Slurm

From ISOR
(Difference between revisions)
Jump to: navigation, search
m
Line 14: Line 14:
 
! style="text-align:left;"| Server
 
! style="text-align:left;"| Server
 
! CPU
 
! CPU
 +
! Clock speed
 +
! Sockets
 +
! Cores/socket
 
! Memory
 
! Memory
 
! GPU ready
 
! GPU ready
Line 19: Line 22:
 
| DL 385
 
| DL 385
 
| AMD Epyc 7452
 
| AMD Epyc 7452
 +
| 2.35 GHz (max. 3.35 GHz)
 +
| 2
 +
| 32
 
| 256 GB
 
| 256 GB
 
| yes
 
| yes
Line 24: Line 30:
 
| ProLiant XL 170r
 
| ProLiant XL 170r
 
| Xeon-G 6226R
 
| Xeon-G 6226R
 +
| 2.9 GHz (max. 3.9 GHz)
 +
| 2
 +
| 16
 
| 384 GB
 
| 384 GB
 
| no
 
| no
Line 29: Line 38:
  
 
Although all 9 nodes could serve as compute nodes, one of the DL 385 machines currently serves as a login node only.
 
Although all 9 nodes could serve as compute nodes, one of the DL 385 machines currently serves as a login node only.
Due to the fact that the compute nodes are heterogeneous, they are grouped into so-called '''partitions''' according to the terminology of '''slurm'''.
+
Due to the fact that the compute nodes are heterogeneous, they are grouped into so-called ''partitions'' according to the terminology of ''slurm''.
 
Furthermore, some of the nodes are "private", meaning that particular working groups have exclusive access to them as soon as they submit jobs. Whenever a private node is idle, users from other working groups also may use them for computational purposes. However, as soon as a high-priority job arrives, any running low-priority job on these machines are cancelled (re-queued). Details on how the access control is implemented on the HPC3 WiWi cluster are given in the next section.
 
Furthermore, some of the nodes are "private", meaning that particular working groups have exclusive access to them as soon as they submit jobs. Whenever a private node is idle, users from other working groups also may use them for computational purposes. However, as soon as a high-priority job arrives, any running low-priority job on these machines are cancelled (re-queued). Details on how the access control is implemented on the HPC3 WiWi cluster are given in the next section.
  
The following table gives an overview of the nodes together with their properties and the partitions they belong to:
+
The following table gives an overview of the nodes and the partitions they belong to:
 +
 
 +
{| class="wikitable"
 +
! style="text-align:left;"| Server
 +
! Node name
 +
! Role
 +
! Partitions
 +
! "Private"
 +
|-
 +
| hpc3
 +
| login, control
 +
|
 +
| no
 +
|-
 +
| gpu01
 +
| compute, GPU
 +
| defpart, gpu, gpucu
 +
| yes
 +
|-
 +
| gpu02
 +
| compute, GPU
 +
| defpart, gpu, gpukr
 +
| yes
 +
|-
 +
| n01-n05
 +
| compute
 +
| defpart, apollo, apollo_nonreserved
 +
| no
 +
|-
 +
| n06
 +
| compute
 +
| defpart, apollo, apollokr
 +
| yes
 +
|}
  
 
== Submitting jobs ==
 
== Submitting jobs ==
Line 38: Line 80:
 
=== Batch jobs ===
 
=== Batch jobs ===
  
Slurm provides support for unattended execution of jobs on the cluster's resources , which is perhaps the most common way of using it (batch mode).  
+
Slurm provides support for unattended execution of jobs on the cluster's resources, which is perhaps the most common way of using it (batch mode).  
 
For this purpose, a shell script is passed to the job scheduler, containing  
 
For this purpose, a shell script is passed to the job scheduler, containing  
 
* the commands to be executed and  
 
* the commands to be executed and  

Revision as of 09:53, 27 September 2021

The Slurm job scheduler on the High Performance WiWi Cluster (HPC3)

Contents

1 Introduction

2 Cluster topology & hardware specs

The cluster is currently made up of 9 nodes:

  • 3 x HP DL385
  • 6 x HP ProLiant XL170r (accommodated in an HP Apollo r2200 chassis)

The servers' CPU and memory resources can briefly be summarized as follows:

Server CPU Clock speed Sockets Cores/socket Memory GPU ready
DL 385 AMD Epyc 7452 2.35 GHz (max. 3.35 GHz) 2 32 256 GB yes
ProLiant XL 170r Xeon-G 6226R 2.9 GHz (max. 3.9 GHz) 2 16 384 GB no

Although all 9 nodes could serve as compute nodes, one of the DL 385 machines currently serves as a login node only. Due to the fact that the compute nodes are heterogeneous, they are grouped into so-called partitions according to the terminology of slurm. Furthermore, some of the nodes are "private", meaning that particular working groups have exclusive access to them as soon as they submit jobs. Whenever a private node is idle, users from other working groups also may use them for computational purposes. However, as soon as a high-priority job arrives, any running low-priority job on these machines are cancelled (re-queued). Details on how the access control is implemented on the HPC3 WiWi cluster are given in the next section.

The following table gives an overview of the nodes and the partitions they belong to:

Server Node name Role Partitions "Private"
hpc3 login, control no
gpu01 compute, GPU defpart, gpu, gpucu yes
gpu02 compute, GPU defpart, gpu, gpukr yes
n01-n05 compute defpart, apollo, apollo_nonreserved no
n06 compute defpart, apollo, apollokr yes

3 Submitting jobs

3.1 Batch jobs

Slurm provides support for unattended execution of jobs on the cluster's resources, which is perhaps the most common way of using it (batch mode). For this purpose, a shell script is passed to the job scheduler, containing

  • the commands to be executed and
  • some extra information for the slurm job scheduler (optional).

Let us take a closer look at how to create such a script. We start with the first line, telling the OS which kind of UNIX shell to use for interpreting the commands in the script.

#!/bin/bash

Then we add a series of directives for the slurm job scheduler, each starting with a '#SBATCH'. Although the '#' character usually indicates a comment, this specific string gets interpreted by slurm and allows to set various options.

#SBATCH --mail-type=BEGIN,END
#SBATCH --mail-user=john.doe@univie.ac.at

For the moment, we only state an e-mail address here and an indication which events trigger a notification via mail. In this case, we receive an e-mail when the job has been started, that is, when it is removed from the queue of waiting jobs and actually allocates resources on the cluster.

Finally, we add commands to be executed for actual computation purposes. Let us assume in the following that the program we would like to run is called do-something, allowing single- or multi-threaded execution. Assume further that threading can be controlled by a command line parameter --threads. If we wanted to use all 16 or 32 processors of a standard allocation (1 socket), then the program could be run either by

do-something --threads 16

or by parallelizing single-threaded instances of itself:

do-something --threads 1 &
do-something --threads 1 &
...
do-something --threads 1 &

Note that the '&' character at the end of each line tells the shell to run the program in background mode. The second mode of execution is useful, for example, when each instance of do-something takes a different file as an input.

When saving the script to the disk as a file, for example job-script.sh, we can run it using the sbatch command:

sbatch -J Job1 job-script.sh

The command takes the contents of the file job-script.sh, and tries to allocate resources on the cluster. If there are enough resources available (at least one socket) then the job is started on the corresponding node. Otherwise the job is held in the queue. To keep track of one's jobs, an identifier (job name) can be assigned to a submitted job by using the parameter '-J', as shown above.

An overview of queued and running jobs can be obtained by the command

squeue

The output might look as follows:

 JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
  224    apollo     Job7 ag_do-br PD       0:00      1 (Resources)
  218    apollo     Job1 ag_do-br  R       0:29      1 n01
  219    apollo     Job2 ag_do-br  R       0:29      1 n02
  220    apollo     Job3 ag_do-br  R       0:29      1 n03
  221    apollo     Job4 ag_do-br  R       0:29      1 n04
  222    apollo     Job5 ag_do-br  R       0:29      1 n05
  223    apollo     Job6 ag_do-br  R       0:29      1 n06

In this case, 6 jobs are running on the 'apollo' partition, each allocating a whole node, i.e., two sockets. Job #7 is currently held in the queue because the partition is fully occupied. This is indicated by the field 'ST' (state), telling us that the job is currently pending (PD). Jobs #1 - #6 are in state running. The last column in this table shows the nodes on which each of the listed jobs is running.

To specify the partition on which a job should run, we can use the option '-p'. For partition 'apollo', this would be

sbatch -p apollo -J Job1 job-script.sh

If no partition is stated in the sbatch command line, the default partition (all compute nodes) is assumed as a target.

The quality of service (QoS) to use can be specified by the option '-q', for example

sbatch -q agcu job-script.sh

Again, the default QoS ('normal') is used if none is provided. Note that privileged QoS specifiers are accepted only

  • for users which are entitled to them (see Section XXX)
  • for partitions on which they are admitted.

To avoid long chains of command line arguments, one can pass most of the parameters to sbatch via directives in the job script, as they were already introduced above in the context of notification e-mails. For example,

#SBATCH --partition=apollo
#SBATCH --qos=normal

lead to the same result as the command line arguments '-p apollo' and '-q normal'.

3.2 Interactive jobs

Personal tools
Namespaces

Variants
Actions
Navigation
Tools