Slurm
m |
(→Cluster topology & hardware specs) |
||
(24 intermediate revisions by one user not shown) | |||
Line 1: | Line 1: | ||
− | The Slurm job scheduler on the High Performance WiWi Cluster (HPC3) | + | The ''Slurm'' job scheduler on the High Performance WiWi Cluster (HPC3) |
== Introduction == | == Introduction == | ||
+ | |||
+ | ''Slurm'' (Simple Linux Utility for Resource Management) is an open-source workload management software for high performance computing. According to the website (https://www.schedmd.com/), 5 out of the 10 highest ranking systems in the [https://www.top500.org/lists/top500/ TOP500] actually rely on this workload manager. It has been chosen to serve as such also for the WiWi cluster. | ||
+ | |||
+ | The following sections provide an introduction into the ''basic'' usage of ''slurm'' on the WiWi cluster, starting with important details concerning the hardware layer and concluding with a primer on more advanced topics. For a detailed documentation of all commands available in ''slurm'', please refer to the official [https://slurm.schedmd.com/ documentation]. | ||
+ | |||
== Cluster topology & hardware specs == | == Cluster topology & hardware specs == | ||
Line 37: | Line 42: | ||
|} | |} | ||
− | Although all 9 nodes could serve as compute nodes, one of the DL385 machines currently serves as a login node only. | + | Although all 9 nodes could serve as compute nodes, one of the DL385 machines currently serves as a login node only. |
Due to the fact that the compute nodes are heterogeneous, they are grouped into so-called ''partitions'' according to the terminology of ''slurm''. | Due to the fact that the compute nodes are heterogeneous, they are grouped into so-called ''partitions'' according to the terminology of ''slurm''. | ||
Furthermore, some of the nodes are "private", meaning that particular working groups have exclusive access to them as soon as they submit jobs. Whenever a private node is idle, users from other working groups also may use them for computational purposes. However, as soon as a high-priority job arrives, any running low-priority job on these machines are cancelled (re-queued). Details on how the access control is implemented on the HPC3 WiWi cluster are given in the next section. | Furthermore, some of the nodes are "private", meaning that particular working groups have exclusive access to them as soon as they submit jobs. Whenever a private node is idle, users from other working groups also may use them for computational purposes. However, as soon as a high-priority job arrives, any running low-priority job on these machines are cancelled (re-queued). Details on how the access control is implemented on the HPC3 WiWi cluster are given in the next section. | ||
Line 64: | Line 69: | ||
| yes | | yes | ||
|- | |- | ||
− | | n01- | + | | n01-n04 |
| compute | | compute | ||
| defpart, apollo, apollo_nonreserved | | defpart, apollo, apollo_nonreserved | ||
| no | | no | ||
|- | |- | ||
− | | n06 | + | | n05,n06 |
| compute | | compute | ||
| defpart, apollo, apollokr | | defpart, apollo, apollokr | ||
Line 75: | Line 80: | ||
|} | |} | ||
− | Partition ''defpart'' is the default partition, ''gpu'' and ''apollo'' encompass the corresponding group of server nodes. Partitions ''gpucu'', ''gpukr'' and ''apollokr'' are single-node partitions for access to private nodes of working groups ''ag_cuchiero'' and ''ag_krivobokova''. Nodes n01- | + | Partition ''defpart'' is the default partition, ''gpu'' and ''apollo'' encompass the corresponding group of server nodes. Partitions ''gpucu'', ''gpukr'' and ''apollokr'' are single-node partitions for access to private nodes of working groups ''ag_cuchiero'' and ''ag_krivobokova''. Nodes n01-n04 are grouped into a partition ''apollo_nonreserved'' for dedicated access to Apollo nodes that are not subject to reserved resources. The two GPU-ready nodes (DL385) in partition ''gpu'' are currently equipped with one Nvidia A40 each. |
== Access control & quality of service (QoS) == | == Access control & quality of service (QoS) == | ||
− | Each system user (see [[HPC]]) is assigned a corresponding ''slurm'' user. Access control and queue management is based on slurm ''accounts'' which directly correspond to working groups. Each account is entitled to a particular set of ''quality of service'' (QoS) levels. On the HPC3 WiWi cluster, QoS mechanisms are used to implement the specific kind of resource reservation that was desired by the working groups that own private nodes. In fact, high-priority access to those nodes is achieved by what is called ''preemption'' in ''slurm''. Whenever jobs with a privileged QoS level enter the queue, jobs with a standard QoS level that are running on the associated private node(s) are cancelled and re-queued. | + | Each system user (see [[HPC]]) is assigned a corresponding ''slurm'' ''user''. Access control and queue management is based on ''slurm'' ''accounts'' which directly correspond to working groups. Each account is entitled to a particular set of ''quality of service'' (QoS) levels. On the HPC3 WiWi cluster, QoS mechanisms are used to implement the specific kind of resource reservation that was desired by the working groups that own private nodes. In fact, high-priority access to those nodes is achieved by what is called ''preemption'' in ''slurm''. Whenever jobs with a privileged QoS level enter the queue, jobs with a standard QoS level that are running on the associated private node(s) are cancelled and re-queued. |
The following table gives an overview of the working groups and the QoS levels they are allowed to select: | The following table gives an overview of the working groups and the QoS levels they are allowed to select: | ||
Line 116: | Line 121: | ||
# on partitions that contain private nodes only. | # on partitions that contain private nodes only. | ||
− | For example, using QoS level ''agcu'' on partition apollo is not permitted because high-priority access (with occasional preemption) should only take effect on partition ''gpucu'' (node gpu01). | + | For example, using QoS level ''agcu'' on partition ''apollo'' is not permitted because high-priority access (with occasional preemption) should only take effect on partition ''gpucu'' (node gpu01). |
+ | |||
== Submitting jobs == | == Submitting jobs == | ||
+ | === Resource allocation and usage principles === | ||
+ | |||
+ | The configuration of ''slurm'' on the WiWi cluster is such that the allocation of CPU resources occurs by considering a ''socket'' as the smallest unit. Although the number of CPU cores on a socket varies depending on the node type (see [[#Cluster topology & hardware specs]]), it allows for a more fine-grained provision of the available computation power. In fact, the number of nodes is virtually doubled this way. | ||
+ | However, contrary to a node-wise allocation where each computation job is run exclusively on one or more servers, completely isolated from other jobs, the socket-based allocation principle will render a single server's CPUs and memory to be shared between two jobs of potentially different users. | ||
+ | |||
+ | It must be emphasized that in ''slurm'', the socket-based view is for job scheduling purposes only. ''Slurm'' does in no way enforce resource usage limits, neither with regard to CPU cores nor concerning memory. In other words: when allocating a CPU socket, e.g. with 16 cores, nothing prevents a user from spawning 32 processes (or threads) within her allocation. The system memory is also shared between the running jobs, which might lead to side effects as well. Therefore, the users are kindly asked to carefully take into account the available CPU and memory resources (see [[#Cluster topology & hardware specs]]) when spawning processes or running multi-threaded applications. Out-of-scale resource consumption does not only potentially compromise another job running on the same machine, it can be expected to slow down the originating job as well when sharing is in effect. | ||
+ | |||
=== Batch jobs === | === Batch jobs === | ||
− | Slurm provides support for unattended execution of jobs on the cluster's resources, which is perhaps the most common way of using it (batch mode). | + | ''Slurm'' provides support for unattended execution of jobs on the cluster's resources, which is perhaps the most common way of using it (batch mode). |
For this purpose, a shell script is passed to the job scheduler, containing | For this purpose, a shell script is passed to the job scheduler, containing | ||
* the commands to be executed and | * the commands to be executed and | ||
− | * some extra information for the slurm job scheduler (optional). | + | * some extra information for the ''slurm'' job scheduler (optional). |
Let us take a closer look at how to create such a script. We start with the first line, telling the OS which kind of UNIX shell to use for interpreting the commands in the script. | Let us take a closer look at how to create such a script. We start with the first line, telling the OS which kind of UNIX shell to use for interpreting the commands in the script. | ||
#!/bin/bash | #!/bin/bash | ||
− | Then we add a series of directives for the slurm job scheduler, each starting with a '#SBATCH'. | + | Then we add a series of directives for the ''slurm'' job scheduler, each starting with a '#SBATCH'. |
− | Although the '#' character usually indicates a comment, this specific string gets interpreted by slurm and allows to set various options. | + | Although the '#' character usually indicates a comment, this specific string gets interpreted by ''slurm'' and allows to set various options. |
#SBATCH --mail-type=BEGIN,END | #SBATCH --mail-type=BEGIN,END | ||
Line 151: | Line 164: | ||
Note that the '&' character at the end of each line tells the shell to run the program in ''background'' mode. The second mode of execution is useful, for example, when each instance of ''do-something'' takes a different file as an input. | Note that the '&' character at the end of each line tells the shell to run the program in ''background'' mode. The second mode of execution is useful, for example, when each instance of ''do-something'' takes a different file as an input. | ||
− | When saving the script to the disk as a file, for example ''job-script.sh'', we can | + | When saving the script to the disk as a file, for example ''job-script.sh'', we can pass it to the ''sbatch'' command: |
sbatch -J Job1 job-script.sh | sbatch -J Job1 job-script.sh | ||
Line 203: | Line 216: | ||
Note that the option '--pty' is important, because otherwise the ''srun'' command spawns multiple instances of /bin/bash across the allocation (all the cores of a socket). This is unwanted behavior in the interactive shell context, because all inputs & outputs would appear multiple times. Of course, any UNIX shell can be used instead of '/bin/bash'. | Note that the option '--pty' is important, because otherwise the ''srun'' command spawns multiple instances of /bin/bash across the allocation (all the cores of a socket). This is unwanted behavior in the interactive shell context, because all inputs & outputs would appear multiple times. Of course, any UNIX shell can be used instead of '/bin/bash'. | ||
+ | |||
== File system access == | == File system access == | ||
− | The users' home directories (/home/<username>/) are transparently available on all compute nodes via NFS. Consequently, any apps that are installed in a user's home directory can be run on any node. This is also the recommended way of proceeding when custom software is needed (instead of a system-wide installation). | + | The users' home directories ('''/home/<username>/''') are transparently available on all compute nodes via NFS. Consequently, any apps that are installed in a user's home directory can be run on any node. This is also the recommended way of proceeding when custom software is needed (instead of a system-wide installation). |
+ | |||
+ | For storing temporary data during computation, there is a ''global scratch space'' available. It is also NFS-mounted and can be reached via '''/scratch.global''' on all compute nodes. The current capacity is 3.0TB. | ||
Please be aware that due to the NFS-based mount all file operations also have to pass the network layer. I/O-intensive tasks with thousands of read or write operations per second can therefore be subject to significant slow-downs compared to native file system access. | Please be aware that due to the NFS-based mount all file operations also have to pass the network layer. I/O-intensive tasks with thousands of read or write operations per second can therefore be subject to significant slow-downs compared to native file system access. | ||
− | In such scenarios, it is advisable to use the local scratch space provided on each node, accessible via ''/scratch.local''. | + | In such scenarios, it is advisable to use the ''local scratch space'' provided on each node, accessible via '''/scratch.local'''. The capacity is 800GB on the Apollo nodes and 1.8TB on the GPU nodes. |
− | Note that both scratch directories, ''/scratch.global'' and ''/scratch.local'' are world-writable and thus large amounts of data might pile up over time. For this reason, files that are older than 30 days will be automatically removed. | + | Note that both scratch directories, '''/scratch.global''' and '''/scratch.local''' are world-writable and thus large amounts of data might pile up over time. For this reason, files that are older than 30 days will be automatically removed. |
+ | Extremely fast I/O for limited amounts of temporary data is available through a ramdisk ('''/scratch.ramdisk''') on all nodes. Its capacity is 64GB per node and any data stored in it is transient by nature. If the associated files should persist, it is therefore highly advisable to move them to your home directory, for example, before the computation job completes, i.e., at the end of the job script. | ||
== Advanced topics == | == Advanced topics == | ||
Line 218: | Line 235: | ||
=== Adjusting the resource allocation === | === Adjusting the resource allocation === | ||
− | The default allocation on the WiWi cluster is one socket with a varying number of CPU cores, depending on the node type. Of course, it is possible to allocate more than a socket for a job. This can be | + | The default allocation on the WiWi cluster is one ''socket'' with a varying number of CPU cores, depending on the node type (see [[#Resource allocation and usage principles]]). Of course, it is possible to allocate more than a socket for a job. In most cases, a user might want to request a single node in an exclusive manner, such that no other jobs can be allocated to it while it is running. This can be achieved by typing |
+ | |||
+ | sbatch --exclusive job-script.sh | ||
+ | |||
+ | Basically, it would be possible to request more than one node by using the '-N' option: | ||
+ | |||
+ | sbatch -N 2 --exclusive job-script.sh | ||
+ | |||
+ | It must be remarked that an allocation involving two or more nodes only makes sense for "truly" parallel applications, like those making use of MPI (message passing interface). In all other cases, hence when using classical ''shared-memory'' applications, this leads to undesired behavior as these applications will run only on one node (unless you start several separate processes). Even if you plan to run a lot of computation tasks at the same time, it is better to accommodate them in separate socket-based job scripts (using the standard allocation). This makes it easier for the workload manager to schedule them because under a high job load, chances are higher to obtain a free socket than to allocate several nodes at once. | ||
+ | |||
+ | === Using GPUs === | ||
+ | |||
+ | Nodes gpu01 and gpu02 are each equipped with one Nvidia A40 GPU. Nvidia CUDA drivers are installed on both nodes, granting easy access to these additional computing resources. To request and allocate a GPU resource via ''slurm'', please use the '--gres' option as shown below: | ||
− | sbatch - | + | sbatch -p gpu --gres=gpu:1 job-script.sh |
− | + | This command instructs ''slurm'' to allocate one GPU in the partition ''gpu'' for running the job script. Note that it is unfortunately not possible to perform a GPU-only allocation. At least one socket (32 cores) is allocated together with the GPU. This is an inherent drawback of the workload manager and thus far, no workaround has been found for this. | |
=== Job steps === | === Job steps === | ||
Contents will follow... | Contents will follow... |
Latest revision as of 10:15, 13 April 2022
The Slurm job scheduler on the High Performance WiWi Cluster (HPC3)
Contents |
[edit] 1 Introduction
Slurm (Simple Linux Utility for Resource Management) is an open-source workload management software for high performance computing. According to the website (https://www.schedmd.com/), 5 out of the 10 highest ranking systems in the TOP500 actually rely on this workload manager. It has been chosen to serve as such also for the WiWi cluster.
The following sections provide an introduction into the basic usage of slurm on the WiWi cluster, starting with important details concerning the hardware layer and concluding with a primer on more advanced topics. For a detailed documentation of all commands available in slurm, please refer to the official documentation.
[edit] 2 Cluster topology & hardware specs
The cluster is currently made up of 9 nodes that are interconnected via a 10Gbit (Ethernet) network:
- 3 x HP DL385
- 6 x HP ProLiant XL170r (accommodated in an HP Apollo r2200 chassis)
The servers' CPU and memory resources can briefly be summarized as follows:
Server | CPU | Clock speed | Sockets | Cores/socket | Memory | GPU ready |
---|---|---|---|---|---|---|
DL385 | AMD Epyc 7452 | 2.35 GHz (max. 3.35 GHz) | 2 | 32 | 256 GB | yes |
ProLiant XL170r | Xeon-G 6226R | 2.9 GHz (max. 3.9 GHz) | 2 | 16 | 384 GB | no |
Although all 9 nodes could serve as compute nodes, one of the DL385 machines currently serves as a login node only. Due to the fact that the compute nodes are heterogeneous, they are grouped into so-called partitions according to the terminology of slurm. Furthermore, some of the nodes are "private", meaning that particular working groups have exclusive access to them as soon as they submit jobs. Whenever a private node is idle, users from other working groups also may use them for computational purposes. However, as soon as a high-priority job arrives, any running low-priority job on these machines are cancelled (re-queued). Details on how the access control is implemented on the HPC3 WiWi cluster are given in the next section.
The following table gives an overview of the nodes and the partitions they belong to:
Node name | Role | Partitions | "Private" |
---|---|---|---|
hpc3 | login, control | no | |
gpu01 | compute, GPU | defpart, gpu, gpucu | yes |
gpu02 | compute, GPU | defpart, gpu, gpukr | yes |
n01-n04 | compute | defpart, apollo, apollo_nonreserved | no |
n05,n06 | compute | defpart, apollo, apollokr | yes |
Partition defpart is the default partition, gpu and apollo encompass the corresponding group of server nodes. Partitions gpucu, gpukr and apollokr are single-node partitions for access to private nodes of working groups ag_cuchiero and ag_krivobokova. Nodes n01-n04 are grouped into a partition apollo_nonreserved for dedicated access to Apollo nodes that are not subject to reserved resources. The two GPU-ready nodes (DL385) in partition gpu are currently equipped with one Nvidia A40 each.
[edit] 3 Access control & quality of service (QoS)
Each system user (see HPC) is assigned a corresponding slurm user. Access control and queue management is based on slurm accounts which directly correspond to working groups. Each account is entitled to a particular set of quality of service (QoS) levels. On the HPC3 WiWi cluster, QoS mechanisms are used to implement the specific kind of resource reservation that was desired by the working groups that own private nodes. In fact, high-priority access to those nodes is achieved by what is called preemption in slurm. Whenever jobs with a privileged QoS level enter the queue, jobs with a standard QoS level that are running on the associated private node(s) are cancelled and re-queued.
The following table gives an overview of the working groups and the QoS levels they are allowed to select:
Working group | QoS levels | Preemption on partition(s) |
---|---|---|
ag_cuchiero | normal, agcu | gpucu |
ag_krivobokova | normal, agkr | gpukr, apollokr |
ag_doerner | normal | |
ag_ehmke | normal | |
ag_hautsch | normal | |
ag_operres | normal |
Note that privileged QoS levels can only be selected
- by users that have the permission to use them (based on their working group affiliation), and
- on partitions that contain private nodes only.
For example, using QoS level agcu on partition apollo is not permitted because high-priority access (with occasional preemption) should only take effect on partition gpucu (node gpu01).
[edit] 4 Submitting jobs
[edit] 4.1 Resource allocation and usage principles
The configuration of slurm on the WiWi cluster is such that the allocation of CPU resources occurs by considering a socket as the smallest unit. Although the number of CPU cores on a socket varies depending on the node type (see #Cluster topology & hardware specs), it allows for a more fine-grained provision of the available computation power. In fact, the number of nodes is virtually doubled this way. However, contrary to a node-wise allocation where each computation job is run exclusively on one or more servers, completely isolated from other jobs, the socket-based allocation principle will render a single server's CPUs and memory to be shared between two jobs of potentially different users.
It must be emphasized that in slurm, the socket-based view is for job scheduling purposes only. Slurm does in no way enforce resource usage limits, neither with regard to CPU cores nor concerning memory. In other words: when allocating a CPU socket, e.g. with 16 cores, nothing prevents a user from spawning 32 processes (or threads) within her allocation. The system memory is also shared between the running jobs, which might lead to side effects as well. Therefore, the users are kindly asked to carefully take into account the available CPU and memory resources (see #Cluster topology & hardware specs) when spawning processes or running multi-threaded applications. Out-of-scale resource consumption does not only potentially compromise another job running on the same machine, it can be expected to slow down the originating job as well when sharing is in effect.
[edit] 4.2 Batch jobs
Slurm provides support for unattended execution of jobs on the cluster's resources, which is perhaps the most common way of using it (batch mode). For this purpose, a shell script is passed to the job scheduler, containing
- the commands to be executed and
- some extra information for the slurm job scheduler (optional).
Let us take a closer look at how to create such a script. We start with the first line, telling the OS which kind of UNIX shell to use for interpreting the commands in the script.
#!/bin/bash
Then we add a series of directives for the slurm job scheduler, each starting with a '#SBATCH'. Although the '#' character usually indicates a comment, this specific string gets interpreted by slurm and allows to set various options.
#SBATCH --mail-type=BEGIN,END #SBATCH --mail-user=john.doe@univie.ac.at
For the moment, we only state an e-mail address here and an indication which events trigger a notification via mail. In this case, we receive an e-mail when the job has been started, that is, when it is removed from the queue of waiting jobs and actually allocates resources on the cluster.
Finally, we add commands to be executed for actual computation purposes. Let us assume in the following that the program we would like to run is called do-something, allowing single- or multi-threaded execution. Assume further that threading can be controlled by a command line parameter --threads. If we wanted to use all 16 or 32 processors of a standard allocation (1 socket), then the program could be run either by
do-something --threads 16
or by parallelizing single-threaded instances of itself:
do-something --threads 1 & do-something --threads 1 & ... do-something --threads 1 &
Note that the '&' character at the end of each line tells the shell to run the program in background mode. The second mode of execution is useful, for example, when each instance of do-something takes a different file as an input.
When saving the script to the disk as a file, for example job-script.sh, we can pass it to the sbatch command:
sbatch -J Job1 job-script.sh
The command takes the contents of the file job-script.sh, and tries to allocate resources on the cluster. If there are enough resources available (at least one socket) then the job is started on the corresponding node. Otherwise the job is held in the queue. To keep track of one's jobs, an identifier (job name) can be assigned to a submitted job by using the parameter '-J', as shown above.
An overview of queued and running jobs can be obtained by the command
squeue
The output might look as follows:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 224 apollo Job7 ag_do-br PD 0:00 1 (Resources) 218 apollo Job1 ag_do-br R 0:29 1 n01 219 apollo Job2 ag_do-br R 0:29 1 n02 220 apollo Job3 ag_do-br R 0:29 1 n03 221 apollo Job4 ag_do-br R 0:29 1 n04 222 apollo Job5 ag_do-br R 0:29 1 n05 223 apollo Job6 ag_do-br R 0:29 1 n06
In this case, 6 jobs are running on the 'apollo' partition, each allocating a whole node, i.e., two sockets. Job #7 is currently held in the queue because the partition is fully occupied. This is indicated by the field 'ST' (state), telling us that the job is currently pending (PD). Jobs #1 - #6 are in state running. The last column in this table shows the nodes on which each of the listed jobs is running.
To specify the partition on which a job should run, we can use the option '-p'. For partition 'apollo', this would be
sbatch -p apollo -J Job1 job-script.sh
If no partition is stated in the sbatch command line, the default partition (all compute nodes) is assumed as a target.
The quality of service (QoS) to use can be specified by the option '-q', for example
sbatch -q agcu job-script.sh
Again, the default QoS ('normal') is used if none is provided. Note that privileged QoS specifiers are accepted only
- for users which are entitled to them
- for partitions on which they are admitted (see also #Access control & quality of service (QoS)).
To avoid long chains of command line arguments, one can pass most of the parameters to sbatch via directives in the job script, as they were already introduced above in the context of notification e-mails. For example,
#SBATCH --partition=apollo #SBATCH --qos=normal
lead to the same result as the command line arguments '-p apollo' and '-q normal'.
[edit] 4.3 Interactive jobs
Besides the batch-based job submission, it is also possible to run interactive jobs on the cluster. In essence, this means that as soon as resources are available, a shell prompt appears, allowing to run any application or script on the node(s) that have just been allocated. Relying on the defaults, this can be achieved by the following command:
srun --pty /bin/bash
Note that the option '--pty' is important, because otherwise the srun command spawns multiple instances of /bin/bash across the allocation (all the cores of a socket). This is unwanted behavior in the interactive shell context, because all inputs & outputs would appear multiple times. Of course, any UNIX shell can be used instead of '/bin/bash'.
[edit] 5 File system access
The users' home directories (/home/<username>/) are transparently available on all compute nodes via NFS. Consequently, any apps that are installed in a user's home directory can be run on any node. This is also the recommended way of proceeding when custom software is needed (instead of a system-wide installation).
For storing temporary data during computation, there is a global scratch space available. It is also NFS-mounted and can be reached via /scratch.global on all compute nodes. The current capacity is 3.0TB.
Please be aware that due to the NFS-based mount all file operations also have to pass the network layer. I/O-intensive tasks with thousands of read or write operations per second can therefore be subject to significant slow-downs compared to native file system access.
In such scenarios, it is advisable to use the local scratch space provided on each node, accessible via /scratch.local. The capacity is 800GB on the Apollo nodes and 1.8TB on the GPU nodes. Note that both scratch directories, /scratch.global and /scratch.local are world-writable and thus large amounts of data might pile up over time. For this reason, files that are older than 30 days will be automatically removed.
Extremely fast I/O for limited amounts of temporary data is available through a ramdisk (/scratch.ramdisk) on all nodes. Its capacity is 64GB per node and any data stored in it is transient by nature. If the associated files should persist, it is therefore highly advisable to move them to your home directory, for example, before the computation job completes, i.e., at the end of the job script.
[edit] 6 Advanced topics
[edit] 6.1 Adjusting the resource allocation
The default allocation on the WiWi cluster is one socket with a varying number of CPU cores, depending on the node type (see #Resource allocation and usage principles). Of course, it is possible to allocate more than a socket for a job. In most cases, a user might want to request a single node in an exclusive manner, such that no other jobs can be allocated to it while it is running. This can be achieved by typing
sbatch --exclusive job-script.sh
Basically, it would be possible to request more than one node by using the '-N' option:
sbatch -N 2 --exclusive job-script.sh
It must be remarked that an allocation involving two or more nodes only makes sense for "truly" parallel applications, like those making use of MPI (message passing interface). In all other cases, hence when using classical shared-memory applications, this leads to undesired behavior as these applications will run only on one node (unless you start several separate processes). Even if you plan to run a lot of computation tasks at the same time, it is better to accommodate them in separate socket-based job scripts (using the standard allocation). This makes it easier for the workload manager to schedule them because under a high job load, chances are higher to obtain a free socket than to allocate several nodes at once.
[edit] 6.2 Using GPUs
Nodes gpu01 and gpu02 are each equipped with one Nvidia A40 GPU. Nvidia CUDA drivers are installed on both nodes, granting easy access to these additional computing resources. To request and allocate a GPU resource via slurm, please use the '--gres' option as shown below:
sbatch -p gpu --gres=gpu:1 job-script.sh
This command instructs slurm to allocate one GPU in the partition gpu for running the job script. Note that it is unfortunately not possible to perform a GPU-only allocation. At least one socket (32 cores) is allocated together with the GPU. This is an inherent drawback of the workload manager and thus far, no workaround has been found for this.
[edit] 6.3 Job steps
Contents will follow...