Let's explore how to use the Slurm Batch System to the ATOS HPCF or ECS.
Basic job submission
Access the default login node of the ATOS HPCF or ECS.
Create a directory for this tutorial so all the exercises and outputs are contained inside:
mkdir ~/batch_tutorial
cd ~/batch_tutorial
Create and submit a job called simplest.sh with just default settings that runs the command hostname. Can you find the output and inspect it? Where did your job run?
Using your favourite editor, create a file called simplest.sh with the following content
The job should be run shortly. When finished, a new file called slurm-<jobid>.out should appear in the same directory. You can check the output with:
$ cat $(ls -1 slurm-*.out | tail -n1)
ab6-202.bullx
[ECMWF-INFO -ecepilog] ----------------------------------------------------------------------------------------------------
[ECMWF-INFO -ecepilog] This is the ECMWF job Epilogue
[ECMWF-INFO -ecepilog] +++ Please report issues using the Support portal +++
[ECMWF-INFO -ecepilog] +++ https://support.ecmwf.int +++
[ECMWF-INFO -ecepilog] ----------------------------------------------------------------------------------------------------
[ECMWF-INFO -ecepilog] Run at 2023-10-25T11:31:53 on ecs
[ECMWF-INFO -ecepilog] JobName : simplest.sh
[ECMWF-INFO -ecepilog] JobID : 64273363
[ECMWF-INFO -ecepilog] Submit : 2023-10-25T11:31:36
[ECMWF-INFO -ecepilog] Start : 2023-10-25T11:31:51
[ECMWF-INFO -ecepilog] End : 2023-10-25T11:31:53
[ECMWF-INFO -ecepilog] QueuedTime : 15.0
[ECMWF-INFO -ecepilog] ElapsedRaw : 2
[ECMWF-INFO -ecepilog] ExitCode : 0:0
[ECMWF-INFO -ecepilog] DerivedExitCode : 0:0
[ECMWF-INFO -ecepilog] State : COMPLETED
[ECMWF-INFO -ecepilog] Account : myaccount
[ECMWF-INFO -ecepilog] QOS : ef
[ECMWF-INFO -ecepilog] User : user
[ECMWF-INFO -ecepilog] StdOut : /etc/ecmwf/nfs/dh1_home_a/user/slurm-64273363.out
[ECMWF-INFO -ecepilog] StdErr : /etc/ecmwf/nfs/dh1_home_a/user/slurm-64273363.out
[ECMWF-INFO -ecepilog] NNodes : 1
[ECMWF-INFO -ecepilog] NCPUS : 2
[ECMWF-INFO -ecepilog] SBU : 0.011
[ECMWF-INFO -ecepilog] ----------------------------------------------------------------------------------------------------
You can then see that the script has run on a different node than the one you are on.
If you repeat the operation, you may get your job to run on a different node every time, whichever happens to be free at the time.
Configure your simplest.sh job to direct the output to simplest-<jobid>.out, the error to simplest-<jobid>.err both in the same directory, and the job name to just "simplest". Note you will need to use a special placeholder for the -<jobid>.
Using your favourite editor, open the simplest.sh job script and add the relevant #SBATCH directives:
The first one corresponds to the job itself. The second one (always named batch) corresponds to the actual job script and the third (named extern) corresponds to the external step used to generate the end of job information. You may have more lines if your job contains more steps, which typically correspond to srun parallel executions.
If you want to list just the entry for the job itself, you can do:
sacct -X
Can you get information of all the jobs run today by you that were cancelled?
You can filter jobs by state with the -s option. But If you run it naively:
sacct -X -t CANCELLED
You will get no output. That is because when using state you must also specify the start and end times of your query period. You can then do something like:
The default information shown on the screen when querying past jobs is limited. Can you extract the submit, start, and end times of your cancelled jobs today? What about their output and error path? Hint: use the corresponding man page for all the options.
You can use the following command to see all the possible output fields you can query for:
sacct -e
While there are dedicated fields for the job submit, start and end times, there is none for the output and error paths. However, the AdminComment field is used to carry that information. Since it is a long field, you may want to pass a length to the fieldname to avoid truncation:
Create a new job script broken1.sh with the contents below and try to submit the job. What happened? Can you fix the job and keep trying until it runs successfully?
There is no shebang at the beginning of the script.
There should be no spaces in the directives
There should be no space
QoS "express" does not exist
Here is an amended version following best practices for the jobs:
broken1_fixed.sh
#!/bin/bash
#SBATCH --job-name=broken1
#SBATCH --output=broken1-%J.out
#SBATCH --error=broken1-%J.out
#SBATCH --time=00:05:00
echo "I was broken!"
Note that the QoS line was removed, but you may also use the following if running on ECS:
#SBATCH --qos=ef
or the alternatively, if on Atos HPCF:
#SBATCH --qos=nf
Check that the actual job run and generated the expected output:
$ grep -v ECMWF-INFO $(ls -1 broken1-*.out | tail -n1)
I was broken!
Create a new job script broken2.sh with the contents below and try to submit the job. What happened? Can you fix the job and keep trying until it runs successfully?
broken2.shExpand source
#!/bin/bash
#SBATCH --job-name=broken2
#SBATCH --output=broken2-%J.out
#SBATCH --error=broken2-%J.out
#SBATCH --qos=ns
#SBATCH --time=10-00
echo "I was broken!"
The job above has the following problems:
QoS "ns" does not exist. Either remove to use the default or use the corresponding QoS on ECS (ef) or HPCF (nf)
The time requested is 10 days, which is longer than the maximum allowed. it was probably meant to be 10 minutes
Here is an amended version:
broken1.sh
#!/bin/bash
#SBATCH --job-name=broken2
#SBATCH --output=broken2-%J.out
#SBATCH --error=broken2-%J.out
#SBATCH --time=10:00
echo "I was broken!"
Again, note that the QoS line was removed, but you may also use the following if running on ECS:
#SBATCH --qos=ef
or the alternatively, if on Atos HPCF:
#SBATCH --qos=nf
Check that the actual job run and generated the expected output:
$ grep -v ECMWF-INFO $(ls -1 broken2-*.out | tail -n1)
I was broken!
Create a new job script broken3.sh with the contents below and try to submit the job. What happened? Can you fix the job and keep trying until it runs successfully?
broken3.shExpand source
#!/bin/bash
#SBATCH --job-name=broken3
#SBATCH --chdir=$SCRATCH
#SBATCH --output=broken3output/broken3-%J.out
#SBATCH --error=broken3output/broken3-%J.out
echo "I was broken!"
The job above has the following problems:
Variables are not expanded on job directives. You must specify your paths explicitly
The directory where the output and error files will go must exist beforehand. Otherwise the job will fail but you will not get any hint as to what may have happened to the job. The only hint would be if checking sacct:
You will need to create the output directory with:
mkdir -p $SCRATCH/broken3output/
Here is an amended version of the job:
broken3.sh
#!/bin/bash
#SBATCH --job-name=broken3
#SBATCH --chdir=/scratch/<your_user_id>
#SBATCH --output=broken3output/broken3-%J.out
#SBATCH --error=broken3output/broken3-%J.out
echo "I was broken!"
Check that the actual job run and generated the expected output:
$ grep -v ECMWF-INFO $(ls -1 $SCRATCH/broken3output/broken3-*.out | tail -n1)
I was broken!
You may clean up the output directory with
rm -rf $SCRATCH/broken3output
Create a new job script broken4.sh with the contents below and try to submit the job. You should not see the message in the output. What happened? Can you fix the job and keep trying until it runs successfully?
broken4.shExpand source
#!/bin/bash
#SBATCH --job-name=broken4
#SBATCH --output=broken4-%J.out
ls $FOO/bar
echo "I should not be here"
The job above has the following problems:
FOO variable is undefined when used. Undefined variables often lead to unexpected failures that are not always easy to spot.
Even if FOO was defined to "", the ls command fails but the job keeps running and eventually will apparently finish successfully from Slurm point of view, but it should have failed and been interrupted on the first error.
Here is an amended version of the job following best practices:
broken4.sh
#!/bin/bash
#SBATCH --output=broken4-%J.out
set -x # echo script lines as they are executed
set -e # stop the shell on first error
set -u # fail when using an undefined variable
set -o pipefail # If any command in a pipeline fails, that return code will be used as the return code of the whole pipeline
ls $FOO/bar
echo "I should not be here"
With the extra shell options, we guarantee we get some extra information on the output about the commands being written, and we ensure that the job will stop when encountering the first error (non-zero exit code), as well as if an undefined variable is found.
Best practices
Even if most examples in this tutorial do not have the extra shell options for simplicity, you should always include those in your production jobs.
Understanding your limits
Although most limits are described in HPC2020: Batch system, you can also check them (or reach them) for yourself in the system.
Create a new job script naughty.sh with the following contents:
Submit naughty.sh to the batch system and check its status. What happened to the job?
You can submit it with:
sbatch naughty.sh
You can then monitor the state of your job with squeue:
squeue -j <jobid>
After a few seconds of running, you may see the job finishes and disappears. If we use sacct, we can see the job has failed, with an exit code of 9, which indicates it was killed:
$ grep -v ECMWF-INFO naughty.out | head -n 22
__ __ _____ __ __ _ _____ _ _
__/\_| \/ | ____| \/ | |/ /_ _| | | | __/\__
\ / |\/| | _| | |\/| | ' / | || | | | \ /
/_ _\ | | | |___| | | | . \ | || |___| |__/_ _\
\/ |_| |_|_____|_| |_|_|\_\___|_____|_____|\/
BEGIN OF ECMWF MEMKILL REPORT
[ERROR__ECMWF_MEMORY_SUPERVISOR:JOB_OR_SESSION_OUT_OF_MEMORy,ac6-202.bullx:/usr/local/sbin/watch_cgroup:l454:Thu Oct 26 11:16:43 2023]
[summary]
job/session: 64304303
requested/default memory limit for job/session: 8000MiB
sum of active and inactive _anonymous memory of job/session: 8001MiB
ACTION: about to issue: 'kill -SIGKILL' to pid: 4016899
to-be-killed process: "perl -e $a='A'x(8000*1024*1024/2); sleep", with resident-segment-size: 8004MiB
How could you have checked this beforehand instead of taking the trial and error approach?
You could have checked HPC2020: Batch system, or you could also ask Slurm for this information. Default memory is defined per partition, so you can then do
Can you check, without trial and error this time, what is the maximum wall clock time, maximum CPUs, and maximum memory you can request to Slurm for each QoS?
Again, you will find this information ini HPC2020: Batch system, but you can also ask Slurm. These settings are part of the QoS setup so the command is
sacctmgr show qos
The fields we are looking for this time are MaxWall and MaxTRES:
sacctmgr -P show qos format=name,MaxWall,MaxTRES
If you run this on HPCF, you may notice there is no maximum limit set at the QoS level for the np parallel QoS, so you are bound by the maximum memory available in the node.
You can also see other limits such as the local SSD tmpdir space.
How many jobs could you potentially have running concurrently? How many jobs could you have in the system (pending or running), before a further submission fails?
Again, you will find this information ini HPC2020: Batch system, but you can also ask Slurm. These settings are part of the Association setup so the command is
sacctmgr show assoc where user=$USER
The fields we are looking for are MaxJobs and MaxSubmit:
sacctmgr show assoc user=$USER format=account,user,partition,maxjobs,maxsubmit
Remember that a Slurm Association is made of the user, project account and partition, and the limits are set at the association level.
So far we have only run serial jobs. You may also want to run small parallel jobs, either concurrently using just multiple threads, multiple processes or both. Examples of this are MPI and OpenMP programs. We call these kind of small parallel jobs "fractional", because they will run on a fraction of a node, sharing it with other users.
If you followed this tutorial so far, you will have realised ECS users may run very small parallel jobs on the default ef QoS, whereas HPCF users may run slightly bigger jobs (up to half a GPIL node) on the default nf QoS.
Run the program interactively to familiarise yourself with the ouptut:
$ xthi
Host=ac6-200 MPI Rank=0 CPU=128 NUMA Node=0 CPU Affinity=0,128
As you can see, only 1 process and 1 thread are run, and they may run on one of two virtual cores assigned to my session (which correspond to the same physical CPU). If you try to run with 4 OpenMP threads, you will see they will effectively fight each other for those same two cores, impacting the performance of your application but not anyone else in the login node:
$ OMP_NUM_THREADS=4 xthi
Host=ac6-200 MPI Rank=0 OMP Thread=0 CPU=128 NUMA Node=0 CPU Affinity=0,128
Host=ac6-200 MPI Rank=0 OMP Thread=1 CPU= 0 NUMA Node=0 CPU Affinity=0,128
Host=ac6-200 MPI Rank=0 OMP Thread=2 CPU=128 NUMA Node=0 CPU Affinity=0,128
Host=ac6-200 MPI Rank=0 OMP Thread=3 CPU= 0 NUMA Node=0 CPU Affinity=0,128
Create a new job script fractional.sh to run xthi with 2 MPI tasks and 2 OpenMP threads, submit it and check the output to ensure the right number of tasks and threads were spawned.
Here is a job template to start with:
fractional.shExpand source
#!/bin/bash
#SBATCH --output=fractional.out
# TODO: Add here the missing SBATCH directives for the relevant resources
# Define the number of OpenMP threads
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}
# Load xthi tool
module load xthi
# TODO: Add here the line to run xthi
# Hint: use srun
Using your favourite editor, create a file called fractional.sh with the following content:
fractional.sh
#!/bin/bash
#SBATCH --output=fractional.out
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=2
# Define the number of OpenMP threads
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}
# Load xthi tool
module load xthi
srun -c $SLURM_CPUS_PER_TASK xthi
You need to request 2 tasks, and 2 cpus per task in the job. Then we will use srun to spawn our parallel run, which should inherit the job geometry requested, except the cpus-per-task, which must be explicitly passed to srun.
The job should be run shortly. When finished, a new file called fractional.out should appear in the same directory. You can check the relevant output with:
grep -v ECMWF-INFO fractional.out
You should see an output similar to:
$ grep -v ECMWF-INFO fractional.out
Host=ad6-202 MPI Rank=0 OMP Thread=0 CPU= 5 NUMA Node=0 CPU Affinity=5,133
Host=ad6-202 MPI Rank=0 OMP Thread=1 CPU=133 NUMA Node=0 CPU Affinity=5,133
Host=ad6-202 MPI Rank=1 OMP Thread=0 CPU=137 NUMA Node=0 CPU Affinity=9,137
Host=ad6-202 MPI Rank=1 OMP Thread=1 CPU= 9 NUMA Node=0 CPU Affinity=9,137
Srun automatic cpu binding
You can see srun automatically ensures certain binding of the cores to the tasks. If you were to instruct srun to avoid any cpu binding with --cpu-bind=none, you would see something like:
$ grep -v ECMWF-INFO fractional.out
Host=aa6-203 MPI Rank=0 OMP Thread=0 CPU=136 NUMA Node=0 CPU Affinity=4,8,132,136
Host=aa6-203 MPI Rank=0 OMP Thread=1 CPU= 8 NUMA Node=0 CPU Affinity=4,8,132,136
Host=aa6-203 MPI Rank=1 OMP Thread=0 CPU=132 NUMA Node=0 CPU Affinity=4,8,132,136
Host=aa6-203 MPI Rank=1 OMP Thread=1 CPU= 4 NUMA Node=0 CPU Affinity=4,8,132,136
Here all processes/threads could run in any of the cores assigned to the job, potentially having them hopping from cpu to cpu during the program's execution
Can you ensure each one of the OpenMP threads runs on a single physical core, without exploiting the hyperthreading, for optimal performance?
In order to ensure each thread gets their own core, you can use the environment variable OMP_PLACES=threads.
Then, to make sure only physical cores are used for performance, we need to use the --hint=nomultithread directive:
fractional.sh
#!/bin/bash
#SBATCH --output=fractional.out
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=2
#SBATCH --hint=no multithread
# Define the number of OpenMP threads
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}
# Ensure proper OpenMP thread CPU pinning
export OMP_PLACES=threads
# Load xthi tool
module load xthi
srun -c $SLURM_CPUS_PER_TASK xthi
For bigger parallel executions, you will need to use the HPCF's parallel QoS, np, which gives access to the biggest partition of nodes in every complex.
When running in such configuration, your job will get exclusive use of the nodes where it will run so external interferences are minimised. It is important then that the resources allocated are used efficiently.
Here is a very simplified diagram of the Atos HPCF node that you should keep in mind when deciding your job geometries:
If not already on HPCF, open a session on hpc-login.
Create a new job script parallel.sh to run xthi with 32 MPI tasks and 4 OpenMP threads, leaving hyperthreading enabled. Submit it and check the output to ensure the right number of tasks and threads were spawned. Take note of what cpus are used, and how much SBUs you used.
Here is a job template to start with:
parallel.shExpand source
#!/bin/bash
#SBATCH --output=parallel-%j.out
#SBATCH --qos=np
# TODO: Add here the missing SBATCH directives for the relevant resources
# Define the number of OpenMP threads
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}
# Ensure proper OpenMP thread CPU pinning
export OMP_PLACES=threads
# Load xthi tool
module load xthi
srun -c $SLURM_CPUS_PER_TASK xthi
Using your favourite editor, create a file called parallel.sh with the following content:
paralell.sh
#!/bin/bash
#SBATCH --output=parallel-%j.out
#SBATCH --qos=np
#SBATCH --ntasks=32
#SBATCH --cpus-per-task=4
# Define the number of OpenMP threads
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}
# Ensure proper OpenMP thread CPU pinning
export OMP_PLACES=threads
# Load xthi tool
module load xthi
srun -c $SLURM_CPUS_PER_TASK xthi
You need to request 32 tasks, and 4 cpus per task in the job. Then we will use srun to spawn our parallel run, which should inherit the job geometry requested, except the cpus-per-task, which must be explicitly passed to srun.
The job should be run shortly. When finished, a new file called parallel-<jobid>.out should appear in the same directory. You can check the relevant output with:
Modify the parallel.sh job geometry (number of tasks, threads and use of hyperthreading) so that you fully utilise all the physical cores, and only those, i.e. 0-127.
Without using hyperthreading, an Atos HPCF node has 128 phyisical cores available. Any combination of tasks and threads that adds up to that figure will fill the node. Examples include 32 tasks x 4 threads, 64 tasks x 2 threads or 128 single-threaded tasks. For this example, we picked the first one:
paralell.sh
#!/bin/bash
#SBATCH --output=parallel-%j.out
#SBATCH --qos=np
#SBATCH --ntasks=32
#SBATCH --cpus-per-task=4
#SBATCH --hint=nomultithread
# Define the number of OpenMP threads
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}
# Ensure proper OpenMP thread CPU pinning
export OMP_PLACES=threads
# Load xthi tool
module load xthi
srun -c $SLURM_CPUS_PER_TASK xthi
The job should be run shortly. When finished, a new file called parallel-<jobid>.out should appear in the same directory. You can check the relevant output with:
Modify the parallel.sh job geometry so it still runs on the np QoS, but only with 2 tasks and 2 threads. Check the SBU cost. Since the execution is 32 times smaller, did it cost 32 times less than the previous? Why?
Let's use the following job:
paralell.sh
#!/bin/bash
#SBATCH --output=parallel-%j.out
#SBATCH --qos=np
# Add here the missing SBATCH directives for the relevant resources
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=2
#SBATCH --hint=nomultithread
module load xthi
export OMP_PLACES=threads
srun -c $SLURM_CPUS_PER_TASK xthi
The job should be run shortly. When finished, a new file called parallel-<jobid>.out should appear in the same directory. You can check the relevant output with:
This is in a similar scale to the previous one which 32 times bigger one. The reason behind it is that on the np QoS the allocation is done in full nodes. The SBU cost takes into account the allocated nodes for a given period of time, no matter how they are used.
You may compare the cost of your last parallel job and your last fractional, with the same geometry (2x2):