HPC2020: Batch system

Slurm is the batch system available. Any script can be submitted as a job with no changes, but you might want to see Writing SLURM jobs to customise it.

To submit a script as a serial job with default options enter the command:

sbatch yourscript.sh

You may query the queues to see the jobs currently running or pending with:

squeue

And cancel a job with

scancel <jobid>

Currently the "scancel" command shall be executed on the login node of the same cluster where the job is running.

See the Slurm documentation for more details on the different commands available to submit, query or cancel jobs.

QoS available

These are the different QoS (or queues) available for standard users on the four complexes:

QoS name	Type	Suitable for...	Shared nodes	Maximum jobs per user	Default / Max Wall Clock Limit	Default / Max CPUs	Default / Max Memory
nf	fractional	serial and small parallel jobs. It is the default	Yes	-	average runtime + standard deviation / 2 days	1 / 64	8 GB / 128 GB
ni	interactive	serial and small parallel interactive jobs	Yes	1	12 hours / 7 days	1 / 32	8 GB / 32 GB
np	parallel	parallel jobs requiring more than half a node	No	-	average runtime + standard deviation / 2 days	-	240GB / 240 GB per node (all usable memory in a node)

ECS

For those using ECS, these are the different QoS (or queues) available for standard users of this service:

QoS name	Type	Suitable for...	Shared nodes	Maximum jobs per user	Default / Max Wall Clock Limit	Default / Max CPUs	Default / Max Memory
ef	fractional	serial and small parallel jobs - ECGATE service	Yes	-	average job runtime + standard deviation / 2 days	1 / 8	8 GB / 16 GB
ei	interactive	serial and small parallel interactive jobs - ECGATE service	Yes	1	12 hours / 7 days	1 / 4	8 GB / 8 GB
el	long	serial and small parallel interactive jobs - ECGATE service	Yes	-	average job runtime + standard deviation / 7 days	1 / 8	8 GB / 16 GB
et	Time-critical Option 1	serial and small parallel Time-Critical jobs. Only usable through ECACCESS Time Critical Option-1	Yes	-	average job runtime + standard deviation / 12 hours	1 / 8	8 GB / 16 GB

Time limit management

See HPC2020: Job Runtime Management for more information on how the default Wall Clock Time limit is calculated.

Limits are not set in stone

Different limits on the different QoSs may be introduced or changed as the system evolves.

Checking QoS setup

If you want to get all the details of a particular QoS on the system, you may run, for example:

sacctmgr list qos names=nf

Submitting jobs remotely

If you are submitting jobs from a different platform via ssh, please use the *-batch dedicated nodes instead of the *-login equivalents:

For generic remote job submission on HPCF: hpc-batch or hpc2020-batch
For remote job submission on a specific HPCF complex: <complex_name>-batch
For remote job submission to the ECS virtual complex: ecs-batch

For example, to submit a job from a remote platform onto the Atos HCPF:

ssh hpc-batch "sbatch myjob.sh"

HPC2020: Writing SLURM jobs

Any shell script can be submitted as a Slurm job with no modifications. In such a case, sensible default values will be applied to the job. However, you can configure the script to fit your needs through job directives. In Slurm, these are just special comments in your script, usually at the top just after the shebang line, with the form:

HPC2020: Submitting a serial or small parallel job

Serial and small parallel jobs, called fractional, run on the gpil partition and use the same QoS, typically nf for regular users in the Atos HPCF service. For ECS users, they will run on the ecs partition on queue ef.

These are the default queue and partition. They will be used if no directives are specified.

HPC2020: Submitting a parallel job

Parallel jobs run on the compute partition and use the np QoS for regular users.

This queue is not the default, so make sure you explicitly define it your job directives before submission.

Parallel jobs are allocated exclusive nodes, so they will not share resources with other jobs.

HPC2020: Slurm - PBS cheatsheet

Top tips when working with SLURM

Put all your SLURM directives at the top of the script file, above any commands. Any directive after an executable line in the script is ignored.
Note that you can pass SLURM directives as options to the sbatch command.

HPC2020: Running an interactive job

If you wish to run interactively but are constrained by the limits on the CPUs, CPU Time or memory, you may run a small interactive job requesting the resources you want.

By doing that, you will get a dedicated allocation of CPUs and memory to run your application interactively. There are several ways to do this, depending on your use case:

HPC2020: Multi-complex SLURM management

csWith 4 identical Atos complexes (also known as clusters) installed in our Data Centre in Bologna - see Atos HPCF: System overview, we are now able to provide a more reliable computing service at ECMWF, including for batch work. For example, during a system session on one complex, we will submit batch jobs to a different complex. This enhanced batch service however may require the use of some ECMWF customised SLURM commands.

HPC2020: Job Runtime Management

ECMWF enforces killing jobs if they have reached their wall time if #SBATCH --time or command line option --time were provided with the job.

If no time limit is specified in the job, an automatic time limit will be configured based on average runtimes of previous similar jobs and allow some grace before it will be killed.

HPC2020: example Slurm serial batch job scripts for ECS

Job scripts

Here you find some simple serial batch job examples which are designed to be submitted to and run in the ef queue of the ECS virtual complex, but can be easily adapted to run on any other complex on the other complexes just changing the QoS to nf. Use them as templates to learn from, or as starting points to construct your own jobs.

HPC2020: example Slurm parallel batch job scripts

Job scripts

Here you find some simple parallel batch job examples which are designed to be submitted to and run in the nf queue on the Atos HPCF. Use them as templates to learn from, or as starting points for constructing your own jobs.

Do not forget to modify the scripts with your own workdir, UID and GID as necessary!

HPC2020: Batch jobs not starting - reasons

There may be a number of reasons why a submitted job does not start running. When that happens, it is a good idea to use squeue and pay attention to the STATE and NODELIST(REASON) columns:

$> squeue -j 64243399 
    JOBID       NAME  USER   QOS    STATE       TIME TIME_LIMIT NODES      FEATURES NODELIST(REASON)
 64243399     my_job  user    nf  PENDING       0:00   03:00:00     1        (null) (Priority)

HPC2020: Affinity

When running parallel jobs, SLURM will automatically set up some default process affinity. This means that every task spawned by srun (each MPI rank on an MPI execution) will be pinned to a specific core or set of cores within every computing node.

However, the default affinity may not be what you would expect, and depending on the application it could have a significant impact in performance.

Space shortcuts

Page tree

QoS available

ECS

Submitting jobs remotely

HPC2020: Writing SLURM jobs

HPC2020: Submitting a serial or small parallel job

HPC2020: Submitting a parallel job

HPC2020: Slurm - PBS cheatsheet

HPC2020: Running an interactive job

HPC2020: Multi-complex SLURM management

HPC2020: Job Runtime Management

HPC2020: example Slurm serial batch job scripts for ECS

Job scripts

HPC2020: example Slurm parallel batch job scripts

Job scripts

HPC2020: Batch jobs not starting - reasons

HPC2020: Affinity