This example parallel job shows how to run a Multiple Program Multiple Data (MPMD) hybrid MPI + OpenMP execution, such as a coupled model. 

It is using the heterogeneous step feature in Slurm, which assigns complete nodes to every step of the MPMD run, and allows for different layout in terms of tasks and threads for each of them. The dummy program used is based on Cray's xthi tool, which displays the rank and thread for each unit of execution, as well as the node where they run and the cpu pinned to them. See HPC2020: Affinity for more details on cpu pinning and why it is important. This example simulates a coupled MPI+OpenMP execution with 3 different binaries, each of them running with a different layout:

#!/bin/bash
#SBATCH -q np
#SBATCH -N 3
#SBATCH --hint=nomultithread
#SBATCH -J test-coupled
#SBATCH -o test-coupled-%j.out
#SBATCH -e test-coupled-%j.out

executables="exe1 exe2 exe3"
mpi_tasks_exe1=128
omp_threads_exe1=1
mpi_tasks_exe2=64
omp_threads_exe2=2
mpi_tasks_exe3=32
omp_threads_exe3=4

set -e

# Load environment
module load hpcx-openmpi

# Fetch and build affinity checker xthi
# Tweak it slightly to be able to simulate different executables
[[ ! -e xthi.c ]] && wget https://docs.nersc.gov/jobs/affinity/xthi.c
sed -i -e "s/Hello from/%s/" -e "s/   rank, thread/   argv[0], rank, thread/" xthi.c
mpicc -fopenmp -o xthi xthi.c

srun_line=""
for e in $executables; do
    # Simulate different executables
    ln -sf xthi $e
    # Configure srun step for executable
    mpi_tasks_var=mpi_tasks_$e
    omp_threads_var=omp_threads_$e
    srun_line+=": -n ${!mpi_tasks_var} -c ${!omp_threads_var} ./$e "
done
srun_line=${srun_line:1}

# Ensure OpenMP correct pinning
export OMP_PLACES=threads

# Avoid PMI hangs with heterogeneous jobs
export SLURM_MPI_TYPE=none

# Use srun in heterogeneous steps mode.
# Each executable will have a different layout, minimum allocation is one node for each.
# See https://slurm.schedmd.com/heterogeneous_jobs.html#het_steps
srun $srun_line | sort -k 3,3n -k 5,5n

In the above configuration, the job will run on 3 nodes, one for each executable. In the output you may then see where each of the executables actually run, with the requested layouts and corresponding cpu affinity:

$ outfile=test-coupled-32379327.out; grep -e exe1 -m 2 $outfile; echo "..."; grep -e exe2 -m 4 $outfile; echo "...";  grep -e exe3 -m 8 $outfile
./exe1 rank 0, thread 0, on ac3-4036.bullx. (core affinity = 0)
./exe1 rank 1, thread 0, on ac3-4036.bullx. (core affinity = 1)
...
./exe2 rank 128, thread 0, on ac3-4039.bullx. (core affinity = 0)
./exe2 rank 128, thread 1, on ac3-4039.bullx. (core affinity = 1)
./exe2 rank 129, thread 0, on ac3-4039.bullx. (core affinity = 2)
./exe2 rank 129, thread 1, on ac3-4039.bullx. (core affinity = 3)
...
./exe3 rank 192, thread 0, on ac3-4040.bullx. (core affinity = 0)
./exe3 rank 192, thread 1, on ac3-4040.bullx. (core affinity = 1)
./exe3 rank 192, thread 2, on ac3-4040.bullx. (core affinity = 2)
./exe3 rank 192, thread 3, on ac3-4040.bullx. (core affinity = 3)
./exe3 rank 193, thread 0, on ac3-4040.bullx. (core affinity = 4)
./exe3 rank 193, thread 1, on ac3-4040.bullx. (core affinity = 5)
./exe3 rank 193, thread 2, on ac3-4040.bullx. (core affinity = 6)
./exe3 rank 193, thread 3, on ac3-4040.bullx. (core affinity = 7)