Running Jobs on foseres

The foseres facility uses Slurm to schedule jobs.

Writing a submission script is typically the most convenient way to submit your job to the job submission system. Example submission scripts (with explanations) for the most common job types are provided below.

Interactive jobs are also available and can be particularly useful for developing and debugging applications. More details are available below.

If you have any questions on how to run jobs on foseres do not hesitate to contact the HPC Support at hpcsupport@plymouth.ac.uk.

Using Slurm

You typically interact with Slurm by (1) specifying Slurm directives in job submission scripts (see examples below) and (2) issuing Slurm commands from the login nodes.

There are three key commands used to interact with the Slurm on the command line:

  • sbatch

  • squeue

  • scancel

Check the Slurm man page for more advanced commands:

man slurm

The sbatch command

The sbatch command submits a job to Slurm:

sbatch job_script

This will submit your job script “job_script” to the job-queues. See the sections below for details on how to write job scripts.

The qstat command

Use the command qstat to view the job queue. For example:

squeue

will list all jobs on foseres.

You can view just your jobs by using:

squeue -u <username>

To see more information about a queued job, use:

show jobid -dd <jobid>

The scancel command

Use this command to delete a job from foseres’s job queue. For example:

scancel <jobid>

will remove the job with ID <jobid> from the queue.

Queues

Please note that Slurm job scheduler uses the term ‘partitions’ to refer to queues, and therefore you may see the word partition used interchangeably both here and on other sites. There are two queues on foseres: the normal queue and the test queue. The test queue can also be used for pre/post processing as it is usually quite empty and has therefore a smaller wait time.

normal

  • Maximum job size: 48 nodes (1536 cores)

  • Maximum walltime: 4320 minutes (3 days)

test

  • Maximum job size: 2 nodes (64 cores)

  • Maximum walltime: 4320 minutes (3 days)

Sometimes some nodes are “down” and less nodes are available.

If you have special request, contact hpcsupport@plymouth.ac.uk.

Output from Slurm jobs

Slurm produces standard output and standard error for each batch job can be found in files <jobname>.o<Job ID> and <jobname>.e<Job ID> respectively. These files appear in the job’s working directory once your job has completed or its maximum allocated time to run (i.e. wall time, see later sections) has ran out.

Running MPI parallel jobs

When you running parallel jobs requiring MPI you will use an MPI launch command to start your executable in parallel.

Intel MPI

Intel MPI is accessed at runtime by loading the appropriate modules, eg:

module load intel_oneapi/compiler/2021.2.0  intel_oneapi/mpi/2021.2.0

Intel MPI: parallel job launcher mpirun

The Intel MPI parallel job launcher on foseres is mpirun.

.

A sample MPI launch line using mpirun looks like:

mpirun -n 128 -ppn 32./my_mpi_executable.x arg1 arg2

This will start the parallel executable my_mpi_executable.x with arguments “arg1” and “arg2”. The job will be started using 128 MPI processes, with 32 MPI processes placed on each compute node (this would use all the physical cores on each node). This would require 4 nodes to be requested in the Slurm options.

The most important mpirun flags are:

-n [total number of MPI processes]

Specifies the total number of distributed memory parallel processes (not including shared-memory threads). For pure MPI jobs that use all physical cores this will usually be a multiple of 24. The default on foseres is 1.

-ppn [parallel processes per node]

Specifies the number of distributed memory parallel processes per node. There is a choice of 1-32 for physical cores on foseres compute nodes (1-64 if you are using Hyper-Threading) For pure MPI jobs, the most economic choice is usually to run with “fully-packed” nodes on all physical cores if possible, i.e. -ppn 32 . Running “unpacked” or “underpopulated” (i.e. not using all the physical cores on a node) is useful if you need large amounts of memory per parallel process or you are using more than one shared-memory thread per parallel process.

Documentation on using Intel MPI (including mpirun) can be found online at:

Example parallel MPI job submission scripts

Example job submssion scripts are included in full below. They are also available via the following links:

We recommend that you use the following C code to test the aforementioned Slurm script

Using intel oneAPI as an example they can be compiled by typing:

module load compiler/2021.2.0 mpi/2021.2.0
mpiicc helloworld_mpi.c -o helloworld_mpi
mpiicc -qopenmp helloworld_hybrid.c -o helloworld_hybrid

Example: Intel MPI job submission script for MPI parallel job

A simple MPI job submission script to submit a job using 4 compute nodes (maximum of 144 physical cores) for 20 minutes would look like:

#!/bin/bash

#SBATCH -J test               # Job name
#SBATCH -o job.%j.out         # Name of stdout output file (%j expands to jobId)
#SBATCH -N 2                  # Total number of nodes requested
#SBATCH -n 64                 # Total number of mpi tasks requested
#SBATCH -t 01:30:00           # Run time (hh:mm:ss) - 1.5 hours
#SBATCH -p normal             # Request  a  specific queue for the resource allocation: normal or test. If that line is removed, the job will be scheduled in the test partition

# Launch MPI-based executable
module purge
module load compiler/2021.2.0  mpi/2021.2.0 # using intel_oneAPI compilers and intel MPI implementation.

# Launch the parallel job
#   Using  64 MPI processes and 32 MPI processes per node
mpirun -np ${SLURM_NTASKS} -ppn 32 ./my_mpi_executable.x arg1 arg2 # $SLURM_NTASKS is automatically set to 64 in that case.

This will run your executable “my_mpi_executable.x” in parallel on 128 MPI processes using 4 nodes (32cores per node, i.e. not using hyper-threading). Slurm will allocate 4 nodes to your job and mpirun will place 32 MPI processes on each node (one per physical core).

Example: Intel MPI job submission script for MPI+OpenMP (mixed mode) parallel job

Mixed mode codes that use both MPI (or another distributed memory parallel model) and OpenMP should take care to ensure that the shared memory portion of the process/thread placement does not span more than one node. This means that the number of shared memory threads should be a factor of 16.

In the example below, we are using 16 nodes for 6 hours. There are 128MPI processes in total and 4 OpenMP threads per MPI process. Note the use of the I_MPI_PIN_DOMAIN environment variable to specify that MPI process placement should leave space for threads.

#!/bin/bash

#SBATCH -J test_hybrid        # Job name
#SBATCH -o job.%j.out         # Name of stdout output file (%j expands to jobId)
#SBATCH -N 1                  # Total number of nodes requested
#SBATCH -n 8                  # Total number of mpi tasks requested
#SBATCH -t 01:30:00           # Run time (hh:mm:ss) - 1.5 hours
#SBATCH -p test               # Request  a  specific queue: normal or test.
                              # If that line is removed, the job will be scheduled
                              # in the test partition

# Launch MPI-based executable
module purge
module load  intel/19.0.5.281 impi/2019.5.281 # using intel compilers and intel MPI suite version 2019.5.281


# Set the number of threads to 4
#   There are 12 OpenMP threads per MPI process
export OMP_NUM_THREADS=4

# Set placement to support hybrid jobs
export I_MPI_PIN_DOMAIN=omp

# Launch the parallel job
#   Using 128 MPI processes
#   8 MPI processes per node
#   4 OpenMP threads per MPI process
mpirun -n 128 -ppn 8 ./my_mixed_executable arg1 arg2

Interactive Jobs

When you are developing or debugging code you often want to run many short jobs with a small amount of editing the code between runs. This can be achieved by using the login nodes to run MPI but you may want to test on the compute nodes (e.g. you may want to test running on multiple nodes across the high performance interconnect). One of the best ways to achieve this on foseres is to use interactive jobs.

An interactive job allows you to issue mpirun commands directly from the command line without using a job submission script, and to see the output from your program directly in the terminal.

To submit a request for an interactive job reserving 1 node with 32 processes you would issue the following qsub command from the command line:

srun -N1 -n32 --pty bash -i

When you submit this job your terminal will display something like:

[username@node2 ~]$

It may take some time for your interactive job to start. Once it runs you will enter a standard interactive terminal session. Whilst the interactive session lasts you will be able to run parallel jobs on the compute nodes by issuing the mpirun command directly at your command prompt (remember you will need to load modules before running) using the same syntax as you would inside a job script. The maximum number of cores you can use is limited by the value of select you specify when you submit a request for the interactive job.

If you know you will be doing a lot of intensive debugging you may find it useful to request an interactive session lasting the expected length of your working session, say a full day.

Your session will end when you hit the requested walltime. If you wish to finish before this you should use the exit command.

Advanced topics

Intel MPI: running hybrid MPI/OpenMP applications

If you are running hybrid MPI/OpenMP code using Intel MPI you need to set the I_MPI_PIN_DOMAIN environment variable to omp so that MPI tasks are pinned with enough space for OpenMP threads.

For example, in your job submission script you would use:

export I_MPI_PIN_DOMAIN=omp

You can then also use the KMP_AFFINITY enviroment variable to control placement of OpenMP threads. For more information, see:

Intel MPI: Process Placement

By default, MPI processes are placed on nodes in a round-robin format. For example, if you are using 4 nodes, 16 MPI processes in total and have 4 MPI processes per node, you would use the command:

mpirun -n 16 -ppn 4 /path/to/my/exe

the processes would be placed in the following way:

MPI process 0: placed on Node 1
MPI process 1: placed on Node 2
MPI process 2: placed on Node 3
MPI process 3: placed on Node 4
MPI process 4: placed on Node 1
MPI process 5: placed on Node 2
MPI process 6: placed on Node 3
MPI process 7: placed on Node 4
MPI process 8: placed on Node 1
...
MPI process 15: placed on Node 4

The alternative way to place MPI processes would be to fill one node with processes before moving onto the next node (this is often known as SMP placement). This can be achieved within a Slurm job on foseres by using the -f flag to pass the node list file explicity. For example:

mpirun -n 16 -ppn 4 -f $Slurm_NODEFILE /path/to/my/exe

The processes would be placed in the following way:

MPI process 0: placed on Node 1
MPI process 1: placed on Node 1
MPI process 2: placed on Node 1
MPI process 3: placed on Node 1
MPI process 4: placed on Node 2
MPI process 5: placed on Node 2
MPI process 6: placed on Node 2
MPI process 7: placed on Node 2
MPI process 8: placed on Node 3
...
MPI process 15: placed on Node 4