SLURM

Access to the CPU cores and GPUs is based on a Slurm batch system.

Submitting

In order to submit to the chosen Slurm queue, one needs to specify a set of variables in the header of the script. The content of the file could be for example:

#!/bin/bash
#SBATCH --job-name=myJobName  # the name of the job
#SBATCH -p main               # partition (queue)
#SBATCH --time=2:00:00        # amount of time the job takes
#SBATCH --cpus-per-task=8     # how many threads you wish to use for the given job
#SBATCH -e %s                 # location where STDERR will be written
#SBATCH -o %s                 # location where STDOUT will be written
env                           # print the environment variables used
date                          # print the datetime when the script starts

python myProgram.py           # run the script you with to submit to the cluster

NOTE: Please test your scripts before flooding the cluster with broken jobs. Checklist for a given job can be found here.

Available queues

Partition	Timelimit¹	Usecase
long	14 - 00:00:00	Meant for regular jobs that require a longer timelimit
gpu	8 - 00:00:00	Meant for jobs that are to be executed on GPUs.
main	2 - 00:00:00	Regular jobs
io	2 - 00:00:00	Meant for jobs that are IO heavy, meaning a lot of reading/writing from/to disk
short	0 - 02:00:00	Meant for regular jobs that take short time to execute

For more up-to-date information on the available queues and corresponding time limits, simply run sinfo.

Useful commands

For all possible options visit SLURM documentation.

Cancelling your job(s)

In order to cancel your jobs use scancel. For example in order to cancel all your jobs with status PENDING and with a name MyJob:

scancel -u $USER -t PENDING --name MyJob

In order to cancel a single job:

scancel <jobid>

Checking job queues

Check how many jobs are currently in the queue

squeue -h | wc -l

Check how many jobs have you submitted to the queue

squeue -u $USER -h | wc -l

Check how many jobs are currently in the running state

squeue -h -t r | wc -l

Check how many jobs are currently in the pending state

squeue -h -t pd | wc -l

Check how many jobs each user has submitted to the queue

squeue -h -o "%u" | sort | uniq -c | sort -nr -k2

Display the actual command, runtime, node and user who submitted jobs to the queue

squeue -h -o "%o %A %M %u"

A useful command of checking the number of jobs each user has submitted:

squeue -h -o "%u" | sort | uniq -c | sort -nr -k2

in order to not type out/copy the command every time you want to check this, you can add it to your .bashrc:

echo 'alias sstatus="squeue -h -o "%u" | sort | uniq -c | sort -nr -k2"' >> .bashrc

and thus get the output to your terminal by running simply sstatus

Job info

Once your job has completed, you can get additional information that was not available during the run. This includes run time, memory used, etc. To get statistics on completed jobs by jobID:

sacct -j <jobid> --format=JobID,JobName,MaxRSS,Elapsed

To view the same information for all jobs of a user:

sacct -u $USER --format=JobID,JobName,MaxRSS,Elapsed

Alternatively, one can similarly use scontrol to gather more information about jobs, but the output is more difficult to parse:

scontrol show -od job | grep $JOB_ID

Timelimit is given in days: hours-minutes-seconds ↩