SLURM
Access to the CPU cores and GPUs is based on a Slurm batch system.
Submitting
In order to submit to the chosen Slurm queue, one needs to specify a set of variables in the header of the script. The content of the file could be for example:
#!/bin/bash
#SBATCH --job-name=myJobName # the name of the job
#SBATCH -p main # partition (queue)
#SBATCH --time=2:00:00 # amount of time the job takes
#SBATCH --cpus-per-task=8 # how many threads you wish to use for the given job
#SBATCH -e %s # location where STDERR will be written
#SBATCH -o %s # location where STDOUT will be written
env # print the environment variables used
date # print the datetime when the script starts
python myProgram.py # run the script you with to submit to the cluster
NOTE: Please test your scripts before flooding the cluster with broken jobs. Checklist for a given job can be found here.
Available queues
Partition | Timelimit1 | Usecase |
---|---|---|
long | 14 - 00:00:00 | Meant for regular jobs that require a longer timelimit |
gpu | 8 - 00:00:00 | Meant for jobs that are to be executed on GPUs. |
main | 2 - 00:00:00 | Regular jobs |
io | 2 - 00:00:00 | Meant for jobs that are IO heavy, meaning a lot of reading/writing from/to disk |
short | 0 - 02:00:00 | Meant for regular jobs that take short time to execute |
For more up-to-date information on the available queues and corresponding time limits, simply run sinfo
.
Useful commands
For all possible options visit SLURM documentation.
Cancelling your job(s)
In order to cancel your jobs use scancel
. For example in order to cancel all your jobs with status PENDING and with a name MyJob:
scancel -u $USER -t PENDING --name MyJob
scancel <jobid>
Checking job queues
Check how many jobs are currently in the queue
squeue -h | wc -l
Check how many jobs have you submitted to the queue
squeue -u $USER -h | wc -l
Check how many jobs are currently in the running state
squeue -h -t r | wc -l
Check how many jobs are currently in the pending state
squeue -h -t pd | wc -l
Check how many jobs each user has submitted to the queue
squeue -h -o "%u" | sort | uniq -c | sort -nr -k2
Display the actual command, runtime, node and user who submitted jobs to the queue
squeue -h -o "%o %A %M %u"
A useful command of checking the number of jobs each user has submitted:
squeue -h -o "%u" | sort | uniq -c | sort -nr -k2
in order to not type out/copy the command every time you want to check this, you can add it to your .bashrc
:
echo 'alias sstatus="squeue -h -o "%u" | sort | uniq -c | sort -nr -k2"' >> .bashrc
sstatus
Job info
Once your job has completed, you can get additional information that was not available during the run. This includes run time, memory used, etc. To get statistics on completed jobs by jobID:
sacct -j <jobid> --format=JobID,JobName,MaxRSS,Elapsed
To view the same information for all jobs of a user:
sacct -u $USER --format=JobID,JobName,MaxRSS,Elapsed
Alternatively, one can similarly use scontrol
to gather more information about jobs, but the output is more difficult to parse:
scontrol show -od job | grep $JOB_ID
-
Timelimit is given in days: hours-minutes-seconds ↩