Introduction to SLURM: Simple Linux Utility for Resource Management

  • Open source fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters.
  • HPC systems admins use this system for smooth resource distribution among various users. A user can submit jobs with specific resources to the centralized manager.

The three objectives of SLURM:

  • Lets a user request a compute node to do an analysis (job)
  • Provides a framework (commands) to start, cancel, and monitor a job
  • Keeps track of all jobs to ensure everyone can efficiently use all computing resources without stepping on each others toes.

SLURM Commands:

The main SLURM user commands, shown on the left, give the user access to information pertaining to the super computing cluster and the ability to submit or cancel a job. See table below for a description of the main SLURM user functions.

command Description
sbatch Submit a batch script to SLURM
squeue List all jobs currently running or in queue
scancel Cancel a job you submitted
sinfo Check the availability of nodes within all partitions
scontrol See the configuration of a specific node or information about a job
sacct Displays accounting data for all jobs
salloc reserve an interactive node

squeue

The first SLURM command to learn is squeue. It provides a list of all jobs that have been submitted to the SLURM scheduler by everyone using the supercomputer. This command can tell you how busy a super computing resource is and if your job is running or not.

1
2
3
4
5
6
squeue

JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
        2910274 long_1nod porechop  severin  PD   3:30:32      1 (Nodes required for job are DOWN, Drained or reserved)
        2910262 long_1nod       sh  severin  R    4:01:00      1 nova013
        2909617 long_1nod     bash   remkv6  R    7:13:38      1 nova027
Header column Definition
JOBID The ID that job has been given, usually a large number
PARTITION the partition assigned to a given job
NAME the name provided to SLURM by the user for this job
USER The name of the user who submitted the job
ST The state of the job, running(R), PenDing(PD)
NODES number of nodes requested
NODELIST(REASON) which node(s) is the job running on (or the reason why is it not running)

This can be a really long list especially if you only want to see your own jobs. To do that you can specify a user using the ‘-u’ parameter.

1
2
3
4
5
6
7
8
squeue -u $USER

JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           2867457     short P3826e00 sivanand  R   21:50:29      1 ceres14-compute-53
           2867458     short P6370337 sivanand  R   21:50:29      1 ceres14-compute-53
           2867459     short Pa0567fb sivanand  R   21:50:29      1 ceres19-compute-38
           2867456      long   Falcon sivanand  R   21:50:45      1 ceres14-compute-55
           2867883     short       sh sivanand  R      48:03      1 ceres14-compute-64

In the above example $USER is your username.

scancel

If you submit a job and realize you need to cancel it for some reason, you will use the scancel command with the JOBID described above in squeue

1
scancel 2867457

This sends a signal to the SLURM schedule to stop a running job or remove a pending job from the SLURM queue.

sbatch

The sbatch command is the most important command as it is used to submit jobs to the super computing cluster. A job is a script that runs on computing resources. The script contains the commands you want to run on the super computing node.

1
sbatch slurm.batch.sh

Super easy to use once you have written the SLURM submission script. This is the part that many new users get stuck on but it really isn’t so bad. You just have to add a header to a text file that has your commands in it.

SLURM batch script: Guidelines

The SLURM script contains a header with a SLURM SBATCH comment #SBATCH. These comments tell the SLURM schedule the following information.

  • Number of nodes
  • Desired number of processors or jobs
  • Type of partition/queue you want to use (optional)
  • Memory requirement (Optional)
  • Length of time you want to run the job (Each partition has a default)
  • Where to write output and error files
  • Name for your job while running on HPC
  • Email ID to get job status (Optional)

Here is a table descriptions for the most commonly used #SBATCH comments

SBATCH command Description
#SBATCH -N 1 Reserve a single node
#SBATCH -n 4 The job steps will launch a max of 4 jobs
#SBATCH -p short Reserve in the short partition
#SBATCH -t 01:00:00 Reserve for 01 hour:00 minutes:00 seconds
#SBATCH -J sleep the name of the job is “sleep”
#SBATCH -o sleep.o%j write any std output to a file named sleep.o%j where %j is automatically replaced with the jobid
#SBATCH -e sleep.e%j write any std output to a file named sleep.e%j where %j is automatically replaced with the jobid
#SBATCH –mail-user=user@domain.edu Notify me at this email address
#SBATCH –mail-type=begin Notify by email when the job begins
#SBATCH –mail-type=end Notify by email when the job ends

Super computing etiquette

One of the most important takeaways in this tutorial is that a job is best run on compute nodes and not on the login node. We generally write a batch script where we can reserve the necessary resources and then write the commands or the actual job that you want to do. Obviously this example is trivial, however in reality most jobs run by users involve at least some component of heavy computing or memory. It is poor etiquette to do any intensive computing on the headnode as it slows everyone down sometimes to the point where no one can use the ls command.

Writing a SLURM job script

Now that you know a little more about #SBATCH comments, A SLURM job script is straight forward to write and contains two components:

  • SLURM header with #SBATCH comments that define the resources you need
  • The commands you want to run

SLURM header

once you write this once, you could reuse it for other scripts you need by modifying the #SBATCH comments according to your need.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
#!/bin/bash
##The shebang line or the absolute path to the bash interpreter

## All the lines below that start with a single `#SBATCH` is a SLURM SBATCH comment

#SBATCH -N 1
#SBATCH -n 4
#SBATCH -p short
#SBATCH -t 01:00:00
#SBATCH -J sleep
#SBATCH -o sleep.o%j
#SBATCH -e sleep.e%j
#SBATCH --mail-user=user@domain.edu
#SBATCH --mail-type=begin
#SBATCH --mail-type=end

cd $SLURM_SUBMIT_DIR  # this line changes you into the directory you submitted the script once the job starts

Commands you want to run

In this example we will be taking advantage of the sleep command.

1
2
3
4
5
6
7
8
9
10
## The following lines are the commands that you want to run


sleep 10 && echo "I slept for 10 seconds"
sleep 20 && ech "I slept for 20 seconds"
## Note in the above line, I deliberately mis spelt `ech`; this would cause a std error to be output
sleep 60 && echo "I slept for 1 min"

scontrol show job $SLURM_JOB_ID
## scontrol above is a slurm command to view the slurm configuration or state. It is useful to see how much of the resources you have used.

Copy the “SLURM header” and the “Commands you want to run” into a new file

  • save the job script as slurm.batch.sh

This script can be submitted as follows:

1
sbatch slurm.batch.sh

This job will at least run for 1-2 mins, so soon after submitting you can actually issue commands to see the job run.

1
2
3
squeue -u $USER
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           2935316     short    sleep sivanand  R       0:04      1 ceres14-compute-34

Notes: We are using the -u option for squeue and supplying the variable $USER, which referes to your **user name. We notice that the job, **sleep**, is running on the node ceres14-compute-34 in the short partition and has a job ID 2935316.

Once the job is completed the following files appear

1
2
sleep.o2935316 # this is the standard output where 2935316 is the JOBID
sleep.e2935316 # this is the standard error where 2935316 is the JOBID

Let’s take a look at the standard output file

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
more sleep.o2935316

I slept for 10 seconds
I slept for 1 min

JobId=2935316 JobName=sleep
   UserId=sivanandan.chudalayandi(1727000561) GroupId=sivanandan.chudalayandi(1727000561) MCS_label=N/A
   Priority=213721 Nice=0 Account=scinet QOS=memlimit
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:01 TimeLimit=01:00:00 TimeMin=N/A
   SubmitTime=2020-05-18T10:40:25 EligibleTime=2020-05-18T10:40:26
   AccrueTime=2020-05-18T10:40:26
   StartTime=2020-05-18T10:40:26 EndTime=2020-05-18T11:40:26 Deadline=N/A
   PreemptEligibleTime=2020-05-18T10:40:26 PreemptTime=None
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-05-18T10:40:26
   Partition=short AllocNode:Sid=ceres19-ipa-0:39699
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=ceres14-compute-34
   BatchHost=ceres14-compute-34
   NumNodes=1 NumCPUs=4 NumTasks=4 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=4,mem=12400M,node=1,billing=4
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=3100M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/project/isu_gif_vrsc/Siva/Service/Slurm/slurm.batch.sh
   WorkDir=/project/isu_gif_vrsc/Siva/Service/Slurm
   StdErr=/project/isu_gif_vrsc/Siva/Service/Slurm/sleep.e2935316
   StdIn=/dev/null
   StdOut=/project/isu_gif_vrsc/Siva/Service/Slurm/sleep.o2935316
   Power=

**Note**: the line starting with JobID through Power= is the slurm configuration and state (scontrol) and gives you an idea of how many resources you have used as mentioned before. The last two lines are directly from our echo command in the script.

Additionally, the error file sleep.e2935316:

1
2
more sleep.e2935316
/var/spool/slurmd/job2935316/slurm_script: line 16: ech: command not found

This tells us that the command ech (deliberately mis-spelt) is not found.

sinfo

Sometimes it can be difficult to get a node and you end up in the SLURM queue for a long time or you just want to test a script out before you submit and walk away to make sure that it will run well. The easiest way to find out what nodes are available is to use the sinfo command.

1
2
3
4
5
6
7
8
9
10
11
12
$ sinfo
PARTITION        AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug               up    1:00:00      1  maint ceres19-compute-26
debug               up    1:00:00      1    mix ceres14-compute-4
debug               up    1:00:00      1   idle ceres19-compute-25
brief-low           up    2:00:00      2  maint ceres19-compute-[26,40]
brief-low           up    2:00:00      1  down* ceres19-compute-37
brief-low           up    2:00:00     59    mix ceres18-compute-[0-17,19-27],ceres19-compute-[0-5,7-9,12,21-24,35-36,38-39,41-42,44-45,47,55-63]
brief-low           up    2:00:00      4  alloc ceres18-compute-18,ceres19-compute-[6,28,43]
brief-low           up    2:00:00     26   idle ceres19-compute-[10-11,13-20,25,27,29-34,46,48-54]
mem768-low          up    2:00:00      3   idle ceres18-mem768-0,ceres19-mem768-[0-1]
mem-low             up    2:00:00      3    mix ceres18-mem-[0-1],ceres19-mem-1

SINFO provides the following information

Header column Definition
PARTITION a group of nodes
AVAIL whether or not the node is up, down or in some other state
TIMELIMIT the amount of time a user can request a node in a given partition
NODES the number of nodes in a given partition
STATE maintenance, mix, idle, down, allocated
NODELIST the node names with a given STATE

With this information it is possible to find partitions that have idle nodes that could be used for a job. Unfortunately, sinfo by itself is a bit messy so I have created an alias that formats the output to be easier to read

1
2
3
4
5
6
7
8
9
10
11
12
sinfo -o "%20P %5D %14F %10m %11l %N"
PARTITION            NODES NODES(A/I/O/T) MEMORY     TIMELIMIT   NODELIST
debug                3     0/3/0/3        126000+    1:00:00     ceres14-compute-4,ceres19-compute-[25-26]
brief-low            92    33/58/1/92     381000     2:00:00     ceres18-compute-[0-27],ceres19-compute-[0-63]
priority-gpu         1     1/0/0/1        379000     14-00:00:00 ceres18-gpu-0
short*               100   51/48/1/100    126000+    2-00:00:00  ceres14-compute-[1-24,26-29,32-39,44-56,58-67],ceres18-compute-[24-27],ceres19-compute-[27-63]
medium               67    49/17/1/67     126000+    7-00:00:00  ceres14-compute-[26-29,32-39,44-56,58-67],ceres18-compute-[25-27],ceres19-compute-[35-63]
long                 34    31/3/0/34      126000+    21-00:00:00 ceres14-compute-[44-56,58-67],ceres18-compute-[26-27],ceres19-compute-[55-63]
mem                  8     3/4/1/8        1530000+   7-00:00:00  ceres14-mem-[0-3],ceres18-mem-2,ceres19-mem-[2-4]
mem768               1     0/1/0/1        763000     7-00:00:00  ceres18-mem768-1
huge                 1     1/0/0/1        4:16:1   3095104    14990      1-00:00:00  fat,AVX,AVX2,AVX novahuge001

If you edit your .bashrc file in your home directory and add this alias you can use si instead.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
nano ~/.bashrc
#add the following line
alias si='sinfo -o "%20P %5D %14F %10m %11l %N"'
#exit nano
si
debug                3     0/3/0/3        126000+    1:00:00     ceres14-compute-4,ceres19-compute-[25-26]
brief-low            92    33/58/1/92     381000     2:00:00     ceres18-compute-[0-27],ceres19-compute-[0-63]
priority-gpu         1     1/0/0/1        379000     14-00:00:00 ceres18-gpu-0
short*               100   51/48/1/100    126000+    2-00:00:00  ceres14-compute-[1-24,26-29,32-39,44-56,58-67],ceres18-compute-[24-27],ceres19-compute-[27-63]
medium               67    49/17/1/67     126000+    7-00:00:00  ceres14-compute-[26-29,32-39,44-56,58-67],ceres18-compute-[25-27],ceres19-compute-[35-63]
long                 34    31/3/0/34      126000+    21-00:00:00 ceres14-compute-[44-56,58-67],ceres18-compute-[26-27],ceres19-compute-[55-63]
mem                  8     3/4/1/8        1530000+   7-00:00:00  ceres14-mem-[0-3],ceres18-mem-2,ceres19-mem-[2-4]
mem768               1     0/1/0/1        763000     7-00:00:00  ceres18-mem768-1
huge                 1     1/0/0/1        4:16:1   3095104    14990      1-00:00:00  fat,AVX,AVX2,AVX novahuge001

scontrol

If you need to see the configuration of a specific node to determine if that type of node is sufficient for your analysis or to diagnose a problem (like insufficient memory segmentation fault). scontrol can be used to look up information on a node for example ceres14-compute-8

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
$ scontrol show nodes ceres14-compute-8

NodeName=ceres14-compute-8 Arch=x86_64 CoresPerSocket=10
   CPUAlloc=0 CPUTot=40 CPULoad=0.01
   AvailableFeatures=AVX
   ActiveFeatures=AVX
   Gres=(null)
   NodeAddr=ceres14-compute-8 NodeHostName=ceres14-compute-8 Version=19.05.5
   OS=Linux 3.10.0-1062.12.1.el7.x86_64 #1 SMP Tue Feb 4 23:02:59 UTC 2020
   RealMemory=126000 AllocMem=0 FreeMem=85536 Sockets=2 Boards=1
   State=IDLE ThreadsPerCore=2 TmpDisk=975 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=short,geneious
   BootTime=2020-02-17T17:14:55 SlurmdStartTime=2020-02-18T17:12:06
   CfgTRES=cpu=40,mem=126000M,billing=40
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

Sometimes, you want to know more about the job you just ran or is currently running.

1
scontrol show job JOBID

You can get the JOBID from the output of the squeue command

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
scontrol show job 2909617

JobId=2909617 JobName=bash
   UserId=remkv6(298590) GroupId=domain users(101) MCS_label=N/A
   Priority=84730 Nice=0 Account=gif QOS=gif
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=06:58:38 TimeLimit=10:00:00 TimeMin=N/A
   SubmitTime=2020-05-18T07:29:05 EligibleTime=2020-05-18T07:29:05
   AccrueTime=Unknown
   StartTime=2020-05-18T07:29:05 EndTime=2020-05-18T17:29:05 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-05-18T07:29:05
   Partition=long_1node192 AllocNode:Sid=nova:71501
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=nova027
   BatchHost=nova027
   NumNodes=1 NumCPUs=4 NumTasks=4 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=4,mem=20400M,node=1
   Socks/Node=* NtasksPerN:B:S:C=4:0:*:* CoreSpec=*
   MinCPUsNode=4 MinMemoryCPU=5100M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=bash
   WorkDir=/work/gif/remkv6/Baum/04_DovetailSCNGenome/01_mikadoRerurn/01_BrakerFix/braker
   Comment=Time 600, Med priority, overdrawn
   Power=

Hint if you put this code at the end of your SLURM script it will output this to your standard out file after your job completes.

1
scontrol show job $SLURM_JOB_ID

sacct

This command provides useful accounting information about submitted jobs.

Column Description
JobID Job ID number
JobName Name of the Job
Partition What partition of the SLURM queue is it running or queued for
Account Which account/group is it running on
AllocCPUS Number of CPUs allocated/requested
State ExitCode State of job or exit code

By itself this command will only give you information about your jobs

1
sacct

Adding the -a parameter will provide information about all accounts.

1
sacct -a

And there is a format option that can give more useful column information.

1
sacct -a --format JobID,Partition,Timelimit,Start,Elapsed,NodeList%20,ExitCode,ReqMem,MaxRSS,MaxVMSize,AllocCPUS

salloc

Interactive Session

We could have also run the commands in the job script interactively by first reserving a node in the partition using salloc

1
2
3
4
5
6
7
8
9
10
11
12
13
# this command will give 1 Node with 4 cpu in the short partitio for a time of 00 hours: 30 minutes: 00 seconds

$ salloc -N 1 -n 4 -p short -t 00:30:00

salloc: Pending job allocation 2935626
salloc: job 2935626 queued and waiting for resources
salloc: job 2935626 has been allocated resources
salloc: Granted job allocation 2935626
salloc: Waiting for resource configuration
salloc: Nodes ceres14-compute-48 are ready for job
export TMPDIR=/local/bgfs//2935626
export TMOUT=5400

In an interactive session, we can primarily use it to run small test runs of a large job and/or run say, a bunch of file compression or un-tarring.

We can run the commands from out job script above directly in the interactive session.

1
2
sleep 10 && echo "I slept for 10 seconds"
I slept for 10 seconds

or

1
2
3
sleep 20 && ech "I slept for 20 seconds"
bash: ech: command not found

Additional Resources

References

This tutorial is a rehash of material found on schedmd

Those useful aliases you wanted to put in your ~/.bashrc file.

1
2
3
alias si='sinfo -o "%20P %5D %14F %10m %11l %N"'
alias sq='squeue -o "%8i %12j %4t %10u %20q %20a %10g %20P %10Q %5D %11l %11L %R"'
alias sacct='sacct --format JobID,Partition,Timelimit,Start,Elapsed,NodeList%20,ExitCode,ReqMem,MaxRSS,MaxVMSize,AllocCPUS'