HPC cluster basics
SLURM: Simple Linux Utility for Resource Management
- A simple text file with all the requirements for running your job
- Memory requirement
- Desired number of processors
- Length of time you want to run the job
- Type of queue you want to use (optional)
- Where to write output and error files
- Name foryour job while running on HPC
Job Script Basics
A typical job script will look like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --cpus-per-task=8
#SBATCH --time=02:00:00
#SBATCH --mem=128G
#SBATCH --mail-user=netid@gmail.com
#SBATCH --mail-type=begin
#SBATCH --mail-type=end
#SBATCH --error=JobName.%J.err
#SBATCH --output=JobName.%J.out
cd $SLURM_SUBMIT_DIR
module load modulename
your_commands_goes_here
Lines starting with #SBATCH
are for SLURM
resource manager to request resources for HPC. Some important options are as follows:
Option | Examples | Description |
---|---|---|
Number of nodes | ||
Number of CPUs per node | ||
Total time requested for your job | ||
STDOUT to a file | ||
STDERR to a file | ||
Email address to send notifications |
Job Management Commands
Job Status | Commands |
---|---|
list all queues | |
list all jobs | |
list jobs for userid | |
list running jobs |
Let’s go ahead and give these job management commands a try.
1
2
3
4
5
sinfo -a
squeue
squeue -t R
#pick a name you saw when you typed squeue and specify all the jobs by that person with the following option
squeue -u first.lastname
There can be a lot of information using those two commands. I have created some useful alias’ that change the output to something more informative.
1
2
alias sq='squeue -o "%8i %12j %4t %10u %20q %20a %10g %20P %10Q %5D %11l %11L %R"'
alias si='sinfo -o "%20P %5D %14F %8z %10m %10d %11l %16f %N"'
Where (A/I/O/T)
= available/idle/other/total
You can place those alias’ into your ~/.bashrc
file and it will automatically load every time you log in.
Exercise: Add these two alias’ above to your ~/.bashrc
file
1
nano ~/.bashrc
Job scheduling commands
Commands | Function | Basic Usage | Example |
---|---|---|---|
submit a slurm job | sbatch [script] | $ sbatch job.sub | |
delete slurm batch job | scancel [job_id] | $ scancel 123456 |
Interactive Session
To start a interactive session execute the following:
1
2
3
4
5
# this command will give 1 Node with 1 cpu in the brief-low queue for a time of 00 hours: 01 minutes: 00 seconds
salloc -N 1 -n 1 -p brief-low -t 00:01:00
# You can exit out of an interactive node by typing exit and hitting return
Interactive sessions are very helpful when you need more computing power than your laptop or desktop to wrangle the data or to test new software prior to submitting a full batch script.