Introduction to SLURM and Job Submission¶
Now that you are connected to the cluster, it's time to learn about SLURM (Simple Linux Utility for Resource Management), which is the workload manager used on the cluster. SLURM handles job scheduling, resource allocation, and job monitoring.
Login vs. Compute Nodes¶
In general, cluster users are expected to submit most computations to the job scheduler to be run on the dedicated compute nodes. The login nodes are meant for tasks like editing source/command files and running short test programs that do not use much memory, time, and only need one or two CPUs. Some rough guidelines are: the test will run less than five minutes, will use less than 5 GB of memory, and will not use more than two CPUs.
Submitting a Job with sbatch
¶
You can submit a job by writing a job script. It's a simple text file that contains both the resource requirements and the commands you want to execute.
Let's create our first job script. (you can use the editor of your choice, e.g. emacs, joe, nano, vim, etc)
$ nano test_job.sh
We need to have a shebang line at the beginning of the script to specify the file is a shell script.
#!/bin/sh
Slurm lets you specify options directly in a batch script, called Slurm “directives.” These directives can provide job setup information used by Slurm, including resource requests, email options, and more. This information is then followed by the commands to be executed to do the computational work of your job.
Slurm directives must precede the executable section in your script.
# Run on the general partition
#SBATCH --partition=general
# Request one node
#SBATCH --nodes=1
# Request one task
#SBATCH --ntasks=1
# Request 4GB of RAM
#SBATCH --mem=4G
# Run for a maximum of 5 minutes
#SBATCH --time=5:00
# Name of the job
#SBATCH --job-name=testjob
# Name the output file
#SBATCH --output=%x_%j.out
# Set email address for notifications
#SBATCH --mail-user=netid@uvm.edu
# Request email to be sent at both begin and end, and if job fails
#SBATCH --mail-type=ALL
Below the job script’s directives is the section of code that Slurm will execute. This section is equivalent to running a Bash script in the command line – it’ll go through and sequentially run each command that you include. When there are no more commands to run, the job will stop.
For example, these commands go to jshmoe's home directory and executes a python program.
# go to jshmoe's home directory
cd /gpfs1/home/j/s/jshmoe
# in that directory, run test.py
python test.py
When you are done editing your file, save and exit.
To submit the job we use the sbatch
command.
$ sbatch test_job.sh
Submitted batch job 123456
Your job will be submited and run onces the requested resources are avialable.
Jobs with fewer resources requested will run sooner.
Although jobs submitted before you are further ahead in the queue, the slurm scheduler looks for jobs that can fit in the gaps between larger jobs, as long as they do not delay them. This means that being conservative in your resource requests will result in jobs running sooner.
Running an Interactive Job with srun
¶
In addition to batch jobs, you can run interactive jobs on the cluster using SLURM. An interactive job gives you direct access to a compute node, allowing you to run commands interactively as if you were logged into that node. This is useful for tasks like debugging and testing code.
To start an interactive session, use the srun
command. Here's an example:
$ srun --partition=general --nodes=1 --ntasks=1 --mem=4G --time=30:00 --pty bash
In this command:
--partition=general
: Specifies the partition to run the interactive session on.--nodes=1
: Requests one compute node.--ntasks=1
: Requests one task.--mem=4G
: Allocates 4GB of RAM.--time=1:00:00
: Sets a maximum runtime of one hour.--pty bash
: Starts a Bash shell interactively on the allocated compute node.
Once this command is executed, you will be dropped into a Bash shell running on a compute node. From here, you can run commands, load modules, or execute scripts as needed.
To end the interactive session, simply type:
$ exit
Interactive jobs are ideal for real-time experimentation and testing, complementing the batch job process.
Job Constraints¶
When a job has specific hardware requirements, you can use constraints to select the appropriate nodes. For example, to limit your job to a node with an intel processor, use --constraint=intel
. Here is a table of common constraints:
Constraint | Description |
---|---|
intel | Nodes with intel processors |
amd | Nodes with amd processors |
v100 | Nodes with v100 gpus |
a100 | Nodes with a100 gpus |
h100 | Nodes with h100 gpus |
noib | Nodes without Infiniband |
ib | Infiniband nodes, all types |
ib1 | Infiniband nodes, switch 1 |
ib2 | Infiniband nodes, switch 2 |
10g | 10 Gig Ethernet |
hc | High clockspeed nodes |
cascadelake | Nodes with cascadelake generation processors |
broadwell | Nodes with broadwell generation processors |
There are more constraints than are listed here. You can look at them with this command:
show_node_constraints