Skip to content

Introduction to SLURM and Job Submission

Now that you are connected to the cluster, it's time to learn about SLURM (Simple Linux Utility for Resource Management), which is the workload manager used on the cluster. SLURM handles job scheduling, resource allocation, and job monitoring.

Login vs. Compute Nodes

In general, cluster users are expected to submit most computations to the job scheduler to be run on the dedicated compute nodes. The login nodes are meant for tasks like editing source/command files and running short test programs that do not use much memory, time, and only need one or two CPUs. Some rough guidelines are: the test will run less than five minutes, will use less than 5 GB of memory, and will not use more than two CPUs.

Submitting a Job with sbatch

You can submit a job by writing a job script. It's a simple text file that contains both the resource requirements and the commands you want to execute.

Let's create our first job script. (you can use the editor of your choice, e.g. emacs, joe, nano, vim, etc)

$ nano test_job.sh

We need to have a shebang line at the beginning of the script to specify the file is a shell script.

#!/bin/sh

Slurm lets you specify options directly in a batch script, called Slurm “directives.” These directives can provide job setup information used by Slurm, including resource requests, email options, and more. This information is then followed by the commands to be executed to do the computational work of your job.

Slurm directives must precede the executable section in your script.

# Run on the general partition 
#SBATCH --partition=general

# Request one node
#SBATCH --nodes=1

# Request one task
#SBATCH --ntasks=1

# Request 4GB of RAM
#SBATCH --mem=4G

# Run for a maximum of 5 minutes
#SBATCH --time=5:00

# Name of the job
#SBATCH --job-name=testjob

# Name the output file 
#SBATCH --output=%x_%j.out

# Set email address for notifications 
#SBATCH --mail-user=netid@uvm.edu

# Request email to be sent at both begin and end, and if job fails
#SBATCH --mail-type=ALL 

Below the job script’s directives is the section of code that Slurm will execute. This section is equivalent to running a Bash script in the command line – it’ll go through and sequentially run each command that you include. When there are no more commands to run, the job will stop.

For example, these commands go to jshmoe's home directory and executes a python program.

# go to jshmoe's home directory
cd /gpfs1/home/j/s/jshmoe
# in that directory, run test.py
python test.py

When you are done editing your file, save and exit.

To submit the job we use the sbatch command.

$ sbatch test_job.sh
Submitted batch job 123456

Your job will be submited and run onces the requested resources are avialable.

Jobs with fewer resources requested will run sooner.

Although jobs submitted before you are further ahead in the queue, the slurm scheduler looks for jobs that can fit in the gaps between larger jobs, as long as they do not delay them. This means that being conservative in your resource requests will result in jobs running sooner.


Running an Interactive Job with srun

In addition to batch jobs, you can run interactive jobs on the cluster using SLURM. An interactive job gives you direct access to a compute node, allowing you to run commands interactively as if you were logged into that node. This is useful for tasks like debugging and testing code.

To start an interactive session, use the srun command. Here's an example:

$ srun --partition=general --nodes=1 --ntasks=1 --mem=4G --time=30:00 --pty bash

In this command:

  • --partition=general: Specifies the partition to run the interactive session on.
  • --nodes=1: Requests one compute node.
  • --ntasks=1: Requests one task.
  • --mem=4G: Allocates 4GB of RAM.
  • --time=1:00:00: Sets a maximum runtime of one hour.
  • --pty bash: Starts a Bash shell interactively on the allocated compute node.

Once this command is executed, you will be dropped into a Bash shell running on a compute node. From here, you can run commands, load modules, or execute scripts as needed.

To end the interactive session, simply type:

$ exit

Interactive jobs are ideal for real-time experimentation and testing, complementing the batch job process.


Job Constraints

When a job has specific hardware requirements, you can use constraints to select the appropriate nodes. For example, to limit your job to a node with an intel processor, use --constraint=intel. Here is a table of common constraints:

Constraint Description
intel Nodes with intel processors
amd Nodes with amd processors
v100 Nodes with v100 gpus
a100 Nodes with a100 gpus
h100 Nodes with h100 gpus
noib Nodes without Infiniband
ib Infiniband nodes, all types
ib1 Infiniband nodes, switch 1
ib2 Infiniband nodes, switch 2
10g 10 Gig Ethernet
hc High clockspeed nodes
cascadelake Nodes with cascadelake generation processors
broadwell Nodes with broadwell generation processors

There are more constraints than are listed here. You can look at them with this command:

show_node_constraints

Up Next

Data & Storage