1. Home
  2. Run a Job
  3. Understanding the Batch Job System

Understanding the Batch Job System

What is a batch job system?

Submitting a job to an HPC cluster is done using a batch system. A batch system allows users to submit jobs requesting the resources (nodes, processors, memory, GPUs) that they need. The jobs are queued and then run as resources become available. Scheduling policies in place on the system attempt to balance the desire for short queue waits against the need for efficient system utilization.

Interactive vs. Batch Computing

When you type commands in an interactive command line shell and see a response displayed, you are working interactively. If you haven’t used a cluster before, you may be more accustomed to this type of programming.

To run a batch job, you put the commands into a text file instead of typing them at the prompt. You submit this file to the batch system, which will run it as soon as resources become available. The output you would normally see on your display goes into a log file. You can check the status of your job interactively and/or receive emails when it begins and ends execution.

Appropriate use of the login nodes

You connect to the login nodes of the cluster, and from there you submit your job(s). The login nodes are also useful for editing job scripts, source code, or scripts that will later be used with batch jobs. The login nodes may also be used to run short test programs.

In general, cluster users are expected to submit most computations to the job scheduler to be run on the dedicated compute nodes. The login nodes are meant for tasks like editing source/command files and running short test programs that do not use much memory, time, and only need one or two CPUs. Some rough guidelines are: the test will run less than five minutes, will use less than 5 GB of memory, and will not use more than two CPUs. If you need more resources than that, please use a regular job or an interactive job or the Open OnDemand interface.

If you’d like help adapting interactive software to the cluster in any of these ways, don’t hesitate to reach out to our support team. We’re happy to help!

VACC Batch Systems

The batch system used for the Bluemoon and DeepGreen clusters is Slurm, a workload manager that performs both resource management and scheduling.

Batch Processing Overview

Here are the basic actions you will take when you want to run a job on the cluster.

1. Log in to the cluster

If you aren’t already set up to do this, see the article: Connect to the Cluster.

2. Write a job script

Load the Software You Need

You may or may not need to load the software necessary for your job each time you log into the cluster. For more information, see Loading Software.

Your job script is a text file that includes Slurm directives, as well as the commands you want executed. The directives tell the batch system what resources you need, among other things. You can prepare your job script using any text editor. For more information, see Bluemoon — Writing / Submitting a Job and DeepGreen — Writing / Submitting a Job.

3. Submit job

You submit your job to the batch system using the sbatch command, with the name of the script file as the argument.

For more information, see Bluemoon: Job Submission and DeepGreen: Job Submission.

4. Check the job’s status

Your job may remain in the queue for minutes or days before it runs, depending on system load and the resources requested. It may then run for minutes or days, depending on the workload of the job. You can monitor your job’s progress directly in Slurm or configure your job to send you an email when it has finished.

For more information, see Monitoring / Managing a Job.

5. Retrieve your output

Slurm generates a log file from the terminal text output from your job, and by default will be in the directory you submitted the job from. Any other new or changed files depend on the specifics of the job you’re running. Every node on the cluster has access to our GPFS file servers, so any files located there which were created or updated by jobs are easily accessible from the user nodes. See Transfer Files To/From the Cluster for information on accessing results from outside of the cluster.

Updated on March 21, 2024

Related Articles

Need Support?
Can't find the answer you're looking for?
Contact Support