University of Vermont

Vermont Advanced Computing Core

VACC User Guide

Logging In | Submitting Jobs to the Cluster | MPI Jobs | Common Commands for Monitoring Your Jobs | Caveats

Advanced VACC User Documentation
(UVM NetID login required)
Instructions on Compiling MATLAB code
(UVM NetID login required)

Logging In

There are two machines accessible to VACC users: bluemoon-user1.uvm.edu and bluemoon-user2.uvm.edu. Use ssh to get access to the cluster.

It is highly recommended to change your shell to bash (In order to do this, log in to zoo, run shupdate to get new versions of shell profiles for both bash and tcsh, and then run zoochsh to change your shell to bash.)before using bluemoon. Note that it takes a sync from DCE-LDAP before bluemoon sees that your shell has changed - currently that happens overnight between 2 and 5 AM.

Submitting Jobs to the Cluster

We are currenty using PBS portable batch system for scheduling jobs.

Single Processor Jobs

These are pretty easy, and if you have lots of them, PBS will farm them out to the different processors that are available, so you can queue up many jobs and they will be started as soon as processors become available. However, keep in mind that there is some overhead in scheduling jobs, so if you have many jobs (hundreds or thousands) that each take less than 10 seconds or so to run, you should think about aggregating them.

For each job you want to run, create a job script. In this case we'll call it myjob.script:

# This job needs 1 compute node with 1 processor per node.
# PBS -l nodes=1:ppn=1
# It should be allowed to run for up to 1 hour.
# PBS -l walltime=01:00:00
# Name of job. #PBS -N myjob
# Join STDERR TO STDOUT. (omit this if you want separate STDOUT AND STDERR)
# PBS -j oe
# Send me mail on job start, job end and if job aborts
# PBS -M kapoodle@uvm.edu
# PBS -m bea

cd $HOME/myjob
echo "This is myjob running on" `hostname`
myprogram -foo 1 -bar 2 -baz 3

Submit the job by running: qsub myjob.script

Return to Top

MPI Jobs

For jobs that take advantage of multiple processors, PBS will help communicate to your program which nodes it should run on. You just need to make use of the right variables.

Example Scriptfile:

# this job needs 8 nodes, 16 processors total
#PBS -l nodes=8:ppn=2
# it needs to run for 4 hours
#PBS -l walltime=04:00:00
#PBS -N myjob
#PBS -j oe
#PBS -M kapoodle@uvm.edu
#PBS -m bea

echo "This is myjob being started on" `hostname`
cd ~/myjob
mpiexec ./myprogram

Return to Top

Common Commands, for Monitoring Your Jobs

showq

showq will show what the currently queued and running jobs are. It will also show the remaining walltime allocated to each job. showq - u userid shows status of your jobs.

[jtl@bluemoon-user1 home]$ showq
ACTIVE JOBS--------------------
Job Name User Name State Proc Remaining Start time
47709 jtl Running 24 4:28:46 Fri Aug 11 08:59:51
47714 jtl Running 24 8:11:12 Fri Aug 11 12:42:17
47715 kapoodle Running 10 1:10:52:40 Fri Aug 11 13:23:45
47694 kapoodle Running 32 1:23:11:19 Thu Aug 10 01:42:24
47713 kapoodle Running 16 6:23:17:39 Fri Aug 11 13:48:44

5 Active Jobs

106 of 114 Processors Active (92.98%)
53 of 54 Nodes Active (98.15%)

IDLE JOBS----------------------

Job Name User Name State Proc WCLIMIT QUEUETIME

0 Idle Jobs
BLOCKED JOBS----------------

Job Name User Name State Proc WCLIMIT QUEUETIME

Total Jobs: 5 Active Jobs: 5 Idle Jobs: 0 Blocked Jobs: 0

qstat

qstat will print out the currently queued, running, and recently exited jobs.

[jtl@bluemoon-user1 ~]$ qstat
Job id Name User Time Use S Queue
37.bluemoon-mgmt hpl-2proc jtl 00:00:02 R exec1
38.bluemoon-mgmt ...procs-realbig jtl 00:00:00 R exec1
39.bluemoon-mgmt ...procs-realbig jtl 00:00:00 R exec1

You can see from the above that there are 3 jobs in the queue, and all three are running (S=R).
qstat -r will give slightly more detail (but only for jobs that are currently running):

bluemoon-mgmt1.cluster:
Job id Username Queue Jobname SessID NDS TSK Req'd Memory Req'd Time S Elap Time
37.bluemoon-mgm jtl exec1 hpl-2proc 3823 1 -- -- -- R 01:54
39.bluemoon-mgm jtl exec1 hpl-32proc -- 16 -- -- -- R 01:44

This shows the number of nodes assigned to each job, and the elapsed runtime (wall clock time, not CPU time.) To get more detailed job status, use qstat -f. This will give details such as the nodes your job is currently running on, the environment variables set, the amount of resources used so far, or if your job has not started yet, the reason why.

qstat - u userid shows status of only your jobs.

Return to Top

Caveats

While we consider research computing central to the University's mission, bluemoon is not considered a business-critical service. Its architecture has been chosen for performance, as most HPC clusters are, rather than resiliency or reliability. This means that if a component of the cluster fails, we may not respond in the same timeliness that we would if, say, www.uvm.edu were down. If a node or two were to fail, it will probably be several days before it is repaired, since it is presumed the cluster will continue to operate. In the event of multiple failures of critical systems across campus, we will attend to business-critical services before we get to fixing problems with bluemoon.

Also note that we have not invested any money in data backups for the cluster. In the event of data failure, the cluster configuration would likely be recoverable, but research data will not be regularly backed up, as it would be on zoo.uvm.edu. At this time, users are responsible for backing up their own data.