VACC User Guide
Logging In | Submitting Jobs to the Cluster | MPI Jobs | Common Commands for Monitoring Your Jobs | Caveats
Advanced VACC User Documentation
(UVM NetID login required)
Instructions on Compiling MATLAB code
(UVM NetID login required)
Logging In
There are two machines accessible to VACC users: bluemoon-user1.uvm.edu and bluemoon-user2.uvm.edu. Use ssh to get access to the cluster.
It is highly recommended to change your shell to bash (In order to do this, log in to zoo, run shupdate to get new versions of shell profiles for both bash and tcsh, and then run zoochsh to change your shell to bash.)before using bluemoon. Note that it takes a sync from DCE-LDAP before bluemoon sees that your shell has changed - currently that happens overnight between 2 and 5 AM.
Submitting Jobs to the Cluster
We are currenty using PBS portable batch system for scheduling jobs.
Single Processor JobsThese are pretty easy, and if you have lots of them, PBS will farm them out to the different processors that are available, so you can queue up many jobs and they will be started as soon as processors become available. However, keep in mind that there is some overhead in scheduling jobs, so if you have many jobs (hundreds or thousands) that each take less than 10 seconds or so to run, you should think about aggregating them.
For each job you want to run, create a job script. In this case we'll call it myjob.script:
# This job needs 1 compute node with 1 processor per node.
# PBS -l nodes=1:ppn=1
# It should be allowed to run for up to 1 hour.
# PBS -l walltime=01:00:00
# Name of job. #PBS -N myjob
# Join STDERR TO STDOUT. (omit this if you want separate STDOUT AND STDERR)
# PBS -j oe
# Send me mail on job start, job end and if job aborts
# PBS -M kapoodle@uvm.edu
# PBS -m bea
cd $HOME/myjob
echo "This is myjob running on" `hostname`
myprogram -foo 1 -bar 2 -baz 3
Submit the job by running: qsub myjob.script
MPI Jobs
For jobs that take advantage of multiple processors, PBS will help communicate to your program which nodes it should run on. You just need to make use of the right variables.
Example Scriptfile:
# this job needs 8 nodes, 16 processors total
#PBS -l nodes=8:ppn=2
# it needs to run for 4 hours
#PBS -l walltime=04:00:00
#PBS -N myjob
#PBS -j oe
#PBS -M kapoodle@uvm.edu
#PBS -m bea
echo "This is myjob being started on" `hostname`
cd ~/myjob
mpiexec ./myprogram
Common Commands, for Monitoring Your Jobs
showq
showq will show what the currently queued and running jobs are. It will also show the remaining walltime allocated to each job. showq - u userid shows status of your jobs.
ACTIVE JOBS--------------------
| Job Name | User Name | State | Proc | Remaining | Start time |
| 47709 | jtl | Running | 24 | 4:28:46 | Fri Aug 11 08:59:51 |
| 47714 | jtl | Running | 24 | 8:11:12 | Fri Aug 11 12:42:17 |
| 47715 | kapoodle | Running | 10 | 1:10:52:40 | Fri Aug 11 13:23:45 |
| 47694 | kapoodle | Running | 32 | 1:23:11:19 | Thu Aug 10 01:42:24 |
| 47713 | kapoodle | Running | 16 | 6:23:17:39 | Fri Aug 11 13:48:44 |
5 Active Jobs
| 106 of 114 | Processors Active (92.98%) |
| 53 of 54 | Nodes Active (98.15%) |
IDLE JOBS----------------------
| Job Name | User Name | State | Proc | WCLIMIT | QUEUETIME |
0 Idle Jobs
BLOCKED JOBS----------------
| Job Name | User Name | State | Proc | WCLIMIT | QUEUETIME |
Total Jobs: 5 Active Jobs: 5 Idle Jobs: 0 Blocked Jobs: 0
qstat
qstat will print out the currently queued, running, and recently exited jobs.
| Job id | Name | User | Time Use | S | Queue |
| 37.bluemoon-mgmt | hpl-2proc | jtl | 00:00:02 | R | exec1 |
| 38.bluemoon-mgmt | ...procs-realbig | jtl | 00:00:00 | R | exec1 |
| 39.bluemoon-mgmt | ...procs-realbig | jtl | 00:00:00 | R | exec1 |
You can see from the above that there are 3 jobs in the queue, and all three are running (S=R).
qstat -r will give slightly more detail (but only for jobs that are currently running):
| Job id | Username | Queue | Jobname | SessID | NDS | TSK | Req'd Memory | Req'd Time | S | Elap Time | |
| 37.bluemoon-mgm | jtl | exec1 | hpl-2proc | 3823 | 1 | -- | -- | -- | R | 01:54 | |
| 39.bluemoon-mgm | jtl | exec1 | hpl-32proc | -- | 16 | -- | -- | -- | R | 01:44 |
This shows the number of nodes assigned to each job, and the elapsed runtime (wall clock time, not CPU time.) To get more detailed job status, use qstat -f. This will give details such as the nodes your job is currently running on, the environment variables set, the amount of resources used so far, or if your job has not started yet, the reason why.
qstat - u userid shows status of only your jobs.
Caveats
While we consider research computing central to the University's mission, bluemoon is not considered a business-critical service. Its architecture has been chosen for performance, as most HPC clusters are, rather than resiliency or reliability. This means that if a component of the cluster fails, we may not respond in the same timeliness that we would if, say, www.uvm.edu were down. If a node or two were to fail, it will probably be several days before it is repaired, since it is presumed the cluster will continue to operate. In the event of multiple failures of critical systems across campus, we will attend to business-critical services before we get to fixing problems with bluemoon.
Also note that we have not invested any money in data backups for the cluster. In the event of data failure, the cluster configuration would likely be recoverable, but research data will not be regularly backed up, as it would be on zoo.uvm.edu. At this time, users are responsible for backing up their own data.



