Maintenance
Upcoming maintenance on the VACC cluster¶
Schedule¶
- Wednesday May 27th:
- UPS B shutdown for removal (6 hours: 6:00AM - 12:00PM)
- VACC again reduced to 50% capacity; back to 100% at noon.
- May 28th-June 9th:
- VACC at 100% capacity, but B-side circuits not UPS backed. Any power disruption will mean those compute nodes will crash.
- Wednesday June 10th:
- VACC maintenance day: Entire cluster offline for software & firmware updates.
- Jobs that would continue through the maintenance will not be permitted to start until the maintenance is completed.
- June 11th-12th:
- VACC continues operation with B-side circuits not UPS backed.
- Saturday June 13th:
- 6AM - 4PM: UPS B installation (8 hours of load testing.)
- VACC at 50% capacity.
- Sunday June 14th:
- VACC returns to normal operation, with new equipment and cooling.
Note: This schedule depends upon the timely delivery of needed parts, and the availability of personnel. There could be additional delays which will affect the above schedule. We will update the schedule as changes occur.
Data Center Upgrades¶
As of February 23, 2026:
The VACC has acquired new compute hardware for IceCore:
- 5 (five) HPE Cray XD670 with 100Gb Eth, NDR (400Gb) IB, 1 TB RAM, 8x NVIDIA H200 SXM 700W TDP GPUs with 141 GB HBM each
- 6 (six) HPE DL380a GPU nodes with 8x NVIDIA RTX 6000 Server Edition GPUs with 96 GB VRAM, 100Gb Eth, NDR (400Gb) IB, 1 TB RAM
- 16 (sixteen) HPE DL380a GPU nodes with 100Gbe Eth, NDR (400Gb) IB, 1 TB RAM, 4x H200 NVL 600W TDP GPUs
- 11 of these already in production, in temporary racks. Three of these comprise GoldenMaple.
- 17 (seventeen) HPE DL365 compute nodes with 2x AMD EPYC 9655 CPUs, NDR200 IB, 100Gb Eth, 1.5 TB RAM
In order to support this new hardware for IceCore, we will need new power circuits and additional cooling capacity in the data center. We are currently at the limit of what we can safely cool without overheating.
For cooling, UVM has purchased additional cooling distribution (Motivair MCDU-40) and 3 additional Motivair ChilledDoors to cool the new VACC hardware.
Several maintenance windows will require us to temporarily reduce compute capacity in the VACC and, in some cases, schedule downtime. Parts and personnel availability will determine actual dates; we will strive to provide ample notice of outages and keep the number of outages and reduced capacity periods to a minimum.
Circuit upgrades¶
New power circuits need to be added. These are done in 2 phases: A-side circuits and B-side circuits.
A-side circuit upgrades:
- One window of 4 hours where the VACC is down.
- Eight hours of 50% capacity in the VACC. Deepgreen (V100 GPUs) will be entirely offline.
A pause of 2 days between A and B circuits.
B-side circuit upgrades:
- Eight hours of 50% capacity in the VACC. Deepgreen (V100 GPUs) will be entirely offline.
Current estimate for the start of A-side circuit upgrades is March 9th.
Secondary cooling upgrades¶
During secondary cooling upgrades, we will need to reduce load on the VACC by 50% so that the data center does not overheat.
Secondary cooling upgrades:
- Five to seven business days 50% capacity in the VACC.
- At the end of the secondary cooling upgrades, commissioning tests will be performed on the new secondary cooling infrastructure. The last two days will involve GPU nodes being unavailable for users as we need to run them at 100% capacity to stress the infrastructure.
Current estimate for start of reduced capacity due to shutdown of secondary cooling is March 9th. Ideally, we will overlap the days of reduced capacity.
After secondary cooling is upgraded, Deepgreen will be retired; its GPUs will be replaced with nodes providing newer H200 GPUs.
UPS replacement¶
Our data center's UPSes are 20 years old. It is getting difficult to maintain them. We plan to replace them in the coming months.
Many compute nodes are only covered by a single UPS, so must be powered down during electrical work.
A-side UPS replacement:
- 4 hour window of the VACC at 50% capacity.
- Four to five business days of the VACC at 100% capacity, however, 50% will not be UPS backed, so disruptions in utility power could cause node failure and loss of jobs.
- 8 hour window of the VACC at 50% capacity while load tests are performed, and the new UPS is connected.
B-side UPS replacement: - Essentially, a repeat of the A-side replacement.
A-side UPS replacement is estimated to begin in March. B-side UPS replacement is estimated to begin in late April.
We plan to pause data center work for Research Week (April 13-17). Remaining UPS, cooling, and power work will continue after April 20.
Completed maintenance¶
April, 2026¶
- April 1-April 3:
- VACC continues at 50-100% capacity (A-side circuits/nodes not UPS backed. "Not UPS backed" means an electrical outage or disturbance could cause some compute nodes/jobs to fail.)
- IceCore (H200) nodes have migrated into new racks. This also includes GoldenMaple (H200) nodes.
- Substantial completion of secondary cooling loop. Commissioning still to be completed, but cooling is essentially online at this point.
- April 4 (Saturday):
- UPS A shutdown for installation (8 hours of load testing.)
- VACC at 50% capacity.
- April 5-12:
- VACC at 50-100% capacity, and beyond 100% as new hardware is brought into production.
- New scheduler configuration in place (GPU features).
- April 13-17:
- UVM Research Week: VACC/IceCore available beyond 100%.
March 9-25, 2026¶
- March 9:
- Secondary cooling loop was shutdown. VACC reduced to 50% capacity.
- DeepGreen was shut down permanently. No more V100 GPUs will be available.
- March 10:
- VACC cluster shutdown at 5:30AM, due to "A" side electrical panel shutdown.
- During the day, capacity was brought back online to 50% capacity.
- March 10-12:
- VACC continued at 50% capacity, as secondary cooling loop is offline.
- March 13:
- VACC cluster resources were unavailable from 5:00AM-12:00PM, due to electrical panel shutdown.
- After power was restored, VACC came back at 50% capacity.
- Initial startup and configuration of rear door heat exchangers for IceCore.
- March 14-17:
- VACC continued at 50% capacity.
- New secondary cooling installation.
- March 18:
- UPS A was shutdown for removal (6 hours: 6:00AM - 12:00PM)
- Nodes supported by UPS A were unavailable.
- March 19:
- VACC continues at 50-100% capacity (A-side circuits/nodes not UPS backed. "Not UPS backed" means an electrical outage or disturbance could cause some compute nodes/jobs to fail.)
- During this time, IceCore (H200) nodes will migrate into new racks. This also includes GoldenMaple (H200) nodes.
January 7-8, 2026¶
The cluster was down for scheduled maintenance to upgrade the operating system. We moved from:
- RHEL 9.4 to RHEL 9.6, including many bugfixes.
- Slurm from 25.05 to 25.11
GPFS3 rebuild¶
All files on /gpfs3 were deleted on January 7th so that we could rebuild the file system. A new policy of automatically deleting files that have not been accessed within 60 days was implemented. To emphasize the new policy, /gpfs3 was renamed /gpfs3tmp.
Details about /gpfs3tmp¶
To improve service to VACC users, we rebuilt the /gpfs3 filesystem on Jan 7, 2026. This filesystem was originally intended to be only for temporary files. After the rebuild, it was renamed /gpfs3tmp, and automatic purging of files that are not being accessed was implemented. Directories on it will only be created for each PI group. There are two main changes to be aware of:
- Files untouched for sixty (60) days will be automatically deleted. Since this is scratch (temporary storage), there is no backup. A warning email will be sent at the forty (40) day mark. No notifications will be sent about deletions on day sixty.
- No per-user directories are automatically created. Group members will be able to create subdirectories under their group's PI directory.
Regarding the previously existing gpfs3: a snapshot was be taken of the filesystem before it is deleted and rebuilt. However, our backup of the existing gpfs3 (which we do not normally perform) will only be held for 60 days (until March 8, 2026).