When it comes to new supercomputers, a machine’s computational horsepower often gets the most attention. And while the novel computing hardware that gives supercomputers their processing power is indeed an engineering marvel, so too is the infrastructure required to operate the massive, world-class systems.
At the U.S. Department of Energy’s (DOE) Argonne National Laboratory, work has been underway for the past several years to expand and upgrade the ALCF data center that will house the upcoming Aurora exascale supercomputer.
Preparing for a new supercomputer requires years of planning, coordination, and collaboration. As Argonne’s largest and most powerful supercomputer to date, the lab has completed some substantial facility upgrades to get ready for the system, including adding new data center space, mechanical rooms, and equipment that significantly increase the building’s power and cooling capacity.
Built by Intel and Hewlett Packard Enterprise (HPE), Aurora will be theoretically capable of delivering more than two exaflops of computing power, or more than 2 billion billion calculations per second, when it’s deployed for science. The new supercomputer will follow the ALCF’s previous and current systems – Intrepid, Mira, Theta, and Polaris – to deliver on the facility’s mission to provide leading-edge supercomputing resources that enable breakthroughs in science and engineering. Open to researchers from across the world, ALCF supercomputers are used to tackle a wide range of scientific problems including designing more efficient airplanes, investigating the mysteries of the cosmos, modeling the impacts of climate change, and accelerating the discovery of new materials.
Aurora Hits the Floor
Over the past year, the physical Aurora system has begun to take shape with the delivery and installation of its computer racks and several components, including a test and development platform named Sunspot, the HPE Slingshot interconnect technology, and the Intel DAOS (Distributed Asynchronous Object Storage) storage system. Occupying the space of two professional basketball courts, Aurora is made up of rows of supercomputer cabinets that stand over 8-feet tall. The cabinets are outfitted with more than 300 miles of networking cables, countless red and blue hoses that pipe water in and out to cool the system, and specialized piping and equipment that bring the water in from beneath the data center floor and the electrical power from the floor above.
The installation continues this fall with the phased delivery of Intel’s state-of-the-art Ponte Vecchio GPUs (graphics processing units) and Sapphire Rapids CPUs (central processing units). The system is slated to be completed next year with an upgrade to Sapphire Rapids CPUs with high-bandwidth memory.
Facility Upgrades
While the supercomputer is nearing completion, the work to ready the Argonne site for Aurora has been years in the making. The process of deploying a new supercomputer begins with the major facility upgrades necessary to operate the system, including utility-scale electrical and mechanical work.
Because Aurora is a liquid-cooled system, Argonne had to upgrade its cooling capacity to pump 44,000 gallons of water through a complex loop of pipes that connects to cooling towers, chillers, heat exchangers, a filtration system, and other components. With pipes ranging from 4 inches to 30 inches in diameter, the cooling system ensures the water is at the perfect temperature, pressure, and purity levels as it passes through the Aurora hardware.
The electrical room, which is located on the second floor above the data center, contains 14 substations that provide 60 megawatts of capacity to power Aurora, future Argonne computing systems, and the building’s day-to-day electricity needs. The room is outfitted with a large ceiling hatch so the substations can be lowered in (and out if needed) by construction cranes.
Once the major facility upgrades were in place, the team moved on to data center enhancements. Focused on the machine room, this work included making sure power is delivered to the right locations at the right voltage, installing heavy-duty floor tiles to support the 600-ton supercomputer, and putting in pipes to link the water loop to Aurora.
While there are always challenges associated with construction work at this scale, many Aurora facility upgrades were carried out during the COVID-19 pandemic, creating some unforeseen issues related to contractor access and supply chain disruptions. Argonne and its partners put protocols in place to ensure they could continue to work safely and mitigate the impacts of COVID as much as possible. Due to the supply chain constraints causing various parts to be delayed, the Aurora team has been building the supercomputer piece by piece as components are delivered. Having a majority of the physical system and supporting infrastructure in place has allowed the Argonne-Intel-HPE team to test and fine-tune various components, such as DAOS and the cooling loop, ahead of the supercomputer’s deployment.
Science on Day One
In addition to the construction work, Argonne researchers are contributing to a broad range of activities to prepare Aurora for science on day one. To ensure key software is ready to run on the exascale system, scientists and developers participating in DOE’s Exascale Computing Project (ECP) and the ALCF’s Aurora Early Science Program (ESP) continue work to port and optimize dozens of scientific computing applications. With access to the Aurora software development kit and early hardware, the teams are working to improve the performance and functionality of various codes, frameworks, and libraries using the programming models that will be supported on Aurora.
Training events, such as workshops, hackathons, and webinars, have been an important mechanism for providing guidance on application development and disseminating the latest details on hardware and software. In June, for example, the Argonne-Intel Center for Excellence hosted a multi-day workshop for ECP and ESP teams to provide updates on the system, share approaches and best practices for performance portability, and facilitate hands-on sessions with exascale software tools.