Alps (supercomputer)
Active | operational 2024 |
---|---|
Sponsors | Swiss Confederation |
Operators | Swiss National Supercomputing Centre (CSCS) |
Location | Lugano-Cornadero, Switzerland |
Architecture | HPE Cray EX254n: Nvidia GH200 Grace Hopper with combinations of Grace 72 ARMv9-Neoverse-V2 CPUs and Hopper H100 Tensor Core GPUs (1'305'600 cores total) |
Power | 10 MW under full load |
Operating system | Linux |
Memory | 144 terabytes (TB) |
Speed | 270 PFLOPS (Rmax) |
Ranking | TOP500: 6, June 2024 |
Website | cscs.ch |
Sources | "Nvidia GH200 Grace Hopper Superchip" |
The Alps supercomputer is a high-performance computer funded by the Swiss Confederation through the ETH Domain, with its main location in Lugano. It is part of the Swiss National Supercomputing Centre (CSCS), which provides computing services for selected scientific customers.[1]
The Swiss National Supercomputing Centre (CSCS) was founded in 1991. This center operates a user lab for computing services. Examples in the past include the analysis of data from the Large Hadron Collider (LHC) at CERN, data storage for the X-ray laser SwissFEL of the Paul Scherrer Institute, and simulations for weather forecasts by MeteoSwiss.[2] These computing services have been provided over time by increasingly powerful computing systems. Since 2020 and the commissioning of the high-performance computer HPE Cray EX, the name Alps has been used for the new computers. On September 14, 2024, the latest supercomputer Alps HPE Cray EX254n was inaugurated. Even beforehand, the planned performance of Alps was described as being able to train the LLM GPT-3 from OpenAI in two days.[3] This supercomputer is based on Grace Hopper GH200 integrated circuits (ICs) from Nvidia[4][5] and achieves a performance of 270 petaflops per second, which means 270 quadrillion operations per second. In 2024, it ranks 6th (TOP500 list) among the world's fastest computers, although the in-house computers of Meta, Microsoft, Alphabet Inc./Google LLC, and Oracle are likely more powerful, but their performance is not known. A panel of experts from various natural sciences decides who is allowed to use this new computer. The use by a research collaboration of EPFL and the Yale Institute for Global Health has already been approved. This research group uses an open-source AI model from Meta and trained it on Alps with health data from medical research. With Alps, scientists in Switzerland receive an infrastructure to exploit many possibilities of artificial intelligence (AI). The new supercomputer is used as part of the Swiss AI Initiative by the ETH Zurich and EPFL.
Structure
[edit]To suitably house and operate modern supercomputers, a new data center building and an adjacent office building were constructed in Lugano-Cornadero. The data center building consists of three floors. The lowest floor houses the basic infrastructure with primary power and water distribution as well as an emergency power supply via batteries. The cooling of the computers and the buildings in summer is done with lake water from Lake Lugano. From a depth of 45 meters, 460 liters of cold lake water per second are supplied to the data center via 2.8 km long pipes. There, it cools the internal cooling circuit of the computer via a heat exchanger.[6] The secondary distribution is done on the middle floor using power distribution units, which allow flexible installation of the computers above. The computers are located on the top floor.[7] The latest Alps highly-parallel supercomputer was delivered by Hewlett Packard Enterprise (HPE), which acquired the supercomputer-specialized company Cray as a subsidiary in 2019. It is installed on an area of 2000 m2. The total cost was about 100 million CHF.
Electronics
[edit]In order to achieve superior performance, combinations of central processors (CPUs) with graphics processors (GPUs) as well as their associated memories (128 GB LPDDR-5X RAM; 96 GB HBM-3)[8] are placed in close proximity on the same monolithic integrated circuit provided by Nvidia. Arrays of 72 CPUs are called Grace and consist of ARMv9-Neoverse-V2 processors, which are RISC processors. The 132 GPUs are called Hopper H100 Tensor Core.[9] The combinations of said 72 CPUs together with 132 GPUs integrated on a VLSI chip are called GH200 Grace Hopper in memory of Grace Hopper. A total of 1'305'600 processor cores (CPUs and GPUs) are available on this Alps system. Data exchanges between the 2'688 nodes occur on an Ethernet-type network called Slingshot-11 at a rate of 200 Gbit/s.[10][8] A single node is composed of four GH200, in a Quad GH200 configuration. Every Quad GH200 node acts as a single NUMA system, with 288 CPU cores and 4 GPUs. The Grace CPUs communicate through a cache-coherent interconnect, while the Hopper GPUs communicate through NVLink.[11]
Operation
[edit]A team from CSCS develops special software for different applications. The power consumption of the computer at full load is 10 MW. The electricity costs are estimated to be around 15 million CHF per year.
References
[edit]- ^ Gioia da Silva: ETH weiht einen der modernsten KI-Supercomputer der Welt ein. In: Neue Zürcher Zeitung, 14 September 2024. Retrieved 26 September 2024
- ^ About CSCS. cscs.ch. Retrieved 26 September 2024
- ^ Alp's system to advance research across climate, physics, life sciences with 7x more powerful AI capabilities than current world-leading system for AI on MLPerf. nvidia.com, 12 April 2021. Retrieved 26 September 2024
- ^ Benedikt Schwan (2023-06-01). "Nvidia: Die KI aus dem Monstercomputer" (in German). Zeit Online. Retrieved 2024-09-26.
- ^ Neue Forschungsinfrastruktur: ‘Alps’ Supercomputer eingeweiht. ETH Zürich, 14 September 2024. Retrieved 26 September 2024
- ^ Lake water to cool supercomputers. cscs.ch 2015. Retrieved 26 September 2024
- ^ Innovative new building for CSCS in Lugano. cscs.ch 2015. Retrieved 26 September 2024
- ^ a b Alps: System Specification. cscs.ch. Retrieved 1 October 2024
- ^ Datasheet: NVIDIA GH200 Grace Hopper Superchip. nvidia.com. Retrieved 30 September 2024
- ^ TOP500: Alps, top500.org. Retrieved 30 September 2024
- ^ Fusco, Luigi; et al. "Understanding Data Movement in Tightly Coupled Heterogeneous Systems: A Case Study with the Grace Hopper Superchip". arXiv:2408.11556.