ThetaGPU Machine Overview
ThetaGPU is an extension of Theta and is comprised of 24 NVIDIA DGX A100 nodes. Each DGX A100 node comprises eight NVIDIA A100 Tensor Core GPUs and two AMD Rome CPUs that provide 22 with 320 GB of GPU memory and two nodes with 640 GB of GPU memory (8320 GB aggregately) of GPU memory for training artificial intelligence (AI) datasets, while also enabling GPU-specific and -enhanced high-performance computing (HPC) applications for modeling and simulation
The DGX A100’s integration into Theta is achieved via the ALCF’s Cobalt HPC scheduler and shared access to a 10-petabyte Lustre filesystem. Fixed ALCF user accounts ensure a smooth onboarding process for the expanded system.
A 15-terabyte solid-state drive offers up to 25 gigabits per second in bandwidth. The dedicated compute fabric comprises 20 Mellanox QM9700 HDR200 40-port switches wired in a fat-tree topology. ThetaGPU cannot utilize the Aries interconnect.
Table 1 summarizes the capabilities of a ThetaGPU compute node.
|AMD Rome 64-core CPU||2||48|
|DDR4 Memory||1 TB on 320 GB & 2 TB on 640 GB||26 TB|
|NVIDIA A100 GPU||8||192|
|GPU Memory||22 nodes w/ 320 GB & 2 nodes w/ 640 GB||8,320 GB|
|HDR200 Compute Ports||8||192|
|HDR200 Storage Ports||2||48|
|3.84 TB Gen4 NVME drives||4||96|
ThetaGPU Login Nodes
The Theta login nodes (see above) will be the intended method to access ThetaGPU. At first, Cobalt jobs cannot be submitted from the Theta login nodes to run on the GPU nodes; until that is supported, users will need to login in to the ThetaGPU service nodes (thetagpusn1 or thetagpusn2) from the Theta login nodes, and from there Cobalt jobs can be submitted to run on the GPU nodes.