Single node "GPU-Peak" benchmarks
This work was done on a pre-production supercomputer with early versions of the Aurora software development kit
This page aims to give you a high-level overview of key performance numbers for a single Aurora node.
- We are providing both 1 Tile and Full Node numbers.
- The Full Node numbers are the Weak scaling version of the single node one.
- The Full Node numbers have been achieved by one Rank per Tile, 12 Ranks.
- All benchmarks' source code and launch options are included so you can tweak them as needed.
- We are not exhaustive. Please assume we cherry-picked the correct size to get the best numbers.
- We will not compare the results to some “theoretical” value. Theoretical values are full of assumptions, and we want to keep this page short.
- We will not compare the results to other hardware. Feel free to do it yourself 🙂
- To improve reproducibility, only the “best” numbers are reported (e.g., we take the minimum time of repetition step). When doing "real" science, please perform better statistical analysis.
- The code will use a mixture of OpenMP and SYCL in C++ (sorry, Fortran, Python, and Level Zero lovers).
The asterisk (*
) means that the data was collected on sunspot with older software stack...
Micro-benchmarks
One Tile | Full Node | Scaling | |
---|---|---|---|
Single Precision Peak Flops | 23 TFlop/s | 267 TFlop/s | 11.8 |
Double Precision Peak Flops | 17 TFlop/s | 187 TFlop/s | 10.9 |
Memory Bandwidth (triad) | 1 TB/s | 12 TB/s | 11.9 |
PCIe Unidirectional Bandwidth (H2D) | 54 GB/s | 329 GB/s | 6.1 |
PCIe Unidirectional Bandwidth (D2H) | 55 GB/s | 263 GB/s | 4.8 |
PCIe Bidirectional Bandwidth | 76 GB/s | 357 GB/s | 4.7 |
Tile2Tile Unidirectional Bandwidth | 196 GB/s | 1 TB/s | 6.0 |
Tile2Tile Bidirectional Bandwidth | 287 GB/s | 2 TB/s | 5.9 |
GPU2GPU Unidirectional Bandwidth | 15 GB/s | 95 GB/s | 6.3 |
GPU2GPU Bidirectional Bandwidth | 23 GB/s | 142 GB/s | 6.2 |
Benchmark description
- Double Precision Peak Flops: Chain of FMA.
- Memory Bandwidth (triad): Triad, 2 load, 1 store
- PCIe Unidirectional Bandwidth (H2D): Host to Device data-transfert
- PCIe Unidirectional Bandwidth (H2D): Device to Host data-transfert
- PCIe Unidirectional Bandwidth (H2D): Concurent Host to Device and Device to Host data-transfert
- Tile2Tile Unidirectional Bandwidth: MPI Rank 0 (GPU N, Tile 0) will send a GPU buffer to Rank 1 (GPU N, Tile 1)
- Tile2Tile Unidirectional Bandwidth: MPI Rank 0 (GPU N, Tile 0) will send a GPU buffer to Rank 1 (GPU N, Tile 1). Concurently, Rank 1 will also send a buffer to Rank 0
- GPU2GPU Unidirectional Bandwidth: MPI Rank 0 (GPU 0,Tile 0) will send a GPU buffer to Rank 1 (GPU 1, Tile 0).
- GPU2GPU Bidirectional Bandwidth: MPI Rank 0 (GPU 0,Tile 0) will send a GPU buffer to Rank 1 (GPU 1, Tile 0). Concurently, Rank 1 will also send a buffer to Rank 0
GEMM
One Tile | Full Node | Scaling | |
---|---|---|---|
DGEMM | 15 TFlop/s | 179 TFlop/s | 11.9 |
SGEMM | 22 TFlop/s | 258 TFlop/s | 11.7 |
HGEMM | 263 TFlop/s | 2606 TFlop/s | 9.9 |
BF16GEMM | 273 TFlop/s | 2645 TFlop/s | 9.7 |
TF32GEMM | 110 TFlop/s | 1311 TFlop/s | 11.9 |
I8GEMM | 577 TFlop/s | 5394 TFlop/s | 9.4 |
FFT
One Tile | Full Node | Scaling | |
---|---|---|---|
Single-precision FFT C2C 1D | 3 TFlop/s | 34 TFlop/s | 10.8 |
Single-precision FFT C2C 2D | 3 TFlop/s | 35 TFlop/s | 10.4 |
Don't hesitate to contact ALCF staff (via email or Slack) for complaints, bug reports, or praise.