Skip to content

Argonne Leadership Computing Facility

Single node "GPU-Peak" benchmarks

This work was done on a pre-production supercomputer with early versions of the Aurora software development kit

This page aims to give you a high-level overview of key performance numbers for a single Aurora node.

  • We are providing both 1 Tile and Full Node numbers.
  • The Full Node numbers are the Weak scaling version of the single node one.
  • The Full Node numbers have been achieved by one Rank per Tile, 12 Ranks.
  • All benchmarks' source code and launch options are included so you can tweak them as needed.
  • We are not exhaustive. Please assume we cherry-picked the correct size to get the best numbers.
  • We will not compare the results to some “theoretical” value. Theoretical values are full of assumptions, and we want to keep this page short.
  • We will not compare the results to other hardware. Feel free to do it yourself 🙂
  • To improve reproducibility, only the “best” numbers are reported (e.g., we take the minimum time of repetition step). When doing "real" science, please perform better statistical analysis.
  • The code will use a mixture of OpenMP and SYCL in C++ (sorry, Fortran, Python, and Level Zero lovers).

The asterisk (*) means that the data was collected on sunspot with older software stack...

Micro-benchmarks

One Tile Full Node
Single Precision Peak Flops* 23 TFlop/s 264 TFlop/s
Double Precision Peak Flops 17 TFlop/s 203 TFlop/s
Memory Bandwidth (triad) 1 TB/s 12 TB/s
PCIe Unidirectional Bandwidth (H2D)* 50 GB/s 173 GB/s
PCIe Unidirectional Bandwidth (D2H)* 49 GB/s 122 GB/s
PCIe Bidirectional Bandwidth 76 GB/s 356 GB/s
Tile2Tile Unidirectional Bandwidth* 296 GB/s 1766 GB/s
Tile2Tile Bidirectional Bandwidth* 147 GB/s 689 GB/s
GPU2GPU Unidirectional Bandwidth* 29 GB/s 174 GB/s
GPU2GPU Bidirectional Bandwidth* 57 GB/s 316 GB/s

Benchmark description

  • Double Precision Peak Flops: Chain of FMA.
  • Memory Bandwidth (triad): Triad, 2 load, 1 store
  • PCIe Unidirectional Bandwidth (H2D): Host to Device data-transfert
  • PCIe Unidirectional Bandwidth (H2D): Device to Host data-transfert
  • PCIe Unidirectional Bandwidth (H2D): Concurent Host to Device and Device to Host data-transfert
  • Tile2Tile Unidirectional Bandwidth: MPI Rank 0 (GPU N, Tile 0) will send a GPU buffer to Rank 1 (GPU N, Tile 1)
  • Tile2Tile Unidirectional Bandwidth: MPI Rank 0 (GPU N, Tile 0) will send a GPU buffer to Rank 1 (GPU N, Tile 1). Concurently, Rank 1 will also send a buffer to Rank 0
  • GPU2GPU Unidirectional Bandwidth: MPI Rank 0 (GPU 0,Tile 0) will send a GPU buffer to Rank 1 (GPU 1, Tile 0).
  • GPU2GPU Bidirectional Bandwidth: MPI Rank 0 (GPU 0,Tile 0) will send a GPU buffer to Rank 1 (GPU 1, Tile 0). Concurently, Rank 1 will also send a buffer to Rank 0

GEMM

One Tile Full Node
DGEMM 14 TFlop/s 173 TFlop/s
SGEMM 21 TFlop/s 257 TFlop/s
HGEMM 224 TFlop/s 2094 TFlop/s
BF16GEMM 238 TFlop/s 2439 TFlop/s
TF32GEMM 98 TFlop/s 1204 TFlop/s
I8GEMM 520 TFlop/s 4966 TFlop/s

FFT

One Tile Full Node
Single-precision C2C 1D* 2.3 TFlop/s 25 TFlop/s
Single-precision C2C 2D* 2.1 TFlop/s 21 TFlops/s

Don't hesitate to contact ALCF staff (via email or Slack) for complaints, bug reports, or praise.