Authors: John Holmen (Oak Ridge National Laboratory), Philipp Grete (Hamburg Observatory University of Hamburg), Veronica Vergara Larrea (Oak Ridge National Laboratory)
Abstract: The Oak Ridge Leadership Computing Facility (OLCF) has been preparing the nation’s first exascale system, Frontier, for production and end users. Frontier is based on HPE Cray’s new EX architecture and slingshot interconnect and features 74 cabinets of optimized 3rd Gen AMD EPYC CPUs for HPC and AI and AMD Instinct 250X accelerators. As a part of this preparation, “real-world” user codes have been selected to help assess the functionality, performance, and usability of the system. This paper describes early experiences using the system in collaboration with the Hamburg Observatory for two selected codes, which have since been adopted in the OLCF Test Harness. Experiences discussed include efforts to resolve performance variability and per-cycle slowdowns. Results are shown for a performance portable astrophysical magnetohydronamics code, AthenaPK, and a miniapp stressing the core functionality of a performance portable block-structured adaptive mesh refinement (AMR) framework, Parthenon-Hydro. These results show good scaling characteristics to the full system. At the largest scale, the Parthenon-Hydro miniapp reaches a total of $1.7 \times 10^{13}$ zone-cycles/s on 9,216 nodes (73,728 logical GPUs) at $\approx92\%$ weak scaling parallel efficiency (starting from a single node using a second-order, finite-volume method).
Long Description: The Oak Ridge Leadership Computing Facility (OLCF) has been preparing the nation’s first exascale system, Frontier, for production and end users. Frontier is based on HPE Cray’s new EX architecture and slingshot interconnect and features 74 cabinets of optimized 3rd Gen AMD EPYC CPUs for HPC and AI and AMD Instinct 250X accelerators. As a part of this preparation, “real-world” user codes have been selected to help assess the functionality, performance, and usability of the system. This paper describes early experiences using the system in collaboration with the Hamburg Observatory for two selected codes, which have since been adopted in the OLCF Test Harness. Experiences discussed include efforts to resolve performance variability and per-cycle slowdowns. Results are shown for a performance portable astrophysical magnetohydronamics code, AthenaPK, and a miniapp stressing the core functionality of a performance portable block-structured adaptive mesh refinement (AMR) framework, Parthenon-Hydro. These results show good scaling characteristics to the full system. At the largest scale, the Parthenon-Hydro miniapp reaches a total of $1.7 \times 10^{13}$ zone-cycles/s on 9,216 nodes (73,728 logical GPUs) at $\approx92\%$ weak scaling parallel efficiency (starting from a single node using a second-order, finite-volume method).
Paper: PDF