CUG Archive

Papers

Balancing Workloads in More Ways than One

Authors: Veronica G. Melesse Vergara (Oak Ridge National Laboratory), Paul Peltz (Oak Ridge National Laboratory), Nick Hagerty (Oak Ridge National Laboratory), Christopher Zimmer (Oak Ridge National Laboratory), Reuben Budiardja (Oak Ridge National Laboratory), Dan Dietz (Oak Ridge National Laboratory), Thomas Papatheodore (Oak Ridge National Laboratory), Christopher Coffman (Oak Ridge National Laboratory), Benton Sparks (Oak Ridge National Laboratory)

Abstract: The newest system deployed by Oak Ridge National Laboratory (ORNL) as part of the National Climate-Computing Research Center (NCRC) strategic partnership between U.S. Department of Energy and the National Oceanic and Atmospheric Administration (NOAA), named C5, is a HPE/Cray EX 3000 supercomputer with 1,792 nodes interconnected with HPE's Slingshot 10 technology. Each node is comprised of two 64-core AMD EPYC 7H12 processors and has 256GB of DRAM memory. In this paper, we describe the process ORNL used to deploy C5 and discuss the challenges we encountered during execution of the acceptance test plan. These challenges include balancing of: (1) production workloads running in parallel on the Gaea collection of systems, (2) the mixture and distribution of tests executed on C5 against f2, the shared Lustre parallel file system, simultaneously, (3) compute and file system resources available, and (4) the schedule and resource constraints. Part of the work done to overcome these challenges included expanding monitoring capabilities in the OLCF Test Harness which are described here. Finally, we present benchmarking results from NOAA benchmarks and OLCF applications that were used in this study that could be useful for other centers deploying similar systems.

Long Description: The newest system deployed by Oak Ridge National Laboratory (ORNL) as part of the National Climate-Computing Research Center (NCRC) strategic partnership between U.S. Department of Energy and the National Oceanic and Atmospheric Administration (NOAA), named C5, is a HPE/Cray EX 3000 supercomputer with 1,792 nodes interconnected with HPE's Slingshot 10 technology. Each node is comprised of two 64-core AMD EPYC 7H12 processors and has 256GB of DRAM memory. In this paper, we describe the process ORNL used to deploy C5 and discuss the challenges we encountered during execution of the acceptance test plan. These challenges include balancing of: (1) production workloads running in parallel on the Gaea collection of systems, (2) the mixture and distribution of tests executed on C5 against f2, the shared Lustre parallel file system, simultaneously, (3) compute and file system resources available, and (4) the schedule and resource constraints. Part of the work done to overcome these challenges included expanding monitoring capabilities in the OLCF Test Harness which are described here. Finally, we present benchmarking results from NOAA benchmarks and OLCF applications that were used in this study that could be useful for other centers deploying similar systems.

Paper: PDF

Back to Papers Archive Listing