CUG Logo

Papers

Polaris and Acceptance Testing

Authors: Brian Homerding (Argonne National Laboratory), Ben Lenard (Argonne National Laboratory), Cyrus Blackworth (Argonne National Laboratory), Alex Kulyavtsev (Argonne National Laboratory), Carissa Holohan (Argonne National Laboratory), Gordon McPheeters (Argonne National Laboratory), Eric Pershy (Argonne National Laboratory), Paul Rich (Argonne National Laboratory), Doug Waldron (Argonne National Laboratory), Michael Zhang (Argonne National Laboratory), Kevin Harms (Argonne National Laboratory), Ti Leggett (Argonne National Laboratory), William Allcock (Argonne National Laboratory)

Abstract: Argonne Leadership Computing Facility (ALCF) is home to Polaris, a 44 peak PetaFLOP (PF) system developed in collaboration with Hewlett Packard Enterprise (HPE) and NVIDIA. Polaris is a heterogeneous system with 560 nodes utilizing NVIDIA GPUs along with a HPE Slingshot Interconnect and a HDR200 Infiniband network to storage. Due to hardware availability the delivery was performed in multiple stages. We introduce both hardware and software components of Polaris and discuss the performance of our thorough benchmarking analysis. ALCF policy is to perform a rigorous multi-week ac- ceptance testing (AT) evaluation for every major system to ensure the capa- bilities of that system can support ALCF users’ science application needs and meet ACLF system operational metrics. The various system components are thoroughly tested to ensure the system will be stable for production operation, functions correctly, and fulfill performance expectations for scientific workloads. We will discuss how ALCF used Jenkins and ReFrame to perform the AT of the base Polaris system as well as a second AT to evaluate the Polaris CPU upgrade. We will present our approach for deploying Jenkins to streamline the AT evaluation with benchmarking improvements and lessons learned from the successful acceptance of the heterogeneous system, Polaris.

Long Description: Argonne Leadership Computing Facility (ALCF) is home to Polaris, a 44 peak PetaFLOP (PF) system developed in collaboration with Hewlett Packard Enterprise (HPE) and NVIDIA. Polaris is a heterogeneous system with 560 nodes utilizing NVIDIA GPUs along with a HPE Slingshot Interconnect and a HDR200 Infiniband network to storage. Due to hardware availability the delivery was performed in multiple stages. We introduce both hardware and software components of Polaris and discuss the performance of our thorough benchmarking analysis. ALCF policy is to perform a rigorous multi-week ac- ceptance testing (AT) evaluation for every major system to ensure the capa- bilities of that system can support ALCF users’ science application needs and meet ACLF system operational metrics. The various system components are thoroughly tested to ensure the system will be stable for production operation, functions correctly, and fulfill performance expectations for scientific workloads. We will discuss how ALCF used Jenkins and ReFrame to perform the AT of the base Polaris system as well as a second AT to evaluate the Polaris CPU upgrade. We will present our approach for deploying Jenkins to streamline the AT evaluation with benchmarking improvements and lessons learned from the successful acceptance of the heterogeneous system, Polaris.

Paper: PDF



Back to Papers Archive Listing