Authors: Zhengji Zhao (Lawrence Berkeley National Laboratory), Brian Austin (Lawrence Berkeley National Laboratory), Stefan Maintz (NVIDIA), Martijn Marsman (University of Vienna, VASP Software GmbH)
Abstract: NERSC’s new supercomputer, Perlmutter, an HPE Cray EX system, has recently entered production. NERSC users are transitioning from a Cray XC40 system based on Intel Haswell and KNL processors to Perlmutter with NVIDIA A100 GPUs and AMD Milan CPUs with more on-node parallelism and NUMA domains. VASP, a widely-used materials science code that uses about 20\% of NERSC's computing cycles, has been ported to GPUs using OpenACC. For applications to achieve optimal performance, features specific to Cray EX must be explored, including the build and runtime options. In this paper, we present a performance analysis of representative VASP workloads on Perlmutter, addressing practical questions concerning hundreds of VASP users: What types of VASP workloads are suitable to run on GPUs? What is the optimal number of GPU nodes to use for a given problem size? How many MPI processes should share a GPU? What Slingshot options improve VASP performance? Is it worthwhile to enable OpenMP threads when running on GPU nodes? How many threads per task perform best on Milan CPU nodes? What are the most effective ways to minimize charging and energy costs when running VASP jobs on Perlmutter? This paper will serve as a Cray EX performance guide for VASP users and others.
Long Description: NERSC’s new supercomputer, Perlmutter, an HPE Cray EX system, has recently entered production. NERSC users are transitioning from a Cray XC40 system based on Intel Haswell and KNL processors to Perlmutter with NVIDIA A100 GPUs and AMD Milan CPUs with more on-node parallelism and NUMA domains. VASP, a widely-used materials science code that uses about 20\% of NERSC's computing cycles, has been ported to GPUs using OpenACC. For applications to achieve optimal performance, features specific to Cray EX must be explored, including the build and runtime options. In this paper, we present a performance analysis of representative VASP workloads on Perlmutter, addressing practical questions concerning hundreds of VASP users: What types of VASP workloads are suitable to run on GPUs? What is the optimal number of GPU nodes to use for a given problem size? How many MPI processes should share a GPU? What Slingshot options improve VASP performance? Is it worthwhile to enable OpenMP threads when running on GPU nodes? How many threads per task perform best on Milan CPU nodes? What are the most effective ways to minimize charging and energy costs when running VASP jobs on Perlmutter? This paper will serve as a Cray EX performance guide for VASP users and others.
Paper: PDF