Authors: Wenqing Peng (EPCC, The University of Edinburgh), Adrian Jackson (EPCC, The University of Edinburgh), Evgenij Belikov (EPCC, The University of Edinburgh)
Abstract: In this paper we present intra-node bandwidth measurements on ARCHER2 (AMD Rome) and LUMI (AMD Milan) using the open-source CAMP (Configurable App for Memory Probing) tool, which is a configurable micro-benchmark that allows varying operational intensity, thread counts and placement, and memory access patterns including contiguous, strided, various types of stencils, and random. We also gather information on power consumption from the Slurm batch scheduler to correlate it with the access patterns used. For comparison, we run another set of the measurements on a node on NEXTGenIO (Intel Ice Lake). Additionally, we extend CAMP to increase its resolution so that we can assess the range of operational intensities between zero and two in more detail compared to previous results. Moreover, we illustrate the mechanism for using custom kernels in CAMP using dot product as an example. Our results confirm and extend previous results showing that maximum bandwidth is reached using a fraction of threads compared to the maximum number of available cores on a node. In particular, for memory access with a stride of four and for a contiguous access case, we observe up to 11% higher bandwidth using 16 threads compared to the full node using 128 cores on an ARCHER2 node and up to 15% on LUMI, especially for operational intensities below 0.5. This suggests that underpopulation may be a viable option to achieve higher performance compared to full node utilisation and thus the results suggest that benchmarking should include tests using only a fraction of available cores per node. Additionally, sub-NUMA-node awareness may be required to reach the highest performance.
Long Description: In this paper we present intra-node bandwidth measurements on ARCHER2 (AMD Rome) and LUMI (AMD Milan) using the open-source CAMP (Configurable App for Memory Probing) tool, which is a configurable micro-benchmark that allows varying operational intensity, thread counts and placement, and memory access patterns including contiguous, strided, various types of stencils, and random. We also gather information on power consumption from the Slurm batch scheduler to correlate it with the access patterns used. For comparison, we run another set of the measurements on a node on NEXTGenIO (Intel Ice Lake). Additionally, we extend CAMP to increase its resolution so that we can assess the range of operational intensities between zero and two in more detail compared to previous results. Moreover, we illustrate the mechanism for using custom kernels in CAMP using dot product as an example. Our results confirm and extend previous results showing that maximum bandwidth is reached using a fraction of threads compared to the maximum number of available cores on a node. In particular, for memory access with a stride of four and for a contiguous access case, we observe up to 11% higher bandwidth using 16 threads compared to the full node using 128 cores on an ARCHER2 node and up to 15% on LUMI, especially for operational intensities below 0.5. This suggests that underpopulation may be a viable option to achieve higher performance compared to full node utilisation and thus the results suggest that benchmarking should include tests using only a fraction of available cores per node. Additionally, sub-NUMA-node awareness may be required to reach the highest performance.
Paper: PDF