CUG Archive

Papers

Hiding I/O using SMT on the ARCHER2 HPE Cray EX system

Authors: Shrey Bhardwaj (EPCC, The University of Edinburgh), Paul Bartholomew (EPCC, The University of Edinburgh), Mark Parsons (EPCC, The University of Edinburgh)

Abstract: In modern HPC systems, the I/O bottleneck limits the overall application wall clock time. To address this problem, this work tests the hypothesis that the effective I/O bandwidth can be improved by using SMT on ARCHER2, a HPE Cray EX supercomputing system. This was achieved by developing a benchmark library, iocomp which uses MPI to separate the computation and I/O processes. These processes can then be mapped to both threads of a single core using SMT on ARCHER2 as well as separate cores for comparison. For preliminary testing, the STREAM benchmark is used as the computational kernel and the iocomp library is used for the I/O operations. Timers are added to the application to record the computation and the wall time. Preliminary results show that when SMT is used, the wall clock time is 30% greater as compared to placing the computation and I/O processes onto separate cores using a full ARCHER2 node. As the STREAM benchmark is an unrealistic test case, HPCG and HPL benchmarks will next be used to test the hypothesis.

Long Description: In modern HPC systems, the I/O bottleneck limits the overall application wall clock time. To address this problem, this work tests the hypothesis that the effective I/O bandwidth can be improved by using SMT on ARCHER2, a HPE Cray EX supercomputing system. This was achieved by developing a benchmark library, iocomp which uses MPI to separate the computation and I/O processes. These processes can then be mapped to both threads of a single core using SMT on ARCHER2 as well as separate cores for comparison. For preliminary testing, the STREAM benchmark is used as the computational kernel and the iocomp library is used for the I/O operations. Timers are added to the application to record the computation and the wall time. Preliminary results show that when SMT is used, the wall clock time is 30% greater as compared to placing the computation and I/O processes onto separate cores using a full ARCHER2 node. As the STREAM benchmark is an unrealistic test case, HPCG and HPL benchmarks will next be used to test the hypothesis.

Paper: PDF

Back to Papers Archive Listing