CUG2021 Proceedings

Monday, May 3rd

8:15am-8:45am

Acceptance and Testing

Stephen Leak

Acceptance Testing the Chicoma HPE-Cray EX Supercomputer

A Step Towards the Final Frontier: Lessons Learned from Acceptance Testing of the First HPE/Cray EX 3000 System at ORNL

Presentation, Paper

9:00am-9:30am

Storage and I/O 1

Tina Declerck

New data path solutions from HPE for HPC simulation, AI, and high performance workloads

Lustre and Spectrum Scale: Simplify parallel file system workflows with HPE Data Management Framework

Presentation, Paper

9:45am-10:15am

Storage and I/O 2

Veronica G. Vergara Larrea

h5bench: HDF5 I/O Kernel Suite for Exercising HPC I/O Patterns

h5bench: HDF5 I/O Kernel Suite for Exercising HPC I/O Patterns

Tonglin Li (Lawrence Berkeley National Laboratory, National Energy Research Scientific Computing Center); Suren Byna (Lawrence Berkeley National Laboratory); Quincey Koziol (Lawrence Berkeley National Laboratory, National Center for Supercomputing Applications); and Houjun Tang, Jean Luca Bez, and Qiao Kang (Lawrence Berkeley National Laboratory)

Abstract

Parallel I/O is a critical technique for moving data between compute and storage subsystems of supercomputing systems. With massive amounts of data being produced or consumed by compute nodes, high performant parallel I/O is essential. I/O benchmarks play an important role in this process, however, there is a scarcity of I/O benchmarks that are representative of current workloads on HPC systems. Towards creating representative I/O kernels from real world applications, we have created h5bench a set of I/O kernels that exercise HDF5 I/O on parallel file systems in numerous dimensions. Our focus on HDF5 is because of the parallel I/O library's heavy usage in a wide variety of scientific applications running on supercomputing systems. The various dimensions of h5bench include I/O operations (read and write), data locality (arrays of basic data types and arrays of structures), array dimensionality (1D arrays, 2D meshes, 3D cubes) and I/O modes (synchronous and asynchronous). In this paper, we present the observed performance of h5bench executed along several of these dimensions on a Cray system: Cori at NERSC using both the DataWarp burst buffer and a Lustre file system and Summit at Oak Ridge Leadership Computing Facility (OLCF) using a SpectrumScale file system. These performance measurements are using find performance bottlenecks, identify root causes of any poor performance, and optimize I/O performance. As the I/O patterns of h5bench are diverse and capture the I/O behaviors of various HPC applications, this study will be helpful not only to the CUG community but also to the broader supercomputing community.

pdf, pdf

Architecture and Performance of Perlmutter's 35 PB ClusterStor E1000 All-Flash File System

Presentation, Paper

Tuesday, May 4th

8:05am-8:52am

System Analytics and Monitoring

Jim Brandt

Integrating System State and Application Performance Monitoring: Network Contention Impact

trellis — An Analytics Framework for Understanding Slingshot Performance

AIOps: Leveraging AI/ML for Anomaly Detection in System Management

Real-time Slingshot Monitoring in HPCM

Analytic Models to Improve Quality of Service of HPC Jobs

Presentation, Paper

9:05am-9:50am

Systems Support

Hai Ah Nam

Blue Waters System and Component Reliability

Configuring and Managing Multiple Shasta Systems: Best Practices Developed During the Perlmutter Deployment

Slurm on Shasta at NERSC: adapting to a new way of life

Declarative automation of compute node lifecycle through Shasta API integration

Cray EX Shasta v1.4 System Management Overview

Managing User Access with UAN and UAI

User and Administrative Access Options for CSM-Based Shasta Systems

HPE Ezmeral Container Platform: Current And Future

Presentation, Paper

Wednesday, May 5th

8:00am-8:45am

Applications and Performance (ARM)

Simon McIntosh-Smith

An Evaluation of the A64FX Architecture for HPC Applications

Vectorising and distributing NTTs to count Goldbach partitions on Arm-based supercomputers

Optimizing a 3D multi-physics continuum mechanics code for the HPE Apollo 80 System

Presentation, Paper

9:00am-9:47am

Applications and Performance

Zhengji Zhao

Optimizing the Cray Graph Engine for Performant Analytics on Cluster, SuperDome Flex, Shasta Systems and Cloud Deployment

Real-Time XFEL Data Analysis at SLAC and NERSC: a Trial Run of Nascent Exascale Experimental Data Analysis

Early Experiences Evaluating the HPE/Cray Ecosystem for AMD GPUs

Convergence of AI and HPC at HLRS. Our Roadmap.

Porting Codes to LUMI

Presentation, Paper

Thursday, May 6th

7:00am-11:00am

BoF 1

Bilel Hadri

Update of Cray Programming Environment

Programming Environments, Applications, and Documentation (PEAD) Special Interest Group meeting

HPC System Test: Building a cross-center collaboration for system testing

HPC System Test: Building a cross-center collaboration for system testing

Veronica G. Vergara Larrea (Oak Ridge National Laboratory), Bilel Hadri (King Abdullah University of Science and Technology), Reuben Budiardja (Oak Ridge National Laboratory), Vasileios Karakasis (Swiss National Supercomputing Centre), Shahzeb Siddiqui (Lawrence Berkeley National Laboratory), and George Markomanolis (CSC - IT Center for Science Ltd.)

Abstract

This session builds upon an effort started at CUG 2019 and continued at SC19 in which several HPC centers gathered to discuss acceptance and regression testing procedures and frameworks. From that session, we learned there are many commonalities in the procedures and tools utilized for system testing. CSCS, KAUST, and NERSC use the ReFrame framework for regression testing. While other centers, like NCSA and OLCF, have built in-house tools for acceptance testing. From the experiences shared, we see there are many benchmarks and applications that are widely run which often become part of a local test suite. These common elements are a strong indication that a tighter collaboration between centers would be beneficial. Furthermore, as systems become more complex, leveraging the HPC community to develop and maintain the growing number of tests needed to assess a system is key.

The BOF will include lightning presentations from HPC centers that are using different testing frameworks for regression and acceptance. These will be followed by discussions around these topics: What challenges are centers currently facing in this area? What role should the vendors play in testing? How can we leverage testing efforts across centers to develop a maintainable collection of tests?

This BOF invites attendees participation to form the HPC System Test working group. The group will collaborate to define a set of guidelines and methodologies that can be used to build and maintain a collection of HPC system tests. All products from the session will be publicly available at: https://olcf.github.io/hpc-system-test-wg/

pdf

Birds of a Feather, Paper