Presentation, Paper Acceptance and Testing Chair: Stephen Leak (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory, Lawrence Berkeley National Laboratory) Acceptance Testing the Chicoma HPE-Cray EX Supercomputer Kody Everson (Los Alamos National Laboratory, Dakota State University Advanced Research Laboratory) and Paul Ferrell, Jennifer Green, Francine Lapid, Daniel Magee, Jordan Ogas, Calvin Seamons, and Nicholas Sly (Los Alamos National Laboratory) Abstract Abstract Since the installation of MANIAC I in 1952, Los Alamos National Laboratory (LANL) has been at the forefront of addressing global crises using state-of-the-art computational resources to accelerate scientific innovation and discovery. This generation faces a new crisis in the global COVID-19 pandemic that continues to damage economies, health, and wellbeing; LANL is supplying high-performance computing (HPC) resources to contribute to the recovery from the impacts of this virus. A Step Towards the Final Frontier: Lessons Learned from Acceptance Testing of the First HPE/Cray EX 3000 System at ORNL Veronica G. Vergara Larrea, Reuben Budiardja, Paul Peltz, Jeffery Niles, Christopher Zimmer, Daniel Dietz, Christopher Fuson, Hong Liu, Paul Newman, James Simmons, and Chris Muzyn (Oak Ridge National Laboratory) Abstract Abstract In this paper, we summarize the deployment of the Air Force Weather (AFW) HPC11 system at Oak Ridge National Laboratory (ORNL) including the process followed to successfully complete acceptance testing of the system. HPC11 is the first HPE/Cray EX 3000 system that has been successfully released to its user community in a federal facility. HPC11 consists of two identical 800-node supercomputers, Fawbush and Miller, with access to two independent and identical Lustre parallel file systems. HPC11 is equipped with Slingshot 10 interconnect technology and relies on the HPE Performance Cluster Manager (HPCM) software for system configuration. ORNL has a clearly defined acceptance testing process used to ensure that every new system deployed can provide the necessary capabilities to support user workloads. We worked closely with HPE and AFW to develop a set of tests for the United Kingdom’s Meteorological Office’s Unified Model (UM) and 4DVAR. We also included benchmarks and applications from the Oak Ridge Leadership Computing Facility (OLCF) portfolio to fully exercise the HPE/Cray programming environment and evaluate the functionality and performance of the system. Acceptance testing of HPC11 required parallel execution of each element on Fawbush and Miller. In addition, careful coordination was needed to ensure successful acceptance of the newly deployed Lustre file systems alongside the compute resources. In this work, we present test results from specific system components and provide an overview of the issues identified, challenges encountered, and the lessons learned along the way. Presentation, Paper Storage and I/O 1 Chair: Tina Declerck (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory) New data path solutions from HPE for HPC simulation, AI, and high performance workloads Lance Evans and Marc Roskow (HPE) Abstract Abstract HPE is extending its HPC storage portfolio to include an IBM Spectrum Scale based solution. The HPE solution will leverage HPE servers and the robustness of Spectrum scale to address the increasing demand for “enterprise” HPC systems. IBM Spectrum Scale is an enterprise-grade parallel file system that provides superior resiliency, scalability and control. IBM Spectrum Scale delivers scalable capacity and performance to handle demanding data analytics, content repositories and technical computing workloads. Storage administrators can combine flash, disk, cloud, and tape storage into a unified system with higher performance and lower cost than traditional approaches. Leveraging HPE Proliant servers, we will deliver a range of storage (NSD), protocol, and data mover servers with a granularity that addresses small AI systems to large HPC scratch spaces with exceptional cost and flexibility. Lustre and Spectrum Scale: Simplify parallel file system workflows with HPE Data Management Framework Mark Wiertalla and Kirill Malkin (HPE) and Zsolt Ferenczy (HPEHPE) Abstract Abstract Lustre and Spectrum Scale: Simplify parallel file system workflows with HPE Data Management Framework Presentation, Paper Storage and I/O 2 Chair: Veronica G. Vergara Larrea (Oak Ridge National Laboratory) h5bench: HDF5 I/O Kernel Suite for Exercising HPC I/O Patterns Tonglin Li (Lawrence Berkeley National Laboratory, National Energy Research Scientific Computing Center); Suren Byna (Lawrence Berkeley National Laboratory); Quincey Koziol (Lawrence Berkeley National Laboratory, National Center for Supercomputing Applications); and Houjun Tang, Jean Luca Bez, and Qiao Kang (Lawrence Berkeley National Laboratory) Abstract Abstract Parallel I/O is a critical technique for moving data between compute and storage subsystems of supercomputing systems. With massive amounts of data being produced or consumed by compute nodes, high performant parallel I/O is essential. I/O benchmarks play an important role in this process, however, there is a scarcity of I/O benchmarks that are representative of current workloads on HPC systems. Towards creating representative I/O kernels from real world applications, we have created h5bench a set of I/O kernels that exercise HDF5 I/O on parallel file systems in numerous dimensions. Our focus on HDF5 is because of the parallel I/O library's heavy usage in a wide variety of scientific applications running on supercomputing systems. The various dimensions of h5bench include I/O operations (read and write), data locality (arrays of basic data types and arrays of structures), array dimensionality (1D arrays, 2D meshes, 3D cubes) and I/O modes (synchronous and asynchronous). In this paper, we present the observed performance of h5bench executed along several of these dimensions on a Cray system: Cori at NERSC using both the DataWarp burst buffer and a Lustre file system and Summit at Oak Ridge Leadership Computing Facility (OLCF) using a SpectrumScale file system. These performance measurements are using find performance bottlenecks, identify root causes of any poor performance, and optimize I/O performance. As the I/O patterns of h5bench are diverse and capture the I/O behaviors of various HPC applications, this study will be helpful not only to the CUG community but also to the broader supercomputing community. Architecture and Performance of Perlmutter's 35 PB ClusterStor E1000 All-Flash File System Glenn K. Lockwood (Lawrence Berkeley National Laboratory, National Energy Research Scientific Computing Center) and Alberto Chiusole, Lisa Gerhardt, Kirill Lozinskiy, David Paul, and Nicholas J. Wright (Lawrence Berkeley National Laboratory) Abstract Abstract NERSC's newest system, Perlmutter, features a 35 PB all-flash Lustre file system built on HPE Cray ClusterStor E1000. We present its architecture, early performance figures, and performance considerations unique to this architecture. We demonstrate the performance of E1000 OSSes through low-level Lustre tests that achieve over 90% of the theoretical bandwidth of the SSDs at the OST and LNet levels. We also show end-to-end performance for both traditional dimensions of I/O performance (peak bulk-synchronous bandwidth) and non-optimal workloads endemic to production computing (small, incoherent I/Os at random offsets) and compare them to NERSC's previous system, Cori, to illustrate that Perlmutter achieves the performance of a burst buffer and the resilience of a scratch file system. Finally, we discuss performance considerations unique to all-flash Lustre and present ways in which users and HPC facilities can adjust their I/O patterns and operations to make optimal use of such architectures. Presentation, Paper System Analytics and Monitoring Chair: Jim Brandt (Sandia National Laboratories) Integrating System State and Application Performance Monitoring: Network Contention Impact Jim Brandt (Sandia National Laboratories); Tom Tucker (Open Grid Computing); and Simon Hammond, Ben Schwaller, Ann Gentile, Kevin Stroup, and Jeanine Cook (Sandia National Laboratories) Abstract Abstract Discovering and attributing application performance variation in production HPC systems requires continuous concurrent information on the state of the system and applications, and of applications’ progress. Even with such information, there is a continued lack of understanding of how time-varying system conditions relate to a quantifiable impact on application performance. We have developed a unified framework to obtain and integrate, at run time, both system and application information to enable insight into application performance in the context of system conditions. The Lightweight Distributed Metric Service (LDMS) is used on several significant large-scale Cray platforms for the collection of system data and is planned for inclusion on several upcoming HPE systems. We have developed a new capability to inject application progress information into the LDMS data stream. The consistent handling of both system and application data eases the development of storage, performance analytics, and dashboards. We illustrate the utility of our framework by providing runtime insight into application performance in conjunction with network congestion assessments on a Cray XC40 system with a beta Programming Environment being used to prepare for the upcoming ACES Crossroads system. We describe possibilities for application to the Slingshot network. The complete system is generic and can be applied to any *-nix system; the system data can be obtained by both generic and system-specific data collection plugins (e.g., Aries vs Slingshot counters); and no application changes are required when the injection is performed by a portability abstraction layer, such as that employed by kokkos. trellis — An Analytics Framework for Understanding Slingshot Performance Madhu Srinivasan, Dipanwita Mallick, Kristyn Maschhoff, and Haripriya Ayyalasomayajula (Hewlett Packard Enterprise) Abstract Abstract The next generation HPE Cray EX and HPE Apollo supercomputers with Slingshot interconnect are breaking new ground in the collection and analysis of system performance data. The monitoring frameworks on these systems provide visibility into Slingshot's operational characteristics through advanced instrumentation and transparency into real-time network performance. There still exists, however, a wide gap between the volume of telemetry generated by Slingshot and a user's ability to assimilate and explore this data to derive critical, timely, and actionable insights about fabric health, application performance, and potential congestion scenarios. In this work, we present trellis --- an analytical framework built on top of Slingshot monitoring APIs. The goal of trellis is to provide system-administrators and researchers insight into network performance, and its impact on complex workflows that include both AI and traditional simulation workloads. We also present a visualization interface, built on trellis, that allows users to interactively explore through various levels of the network topology over specified time windows, and gain key insights into job performance and communication patterns. We demonstrate these capabilities on an internal Shasta development system and visualize Slingshot's innovative congestion-control and adaptive-routing in action. AIOps: Leveraging AI/ML for Anomaly Detection in System Management Sergey Serebryakov, Jeff Hanson, Tahir Cader, Deepak Nanjundaiah, and Joshi Subrahmanya (Hewlett-Packard Enterprise) Abstract Abstract HPC datacenters rely on set-points and dashboards for system management, which leads to thousands of false alarms. Exascale systems will deploy thousands of servers and sensors, produce millions of data points per second, and be more prone to management errors and equipment failures. HPE and the National Renewable Energy Lab (NREL) are using AI/ML to improve data center resiliency and energy efficiency. HPE has developed and deployed in NREL’s production environment (since June 2020), an end-to-end anomaly detection pipeline that operates in real-time, automatically, and at massive scale. In the paper, we will provide detailed results from several end-to-end anomaly detection workflows either already deployed at NREL, or to be deployed soon. We will describe the upcoming AIOps release as a technology preview with HPCM 1.5, plans for future deployment with Cray System Manager, and potential use as an Edge processor (inferencing engine) for HPE’s InfoSight analytics platform. Real-time Slingshot Monitoring in HPCM Priya K, Prasanth Kurian, and Jyothsna Deshpande (Hewlett Packard Enterprise) Abstract Abstract HPE Performance Cluster Manager (HPCM) software is used to provision, monitor, and manage HPC cluster hardware and software components. HPCM has a centralized monitoring infrastructure for persistent storage of telemetry and alerting on these metrics based on thresholds. Slingshot fabric management and monitoring is the new feature in HPCM Monitoring infrastructure. Slingshot Telemetry (SST) monitoring framework in HPCM is used for collecting and storing Slingshot fabric health and performance telemetry. Real time telemetry information gathered by SST is used for fabric health monitoring, real-time analytics, visualization, and alerting. This solution is capable of both vertical and horizontal scalability, handling huge volumes of telemetry data. Flexible and extensible model of the SST collection agent makes it easy to collect metrics at different granularities and intervals. Visualization dashboards are designed to suit different use cases, giving a complete view of fabric health. Analytic Models to Improve Quality of Service of HPC Jobs Saba Naureen, Prasanth Kurian, and Amarnath Chilumukuru (HPE) Abstract Abstract A typical High Performance Computing (HPC) cluster comprises of components such as CPU, Memory, GPU, Ethernet, Fabric, storage, racks, cooling devices and switches. A cluster usually consists of 1000’s of compute nodes interconnected using an Ethernet network for management tasks and Fabric network for data traffic. Job scheduler need to be aware of the health & availability of the cluster components in order to deliver high performance results. Since the failure of any component will adversely impact the overall performance of a job, identifying the issues or outages is very critical for ensuring the desired Quality of Service (QoS) is achieved. We showcase an analytics based model implemented as part of HPE Performance Cluster Manager (HPCM) that gathers and analyzes the telemetry data pertaining to the various cluster components like the racks, enclosures, cluster nodes, storage devices, Fabric switches, Cooling Distribution Unit (CDU), ARC (Adaptive Rack Cooling), Chassis Management Controller (CMC), fabric, power supplies & system logs. This real-time status information based on the telemetry data is utilized by the Job Schedulers to perform scheduling tasks effectively. This enables schedulers to take smart decisions and ensure that it schedules jobs only on healthy nodes, thus preventing job failures and wastage of computational resources. Our solution enables HPC job schedulers to be health-aware resulting in improving the reliability of the clusters and improve overall customer experience. Presentation, Paper Systems Support Chair: Hai Ah Nam (Lawrence Berkeley National Laboratory) Blue Waters System and Component Reliability Brett Bode, David King, Celso Mendes, and William Kramer (National Center for Supercomputing Applications/University of Illinois); Saurabh Jha (University of Illinois); and Roger Ford, Justin Davis, and Steven Dramstad (Cray Inc.) Abstract Abstract The Blue Waters system, installed in 2012 at NCSA, has the largest component count of any system Cray has built. Blue Waters includes a mix of dual-socket CPU (XE) and single-socket CPU, single GPU (XK) nodes. The primary storage is provided by Cray’s Sonexion/ClusterStor Lustre storage system delivering 35PB (raw) storage at 1TB/sec. The statistical failure rates over time for each component including CPU, DIMM, GPU, disk drive, power supply, blower, etc and their impact on higher level failure rates for individual nodes and the systems as a whole are presented in detail, with a particular emphasis on identifying any increases in rate that might indicate the right-side of the expected bathtub curve has been reached. Strategies employed by NCSA and Cray for minimizing the impact of component failure, such as the preemptive removal of suspect disk drives, are also presented. Configuring and Managing Multiple Shasta Systems: Best Practices Developed During the Perlmutter Deployment James Botts (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center); Zachary Crisler (Hewlett Packard Enterprise); Aditi Gaur and Douglas Jacobsen (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center); Harold Longley, Alex Lovell-Troy, and Dave Poulsen (Hewlett Packard Enterprise); and Eric Roman and Chris Samuel (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) Abstract Abstract The perlmutter supercomputer and related test systems provide an early look at Shasta system management and our ideas on best practices for managing Shasta systems. The cloud-native software and ethernet-based networking on the system enable tremendous flexibility in management policies and methods. Based on work performed using Shasta 1.3 and previewed 1.4 releases, NERSC has developed, in close collaboration with HPE through the perlmutter System Software COE, methodologies for efficiently managing multiple Shasta systems. We describe how we template and synchronize configurations and software between systems and orchestrate manipulations of the configuration of the managed system. Key to this is a secured external management system that provides both a configuration origin for the system and an interactive management space. Leveraging this external management system we simultaneously create a systems-development environment as well as secure key aspects of the Shasta system, enabling NERSC to rapidly deploy the perlmutter system. Slurm on Shasta at NERSC: adapting to a new way of life Christopher Samuel (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center, National Energy Research Scientific Computing Center) and Douglas M. Jacobsen and Aditi Gaur (Lawrence Berkeley National Laboratory, National Energy Research Scientific Computing Center) Abstract Abstract Shasta, with its heady mix of kubernetes, containers, software defined networking and 1970s batch computing provides a vast array of new concepts, strategies and acronyms for traditional HPC administrators to adapt to. NERSC has been working through this maze to take advantage of the new capabilities that Shasta brings in order to provide a stable and expected interface for traditional HPC workloads on Perlmutter whilst also taking advantage of Shasta and new abilities in Slurm to provide more modern interfaces and capabilities for production use. Declarative automation of compute node lifecycle through Shasta API integration J. Lowell Wofford and Kevin Pelzel (Los Alamos National Laboratory) Abstract Abstract Using the Cray Shasta system available at Los Alamos National Laboratory, we have experimented with integrating with various components of the HPE Cray Shasta software stack through the provided APIs. We have integrated with a LANL open-source software project, Kraken, which provides distributed state-based automation to provide new automation and management features to the Shasta system. We have focused on managing Shasta compute node lifecycle with Kraken, providing automation to node operations such as node image, kernel and configuration management. We examine the strengths and challenges of integrating with the Shasta APIs and discuss possibilities for further API integrations. Cray EX Shasta v1.4 System Management Overview Harold Longley (Hewlett Packard Enterprise) Abstract Abstract How do you manage a Cray EX (Shasta) system? This overview describes the Cray System Management software in the Shasta v1.4 release. This release has introduced new features such as booting management nodes from images, product streams, and configuration layers. The foundation of containerized microservices orchestrated by Kubernetes on the management nodes provides a highly available and resilient set of services to manage the compute and application nodes. Lower level hardware control is based on the DMTF Redfish standard enabling higher level hardware management services to control and monitor components and manage firmware updates. The network management services enable control of the high speed network fabric. The booting process relies upon preparation of images and configuration as well as run-time interaction between the nodes and services while nodes boot and configure. All microservices have published RESTful APIs for those who want to integrate management functions into their existing DevOps environment. The v1.4 software includes the cray CLI and the SAT (System Administration Toolkit) CLI which are clients that use these services. Identity and access management protect critical resources, such as the API gateway. Non-administrative users access the system either through a multi-user Linux node (User Access Node) or a single-user container (User Access Instance) managed by Kubernetes. Logging and telemetry data can be sent from the system to other site infrastructure. The tools for collection, monitoring, and analysis of telemetry and log data have been improved with new alerts and notifications. Managing User Access with UAN and UAI Harold Longley, Alex Lovell-Troy, and Gregory Baker (Hewlett Packard Enterprise) Abstract Abstract User Access Nodes (UANs) and User Access Instances (UAIs) represent the primary entry point for users on a Cray EX system to develop, build, and execute their applications on the Cray EX compute nodes. The UAN is a traditional, multi-user Linux node. The UAI is a dynamically provisioned, single user container which can be customized to the user’s needs. This presentation will describe the state of the Shasta v1.4 software for user access with UAN and UAI, provisioning software products for users, providing access to shared filesystems, granting and revoking authentication and authorization, logging of access, and monitoring of resource utilization. User and Administrative Access Options for CSM-Based Shasta Systems Alex Lovell-Troy, Sean Lynn, and Harold Longley (Hewlett Packard Enterprise) Abstract Abstract Cray System Management (CSM) from HPE is a cloud-like control system for High Performance Computing. CSM is designed to integrate the Supercomputer with multiple datacenter networks and provide secure administrative access via authenticated REST APIs. Access to the compute nodes and to the REST APIs may need to follow different network paths which has network routing implications. This paper outlines the flexible network configurations and guides administrators planning their Shasta/CSM systems. Site Administrators have configuration options for allowing users and administrators to access the REST APIs from outside. They also have options for allowing applications running on the compute nodes to access these same APIs. This paper is structured around three themes. The first theme defines a layer2/layer3 perimeter around the system and addresses upstream connections to the site network. The second theme deals primarily with layer 3 subnet routing from the network perimeter inward. The third theme deals with administrative access control at various levels of the network as well as user-based access controls to the APIs themselves. Finally, this paper will combine the themes to describe specific use cases and how to support them with available administrative controls. HPE Ezmeral Container Platform: Current And Future Thomas Phelan (HPE) Abstract Abstract The HPE Ezmeral Container Platform is the industry's first enterprise-grade container platform for both cloud-native and distributed non cloud-native applications using the open-source Kubernetes container orchestrator. Ezmeral enables true hybrid cloud operations across any location: on-premises, public cloud, and edge. Today, the HPE Ezmeral Container Platform is largely used for enterprise AI/ML/DL applications. However, the industry is starting to see a convergence of AI/ML/DL and High Performance Computer (HPC) workloads. This session will present an overview of the HPE Ezmeral Container Platform - its architecture, features, and usecases. It will also provide a look into the future product roadmap where the platform will support HPC workloads as well. Presentation, Paper Applications and Performance (ARM) Chair: Simon McIntosh-Smith (University of Bristol) An Evaluation of the A64FX Architecture for HPC Applications Andrei Poenaru and Tom Deakin (University of Bristol, GW4); Simon McIntosh-Smith (University of Bristol); and Si Hammond and Andrew Younge (Sandia National Laboratories) Abstract Abstract In this paper, we present some of the first in-depth, rigorous, independent benchmark results for the A64FX, the processor at the heart of Fugaku, the current #1 supercomputer in the world, and now available in Apollo 80 guise. The Isambard and Astra research teams have combined to perform this study, using a combination of mini-apps and application benchmarks to evaluate A64FX's performance for both compute- and bandwidth-bound scenarios. The study uniquely had access to all four major compilers for A64FX: Cray, Arm, GNU and Fujitsu. The results showed that the A64FX is extremely competitive, matching or exceeding contemporary dual-socket x86 servers. We also report tuning and optimisation techniques which proved essential for achieving good performance on this new architecture. Vectorising and distributing NTTs to count Goldbach partitions on Arm-based supercomputers Ricardo Jesus (EPCC, The University of Edinburgh); Tomás Oliveira e Silva (IEETA/DETI, Universidade de Aveiro); and Michèle Weiland (EPCC, The University of Edinburgh) Abstract Abstract In this paper we explore the usage of SVE to vectorise number-theoretic transforms (NTTs). In particular, we show that 64-bit modular arithmetic operations, including modular multiplication, can be efficiently implemented with SVE instructions. The vectorisation of NTT loops and kernels involving 64-bit modular operations was not possible in previous Arm-based SIMD architectures, since these architectures lacked crucial instructions to efficiently implement modular multiplication. We test and evaluate our SVE implementation on the A64FX processor in an HPE Apollo 80 system. Furthermore, we implement a distributed NTT for the computation of large-scale exact integer convolutions. We evaluate this transform on HPE Apollo 70, Cray XC50, and HPE Apollo 80 systems, where we demonstrate good scalability to thousands of cores. Finally, we describe how these methods can be utilised to count the number of Goldbach partitions of all even numbers to large limits. We present some preliminary results concerning this problem, in particular a histogram of the number of Goldbach partitions of the even numbers up to 2^40. Optimizing a 3D multi-physics continuum mechanics code for the HPE Apollo 80 System Vince Graziano (New Mexico Consortium, Los Alamos National Laboratory) and David Nystrom, Howard Pritchard, Brandon Smith, and Brian Gravelle (Los Alamos National Laboratory) Abstract Abstract We present results of a performance evaluation of a LANL 3D multi-physics continuum mechanics code - Pagosa - on an HPE Apollo 80 system. The Apollo 80 features the Fujitsu A64FX ARM processor with Scalable Vector Extension (SVE) support and high bandwidth memory. This combination of SIMD vector units and high memory bandwidth offers the promise of realizing a significant fraction of the theoretical peak performance for applications like Pagosa. In this paper we present performance results of the code using the GNU, ARM, and CCE compilers, analyze these compilers’ ability to vectorize performance critical loops when targeting the SVE instruction set, and describe code modifications to improve the performance of the application on the A64FX processor. Presentation, Paper Applications and Performance Chair: Zhengji Zhao (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory, Lawrence Berkeley National Laboratory) Optimizing the Cray Graph Engine for Performant Analytics on Cluster, SuperDome Flex, Shasta Systems and Cloud Deployment Christopher Rickett, Kristyn Maschhoff, and Sreenivas Sukumar (Hewlett Packard Enterprise) Abstract Abstract We present updates to the Cray Graph Engine, a high performance in-memory semantic graph database, which enable performant execution across multiple architectures as well as deployment in a container to support cloud and as-a-service graph analytics. This paper discusses the changes required to port and optimize CGE to target multiple architectures, including Cray Shasta systems, large shared-memory machines such as SuperDome Flex (SDF), and cluster environments such as Apollo systems. The porting effort focused primarily on removing dependences on XPMEM and Cray PGAS and replacing these with a simplified PGAS library based upon POSIX shared memory and one-side MPI, while preserving the existing Coarray-C++ CGE code base. We also discuss the containerization of CGE using Singularity and the techniques required to enable container performance matching native execution. We present early benchmarking results for running CGE on the SDF, Infiniband clusters and Slingshot interconnect-based Shasta systems. Real-Time XFEL Data Analysis at SLAC and NERSC: a Trial Run of Nascent Exascale Experimental Data Analysis Best Paper Johannes P. Blaschke (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center); Aaron S. Brewster, Daniel W. Paley, Derek Mendez, Asmit Bhowmick, and Nicholas K. Sauter (Lawrence Berkeley National Laboratory/Physical Biosciences Division); Wilko Kröger and Murali Shankar (SLAC National Accelerator Laboratory); and Bjoern Enders and Deborah Bard (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) Abstract Abstract X-ray scattering experiments using Free Electron Lasers (XFELs) are a powerful tool to determine the molecular structure and function of unknown samples (such as COVID-19 viral proteins). XFEL experiments are a challenge to computing in two ways: i) due to the high cost of running XFELs, a fast turnaround time from data acquisition to data analysis is essential to make informed decisions on experimental protocols; ii) data collection rates are growing exponentially, requiring new scalable algorithms. Here we report our experiences analyzing data from two experiments at the Linac Coherent Light Source (LCLS) during September 2020. Raw data were analyzed on NERSC’s Cori XC40 system, using the Superfacility paradigm: our workflow automatically moves raw data between LCLS and NERSC, where it is analyzed using the software package CCTBX. We achieved real time data analysis with a turnaround time from data acquisition to full molecular reconstruction in as little as 10 min -- sufficient time for the experiment’s operators to make informed decisions. By hosting the data analysis on Cori, and by automating LCLS-NERSC interoperability, we achieved a data analysis rate which matches the data acquisition rate. Completing data analysis with 10 mins is a first for XFEL experiments and an important milestone if we are to keep up with data collection trends. Early Experiences Evaluating the HPE/Cray Ecosystem for AMD GPUs Veronica G. Vergara Larrea, Reuben Budiardja, and Wayne Joubert (Oak Ridge National Laboratory) Abstract Abstract Since deploying the Titan supercomputer in 2012, the Oak Ridge Leadership Computing Facility (OLCF) has continued to support and promote GPU-accelerated computing among its user community. Summit, the flagship system at the OLCF --- currently number 2 in the most recent TOP500 list --- has a theoretical peak performance of approximately 200 petaflops. Because the majority of Summit’s computational power comes from its 27,972 GPUs, users must port their applications to one of the supported programming models in order to make efficient use of the system. Looking ahead to Frontier, the OLCF’s exascale supercomputer, users will need to adapt to an entirely new ecosystem which will include new hardware and software technologies. First, users will need to familiarize themselves with the AMD Radeon GPU architecture. Furthermore, users who have been previously relying on CUDA will need to transition to the Heterogeneous-Computing Interface for Portability (HIP) or one of the other supported programming models (e.g., OpenMP, OpenACC). In this work, we describe our initial experiences in porting three applications or proxy apps currently running on Summit to the HPE/Cray ecosystem to leverage the compute power from AMD GPUs: minisweep, GenASiS, and Sparkler. Each one is representative of current production workloads utilized at the OLCF, different programming languages, and different programming models. We also share lessons learned from challenges encountered during the porting process and provide preliminary results from our evaluation of the HPE/Cray Programming Environment and the AMD software stack using these key OLCF applications. Convergence of AI and HPC at HLRS. Our Roadmap. Denns Hoppe (High Performance Computing Center Stuttgart) Abstract Abstract The growth of artificial intelligence (AI) is accelerating. AI has left research and innovation labs, and nowadays plays a significant role in everyday lives. The impact on society is graspable: autonomous driving cars produced by Tesla, voice assistants such as Siri, and AI systems that beat renowned champions in board games like Go. All these advancements are facilitated by powerful computing infrastructures based on HPC and advanced AI-specific hardware, as well as highly optimized AI codes. Since several years, HLRS is engaged in big data and AI-specific activities around HPC. The road towards AI at HLRS began several years ago with installing a Cray Urika-GX for processing large volumes of data. But due to the isolated platform and, for HPC users, different usage concept, uptake of this system was lower than expected. This drastically changed recently with the installation of a CS-Storm equipped with powerful GPUs. Since then, we are also extending our HPC system with GPUs due to a high customer demand. We foresee that the duality of using AI and HPC on different systems will soon be overcome, and hybrid AI/HPC workflows will be eventually possible. Porting Codes to LUMI Georgios Markomanolis (CSC - IT Center for Science Ltd.) Abstract Abstract LUMI is a new upcoming EuroHPC pre-exascale supercomputer with a peak performance of a bit over 550 petaflop/s by HPE Cray. Many countries of LUMI consortium will have access to this system among other users. It is known that this system will be based on the next generation of AMD Instinct GPUs and this is a new environment for all of us. In this presentation, we discuss the AMD ecosystem, we present with examples the procedure to convert CUDA codes to HIP, among also how to port Fortran codes with hipfort. We discuss the utilization of other HIP libraries and we demonstrate a performance comparison between CUDA and HIP. We explore the challenges that scientists will have to handle during their application porting and also we provide step-by-step guidance. Finally, we will discuss the potential of other programming models and the workflow that we follow to port codes depending on their readiness for GPUs and the used programming language. Birds of a Feather, Paper BoF 1 Chair: Bilel Hadri (KAUST Supercomputing Lab) Update of Cray Programming Environment John Levesque (HPE) Abstract Abstract Over the past year, the Cray Programming Environment (CPE) engineers have been hard at work on numerous projects to make the compiler and tools easier to use and to interact well with the new GPU systems. This talk will cover those facets of development and will give a futures perspective to where CPE is going. We recognize that CPE is the only programming environment that gives applications developers a portable development interface across all the popular nodes and GPU options. Programming Environments, Applications, and Documentation (PEAD) Special Interest Group meeting Bilel Hadri (KAUST Supercomputing Lab) Abstract Abstract The Programming Environments, Applications and Documentation Special Interest Group (“the SIG”) has as its mission to provide a forum for exchange of information related to the usability and performance of programming environments (including compilers, libraries and tools) and scientific applications running on Cray systems. Related topics in user support and communication (e.g. documentation) are also covered by the SIG. HPC System Test: Building a cross-center collaboration for system testing Veronica G. Vergara Larrea (Oak Ridge National Laboratory), Bilel Hadri (King Abdullah University of Science and Technology), Reuben Budiardja (Oak Ridge National Laboratory), Vasileios Karakasis (Swiss National Supercomputing Centre), Shahzeb Siddiqui (Lawrence Berkeley National Laboratory), and George Markomanolis (CSC - IT Center for Science Ltd.) Abstract Abstract This session builds upon an effort started at CUG 2019 and continued at SC19 in which several HPC centers gathered to discuss acceptance and regression testing procedures and frameworks. From that session, we learned there are many commonalities in the procedures and tools utilized for system testing. CSCS, KAUST, and NERSC use the ReFrame framework for regression testing. While other centers, like NCSA and OLCF, have built in-house tools for acceptance testing. From the experiences shared, we see there are many benchmarks and applications that are widely run which often become part of a local test suite. These common elements are a strong indication that a tighter collaboration between centers would be beneficial. Furthermore, as systems become more complex, leveraging the HPC community to develop and maintain the growing number of tests needed to assess a system is key. Presentation, Paper Acceptance and Testing Chair: Stephen Leak (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory, Lawrence Berkeley National Laboratory) Acceptance Testing the Chicoma HPE-Cray EX Supercomputer Kody Everson (Los Alamos National Laboratory, Dakota State University Advanced Research Laboratory) and Paul Ferrell, Jennifer Green, Francine Lapid, Daniel Magee, Jordan Ogas, Calvin Seamons, and Nicholas Sly (Los Alamos National Laboratory) Abstract Abstract Since the installation of MANIAC I in 1952, Los Alamos National Laboratory (LANL) has been at the forefront of addressing global crises using state-of-the-art computational resources to accelerate scientific innovation and discovery. This generation faces a new crisis in the global COVID-19 pandemic that continues to damage economies, health, and wellbeing; LANL is supplying high-performance computing (HPC) resources to contribute to the recovery from the impacts of this virus. A Step Towards the Final Frontier: Lessons Learned from Acceptance Testing of the First HPE/Cray EX 3000 System at ORNL Veronica G. Vergara Larrea, Reuben Budiardja, Paul Peltz, Jeffery Niles, Christopher Zimmer, Daniel Dietz, Christopher Fuson, Hong Liu, Paul Newman, James Simmons, and Chris Muzyn (Oak Ridge National Laboratory) Abstract Abstract In this paper, we summarize the deployment of the Air Force Weather (AFW) HPC11 system at Oak Ridge National Laboratory (ORNL) including the process followed to successfully complete acceptance testing of the system. HPC11 is the first HPE/Cray EX 3000 system that has been successfully released to its user community in a federal facility. HPC11 consists of two identical 800-node supercomputers, Fawbush and Miller, with access to two independent and identical Lustre parallel file systems. HPC11 is equipped with Slingshot 10 interconnect technology and relies on the HPE Performance Cluster Manager (HPCM) software for system configuration. ORNL has a clearly defined acceptance testing process used to ensure that every new system deployed can provide the necessary capabilities to support user workloads. We worked closely with HPE and AFW to develop a set of tests for the United Kingdom’s Meteorological Office’s Unified Model (UM) and 4DVAR. We also included benchmarks and applications from the Oak Ridge Leadership Computing Facility (OLCF) portfolio to fully exercise the HPE/Cray programming environment and evaluate the functionality and performance of the system. Acceptance testing of HPC11 required parallel execution of each element on Fawbush and Miller. In addition, careful coordination was needed to ensure successful acceptance of the newly deployed Lustre file systems alongside the compute resources. In this work, we present test results from specific system components and provide an overview of the issues identified, challenges encountered, and the lessons learned along the way. Presentation, Paper Storage and I/O 1 Chair: Tina Declerck (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory) New data path solutions from HPE for HPC simulation, AI, and high performance workloads Lance Evans and Marc Roskow (HPE) Abstract Abstract HPE is extending its HPC storage portfolio to include an IBM Spectrum Scale based solution. The HPE solution will leverage HPE servers and the robustness of Spectrum scale to address the increasing demand for “enterprise” HPC systems. IBM Spectrum Scale is an enterprise-grade parallel file system that provides superior resiliency, scalability and control. IBM Spectrum Scale delivers scalable capacity and performance to handle demanding data analytics, content repositories and technical computing workloads. Storage administrators can combine flash, disk, cloud, and tape storage into a unified system with higher performance and lower cost than traditional approaches. Leveraging HPE Proliant servers, we will deliver a range of storage (NSD), protocol, and data mover servers with a granularity that addresses small AI systems to large HPC scratch spaces with exceptional cost and flexibility. Lustre and Spectrum Scale: Simplify parallel file system workflows with HPE Data Management Framework Mark Wiertalla and Kirill Malkin (HPE) and Zsolt Ferenczy (HPEHPE) Abstract Abstract Lustre and Spectrum Scale: Simplify parallel file system workflows with HPE Data Management Framework Presentation, Paper Storage and I/O 2 Chair: Veronica G. Vergara Larrea (Oak Ridge National Laboratory) h5bench: HDF5 I/O Kernel Suite for Exercising HPC I/O Patterns Tonglin Li (Lawrence Berkeley National Laboratory, National Energy Research Scientific Computing Center); Suren Byna (Lawrence Berkeley National Laboratory); Quincey Koziol (Lawrence Berkeley National Laboratory, National Center for Supercomputing Applications); and Houjun Tang, Jean Luca Bez, and Qiao Kang (Lawrence Berkeley National Laboratory) Abstract Abstract Parallel I/O is a critical technique for moving data between compute and storage subsystems of supercomputing systems. With massive amounts of data being produced or consumed by compute nodes, high performant parallel I/O is essential. I/O benchmarks play an important role in this process, however, there is a scarcity of I/O benchmarks that are representative of current workloads on HPC systems. Towards creating representative I/O kernels from real world applications, we have created h5bench a set of I/O kernels that exercise HDF5 I/O on parallel file systems in numerous dimensions. Our focus on HDF5 is because of the parallel I/O library's heavy usage in a wide variety of scientific applications running on supercomputing systems. The various dimensions of h5bench include I/O operations (read and write), data locality (arrays of basic data types and arrays of structures), array dimensionality (1D arrays, 2D meshes, 3D cubes) and I/O modes (synchronous and asynchronous). In this paper, we present the observed performance of h5bench executed along several of these dimensions on a Cray system: Cori at NERSC using both the DataWarp burst buffer and a Lustre file system and Summit at Oak Ridge Leadership Computing Facility (OLCF) using a SpectrumScale file system. These performance measurements are using find performance bottlenecks, identify root causes of any poor performance, and optimize I/O performance. As the I/O patterns of h5bench are diverse and capture the I/O behaviors of various HPC applications, this study will be helpful not only to the CUG community but also to the broader supercomputing community. Architecture and Performance of Perlmutter's 35 PB ClusterStor E1000 All-Flash File System Glenn K. Lockwood (Lawrence Berkeley National Laboratory, National Energy Research Scientific Computing Center) and Alberto Chiusole, Lisa Gerhardt, Kirill Lozinskiy, David Paul, and Nicholas J. Wright (Lawrence Berkeley National Laboratory) Abstract Abstract NERSC's newest system, Perlmutter, features a 35 PB all-flash Lustre file system built on HPE Cray ClusterStor E1000. We present its architecture, early performance figures, and performance considerations unique to this architecture. We demonstrate the performance of E1000 OSSes through low-level Lustre tests that achieve over 90% of the theoretical bandwidth of the SSDs at the OST and LNet levels. We also show end-to-end performance for both traditional dimensions of I/O performance (peak bulk-synchronous bandwidth) and non-optimal workloads endemic to production computing (small, incoherent I/Os at random offsets) and compare them to NERSC's previous system, Cori, to illustrate that Perlmutter achieves the performance of a burst buffer and the resilience of a scratch file system. Finally, we discuss performance considerations unique to all-flash Lustre and present ways in which users and HPC facilities can adjust their I/O patterns and operations to make optimal use of such architectures. Presentation, Paper System Analytics and Monitoring Chair: Jim Brandt (Sandia National Laboratories) Integrating System State and Application Performance Monitoring: Network Contention Impact Jim Brandt (Sandia National Laboratories); Tom Tucker (Open Grid Computing); and Simon Hammond, Ben Schwaller, Ann Gentile, Kevin Stroup, and Jeanine Cook (Sandia National Laboratories) Abstract Abstract Discovering and attributing application performance variation in production HPC systems requires continuous concurrent information on the state of the system and applications, and of applications’ progress. Even with such information, there is a continued lack of understanding of how time-varying system conditions relate to a quantifiable impact on application performance. We have developed a unified framework to obtain and integrate, at run time, both system and application information to enable insight into application performance in the context of system conditions. The Lightweight Distributed Metric Service (LDMS) is used on several significant large-scale Cray platforms for the collection of system data and is planned for inclusion on several upcoming HPE systems. We have developed a new capability to inject application progress information into the LDMS data stream. The consistent handling of both system and application data eases the development of storage, performance analytics, and dashboards. We illustrate the utility of our framework by providing runtime insight into application performance in conjunction with network congestion assessments on a Cray XC40 system with a beta Programming Environment being used to prepare for the upcoming ACES Crossroads system. We describe possibilities for application to the Slingshot network. The complete system is generic and can be applied to any *-nix system; the system data can be obtained by both generic and system-specific data collection plugins (e.g., Aries vs Slingshot counters); and no application changes are required when the injection is performed by a portability abstraction layer, such as that employed by kokkos. trellis — An Analytics Framework for Understanding Slingshot Performance Madhu Srinivasan, Dipanwita Mallick, Kristyn Maschhoff, and Haripriya Ayyalasomayajula (Hewlett Packard Enterprise) Abstract Abstract The next generation HPE Cray EX and HPE Apollo supercomputers with Slingshot interconnect are breaking new ground in the collection and analysis of system performance data. The monitoring frameworks on these systems provide visibility into Slingshot's operational characteristics through advanced instrumentation and transparency into real-time network performance. There still exists, however, a wide gap between the volume of telemetry generated by Slingshot and a user's ability to assimilate and explore this data to derive critical, timely, and actionable insights about fabric health, application performance, and potential congestion scenarios. In this work, we present trellis --- an analytical framework built on top of Slingshot monitoring APIs. The goal of trellis is to provide system-administrators and researchers insight into network performance, and its impact on complex workflows that include both AI and traditional simulation workloads. We also present a visualization interface, built on trellis, that allows users to interactively explore through various levels of the network topology over specified time windows, and gain key insights into job performance and communication patterns. We demonstrate these capabilities on an internal Shasta development system and visualize Slingshot's innovative congestion-control and adaptive-routing in action. AIOps: Leveraging AI/ML for Anomaly Detection in System Management Sergey Serebryakov, Jeff Hanson, Tahir Cader, Deepak Nanjundaiah, and Joshi Subrahmanya (Hewlett-Packard Enterprise) Abstract Abstract HPC datacenters rely on set-points and dashboards for system management, which leads to thousands of false alarms. Exascale systems will deploy thousands of servers and sensors, produce millions of data points per second, and be more prone to management errors and equipment failures. HPE and the National Renewable Energy Lab (NREL) are using AI/ML to improve data center resiliency and energy efficiency. HPE has developed and deployed in NREL’s production environment (since June 2020), an end-to-end anomaly detection pipeline that operates in real-time, automatically, and at massive scale. In the paper, we will provide detailed results from several end-to-end anomaly detection workflows either already deployed at NREL, or to be deployed soon. We will describe the upcoming AIOps release as a technology preview with HPCM 1.5, plans for future deployment with Cray System Manager, and potential use as an Edge processor (inferencing engine) for HPE’s InfoSight analytics platform. Real-time Slingshot Monitoring in HPCM Priya K, Prasanth Kurian, and Jyothsna Deshpande (Hewlett Packard Enterprise) Abstract Abstract HPE Performance Cluster Manager (HPCM) software is used to provision, monitor, and manage HPC cluster hardware and software components. HPCM has a centralized monitoring infrastructure for persistent storage of telemetry and alerting on these metrics based on thresholds. Slingshot fabric management and monitoring is the new feature in HPCM Monitoring infrastructure. Slingshot Telemetry (SST) monitoring framework in HPCM is used for collecting and storing Slingshot fabric health and performance telemetry. Real time telemetry information gathered by SST is used for fabric health monitoring, real-time analytics, visualization, and alerting. This solution is capable of both vertical and horizontal scalability, handling huge volumes of telemetry data. Flexible and extensible model of the SST collection agent makes it easy to collect metrics at different granularities and intervals. Visualization dashboards are designed to suit different use cases, giving a complete view of fabric health. Analytic Models to Improve Quality of Service of HPC Jobs Saba Naureen, Prasanth Kurian, and Amarnath Chilumukuru (HPE) Abstract Abstract A typical High Performance Computing (HPC) cluster comprises of components such as CPU, Memory, GPU, Ethernet, Fabric, storage, racks, cooling devices and switches. A cluster usually consists of 1000’s of compute nodes interconnected using an Ethernet network for management tasks and Fabric network for data traffic. Job scheduler need to be aware of the health & availability of the cluster components in order to deliver high performance results. Since the failure of any component will adversely impact the overall performance of a job, identifying the issues or outages is very critical for ensuring the desired Quality of Service (QoS) is achieved. We showcase an analytics based model implemented as part of HPE Performance Cluster Manager (HPCM) that gathers and analyzes the telemetry data pertaining to the various cluster components like the racks, enclosures, cluster nodes, storage devices, Fabric switches, Cooling Distribution Unit (CDU), ARC (Adaptive Rack Cooling), Chassis Management Controller (CMC), fabric, power supplies & system logs. This real-time status information based on the telemetry data is utilized by the Job Schedulers to perform scheduling tasks effectively. This enables schedulers to take smart decisions and ensure that it schedules jobs only on healthy nodes, thus preventing job failures and wastage of computational resources. Our solution enables HPC job schedulers to be health-aware resulting in improving the reliability of the clusters and improve overall customer experience. Presentation, Paper Systems Support Chair: Hai Ah Nam (Lawrence Berkeley National Laboratory) Blue Waters System and Component Reliability Brett Bode, David King, Celso Mendes, and William Kramer (National Center for Supercomputing Applications/University of Illinois); Saurabh Jha (University of Illinois); and Roger Ford, Justin Davis, and Steven Dramstad (Cray Inc.) Abstract Abstract The Blue Waters system, installed in 2012 at NCSA, has the largest component count of any system Cray has built. Blue Waters includes a mix of dual-socket CPU (XE) and single-socket CPU, single GPU (XK) nodes. The primary storage is provided by Cray’s Sonexion/ClusterStor Lustre storage system delivering 35PB (raw) storage at 1TB/sec. The statistical failure rates over time for each component including CPU, DIMM, GPU, disk drive, power supply, blower, etc and their impact on higher level failure rates for individual nodes and the systems as a whole are presented in detail, with a particular emphasis on identifying any increases in rate that might indicate the right-side of the expected bathtub curve has been reached. Strategies employed by NCSA and Cray for minimizing the impact of component failure, such as the preemptive removal of suspect disk drives, are also presented. Configuring and Managing Multiple Shasta Systems: Best Practices Developed During the Perlmutter Deployment James Botts (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center); Zachary Crisler (Hewlett Packard Enterprise); Aditi Gaur and Douglas Jacobsen (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center); Harold Longley, Alex Lovell-Troy, and Dave Poulsen (Hewlett Packard Enterprise); and Eric Roman and Chris Samuel (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) Abstract Abstract The perlmutter supercomputer and related test systems provide an early look at Shasta system management and our ideas on best practices for managing Shasta systems. The cloud-native software and ethernet-based networking on the system enable tremendous flexibility in management policies and methods. Based on work performed using Shasta 1.3 and previewed 1.4 releases, NERSC has developed, in close collaboration with HPE through the perlmutter System Software COE, methodologies for efficiently managing multiple Shasta systems. We describe how we template and synchronize configurations and software between systems and orchestrate manipulations of the configuration of the managed system. Key to this is a secured external management system that provides both a configuration origin for the system and an interactive management space. Leveraging this external management system we simultaneously create a systems-development environment as well as secure key aspects of the Shasta system, enabling NERSC to rapidly deploy the perlmutter system. Slurm on Shasta at NERSC: adapting to a new way of life Christopher Samuel (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center, National Energy Research Scientific Computing Center) and Douglas M. Jacobsen and Aditi Gaur (Lawrence Berkeley National Laboratory, National Energy Research Scientific Computing Center) Abstract Abstract Shasta, with its heady mix of kubernetes, containers, software defined networking and 1970s batch computing provides a vast array of new concepts, strategies and acronyms for traditional HPC administrators to adapt to. NERSC has been working through this maze to take advantage of the new capabilities that Shasta brings in order to provide a stable and expected interface for traditional HPC workloads on Perlmutter whilst also taking advantage of Shasta and new abilities in Slurm to provide more modern interfaces and capabilities for production use. Declarative automation of compute node lifecycle through Shasta API integration J. Lowell Wofford and Kevin Pelzel (Los Alamos National Laboratory) Abstract Abstract Using the Cray Shasta system available at Los Alamos National Laboratory, we have experimented with integrating with various components of the HPE Cray Shasta software stack through the provided APIs. We have integrated with a LANL open-source software project, Kraken, which provides distributed state-based automation to provide new automation and management features to the Shasta system. We have focused on managing Shasta compute node lifecycle with Kraken, providing automation to node operations such as node image, kernel and configuration management. We examine the strengths and challenges of integrating with the Shasta APIs and discuss possibilities for further API integrations. Cray EX Shasta v1.4 System Management Overview Harold Longley (Hewlett Packard Enterprise) Abstract Abstract How do you manage a Cray EX (Shasta) system? This overview describes the Cray System Management software in the Shasta v1.4 release. This release has introduced new features such as booting management nodes from images, product streams, and configuration layers. The foundation of containerized microservices orchestrated by Kubernetes on the management nodes provides a highly available and resilient set of services to manage the compute and application nodes. Lower level hardware control is based on the DMTF Redfish standard enabling higher level hardware management services to control and monitor components and manage firmware updates. The network management services enable control of the high speed network fabric. The booting process relies upon preparation of images and configuration as well as run-time interaction between the nodes and services while nodes boot and configure. All microservices have published RESTful APIs for those who want to integrate management functions into their existing DevOps environment. The v1.4 software includes the cray CLI and the SAT (System Administration Toolkit) CLI which are clients that use these services. Identity and access management protect critical resources, such as the API gateway. Non-administrative users access the system either through a multi-user Linux node (User Access Node) or a single-user container (User Access Instance) managed by Kubernetes. Logging and telemetry data can be sent from the system to other site infrastructure. The tools for collection, monitoring, and analysis of telemetry and log data have been improved with new alerts and notifications. Managing User Access with UAN and UAI Harold Longley, Alex Lovell-Troy, and Gregory Baker (Hewlett Packard Enterprise) Abstract Abstract User Access Nodes (UANs) and User Access Instances (UAIs) represent the primary entry point for users on a Cray EX system to develop, build, and execute their applications on the Cray EX compute nodes. The UAN is a traditional, multi-user Linux node. The UAI is a dynamically provisioned, single user container which can be customized to the user’s needs. This presentation will describe the state of the Shasta v1.4 software for user access with UAN and UAI, provisioning software products for users, providing access to shared filesystems, granting and revoking authentication and authorization, logging of access, and monitoring of resource utilization. User and Administrative Access Options for CSM-Based Shasta Systems Alex Lovell-Troy, Sean Lynn, and Harold Longley (Hewlett Packard Enterprise) Abstract Abstract Cray System Management (CSM) from HPE is a cloud-like control system for High Performance Computing. CSM is designed to integrate the Supercomputer with multiple datacenter networks and provide secure administrative access via authenticated REST APIs. Access to the compute nodes and to the REST APIs may need to follow different network paths which has network routing implications. This paper outlines the flexible network configurations and guides administrators planning their Shasta/CSM systems. Site Administrators have configuration options for allowing users and administrators to access the REST APIs from outside. They also have options for allowing applications running on the compute nodes to access these same APIs. This paper is structured around three themes. The first theme defines a layer2/layer3 perimeter around the system and addresses upstream connections to the site network. The second theme deals primarily with layer 3 subnet routing from the network perimeter inward. The third theme deals with administrative access control at various levels of the network as well as user-based access controls to the APIs themselves. Finally, this paper will combine the themes to describe specific use cases and how to support them with available administrative controls. HPE Ezmeral Container Platform: Current And Future Thomas Phelan (HPE) Abstract Abstract The HPE Ezmeral Container Platform is the industry's first enterprise-grade container platform for both cloud-native and distributed non cloud-native applications using the open-source Kubernetes container orchestrator. Ezmeral enables true hybrid cloud operations across any location: on-premises, public cloud, and edge. Today, the HPE Ezmeral Container Platform is largely used for enterprise AI/ML/DL applications. However, the industry is starting to see a convergence of AI/ML/DL and High Performance Computer (HPC) workloads. This session will present an overview of the HPE Ezmeral Container Platform - its architecture, features, and usecases. It will also provide a look into the future product roadmap where the platform will support HPC workloads as well. Presentation, Paper Applications and Performance (ARM) Chair: Simon McIntosh-Smith (University of Bristol) An Evaluation of the A64FX Architecture for HPC Applications Andrei Poenaru and Tom Deakin (University of Bristol, GW4); Simon McIntosh-Smith (University of Bristol); and Si Hammond and Andrew Younge (Sandia National Laboratories) Abstract Abstract In this paper, we present some of the first in-depth, rigorous, independent benchmark results for the A64FX, the processor at the heart of Fugaku, the current #1 supercomputer in the world, and now available in Apollo 80 guise. The Isambard and Astra research teams have combined to perform this study, using a combination of mini-apps and application benchmarks to evaluate A64FX's performance for both compute- and bandwidth-bound scenarios. The study uniquely had access to all four major compilers for A64FX: Cray, Arm, GNU and Fujitsu. The results showed that the A64FX is extremely competitive, matching or exceeding contemporary dual-socket x86 servers. We also report tuning and optimisation techniques which proved essential for achieving good performance on this new architecture. Vectorising and distributing NTTs to count Goldbach partitions on Arm-based supercomputers Ricardo Jesus (EPCC, The University of Edinburgh); Tomás Oliveira e Silva (IEETA/DETI, Universidade de Aveiro); and Michèle Weiland (EPCC, The University of Edinburgh) Abstract Abstract In this paper we explore the usage of SVE to vectorise number-theoretic transforms (NTTs). In particular, we show that 64-bit modular arithmetic operations, including modular multiplication, can be efficiently implemented with SVE instructions. The vectorisation of NTT loops and kernels involving 64-bit modular operations was not possible in previous Arm-based SIMD architectures, since these architectures lacked crucial instructions to efficiently implement modular multiplication. We test and evaluate our SVE implementation on the A64FX processor in an HPE Apollo 80 system. Furthermore, we implement a distributed NTT for the computation of large-scale exact integer convolutions. We evaluate this transform on HPE Apollo 70, Cray XC50, and HPE Apollo 80 systems, where we demonstrate good scalability to thousands of cores. Finally, we describe how these methods can be utilised to count the number of Goldbach partitions of all even numbers to large limits. We present some preliminary results concerning this problem, in particular a histogram of the number of Goldbach partitions of the even numbers up to 2^40. Optimizing a 3D multi-physics continuum mechanics code for the HPE Apollo 80 System Vince Graziano (New Mexico Consortium, Los Alamos National Laboratory) and David Nystrom, Howard Pritchard, Brandon Smith, and Brian Gravelle (Los Alamos National Laboratory) Abstract Abstract We present results of a performance evaluation of a LANL 3D multi-physics continuum mechanics code - Pagosa - on an HPE Apollo 80 system. The Apollo 80 features the Fujitsu A64FX ARM processor with Scalable Vector Extension (SVE) support and high bandwidth memory. This combination of SIMD vector units and high memory bandwidth offers the promise of realizing a significant fraction of the theoretical peak performance for applications like Pagosa. In this paper we present performance results of the code using the GNU, ARM, and CCE compilers, analyze these compilers’ ability to vectorize performance critical loops when targeting the SVE instruction set, and describe code modifications to improve the performance of the application on the A64FX processor. Presentation, Paper Applications and Performance Chair: Zhengji Zhao (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory, Lawrence Berkeley National Laboratory) Optimizing the Cray Graph Engine for Performant Analytics on Cluster, SuperDome Flex, Shasta Systems and Cloud Deployment Christopher Rickett, Kristyn Maschhoff, and Sreenivas Sukumar (Hewlett Packard Enterprise) Abstract Abstract We present updates to the Cray Graph Engine, a high performance in-memory semantic graph database, which enable performant execution across multiple architectures as well as deployment in a container to support cloud and as-a-service graph analytics. This paper discusses the changes required to port and optimize CGE to target multiple architectures, including Cray Shasta systems, large shared-memory machines such as SuperDome Flex (SDF), and cluster environments such as Apollo systems. The porting effort focused primarily on removing dependences on XPMEM and Cray PGAS and replacing these with a simplified PGAS library based upon POSIX shared memory and one-side MPI, while preserving the existing Coarray-C++ CGE code base. We also discuss the containerization of CGE using Singularity and the techniques required to enable container performance matching native execution. We present early benchmarking results for running CGE on the SDF, Infiniband clusters and Slingshot interconnect-based Shasta systems. Real-Time XFEL Data Analysis at SLAC and NERSC: a Trial Run of Nascent Exascale Experimental Data Analysis Best Paper Johannes P. Blaschke (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center); Aaron S. Brewster, Daniel W. Paley, Derek Mendez, Asmit Bhowmick, and Nicholas K. Sauter (Lawrence Berkeley National Laboratory/Physical Biosciences Division); Wilko Kröger and Murali Shankar (SLAC National Accelerator Laboratory); and Bjoern Enders and Deborah Bard (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) Abstract Abstract X-ray scattering experiments using Free Electron Lasers (XFELs) are a powerful tool to determine the molecular structure and function of unknown samples (such as COVID-19 viral proteins). XFEL experiments are a challenge to computing in two ways: i) due to the high cost of running XFELs, a fast turnaround time from data acquisition to data analysis is essential to make informed decisions on experimental protocols; ii) data collection rates are growing exponentially, requiring new scalable algorithms. Here we report our experiences analyzing data from two experiments at the Linac Coherent Light Source (LCLS) during September 2020. Raw data were analyzed on NERSC’s Cori XC40 system, using the Superfacility paradigm: our workflow automatically moves raw data between LCLS and NERSC, where it is analyzed using the software package CCTBX. We achieved real time data analysis with a turnaround time from data acquisition to full molecular reconstruction in as little as 10 min -- sufficient time for the experiment’s operators to make informed decisions. By hosting the data analysis on Cori, and by automating LCLS-NERSC interoperability, we achieved a data analysis rate which matches the data acquisition rate. Completing data analysis with 10 mins is a first for XFEL experiments and an important milestone if we are to keep up with data collection trends. Early Experiences Evaluating the HPE/Cray Ecosystem for AMD GPUs Veronica G. Vergara Larrea, Reuben Budiardja, and Wayne Joubert (Oak Ridge National Laboratory) Abstract Abstract Since deploying the Titan supercomputer in 2012, the Oak Ridge Leadership Computing Facility (OLCF) has continued to support and promote GPU-accelerated computing among its user community. Summit, the flagship system at the OLCF --- currently number 2 in the most recent TOP500 list --- has a theoretical peak performance of approximately 200 petaflops. Because the majority of Summit’s computational power comes from its 27,972 GPUs, users must port their applications to one of the supported programming models in order to make efficient use of the system. Looking ahead to Frontier, the OLCF’s exascale supercomputer, users will need to adapt to an entirely new ecosystem which will include new hardware and software technologies. First, users will need to familiarize themselves with the AMD Radeon GPU architecture. Furthermore, users who have been previously relying on CUDA will need to transition to the Heterogeneous-Computing Interface for Portability (HIP) or one of the other supported programming models (e.g., OpenMP, OpenACC). In this work, we describe our initial experiences in porting three applications or proxy apps currently running on Summit to the HPE/Cray ecosystem to leverage the compute power from AMD GPUs: minisweep, GenASiS, and Sparkler. Each one is representative of current production workloads utilized at the OLCF, different programming languages, and different programming models. We also share lessons learned from challenges encountered during the porting process and provide preliminary results from our evaluation of the HPE/Cray Programming Environment and the AMD software stack using these key OLCF applications. Convergence of AI and HPC at HLRS. Our Roadmap. Denns Hoppe (High Performance Computing Center Stuttgart) Abstract Abstract The growth of artificial intelligence (AI) is accelerating. AI has left research and innovation labs, and nowadays plays a significant role in everyday lives. The impact on society is graspable: autonomous driving cars produced by Tesla, voice assistants such as Siri, and AI systems that beat renowned champions in board games like Go. All these advancements are facilitated by powerful computing infrastructures based on HPC and advanced AI-specific hardware, as well as highly optimized AI codes. Since several years, HLRS is engaged in big data and AI-specific activities around HPC. The road towards AI at HLRS began several years ago with installing a Cray Urika-GX for processing large volumes of data. But due to the isolated platform and, for HPC users, different usage concept, uptake of this system was lower than expected. This drastically changed recently with the installation of a CS-Storm equipped with powerful GPUs. Since then, we are also extending our HPC system with GPUs due to a high customer demand. We foresee that the duality of using AI and HPC on different systems will soon be overcome, and hybrid AI/HPC workflows will be eventually possible. Porting Codes to LUMI Georgios Markomanolis (CSC - IT Center for Science Ltd.) Abstract Abstract LUMI is a new upcoming EuroHPC pre-exascale supercomputer with a peak performance of a bit over 550 petaflop/s by HPE Cray. Many countries of LUMI consortium will have access to this system among other users. It is known that this system will be based on the next generation of AMD Instinct GPUs and this is a new environment for all of us. In this presentation, we discuss the AMD ecosystem, we present with examples the procedure to convert CUDA codes to HIP, among also how to port Fortran codes with hipfort. We discuss the utilization of other HIP libraries and we demonstrate a performance comparison between CUDA and HIP. We explore the challenges that scientists will have to handle during their application porting and also we provide step-by-step guidance. Finally, we will discuss the potential of other programming models and the workflow that we follow to port codes depending on their readiness for GPUs and the used programming language. |
Presentation, Paper Acceptance and Testing Chair: Stephen Leak (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory, Lawrence Berkeley National Laboratory) Presentation, Paper Storage and I/O 1 Chair: Tina Declerck (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory) Presentation, Paper Storage and I/O 2 Chair: Veronica G. Vergara Larrea (Oak Ridge National Laboratory) Presentation, Paper System Analytics and Monitoring Chair: Jim Brandt (Sandia National Laboratories) Presentation, Paper Systems Support Chair: Hai Ah Nam (Lawrence Berkeley National Laboratory) Configuring and Managing Multiple Shasta Systems: Best Practices Developed During the Perlmutter Deployment pdfPresentation, Paper Applications and Performance (ARM) Chair: Simon McIntosh-Smith (University of Bristol) Presentation, Paper Applications and Performance Chair: Zhengji Zhao (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory, Lawrence Berkeley National Laboratory) Optimizing the Cray Graph Engine for Performant Analytics on Cluster, SuperDome Flex, Shasta Systems and Cloud Deployment pdf, pdfPresentation, Paper Acceptance and Testing Chair: Stephen Leak (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory, Lawrence Berkeley National Laboratory) Presentation, Paper Storage and I/O 1 Chair: Tina Declerck (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory) Presentation, Paper Storage and I/O 2 Chair: Veronica G. Vergara Larrea (Oak Ridge National Laboratory) Presentation, Paper System Analytics and Monitoring Chair: Jim Brandt (Sandia National Laboratories) Presentation, Paper Systems Support Chair: Hai Ah Nam (Lawrence Berkeley National Laboratory) Configuring and Managing Multiple Shasta Systems: Best Practices Developed During the Perlmutter Deployment pdfPresentation, Paper Applications and Performance (ARM) Chair: Simon McIntosh-Smith (University of Bristol) Presentation, Paper Applications and Performance Chair: Zhengji Zhao (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory, Lawrence Berkeley National Laboratory) Optimizing the Cray Graph Engine for Performant Analytics on Cluster, SuperDome Flex, Shasta Systems and Cloud Deployment pdf, pdf |