Sunday, May 7th1:30pm-2:45pmProgramming Environments, Applications, and Documentation (PEAD) Introduction PEAD Introduction PEAD Christopher Fuson (Oak Ridge National Laboratory) Abstract Welcome and agenda overview HPE Documentation HPE Documentation Peggy Sanchez and Barbara Chapman (HPE) Abstract Documentation is a powerful tool used by CUG member sites and HPE to help users effectively use center HPC resources. Available 24x7, documentation allows centers to reach a global user community and cover a wide range of technical information with varying levels of detail and focus. During this BOF, HPE representatives will walk through existing documentation portals and discuss new and upcoming features. The goal of the BOF is to provide an opportunity for CUG member sites and HPE to discuss existing and future documentation needs. Training Training Christopher Fuson (Oak Ridge National Laboratory), Eva Siegmann (Stony Brook), Marco De La Pierre (Pawsey), and Barbara Chapman (HPE) Abstract HPC resources are large and complex. To effectively use a center’s HPC resources, users must have an understanding of hardware configurations, data storage options, available scientific software, programming environments, batch schedulers, and everything in between. Centers and HPE place large amounts of effort developing, organizing, and presenting training opportunities for the user community. In this BOF, representatives from multiple centers will discuss their center’s training efforts. HPE will present available training offerings as well as discuss future training needs with those in attendance. Birds of a Feather 3:00pm-5:00pmProgramming Environments, Applications, and Documentation (PEAD) User Module Environment User Module Environment Kaylie Anderson (HPE), John Holmen (Oak Ridge National Laboratory), and Pascal Elahi (Pawsey) Abstract A center’s HPC community is often comprised of user groups with very diverse needs and goals. A resource’s user environment must support multiple workflows each with varying compiler, library, and tool requirements. User modules provide a mechanism to support an array of workflows with varied needs. During this BOF, HPE will discuss the CPE module environment including new and upcoming LMOD features. Center representatives will also discuss use cases and methods used to augment the provided environment. PE Updates and Testing PE Updates and Testing Jeff Hudson (HPE); Abhinav Thota (Indiana University); Guilherme Peretti-Pezzi (Swiss National Supercomputing Centre); Juan Herrera (EPCC, The University of Edinburgh); and Koutsaniti Eirini (Swiss National Supercomputing Centre) Abstract HPC programming environments can be very complex, containing libraries, compilers, and tools that must work together to provide an effective resource to a center’s user community. Over a resource’s lifespan, upgrades can impact not only the individual component, but the ability for multiple components to successfully work together. Testing at various stages of a resource’s lifespan is crucial to ensure the numerous hardware and software components are in working order. The goal of this BOF is to provide a venue for CUG member sites to share techniques, best practices, and lessons learned for resource testing. During the BOF HPE representatives will discuss the environment, process, and tools used to test the CPE. Center representatives will discuss testing including the use of the ReFrame framework for regression testing. Future Directions for Fortran Future Directions for Fortran Barbara Chapman, John Levesque, and Bill Long (HPE) Abstract While there is significant change in the architectures of HPC platforms and in the applications that are deployed on them, Fortran has remained a key programming language for scientific and technical application development in the exascale era. HPE's Cray Programming Environment (CPE) continues to vigorously support the use of Fortran in HPC. CPE maintains its own Fortran compiler, which is continuously evolving to support the latest versions of the Fortran standard as well as both OpenMP and OpenACC directives. While Fortran has many inherent benefits for numerically intensive computations, it does not enjoy the support of a large community of developers such as that for C++. Moreover, Fortran is seldom taught in Computer Science departments, and many application developers are not aware of its benefits, including its more recent features. Given that Fortran code dominates many HPC Center workloads, what should we be doing as a community to maintain its relevance and ensure that it will meet future HPC needs? What can HPE do to encourage such efforts? The goal of this BOF is to provide an open forum for CUG members to share their thoughts on the future of Fortran in HPC in general and to discuss these questions. Birds of a Feather | Monday, May 8th8:30am-10:00amTutorial 1A Using the HPE Cray Programming Environment (HCPE) to Port and Optimize Applications to hybrid systems with GPUs using OpenMP Offload or OpenACC Using the HPE Cray Programming Environment (HCPE) to Port and Optimize Applications to hybrid systems with GPUs using OpenMP Offload or OpenACC Harvey Richardson, John Levesque, Nina Mujkanovic, and Alfio Lazzaro (HPE) Abstract This tutorial will consist of instruction, demonstration, and hands on utilization of HPECray Programming environment to port to hybrid systems with GPUs using OpenMP Offload and/or OpenACC directives. This will be a full day tutorial with the lecture and exercises given throughout the day. The attendees will learn about the compiler, performance analysis tools and debuggers targeting GPU usage. Later in the day we will have a video call with several of the OpenMP, OpenACC and MPI developers so attendees can ask questions that come up throughout the day. The entire day will consist of the attendees either using HPE Cray Programming Environment on their own applications and/or working on assignments we will have available. Access to several systems will be made available to the attendees. It is important to understand that the best way to learn the compiler and tool capabilities is to use them on your own application, so the hands-on session is extremely important. Both the AMD and NVIDIA GPUs systems will be available. The tutorial will cover the process of taking an all MPI application and first identifying the computational bottlenecks using performance analysis tools and then incrementally adding GPU directives to move the application to the GPU. Performance analysis tools will then be used to analyze the performance of the application running on the GPU to identify, data motion, computational bottleneck, etc. Tutorial Tutorial 1B Advanced Topics for Cray System Management for HPE Cray EX Systems Advanced Topics for Cray System Management for HPE Cray EX Systems Harold Longley (Hewlett Packard Enterprise) Abstract This tutorial session discusses several management topics for Cray System Management (CSM) and related software on the HPE Cray EX Supercomputer. This includes configuration, system monitoring, health validation, compute node environment tuning, boot troubleshooting, and extending the system management REST API toolset. Tutorial Tutorial 1C Supercomputer Affinity on HPE Systems Supercomputer Affinity on HPE Systems Edgar A. Leon and Jane E. Herriman (Lawrence Livermore National Laboratory) Abstract HPE's Frontier supercomputer at Oak Ridge National Laboratory is the world's first exascale machine. To provide such computing power, HPE and its partners rely on a complex heterogeneous architecture with four NUMA domains, 64 SMT-2 CPU cores, and 8 GPUs per node. This complexity can create unnecessary data movement that negatively affects performance, scalability, and power. The key to minimizing data movement is to leverage hardware locality: Minimize data movement between hardware components by changing an application's affinity policies---the rules that assign processes, threads, and GPU kernels to the hardware. Tutorial 10:30am-12:00pmTutorial 1A Continued Using the HPE Cray Programming Environment (HCPE) to Port and Optimize Applications to hybrid systems with GPUs using OpenMP Offload or OpenACC Using the HPE Cray Programming Environment (HCPE) to Port and Optimize Applications to hybrid systems with GPUs using OpenMP Offload or OpenACC Harvey Richardson, John Levesque, Nina Mujkanovic, and Alfio Lazzaro (HPE) Abstract This tutorial will consist of instruction, demonstration, and hands on utilization of HPECray Programming environment to port to hybrid systems with GPUs using OpenMP Offload and/or OpenACC directives. This will be a full day tutorial with the lecture and exercises given throughout the day. The attendees will learn about the compiler, performance analysis tools and debuggers targeting GPU usage. Later in the day we will have a video call with several of the OpenMP, OpenACC and MPI developers so attendees can ask questions that come up throughout the day. The entire day will consist of the attendees either using HPE Cray Programming Environment on their own applications and/or working on assignments we will have available. Access to several systems will be made available to the attendees. It is important to understand that the best way to learn the compiler and tool capabilities is to use them on your own application, so the hands-on session is extremely important. Both the AMD and NVIDIA GPUs systems will be available. The tutorial will cover the process of taking an all MPI application and first identifying the computational bottlenecks using performance analysis tools and then incrementally adding GPU directives to move the application to the GPU. Performance analysis tools will then be used to analyze the performance of the application running on the GPU to identify, data motion, computational bottleneck, etc. Tutorial Tutorial 1B Continued Advanced Topics for Cray System Management for HPE Cray EX Systems Advanced Topics for Cray System Management for HPE Cray EX Systems Harold Longley (Hewlett Packard Enterprise) Abstract This tutorial session discusses several management topics for Cray System Management (CSM) and related software on the HPE Cray EX Supercomputer. This includes configuration, system monitoring, health validation, compute node environment tuning, boot troubleshooting, and extending the system management REST API toolset. Tutorial Tutorial 1C Continued Supercomputer Affinity on HPE Systems Supercomputer Affinity on HPE Systems Edgar A. Leon and Jane E. Herriman (Lawrence Livermore National Laboratory) Abstract HPE's Frontier supercomputer at Oak Ridge National Laboratory is the world's first exascale machine. To provide such computing power, HPE and its partners rely on a complex heterogeneous architecture with four NUMA domains, 64 SMT-2 CPU cores, and 8 GPUs per node. This complexity can create unnecessary data movement that negatively affects performance, scalability, and power. The key to minimizing data movement is to leverage hardware locality: Minimize data movement between hardware components by changing an application's affinity policies---the rules that assign processes, threads, and GPU kernels to the hardware. Tutorial 1:00pm-2:30pmTutorial 2A System Monitoring with CSM and HPCM System Monitoring with CSM and HPCM Jeff Hanson (HPE) Abstract CSM and HPCM provide extensive system monitoring features. This tutorial will Review the architecture and features of each stack Review optional features and how to enable Present methods for extraction of monitoring telemetry to customer data repositories Present use cases and methods for analysis of SlingShot telemetry at node, switch and fabric levels Present AIOps technology and how it is used/managed Present methods for scaling monitoring infrastructure Present methods for management of monitoring infrastructure Tutorial Tutorial 2B Analyzing the Slingshot Fabric with the Slingshot Dashboard Analyzing the Slingshot Fabric with the Slingshot Dashboard Nilakantan Mahadevan, Forest Godfrey, and Jose Mendes (Hewlett Packard Enterprise) Abstract Understanding the performance and correctness of the Slingshot fabric is a complex task. To aid customers in this task, HPE is providing the Slingshot Fabric Dashboard. The dashboard can help in understanding fabric load, link failure rates, as well as give an overview of current fabric status. This tutorial will cover configuration and use of the dashboard to understand real world issues on Slingshot fabrics. Sample data (either synthetically generated or taken from internal test systems) will be used to illustrate real world problems from large systems such as Frontier and LUMI. The tutorial will include hands on use of the dashboard. Tutorial Tutorial 2C Omnitools: Performance Analysis Tools for AMD GPUs Omnitools: Performance Analysis Tools for AMD GPUs George Markomanolis and Samuel Antao (AMD) Abstract The top entries of the TOP500 list feature systems enabled with AMD Instinct GPUs, including world and Europe’s fastest supercomputers Frontier and LUMI, respectively. As these systems enter production, application teams will require the ability to profile applications to ascertain performance. To enable Indeed, AMD has released in 2022 two new profiling tools: Omnitrace and Omniperf. These tools are a result of close collaborations between AMD development teams and computational scientists aimed at unpicking performance bottlenecks in applications and identifying improvement strategies. Omnitrace targets end-to-end application performance generating timelines that cover MPI, OpenMP, Kokkos, Python, etc. It enables the developer to identify relevant hardware counters to collect and generate information in performance-limiting kernels. Omniperf can then be used to seek further insight on these kernels through roofline analysis, memory chart analysis, and read-outs of many metrics including cache access, GPU utilization, and speed of light analysis. In this tutorial we will present advanced features of these tools, with live demonstrations, and provide numerous hands-on examples for attendees to identify and mitigate bottlenecks in scientific and machine learning applications running on AMD GPUs. Tutorial 3:00pm-4:30pmTutorial 2A Continued System Monitoring with CSM and HPCM System Monitoring with CSM and HPCM Jeff Hanson (HPE) Abstract CSM and HPCM provide extensive system monitoring features. This tutorial will Review the architecture and features of each stack Review optional features and how to enable Present methods for extraction of monitoring telemetry to customer data repositories Present use cases and methods for analysis of SlingShot telemetry at node, switch and fabric levels Present AIOps technology and how it is used/managed Present methods for scaling monitoring infrastructure Present methods for management of monitoring infrastructure Tutorial Tutorial 2B Continued Analyzing the Slingshot Fabric with the Slingshot Dashboard Analyzing the Slingshot Fabric with the Slingshot Dashboard Nilakantan Mahadevan, Forest Godfrey, and Jose Mendes (Hewlett Packard Enterprise) Abstract Understanding the performance and correctness of the Slingshot fabric is a complex task. To aid customers in this task, HPE is providing the Slingshot Fabric Dashboard. The dashboard can help in understanding fabric load, link failure rates, as well as give an overview of current fabric status. This tutorial will cover configuration and use of the dashboard to understand real world issues on Slingshot fabrics. Sample data (either synthetically generated or taken from internal test systems) will be used to illustrate real world problems from large systems such as Frontier and LUMI. The tutorial will include hands on use of the dashboard. Tutorial Tutorial 2C Continued Omnitools: Performance Analysis Tools for AMD GPUs Omnitools: Performance Analysis Tools for AMD GPUs George Markomanolis and Samuel Antao (AMD) Abstract The top entries of the TOP500 list feature systems enabled with AMD Instinct GPUs, including world and Europe’s fastest supercomputers Frontier and LUMI, respectively. As these systems enter production, application teams will require the ability to profile applications to ascertain performance. To enable Indeed, AMD has released in 2022 two new profiling tools: Omnitrace and Omniperf. These tools are a result of close collaborations between AMD development teams and computational scientists aimed at unpicking performance bottlenecks in applications and identifying improvement strategies. Omnitrace targets end-to-end application performance generating timelines that cover MPI, OpenMP, Kokkos, Python, etc. It enables the developer to identify relevant hardware counters to collect and generate information in performance-limiting kernels. Omniperf can then be used to seek further insight on these kernels through roofline analysis, memory chart analysis, and read-outs of many metrics including cache access, GPU utilization, and speed of light analysis. In this tutorial we will present advanced features of these tools, with live demonstrations, and provide numerous hands-on examples for attendees to identify and mitigate bottlenecks in scientific and machine learning applications running on AMD GPUs. Tutorial 4:35pm-6:00pmBoF 1A Systems Monitoring Working Group BOF Systems Monitoring Working Group BOF Craig West (Australian Bureau of Meteorology); Lena Lopatina (Los Alamos National Laboratory); and Stephen Leak (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory, Lawrence Berkeley National Laboratory) Abstract The System Monitoring Working Group (SMWG) is a CUG SIG (special interests group) to enable collaborations between HPE Cray and their customers on system monitoring capabilities. The Systems Monitoring Working Group includes representatives from many HPE Cray member sites. We meet to discuss and collaborate together on any issues related to system monitoring. Birds of a Feather BoF 1B Extending Software Tools for CSM Extending Software Tools for CSM Harold Longley and Ryan Haasken (Hewlett Packard Enterprise), Doug Jacobsen and Brian Friesen (NERSC), Alden Stradling and Graham van Heule (Los Alamos National Laboratory), and Miguel Gila and Manuel Sopena Ballesteros (Swiss National Supercomputing Centre) Abstract This BOF encourages discussion amongst those who have extended the software toolset for managing Cray HPE EX systems with Cray System Management (CSM) or who want to use the extensions done by other customers. The system management environment with CSM software on HPE Cray EX systems was designed to be extensible. An API gateway with integrated authentication and authorization is used to contact the containerized microservices and included open-source software. Birds of a Feather BoF 1C HPC System Test: Challenges and Lessons Learned Deploying Bleeding-Edge Network Technologies HPC System Test: Challenges and Lessons Learned Deploying Bleeding-Edge Network Technologies Veronica G. Melesse Vergara (Oak Ridge National Laboratory) and Bilel Hadri (King Abdullah University of Science and Technology) Abstract In the last couple of years, CUG sites across the world have begun deploying and testing the latest generation of HPE/Cray EX systems including Perlmutter at NERSC (USA), LUMI at CSC (Finland), Frontier at ORNL (USA), Setonix at PAWSEY (Australia), among others. In this birds of a feather session, we aim to gather center and vendor staff from across the globe to describe deployment processes used, discuss challenges encountered, and share lessons learned during the deployment of HPE's Slingshot 11 technology at different scales. The session will first feature speakers from HPE Engineering and ORNL HPC Scalable Systems teams which will be followed by an interactive discussion to encourage participants to share their own experiences. First, Forest Godfrey (HPE) will discuss Slingshot's testing and troubleshooting methodologies developed by HPE. Then, Matt Ezell (ORNL) will provide a summary of the deployment and configuration from a center perspective. The led discussion will invite staff from Indiana University, NERSC, LLNL, CSC, and Pawsey to share their experiences, challenges, and tools developed. The session will focus on these primary goals: (1) identify common challenges across sites, (2) gather information on tests and tools that could be leveraged by the community, and (3) define and share best practices that can be used to validate functionality, performance, and stability of Slingshot 11 based systems. At the end of the session, we will develop a technical report summarizing recommendations, workarounds implemented, gaps identified, and shareable tests that have proven to be helpful in detecting Slingshot 11 issues. Birds of a Feather | Tuesday, May 9th8:30am-10:00amPlenary: Welcome, Keynote CUG Welcome CUG Welcome Ashley Barker (Oak Ridge National Laboratory) Abstract Welcome from the CUG President. Keynote: Daring to think of the impossible: The story of Vlasiator Keynote: Daring to think of the impossible: The story of Vlasiator Minna Palmroth (University of Helsinki, Finnish Meteorological Institute) Abstract Vlasiator is the world’s first global Eulerian hybrid-Vlasov simulation code, going beyond magnetohydrodynamics in the solar wind—magnetosphere—ionosphere system. This presentation gives the story of Vlasiator. An important enabler of Vlasiator is the rapid increase of computational resources over the last decade, but also the open-minded, courageous forerunners, who have embraced this new opportunity, both as developers but also as co-authors of our papers. Typically, when starting a new coding project, people think about the presently available resources. But when the development continues for multiple years, the resources change. If instead, one targets to upcoming resources, one is always in possession of a code which does not contain large legacy parts that are not able to utilize latest resources. It will be interesting to see how many modelling groups will take the opportunity to benefit from the current high-performance computing trends, and where are we in the next 10 years. AMD Together We Advance - A Look Back and a Look Forward AMD Together We Advance - A Look Back and a Look Forward David Cownie (AMD) Abstract We'll take a look back at the progress made since CUG19 in Montreal and review the current state of AMD. We’ll also explore the exciting future of AMD and how we can advance together Multi-dimensional HPC for Breakthrough Results Multi-dimensional HPC for Breakthrough Results Branden Bauer (Altair Engineering, Inc.) Abstract Today’s hyper-competitive HPC landscape demands more than simple scheduling for CPUs. We’ll show you how to optimize across software licenses and storage requirements, accelerate with GPUs, burst to the cloud, submit jobs from anywhere, and more. Plenary 10:30am-12:00pmPlenary: Sponsor talk, Best Paper, New sites CSC Finland: 52 years of leadership in European HPC CSC Finland: 52 years of leadership in European HPC Kimmo Koski (CSC - IT Center for Science Ltd.) Abstract Kimmo Koski, CSC’s President & CEO will give a presentation entitled "CSC Finland: 52 years of leadership in European HPC" CUG Election Candidate Statement CUG Election Candidate Statement Ashley Barker (Oak Ridge National Laboratory) Abstract CUG Election Candidate Statement Balancing Workloads in More Ways than One Balancing Workloads in More Ways than One Veronica G. Melesse Vergara, Paul Peltz, Nick Hagerty, Christopher Zimmer, Reuben Budiardja, Dan Dietz, Thomas Papatheodore, Christopher Coffman, and Benton Sparks (Oak Ridge National Laboratory) Abstract The newest system deployed by Oak Ridge National Laboratory (ORNL) as part of the National Climate-Computing Research Center (NCRC) strategic partnership between U.S. Department of Energy and the National Oceanic and Atmospheric Administration (NOAA), named C5, is a HPE/Cray EX 3000 supercomputer with 1,792 nodes interconnected with HPE's Slingshot 10 technology. Each node is comprised of two 64-core AMD EPYC 7H12 processors and has 256GB of DRAM memory. In this paper, we describe the process ORNL used to deploy C5 and discuss the challenges we encountered during execution of the acceptance test plan. These challenges include balancing of: (1) production workloads running in parallel on the Gaea collection of systems, (2) the mixture and distribution of tests executed on C5 against f2, the shared Lustre parallel file system, simultaneously, (3) compute and file system resources available, and (4) the schedule and resource constraints. Part of the work done to overcome these challenges included expanding monitoring capabilities in the OLCF Test Harness which are described here. Finally, we present benchmarking results from NOAA benchmarks and OLCF applications that were used in this study that could be useful for other centers deploying similar systems. Komondor, the Hungarian breed Komondor, the Hungarian breed Zoltan Kiss (KIFÜ) Abstract Security is one of the main concerns of running a massively parallel multiuser environment. Given the high costs of obtaining and running such a system, novel methods need to be introduced to protect from threats coming from outside or inside of it, while keeping the system robust and user friendly. KIFÜ just set up its Komondor Cray EX system this year, and since it also operates the Academic Identification system in Hungary, we are investigating ways to improve HPC security connected to federated IDs by utilizing open source toolsets. We will quicky introduce our open source tools to offer firewall functionality, OTP based multifactor autentication, and a fully secured HPC Portal integration with Jupyter support. High Performance Remote Linux Desktops with ThinLinc High Performance Remote Linux Desktops with ThinLinc Pierre Ossman (Cendio AB) Abstract ThinLinc is a scalable remote desktop solution for Linux, built with open-source components. It enables access to graphical Linux applications and full desktop environments. CUG member sites have used ThinLinc to provide users with access to applications like MATLAB or VMD, as well as a “Remote Research Desktop”. This environment allows users to run their entire workflow, from data retrieval and preparation to job submission and post-processing. A “Remote Research Desktop” also enables access to interactive applications and running jobs in a batch system. Cendio, the company behind ThinLinc, is a strong supporter of open-source projects and the main contributor to projects like TigerVNC, noVNC, and others. Plenary 1:00pm-2:30pmTechnical Session 1A Pekka Manninen CPE Update CPE Update Barbara Chapman (HPE) Abstract The HPE Cray Programming Environment (CPE) provides a suite of integrated programming tools that facilitate application development on a diverse range of HPC systems delivered by HPE. It consists of an integrated set of compilers, math libraries, communications libraries, debuggers, and performance tools that enable the creation, evolution and adaptation of portable application codes written using mainstream programming languages and the most widely used parallel programming models. Its components are optimized to provide scalable performance on a variety of hardware configurations. Deploying Alternative User Environments on Alps Deploying Alternative User Environments on Alps Jonathan Coles, Benjamin Donald Cumming, Theofilos-Ioannis Manitaras, Jean-Guillaume Piccinali, and Simon Pintarelli (CSCS) and Harmen Stoppels (Stoppels Consulting) Abstract We describe a method for defining, building and deploying alternative programming environments alongside the CPE on HPE Cray EX Alps infrastructure at CSCS. This addresses an important strategic need at CSCS to deliver tailored environments within our versatile cluster (vCluster) configuration. We provide compact, testable, optimized software environments that can be updated independently of the CPE release cycle. The environments are defined with a descriptive YAML recipe, which is processed by a novel configuration tool that builds the software stackusing Spack and generates a SquashFS image. Cray-MPICH is provided through a custom Spack package without the need for a CPE installation. We describe the command line tools and Slurm plugin that facilitate loading environments per user and per job. Through a series of benchmarks we demonstrate application and micro-benchmark performance that matches CPE. Automating Software Stack Deployment on an HPE Cray EX Supercomputer Automating Software Stack Deployment on an HPE Cray EX Supercomputer Pascal Jahan Elahi, Cristian Di Pietrantonio, Marco De La Pierre, and Deva Kumar Deeptimahanti (Pawsey Supercomputing Research Centre) Abstract The complexity and diversity of scientific software, in conjunction with a desire for reproducibility, led to the development of package managers such as Spack and EasyBuild, with the purpose of compiling and installing optimised software on supercomputers. In this paper, we present how Pawsey leverages such tools to deploy the system-wide software stack. Two aspects of Pawsey software stack deployment are discussed: the first comprises organisation, accessibility, interoperability with the HPE Cray EX environment, and the choice of technologies such as containers, derived from a set of policies and requirements; the second is the (almost) automated, self-contained, deployment process using Spack and Bash scripts. This process clones a specific version of Spack, configures it, runs it to build the software stack using environments, deploys Singularity Registry HPC to setup desired containers-as-modules and then generates bespoke module files. The deployment is tested using a ReFrame framework. Meeting the requirements of our user base necessitated patching Spack, writing new Spack recipes, patching existing recipes and/or source code of software to properly build within the Cray Programming Environment. The whole Spack configuration at Pawsey is made publicly accessible on GitHub, for the benefit of the broader HPC community. Paper, Presentation Technical Session 1B Lena M Lopatina Building Efficient AI Pipelines with Self-Learning Data Foundation for AI Building Efficient AI Pipelines with Self-Learning Data Foundation for AI Annmary Justine, Aalap Tripathy, Revathy Venkataramanan, Sergey Serebryakov, Martin Foltin, Cong Xu, Suparna Bhattacharya, and Paolo Faraboschi (Hewlett Packard Enterprise) Abstract Development of trustworthy AI models often requires significant effort, resources and energy. Available tools focus on optimization of individual AI pipeline stages but lack in end-to-end optimization and reuse of historical experience from similar pipelines. This leads to excessive resource consumption from running unnecessary AI experiments with poor data or parameters. Hewlett Packard Labs is developing a novel Self-Learning Data Foundation for AI infrastructure that captures and learns from AI pipeline metadata to optimize the pipelines. We show examples how Data Foundation is helping AI practitioners: i) enable reproducibility, audit trail, and incremental model development across distributed sites spanning edge, High Performance Computing and cloud (e.g., in particle trajectory reconstruction, autonomous microscopy computational steering, etc.), ii) reduce number of AI model training experiments by initializing AutoML training runs based on historical experience, and, iii) track resource consumption and carbon footprint through different stages of AI lifecycle enabling energy aware pipeline optimizations. The visibility of pipeline metadata beyond training to inference and retraining provides insights about end-to-end tradeoffs between runtime, accuracy and energy efficiency. The Data Foundation is built on open-source Common Metadata Framework that can be integrated with 3rd party workflow management, experiment tracking, data versioning and storage back ends. ytopt: Autotuning Scientific Applications for Energy Efficiency at Large Scales ytopt: Autotuning Scientific Applications for Energy Efficiency at Large Scales Xingfu Wu, Prasanna Balaprakash, Michael Kruse, Jaehoon Koo, Brice Videau, Paul Hovland, and Valerie Taylor (Argonne National Laboratory); Brad Geltz and Siddhartha Jana (Intel Corporation); and Mary Hall (University of Utah) Abstract As we enter the exascale computing era, efficiently utilizing power and optimizing the performance of scientific applications under power and energy constraints has become critical and challenging. We propose a low-overhead autotuning framework to autotune performance and energy for various hybrid MPI/OpenMP scientific applications at large scales and to explore the tradeoffs between application runtime and power/energy for energy efficient application execution, then use this framework to autotune four ECP proxy applications---XSBench, AMG, SWFFT, and SW4lite. Our approach uses Bayesian optimization with a Random Forest surrogate model to effectively search parameter spaces with up to 6 million different configurations on two large-scale production systems, Theta at Argonne National Laboratory and Summit at Oak Ridge National Laboratory. The experimental results show that our autotuning framework at large scales has low overhead and achieves good scalability. Using the proposed autotuning framework to identify the best configurations, we achieve up to 91.59% performance improvement, up to 21.2% energy savings, and up to 37.84% EDP improvement on up to 4,096 nodes. Benchmarking High-End ARM Systems with Scientific Applications. Performance and Energy Efficiency. Benchmarking High-End ARM Systems with Scientific Applications. Performance and Energy Efficiency. Nikolay Simakov, Robert DeLeon, Joseph White, Mathew Jones, and Thomas Furlani (Center for Computational Research, SUNY University at Buffalo) and Eva Siegmann and Robert Harrison (Institute for Advanced Computational Science, Stony Brook University) Abstract Motivated by our positive experience with the Ookami HPE/Cray Apollo-80 system (the first open Fujitsu A64FX system in the USA), we report benchmark results on modern ARM processors (Amazon Graviton 2/3, Futjutsu A64FX, Ampere Altra, Thunder X2). Comparison is made to x86 systems (Intel and AMD) and hybrid Intel x86/NVIDIA GPU systems. The benchmarking was done with the application kernel module of XDMoD. XDMoD, developed at the University of Buffalo, is a comprehensive suite for HPC resource utilization and performance monitoring. The applications span multiple HPC fields and paradigms: HPCC (several HPC benchmarks), NWChem (ab initio chemistry), Open Foam (partial differential equation solver, hydrodynamics), GROMACS (biomolecular simulation), AI Benchmark Alpha (AI benchmark) and Enzo (adaptive mesh refinement, astrophysical simulation). Paper, Presentation Technical Session 1C Chris Fuson Flexible Slurm configuration for large scale HPC Flexible Slurm configuration for large scale HPC Steven Robson, Kieran Leach, Stephen Booth, Greg Blow, Maciej Hamczyk, and Philip Cass (EPCC, The University of Edinburgh) Abstract EPCC operates a variety of services including ARCHER2, commissioned by UK Research and Innovation on behalf of the UK Government as the UK’s Tier-1 service, and Cirrus, a UKRI Tier-2 service. Supporting Many Task Workloads on Frontier using PMIx and PRRTE Supporting Many Task Workloads on Frontier using PMIx and PRRTE Wael Elwasif and Thomas Naughton (Oak Ridge National Laboratory) Abstract Large scale many-tasks ensemble are increasingly used as the basic building block for scientific applications running on leadership class platforms. Workflow engines are used to coordinate the execution of such ensembles, and make use of lower level system software to manage the lifetime of processes. PMIx (Process Management Interface for Exascale) is a standard for interaction with system resource and task management system software. The OpenPMIx reference implementation provides a useful basis for workflows engines running on large scale HPC systems. Slurm 23.02, 23.11, and Beyond Slurm 23.02, 23.11, and Beyond Tim Wickberg (SchedMD LLC, Slurm) Abstract This presentation will provide a technical overview of new features and functionality being released in the HPC open source Slurm workload manager version 23.02 (released Feb 2023) and 23.11 (being released Nov 2023). Paper, Presentation 3:00pm-5:00pmTechnical Session 2A Frank M. Indiviglio Deploying Cloud-Native HPC Clusters on HPE Cray EX Deploying Cloud-Native HPC Clusters on HPE Cray EX Felipe A. Cruz, Alejandro J. Dabin, and Manuel Sopena Ballesteros (Swiss National Supercomputing Centre) Abstract The software stack that manages a High-Performance Computing (HPC) cluster is a collection of applications and services put together by multiple engineers. Integrating all the software components is often complex and challenging. Therefore, the engineering effort frequently focuses on minimizing service disruption over the value delivery of new features. In this work, we introduce a cloud-native architecture for delivering an HPC cluster on top of HPE Cray EX that streamlines the development, operation, maintenance, and administration of the many services that compose an HPC cluster. Under a cloud-native approach, an HPC cluster is architected as a collection of small, loosely coupled services that can be independently delivered. Moreover, we leverage an on-prem cloud platform deployment that enables a self-service model for engineers to introduce controlled changes to the cluster while streamlining service and infrastructure automation. The presented cloud-native architecture is a starting point for delivering HPC clusters that are more resilient and scalable to operate. Software-defined Multi-tenancy on HPE Cray EX Supercomputers Software-defined Multi-tenancy on HPE Cray EX Supercomputers Richard Duckworth, Vinay Gavirangaswamy, David Gloe, and Brad Klein (HPE) Abstract Sandia National Laboratory’s Red Storm System was designed to support “switching” hardware to isolate computation and data between data classification levels. This enabled Sandia and derivative system architectures to adapt investments in capability computing to evolving needs. Today, industry demand for multi-tenancy in modern converged HPC and AI platforms has not waned, but expectations around how the solution should be delivered have changed – as have the types of workloads being run. The industry is now strongly advocating for and investing in cloud-like platforms that treat multi-tenancy as a first-principles capability, align with modern DevOps management techniques, support resource elasticity, and enable customers to deliver their own IaaS, PaaS, and SaaS solutions. Enter HPE Cray Systems Management (CSM). CSM is a Kubernetes-based, turnkey, open source, API-driven HPC systems software solution. Using CSM as a foundation, we have developed a software-defined, multi-tenancy architecture, anchored by a tenancy “controller hub,” called the Tenant and Partition Management System (TAPMS). TAPMS, through extant features in CSM inherits availability, scale, resiliency, disaster recovery, and security properties of the platform. This paper presents TAPMS, the supporting architecture, and the resulting composable, declarative tenant configuration interfaces that TAPMS and the underlying Kubernetes Operator Pattern enable. New User Experiences with K3s and MetalLB on Managed Nodes New User Experiences with K3s and MetalLB on Managed Nodes Alan Mutschelknaus and Jeremy Duckworth (HPE) Abstract Traditional managed nodes on HPE Cray EX systems dedicated to user compilations and job launch, have not supported user interaction beyond the standard SSH shell environment. This model works for many use cases, but it does not provide for the flexibility that industry solutions around container orchestration have. While User Access Instances (UAIs) are available as containerized login environments, they currently run in the Cray System Management Kubernetes cluster and would be better suited to run alongside other user processes. The WLCG Journey at CSCS: from Piz Daint to Alps The WLCG Journey at CSCS: from Piz Daint to Alps Riccardo Di Maria, Miguel Gila, Dino Conciatore, Giuseppe Lo Re, Elia Oggian, and Dario Petrusic (ETH Zurich, CSCS) Abstract The Swiss National Supercomputing Centre (CSCS), in close collaboration with the Swiss Institute for Particle Physics (CHiPP), provides the Worldwide LHC Computing Grid (WLCG) project with cutting-edge HPC and HTC resources. These are reachable through a number of Computing Elements (CEs) that, along with a Storage Element (SE), characterise CSCS as a Tier-2 Grid site. The current flagship system, an HPE Cray XC named Piz Daint, has been the platform where all the computing requirements for the Tier-2 have been met for the last 6 years. With the commissioning of the future flagship infrastructure, an HPE Cray EX referred to as Alps, CSCS is gradually moving the computational resources to the new environment. The Centre has been investing heavily in the concept of Infrastructure as Code (IaC) and it is embracing the multi-tenancy paradigm for its infrastructure. As a result, the project leverages modern approaches and technologies borrowed from the cloud to perform a complete re-design of the service. The goal of this contribution is to describe the journey, design choices, and challenges encountered along the way to implement the new WLCG platform, which is also being profited from by other projects such as the Cherenkov Array Telescope (CTA). Paper, Presentation Technical Session 2B Lena M Lopatina Deploying a Parallel File System for the World’s First Exascale Supercomputer Deploying a Parallel File System for the World’s First Exascale Supercomputer Jesse Hanley, Dustin Leverman, Christopher Coffman, Bradley Gipson, Christopher Brumgard, and Rick Mohr (Oak Ridge National Laboratory) Abstract The world’s first exascale supercomputer, OLCF’s ’Frontier’, debuted last year and is allocated for INCITE awards this year. OLCF partnered with HPE to design, procure and deploy a parallel file system to support the demands of this new machine. This file system is based on the ClusterStor E1000 storage platform and has been integrated into the OLCF site. Hiding I/O using SMT on the ARCHER2 HPE Cray EX system Hiding I/O using SMT on the ARCHER2 HPE Cray EX system Shrey Bhardwaj, Paul Bartholomew, and Mark Parsons (EPCC, The University of Edinburgh) Abstract In modern HPC systems, the I/O bottleneck limits the overall application wall clock time. To address this problem, this work tests the hypothesis that the effective I/O bandwidth can be improved by using SMT on ARCHER2, a HPE Cray EX supercomputing system. This was achieved by developing a benchmark library, iocomp which uses MPI to separate the computation and I/O processes. These processes can then be mapped to both threads of a single core using SMT on ARCHER2 as well as separate cores for comparison. For preliminary testing, the STREAM benchmark is used as the computational kernel and the iocomp library is used for the I/O operations. Timers are added to the application to record the computation and the wall time. Preliminary results show that when SMT is used, the wall clock time is 30% greater as compared to placing the computation and I/O processes onto separate cores using a full ARCHER2 node. As the STREAM benchmark is an unrealistic test case, HPCG and HPL benchmarks will next be used to test the hypothesis. MPI-IO Local Aggregation as Collective Buffering for NVMe Lustre Storage Targets MPI-IO Local Aggregation as Collective Buffering for NVMe Lustre Storage Targets Michael Moore and Ashwin Reghunandanan (Hewlett Packard Enterprise) and Lisa Gerhardt (Lawrence Berkeley National Laboratory) Abstract HPC I/O workloads using shared file access on distributed file systems such as Lustre have historically achieved lower performance relative to an optimal file-per-process workload. Optimizations at different levels of the application and file system stacks have alleviated many of the performance limitations for disk-based Lustre storage targets (OSTs). While many of the shared file optimizations in Lustre and MPI-IO provide performance benefits on NVMe based OSTs the existing optimizations don’t allow full utilization of the high throughput and random-access performance characteristics of the NVMe OSTs on existing systems. A new optimization in HPE Cray MPI, part of the HPE Cray Programming Environment, builds on existing shared file optimizations and the performance characteristics of NVMe-backed OSTs to improve shared file write performance for those targets. This paper discusses the motivation and implementation of that new shared file write optimization, MPI-IO Local Aggregation as Collective Buffering, for NVMe based Lustre OSTs like those in the HPE Cray ClusterStor E1000 storage system. This paper describes the new feature and how to evaluate application MPI-IO collective operation performance through HPE Cray MPI MPI-IO statistics. Finally, results of benchmarks using the new collective MPI-IO write optimization are presented. Kfabric Lustre Network Driver Kfabric Lustre Network Driver Chris Horn, Ian Ziemba, Amith Abraham, Ron Gredvig, and John Fragalla (Hewlett Packard Enterprise) Abstract Lustre is a parallel distributed file system used for large-scale cluster computing. Lustre’s performance scalability makes Cray ClusterStor the ideal storage system to pair with Cray EX computer systems. Between a high-performance compute system and storage system, it is necessary to deploy an equally performant and scalable network fabric and related software stack. Kfabric is a high-performance fabric software library based on libfabric, and it is optimized for bulk data and storage transfers. The Lustre Kfabric Network Driver (kfilnd) leverages Kfabric interfaces and the Kfabric kCXI provider (kfi_cxi) to enable Remote Direct Memory Access (RDMA) for Lustre filesystem communication on Slingshot networks. This presentation provides details on the designs of these new technologies and reports on some early lessons learned with their deployment at scale. We will provide a brief overview of the kfilnd, kfi_cxi, and cxi software. We will discuss some of the challenges we encountered in the areas of serviceability and resiliency, as well as some recent improvements we have made in those areas. Finally, we will provide a short preview of future work. This information should prepare system administrators to better operate and service Lustre clients and servers on Cray ClusterStor and Cray EX systems with Slingshot. Paper, Presentation Technical Session 2C Brett Bode Stress-less MPI Stress Tests Stress-less MPI Stress Tests Pascal Elahi and Craig Meyer (Pawsey Supercomputing Research Centre) Abstract Message Passing Interface (MPI) is critical for running jobs at scale on High Performance Computing (HPC) systems. Consequently, it is common practice to test the MPI on an HPC system with packages such as Ohio State University Micro-benchmarks and other MPI-enabled codes. However, these packages' focus on benchmarking or specific simple communication pattern means they leave much of the MPI deployment untested. As a consequence, users of Pawsey's HPE Cray EX system encountered issues with the MPI at initial early deployment, even after the system had passed acceptance tests. We present here a suite of MPI tests that stress the MPI library, focusing on a wide variety of communication patterns. These tests were critical to uncovering or isolating a number of underlying issues with the communication libraries on our newly deployed HPE Cray EX system. We will also discuss in detail any issues uncovered by these tests that remain unresolved and what impact that might have on users of EX systems. We have integrated these tests into a Reframe framework so that they can be deployed on any system. Leveraging Libfabric to Compare Containerized MPI Applications’ Performance Over Slingshot 11 Leveraging Libfabric to Compare Containerized MPI Applications’ Performance Over Slingshot 11 Alberto Madonna (Swiss National Supercomputing Centre) Abstract The capability to flexibly access different high-speed network hardware is fundamental when pursuing performance portability of containerized HPC applications. Solutions should strive to provide two desirable qualities: first, to be independent from the MPI implementation, allowing developers to use the best flavor for their application; second, to minimize modifications to the container image software stack, improving workflow reproducibility. Recent developments demonstrated the possibility to achieve these traits by using the libfabric communication framework as a middleware, abstracting the network hardware from the MPI libraries. In a nutshell, libfabric components are either added or replaced at container creation time, enabling containers to leverage optimized fabrics. This presentation provides an overview of such an approach, and then describes early experiences in applying the technique on a system featuring the HPE Slingshot 11 interconnect. Experimental results are showcased, comparing open source and proprietary MPI implementations across synthetic benchmarks and a selection of real-world scientific applications. Designing HPE Cray Message Passing Toolkit Software Stack for HPE Cray EX supercomputers Designing HPE Cray Message Passing Toolkit Software Stack for HPE Cray EX supercomputers Krishna Kandalla, Naveen Ravi, Kim McMahon, Larry Kaplan, and Mark Pagel (Hewlett Packard Enterprise) Abstract The Frontier supercomputer at ORNL is the world’s first supercomputer to break the exascale barrier and this system is based on the HPE Cray EX architecture. The HPE Cray EX architecture is designed to be highly flexible and relies on the HPE Slingshot technology. HPE Cray Programming Environment is a key software component that is tightly integrated with the broader HPE Cray EX hardware and software ecosystem to offer high performance and improved programmer productivity on HPE Cray EX systems. HPE Cray Messaging Passing Toolkit is one of the building blocks of the HPE Cray Programming Environment. It comprises of HPE Cray MPI and HPE Cray OpenSHMEMX software stacks. HPE Cray MPI is a proprietary implementation of the MPI specification and is used as the primary MPI stack on HPE Cray EX supercomputers and was instrumental in surpassing the exascale performance barrier on the Frontier supercomputer. HPE Cray OpenSHMEMX is a proprietary implementation of the OpenSHMEM specification and is the premier SHMEM implementation on HPE EX systems. Both libraries leverage years of innovation to offer high performance and scalable communication capabilities. This talk offers an overview of HPE Cray MPI and HPE Cray OpenSHMEMX stacks on HPE Cray EX supercomputers. Open MPI for HPE Cray EX Systems Open MPI for HPE Cray EX Systems Howard Pritchard (Los Alamos National Laboratory) and Thomas Naughton, Amir Shehata, and David Bernholdt (Oak Ridge National Laboratory) Abstract Open MPI for HPE Cray EX Systems Paper, Presentation 5:05pm-5:50pmBoF 2A Energy-based allocations and charging on large scale HPC systems Energy-based allocations and charging on large scale HPC systems Sridutt Bhalachandra and Norman Bourassa (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center); Maciej Cytowski and Cristian Di Pietrantonio (Pawsey Supercomputing Research Centre); Martin Bernreuther and Björn Dick (High Performance Computing Center Stuttgart); Juan Rodriguez Herrera, Alan Simpson, and Andrew Turner (EPCC, The University of Edinburgh); and Torsten Wilde (Hewlett Packard) Abstract Explicitly including per-job energy costs has not, historically, been important when allocating and charging for resources on HPC systems, because electricty costs for large HPC services over their lifetime have been dwarfed by the costs of procuring the systems. This is no longer true, for the ARCHER2 UK National Supercomputing Service, the lifetime electricity costs are estimated at more than 50% of the hardware costs. There is now wider interest in how allocation/charging schemes could be changed from residency-based approaches (e.g. using coreh) to approaches that explicitly include a component of energy. In this session, we will provide a forum for interested parties to come together for discussion of energy-based allocation/charging and the issues surrounding the introduction of such schemes. We will invite participants to join a cross-site working group to continue to share experience on energy-based charging. Birds of a Feather BoF 2B HPCM Users HPCM Users Jeff Hanson and Cornel Boac (Hewlett Packard) Abstract HPCM had a BoF at the virtual CUG (2021) and at 2022. Both were 45 minute sessions. I propose a similar session for 2023. Birds of a Feather BoF 2C Data Mobility Service – Data curation and intelligent management Data Mobility Service – Data curation and intelligent management Torben Kling Petersen (HPE) Abstract With storage requirements approaching exabyte levels, billions of files, and availability becoming more important to computational workflows, new paradigms to manage data are required. Many systems today are required to last for 5-7 years, support multi-tenancy data protection, and allow for geo-distributed data federation, making their management increasingly complicated. To address these challenges, HPE is developing Data Mobility Service, a suite of tools comprising one of several functions of future HPE GreenLake offerings, for use in both on-prem and off-prem solutions. While requirements are still being defined, the concept is to manage data ranging from node local storage, through parallel file systems, to archives. Local archiving, managed today via DMF7, can be stretched to multi-site data movement and potentially global federation, all managed as a dynamic service using HPE’s GreenLake Cloud Platform. This talk will provide an outline of HPE’s plans for Data Mobility Service, to gather feedback and facilitate development of the service to meet customer needs. This includes ephemeral file systems, data access through FAIR principles, and open-source research projects such as HPE’s Common Metadata Framework. Additionally, use of policy-based data movement workflows will be discussed. Birds of a Feather | Wednesday, May 10th8:30am-10:00amPlenary: CUG Elections, Panel CUG Business CUG Business Ashley Barker (Oak Ridge National Laboratory) Abstract CUG Business meeting, elections Women in HPC presents: Equity in Technical Leadership Women in HPC presents: Equity in Technical Leadership Kelly Rowland and Tina Declerck (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory); Ann Gentile (Sandia National Laboratories); Lipi Gupta and Yun (Helen) He (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory); Lena Lopatina (Los Alamos National Laboratory); Verónica Melesse Vergara (Oak Ridge National Laboratory); Hai Ah Nam (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory); Amy Neeser (University of California, Berkeley); Jean Sexton (Lawrence Berkeley National Laboratory); and Laurie Stephey and Zhengji Zhao (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory) Abstract The San Francisco Bay Area chapter of Women in HPC invites everyone to attend this discussion of equity in technical leadership. Technical leadership includes the skills to lead, direct and manage technical projects; we have invited a panel of experts on this topic to explore what it looks like to support individuals in groups underrepresented in HPC as technical leaders at various career stages. The End of Heterogeneous Programming The End of Heterogeneous Programming Alex Koehler (NVIDIA) Abstract Heterogeneous programming is dying and NVIDIA is killing it. Heterogeneous programming used to be required to get applications working on GPUs. As a result of a decade of co-design between ISO language parallelism and GPU hardware, it is now possible to develop GPU applications strictly based on ISO language parallelism, with no explicit support for heterogeneity, and have the performance be compelling. As a result of support for ISO language parallelism, the use of pragmas/directives as the portable way to port applications GPUs should be considered deprecated. Arm in HPC Arm in HPC david lecomber (Arm) Abstract HPE and Cray has a long and collaborative history of firsts in the story of Arm in HPC - and with the upcoming systems announced in Europe and US there will be more to come. We'll talk about that progress - and what Arm is doing to make HPC successful on its architecture. CUG Business, Plenary 10:30am-11:40amPlenary: HPE Update HPE update by Gerald Kleyn HPE update by Gerald Kleyn Gerald Kleyn (HPE) Abstract HPE update by Gerald Kleyn Vendor, Plenary 2:15pm-3:45pmTechnical Session 3A Martti Louhivuori Monitoring and characterizing GPU usage Monitoring and characterizing GPU usage Le Mai Weakley, Scott Michael, Abhinav Thota, Laura Huber, Ben Fulton, and Matthew Kusz (Indiana University) Abstract For systems with an accelerator component, it is important from an operational and planning perspective to understand how and to what extent the accelerators are being used. Having a framework for tracking the utilization of accelerator resources is important both for judging how efficiently used a system is, and for capacity and configuration planning of future systems. In addition to tracking total utilization and accelerator efficiency numbers, some attention should also be paid to the types of research and workflows that are being executed on the system. In the past, the demand for accelerator resources was largely driven by more traditional simulation codes, such as molecular dynamics. But with the growing popularity of deep learning and artificial intelligence workflows, accelerators have become even more highly sought after and are being used in new ways. Provisioning resources to researchers via an allocation system allows sites to track a project's usage and workflow as well as the scientific impact of the project. With such tools and data in hand, characterizing the GPU utilization of deep learning frameworks versus more traditional GPU-enabled applications becomes possible. In this paper we present a survey of GPU monitoring tools used in sites and a framework for tracking the utilization of NVIDIA GPUs on Slurm scheduled HPC systems used at Indiana University. We also present an analysis of accelerator utilization on multiple systems, including an HPE Apollo system targeting AI workflows and a Cray EX system. Evaluating and Influencing Extreme-Scale Monitoring Implementations Evaluating and Influencing Extreme-Scale Monitoring Implementations Jim Brandt (Sandia National Laboratories), Chris Morrone (Lawrence Livermore National Laboratory), Eric Roman (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center), Ann Gentile (Sandia National Laboratories), Tom Tucker (Open Grid Computing), Jeff Hanson (HPE), and Kathleen Shoga and Alec Scott (Lawrence Livermore National Laboratory) Abstract Over the past decade we have been able to gain new insights into application resource utilization and to detect and diagnose problems with decreased latency through fine-grained monitoring of our HPC systems while incurring no statistically significant performance penalty. STREAM: A Scalable Federated HPC Telemetry Platform STREAM: A Scalable Federated HPC Telemetry Platform Ryan Adamson (Oak Ridge National Laboratory) Abstract Obtaining and analyzing high performance computing (HPC) telemetry in real time is a complex task that can impact algorithmic performance, operating costs, and ultimately scientific outcomes. If your organization operates multiple HPC systems, filesystems, and clusters, telemetry streams can be synthesized in order to ease operational and analytics burden. In order to collect this telemetry, the Oak Ridge Leadership Computing Facility (OLCF) has deployed STREAM (Streaming Telemetry for Resource Events, Analytics, and Monitoring), which is a distributed and high-performance message bus based on Apache Kafka. STREAM collects center-wide performance information and must interface with many sources, including five HPE deployed supercomputers, each with their own Kafka cluster which is managed by HPCM. OLCF Supercomputers and their attached scratch filesystems currently send more than 300 million messages over 200 topics to produce around 1.3 Terabytes per day of telemetry data to STREAM. This paper describes the architectural principles that enable STREAM to be both resilient and highly performant while supporting multiple upstream Kafka clusters and other data sources. It also discusses the design challenges and decisions faced in adapting our existing system-monitoring infrastructure to support the first Exascale computing platform. Paper, Presentation Technical Session 3B Tina Declerck HPE’s Holistic system Power and energy Management (HPM) vision HPE’s Holistic system Power and energy Management (HPM) vision Torsten Wilde, Larry Kaplan, and Andy Warner (Hewlett Packard Enterprise) Abstract With the movement towards a carbon neutral and sustainable economy, the landscape of HPC system management and optimization is changing rapidly. Rising energy prices leading to concerns of increased OPEX, regulatory concerns around data center sustainability (reduction of carbon footprint, total power burden on the grid), and the expected increase in system power consumption with upcoming technologies require a fresh take on data center and system operation along with power/energy management. Powersched: A HPC System Power and Energy Management Framework Powersched: A HPC System Power and Energy Management Framework Marcel Marquardt, Jan Mäder, Tobias Schiffmann, Christian Simmendinger, and Torsten Wilde (Hewlett Packard Enterprise) Abstract Supercomputers can consume huge amounts of energy. This rising power consumption already has led to a situation in which large HPC sites run overprovisioned supercomputers, where the peak power demand of the system can exceed the power available at a given site. In addition, energy prices, especially in Europe, have dramatically increased over the last year. To address both problems, we see the need to run HPC applications at a maximally energy efficient sweetspot in terms of instructions per watt. Power Capping of Heterogeneous Systems Power Capping of Heterogeneous Systems Andrew Nieuwsma and Torsten Wilde (Hewlett Packard Enterprise) Abstract The landscape of HPC is changing rapidly because of raising energy prices and concerns of increased OPEX, regulatory concerns around data center sustainability (reduction of carbon footprint, total power burden on the grid), and the expected increase in system power consumption as systems get larger. Customers are asking for solutions that help them manage the changing landscape. Paper, Presentation Technical Session 3C Helen He Building AMD ROCm from Source on a Supercomputer Building AMD ROCm from Source on a Supercomputer Cristian Di Pietrantonio (Pawsey Supercomputing Research Centre) Abstract ROCm is an open-source software development platform for GPU computing created by AMD to accompany its GPU hardware that is being increasingly adopted to build the next generation of supercomputers. We argue that the fast-pacing evolution of their software platform and the complexities of installing software on a supercomputer mandate a more flexible installation process for ROCm than the available installation methods. Arkouda: A high-performance data analytics package Arkouda: A high-performance data analytics package Michelle Strout (Hewlett Packard Enterprise, University of Arizona); Brad Chamberlain (Hewlett Packard Enterprise, University of Washington); Elliot Ronaghan (Hewlett Packard Enterprise); and Scott Bachman (Hewlett Packard Enterprise, NCAR) Abstract This talk describes Arkouda, a Python package the Chapel team at HPE has co-developed with U.S. DoD to support data science using familiar interfaces at massive scales (think "TB-scale arrays") and interactive rates (think "seconds to small numbers of minutes per operation"). In more detail: Arkouda supports a key subset of the NumPy and Pandas Dataframe interfaces out of the box, serving as a virtual drop-in replacement for those operations. However, Arkouda's arrays can be transparently distributed across the memories of multiple compute nodes, and its operations are computed in parallel using all of the nodes' processor cores. This is achieved due to Arkouda's use of a client-server model in which the server is written in Chapel and runs on the local system, a cluster, cloud, or supercomputer. In practice, users have run Arkouda operations on data sets of up to 56 TB of memory in seconds to minutes using up 112k cores. We have also seen Arkouda outperform NumPy for single-node computations and significantly outperforming Dask at scale. This talk will describe Arkouda's feature set, architecture, usage models, and performance results. For more information about Arkouda, see https://github.com/Bears-R-Us/arkouda. HPC workflow orchestration using the ipython notebook platform HPC workflow orchestration using the ipython notebook platform Jonathan Sparks and Ayad Jassim (Hewlett Packard Enterprise) Abstract This paper describes a methodology using novel ipython notebook technologies to orchestrate complex HPC workflows and illustrates how this infrastructure can aid developer productivity from the edge to the supercomputer. Notebook's inherent capability to incorporate multi-language code segments and declarative metadata are two essential building blocks used in workflow modeling. We present a reference architecture using notebook infrastructure and components leveraging the accessibility and features of the web interface rather than the traditional SSH-based remote shell typically exposed by HPC systems. This architecture employs a unified framework to seamlessly support HPC workflow understanding and the generation of distributed tasks, allowing for better system and code team productivity. In addition to presenting a reference architecture, we investigate several competing technologies exposing the strengths and weaknesses of these approaches through the lens of HPC. The proposed framework uses existing companion works, such as Streamflow and Common Workflow Language (CWL), and shows that these components can fully describe complex real-world workflows. This paper uses exemplar use cases, such as Computation Fluid Dynamics (CFD) and Weather models illustrating edge-to-core dynamic workflow migration exposed by this platform and developer tools such as Microsoft Visual Studio plugins. Paper, Presentation 4:00pm-5:00pmTechnical Session 4C Tina Declerck Delta: Living on the Edge of Slingshot Support. Delta: Living on the Edge of Slingshot Support. Brett Bode, David King, Greg Bauer, Galen Arnold, and Robert Brunner (National Center for Supercomputing Applications/University of Illinois) Abstract The Delta system is a fairly standard x86 CPU cluster except that it utilizes HPE/Cray’s Slingshot interconnect. HPE supports Slingshot in a range of software and hardware environments, but clearly systems with Cray OS or SLES systems managed by HPCM are the core supported platforms. Delta differs from those platforms in a number of ways including the use of xCAT to manage Red Hat Enterprise Linux based nodes, DDN based storage attached to the fabric in multiple ways and the use of Spack to build most of the user space software environment. This presentation will cover the initial installation of the Delta software stack on the initial Slingshot 10 system and then the various changes that we made to upgrade to Slingshot 11 and our final production software stack. Cray Systems Management (CSM) Security Policy Engine Cray Systems Management (CSM) Security Policy Engine Srinivasa Rao Yarlagadda, Jeremy Duckworth, Viswanadha Murthy MVD, and Amarnath Chilumukuru (Hewlett Packard Enterprise) Abstract Cray Systems Management (CSM) provides enhanced security controls to enable a trustworthy HPC and AI system solution. Most of the exploits in Kubernetes occur due to inadequate security controls and misconfiguration. Kyverno is a policy engine designed for Kubernetes that runs as a dynamic admission controller. Kyverno uses the Kubernetes admission webhooks to validate, mutate and generate Kubernetes resources. Image signature verification to prevent software supply chain attacks can also be achieved. Paper, Presentation 4:00pm-5:30pmTechnical Session 4A Craig West Polaris and Acceptance Testing Polaris and Acceptance Testing Brian Homerding, Ben Lenard, Cyrus Blackworth, Alex Kulyavtsev, Carissa Holohan, Gordon McPheeters, Eric Pershy, Paul Rich, Doug Waldron, Michael Zhang, Kevin Harms, Ti Leggett, and William Allcock (Argonne National Laboratory) Abstract Argonne Leadership Computing Facility (ALCF) is home to Polaris, a 44 peak PetaFLOP (PF) system developed in collaboration with Hewlett Packard Enterprise (HPE) and NVIDIA. Polaris is a heterogeneous system with 560 nodes utilizing NVIDIA GPUs along with a HPE Slingshot Interconnect and a HDR200 Infiniband network to storage. Due to hardware availability the delivery was performed in multiple stages. We introduce both hardware and software components of Polaris and discuss the performance of our thorough benchmarking analysis. ALCF policy is to perform a rigorous multi-week ac- ceptance testing (AT) evaluation for every major system to ensure the capa- bilities of that system can support ALCF users’ science application needs and meet ACLF system operational metrics. The various system components are thoroughly tested to ensure the system will be stable for production operation, functions correctly, and fulfill performance expectations for scientific workloads. We will discuss how ALCF used Jenkins and ReFrame to perform the AT of the base Polaris system as well as a second AT to evaluate the Polaris CPU upgrade. We will present our approach for deploying Jenkins to streamline the AT evaluation with benchmarking improvements and lessons learned from the successful acceptance of the heterogeneous system, Polaris. Frontier Node Health Checking and State Management Frontier Node Health Checking and State Management Matthew A. Ezell (Oak Ridge National Laboratory) Abstract The HPE Cray EX235a compute blade that powers Frontier packs significant computational power in a small form factor. These complex nodes contain one CPU, 4 AMD GPUs (which present as 8 devices), 4 SlingShot NICs, and 2 NVMe devices. During the process of Frontier’s bring-up, as HPE and ORNL staff observed issues on nodes they would develop a health check to automatically detect the problem. A simple bash script called checknode collected these tests into one central location ensure that each component in the node is working according to its specifications. ORNL developed procedures that ensure checknode is run before allowing nodes to be used by the workload manager. The full checknode script runs on boot before Slurm starts, and a reduced set of tests run during the epilog of every Slurm job. Errors detected by checknode will cause the node to be marked as “drain” in Slurm with the error message stored in the Slurm “reason” field. Upon a healthy run of checknode, it can automatically undrain/resume a node as long as the “reason” was set by checknode itself. This presentation will discuss some of the checks present in checknode as well as outline the node state management workflow. Performance Portable Thread Cooperation On AMD and NVIDIA GPUs Performance Portable Thread Cooperation On AMD and NVIDIA GPUs Sebastian Keller (ETH Zurich, CSCS) Abstract One difference between the AMD CDNA2 and NVIDIA GPU architectures lies in the implementation of shared memory or local data store. In the former, the local data store resides in a dedicated area of the compute units, whereas on NVIDIA GPUs, shared memory competes with normal stack variables for space in the register file. By implementing a simple N-body kernel as an example, it is explored whether this architectural difference leads to different choices for thread cooperation, i.e. data exchange through shared memory or intra-warp exchanges with shuffle instructions, performing optimally on one GPU vs. the otther and what this means for performance portability. Paper, Presentation Technical Session 4B Abhinav S. Thota Using Containers to Deliver User Environments on HPE Cray EX Using Containers to Deliver User Environments on HPE Cray EX Felipe A. Cruz and Alberto Madonna (Swiss National Supercomputing Centre) Abstract In HPC systems, the user environment is the layer of the software stack with everything needed to support users' workflows for development, debugging, testing, and job execution. As such, environments can include compilers, libraries, environment variables, and command-line tools. User environments meet additional challenges: the need to provide stability and flexibility in large systems with thousands of users; as user environments are coupled and built over the system software layer, they are subject to overall system cadence and validation with all that this entails. Towards a "Containers Everywhere" HPC Platform Towards a "Containers Everywhere" HPC Platform Daniel Fulton, Laurie Stephey, Shane Canon, Brandon Cook, and Adam Lavely (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) Abstract Although containers provide many operational advantages including flexibility, portability and reproducibility, a fully containerized ecosystem for HPC systems does not yet exist. To date, containers in HPC typically require both substantial user expertise and additional container and job configuration. In this paper, we argue that a fully containerized HPC platform is compelling for both HPC administrators and users, offer ideas for what this platform might look like, and identify gaps that must be addressed to move from current state of the art to this containers everywhere approach. Additionally, we will discuss enabling core functionality, including communicating with the Slurm scheduler, using custom user-designed images, and using tracing/debuggers inside containers. We argue that to achieve the greatest benefit for both HPC administrators and users a model is needed that will enable both novice users, who have not yet adopted container technologies, as well as expert users who have already embraced containers. The aspiration of this work is to move towards a model in which all users can reap the benefits of working in a containerized environment without being an expert in containers or without even knowing that they are inside of one. Reducing File System Stress Caused by Large Python Installations Using Containers Reducing File System Stress Caused by Large Python Installations Using Containers Henrik Nortamo (CSC - IT Center for Science Ltd.) Abstract We present Tykky, a tool used on LUMI to create large Python-based installations on parallel file systems which do not handle a large amount of small files well. Paper, Presentation 5:00pm-5:30pmBoF 3C Next Steps in System Management Next Steps in System Management Andy Warne, Cornel Boac, Alok Prakash, and Dennis Walker (HPE) Abstract Since the acquisition of Cray by HPE, development and deployment has continued for legacy-HPE and legacy-Cray system management solutions (HPCM and CSM, respectively.) This presentation discusses the next stage in the evolution of system management for HPE systems. It will provide background, including crucial use cases, and describe how HPE plans to: continue and build on the innovations offered by CSM; support new use cases; and offer enhanced support for critical operational workflows. Birds of a Feather | Thursday, May 11th8:30am-10:00amTechnical Session 5A Helen He Climate Change Adaptation Digital Twin to support decision making Climate Change Adaptation Digital Twin to support decision making Jenni Kontkanen (CSC - IT Center for Science Ltd.); Mario Acosta, Pierre-Antoine Bretonnière, and Miguel Castrillo (Barcelona Supercomputing Centre); Paolo Davini (ISAC-CNR – Institute of Atmospheric Sciences and Climate, Consiglio Nazionale delle Ricerche); Francisco Doblas-Reyes (Barcelona Supercomputing Centre, Institució Catalana de Recerca i Estudis Avançats); Barbara Früh (DWD – Deutscher Wetterdienst); Jost von Hardenberg (Politecnico di Torino); Thomas Jung (Alfred Wegener Institute Helmholtz Center for Polar and Marine Research); Heikki Järvinen (University of Helsinki); Jan Keller (DWD – Deutscher Wetterdienst); Daniel Klocke (MPI-M – Max Planck Institute for Meteorology); Outi Sievi-Korte (CSC - IT Center for Science Ltd.); Sami Niemelä (Finnish Meteorological Institute); Bjorn Stevens (MPI-M – Max Planck Institute for Meteorology); Stephan Thober (Helmholtz Centre for Environmental Research); and Pekka Manninen (CSC - IT Center for Science Ltd.) Abstract To guide climate change adaptation efforts, there is a need for developing new types of climate information systems that provide timely information on local and regional impacts of climate change. We aim towards this by developing a Climate Change Adaptation Digital Twin, as part of European Commission’s Destination Earth programme. Early Experiences on the OLCF Frontier System with AthenaPK and Parthenon-Hydro Early Experiences on the OLCF Frontier System with AthenaPK and Parthenon-Hydro John Holmen (Oak Ridge National Laboratory), Philipp Grete (Hamburg Observatory University of Hamburg), and Veronica Vergara Larrea (Oak Ridge National Laboratory) Abstract The Oak Ridge Leadership Computing Facility (OLCF) has been preparing the nation’s first exascale system, Frontier, for production and end users. Frontier is based on HPE Cray’s new EX architecture and slingshot interconnect and features 74 cabinets of optimized 3rd Gen AMD EPYC CPUs for HPC and AI and AMD Instinct 250X accelerators. As a part of this preparation, “real-world” user codes have been selected to help assess the functionality, performance, and usability of the system. This paper describes early experiences using the system in collaboration with the Hamburg Observatory for two selected codes, which have since been adopted in the OLCF Test Harness. Experiences discussed include efforts to resolve performance variability and per-cycle slowdowns. Results are shown for a performance portable astrophysical magnetohydronamics code, AthenaPK, and a miniapp stressing the core functionality of a performance portable block-structured adaptive mesh refinement (AMR) framework, Parthenon-Hydro. These results show good scaling characteristics to the full system. At the largest scale, the Parthenon-Hydro miniapp reaches a total of $1.7 \times 10^{13}$ zone-cycles/s on 9,216 nodes (73,728 logical GPUs) at $\approx92\%$ weak scaling parallel efficiency (starting from a single node using a second-order, finite-volume method). LUMI - Delivering Real-World Application Performance at Scale through Collaboration LUMI - Delivering Real-World Application Performance at Scale through Collaboration Alistair Hart (HPE); Nicholas Malaya (AMD); Fredrik Robertsén (CSC - IT Center for Science Ltd.); Aniello Esposito, Diana Moise, Andrei Poenaru, and Peter Wauligmann (HPE); and Alessandro Fanfarillo, Samuel Antao, and George Markomanolis (AMD) Abstract EuroHPC's LUMI is a Top-3 HPC facility. Achieving performance for both synthetic benchmarks (like HPL) and real-world applications has involved a multi-year technical collaboration between HPE, AMD and the LUMI Consortium coordinated by CSC. Paper, Presentation Technical Session 5B Zhengji Zhao Improving energy efficiency on the ARCHER2 UK National Supercomputing Service Improving energy efficiency on the ARCHER2 UK National Supercomputing Service Adrian Jackson, Alan Simpson, and Andrew Turner (EPCC, The University of Edinburgh) Abstract Energy and power efficiency of modern supercomputers is important for several reasons: reducing energy costs, being more “friendly” to power generation grids and potentially reducing emissions. While energy efficiency has always been important in data centre operation, the energy efficiency of the applications running on the infrastructure has received less attention, usually because costs for large HPC services have historically been heavily dominated by the capital costs of procuring the systems. This is no longer true, for the ARCHER2 UK National Supercomputing Service, the lifetime electricity costs were projected to be around 50% of the hardware costs. Recently, the ARCHER2 team at EPCC have undertaken work to improve the energy efficiency of the service giving a cumulative saving of more than 20% in power draw of the compute cabinets without significantly affecting application performance. In this presentation we will describe this work and its impact. We also discuss the differences between energy efficiency and emissions efficiency on large HPC systems within the context of net zero initiatives and the potentially competing demands between the two goals. Reducing HPC energy footprint for large scale GPU accelerated workloads Reducing HPC energy footprint for large scale GPU accelerated workloads Gabriel Hautreux and Etienne Malaboeuf (CINES) Abstract This paper shows a parametric approach on how to reduce your energy footprint of an HPC system driven by GPUs at a large scale. Frequency capping as well as power capping approaches are tested and compared. This study is performed on Adastra at CINES, the #11 system in top500 and #3 system in Green500. We hope the results of this study will be of help to accelerators enabled HPC centers seeking to reduce their energy footprint by applying policies on either accelerators frequency or power capping at the node level. Estimating energy-efficiency in quantum optimization algorithms. Estimating energy-efficiency in quantum optimization algorithms. Rolando Pablo Hong Enriquez (Hewlett Packard Labs); Rosa Badia (Barcelona Supercomputing Center); Barbara Chapman (Hewlett Packard Enterprise); and Kirk Bresniker, Aditya Dhakal, Eitan Frachtenberg, Ninad Hogade, Gourav Rattihalli, Pedro Bruel, Alok Mishra, and Dejan Milojicic (Hewlett Packard Labs) Abstract Since the dawn of Quantum Computing (QC), theoretical developments like Shor’s algorithm, proved the conceptual superiority of QC over traditional computing. However, such quantum supremacy claims are difficult to achieve in practice due to the technical challenges of realizing noiseless qubits. In the near future, QC applications will need to rely on noisy quantum devices that offload part of their work to classical devices. A way to achieve this is by using Parameterized Quantum Circuits (PQCs) in optimization or even machine learning tasks. The energy consumption of quantum algorithms has been poorly studied. Here we explore several optimization algorithms using both, theoretical insights and numerical experiments, to understand their impact on energy consumption. Specifically, we highlight why and how algorithms like Quantum Natural Gradient Descent, Simultaneous Perturbation Stochastic Approximations or Circuit Learning methods, are at least 2× to 4× more energy efficient than their classical counterparts. Why Feedback-Based Quantum Optimization is energy-inefficient and how a technique like Rosalin, could boost the energy-efficiency of other algorithms by a factor of ≥ 20× Paper, Presentation Technical Session 5C Bilel Hadri Morpheus unleashed: Fast cross-platform SpMV on emerging architectures Morpheus unleashed: Fast cross-platform SpMV on emerging architectures Christodoulos Stylianou, Mark Klaisoongnoen, Ricardo Jesus, Nick Brown, and Michele Weiland (EPCC, The University of Edinburgh) Abstract Sparse matrices and linear algebra are at the heart of scientific simulations. Over the years, more than 70 sparse matrix storage formats have been developed, targeting a wide range of hardware architectures and matrix types, each of which exploit the particular strengths of an architecture, or the specific sparsity patterns of the matrices. Deploying HPC Seismic Redatuming on HPE/Cray Systems Deploying HPC Seismic Redatuming on HPE/Cray Systems Hatem Ltaief (KAUST) Abstract Seismic redatuming entails the repositioning of seismic data from the surface of the Earth to a subsurface level closer to where reflections originated. This operation requires intensive data movement, a large memory footprint, and extensive computations due to the requirement to repeatedly apply a Multi-Dimensional Convolution (MDC) operator that manipulates large, formally dense, complex-valued frequency matrices. We present a high-performance implementation of Seismic Redatuming by Inversion (SRI), which combines algebraic compression with mixed-precision (MP) computations. First, we improve the memory footprint of the MDC operator using Tile Low-Rank Matrix-Vector Multiplications (TLR-MVMs). Mixed precision computations are then used to increase the arithmetic intensity of the operator using several FP32/FP16/INT8 MP TLR-MVM implementations. The numerical robustness of our synergistic approach is validated on a benchmark 3D synthetic seismic dataset and its use is demonstrated on various hardware architectures, including XC40 (Intel Haswell), Apollo Series (Fujitsu ARM, AMD Rome, and NVIDIA A100), and EX (AMD Genoa). We achieve up to 63X memory footprint reduction and 12.5X speedup against the traditional dense MVM kernel. We release our code at: https://github.com/DIG-Kaust/TLR-MDC. Accelerating the Big Data Analytics Suite Accelerating the Big Data Analytics Suite Pierre Carrier (Hewlett Packard Enterprise); Scott Moe (Advanced Micro Devices, Inc; Microsoft Azure); Colin Wahl (Hewlett Packard Enterprise); and Alessandro Fanfarillo (Advanced Micro Devices, Inc) Abstract The Big Data Analytics Suite (BDAS) contains three classic machine learning codes: K-Means, Principal Component Analysis (PCA), and Support Vector Machine (SVM). This article describes how the 3 CPU codes, originally written in R, have been rewritten in C++ with HIP and MPI, and recast into GEMM-centric operations, taking full advantage of the heterogeneous architecture of the Frontier system. The new accelerated implementation of K-Means is now 80% GEMM-centric, PCA is 99% GEMM-centric, and finally, a new implementation in SVM will make it 20% GEMM-centric. Once completed in SVM, the entire machine learning suite will be GEMM driven. A discussion about AMD Tensile optimization of the GEMM operation adapted to extremely tall-and-skinny matrices in BDAS will be included. The improvements from the original CPU R codes to the new accelerated versions, referenced to the same number of Frontier nodes in use, are 320X, 360X and 120X, respectively for K-Means, PCA, and SVM. Future integration with python will be discussed, especially in the context of Dragon, and inclusion of various precision types. Paper, Presentation 10:30am-11:00amCUG2023 General Session CUG2024 Site Presentation CUG2024 Site Presentation Maciej Cytowski (Pawsey) Abstract Plenary session to hear from the CUG2024 site representative. Plenary 11:00am-12:00pmTechnical Session 6A Jussi Heikonen Frontier As a Machine for Science: How To Build an Exascale Computer and Why You Would Want To Frontier As a Machine for Science: How To Build an Exascale Computer and Why You Would Want To Bronson Messer (Oak) and Jim Rogers (Oak Ridge National Laboratory) Abstract We will present details of the project to build and operate Frontier--the world's first exascale computer at Oak Ridge National Laboratory--with a decidedly different set of emphases than the well-known "speeds and feeds" for the machine. First, we will describe some of the technical details of several of the facilities-related innovations necessary to field a supercomputer at this scale, including methods to deal with the prodigious power densities required, the move from medium-temperature to high-temperature water for the removal of waste heat, and the structural changes necessary in the datacenter to accommodate the sheer mass of the machine. Then, we will give an overview of the initial slate of science projects on Frontier for 2023, including projects from the Exascale Computing Project, the Oak Ridge Leadership Computing Facility's (OLCF) Center for Accelerated Application Readiness (CAAR), and the INCITE allocation program. We will present early results from a number of these projects and review the aims of several others. Our aim is to highlight the totality of the collaboration necessary between the OLCF, vendors, subcontractors, scientists, and software developers to ultimately realize the full promise of exascale computing to advance science across a range of disciplines. The Ookami Apollo80 system: Progress, Challenges and Next Steps The Ookami Apollo80 system: Progress, Challenges and Next Steps Eva Siegmann and Robert Harrison (Stony Brook University) Abstract In this talk, we will share experiences with running the Apollo80 system, Ookami. Ookami is an NSF-funded testbed located at Stony Brook University. Since January 2021 it gives researchers worldwide access to 176 Fujitsu A64FX processors. Until today nearly 100 projects and 270 users were onboarded. Also, since October 2022 Ookami is an ACCESS resource provider with ACCESS being a network of advanced computational resources within the US. Paper, Presentation Technical Session 6B Bilel Hadri Just One More Maintenance: Operating the Perlmutter Supercomputer While Upgrading to Slingshot 11 Just One More Maintenance: Operating the Perlmutter Supercomputer While Upgrading to Slingshot 11 Douglas Jacobsen (NERSC) Abstract The Perlmutter Supercomputer was originally integrated with the Slingshot 10 NIC, and as part of the Phase 2 Integration of the system, the compute, storage, and management systems were all upgraded to Slingshot 11. Owing to the lengthy nature of this process, NERSC leveraged the flexibility of the Cray EX platform to iteratively upgrade the system, one cabinet at a time, to Slingshot 11. In this presentation we will discuss the process for doing this, the benefits we enjoyed as well as the lessons we learned in doing so. This includes detailed discussions of various aspects of Slingshot software and configuration management, early experiences transitioning from the various LNet Networking Drivers (ko2ib for SS10 to ksocklnd for hybrid SS10/SS11 to kfilnd for SS11) for Lustre and DVS, hard-learned best practices for stabilizing the Slingshot network while expanding it (to add cabinets), as well as a description of the tools we use to keep all this working. Finally, we'll close with the current state of affairs on the system and the experiences and lessons we've learned with the fully SS11 system to make it operational. Orchestration of Exascale Fabrics using the Slingshot Fabric Manager: Practical Examples from LUMI and Frontier Orchestration of Exascale Fabrics using the Slingshot Fabric Manager: Practical Examples from LUMI and Frontier Jose Mendes and Forest Godfrey (Hewlett Packard Enterprise) Abstract The large number of devices required to support Exascale-sized fabrics presents a unique set of challenges for orchestration and configuration management software. These challenges occur not only in the large size of configuration files but in other aspects of the software such as responsiveness, end user interface, software reliability, and consistency of operations. The solutions to these challenges sometimes require pragmatic rather than aesthetically appealing solutions. Varied workloads across the Slingshot customer base increase the complexity of the choices made. Paper, Presentation Technical Session 6C Jim Williams Slingshot and HPC Storage – Choosing the right Lustre Network Driver (LND) Slingshot and HPC Storage – Choosing the right Lustre Network Driver (LND) John Fragalla (HPE) Abstract Slingshot is HPE’s flagship HPC interconnect, and efficient data access is critical for HPC application performance. When upgrading from Slingshot-10 to Slingshot-11, building new configurations starting with Slingshot-11 (with the HPE Slingshot NIC), attaching non-performant HPC compute nodes, and/or externally routing a Lustre Network (LNET), there are several Lustre network drivers (LNDs) such as kkfilnd, ko2iblnd, and ksocklnd, that are recommended alone or in combination based on various criteria. In this presentation, we will recommend which LND should be used when, show performance results for each of the drivers, and outline storage connectivity considerations when moving from Slingshot-10 to Slingshot 11. Journey in slingshot HSN segmentation using VLANs Journey in slingshot HSN segmentation using VLANs Chris Gamboni and Miguel Gila (Swiss National Supercomputing Centre) Abstract The Swiss National Supercomputing Center (CSCS) based in Lugano, Switzerland, offers HPC facilities to academic users where co-investing customers can purchase "dedicated" scientific computing capacity. Paper, Presentation 1:00pm-2:30pmTechnical Session 7A Eva Siegmann Porting a large cosmology code to GPU, a case study examining JAX and OpenMP. Porting a large cosmology code to GPU, a case study examining JAX and OpenMP. Nestor Demeure (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center); Theodore Kisner and Reijo Keskitalo (Computational Cosmology Center, Lawrence Berkeley National Laboratory; Department of Physics, University of California Berkeley); Rollin Thomas (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center); Julian Borrill (Computational Cosmology Center, Lawrence Berkeley National Laboratory; Space Sciences Laboratory, University of California Berkeley); and Wahid Bhimji (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) Abstract In recent years, a common pattern has emerged where numerical software is designed around a Python interface calling high-performance kernels written in a lower level language. Adding GPU support to TurboGAP. Towards exascale molecular dynamics with machine learning potentials Adding GPU support to TurboGAP. Towards exascale molecular dynamics with machine learning potentials Cristian-Vasile Achim and Martti Louhivuori (CSC - IT Center for Science Ltd.), Miguel Caro (Aalto University), and Jussi Heikonen (CSC - IT Center for Science Ltd.) Abstract TurboGAP is a state-of-the-art FORTRAN code for atomistic simulations with machine-learning-based interatomic potentials within the Gaussian approximation potential framework. It is a relatively large code that offers different levels of functionality and it is still in development. This is why porting all code to C/C++ is not a viable option. Our main focus is improving the energies and forces prediction, in particular increasing the number of operations per node by offloading as much as possible to GPUs. LUMI supercomputer is the main target of the our work, however we also want to enable users to perform calculations on their own GPU enabled machines. Directive based approaches are in theory portable and require very few changes to code, but in practice it depends on the compiler availability and also the ability of the individual users to set up the environment. Furthermore, the performance is not guaranteed. This why native GPU programming was preferred for offloading selected hotspots. While increasing the implementational complexity, this approach ensures reasonable portability without losing performance. In our presentation we will explain in detail the pros and cons of the mentioned approaches, the strategy used for porting, and early performance results for A100 and MI250X devices. Scalable High-Fidelity Simulation of Turbulence With Neko Using Accelerators Scalable High-Fidelity Simulation of Turbulence With Neko Using Accelerators Niclas Jansson, Martin Karp, Jacob Wahlgren, and Stefano Markidis (KTH Royal Institute of Technology) and Philipp Schlatter (KTH Royal Institute of Technologoy) Abstract Recent trends and advancements in including more diverse and heterogeneous hardware in High-Performance Computing are challenging scientific software developers in their pursuit of efficient numerical methods with sustained performance across a diverse set of platforms. As a result, researchers are today forced to re-factor their codes to leverage these powerful new heterogeneous systems. We present Neko – a portable framework for high-fidelity spectral element flow simulations. Unlike prior works, Neko adopts a modern object-oriented Fortran 2008 approach, allowing multi-tier abstractions of the solver stack and facilitating various hardware backends ranging from general-purpose processors, accelerators down to exotic vector processors and Field-Programmable Gate Arrays (FPGAs). Focusing on the performance and portability of Neko, we describe the framework's device abstraction layer managing device memory, data transfer and kernel launches from Fortran, allowing for a solver written in a hardware-neutral yet performant way. Accelerator specific optimisations are also discussed, with auto-tuning of key kernels and various communication strategies using device-aware MPI. Finally, we present performance measurements on a wide range of computing platforms, including the EuroHPC pre-exascale system LUMI, where Neko achieves excellent parallel efficiency for a large DNS of turbulent fluid flow using up to 80% of the entire LUMI supercomputer. Paper, Presentation Technical Session 7B Chris Fuson Observability, Monitoring, and In Situ Analytics in Exascale Applications Observability, Monitoring, and In Situ Analytics in Exascale Applications Dewi Yokelson (University of Oregon), Oskar Lappi (University of Helsinki), Srinivasan Ramesh (NVIDIA), Miikka Vaisala (Academia Sinica), Kevin Huck (University of Oregon), Touko Puro (University of Helsinki), Boyana Norris (University of Oregon), Maarit Korpi-Laag (Aalto University), Keijo Heljanko (University of Helsinki), and Allen Malony (University of Oregon) Abstract With the rise of exascale systems and large, data-centric workflows, the need to observe and analyze high performance computing (HPC) applications during their execution is becoming increasingly important. HPC applications are typically not designed with online monitoring in mind, therefore, the observability challenge lies in being able to access and analyze interesting events with low overhead while seamlessly integrating such capabilities into existing and new applications. We explore how our service-based observation, monitoring, and analytics (SOMA) approach to collecting and aggregating both application-specific telemetry data and performance data addresses these needs. We present our SOMA framework and demonstrate its viability with LULESH, a hydrodynamics proxy application. Then we focus on Astaroth, a multi-GPU library for stencil computations, highlighting the integration of the TAU and APEX performance tools and SOMA for application and performance data monitoring. Assessing Memory Bandwidth on ARCHER2 and LUMI Using CAMP Assessing Memory Bandwidth on ARCHER2 and LUMI Using CAMP Wenqing Peng, Adrian Jackson, and Evgenij Belikov (EPCC, The University of Edinburgh) Abstract In this paper we present intra-node bandwidth measurements on ARCHER2 (AMD Rome) and LUMI (AMD Milan) using the open-source CAMP (Configurable App for Memory Probing) tool, which is a configurable micro-benchmark that allows varying operational intensity, thread counts and placement, and memory access patterns including contiguous, strided, various types of stencils, and random. We also gather information on power consumption from the Slurm batch scheduler to correlate it with the access patterns used. For comparison, we run another set of the measurements on a node on NEXTGenIO (Intel Ice Lake). Additionally, we extend CAMP to increase its resolution so that we can assess the range of operational intensities between zero and two in more detail compared to previous results. Moreover, we illustrate the mechanism for using custom kernels in CAMP using dot product as an example. Our results confirm and extend previous results showing that maximum bandwidth is reached using a fraction of threads compared to the maximum number of available cores on a node. In particular, for memory access with a stride of four and for a contiguous access case, we observe up to 11% higher bandwidth using 16 threads compared to the full node using 128 cores on an ARCHER2 node and up to 15% on LUMI, especially for operational intensities below 0.5. This suggests that underpopulation may be a viable option to achieve higher performance compared to full node utilisation and thus the results suggest that benchmarking should include tests using only a fraction of available cores per node. Additionally, sub-NUMA-node awareness may be required to reach the highest performance. Overview of SPEC HPC Benchmarks and Details of the SPEChpc 2021 Benchmark Overview of SPEC HPC Benchmarks and Details of the SPEChpc 2021 Benchmark Robert Henschel (Indiana University) and Veronica Melesse Vergara (Oak Ridge National Laboratory) Abstract The Standard Performance Evaluation Corporation (SPEC) is a non-profit corporation formed to establish, maintain and endorse standardized benchmarks and tools to evaluate performance and energy efficiency for the newest generation of computing systems. The SPEC High Performance Group (HPG) focuses specifically on developing industry standard benchmarks for HPC systems and has a track record of producing high-quality benchmarks serving both academia and industry. This talk provides an overview of the HPC benchmarks that are available from SPEC and SPEC HPG and then dives into the details of the newest benchmark, SPEChpc 2021. This benchmark covers all prevalent programming models, supports hybrid execution on CPUs and GPUs and scales from a single node all the way to thousands of nodes and GPUs. In addition to talking about the architecture and use cases of the benchmark, results on relevant HPE/Cray systems will be presented and an outlook of future benchmark directions will be given. Paper, Presentation Technical Session 7C Tina Declerck VASP Performance on HPE Cray EX Based on NVIDIA A100 GPUs and AMD Milan CPUs VASP Performance on HPE Cray EX Based on NVIDIA A100 GPUs and AMD Milan CPUs Zhengji Zhao and Brian Austin (Lawrence Berkeley National Laboratory); Stefan Maintz (NVIDIA); and Martijn Marsman (University of Vienna, VASP Software GmbH) Abstract NERSC’s new supercomputer, Perlmutter, an HPE Cray EX system, has recently entered production. NERSC users are transitioning from a Cray XC40 system based on Intel Haswell and KNL processors to Perlmutter with NVIDIA A100 GPUs and AMD Milan CPUs with more on-node parallelism and NUMA domains. VASP, a widely-used materials science code that uses about 20\% of NERSC's computing cycles, has been ported to GPUs using OpenACC. For applications to achieve optimal performance, features specific to Cray EX must be explored, including the build and runtime options. In this paper, we present a performance analysis of representative VASP workloads on Perlmutter, addressing practical questions concerning hundreds of VASP users: What types of VASP workloads are suitable to run on GPUs? What is the optimal number of GPU nodes to use for a given problem size? How many MPI processes should share a GPU? What Slingshot options improve VASP performance? Is it worthwhile to enable OpenMP threads when running on GPU nodes? How many threads per task perform best on Milan CPU nodes? What are the most effective ways to minimize charging and energy costs when running VASP jobs on Perlmutter? This paper will serve as a Cray EX performance guide for VASP users and others. Containerization Workflow and Performance of Weather and Climate Applications Containerization Workflow and Performance of Weather and Climate Applications Usama Anber and Paulo Souza (Hewlett Packard Enterprise) Abstract Containers are a relatively new concept for weather and climate applications due to their immense complexity and infrastructure, platform, and environment dependencies. The accuracy of predictions made by these applications requires a collaborative effort from the scientific community in academia and national laboratories. However, the setup and configuration of these applications is not a straightforward task and often fails on different HPC architectures. In this paper, we show that containers offer a solution to this problem. We first demonstrate the containerization workflow of these applications, which facilitates portability and bursting out into the public cloud. We also demonstrate scalability and performance in comparison to the bare-metal version of the applications, which is another major concern to the weather and climate community. Integration of Modern HPC Performance Analysis in Vlasiator for Sustained Exascale Integration of Modern HPC Performance Analysis in Vlasiator for Sustained Exascale Camille Coti (Ecole de Technologie Superieure); Yann Pfau-Kempf, Markus Battarbee, and Urs Ganse (University of Helsinki); Sameer Shende, Kevin Huck, and Jordi Rodriguez (University of Oregon); Leo Kotipalo (University of Helsinki); Allen Malony (University of Oregon); and Minna Palmroth (University of Helsinki, Finnish Meteorological Institute) Abstract Delivering sustained exascale applications requires a dedicated development and optimization effort to leverage the power of heterogeneous architectures, complex memory hierarchies, and scalable interconnect technologies fueling exascale computing. The challenge is complicated by the fact that application codes might need to use hybrid programming methods, more sophisticated memory management, and adaptive algorithms to leverage exascale capabilities. However, it is also the case that the factors contributing to performance variability will be more prevalent and difficult to understand. Thus, it is important that exascale environments include robust performance analysis technologies that application teams can integrate within their code development and use to conduct experiments that can help elucidate performance problems. The paper reports on a collaboration to integrate exascale-ready performance tools (TAU and APEX from the University of Oregon) with the Vlasiator application, a leading simulation code from the University of Helsinki for modeling the space plasma environment of the Earth. Our goal is to show the benefits of modern performance tool integration in Vlasiator as it is being ported to the EU’s flagship supercomputer in Finland (LUMI). In addition to presenting Vlasiator performance results, we will offer useful guidance and approaches for other exascale application projects. Paper, Presentation 2:45pm-4:15pmTechnical Session 8A Tina Declerck HPC Cluster CI/CD Image Build Pipelining HPC Cluster CI/CD Image Build Pipelining Travis Cotton (Los Alamos National Laboratory) Abstract Building images for HPC clusters tends to be a monolithic process, requiring complete rebuilds when new packages or configurations are added to them or updating existing images. It is also generally a manual process, heavily involving a system administrator and system-specific custom tooling. Rebuilding images from scratch can be time consuming and updating existing images can introduce unwanted/unexpected changes to production systems. These problems can be mitigated by using existing container models, creating and layering images to create the final “production”-ready results. This allows for rapid turnaround and a guarantee that existing layers remain unchanged while safely updating others. We can then leverage this layer-based image building to allow for a more automated process using Continuous Integration/Continuous Delivery (CI/CD) pipelines. Leveraging standard tools for configuration management and version control combined with the OCI standard of layer-based image building and CI/CD pipelines, we can create an automated and even distributed image building workflow while still being customizable for specific sites and systems. LA-UR-23-20251 The Tri-Lab Operating System Stack (TOSS) on Cray EX Supercomputers The Tri-Lab Operating System Stack (TOSS) on Cray EX Supercomputers Adam Bertsch and Trent D'Hooge (Lawrence Livermore National Laboratory) Abstract We present the Tri-Lab Operating System Stack (TOSS) running on production Cray EX supercomputers and targeted for the El Capitan exascale platform. TOSS is a RedHat-derived operating system that has been enhanced at Lawrence Livermore National Laboratory (LLNL) to streamline High Performance Computing (HPC) workloads and systems management. TOSS runs LLNL’s four Cray EX supercomputers, three of which are on the Top 500 supercomputer list, in addition to dozens of commodity HPC clusters. TOSS includes resource management with FLUX, configuration management via Ansible, the Lustre parallel filesystem, and a collection of tools for node and image management. TOSS also supports desirable features including high availability for image servers and zero-downtime rolling updates. We compare and contrast the TOSS system management philosophy with Cray System Management (CSM). We point out the interfaces between TOSS and low level components of the Cray EX ecosystem, enabled by open source software and open interfaces to proprietary components, as well as demonstrate compatibility with higher level Cray software such as Cray PE. We discuss challenges encountered throughout the process, presenting a case study of our journey to deploying an open software stack on Cray EX. Manta, a cli to simplify common Shasta operations Manta, a cli to simplify common Shasta operations Manuel Sopena Ballesteros (Swiss National Supercomputing Centre, ETH Zurich) Abstract CSM is a great implementation of infrastructure as code that enables the Swiss National Supercomputing Centre (CSCS) to embrace the multi-tenancy paradigm on their HPE Cray EX infrastructures (Alps and PreAlps). But with these new capabilities, new engineering challenges arise. Compared to traditional HPC environments, the complexity to operate the environment has increased, and things like debugging the provisioning and configuration of the nodes have become more complicated: typical operations like reading the CFS Ansible logs or opening the node console require good knowledge of Kubernetes. In this presentation we will introduce Manta, a CLI built to address some of these issues, simplifying the most common operations related to computing node operation (boot, reboot, etc.) and configuration with CFS, using HSM groups as a reference framework. Paper, Presentation Technical Session 8B Cristian-Vasile Achim Header Only Porting: a light-weight header-only library for CUDA/HIP porting Header Only Porting: a light-weight header-only library for CUDA/HIP porting Martti Louhivuori (CSC - IT Center for Science Ltd.) Abstract LUMI, the fastest supercomputer in Europe, is a HPE Cray EX system that uses AMD MI250X GPUs in its GPU partition. To target the MI250X GPUs one can use the HIP portability layer. STAX, HPC meta-containers from edge to core workflow orchestration STAX, HPC meta-containers from edge to core workflow orchestration Jonathan Sparks (Hewlett Packard Enterprise) Abstract Containerized computing has changed the landscape for developers and deployment of science-based codes, including HPC, AI, and enterprise ISV applications. As new system architectures become more complex and diverse regarding choices of CPUs, accelerators, and networks, these system requirements impose a tremendous challenge for application developers to create portable and performant containerized applications. This paper will present a higher-level abstraction of containers – STAX. Whereas a container typically encapsulates a single application with associated libraries and configuration, a STAX container is a meta-container, including metadata for orchestration and components for portability between HPC compute environments and productivity tools enhanced workflow execution. We will discuss and demonstrate how HPE Cray containerized programming environment addresses the portability and migration of workflows at the “edge” and the “core” for HPC. Paper, Presentation Technical Session 8C Zhengji Zhao Performance Study on CPU-based Machine Learning with PyTorch Performance Study on CPU-based Machine Learning with PyTorch Smeet Chheda, Anthony Curtis, Eva Siegmann, and Barbara Chapman (Stony Brook University) Abstract Over the past decade we have seen a surge in research in Machine Learning. Deep neural networks represent a subclass of machine learning and are computationally intensive. Traditionally, GPUs have been leveraged to accelerate the training of such deep networks by taking advantage of parallelization and the many core architecture. As the datasets and models grow larger, scaling the training or inference task can help reduce the time to solution for research or production purposes. The Supercomputer Fugaku established state of the art results in multiple benchmarks in machine learning by scaling ARM based CPU technology. To that end, we study and present the performance of machine learning training and inference tasks on 64-bit ARM CPU architecture by exploiting its features namely the Scalable Vector Extensions (SVE) in the ARMv8-A. For our work, we utilize the Ookami testbed equipped with A64FX processors in the FX700 system and the Stampede2 Cluster equipped with Intel Skylake processors for performance comparisons including throughput with respect to peak power consumption. Towards Training Trillion Parameter Models on HPE GPU Systems Towards Training Trillion Parameter Models on HPE GPU Systems Pierre Carrier, Manjunath Sripadarao, Shruti Ramesh, and Stephen Fleischman (Hewlett-Packard Enterprise) Abstract Large Language Models (LLMs) are deep neural networks with hundreds of billions of weight parameters, typically trained as next word predictors, with attention, over trillions of text tokens. They can be adapted with very little (1-shot) or no fine-tuning (zero-shot) to a wide range of natural language tasks. The current industry-standard is a third-generation transformer model called Generative Pretrained Network (GPT-3) with sizes up-to 530 billion parameters. A Deep Dive into NVIDIA's HPC Software A Deep Dive into NVIDIA's HPC Software Jeff Hammond, Jeff Larkin, and Axel Koehler (NVIDIA) Abstract The NVIDIA HPC SDK provides a full suite of compilers, math libraries, communication libraries, profilers, and debuggers for the NVIDIA platform. The HPC SDK is freely available to all developers on x86, Arm, and Power platforms and is included on HPE systems with NVIDIA GPUs. This presentation will provide a description of the technologies available in the HPC SDK and an update on recent developments. We will discuss NVIDIA's preferred methods of programming to the NVIDIA platform, including support for parallel programming ISO C++, Fortran, Python, compiler directives, and CUDA C++ or Fortran. Paper, Presentation 4:20pm-4:30pmCUG 23 Closing Plenary | Friday, May 12th8:00am-4:30pmSustainable HPC Operations Workshop ( in Kajaani) Workshop in Kajaani |
Sunday, May 7th1:30pm-2:45pmProgramming Environments, Applications, and Documentation (PEAD) Introduction PEAD HPE Documentation Training Birds of a Feather 3:00pm-5:00pmProgramming Environments, Applications, and Documentation (PEAD) User Module Environment PE Updates and Testing Future Directions for Fortran Birds of a Feather | Monday, May 8th8:30am-10:00amTutorial 1A Using the HPE Cray Programming Environment (HCPE) to Port and Optimize Applications to hybrid systems with GPUs using OpenMP Offload or OpenACC Tutorial Tutorial 1B Advanced Topics for Cray System Management for HPE Cray EX Systems Tutorial Tutorial 1C Supercomputer Affinity on HPE Systems Tutorial 10:30am-12:00pmTutorial 1A Continued Using the HPE Cray Programming Environment (HCPE) to Port and Optimize Applications to hybrid systems with GPUs using OpenMP Offload or OpenACC Tutorial Tutorial 1B Continued Advanced Topics for Cray System Management for HPE Cray EX Systems Tutorial Tutorial 1C Continued Supercomputer Affinity on HPE Systems Tutorial 1:00pm-2:30pmTutorial 2A System Monitoring with CSM and HPCM Tutorial Tutorial 2B Analyzing the Slingshot Fabric with the Slingshot Dashboard Tutorial Tutorial 2C Omnitools: Performance Analysis Tools for AMD GPUs Tutorial 3:00pm-4:30pmTutorial 2A Continued System Monitoring with CSM and HPCM Tutorial Tutorial 2B Continued Analyzing the Slingshot Fabric with the Slingshot Dashboard Tutorial Tutorial 2C Continued Omnitools: Performance Analysis Tools for AMD GPUs Tutorial 4:35pm-6:00pmBoF 1A Systems Monitoring Working Group BOF Birds of a Feather BoF 1B Extending Software Tools for CSM Birds of a Feather BoF 1C HPC System Test: Challenges and Lessons Learned Deploying Bleeding-Edge Network Technologies Birds of a Feather | Tuesday, May 9th8:30am-10:00amPlenary: Welcome, Keynote CUG Welcome Keynote: Daring to think of the impossible: The story of Vlasiator AMD Together We Advance - A Look Back and a Look Forward Multi-dimensional HPC for Breakthrough Results Plenary 10:30am-12:00pmPlenary: Sponsor talk, Best Paper, New sites CSC Finland: 52 years of leadership in European HPC CUG Election Candidate Statement Balancing Workloads in More Ways than One Komondor, the Hungarian breed High Performance Remote Linux Desktops with ThinLinc Plenary 1:00pm-2:30pmTechnical Session 1A Pekka Manninen CPE Update Deploying Alternative User Environments on Alps Automating Software Stack Deployment on an HPE Cray EX Supercomputer Paper, Presentation Technical Session 1B Lena M Lopatina Building Efficient AI Pipelines with Self-Learning Data Foundation for AI ytopt: Autotuning Scientific Applications for Energy Efficiency at Large Scales Benchmarking High-End ARM Systems with Scientific Applications. Performance and Energy Efficiency. Paper, Presentation Technical Session 1C Chris Fuson Flexible Slurm configuration for large scale HPC Supporting Many Task Workloads on Frontier using PMIx and PRRTE Slurm 23.02, 23.11, and Beyond Paper, Presentation 3:00pm-5:00pmTechnical Session 2A Frank M. Indiviglio Deploying Cloud-Native HPC Clusters on HPE Cray EX Software-defined Multi-tenancy on HPE Cray EX Supercomputers New User Experiences with K3s and MetalLB on Managed Nodes The WLCG Journey at CSCS: from Piz Daint to Alps Paper, Presentation Technical Session 2B Lena M Lopatina Deploying a Parallel File System for the World’s First Exascale Supercomputer Hiding I/O using SMT on the ARCHER2 HPE Cray EX system MPI-IO Local Aggregation as Collective Buffering for NVMe Lustre Storage Targets Kfabric Lustre Network Driver Paper, Presentation Technical Session 2C Brett Bode Stress-less MPI Stress Tests Leveraging Libfabric to Compare Containerized MPI Applications’ Performance Over Slingshot 11 Designing HPE Cray Message Passing Toolkit Software Stack for HPE Cray EX supercomputers Open MPI for HPE Cray EX Systems Paper, Presentation 5:05pm-5:50pmBoF 2A Energy-based allocations and charging on large scale HPC systems Birds of a Feather BoF 2B HPCM Users Birds of a Feather BoF 2C Data Mobility Service – Data curation and intelligent management Birds of a Feather | Wednesday, May 10th8:30am-10:00amPlenary: CUG Elections, Panel CUG Business Women in HPC presents: Equity in Technical Leadership The End of Heterogeneous Programming Arm in HPC CUG Business, Plenary 10:30am-11:40amPlenary: HPE Update HPE update by Gerald Kleyn Vendor, Plenary 2:15pm-3:45pmTechnical Session 3A Martti Louhivuori Monitoring and characterizing GPU usage Evaluating and Influencing Extreme-Scale Monitoring Implementations STREAM: A Scalable Federated HPC Telemetry Platform Paper, Presentation Technical Session 3B Tina Declerck HPE’s Holistic system Power and energy Management (HPM) vision Powersched: A HPC System Power and Energy Management Framework Power Capping of Heterogeneous Systems Paper, Presentation Technical Session 3C Helen He Building AMD ROCm from Source on a Supercomputer Arkouda: A high-performance data analytics package HPC workflow orchestration using the ipython notebook platform Paper, Presentation 4:00pm-5:00pmTechnical Session 4C Tina Declerck Delta: Living on the Edge of Slingshot Support. Cray Systems Management (CSM) Security Policy Engine Paper, Presentation 4:00pm-5:30pmTechnical Session 4A Craig West Polaris and Acceptance Testing Frontier Node Health Checking and State Management Performance Portable Thread Cooperation On AMD and NVIDIA GPUs Paper, Presentation Technical Session 4B Abhinav S. Thota Using Containers to Deliver User Environments on HPE Cray EX Towards a "Containers Everywhere" HPC Platform Reducing File System Stress Caused by Large Python Installations Using Containers Paper, Presentation 5:00pm-5:30pmBoF 3C Next Steps in System Management Birds of a Feather | Thursday, May 11th8:30am-10:00amTechnical Session 5A Helen He Climate Change Adaptation Digital Twin to support decision making Early Experiences on the OLCF Frontier System with AthenaPK and Parthenon-Hydro LUMI - Delivering Real-World Application Performance at Scale through Collaboration Paper, Presentation Technical Session 5B Zhengji Zhao Improving energy efficiency on the ARCHER2 UK National Supercomputing Service Reducing HPC energy footprint for large scale GPU accelerated workloads Estimating energy-efficiency in quantum optimization algorithms. Paper, Presentation Technical Session 5C Bilel Hadri Morpheus unleashed: Fast cross-platform SpMV on emerging architectures Deploying HPC Seismic Redatuming on HPE/Cray Systems Accelerating the Big Data Analytics Suite Paper, Presentation 10:30am-11:00amCUG2023 General Session CUG2024 Site Presentation Plenary 11:00am-12:00pmTechnical Session 6A Jussi Heikonen Frontier As a Machine for Science: How To Build an Exascale Computer and Why You Would Want To The Ookami Apollo80 system: Progress, Challenges and Next Steps Paper, Presentation Technical Session 6B Bilel Hadri Just One More Maintenance: Operating the Perlmutter Supercomputer While Upgrading to Slingshot 11 Orchestration of Exascale Fabrics using the Slingshot Fabric Manager: Practical Examples from LUMI and Frontier Paper, Presentation Technical Session 6C Jim Williams Slingshot and HPC Storage – Choosing the right Lustre Network Driver (LND) Journey in slingshot HSN segmentation using VLANs Paper, Presentation 1:00pm-2:30pmTechnical Session 7A Eva Siegmann Porting a large cosmology code to GPU, a case study examining JAX and OpenMP. Adding GPU support to TurboGAP. Towards exascale molecular dynamics with machine learning potentials Scalable High-Fidelity Simulation of Turbulence With Neko Using Accelerators Paper, Presentation Technical Session 7B Chris Fuson Observability, Monitoring, and In Situ Analytics in Exascale Applications Assessing Memory Bandwidth on ARCHER2 and LUMI Using CAMP Overview of SPEC HPC Benchmarks and Details of the SPEChpc 2021 Benchmark Paper, Presentation Technical Session 7C Tina Declerck VASP Performance on HPE Cray EX Based on NVIDIA A100 GPUs and AMD Milan CPUs Containerization Workflow and Performance of Weather and Climate Applications Integration of Modern HPC Performance Analysis in Vlasiator for Sustained Exascale Paper, Presentation 2:45pm-4:15pmTechnical Session 8A Tina Declerck HPC Cluster CI/CD Image Build Pipelining The Tri-Lab Operating System Stack (TOSS) on Cray EX Supercomputers Manta, a cli to simplify common Shasta operations Paper, Presentation Technical Session 8B Cristian-Vasile Achim Header Only Porting: a light-weight header-only library for CUDA/HIP porting STAX, HPC meta-containers from edge to core workflow orchestration Paper, Presentation Technical Session 8C Zhengji Zhao Performance Study on CPU-based Machine Learning with PyTorch Towards Training Trillion Parameter Models on HPE GPU Systems A Deep Dive into NVIDIA's HPC Software Paper, Presentation 4:20pm-4:30pmCUG 23 Closing Plenary | Friday, May 12th8:00am-4:30pmSustainable HPC Operations Workshop ( in Kajaani) Workshop in Kajaani |