Birds of a Feather Programming Environments, Applications, and Documentation (PEAD) The PEAD (Programming Environments, Applications, and Documentation) is a CUG Special Interest Group that provides a forum for discussion and information exchange between CUG sites and Cray/HPE. The group focus includes system usability, performance of programming environments (including compilers, libraries, and tools), scientific applications running on Cray/HPE systems, user support, communication, and documentation. The group host meetings at CUG each year to help foster discussions surrounding these topics between HPE and member sites.
Following a successful event at last year's CUG, this year the PEAD SIG will meet Sunday, May 05, from 1:00 PM - 5:00 PM. We are planning topics surrounding the HPE PE roadmap, training collaborations, HPE documentation, as well as Fortran support. All topics will be interactive and discussion based. Registration for the event is required. Lunch will be available for everyone who registers for the meeting. Birds of a Feather Programming Environments, Applications, and Documentation (PEAD) Collaborative Development of HPC Training Materials Lipi Gupta (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center), Ann Backhaus (Pawsey Supercomputing Research Centre), and Jane Herriman (Lawrence Livermore National Laboratory) Abstract Abstract HPC resources can only be used effectively when they are accessible to their users. Therefore, as CUG member sites continue to invest in HPE-Cray systems and software, it is critical to offer training sessions, training resources, and documentation to educate their seasoned scientific users about developments in their HPC resources and to bring newer users up to speed on HPC topics more broadly. There is often significant commonality in existing and needed training materials across CUG member sites. To improve the efficacy of their efforts, many HPC specialists working on training and education have expressed interest in collaborating in the curation, sharing, creation, and improvement of HPC and HPE-Cray specific training materials. This collaborative effort will improve learning outcomes for users at each site, and improve the scientific throughput. Therefore, we propose holding a BoF session to facilitate this collaboration. Birds of a Feather BoF 1B High Performance Data-centre Digital Twins Matthias Maiterth (Oak Ridge National Laboratory); Tim Dykes and Jess Jones (HPE HPC/AI EMEA Research Lab); Adrian Jackson and Michele Weiland (EPCC, The University of Edinburgh); and Wes Brewer (Oak Ridge National Laboratory) Abstract Abstract Digital Twins are the basis of a rapidly growing field of study that pairs physical systems with their digital representation to enhance understanding of the physical counterpart. Digital twins of data-centres are rapidly becoming a reality, using bi-directional feedback loops to link operational telemetry with models of associated sub-systems, which are combined with visualisation and analytics components. Such digital twins promise improved system behaviour prediction, optimised system operation, and ultimately better-informed decision-making for operating large-scale compute centre installations with associated power, cooling, and management infrastructure. Birds of a Feather BoF 1A OpenCHAMI for collaborators and the collaborator-curious Travis Cotton and Alex Lovell-Troy (Los Alamos National Laboratory) Abstract Abstract "OpenCHAMI or OCHAMI (Open, Composable, Heterogeneous, Adaptable, Management Infrastructure) was founded in 2023 as a collaboration between CSCS, the University of Bristol, NERSC, LANL, and HPE to foster development of open source software that was originally written to manage Cray/HPE’s exascale-class systems. The OpenCHAMI BoF at CUG will include presentations and demonstrations that share project insights and discuss futures towards enhancing collaboration in the community." Birds of a Feather BoF 1C HPE Slingshot Birds of a Feather Jesse Treger (HPE) Abstract Abstract This birds-of-a-feather session will provide an opportunity for users to ask questions and share advice on managing and using HPE Slingshot systems, as well as to hear and provide input into HPE's Slingshot software roadmap. HPE Slingshot software scope covers capabilities both for the administrators who operate and manage the system and the fabric, and for HPC/AI application writers and users of the HPE Slingshot NIC’s Libfabric provider. Users will be encouraged to share desired use cases, learnings, and best-known methods. Birds of a Feather BoF 2B Bird of Feather on Artificial Intelligence and Machine Learning for HPC Workload Analysis (AIMLHPCWorkload2024) Kadidia Konate and Richard Gerber (Lawrence Berkeley National Laboratory) Abstract Abstract HPC systems already produce terabytes of monitoring, usage, and performance data each day, ranging from that produced by low-level hardware telemetry and error reporting systems, to hardware performance counters, to job scheduling and system logs, with natural language text from administrator troubleshooting tickets and notes. Systems of the future will be even larger and more complex. This will increase the challenges of monitoring and characterizing user behaviors on these systems. Meanwhile, machine learning and artificial intelligence techniques have already started to demonstrate effectiveness for characterizing and extracting knowledge from large and complex datasets, but these efforts are just beginning to realize the full value of their potential across a wide variety of domains. For these reasons, we propose the First BoF on Artificial Intelligence and Machine Learning for HPC Workload Analysis. This BoF will provide a much-needed opportunity not only for discussing cutting-edge research ideas but also for bringing together researchers working across the disciplines of data science, machine learning, statistics, applied mathematics, systems design, systems monitoring, systems resilience, and hardware architecture. The proposed BoF aims to help the community advance in the direction towards better and more efficient monitoring and understanding of the usage of large-scale computing systems. The Machine Learning and Deep learning insights from the CUG systems of participating BoF authors will be shared. Our approach is to share best practices, and listen to the audience, constructing new paths forward for HPC workload analysis. Our objective is to get feedback from the audience and be highly interactive. Birds of a Feather BoF 2A 2024 HPC Testathon: Experiences and Results Veronica Melesse Vergara (Oak Ridge National Laboratory), Bilel Hadri (King Abdullah University of Science and Technology), and Maciej Cytowski (Pawsey Supercomputing Research Centre) Abstract Abstract The increasing complexity of HPC architectures requires a larger number of tests for thorough system evaluation post-installation and pre-software upgrades. HPC centers and vendors employ various methodologies for system evaluation throughout its lifespan, not only at the beginning during the installation and acceptance time, but also regularly during maintenance windows. The HPC Testathon 2024 co-organised with CUG2024 is the first of its kind event allowing HPC professionals to have the hands-on experience with different HPC system testing environments and tests. The event has been already supported by several CUG sites including KAUST, ORNL, LLNL, LANL, Pawsey as well as Microsoft. The goal is to share some of the most useful tests among participating institutions to leverage on others experience and further improve each site’s approaches. The event was built on success of previous workshops of the HPC System Test Working Group with contributions from many CUG sites and vendors. Birds of a Feather BoF 2C Architecting a Cloud-based Supercomputing as-a-Service Solution Pete Mendygral and Kirti Devi (Hewlett Packard Enterprise) Abstract Abstract The requirements from emerging HPC/AI workflows are pushing HPC systems to have more cloud-like capabilities and cloud environments to have more HPC-like capabilities. Some key examples of this are workflows that combine data acquisition, real-time analysis, model training and inference, and traditional simulation. The requirements include heterogeneous hardware, tailored to the different components of the workflow, high availability of key support services, and elastic access to resources. They may also require technologies outside of traditional HPC, such as Kubernetes. Birds of a Feather BoF 3A System Monitoring Working Group Craig West (BOM) and Stephen Leak (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) Abstract Abstract The System Monitoring Working Group (SMWG) has been a CUG SIG (special interests group) to enable collaborations between HPE Cray and their customers on system monitoring capabilities. The Systems Monitoring Working Group includes representatives from many HPE Cray member sites. We meet to discuss and collaborate together on any issues related to system monitoring. Break Coffee Break Break Coffee Break Break Coffee Break (sponsored by Altair) Break Coffee Break (sponsored by Linaro) Break Coffee Break (sponsored by SchedMD) Break Coffee Break (sponsored by Thinlinc) Break Coffee Break (sponsored by VAST) Break Coffee Break (sponsored by Pier Group) Break Coffee Break Break Coffee Break CUG Board CUG Board & Sponsors Lunch (closed) CUG sponsors (non-HPE) are invited to join the CUG Board for an informal lunch discussion. CUG Board HPE Executive Lunch (closed) HPE Executives and representatives are invited to join the CUG Board for an informal lunch discussion. CUG Board New CUG Board / Old CUG Board Lunch (closed) Newly elected board members are invited to an informal lunch with the prior CUG Board to discuss remaining activities for the week as well as future plans. Please bring food from the standard lunch buffet to the private area at the far end of the restaurant. CUG Program Committee Program Committee Dinner (invite only) Participants that helped with the reviews and program committee are invited to a private event.
6.30pm meet at the Beer Corner to be seated at 7pm in The Mark which is within COMO The Treasury, https://statebuildings.com/functions/the-mark/. CUG Program Committee CUG Advisory Board Lunch Cabinet (closed) The CUG Advisory Board is comprised of chairs and liaisons from the special interest groups and program committee members. This session is typically lead by the CUG Vice President to discuss the program, provide guidance to session chairs for the week, and to receive feedback to improve processes and content for future events. CUG Program Committee CUG Advisory Board The CUG Advisory Board is comprised of chairs and liaisons from the special interest groups and program committee members. This session is typically lead by the CUG Vice President to receive direct feedback from the conference and improve future events. Lunch Lunch (open to PEAD and XTreme participants) Lunch Lunch (sponsored by Nvidia) Lunch Lunch (sponsored by Codee) Lunch Lunch (sponsored by Nvidia) Lunch Lunch (sponsored by Codee) Networking/Social Event WHPC+ Australasia and AMD Diversity and Inclusion Breakfast Women in High Performance Computing Australasia(WHPC+) and AMD invite you to attend a community networking breakfast from 7:00 to 8:20am at the Westin Perth, in the Banksia Room. WHPC+ was created to promote diversity in the HPC industry by encouraging new people into the field and retaining those who are already here. This event is generously sponsored by AMD who are very supportive of the Australasian chapter. This event is conveniently located in the beautiful Westin Perth hotel so that you can easily get to the first meeting session of the day at 8:30 am in Ballroom 2. Come along to meet and learn from others who are championing diversity and inclusion in HPC! While this event is free to attend, numbers are capped so registration is required: https://pawsey.org.au/event/whpc-australasia-and-amd-diversity-and-inclusion-breakfast/ Presentation, Paper Technical Session 1B Chair: Jim Williams (Los Alamos National Laboratory) Enhancing HPC Service Management on Alps using FirecREST API Juan Pablo Dorsch, Andreas Fink, Eirini Koutsaniti, and Rafael Sarmiento (Swiss National Supercomputing Centre) Abstract Abstract With the evolution of scientific computational needs, there is a growing demand for enhanced resource access and sophisticated services beyond traditional HPC offerings. These demands encompass a wide array of services and use cases, from interactive computing platforms like JupyterHub to the integration of Continuous Integration (CI) pipelines with tools such as GitHub Actions and GitLab runners, and the automation of complex workflows in Machine Learning using AirFlow. Automated Hardware-Aware Node Selection for Cluster Computing Manuel Sopena Ballesteros, Miguel Gila, Matteo Chesi, and Mark Klein (Swiss National Supercomputing Centre, ETH Zurich) Abstract Abstract This paper introduces algorithms for automating the grouping of compute nodes into clusters based on user-defined hardware requirements and simultaneously identifies potential hardware failures in HPC data centers. Addressing the challenges of dynamic workloads, the algorithms extract detailed hardware information through CSM APIs, automating node selection aligned with user-defined criteria. The automation streamlines node assignment, reducing human error and expediting the selection process. Versatile Software-defined Cluster on Cray HPE EX Systems Maxime Martinasso, Mark Klein, Benjamin Cumming, Miguel Gila, and Felipe Cruz (Swiss National Supercomputing Centre, ETH Zurich) Abstract Abstract This presentation introduces the versatile software-defined cluster (vCluster), a novel set of technologies for HPC infrastructure such as Cray HPE EX systems. This integration offers a service-oriented approach to computing resources, maintaining infrastructure independence and avoiding vendor lock-in. The vCluster technology bridges the gap between Cloud abstraction and the vertically integrated HPC stack, enabling large-scale infrastructures to support multiple scientific domains with specifically tailored services. Presentation, Paper Technical Session 1A Chair: Lena M Lopatina (LANL) CPE Updates Barbara Chapman (HPE) Abstract Abstract The HPE Cray Programming Environment (CPE) provides a suite of integrated programming tools for application development on a diverse range of HPC systems delivered by HPE. Its compilers, math libraries, communications libraries, debuggers, and performance tools enable the creation, enhancement and optimization of application codes written using mainstream programming languages and the most widely used parallel programming models. A Deep Dive Into NVIDIA's HPC Software Jeff Larkin and Becca Zandstein (NVIDIA) Abstract Abstract NVIDIA's HPC Software enables developers to build applications that take advantage of every aspect of the hardware available to them: CPU, GPU, and interconnect. In this presentation you will learn the latest information on NVIDIA's HPC compilers, libraries, and tools. You will learn how NVIDIA's HPC software makes application developers productive and their applications portable and performant. This presentation will give an overview of NVIDIA's HPC SDK, optimized libraries for the GPU and CPU, performance tools, scalable libraries, python support, and more. Slurm 24.05 and Beyond Tim Wickberg (SchedMD LLC) Abstract Abstract Slurm is the open-source workload manager used on the majority of the TOP500 systems. Presentation, Paper Technical Session 1C Chair: Chris Fuson (ORNL, Oak Ridge National Laboratory) Towards the Development of an Exascale Network Digital Twin John Holmen (Oak Ridge National Laboratory); Md Nahid Newaz (Oakland University); and Srikanth Yoginath, Matthias Maiterth, Amir Shehata, Nick Hagerty, Christopher Zimmer, and Wesley Brewer (Oak Ridge National Laboratory) Abstract Abstract Exascale high performance computing (HPC) systems introduce new challenges related to fault tolerance due to the large component counts needed to operate at such scales. For example, the exascale Frontier system consists of approximately 60 million components. These counts warrant the investigation of new approaches for helping to ensure the functionality, performance, and usability of such systems. An approach explored by the ExaDigiT project is use of digital twins to help inform decisions related to the physical Frontier system. This paper discusses a subset of ExaDigiT’s Facility Digital Twin (FDT), the Network Digital Twin (NDT), which focuses on Frontier’s network as a target use case. We present the various strategies tested and early challenges faced towards the development of an exascale NDT, with the hope that such knowledge would benefit other practitioners who are interested in developing a similar digital twin. A Performance Deep Dive into HPC-AI Workflows with Digital Twins Ana Gainaru (Oak Ridge National Laboratory); Greg Eisenhauer (Georgia Institute of Technology); and Fred Suter, Norbert Podhorszki, and Scott Klasky (Oak Ridge National Laboratory) Abstract Abstract The landscape of High-Performance Computing (HPC) is evolving. Traditional HPC simulations are merging with advanced visualization and AI techniques for analysis, resulting in intricate workflows that push the boundaries of current benchmarks and performance models. Here we focus on workflows that couple in near real-time digital twins, low-fidelity Artificial Intelligence (AI) simulations, alongside ongoing experiments or high-fidelity simulations to continuously drive the latter towards optimal results. It is expected that digital twin workflows will play a crucial role in optimizing the performance of next-generation simulations and instruments. This paper highlights performance limitations for the convergence of AI digital twins and HPC simulations by modeling and analyzing several I/O strategies at scale on HPE/Cray machines. We expose the limitations of relying on existing methods that benchmark individual components for these novel workflows, and propose a performance roofline model to predict the performance of these workflows on future machines and for more complex tasks. Additional layers of analytics and visualization further complicates the performance landscape. Understanding the unique performance characteristics of these intricate HPC-AI hybrid workflows is essential for designing future architectures and algorithms that can fully harness their potential. Optimizing Checkpoint-Restart Mechanisms for HPC with DMTCP in Containers at NERSC Madan Timalsina, Lisa Gerhardt, Johannes Blaschke, Nicholas Tyler, and William Arndt (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) Abstract Abstract This paper presents an in-depth examination of checkpoint-restart mechanisms in High-Performance Computing (HPC). It focuses on the use of Distributed MultiThreaded Check- Pointing (DMTCP) in various computational settings, including both within and outside of containers. The study is grounded in real-world applications running on NERSC Perlmutter, a state-of-the-art supercomputing system. We discuss the advantages of checkpoint-restart in managing complex and lengthy computations in HPC, highlighting its efficiency and reliability in such environments. The role of DMTCP in enhancing these workflows, especially in multi-threaded and distributed applications, is thoroughly explored. Additionally, the paper delves into the use of HPC containers, such as Shifter and Podman-HPC, which aid in the management of computational tasks, ensuring uniform performance across different environments. The methods, results, and potential future directions of this research, including its application in various scientific domains, are also covered, showcasing the critical advancements made in computational methodologies through this study. Presentation, Paper Technical Session 2B Chair: Lena M Lopatina (LANL) EMOI: CSCS Extensible Monitoring and Observability Infrastructure Massimo Benini (CSCS); Jeff Hanson (HPE); and Dino Conciatore, Gianni Mario Ricciardi, Michele Brambilla, Monica Frisoni, Mathilde Gianolli, Gianna Marano, and Jean-Guillaume Piccinali (CSCS) Abstract Abstract The Swiss National Supercomputing Centre (CSCS) is enhancing its computational capabilities through the expansion of the Alps architecture, a Cray HPE EX system equipped with approximately 5000 GH200 modules, in addition to the pre-existing 1000 nodes of a diverse combination of CPUs and GPUs. CSCS has developed an Extensible Monitoring and Observability Infrastructure (EMOI), designed to manage the substantial data influx and provide insightful analysis of the infrastructure`s behavior. This paper presents the architecture and capabilities of EMOI at CSCS, emphasizing its scalability and adaptability to handle the increasing volume of monitoring data generated by the Alps infrastructure. We detail the integration of the Cray System Management (CSM) and Cray System Monitoring Application (SMA) within EMOI. The paper describes our hardware infrastructure, leveraging Kubernetes for dynamic data collection and analysis tools deployment, and outlines our GitOps strategy for efficient service management. We also explore the distinctions in data models across various node architectures within the Alps system, focusing on power consumption data and its relevance concerning global supercomputing challenges. The insights and methodologies presented in this paper are anticipated to be beneficial not only to CSCS, but also to other HPE/Cray sites facing similar challenges in supercomputing infrastructure management. Swordfish/Redfish and ClusterStor - Using Advanced Monitoring to Improve Insight into Complex I/O Workflows. Torben Kling Petersen, Tim Morneau, Dan Matthews, and Nathan Rutman (HPE) Abstract Abstract HPC storage systems today are complex solutions. Contrary to typical compute environments, a single storage component operating subpar can have significant impact on productivity. Further, as a storage solution ages, capacity fills up and is used un-evenly. Understanding these changes and the reasons for performance bottlenecks are increasingly important. With the addition of a full RESTful monitoring API based on Swordfish, we now have the tools required to improve overall storage monitoring. Swordfish is a collaboration between DMTF and SNIA that extends the DMTF Redfish interface to provide a standardized, accessible way to represent and manage storage and file systems in both individual customer and cloud environments. A custom implementation of both Swordfish and Redfish in the ClusterStor software stack provides new ways of gaining insights into the inner workings of an HPE ClusterStor E1000 storage system or either of its forthcoming descendants, C500 and E2000. This paper is intended as an introduction into this new API, including examples and guidance on how this can be used to improve storage monitoring as well as understanding of how traditional HPC and/or modern AI/ML workflows behave and evolve over time. CADDY: Scalable Summarizations over Voluminous Telemetry Data for Efficient Monitoring Saptashwa Mitra, Scott Ragland, Vanessa Zambrano, Dipanwita Mallick, Charlie Vollmer, Lance Kelley, and Nithin Singh Mohan (Hewlett Packard Enterprise) Abstract Abstract In the rapidly evolving landscape of High-Performance Computing (HPC), the efficient management and analysis of telemetry data is pivotal for ensuring system robustness and performance optimization. As HPC systems scale in complexity and capability, traditional data processing methodologies struggle to meet the demands of rapid real-time analytics and large-scale data management. This paper introduces an innovative framework, Caddy, which employs a novel approach to HPC telemetry storage and interactive analysis. Built on the foundation of HPE's Slingshot interconnect and the Fabric AIOps (FAIO) system, Caddy aims to address the critical need for a memory-efficient, scalable, and real-time analytical solution for seamless monitoring over large HPC environments. Command Lines vs. Requested Resources: How Well Do They Align? Ben Fulton, Abhinav Thota, Scott Michael, and Jefferson Davis (Indiana University) Abstract Abstract In the context of high-performance computing, a significant portion of users do not develop their own code from scratch but rely on existing software packages and libraries tailored for specific scientific or computational tasks. Many of these open source scientific software packages provide a variety of methods to efficiently use them in multicore, multinode, or large-memory systems. In this paper, we examine a set of applications that users run on Indiana University supercomputers, and determine for those applications the software parameter settings controlling CPU parallelism, GPU parallelism, and memory usage. We then investigate the common ways users employ these parameters and measure the degree of success in which they take advantage of available resources. By comparing data collected from XALT on the command line parameters used to the Slurm resource requests we are able to determine the degree to which users take advantage of the resources they request. This knowledge will inform us on how better to provide example usages for the software available on our systems, and will inform future software development efforts, guiding the design of more efficient, user-friendly, and adaptable tools that align closely with the specific needs of the HPC community. Presentation, Paper Technical Session 2A Chair: Jim Rogers (Oak Ridge National Laboratory) Updated Node Power Management For New HPE Cray EX255a and EX254n Blades Brian Collum and Steven Martin (Hewlett Packard Enterprise) Abstract Abstract Cray EX nodes have always supported a form of power capping that would allow customers to lower power usage of specific nodes as desired. With the introduction of the HPE Cray EX254n (NVIDIA Grace Hopper) and HPE Cray EX255a (AMD MI300a), this became critical as the overall rack power pushed beyond the maximum supported at some customers sites. With the HPE Cray EX254n in particular, the total TDP of the modules exceeds the maximum that can be delivered by the Cray EX infrastructure. This drove the decision to have to set a power limit on the Grace Hopper modules by default (a first for Cray EX). This presentation will walkthrough design goals of the blades, how power capping is implemented in the firmware, and how to configure the power limit in a running system. The presentation will also go through how to view the current limits configured via the node controllers Redfish API and in-band tools, where applicable, and how the in-band tools interact with the out-of-band configurations. HPE Cray EX Power Monitoring Counters Steven Martin, Brian Collum, and Sean Byland (HPE) Abstract Abstract HPE Cray Power Monitoring (PM) Counters were first deployed on Cray XC30 systems, and several papers were presented at CUG in 2014 that described their use. PM Counters expose power energy and related meta-data collected out of band, directly to in-band consumers. Since introduction, PM Counters have been supported on all blades designed for use in Cray XC, and HPE Cray EX Supercomputer systems. PM Counters have continued to be important as system and application power and energy consumption continues to be a top priority for system vendors, application developers, and the wider HPC research community. Over the last decade, the design of PM Counters has remained very stable, with only minor updates to support evolving node architecture changes. This presentation will give a brief history and overview of PM Counters basics, it will then present details of PM Counters on the latest HPE Cray EX supercomputer blades announced at SC23, and then discuses opportunities and challenges in supporting PM Counters on NGI (Next Generation Infrastructure). The presentation will conclude with a reinforcement of the value of PM Counters in supporting research, development, and testing of energy efficient and sustainable HPC systems. First Analysis on Cooling Temperature Impacts on MI250x Exascale Nodes (HPE Cray EX235A) Torsten Wilde (HPE), Michael Ott (LRZ), and Pete Guyan (HPE) Abstract Abstract With the focus on sustainable data center operations the community is moving from chilled water cooling to warm water cooling solutions expecting that running at higher system inlet temperatures enables a more energy efficient facility operation. Since the overall efficiency is determined by the combination of facility infrastructure and system behavior, understanding the impact of different system inlet cooling temperatures on the system performance and efficiency is important. This inaugural presentation covers the analysis of the impact of the inlet cooling temperature on an Exascale compute blade when running HPL and HPCG. Data was collected using the HPE-LRZ PreExascale co-design project system (HPE Cray EX2500 with four modified Frontier blades optimized for higher cooling temperature support) installed at the Leibniz Supercomputing Centre. Our analysis will show that higher cooling temperatures increase node power consumption leading to a reduction in node performance and overall energy efficiency. Results are presented for different inlet temperatures showing that the overall node efficiency is reduced by around 6% for HPCG and 4% for HPL (25C vs 40C inlet temperature). Combing facility and system, warm water cooling is more efficient but the most efficient cooling temperature depends on the application and efficiency of the cooling infrastructure. EVeREST: An Effective and Versatile Runtime Energy Saving Tool Anna Yue (Hewlett Packard Enterprise, University of Minnesota) and Sanyam Mehta and Torsten Wilde (Hewlett Packard Enterprise) Abstract Abstract Amid conflicting demands for better application performance and energy efficiency, HPC systems must be able to identify opportunities to save power/energy without compromising performance, while ideally being transparent to the user. We identify three primary challenges for a successful energy saving solution - versatility to operate across processors of different types (e.g., CPUs and GPUs) and from different vendors, effectiveness in finding energy saving opportunities and making the right power-performance tradeoffs, and ability to handle parallel applications involving communication. We propose Everest, a lightweight runtime tool that switches to the computed (based on dynamic application characterization) ideal clock frequency for individual application phases/regions while meeting a specified performance target. Everest achieves versatility by relying on the minimum possible performance events for the needed characterization and power estimation. Region-awareness and accurate computation of MPI slack time allow Everest to find enhanced energy saving opportunities and thus save up to 20% more energy than existing solutions on CPUs. These energy savings rise to up to 30% and are more prominent on GPUs where Everest doubly benefits from its unique idle time characterization and by choosing to sacrifice an allowed/acceptable performance loss. Presentation, Paper Technical Session 2C Chair: Veronica G. Vergara Larrea (Oak Ridge National Laboratory) Optimising the Processing and Storage of Radio Astronomy Data Alexander Williamson (International Centre for Radio Astronomy Research, University of Western Australia); Pascal Elahi (Pawsey Supercomputing Research Centre); Richard Dodson and Jonghwan Rhee (International Centre for Radio Astronomy Research, University of Western Australia); and Qian Gong (Oak Ridge National Laboratory) Abstract Abstract The next-generation of radio astronomy telescopes are challenging existing data analysis paradigms, as they have an order of magnitude larger collecting area and bandwidth. The two primary problems encountered when processing this data is the need for storage and that processing is primarily I/O limited. An example of this is the data deluge expected from the SKA-Low Telescope of about 300 PB per year. To remedy these issues, we have demonstrated lossy and lossless compression of data an existing precursor telescope, the Australian Square Kilometre Array Pathfinder (ASKAP), using MGARD and ADIOS2 libraries. We find data processing is faster by a factor of 7 and give compression ratios from a factor of 7 (lossless) up to 37 (lossy with an absolute error bound of 0.001). We will discuss the effectiveness of lossy MGARD compression and its adherence to the designated error bounds, the trade-off between these error-bounds and the corresponding compression ratios, as well as the potential consequences of these I/O and storage improvements on the science quality of the data products. Performance and scaling of the LFRic weather and climate model on different generations of HPE Cray EX supercomputers J. Mark Bull (EPCC, The University of Edinburgh); Andrew Coughtrie (Met Office, UK); Deva Deeptimahanti (Pawsey Supercomputing Research Centre); Mark Hedley (Met Office, UK); Caoimhin Laoide-Kemp (EPCC, The University of Edinburgh); Christopher Maynard (Met Office); Harry Shepherd (Met Office, UK); Sebastiaan Van De Bund and Michele Weiland (EPCC, The University of Edinburgh); and Benjamin Went (Met Office, UK) Abstract Abstract This study presents scaling results and a performance analysis across different supercomputers and compilers for the Met Office weather and climate model, LFRic. The model is shown to scale to large numbers of nodes which meets the design criteria, that of exploitation of parallelism to achieve good scaling. The model is written in a Domain Specific Language, embedded in modern Fortran and uses a Domain Specific Compiler, PSyclone, to generate the parallel code. The performance analysis shows the effect of choice of algorithm, such as redundant computation and scaling with OpenMP threads. The analysis can be used to motivate a discussion of future work to improve the OpenMP performance of other parts of the code. Finally, an analysis of the performance tuning of the I/O server, XIOS is presented. Disaggregated memory in OpenSHMEM applications – Approach and Benefits Clarete Crasta, sharad singhal, Faizan barmawer, Ramesh Chaurasiya, sajeesh KV, Dave Emberson, Harumi Kuno, and John Byrne (Hewlett Packard) Abstract Abstract HPC architectures most often handle High Performance Data Analytics (HPDA) and Explorative Data Analytics (EDA) workloads where the working data set cannot be easily partitioned or is too large to fit into local node memory. This poses challenges for programming models such as OpenSHMEM[1] or MPI[2] where all data in the working set is assumed to fit in the memory of the participating compute nodes. Additionally, existing HPC programming models use expensive all-to-all communication to share data and results between nodes. The data and results are ephemeral and require additional work to save the data for analysis by other applications or subsequent invocations of the same application. Emerging disaggregated architectures including CXL GFAM enable data to be held in external memory accessible to all compute nodes, thus providing a new approach to handling large data sets in HPC applications. Most HPC libraries do not currently support disaggregated memory models. In this paper, we present how disaggregated memory can be accessed by existing programming models such as OpenSHMEM and the benefits of using disaggregated memory in these models. Migrating Complex Workflows to the Exascale: Challenges for Radio Astronomy Pascal Jahan Elahi (Pawsey Supercomputing Research Centre) and Matt Austin, Eric Bastholm, Paulus Lahur, Wasim Raja, Maxim Voronkov, Mark Wieringa, Matthew Whiting, Daniel Mitchell, and Stephen Ord (CSIRO) Abstract Abstract Real-time processing of radio astronomy data presents a unique challenge for HPC centers. The science data processing contains memory-bound codes, CPU-bound codes, portions of the pipeline that consist of large number of embarrassingly parallel jobs, combined with large number of moderate- to large-scale MPI jobs, and IO that consists of both parallel IO writing large files to small jobs writing a large number of small files, all combined in a workflow with a complex job dependency graph and real-world time constraints from radio telescope observations. We present the migration of the Australian Square Kilometre Array Pathfinder Telescope's science processing pipeline from one of Pawsey's older Cray XC system to our HPE-Cray EX system, Setonix. We also discuss the migration from bare-metal deployment of the complex software stack to a containerized, more modular deployment of the workflow. We detail the challenges faced and how the migration unearthed issues in the original deployment of the EX system. The lessons learned in the migration of such a complex software stack and workflow is valuable for other centers. Presentation, Paper Technical Session 3B Chair: Gabriel Hautreux (CINES) Spack Based Production Programming Environments on Cray Shasta Paul Ferrell and Timothy Goetsch (LANL) Abstract Abstract The Cray Pprogramming Eenvironment (CPE) provided for Cray Shasta OS based clusters provides a small but solid set of tools for developers and cluster users. The CPE includes Cray MPICH, Cray Libsci, the Cray debugging tools, and support for a range of compilers – and not much else. Users expect a wide range of additional software on these clusters, and frequently request new software outside of what’s provided by the CPE or what can be provided via the system packages. Foregoing our old manual installation process, the LANL HPC Programming and Runtime Eenvironments Tteam has instead opted to utilize Spack as the installation mechanism for most additional software on all of our new HPE/Cray Shasta clusters. This brings with it several distinct advantages - Spack’s vast library of package recipes, well- defined software inventories, automatically- generated modulefiles, and binary packages produced through our CI infrastructure. It also brings with it substantial issues – remarkably higher manpower requirements, longer turnaround times for software requests, a more challenging build debug process, and questionable long-term maintainability. Our paper will detail our approach and the benefits and pitfalls of using Spack to install and maintain production software environments. Containers-first user environments on HPE Cray EX Felipe Cruz and Alberto Madonna (Swiss National Supercomputing Centre) Abstract Abstract In High-Performance Computing (HPC), managing the user environment is a critical and complex task. It involves composing a mix of software that includes compilers, libraries, tools, environment settings, and their respective versions, all of which depend on each other in intricate ways. Traditional approaches to managing user environments often struggle with finding a balance between stability and flexibility, especially in large systems serving diverse user needs. Cloud-Native Slurm management on HPE Cray EX Felipe A. Cruz, Manuel Sopena, and Guilherme Peretti-Pezzi (Swiss National Supercomputing Centre) Abstract Abstract This work introduces a cloud-native deployment of the Slurm HPC Workload Manager, leveraging microservices, containerization, and on-premises cloud platforms to enhance efficiency and scalability. Utilizing Kubernetes and Nomad's APIs alongside DevOps tools, the system automates system operations, simplifies service configuration, and standardizes monitoring. However, implementing a cloud-native architecture poses challenges, including complex containerization and resource management issues that are intrinsic to HPC. Presentation, Paper Technical Session 3A Chair: Bilel Hadri (KAUST Supercomputing Lab) Early Application Experiences on Aurora at ALCF: Moving From Petascale to Exascale Systems Colleen Bertoni, JaeHyuk Kwack, Thomas Applencourt, Abhishek Bagusetty, Yasaman Ghadar, Brian Homerding, Christopher Knight, Ye Luo, Mathialakan Thavappiragasam, John Tramm, Esteban Rangel, Umesh Unnikrishnan, Timothy J. Williams, and Scott Parker (Argonne National Laboratory) Abstract Abstract Aurora, installed in 2023, is the newest system being prepared for production at the Argonne Leadership Computing Facility (ALCF). Throughout multiple years of preparation, the ALCF has tracked the progress of over 40 applications from the Exascale Computing Project and ALCF's Early Science Project in terms of ability to run on Aurora and performance on Aurora compared to other systems. In addition, the ALCF has been tracking bugs and issues reported by application developers. This broad tracking of applications in a standardized way as well as tracking of over 1100 bugs and issues via source code reproducers has been essential to ensuring the usability of Aurora. It has also helped ensure a smoother transition for applications that can run on past or current production systems, like Polaris, the ALCF's current production system, to Aurora. To gain insight into the current state of applications which were ported to Aurora on both Aurora and Polaris, a set of applications are compared in terms of single GPU and single node performance on Aurora and Polaris. On average the Figure-of-Merit performance for the set of applications was 1.3x greater on a single GPU of Aurora than a single GPU of Polaris. The intra-node parallel efficiency of the set of applications was similar between Aurora and Polaris. Streaming Data in HPC Workflows Using ADIOS Greg Eisenhauer (Georgia Institute of Technology); Norbert Podhorszki, Ana Gainaru, and Scott Klasky (Oak Ridge National Laboratory); Philip Davis and Manish Parashar (University of Utah); Matthew Wolf (Samsung SAIT); Eric Suchtya (Oak Ridge National Laboratory); Erick Fredj (Toga Networks, Jerusalem College of Technology); Vicente Bolea (Kitware, Inc); Franz Pöschel, Klaus Steiniger, and Michael Bussmann (Center for Advanced Systems Understanding); Richard Pausch (Helmholtz-Zentrum Dresden-Rossendorf); and Sunita Chandrasekaran (University of Delaware) Abstract Abstract The "IO Wall" problem, in which the gap between computation rate and data access rate grows continuously, poses significant problems to scientific workflows which have traditionally relied upon using the filesystem for intermediate storage between workflow stages. One way to avoid this problem in scientific workflows is to stream data directly from producers to consumers and avoiding storage entirely. However, the manner in which this is accomplished is key to both performance and usability. This paper presents the Sustanable Staging Transport, an approach which allows direct streaming between traditional file writers and readers with few application changes. SST is an ADIOS "engine", accessible via standard ADIOS APIs, and because ADIOS allows engines to be chosen at run-time, many existing file-oriented ADIOS workflows can utilize SST for direct application-to-application communication without any source code changes. This paper describes the design of SST and presents performance results from various applications that use SST, for feeding model training with simulation data with substantially higher bandwidth than the theoretical limits of Frontier's file system, or for strong coupling of separately developed applications for multiphysics multiscale simulation, or for in situ analysis and visualization of data to complete all data processing shortly after the simulation finishes. Enrichment and Acceleration of Edge to Exascale Computational Steering STEM Workflow using Common Metadata Framework Gayathri Saranathan (Hewlett Packard Enterprise, Hewlett Packard Labs); Martin Foltin, Aalap Tripathy, and Annmary Justine (Hewlett Packard Enterprise); Ayana Ghosh, Maxim Ziatdinov, and Kevin Roccapriore (Oak Ridge National Laboratory); and Suparna Bhattacharya, Paolo Faraboschi, and Sreenivas Rangan Sukumaran (Hewlett Packard Enterprise) Abstract Abstract Computational steering of experiments with the help of Artificial Intelligence (AI) has the potential to accelerate scientific discovery. Fulfilling this promise will require innovations in workflows and algorithms for experiment control. In this work we developed data management infrastructure that facilitates such innovation by enabling fine grain partitioning and optimization of complex workflows between computational and experimental facilities with dynamic data sharing. This is enabled by Common Metadata Framework (CMF) that tracks workflow data lineages and provides visibility across facility boundaries for relevant data subsets. We demonstrate benefits on a novel Scanning Transmission Electron Microscopy (STEM) workflow of nanoparticle plasmonic transitions in material science that crosses facility boundaries several times. The AI control model is seeded by a meta-model developed at a computational facility using data from other experimental sites to help impart prior knowledge, reduce measurement cost and sample degradation. The model is incrementally refined by successive measurements at an experimental facility. The evolution of model uncertainties is captured in CMF and fed back to computational facility for analysis of potential new physical phenomena. The relevant experimental results can be used to calibrate molecular dynamics simulations at the computational facility that in turn influence the AI model refinement trajectory. Presentation, Paper Technical Session 3C Chair: Tina Declerck (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory) CSM-based Software Stack Overview 2024 Harold Longley and Jason Sollom (Hewlett Packard Enterprise) Abstract Abstract The Cray System Management (CSM) software stack has been enhanced within the past year to improve the operating experience for an HPE Cray EX system. Features to be discussed include more automated software installation and upgrade, system backup and using that backup for disaster recovery reinstallation, automation for concurrent rolling reboots of management nodes, enhancements in CSM Diagnostics tooling, improved SMA monitoring tools, Slingshot switch Orchestrated Maintenance and LACP traffic load sharing, containerized login environment, and new compute node features (default kernel configuration, Low Noise Mode improvements, OS Noise Detection, DVS node health monitoring, Dynamic Kernel Module Support, and containers on compute nodes), improved multi-tenancy support, and image management for the aarch64 architecture. Overview of HPCM Peter Guyan and Sue Miller (HPE) Abstract Abstract This talk will give a brief overview of how HPCM is deployed on a HPE solution. It will describe what an admin node does, what Quorum HA is, why we use SU-Leaders. This will include when to use Quorum HA, SU Leaders, an introduction the “cm” command suite and a brief overview of the monitoring tools present with HPCM. With a new set of monitoring tools available in HPCM 1.11, the new pipeline will be described and how to enable the components needed. Seamless Cluster Migration in CSM Miguel Gila and Manuel Sopena Ballesteros (Swiss National Supercomputing Centre) Abstract Abstract The ability to effortless migrate compute clusters between sites or zones is common in the cloud world, and recently has also become a necessity for supercomputing facilities like the Swiss National Supercomputing Centre (CSCS), where its multi-region flagship infrastructure Alps serves multiple tenants and customers, some of them with very different development and operational requirements. Presentation, Paper Technical Session 4B Chair: Brett Bode (National Center for Supercomputing Applications/University of Illinois, National Center for Supercomputing Applications) Scalability and Performance of OFI and UCX on ARCHER2 Jaffery Irudayasamy and Juan F. R. Herrera (EPCC, The University of Edinburgh); Evgenij Belikov (EPCC, The University of EdinburghE); and Michael Bareford (EPCC, The University of Edinburgh) Abstract Abstract OpenFabrics Interfaces (OFI) and Unified Communication X (UCX) are both transport protocols that underlie the HPE Cray MPICH library on HPC systems like ARCHER2. They can be selected at runtime by users. This paper presents the scalability and performance of OFI and UCX transport layer protocol implementations on ARCHER2, an HPE Cray EX system that features Slingshot 10 interconnect. We use ReproMPI microbenchmarks to study the performance of MPI collectives and run experiments using some of the most commonly used applications on ARCHER2. The results show that in most cases OFI and UCX performance is comparable on under 32 nodes (16384 cores) but for larger number of nodes OFI runs more reliably. Ultimately, when it comes to applications there is no one-size-fits-all solution and profiling can facilitate tuning for best performance. Using P4 for Cassini-3 Software Development Environment Hardik Soni, Frank Zago, Khaled Diab, Igor Gorodetsky, and Puneet Sharma (HPE) Abstract Abstract We present a novel approach for co-developing the hardware of Network Interface Cards (NICs) ASICs (e.g., Slingshot Cassini-3) and their software stacks. Due to increasing gap between network link bandwidths and compute power of hosts, critical parts of the software stack of many applications are offloaded to NICs for efficient processing. By processing certain functions using specialized hardware blocks in NICs, compute resources can be better utilized for actual application processing. Therefore, use cases and design of NICs rapidly evolve with advancements in transmission capacity of network links. To reduce time-to-market for next-generation NICs with complex features implemented in hardware, we leverage the software ecosystem of programmable networks and compiler technology for development, testing and verification of the slingshot NICs. Running NCCL and RCCL Applications on HPE Slingshot NIC Jesse Treger and Caio Davi (HPE) Abstract Abstract There has been a rise in Machine Learning applications on High-Performance Computing Systems equipped with GPUs motivated by increasing demands for model and dataset sizes. Although there are no fundamental limitations preventing running these workloads in MPI runtimes, the in-house chip makers' communications collectives libraries have become the preferred deployments. In this presentation, we will how these libraries work and how we can take full advantage over the HPE Slingshot interconnect. We will review use and configuration of the two most common of these - NCCL and RCCL – to run using the HPE Slingshot NIC RDMA capability. We will explain the key parameters for the given environments and provide recommendations for the optimal settings. Enabling NCCL on Slingshot 11 at NERSC Jim Dinan (NVIDIA), Peter Harrington (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center), Igor Gorodetsky (HPE), Josh Romero (NVIDIA), Steven Farrell (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center), Ian Ziemba (HPE), and Wahid Bhimji and Shashank Subramanian (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) Abstract Abstract The NVIDIA Collective Communications Library (NCCL) is widely used for multi-node AI applications as well as other scientific codes. Initial deployments of Slingshot 11 (SS11) did not support this library at high-performance leading to detrimental impacts on the performance of Deep Learning applications on SS11-based HPC systems. We describe collaborative efforts between NERSC, NVidia and HPE to develop capabilities for NCCL on SS11. This involved development of the Libfabric NCCL plugin and an extensive period of testing and refinement utilizing the Perlmutter HPC system at NERSC. In this presentation we will describe the development required as well as performance improvement measured on Perlmutter for both benchmarks and cutting-edge scientific AI applications. Presentation, Paper Technical Session 4A Chair: Lena M Lopatina (LANL) Multi-stage Approach for Identifying Defective Hardware in Frontier Nick Hagerty (Oak Ridge National Laboratory), Andy Warner (HPE), and Jordan Webb (Oak Ridge National Laboratory) Abstract Abstract In June 2022, the long-awaited exaflop compute barrier (1 quintillion floating-point operations per second) was surpassed on the TOP500 list by Frontier, an HPE Cray EX supercomputer at Oak Ridge National Laboratory (ORNL). Drawing peak power of 21.1 MW, Frontier demonstrated 1.1 exaflops/second of computational capability, much of which is supplied by more than 37,000 AMD Instinct MI250X graphics processing units (GPUs). With a single GPU drawing up to 560W thermal design power (TDP), each AMD MI250X draws 2x more power under load than the NVIDIA V100 GPUs used in Frontier’s predecessor at ORNL, the 200 petaflop IBM POWER9 supercomputer, Summit. There are many other major technological advances in the memory, compute, power, and infrastructure of Frontier that are new to production environments. Frontier's mission to enable ground-breaking research in U.S. energy, economic and national security sectors is fulfilled through leadership-class workloads, which are workloads that demand greater than 20\% of the supercomputer. These large workloads are vulnerable to defective and failing computing hardware. The rate of failing hardware is quantified through the mean time between failures (MTBF), which is the length of time between a hardware-level failure anywhere in the system. In this work, we describe the multi-staged approaches to stabilizing and maintaining the functionality of the hardware on Frontier. There are three strategies discussed; the first two utilize leadership-class tests to target improving the MTBF of Frontier, the third utilizes single-node validation to efficiently identify individual instances of defective hardware in Frontier. We provide summarized data from each of the three strategies, then classify the diverse set of failures and discuss trends in defective hardware, before discussing several key challenges to identifying defective hardware and improving the MTBF of Frontier. From Frontier to Framework: Enhancing Hardware Triage for Exascale Machines Isa Wazirzada, Abhishek Mehta, and Vinanti Phadke (Hewlett Packard Enterprise) Abstract Abstract Supercomputers are complex systems that bring together bleeding edge technologies. Take the example of the first exascale system, Frontier, installed at Oakridge National Laboratory. Frontier consists of more than 9400 compute nodes embedded in the HPE Cray EX4000 infrastructure. Each compute node consists of an AMD CPU, four MI250 GPUs interlinked by high-speed XGMI interfaces, as well as Slingshot-11 high-speed NICs. The system is comprised of over 150,000 node-level components interconnected in an extremely dense mechanical framework and is cooled via warm temperature liquid cooling. As impressive as all these technologies are on their own, the real value lies in bringing these technologies together in a system to achieve sustained performance over time. With that in mind, it behooves us to recognize that system quality attributes such as diagnosability and serviceability are critical to achieving high levels of availability throughout the service life of a system. Therefore, for HPE and our customers, providing a product-level hardware triage framework will help ameliorate the return to service time for failed components, provide a standardized approach to diagnosing hardware failures, reduce the number of no trouble found replacements, and minimize the need for SMEs from R&D to directly support systems in the field. Full-stack Approach to HPC Testing Pascal Jahan Elahi and Craig Meyer (Pawsey Supercomputing Research Centre) Abstract Abstract A user of an High Performance Computing (HPC) system relies on a multitude of components, both on the user-facing side, such as \texttt{modules}, and lower-level system software, such as Message Passing Interface (\mpi) libraries. Thus, all these different aspects must be tested to guarantee an HPC system is production ready. We present here a suite of tests that cover this larger space, which not only focus on benchmarking or sanity checks but also provide some diagnostic information in case failures are encountered. These tests cover the job scheduler, here \texttt{slurm}, the \mpi\ library, critical for running jobs at scale, GPUs, a vital part of any energy efficient HPC system, and the performance of compilers that are part of the Cray Programming Environment (CPE). Some tests were critical to uncovering a number of underlying issues with the communication libraries on a newly deployed HPE-Cray EX Shasta system that had gone undetected in other acceptance tests. Otherwise identified bugs within the job scheduler. The tests are implemented in a \reframe\ framework and are open source. An Approach to Continuous Testing Francine Lapid and Shivam Mehta (Los Alamos National Laboratory) Abstract Abstract The National Nuclear Security Administration Department of Energy supercomputers at Los Alamos National Laboratory (LANL) are integral to supporting the lab’s mission and therefore need to be reliable and performant. To identify potential problems ahead of time while minimizing the interruption to the users’ work, the High Performance Computing (HPC) Division at LANL implemented a Continuous Testing framework and the necessary infrastructure to be able to automatically and frequently run a series of tests and proxy applications. The tests, which benchmark various system components, were integrated into the Pavilion2 testing framework and are launched in small Slurm jobs across each machine every weekend. The results of the tests are summarized in a comprehensive Splunk dashboard, enabling continuous monitoring of the health of the machines over time without having to parse through each run’s output and logs. This project is currently running on three of LANL’s newest fleet of Cray Shasta machines – Chicoma, Razorback, and Rocinante– and will eventually be implemented on all production clusters at LANL. This paper details the different components of the Continuous Testing framework, the resulting setup, and the impact it has on our HPC workflow. Presentation, Paper Technical Session 4C Chair: Gabriel Hautreux (CINES) LLM Serving With Efficient KV-Cache Management Using Triggered Operations Aditya Dhakal, Pedro Bruel, Gourav Rattihalli, Sai Rahul Chalamalasetti, and Dejan Milojicic (Hewlett Packard Enterprise, Hewlett Packard Labs) Abstract Abstract A large language model (LLM)’s Key-Value (KV) cache requires enormous GPU memory during inference. For faster query processing and conversational memory of chat applications, this cache is stored to answer subsequent user queries. However, if the cache is buffered on the GPU, its large memory requirement prevents multiplexing and requires cache buffering in remote storage. In current systems, transferring and retrieving cache requires the CPU to coordinate with the GPU and push and fetch the data through the network, increasing the overall latency. This paper proposes lower overhead KV cache storage and retrieval with SmartNICs capable of triggered operations, such as Cassini. Triggered operations enqueue pre-defined data transfer instructions on the NIC. A GPU thread can trigger these instructions once the LLM computes the KV cache for a token. Cassini Network Interface Cards (NICs) then transfer the cache, bypassing the CPU and network stack and improving data-transfer latency. Our experiments show that data transfer with triggered operations provides 19× speedup in transfers ranging from 32KB to 5 Terabytes. From Chatbots to Interfaces: Diversifying the Application of Large Language Models for Enhanced Usability Jonathan Sparks, Pierre Carrier, and Gallig Renaud (Hewlett Packard Enterprise) Abstract Abstract This paper explores the application of Large Language Models (LLMs) in three distinct scenarios, demonstrating their potential to aid user experience and efficiency. Firstly, we examine the application of LLM models such as OpenAI’s GPT or Llama in chatbots to assist in programming environments, providing real-time assistance to developers. Secondly, we explore using LLMs and Python to search internal document corpus for performance engineering, significantly improving the retrieval of relevant information from extensive technical documentation. Lastly, we investigate using LLMs as an interface to system batch schedulers, such as Slurm or PBS, replacing domain-specific languages and prompts with natural text. This approach democratizes access to complex systems, fostering ease of use and enhancing the user experience. Through these use cases, we underscore the versatility and potential of LLMs, highlighting their role as an aid to system operation and user experience. Delivering Large Language Model Platforms With HPC Laura Huber, Abhinav Thota, Scott Michael, and Jefferson Davis (Indiana University) Abstract Abstract In 2023, we saw a huge rise in the capability and popularity of large language models (LLMs). OpenAI released ChatGPT 3.0 to the public in November 2022, and since then, there have been many closed and open-source LLMs that have been released. It has been reported that training ChatGPT 3.0 took more than 10,000 GPU cards, making training a foundational LLM out of reach for many research teams and HPC centers, but there are many ways to use a pre-trained LLM using GPUs at scales available at an HPC center. For example, fine tuning a pre-trained model with site or application- specific data or augmenting the model with Retrieval Augmented Generation (RAG) to add specific knowledge to a pre-trained model. The availability of open-source LLMs has opened the field for individual researchers and service providers to do their own custom training and run their own chatbots. In this paper, we will describe how we deployed and evaluated open source LLMs on Quartz and Big Red 200, a Cray EX supercomputer, and provisioned access to these deployments to a select group of HPC users. System for Recommendation and Evaluation of Large Language Models for practical tasks in Science Cong Xu, Tarun Kumar, Martin Foltin, Annmary Justine, Sergey Serebryakov, Arpit Shah, Agnieszka Ciborowska, Ashish Mishra, Gyanaranjan Nayak, Suparna Bhattacharya, and Paolo Faraboschi (Hewlett Packard Enterprise) Abstract Abstract Large Language Models (LLMs) are showing impressive ability to reason and answer complex questions after a few contextual prompts. This has transformational potential for scientific productivity and new discoveries. However, LLMs commonly suffer from hallucination problems and their performance strongly depends on specific task and model. Several ad-hoc LLM benchmarking studies have been reported for specific tasks in medicine, biology and chemistry. The science community typically prefers open-source models, including derivatives of Llama-2, Galactica, etc. These models usually perform worse than the commercial GPT-4, requiring more scrutiny. In this work we developed an automated platform that co-optimizes LLM recommendation and evaluation for user specified tasks to make it easier for science practitioners to select and evaluate promising LLMs for their use case. Uniquely, it uses a recommender engine trained on characteristics of large number of training datasets, LLM models, and inference data to find a list of best candidate models for user specified prompt templates. Then, it evaluates and ranks these candidates using automated reviewer that performs multi-metric assessment and, uniquely, iteratively revisits the evaluation to validate its truthfulness employing the reflection approach and self-consistency technique. The evaluation platform seamlessly integrates with HPE MLDE (Machine Learning Development Environment). Presentation, Paper Technical Session 5B Chair: Adrian Jackson (EPCC, The University of Edinburgh) Leveraging GNU Parallel for Optimal Utilization of HPC Resources on Frontier and Perlmutter Supercomputers Ketan Maheshwari (Oak Ridge National Laboratory), William Arndt (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center), and Rafael Ferreira Da Silva (Oak Ridge National Laboratory) Abstract Abstract In the realm of HPC, efficiently exploiting parallelism on large-scale systems presents a significant challenge, particularly when aiming for a low-barrier, nonprogrammatic approach. This presentation showcases the effective application of GNU Parallel for optimizing the use of four key resources at OLCF’s Frontier and NERSC’s Perlmutter supercomputers: CPUs, GPUs, NVMe storage, and large-scale Storage systems. Our work positions GNU Parallel not merely as a utility tool but as a mainstream, productive instrument for diverse HPC workloads. We illustrate GNU Parallel’s capacity to efficiently harness CPU resources across up to 9,000 Frontier compute nodes, with a manageable increase in overhead even as node utilization scales up. This demonstration includes a detailed analysis of overhead metrics, revealing that it remains ~5 minutes for up to 7,000 nodes and under 10 minutes for up to 9,000 nodes comprising up to 1.15 million parallel tasks. Our approach is further elucidated through a series of practical vignettes. These include diverse applications, ranging from daily tasks to massively scaled operations, all underpinned by real-world applications and practical use cases. These vignettes offer generalized solutions for common computational and I/O patterns observed in real-world applications, validated through scalable examples on Frontier and Perlmutter supercomputers. Portable Support for GPUs and Distributed-Memory Parallelism in Chapel Andrew Stone and Engin Kayraklioglu (Hewlett Packard Enterprise) Abstract Abstract Writing performant programs on modern supercomputers means targeting parallelism at multiple levels of scale: from vectorization in a single core, to multicore CPUs, to nodes with multiple CPUs, to GPUs and nodes with multiple GPUs. Traditionally, this has meant writing applications using multiple programming models; for example, an application might be written as a combination of MPI, OpenMP, and/or CUDA. This use of multiple programming models complicates an application’s implementation, maintainability, and portability. PaCER: Accelerating Science on Setonix Maciej Cytowski and Ann Backhaus (Pawsey Supercomputing Research Centre) and Joseph Schoonover (Fluid Numerics) Abstract Abstract The Pawsey Centre for Extreme-Scale Readiness (PaCER) was a unique Australian program supporting researchers in developing and optimising codes on Setonix, Australia’s most powerful research supercomputer based on AMD Milan CPUs and MI250X GPUs. The focus of the PaCER program was on both extreme scale research (algorithms design, code optimisation, application and workflow readiness) and using the computational infrastructure to facilitate research for producing world-class scientific outcomes. PaCER was a partnership for collaboration with Pawsey and supercomputing vendors that provided early access to HPC tools and infrastructure, training and exclusive hackathons focused on performance at scale. PaCER supported 10 projects gathering 18 international and national institutions, supported more than 60 researchers and 10 Research Software Engineers. The biggest success of the project is the creation of a supercomputing software developers’ community, unique and first of its kind in Australia. In this talk, we will describe the collaboration model created by PaCER and cover significant achievements of the project, including the series of GPU programming mentored sprints. We will also share some of the lessons learned and future plans. Presentation, Paper Technical Session 5A Chair: Veronica G. Vergara Larrea (Oak Ridge National Laboratory) Power and Performance analysis of GraceHopper superchips on HPE Cray EX systems Benjamin Donald Cumming and Miguel Gila (CSCS), Brian Collum (HPE), Sebastian Keller (CSCS), and Bryan Villalon and Steven James Martin (HPE) Abstract Abstract Systems with tightly-integrated CPU and GPU on the same module or package will be deployed in 2024, based on NVIDIA GH200 and AMD Mi300a processors. There are some significant differences from current systems, of which the key ones for this presentation are: Accelerating Scientific Workflows with the NVIDIA Grace Hopper Platform Gabriel Noaje (NVIDIA) Abstract Abstract This session will focus on providing a demonstration of top ML frameworks, HPC applications, and tools for data science on NVIDIA's Grace Hopper and Grace CPU Superchip. These superchips are the cornerstones of versatile and power-efficient supercomputers worldwide that combine Grace CPUs, Hopper GPUs, and extreme scale networking technology, in a standards-compliant server. We 'll showcase recent results from key applications like PyTorch, JAX, WRF, GROMACS, and NAMD. We'll also provide lessons learned and experiences to help guide developers creating their own applications for NVIDIA Grace superchips. This session is a strong starting point for anyone looking to understand, and develop for, NVIDIA Grace Hopper or the Grace CPU Superchip. GROMACS on AMD GPU-Based HPC Platforms: Using SYCL for Performance and Portability Andrey Alekseenko (KTH/SciLifeLab); Szilárd Páll (KTH/PDC); and Erik Lindahl (KTH, Stockholm University) Abstract Abstract GROMACS is a widely-used molecular dynamics software package with a focus on performance, portability, and maintainability across a broad range of platforms. Thanks to its early algorithmic redesign and flexible heterogeneous parallelization, GROMACS has successfully harnessed GPU accelerators for more than a decade. With the diversification of accelerator platforms in HPC and no obvious choice for a well-suited multi-vendor programming model, the GROMACS project found itself at a crossroads. The performance and portability requirements, as well as a strong preference for a standards-based programming model, motivated our choice to use SYCL for production on both new HPC GPU platforms: AMD and Intel. Since the GROMACS 2022 release, the SYCL backend has been the primary means to target AMD GPUs in preparation for exascale HPC architectures like LUMI and Frontier. SYCL is a cross-platform, royalty-free, C++17-based standard for programming hardware accelerators, from embedded to HPC. It allows using the same code to target GPUs from all three major vendors with minimal specialization, which offers major portability benefits. While SYCL implementations build on native compilers and runtimes, whether such an approach is performant is not immediately evident. Biomolecular simulations have challenging performance characteristics: latency sensitivity, the need for strong scaling, and typical iteration times as short as hundreds of microseconds. Hence, obtaining good performance across the range of problem sizes and scaling regimes is particularly challenging. Here, we share the results of our work on readying GROMACS for AMD GPU platforms using SYCL, and demonstrate performance on Cray EX235a machines with MI250X accelerators. Our findings illustrate that portability is possible without major performance compromises. We provide a detailed analysis of node-level kernel and runtime performance with the aim of sharing best practices with the HPC community on using SYCL as a performance-portable GPU framework. Presentation, Paper Technical Session 5C Chair: Tina Declerck (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory) Cray EX Security Experiences Ben Matthews (NCAR/UCAR) Abstract Abstract Security is an important, if sometimes overlooked, part of running an HPC system. We will describe the out-of-box experience with the Cray EX system from a security perspective. Several security issues (and mitigations for them) present in the Cray EX (HPCM) software stack and, perhaps, HPC systems in general will be described. The process for reporting, patching, and publishing these issues will be discussed as well as some thoughts on where to look for and how to reduce the risk posed by as yet undiscovered vulnerabilities. Finally, some general advice for securing Cray and other HPC systems will be provided. Best of Times, Worst of Times: A Cautionary Tale of Vulnerability Handling Aaron Scantlin (National Energy Research Scientific Computing Center) Abstract Abstract Vulnerabilities are an unfortunate reality in modern computing - while there's plenty of discussion around the importance of detection and patching in HPC, there's not as much chatter about how to handle vulnerabilities discovered at one's institution. While this process is currently ad-hoc and depends in large part on the maintainers of the software in question, one HPC center's recent experience with the discovery of a critical vulnerability within Lustre within COS, reporting that vulnerability to HPE, and the subsequent handling of that information by both groups suggests that there's room for improvement in a variety of areas on both sides. In this presentation, NERSC Security will: AIOPS Empowered: Failure Prediction in System Management Software Tools Deepak Nanjundaiah and SUBRAHMANYA VINAYAK JOSHI (HPE) Abstract Abstract In the realm of High-Performance Computing (HPC), our project addresses the escalating challenge of failures, particularly with the anticipated complexity surge in Exascale systems. Our innovative Semi-Supervised Failure Prediction service, applicable to various system management software tools, utilizes deep learning on telemetry data for real-time, end-to-end failure prediction. Our approach centers on deep learning models proficient in deciphering intricate patterns within extensive datasets. From data acquisition to prediction, our solution seamlessly integrates with system management software, analyzing critical metrics like CPU usage, memory status, and network activity. By learning from historical data, the model distinguishes between normal and failure states, providing real-time predictions before potential failures. With a semi-supervised learning approach using both labeled and unlabeled data, our model adapts effectively to diverse failure scenarios. Integrated with the AIOps service in system management tools, it offers organizations a proactive edge, enabling early intervention for cost reduction, minimized downtime, and improved data center efficiency. In our upcoming presentation, we will delve into a concise results overview, showcasing the benefits of our approach in enhancing predictive capabilities and providing organizations with strategic advantages in system management. This functionality is being considered for future inclusion in HPE system management products. Presentation, Paper Technical Session 6B Chair: Paul L. Peltz Jr. (Oak Ridge National Laboratory) POD: Reconfiguring Compute and Storage Resources Between Cray EX Systems Eric Roman and Tina Declerck (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) and Sean Lynn (Hewlett Packard Enterprise) Abstract Abstract This paper describes how a set of liquid-cooled compute and air-cooled storage resources can be reconfigured between Cray EX Systems. In each configuration, the resources function as a native set of managed nodes and/or directly attached storage resources to the associated host system. In a normal production configuration for example, users can and run compute jobs on the reconfigured compute nodes or read and write data to the corresponding filesystems. To administrators, these compute nodes are managed like conventional compute nodes and the storage is managed by the Neo software stack. The Slingshot network is connected to each of the possible host systems and the systems management networks interconnect via a layer 3 EVPN VXLAN tunnel. NERSC has implemented this reconfigurable architecture on a set of liquid-cooled and air-cooled resources termed POD. POD resources have been successfully transitioned between NERSC's development, staging and production systems and is currently configured and in use on NERSC's flagship system perlmutter. Zero Downtime System Upgrade Strategy Alden Stradling and Joshi Fullop (Los Alamos National Laboratory) Abstract Abstract In a perfect world, HPC system downtime would be easy to minimize. Just keep a perfect copy of the production cluster to prevent scaling surprises. Multitenancy on HPE Cray EX: network segmentation and isolation Chris Gamboni (Swiss National Supercomputing Centre) Abstract Abstract The Swiss National Supercomputing Center (CSCS) has developed a strategy to provide Infrastructure as a Service (IaaS) to select customers co-investing in the Alps infrastructure. A critical component of this service is the ability to segregate and isolate the Alps network for various IaaS tenants, enabling them to integrate their own site networks. This necessitates the management of network multitenancy capabilities within the infrastructure. The Alps system, powered by HPE Cray EX machines and managed through Cray System Management (CSM) with Slingshot interconnect, utilizes VLANs for node segmentation at the High-Speed Network (HSN) level. Implementing network multitenancy with CSM requires a novel configuration approach in the node management network. Presentation, Paper Technical Session 6A Chair: Chris Fuson (ORNL, Oak Ridge National Laboratory) Unification of Alerting Engines for Monitoring in System Management Raghul Vasudevan, Ambresh Gupta, and Sinchana Karnik (Hewlett Packard) Abstract Abstract Unified alerting provides a single interface or platform in system management to create and manage alerts for various components within HPC systems. The system management monitoring stack collects different types of telemetry from various system components and store events and logs in OpenSearch and metrics in Timescale. HPE Cray EX255a Telemetry - Improved Configurability and Performance Sean Byland, Steven Martin, and Brian Collum (HPE) Abstract Abstract The new HPE Cray EX255a blade (two nodes, each with 4 AMD MI300a sockets) has power demands and additional sensors that require a more robust power subsystem and data collection capability. This necessitated careful evaluation and changes to how we manage, collect, and publish sensor data. We factored all sensor access parameters and operational characteristics for the EX225a node cards into a standardized file format. These files define default values for compilation. The files can be edited and unmarshalled into runtime accessible structures, enabling testing, tuning, and experimentation with alternative settings. Starting at the hardware access layer and working up the stack we optimized the code paths to enable collection of more sensors in our fixed time budget. This work is the foundation for future work that could enable the ability for higher-level management and monitoring software to customize data collection on behalf of users. Best Practices for deployment of LDMS on the HPE Cray EX platform James Brandt, Kevin Stroup, and Ann Gentile (Sandia National Laboratories) Abstract Abstract The Lightweight Distributed Metric Service (LDMS) has been deployed on some of the largest Cray systems over the past decade to enable low overhead capture of system and application metrics of interest. LDMS has evolved over time to provide new capabilities and associated configuration options to address the ever increasing size and heterogeneity of HPC systems. In the last quarter of 2023 a working group was formed to formalize “best practices” for LDMS deployment on large-scale HPC systems as well as to help guide future configuration management approaches and mechanisms in the LDMS open source/development project. We present the results from this working group as they apply to base-level configurations of samplers and aggregators, authentication mechanisms, and practices to simplify deployment, including use of pre-built Docker containers. For those interested in automated aggregator load balancing and resilience to host failure we describe capabilities of the LDMS distributed configuration manager (Maestro). Finally, we present planned extensions and the capabilities they provide. Presentation, Paper Technical Session 6C Chair: Bilel Hadri (KAUST Supercomputing Lab) ClusterStor Tiering, Overview, Setup, and Performance Nathan Rutman (Hewlett Packard) Abstract Abstract ClusterStor Tiering is a suite of software features designed to enhance the usability and management of hybrid storage systems, combining both flash and disk components. Specifically crafted for monitoring and maintaining file layouts and free space on E1000 flash and disk tiers, Tiering offers a range of customizable capabilities through data management policies. Administrators can tailor fine-grained indexing controls, orchestrate file migrations between Object Storage Targets (OSTs) or pools, execute restriping processes, perform purges, and generate reports. These actions are intelligently triggered by preset timers or dynamically in response to system conditions, such as reaching capacity thresholds. Leveraging a scale-out architecture, tiering efficiently handles the movement of large data volumes. Utilizing the System Management Unit (SMU) for all functions, additional data mover nodes can be configured to augment throughput. Key functionalities supported by Tiering include scalable search, transparent tiering, parallel data movers, data purging, and reporting. Exploring new software-defined storage technology using VAST on Cray EX systems Mark Klein, Chris Gamboni, Gennaro Oliva, and Salvatore Di Nardo (Swiss National Supercomputing Centre, ETH Zurich); Maria Gutierrez (VAST Data); and Riccardo Di Maria and Miguel Gila (Swiss National Supercomputing Centre, ETH Zurich) Abstract Abstract Alps is the Swiss national supercomputing centre's multi-tenant software-defined infrastructure This paper describes the configuration and experiences to get VAST working as a performant filesystem option on the HPE Cray EX line of supercomputers and highlights the possibility to attach additional storage options over the edge routers of these systems. Reducing Mean Time to Resolution (MTTR) for complex HPC-based systems with next generation automated service tools. Michael Cush (HPE) Abstract Abstract After years of experience with Cray’s System Snapshot Analyzer (SSA), the HPC Call Home team worked to develop a new more flexible, scalable, open, and secure Call Home infrastructure to support our future HPC products. Becoming part of HPE allowed us to take advantage of and include HPE’s highly secure Remote Data Access (RDA) capabilities as part of that new infrastructure. A key design point was to make the new product useful even for sites that are not typically uploading data – which sounds rather odd for a “call home” tool set. Other points were the maintenance of a pluggable and highly configurable collection framework partnered with an efficient storage methodology. This paper will discuss the design and highlight where enhancements were made. Example collection plugins will be reviewed. Finally, the paper will seek to answer the question, “So why should I run SDU?” Presentation, Paper Technical Session 7A Chair: John Holmen (Oak Ridge National Laboratory) Proactive Precision: Enhancing High-Performance Computing with Early Job Failure Detection Dipanwita Mallick, Siddhi Potdar, Saptashwa Mitra, Nithin Mohan, and Charlie Vollmer (Hewlett Packard Enterprise) Abstract Abstract In the high-performance computing (HPC) realm, swiftly identifying job failures is critical to optimize resource allocation and ensure system efficiency. Given the high costs and extensive resource demands of HPC systems, the impact of job failures, particularly post-resource allocation, is significant. These failures, crucial in time-sensitive research domains, can derail progress and obstruct objectives. Proactive failure detection allows administrators to quickly enact corrective actions, like job resubmission or reconfiguration, reducing downtime and enhancing user satisfaction. Our approach includes predicting job failures at the initial stages, analyzing failure causes, and developing preventive strategies. By implementing a robust data collection process within the HPC system and utilizing the Slurm workload manager, we have streamlined the data handling procedures. Our methodology involves data preprocessing, feature engineering, and using machine learning models optimized with cross-validation, addressing class imbalances, and focusing on precision, recall, and F1-score metrics. This thorough approach aims to improve resource optimization and prevent future inefficiencies in HPC systems. Presentation, Paper Technical Session 8B Chair: Raj Gautam (ExxonMobil) Using HPE-Provided Resources to Integrate HPE Support into Internal Incident Management John Gann, Daniel Gens, and Elizabeth Bautista (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) Abstract Abstract High Performance Computing (HPC) has a demand to streamline incident management workflows while keeping information synchronized between internal tickets and vendor support cases. Prior to HPE’s acquisition of Cray, NERSC created an integration between their ServiceNow incident management platform and the Crayport platform. This is now obsolete once HPE took over Cray and NERSC staff had no choice other than to input information manually every time a new incident was opened or required updating. Further, this manual entry needed to be performed in both ServiceNow and HPE’s platform. Presentation, Paper Technical Session 8A Chair: Tina Declerck (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory) Optimizing I/O Patterns to Speed up Non-contiguous Data Retrieval and Analyses Scott Klasky, Qian Gong, and Norbert Podhorszki (Oak Ridge National Laboratory) Abstract Abstract Scientific applications running on Exascale supercomputer generate massive data requiring efficient storage for future analysis. While simulations leverage thousands of nodes for writing, reading and analysis typically utilize limited resources, i,e, a handful of nodes. As a result, users commonly query only a particular plane of a multidimensional array or read data from files in strides, hoping that the overall I/O and data processing cost is reduced. This presentation delves into these non-contiguous and striding I/O patterns commonly employed in scientific data analyses. Using visualization as examples, we reveal the detrimental impact of non-contiguous file access on overall throughput, counteracting the speedup gained from analyses with reduced data volume. Recognizing the pattern of scientific data access – primarily written once and read frequently – we propose to refactor data at the time of writing into a format leading to efficient retrieval. We investigate several data organization and refactoring strategies, assessing their impact on reading performance, writing performance, and error incurred on post-analysis across several commonly used query and post-analyses tasks. Our experiments are conducted on the Frontier Supercomputer at Oak Ridge National Laboratory, providing insights for optimizing I/O operations in the Exascale computing era. Presentation, Paper Technical Session 8C Chair: Jim Rogers (Oak Ridge National Laboratory) Building LDMS Slingshot Switch Samplers Kevin Stroup, Cory Lueninghoener, Jim Brandt, and Ann Gentile (Sandia National Laboratories) Abstract Abstract The Lightweight Distributed Metric Service (LDMS) is widely used for monitoring HPC systems and is integrated in HPE’s Cray System Management architecture as well as HPE’s High Performance Cluster Management architecture. One of the important components of an HPC system to monitor is the high-speed interconnect. In the case of HPE Cray EX family of systems, that interconnect is the Slingshot high-speed network. LDMS utilizes “samplers” to gather data about network metrics, including some metrics that can only be determined by a sampler running on the Slingshot switches. Plenary Plenary: Welcome, Keynote Welcome to Country Ceremony Maciej Cytowski (Pawsey) Abstract Abstract A long tradition of the Noongar people across the southwest of WA is the "Welcome to Country" ceremony, which is meant to call out to the spiritual ancestors of the land, to advise them that new friends have come to visit, and to look over and protect them from harm during their stay. This ceremony varies from person to person, where some share stories, songs, and dances reflecting the song lines and connections of the land and cultural heritage of the place where the gathering occurs. This has been done since time immemorial as a valued and respected ceremony as part of our events. Convergence of Energy Efficient Scientific Computing and GenAI Gabriel Noaje (NVIDIA) Abstract Abstract NVIDIA Grace Hopper Superchips are a scale-up architecture ideal for scientific computing workflows involving CPUs and GPUs. This session dives deep into HPC and AI workload performance results with a technical focus on the specific features of Grace-Hopper that accelerate each workload. Explore how Grace-Hopper's distinctive coupling of the CPU/GPU hardware and the accompanying software stack create a platform which increases developer productivity, accelerates existing applications, and facilitates new standard programming models in C++, Fortran, and Python. Attendees will hear about some of the early customers working with these innovative products as they apply this innovative, energy-efficient platform towards their scientific, generative AI, and industrial use cases. High Performance Remote Linux Desktops with ThinLinc Robert Henschel and Aaron Sowry (Cendio AB) Abstract Abstract ThinLinc is a scalable remote desktop solution for Linux, built with open-source components. It enables access to graphical Linux applications and full desktop environments. CUG member sites have used ThinLinc to provide users with access to applications like MATLAB or VMD, as well as a “High Performance Research Desktop”. This environment allows users to run their entire workflow, from data retrieval and preparation to job submission and post-processing. A “High Performance Research Desktop” also enables access to interactive applications and running jobs in a batch system. Cendio, the company behind ThinLinc, is a strong supporter of open-source projects and the main contributor to projects like TigerVNC, noVNC, and others. Unlocking Exascale Debugging and Performance Engineering with Linaro Forge Marcin Krzysztofik (Linaro) Abstract Abstract Dive into the future of code development and see how Linaro Forge is reshaping what's possible in the world of parallel computing. Linaro Forge unveils the latest advancements: with Linaro DDT, MAP, and Performance Reports, we're setting new standards in scalability and ease-of-use. Discover how these tools have become the go-to solution for developers seeking to push the boundaries of code optimization and performance engineering. Plenary Plenary: CUG site, HPE update Plenary Plenary: CUG Board Updates (Open), CUG Elections, and Best papers CUG Board Updates, SIG Presentations, and Board Elections – Open Session Ashley Barker (Oak Ridge National Laboratory) Abstract Abstract CUG update from the board, overview of conference submissions and proceedings, Special Interest Group Updates, CUG Board Elections. Nine Months in the life of an all-flash file system Lisa Gerhardt, Stephen Simms, David Fox, Ershaad Basheer, and Kirill Lozinskiy (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center); Michael Moore (HPE); and Wahid Bhimji (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) Abstract Abstract NERSC’s Perlmutter scratch filesystem, an all-flash Lustre storage system running on Cray ClusterStor E1000 Storage Systems, has a capacity of 36 PetaBytes and a theoretical peak performance exceeding 7 TeraBytes per second across HPE’s Slingshot network fabric. Deploying an all-flash Lustre filesystem was a leap forward in an attempt to meet the diverse I/O needs of NERSC. With over 10,000 users representing over 1,000 different projects that span multiple disciplines, a file system that could overcome the performance limitations of spinning disk and reduce performance variation was very desirable. While solid state provided excellent performance gains, there were still challenges that required observation and tuning. Working with HPE’s storage team, NERSC staff engaged in an iterative process that through time, increased performance and provided more predictable outcomes. Through the use of IOR runs and OBDfilter scans, NERSC staff were able to closely monitor the performance of the file system at regular intervals to inform the process and chart progress. This paper will document the results of and report insights derived from over 6 months of NERSC’s continuous performance testing, and provide a comprehensive discussion of the tuning and adjustments that were made to improve performance. Isambard-AI: a leadership-class supercomputer optimised specifically for Artificial Intelligence Simon McIntosh-Smith, Sadaf Alam, and Christopher Woods (University of Bristol) Abstract Abstract Isambard-AI is a new, leadership-class supercomputer, designed to support AI-related research. Based on the HPE Cray EX4000 system, and housed in a new, energy efficient Modular Data Centre in Bristol, UK, Isambard-AI employs 5,448 NVIDIA Grace-Hopper GPUs to deliver over 21 ExaFLOP/s of 8-bit floating point performance for LLM training, and over 250 PetaFLOP/s of 64-bit performance, for under 5MW. Isambard-AI integrates two, all-flash storage systems: a 20 PiByte Cray ClusterStor and a 3.5 PiByte VAST solution. Combined these give Isambard-AI flexibility for training, inference and secure data accesses and sharing. But it is the software stack where Isambard-AI will be most different from traditional HPC systems. Isambard-AI is designed to support users who may have been using GPUs in the cloud, and so access will more typically be via Jupyter notebooks, MLOps, or other web-based, interactive interfaces, rather than the approach used on traditional supercomputers of ssh’ing into a system before submitting jobs to a batch scheduler. Its stack is designed to be quickly and regularly upgraded to keep pace with the rapid evolution of AI software, with full support for containers. Phase 1 of Isambard-AI is due online in May/June 2024, with the full system expected in production by the end of the year. Plenary Plenary: Sponsors talks, HPE 1-100 The Biggest Change to HPC Job Scheduling and Resource Management in 30 Years Branden Bauer (Altair Engineering, Inc.) Abstract Abstract HPC is rapidly becoming more complex. Administrators must support a wide range of new workloads that mix AI/ML with HPC while pulling data from varied data sources and a compute environment consisting of assorted structures, including GPUs, CPUs and new accelerators. How do we meet this complexity as an industry while delivering better scalability and efficiency? Introducing the biggest change to HPC resource management in 30 years: Altair® Liquid Scheduling™. Codee: Automatic Code Inspection Tools for Performance and Code Modernization Manuel Arenaz (Codee) Abstract Abstract Codee is a suite of software development tools to help improve the performance of C/C++/Fortran applications, providing a systematic, more predictable approach that leverages parallel programming best practices. Codee Static Code Analyzer provides a systematic predictable approach to enforce C/C++/Fortran performance optimization best practices for the target environment: hardware, compiler and operating system. It provides innovative Coding Assistant capabilities to enable semi-automatic source code rewriting, by inserting OpenMP or OpenACC directives in your codes to run on CPUs or offload to accelerator devices such as GPUs, so that novice programmers can write codes at the expert level. Codee provides integrations with IDEs and CI/CD frameworks to make it possible to Shift Left Performance. In this presentation we will also talk about how to use Codee in conjunction with the Cray tools, including compilers (CCE) and performance tools (e.g. CrayPat, Reveal). Plenary Plenary: CUG 2024, Invited speakers Advancing Gas Turbine Development using HPC: Challenges and Rewards Richard D. Sandberg and Melissa Kozul (University of Melbourne) Abstract Abstract The majority of research on the gas turbines used for aircraft propulsion was historically carried out via costly laboratory tests. This approach generally yields the overall heat transfer or loss of components, rather than providing sufficient detail to dissect the range of physical phenomena impacting the performance of a design. On the other hand, Computational Fluid Dynamics (CFD) yields the full, detailed air flow-field and has become a key gas turbine design tool. Yet, industrial design cycles rely on low-order models so that performing hundreds of engine-relevant analyses is tractable. The accuracy of these models, however, is notoriously challenged by the complex flow conditions within aircraft engines, including strong pressure gradients, separated flow and laminar to turbulent transition mechanisms. Plenary CUG 2024 Closing Presentation, Paper Technical Session 1B Chair: Jim Williams (Los Alamos National Laboratory) Enhancing HPC Service Management on Alps using FirecREST API Juan Pablo Dorsch, Andreas Fink, Eirini Koutsaniti, and Rafael Sarmiento (Swiss National Supercomputing Centre) Abstract Abstract With the evolution of scientific computational needs, there is a growing demand for enhanced resource access and sophisticated services beyond traditional HPC offerings. These demands encompass a wide array of services and use cases, from interactive computing platforms like JupyterHub to the integration of Continuous Integration (CI) pipelines with tools such as GitHub Actions and GitLab runners, and the automation of complex workflows in Machine Learning using AirFlow. Automated Hardware-Aware Node Selection for Cluster Computing Manuel Sopena Ballesteros, Miguel Gila, Matteo Chesi, and Mark Klein (Swiss National Supercomputing Centre, ETH Zurich) Abstract Abstract This paper introduces algorithms for automating the grouping of compute nodes into clusters based on user-defined hardware requirements and simultaneously identifies potential hardware failures in HPC data centers. Addressing the challenges of dynamic workloads, the algorithms extract detailed hardware information through CSM APIs, automating node selection aligned with user-defined criteria. The automation streamlines node assignment, reducing human error and expediting the selection process. Versatile Software-defined Cluster on Cray HPE EX Systems Maxime Martinasso, Mark Klein, Benjamin Cumming, Miguel Gila, and Felipe Cruz (Swiss National Supercomputing Centre, ETH Zurich) Abstract Abstract This presentation introduces the versatile software-defined cluster (vCluster), a novel set of technologies for HPC infrastructure such as Cray HPE EX systems. This integration offers a service-oriented approach to computing resources, maintaining infrastructure independence and avoiding vendor lock-in. The vCluster technology bridges the gap between Cloud abstraction and the vertically integrated HPC stack, enabling large-scale infrastructures to support multiple scientific domains with specifically tailored services. Presentation, Paper Technical Session 1A Chair: Lena M Lopatina (LANL) CPE Updates Barbara Chapman (HPE) Abstract Abstract The HPE Cray Programming Environment (CPE) provides a suite of integrated programming tools for application development on a diverse range of HPC systems delivered by HPE. Its compilers, math libraries, communications libraries, debuggers, and performance tools enable the creation, enhancement and optimization of application codes written using mainstream programming languages and the most widely used parallel programming models. A Deep Dive Into NVIDIA's HPC Software Jeff Larkin and Becca Zandstein (NVIDIA) Abstract Abstract NVIDIA's HPC Software enables developers to build applications that take advantage of every aspect of the hardware available to them: CPU, GPU, and interconnect. In this presentation you will learn the latest information on NVIDIA's HPC compilers, libraries, and tools. You will learn how NVIDIA's HPC software makes application developers productive and their applications portable and performant. This presentation will give an overview of NVIDIA's HPC SDK, optimized libraries for the GPU and CPU, performance tools, scalable libraries, python support, and more. Slurm 24.05 and Beyond Tim Wickberg (SchedMD LLC) Abstract Abstract Slurm is the open-source workload manager used on the majority of the TOP500 systems. Presentation, Paper Technical Session 1C Chair: Chris Fuson (ORNL, Oak Ridge National Laboratory) Towards the Development of an Exascale Network Digital Twin John Holmen (Oak Ridge National Laboratory); Md Nahid Newaz (Oakland University); and Srikanth Yoginath, Matthias Maiterth, Amir Shehata, Nick Hagerty, Christopher Zimmer, and Wesley Brewer (Oak Ridge National Laboratory) Abstract Abstract Exascale high performance computing (HPC) systems introduce new challenges related to fault tolerance due to the large component counts needed to operate at such scales. For example, the exascale Frontier system consists of approximately 60 million components. These counts warrant the investigation of new approaches for helping to ensure the functionality, performance, and usability of such systems. An approach explored by the ExaDigiT project is use of digital twins to help inform decisions related to the physical Frontier system. This paper discusses a subset of ExaDigiT’s Facility Digital Twin (FDT), the Network Digital Twin (NDT), which focuses on Frontier’s network as a target use case. We present the various strategies tested and early challenges faced towards the development of an exascale NDT, with the hope that such knowledge would benefit other practitioners who are interested in developing a similar digital twin. A Performance Deep Dive into HPC-AI Workflows with Digital Twins Ana Gainaru (Oak Ridge National Laboratory); Greg Eisenhauer (Georgia Institute of Technology); and Fred Suter, Norbert Podhorszki, and Scott Klasky (Oak Ridge National Laboratory) Abstract Abstract The landscape of High-Performance Computing (HPC) is evolving. Traditional HPC simulations are merging with advanced visualization and AI techniques for analysis, resulting in intricate workflows that push the boundaries of current benchmarks and performance models. Here we focus on workflows that couple in near real-time digital twins, low-fidelity Artificial Intelligence (AI) simulations, alongside ongoing experiments or high-fidelity simulations to continuously drive the latter towards optimal results. It is expected that digital twin workflows will play a crucial role in optimizing the performance of next-generation simulations and instruments. This paper highlights performance limitations for the convergence of AI digital twins and HPC simulations by modeling and analyzing several I/O strategies at scale on HPE/Cray machines. We expose the limitations of relying on existing methods that benchmark individual components for these novel workflows, and propose a performance roofline model to predict the performance of these workflows on future machines and for more complex tasks. Additional layers of analytics and visualization further complicates the performance landscape. Understanding the unique performance characteristics of these intricate HPC-AI hybrid workflows is essential for designing future architectures and algorithms that can fully harness their potential. Optimizing Checkpoint-Restart Mechanisms for HPC with DMTCP in Containers at NERSC Madan Timalsina, Lisa Gerhardt, Johannes Blaschke, Nicholas Tyler, and William Arndt (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) Abstract Abstract This paper presents an in-depth examination of checkpoint-restart mechanisms in High-Performance Computing (HPC). It focuses on the use of Distributed MultiThreaded Check- Pointing (DMTCP) in various computational settings, including both within and outside of containers. The study is grounded in real-world applications running on NERSC Perlmutter, a state-of-the-art supercomputing system. We discuss the advantages of checkpoint-restart in managing complex and lengthy computations in HPC, highlighting its efficiency and reliability in such environments. The role of DMTCP in enhancing these workflows, especially in multi-threaded and distributed applications, is thoroughly explored. Additionally, the paper delves into the use of HPC containers, such as Shifter and Podman-HPC, which aid in the management of computational tasks, ensuring uniform performance across different environments. The methods, results, and potential future directions of this research, including its application in various scientific domains, are also covered, showcasing the critical advancements made in computational methodologies through this study. Presentation, Paper Technical Session 2B Chair: Lena M Lopatina (LANL) EMOI: CSCS Extensible Monitoring and Observability Infrastructure Massimo Benini (CSCS); Jeff Hanson (HPE); and Dino Conciatore, Gianni Mario Ricciardi, Michele Brambilla, Monica Frisoni, Mathilde Gianolli, Gianna Marano, and Jean-Guillaume Piccinali (CSCS) Abstract Abstract The Swiss National Supercomputing Centre (CSCS) is enhancing its computational capabilities through the expansion of the Alps architecture, a Cray HPE EX system equipped with approximately 5000 GH200 modules, in addition to the pre-existing 1000 nodes of a diverse combination of CPUs and GPUs. CSCS has developed an Extensible Monitoring and Observability Infrastructure (EMOI), designed to manage the substantial data influx and provide insightful analysis of the infrastructure`s behavior. This paper presents the architecture and capabilities of EMOI at CSCS, emphasizing its scalability and adaptability to handle the increasing volume of monitoring data generated by the Alps infrastructure. We detail the integration of the Cray System Management (CSM) and Cray System Monitoring Application (SMA) within EMOI. The paper describes our hardware infrastructure, leveraging Kubernetes for dynamic data collection and analysis tools deployment, and outlines our GitOps strategy for efficient service management. We also explore the distinctions in data models across various node architectures within the Alps system, focusing on power consumption data and its relevance concerning global supercomputing challenges. The insights and methodologies presented in this paper are anticipated to be beneficial not only to CSCS, but also to other HPE/Cray sites facing similar challenges in supercomputing infrastructure management. Swordfish/Redfish and ClusterStor - Using Advanced Monitoring to Improve Insight into Complex I/O Workflows. Torben Kling Petersen, Tim Morneau, Dan Matthews, and Nathan Rutman (HPE) Abstract Abstract HPC storage systems today are complex solutions. Contrary to typical compute environments, a single storage component operating subpar can have significant impact on productivity. Further, as a storage solution ages, capacity fills up and is used un-evenly. Understanding these changes and the reasons for performance bottlenecks are increasingly important. With the addition of a full RESTful monitoring API based on Swordfish, we now have the tools required to improve overall storage monitoring. Swordfish is a collaboration between DMTF and SNIA that extends the DMTF Redfish interface to provide a standardized, accessible way to represent and manage storage and file systems in both individual customer and cloud environments. A custom implementation of both Swordfish and Redfish in the ClusterStor software stack provides new ways of gaining insights into the inner workings of an HPE ClusterStor E1000 storage system or either of its forthcoming descendants, C500 and E2000. This paper is intended as an introduction into this new API, including examples and guidance on how this can be used to improve storage monitoring as well as understanding of how traditional HPC and/or modern AI/ML workflows behave and evolve over time. CADDY: Scalable Summarizations over Voluminous Telemetry Data for Efficient Monitoring Saptashwa Mitra, Scott Ragland, Vanessa Zambrano, Dipanwita Mallick, Charlie Vollmer, Lance Kelley, and Nithin Singh Mohan (Hewlett Packard Enterprise) Abstract Abstract In the rapidly evolving landscape of High-Performance Computing (HPC), the efficient management and analysis of telemetry data is pivotal for ensuring system robustness and performance optimization. As HPC systems scale in complexity and capability, traditional data processing methodologies struggle to meet the demands of rapid real-time analytics and large-scale data management. This paper introduces an innovative framework, Caddy, which employs a novel approach to HPC telemetry storage and interactive analysis. Built on the foundation of HPE's Slingshot interconnect and the Fabric AIOps (FAIO) system, Caddy aims to address the critical need for a memory-efficient, scalable, and real-time analytical solution for seamless monitoring over large HPC environments. Command Lines vs. Requested Resources: How Well Do They Align? Ben Fulton, Abhinav Thota, Scott Michael, and Jefferson Davis (Indiana University) Abstract Abstract In the context of high-performance computing, a significant portion of users do not develop their own code from scratch but rely on existing software packages and libraries tailored for specific scientific or computational tasks. Many of these open source scientific software packages provide a variety of methods to efficiently use them in multicore, multinode, or large-memory systems. In this paper, we examine a set of applications that users run on Indiana University supercomputers, and determine for those applications the software parameter settings controlling CPU parallelism, GPU parallelism, and memory usage. We then investigate the common ways users employ these parameters and measure the degree of success in which they take advantage of available resources. By comparing data collected from XALT on the command line parameters used to the Slurm resource requests we are able to determine the degree to which users take advantage of the resources they request. This knowledge will inform us on how better to provide example usages for the software available on our systems, and will inform future software development efforts, guiding the design of more efficient, user-friendly, and adaptable tools that align closely with the specific needs of the HPC community. Presentation, Paper Technical Session 2A Chair: Jim Rogers (Oak Ridge National Laboratory) Updated Node Power Management For New HPE Cray EX255a and EX254n Blades Brian Collum and Steven Martin (Hewlett Packard Enterprise) Abstract Abstract Cray EX nodes have always supported a form of power capping that would allow customers to lower power usage of specific nodes as desired. With the introduction of the HPE Cray EX254n (NVIDIA Grace Hopper) and HPE Cray EX255a (AMD MI300a), this became critical as the overall rack power pushed beyond the maximum supported at some customers sites. With the HPE Cray EX254n in particular, the total TDP of the modules exceeds the maximum that can be delivered by the Cray EX infrastructure. This drove the decision to have to set a power limit on the Grace Hopper modules by default (a first for Cray EX). This presentation will walkthrough design goals of the blades, how power capping is implemented in the firmware, and how to configure the power limit in a running system. The presentation will also go through how to view the current limits configured via the node controllers Redfish API and in-band tools, where applicable, and how the in-band tools interact with the out-of-band configurations. HPE Cray EX Power Monitoring Counters Steven Martin, Brian Collum, and Sean Byland (HPE) Abstract Abstract HPE Cray Power Monitoring (PM) Counters were first deployed on Cray XC30 systems, and several papers were presented at CUG in 2014 that described their use. PM Counters expose power energy and related meta-data collected out of band, directly to in-band consumers. Since introduction, PM Counters have been supported on all blades designed for use in Cray XC, and HPE Cray EX Supercomputer systems. PM Counters have continued to be important as system and application power and energy consumption continues to be a top priority for system vendors, application developers, and the wider HPC research community. Over the last decade, the design of PM Counters has remained very stable, with only minor updates to support evolving node architecture changes. This presentation will give a brief history and overview of PM Counters basics, it will then present details of PM Counters on the latest HPE Cray EX supercomputer blades announced at SC23, and then discuses opportunities and challenges in supporting PM Counters on NGI (Next Generation Infrastructure). The presentation will conclude with a reinforcement of the value of PM Counters in supporting research, development, and testing of energy efficient and sustainable HPC systems. First Analysis on Cooling Temperature Impacts on MI250x Exascale Nodes (HPE Cray EX235A) Torsten Wilde (HPE), Michael Ott (LRZ), and Pete Guyan (HPE) Abstract Abstract With the focus on sustainable data center operations the community is moving from chilled water cooling to warm water cooling solutions expecting that running at higher system inlet temperatures enables a more energy efficient facility operation. Since the overall efficiency is determined by the combination of facility infrastructure and system behavior, understanding the impact of different system inlet cooling temperatures on the system performance and efficiency is important. This inaugural presentation covers the analysis of the impact of the inlet cooling temperature on an Exascale compute blade when running HPL and HPCG. Data was collected using the HPE-LRZ PreExascale co-design project system (HPE Cray EX2500 with four modified Frontier blades optimized for higher cooling temperature support) installed at the Leibniz Supercomputing Centre. Our analysis will show that higher cooling temperatures increase node power consumption leading to a reduction in node performance and overall energy efficiency. Results are presented for different inlet temperatures showing that the overall node efficiency is reduced by around 6% for HPCG and 4% for HPL (25C vs 40C inlet temperature). Combing facility and system, warm water cooling is more efficient but the most efficient cooling temperature depends on the application and efficiency of the cooling infrastructure. EVeREST: An Effective and Versatile Runtime Energy Saving Tool Anna Yue (Hewlett Packard Enterprise, University of Minnesota) and Sanyam Mehta and Torsten Wilde (Hewlett Packard Enterprise) Abstract Abstract Amid conflicting demands for better application performance and energy efficiency, HPC systems must be able to identify opportunities to save power/energy without compromising performance, while ideally being transparent to the user. We identify three primary challenges for a successful energy saving solution - versatility to operate across processors of different types (e.g., CPUs and GPUs) and from different vendors, effectiveness in finding energy saving opportunities and making the right power-performance tradeoffs, and ability to handle parallel applications involving communication. We propose Everest, a lightweight runtime tool that switches to the computed (based on dynamic application characterization) ideal clock frequency for individual application phases/regions while meeting a specified performance target. Everest achieves versatility by relying on the minimum possible performance events for the needed characterization and power estimation. Region-awareness and accurate computation of MPI slack time allow Everest to find enhanced energy saving opportunities and thus save up to 20% more energy than existing solutions on CPUs. These energy savings rise to up to 30% and are more prominent on GPUs where Everest doubly benefits from its unique idle time characterization and by choosing to sacrifice an allowed/acceptable performance loss. Presentation, Paper Technical Session 2C Chair: Veronica G. Vergara Larrea (Oak Ridge National Laboratory) Optimising the Processing and Storage of Radio Astronomy Data Alexander Williamson (International Centre for Radio Astronomy Research, University of Western Australia); Pascal Elahi (Pawsey Supercomputing Research Centre); Richard Dodson and Jonghwan Rhee (International Centre for Radio Astronomy Research, University of Western Australia); and Qian Gong (Oak Ridge National Laboratory) Abstract Abstract The next-generation of radio astronomy telescopes are challenging existing data analysis paradigms, as they have an order of magnitude larger collecting area and bandwidth. The two primary problems encountered when processing this data is the need for storage and that processing is primarily I/O limited. An example of this is the data deluge expected from the SKA-Low Telescope of about 300 PB per year. To remedy these issues, we have demonstrated lossy and lossless compression of data an existing precursor telescope, the Australian Square Kilometre Array Pathfinder (ASKAP), using MGARD and ADIOS2 libraries. We find data processing is faster by a factor of 7 and give compression ratios from a factor of 7 (lossless) up to 37 (lossy with an absolute error bound of 0.001). We will discuss the effectiveness of lossy MGARD compression and its adherence to the designated error bounds, the trade-off between these error-bounds and the corresponding compression ratios, as well as the potential consequences of these I/O and storage improvements on the science quality of the data products. Performance and scaling of the LFRic weather and climate model on different generations of HPE Cray EX supercomputers J. Mark Bull (EPCC, The University of Edinburgh); Andrew Coughtrie (Met Office, UK); Deva Deeptimahanti (Pawsey Supercomputing Research Centre); Mark Hedley (Met Office, UK); Caoimhin Laoide-Kemp (EPCC, The University of Edinburgh); Christopher Maynard (Met Office); Harry Shepherd (Met Office, UK); Sebastiaan Van De Bund and Michele Weiland (EPCC, The University of Edinburgh); and Benjamin Went (Met Office, UK) Abstract Abstract This study presents scaling results and a performance analysis across different supercomputers and compilers for the Met Office weather and climate model, LFRic. The model is shown to scale to large numbers of nodes which meets the design criteria, that of exploitation of parallelism to achieve good scaling. The model is written in a Domain Specific Language, embedded in modern Fortran and uses a Domain Specific Compiler, PSyclone, to generate the parallel code. The performance analysis shows the effect of choice of algorithm, such as redundant computation and scaling with OpenMP threads. The analysis can be used to motivate a discussion of future work to improve the OpenMP performance of other parts of the code. Finally, an analysis of the performance tuning of the I/O server, XIOS is presented. Disaggregated memory in OpenSHMEM applications – Approach and Benefits Clarete Crasta, sharad singhal, Faizan barmawer, Ramesh Chaurasiya, sajeesh KV, Dave Emberson, Harumi Kuno, and John Byrne (Hewlett Packard) Abstract Abstract HPC architectures most often handle High Performance Data Analytics (HPDA) and Explorative Data Analytics (EDA) workloads where the working data set cannot be easily partitioned or is too large to fit into local node memory. This poses challenges for programming models such as OpenSHMEM[1] or MPI[2] where all data in the working set is assumed to fit in the memory of the participating compute nodes. Additionally, existing HPC programming models use expensive all-to-all communication to share data and results between nodes. The data and results are ephemeral and require additional work to save the data for analysis by other applications or subsequent invocations of the same application. Emerging disaggregated architectures including CXL GFAM enable data to be held in external memory accessible to all compute nodes, thus providing a new approach to handling large data sets in HPC applications. Most HPC libraries do not currently support disaggregated memory models. In this paper, we present how disaggregated memory can be accessed by existing programming models such as OpenSHMEM and the benefits of using disaggregated memory in these models. Migrating Complex Workflows to the Exascale: Challenges for Radio Astronomy Pascal Jahan Elahi (Pawsey Supercomputing Research Centre) and Matt Austin, Eric Bastholm, Paulus Lahur, Wasim Raja, Maxim Voronkov, Mark Wieringa, Matthew Whiting, Daniel Mitchell, and Stephen Ord (CSIRO) Abstract Abstract Real-time processing of radio astronomy data presents a unique challenge for HPC centers. The science data processing contains memory-bound codes, CPU-bound codes, portions of the pipeline that consist of large number of embarrassingly parallel jobs, combined with large number of moderate- to large-scale MPI jobs, and IO that consists of both parallel IO writing large files to small jobs writing a large number of small files, all combined in a workflow with a complex job dependency graph and real-world time constraints from radio telescope observations. We present the migration of the Australian Square Kilometre Array Pathfinder Telescope's science processing pipeline from one of Pawsey's older Cray XC system to our HPE-Cray EX system, Setonix. We also discuss the migration from bare-metal deployment of the complex software stack to a containerized, more modular deployment of the workflow. We detail the challenges faced and how the migration unearthed issues in the original deployment of the EX system. The lessons learned in the migration of such a complex software stack and workflow is valuable for other centers. Presentation, Paper Technical Session 3B Chair: Gabriel Hautreux (CINES) Spack Based Production Programming Environments on Cray Shasta Paul Ferrell and Timothy Goetsch (LANL) Abstract Abstract The Cray Pprogramming Eenvironment (CPE) provided for Cray Shasta OS based clusters provides a small but solid set of tools for developers and cluster users. The CPE includes Cray MPICH, Cray Libsci, the Cray debugging tools, and support for a range of compilers – and not much else. Users expect a wide range of additional software on these clusters, and frequently request new software outside of what’s provided by the CPE or what can be provided via the system packages. Foregoing our old manual installation process, the LANL HPC Programming and Runtime Eenvironments Tteam has instead opted to utilize Spack as the installation mechanism for most additional software on all of our new HPE/Cray Shasta clusters. This brings with it several distinct advantages - Spack’s vast library of package recipes, well- defined software inventories, automatically- generated modulefiles, and binary packages produced through our CI infrastructure. It also brings with it substantial issues – remarkably higher manpower requirements, longer turnaround times for software requests, a more challenging build debug process, and questionable long-term maintainability. Our paper will detail our approach and the benefits and pitfalls of using Spack to install and maintain production software environments. Containers-first user environments on HPE Cray EX Felipe Cruz and Alberto Madonna (Swiss National Supercomputing Centre) Abstract Abstract In High-Performance Computing (HPC), managing the user environment is a critical and complex task. It involves composing a mix of software that includes compilers, libraries, tools, environment settings, and their respective versions, all of which depend on each other in intricate ways. Traditional approaches to managing user environments often struggle with finding a balance between stability and flexibility, especially in large systems serving diverse user needs. Cloud-Native Slurm management on HPE Cray EX Felipe A. Cruz, Manuel Sopena, and Guilherme Peretti-Pezzi (Swiss National Supercomputing Centre) Abstract Abstract This work introduces a cloud-native deployment of the Slurm HPC Workload Manager, leveraging microservices, containerization, and on-premises cloud platforms to enhance efficiency and scalability. Utilizing Kubernetes and Nomad's APIs alongside DevOps tools, the system automates system operations, simplifies service configuration, and standardizes monitoring. However, implementing a cloud-native architecture poses challenges, including complex containerization and resource management issues that are intrinsic to HPC. Presentation, Paper Technical Session 3A Chair: Bilel Hadri (KAUST Supercomputing Lab) Early Application Experiences on Aurora at ALCF: Moving From Petascale to Exascale Systems Colleen Bertoni, JaeHyuk Kwack, Thomas Applencourt, Abhishek Bagusetty, Yasaman Ghadar, Brian Homerding, Christopher Knight, Ye Luo, Mathialakan Thavappiragasam, John Tramm, Esteban Rangel, Umesh Unnikrishnan, Timothy J. Williams, and Scott Parker (Argonne National Laboratory) Abstract Abstract Aurora, installed in 2023, is the newest system being prepared for production at the Argonne Leadership Computing Facility (ALCF). Throughout multiple years of preparation, the ALCF has tracked the progress of over 40 applications from the Exascale Computing Project and ALCF's Early Science Project in terms of ability to run on Aurora and performance on Aurora compared to other systems. In addition, the ALCF has been tracking bugs and issues reported by application developers. This broad tracking of applications in a standardized way as well as tracking of over 1100 bugs and issues via source code reproducers has been essential to ensuring the usability of Aurora. It has also helped ensure a smoother transition for applications that can run on past or current production systems, like Polaris, the ALCF's current production system, to Aurora. To gain insight into the current state of applications which were ported to Aurora on both Aurora and Polaris, a set of applications are compared in terms of single GPU and single node performance on Aurora and Polaris. On average the Figure-of-Merit performance for the set of applications was 1.3x greater on a single GPU of Aurora than a single GPU of Polaris. The intra-node parallel efficiency of the set of applications was similar between Aurora and Polaris. Streaming Data in HPC Workflows Using ADIOS Greg Eisenhauer (Georgia Institute of Technology); Norbert Podhorszki, Ana Gainaru, and Scott Klasky (Oak Ridge National Laboratory); Philip Davis and Manish Parashar (University of Utah); Matthew Wolf (Samsung SAIT); Eric Suchtya (Oak Ridge National Laboratory); Erick Fredj (Toga Networks, Jerusalem College of Technology); Vicente Bolea (Kitware, Inc); Franz Pöschel, Klaus Steiniger, and Michael Bussmann (Center for Advanced Systems Understanding); Richard Pausch (Helmholtz-Zentrum Dresden-Rossendorf); and Sunita Chandrasekaran (University of Delaware) Abstract Abstract The "IO Wall" problem, in which the gap between computation rate and data access rate grows continuously, poses significant problems to scientific workflows which have traditionally relied upon using the filesystem for intermediate storage between workflow stages. One way to avoid this problem in scientific workflows is to stream data directly from producers to consumers and avoiding storage entirely. However, the manner in which this is accomplished is key to both performance and usability. This paper presents the Sustanable Staging Transport, an approach which allows direct streaming between traditional file writers and readers with few application changes. SST is an ADIOS "engine", accessible via standard ADIOS APIs, and because ADIOS allows engines to be chosen at run-time, many existing file-oriented ADIOS workflows can utilize SST for direct application-to-application communication without any source code changes. This paper describes the design of SST and presents performance results from various applications that use SST, for feeding model training with simulation data with substantially higher bandwidth than the theoretical limits of Frontier's file system, or for strong coupling of separately developed applications for multiphysics multiscale simulation, or for in situ analysis and visualization of data to complete all data processing shortly after the simulation finishes. Enrichment and Acceleration of Edge to Exascale Computational Steering STEM Workflow using Common Metadata Framework Gayathri Saranathan (Hewlett Packard Enterprise, Hewlett Packard Labs); Martin Foltin, Aalap Tripathy, and Annmary Justine (Hewlett Packard Enterprise); Ayana Ghosh, Maxim Ziatdinov, and Kevin Roccapriore (Oak Ridge National Laboratory); and Suparna Bhattacharya, Paolo Faraboschi, and Sreenivas Rangan Sukumaran (Hewlett Packard Enterprise) Abstract Abstract Computational steering of experiments with the help of Artificial Intelligence (AI) has the potential to accelerate scientific discovery. Fulfilling this promise will require innovations in workflows and algorithms for experiment control. In this work we developed data management infrastructure that facilitates such innovation by enabling fine grain partitioning and optimization of complex workflows between computational and experimental facilities with dynamic data sharing. This is enabled by Common Metadata Framework (CMF) that tracks workflow data lineages and provides visibility across facility boundaries for relevant data subsets. We demonstrate benefits on a novel Scanning Transmission Electron Microscopy (STEM) workflow of nanoparticle plasmonic transitions in material science that crosses facility boundaries several times. The AI control model is seeded by a meta-model developed at a computational facility using data from other experimental sites to help impart prior knowledge, reduce measurement cost and sample degradation. The model is incrementally refined by successive measurements at an experimental facility. The evolution of model uncertainties is captured in CMF and fed back to computational facility for analysis of potential new physical phenomena. The relevant experimental results can be used to calibrate molecular dynamics simulations at the computational facility that in turn influence the AI model refinement trajectory. Presentation, Paper Technical Session 3C Chair: Tina Declerck (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory) CSM-based Software Stack Overview 2024 Harold Longley and Jason Sollom (Hewlett Packard Enterprise) Abstract Abstract The Cray System Management (CSM) software stack has been enhanced within the past year to improve the operating experience for an HPE Cray EX system. Features to be discussed include more automated software installation and upgrade, system backup and using that backup for disaster recovery reinstallation, automation for concurrent rolling reboots of management nodes, enhancements in CSM Diagnostics tooling, improved SMA monitoring tools, Slingshot switch Orchestrated Maintenance and LACP traffic load sharing, containerized login environment, and new compute node features (default kernel configuration, Low Noise Mode improvements, OS Noise Detection, DVS node health monitoring, Dynamic Kernel Module Support, and containers on compute nodes), improved multi-tenancy support, and image management for the aarch64 architecture. Overview of HPCM Peter Guyan and Sue Miller (HPE) Abstract Abstract This talk will give a brief overview of how HPCM is deployed on a HPE solution. It will describe what an admin node does, what Quorum HA is, why we use SU-Leaders. This will include when to use Quorum HA, SU Leaders, an introduction the “cm” command suite and a brief overview of the monitoring tools present with HPCM. With a new set of monitoring tools available in HPCM 1.11, the new pipeline will be described and how to enable the components needed. Seamless Cluster Migration in CSM Miguel Gila and Manuel Sopena Ballesteros (Swiss National Supercomputing Centre) Abstract Abstract The ability to effortless migrate compute clusters between sites or zones is common in the cloud world, and recently has also become a necessity for supercomputing facilities like the Swiss National Supercomputing Centre (CSCS), where its multi-region flagship infrastructure Alps serves multiple tenants and customers, some of them with very different development and operational requirements. Presentation, Paper Technical Session 4B Chair: Brett Bode (National Center for Supercomputing Applications/University of Illinois, National Center for Supercomputing Applications) Scalability and Performance of OFI and UCX on ARCHER2 Jaffery Irudayasamy and Juan F. R. Herrera (EPCC, The University of Edinburgh); Evgenij Belikov (EPCC, The University of EdinburghE); and Michael Bareford (EPCC, The University of Edinburgh) Abstract Abstract OpenFabrics Interfaces (OFI) and Unified Communication X (UCX) are both transport protocols that underlie the HPE Cray MPICH library on HPC systems like ARCHER2. They can be selected at runtime by users. This paper presents the scalability and performance of OFI and UCX transport layer protocol implementations on ARCHER2, an HPE Cray EX system that features Slingshot 10 interconnect. We use ReproMPI microbenchmarks to study the performance of MPI collectives and run experiments using some of the most commonly used applications on ARCHER2. The results show that in most cases OFI and UCX performance is comparable on under 32 nodes (16384 cores) but for larger number of nodes OFI runs more reliably. Ultimately, when it comes to applications there is no one-size-fits-all solution and profiling can facilitate tuning for best performance. Using P4 for Cassini-3 Software Development Environment Hardik Soni, Frank Zago, Khaled Diab, Igor Gorodetsky, and Puneet Sharma (HPE) Abstract Abstract We present a novel approach for co-developing the hardware of Network Interface Cards (NICs) ASICs (e.g., Slingshot Cassini-3) and their software stacks. Due to increasing gap between network link bandwidths and compute power of hosts, critical parts of the software stack of many applications are offloaded to NICs for efficient processing. By processing certain functions using specialized hardware blocks in NICs, compute resources can be better utilized for actual application processing. Therefore, use cases and design of NICs rapidly evolve with advancements in transmission capacity of network links. To reduce time-to-market for next-generation NICs with complex features implemented in hardware, we leverage the software ecosystem of programmable networks and compiler technology for development, testing and verification of the slingshot NICs. Running NCCL and RCCL Applications on HPE Slingshot NIC Jesse Treger and Caio Davi (HPE) Abstract Abstract There has been a rise in Machine Learning applications on High-Performance Computing Systems equipped with GPUs motivated by increasing demands for model and dataset sizes. Although there are no fundamental limitations preventing running these workloads in MPI runtimes, the in-house chip makers' communications collectives libraries have become the preferred deployments. In this presentation, we will how these libraries work and how we can take full advantage over the HPE Slingshot interconnect. We will review use and configuration of the two most common of these - NCCL and RCCL – to run using the HPE Slingshot NIC RDMA capability. We will explain the key parameters for the given environments and provide recommendations for the optimal settings. Enabling NCCL on Slingshot 11 at NERSC Jim Dinan (NVIDIA), Peter Harrington (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center), Igor Gorodetsky (HPE), Josh Romero (NVIDIA), Steven Farrell (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center), Ian Ziemba (HPE), and Wahid Bhimji and Shashank Subramanian (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) Abstract Abstract The NVIDIA Collective Communications Library (NCCL) is widely used for multi-node AI applications as well as other scientific codes. Initial deployments of Slingshot 11 (SS11) did not support this library at high-performance leading to detrimental impacts on the performance of Deep Learning applications on SS11-based HPC systems. We describe collaborative efforts between NERSC, NVidia and HPE to develop capabilities for NCCL on SS11. This involved development of the Libfabric NCCL plugin and an extensive period of testing and refinement utilizing the Perlmutter HPC system at NERSC. In this presentation we will describe the development required as well as performance improvement measured on Perlmutter for both benchmarks and cutting-edge scientific AI applications. Presentation, Paper Technical Session 4A Chair: Lena M Lopatina (LANL) Multi-stage Approach for Identifying Defective Hardware in Frontier Nick Hagerty (Oak Ridge National Laboratory), Andy Warner (HPE), and Jordan Webb (Oak Ridge National Laboratory) Abstract Abstract In June 2022, the long-awaited exaflop compute barrier (1 quintillion floating-point operations per second) was surpassed on the TOP500 list by Frontier, an HPE Cray EX supercomputer at Oak Ridge National Laboratory (ORNL). Drawing peak power of 21.1 MW, Frontier demonstrated 1.1 exaflops/second of computational capability, much of which is supplied by more than 37,000 AMD Instinct MI250X graphics processing units (GPUs). With a single GPU drawing up to 560W thermal design power (TDP), each AMD MI250X draws 2x more power under load than the NVIDIA V100 GPUs used in Frontier’s predecessor at ORNL, the 200 petaflop IBM POWER9 supercomputer, Summit. There are many other major technological advances in the memory, compute, power, and infrastructure of Frontier that are new to production environments. Frontier's mission to enable ground-breaking research in U.S. energy, economic and national security sectors is fulfilled through leadership-class workloads, which are workloads that demand greater than 20\% of the supercomputer. These large workloads are vulnerable to defective and failing computing hardware. The rate of failing hardware is quantified through the mean time between failures (MTBF), which is the length of time between a hardware-level failure anywhere in the system. In this work, we describe the multi-staged approaches to stabilizing and maintaining the functionality of the hardware on Frontier. There are three strategies discussed; the first two utilize leadership-class tests to target improving the MTBF of Frontier, the third utilizes single-node validation to efficiently identify individual instances of defective hardware in Frontier. We provide summarized data from each of the three strategies, then classify the diverse set of failures and discuss trends in defective hardware, before discussing several key challenges to identifying defective hardware and improving the MTBF of Frontier. From Frontier to Framework: Enhancing Hardware Triage for Exascale Machines Isa Wazirzada, Abhishek Mehta, and Vinanti Phadke (Hewlett Packard Enterprise) Abstract Abstract Supercomputers are complex systems that bring together bleeding edge technologies. Take the example of the first exascale system, Frontier, installed at Oakridge National Laboratory. Frontier consists of more than 9400 compute nodes embedded in the HPE Cray EX4000 infrastructure. Each compute node consists of an AMD CPU, four MI250 GPUs interlinked by high-speed XGMI interfaces, as well as Slingshot-11 high-speed NICs. The system is comprised of over 150,000 node-level components interconnected in an extremely dense mechanical framework and is cooled via warm temperature liquid cooling. As impressive as all these technologies are on their own, the real value lies in bringing these technologies together in a system to achieve sustained performance over time. With that in mind, it behooves us to recognize that system quality attributes such as diagnosability and serviceability are critical to achieving high levels of availability throughout the service life of a system. Therefore, for HPE and our customers, providing a product-level hardware triage framework will help ameliorate the return to service time for failed components, provide a standardized approach to diagnosing hardware failures, reduce the number of no trouble found replacements, and minimize the need for SMEs from R&D to directly support systems in the field. Full-stack Approach to HPC Testing Pascal Jahan Elahi and Craig Meyer (Pawsey Supercomputing Research Centre) Abstract Abstract A user of an High Performance Computing (HPC) system relies on a multitude of components, both on the user-facing side, such as \texttt{modules}, and lower-level system software, such as Message Passing Interface (\mpi) libraries. Thus, all these different aspects must be tested to guarantee an HPC system is production ready. We present here a suite of tests that cover this larger space, which not only focus on benchmarking or sanity checks but also provide some diagnostic information in case failures are encountered. These tests cover the job scheduler, here \texttt{slurm}, the \mpi\ library, critical for running jobs at scale, GPUs, a vital part of any energy efficient HPC system, and the performance of compilers that are part of the Cray Programming Environment (CPE). Some tests were critical to uncovering a number of underlying issues with the communication libraries on a newly deployed HPE-Cray EX Shasta system that had gone undetected in other acceptance tests. Otherwise identified bugs within the job scheduler. The tests are implemented in a \reframe\ framework and are open source. An Approach to Continuous Testing Francine Lapid and Shivam Mehta (Los Alamos National Laboratory) Abstract Abstract The National Nuclear Security Administration Department of Energy supercomputers at Los Alamos National Laboratory (LANL) are integral to supporting the lab’s mission and therefore need to be reliable and performant. To identify potential problems ahead of time while minimizing the interruption to the users’ work, the High Performance Computing (HPC) Division at LANL implemented a Continuous Testing framework and the necessary infrastructure to be able to automatically and frequently run a series of tests and proxy applications. The tests, which benchmark various system components, were integrated into the Pavilion2 testing framework and are launched in small Slurm jobs across each machine every weekend. The results of the tests are summarized in a comprehensive Splunk dashboard, enabling continuous monitoring of the health of the machines over time without having to parse through each run’s output and logs. This project is currently running on three of LANL’s newest fleet of Cray Shasta machines – Chicoma, Razorback, and Rocinante– and will eventually be implemented on all production clusters at LANL. This paper details the different components of the Continuous Testing framework, the resulting setup, and the impact it has on our HPC workflow. Presentation, Paper Technical Session 4C Chair: Gabriel Hautreux (CINES) LLM Serving With Efficient KV-Cache Management Using Triggered Operations Aditya Dhakal, Pedro Bruel, Gourav Rattihalli, Sai Rahul Chalamalasetti, and Dejan Milojicic (Hewlett Packard Enterprise, Hewlett Packard Labs) Abstract Abstract A large language model (LLM)’s Key-Value (KV) cache requires enormous GPU memory during inference. For faster query processing and conversational memory of chat applications, this cache is stored to answer subsequent user queries. However, if the cache is buffered on the GPU, its large memory requirement prevents multiplexing and requires cache buffering in remote storage. In current systems, transferring and retrieving cache requires the CPU to coordinate with the GPU and push and fetch the data through the network, increasing the overall latency. This paper proposes lower overhead KV cache storage and retrieval with SmartNICs capable of triggered operations, such as Cassini. Triggered operations enqueue pre-defined data transfer instructions on the NIC. A GPU thread can trigger these instructions once the LLM computes the KV cache for a token. Cassini Network Interface Cards (NICs) then transfer the cache, bypassing the CPU and network stack and improving data-transfer latency. Our experiments show that data transfer with triggered operations provides 19× speedup in transfers ranging from 32KB to 5 Terabytes. From Chatbots to Interfaces: Diversifying the Application of Large Language Models for Enhanced Usability Jonathan Sparks, Pierre Carrier, and Gallig Renaud (Hewlett Packard Enterprise) Abstract Abstract This paper explores the application of Large Language Models (LLMs) in three distinct scenarios, demonstrating their potential to aid user experience and efficiency. Firstly, we examine the application of LLM models such as OpenAI’s GPT or Llama in chatbots to assist in programming environments, providing real-time assistance to developers. Secondly, we explore using LLMs and Python to search internal document corpus for performance engineering, significantly improving the retrieval of relevant information from extensive technical documentation. Lastly, we investigate using LLMs as an interface to system batch schedulers, such as Slurm or PBS, replacing domain-specific languages and prompts with natural text. This approach democratizes access to complex systems, fostering ease of use and enhancing the user experience. Through these use cases, we underscore the versatility and potential of LLMs, highlighting their role as an aid to system operation and user experience. Delivering Large Language Model Platforms With HPC Laura Huber, Abhinav Thota, Scott Michael, and Jefferson Davis (Indiana University) Abstract Abstract In 2023, we saw a huge rise in the capability and popularity of large language models (LLMs). OpenAI released ChatGPT 3.0 to the public in November 2022, and since then, there have been many closed and open-source LLMs that have been released. It has been reported that training ChatGPT 3.0 took more than 10,000 GPU cards, making training a foundational LLM out of reach for many research teams and HPC centers, but there are many ways to use a pre-trained LLM using GPUs at scales available at an HPC center. For example, fine tuning a pre-trained model with site or application- specific data or augmenting the model with Retrieval Augmented Generation (RAG) to add specific knowledge to a pre-trained model. The availability of open-source LLMs has opened the field for individual researchers and service providers to do their own custom training and run their own chatbots. In this paper, we will describe how we deployed and evaluated open source LLMs on Quartz and Big Red 200, a Cray EX supercomputer, and provisioned access to these deployments to a select group of HPC users. System for Recommendation and Evaluation of Large Language Models for practical tasks in Science Cong Xu, Tarun Kumar, Martin Foltin, Annmary Justine, Sergey Serebryakov, Arpit Shah, Agnieszka Ciborowska, Ashish Mishra, Gyanaranjan Nayak, Suparna Bhattacharya, and Paolo Faraboschi (Hewlett Packard Enterprise) Abstract Abstract Large Language Models (LLMs) are showing impressive ability to reason and answer complex questions after a few contextual prompts. This has transformational potential for scientific productivity and new discoveries. However, LLMs commonly suffer from hallucination problems and their performance strongly depends on specific task and model. Several ad-hoc LLM benchmarking studies have been reported for specific tasks in medicine, biology and chemistry. The science community typically prefers open-source models, including derivatives of Llama-2, Galactica, etc. These models usually perform worse than the commercial GPT-4, requiring more scrutiny. In this work we developed an automated platform that co-optimizes LLM recommendation and evaluation for user specified tasks to make it easier for science practitioners to select and evaluate promising LLMs for their use case. Uniquely, it uses a recommender engine trained on characteristics of large number of training datasets, LLM models, and inference data to find a list of best candidate models for user specified prompt templates. Then, it evaluates and ranks these candidates using automated reviewer that performs multi-metric assessment and, uniquely, iteratively revisits the evaluation to validate its truthfulness employing the reflection approach and self-consistency technique. The evaluation platform seamlessly integrates with HPE MLDE (Machine Learning Development Environment). Presentation, Paper Technical Session 5B Chair: Adrian Jackson (EPCC, The University of Edinburgh) Leveraging GNU Parallel for Optimal Utilization of HPC Resources on Frontier and Perlmutter Supercomputers Ketan Maheshwari (Oak Ridge National Laboratory), William Arndt (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center), and Rafael Ferreira Da Silva (Oak Ridge National Laboratory) Abstract Abstract In the realm of HPC, efficiently exploiting parallelism on large-scale systems presents a significant challenge, particularly when aiming for a low-barrier, nonprogrammatic approach. This presentation showcases the effective application of GNU Parallel for optimizing the use of four key resources at OLCF’s Frontier and NERSC’s Perlmutter supercomputers: CPUs, GPUs, NVMe storage, and large-scale Storage systems. Our work positions GNU Parallel not merely as a utility tool but as a mainstream, productive instrument for diverse HPC workloads. We illustrate GNU Parallel’s capacity to efficiently harness CPU resources across up to 9,000 Frontier compute nodes, with a manageable increase in overhead even as node utilization scales up. This demonstration includes a detailed analysis of overhead metrics, revealing that it remains ~5 minutes for up to 7,000 nodes and under 10 minutes for up to 9,000 nodes comprising up to 1.15 million parallel tasks. Our approach is further elucidated through a series of practical vignettes. These include diverse applications, ranging from daily tasks to massively scaled operations, all underpinned by real-world applications and practical use cases. These vignettes offer generalized solutions for common computational and I/O patterns observed in real-world applications, validated through scalable examples on Frontier and Perlmutter supercomputers. Portable Support for GPUs and Distributed-Memory Parallelism in Chapel Andrew Stone and Engin Kayraklioglu (Hewlett Packard Enterprise) Abstract Abstract Writing performant programs on modern supercomputers means targeting parallelism at multiple levels of scale: from vectorization in a single core, to multicore CPUs, to nodes with multiple CPUs, to GPUs and nodes with multiple GPUs. Traditionally, this has meant writing applications using multiple programming models; for example, an application might be written as a combination of MPI, OpenMP, and/or CUDA. This use of multiple programming models complicates an application’s implementation, maintainability, and portability. PaCER: Accelerating Science on Setonix Maciej Cytowski and Ann Backhaus (Pawsey Supercomputing Research Centre) and Joseph Schoonover (Fluid Numerics) Abstract Abstract The Pawsey Centre for Extreme-Scale Readiness (PaCER) was a unique Australian program supporting researchers in developing and optimising codes on Setonix, Australia’s most powerful research supercomputer based on AMD Milan CPUs and MI250X GPUs. The focus of the PaCER program was on both extreme scale research (algorithms design, code optimisation, application and workflow readiness) and using the computational infrastructure to facilitate research for producing world-class scientific outcomes. PaCER was a partnership for collaboration with Pawsey and supercomputing vendors that provided early access to HPC tools and infrastructure, training and exclusive hackathons focused on performance at scale. PaCER supported 10 projects gathering 18 international and national institutions, supported more than 60 researchers and 10 Research Software Engineers. The biggest success of the project is the creation of a supercomputing software developers’ community, unique and first of its kind in Australia. In this talk, we will describe the collaboration model created by PaCER and cover significant achievements of the project, including the series of GPU programming mentored sprints. We will also share some of the lessons learned and future plans. Presentation, Paper Technical Session 5A Chair: Veronica G. Vergara Larrea (Oak Ridge National Laboratory) Power and Performance analysis of GraceHopper superchips on HPE Cray EX systems Benjamin Donald Cumming and Miguel Gila (CSCS), Brian Collum (HPE), Sebastian Keller (CSCS), and Bryan Villalon and Steven James Martin (HPE) Abstract Abstract Systems with tightly-integrated CPU and GPU on the same module or package will be deployed in 2024, based on NVIDIA GH200 and AMD Mi300a processors. There are some significant differences from current systems, of which the key ones for this presentation are: Accelerating Scientific Workflows with the NVIDIA Grace Hopper Platform Gabriel Noaje (NVIDIA) Abstract Abstract This session will focus on providing a demonstration of top ML frameworks, HPC applications, and tools for data science on NVIDIA's Grace Hopper and Grace CPU Superchip. These superchips are the cornerstones of versatile and power-efficient supercomputers worldwide that combine Grace CPUs, Hopper GPUs, and extreme scale networking technology, in a standards-compliant server. We 'll showcase recent results from key applications like PyTorch, JAX, WRF, GROMACS, and NAMD. We'll also provide lessons learned and experiences to help guide developers creating their own applications for NVIDIA Grace superchips. This session is a strong starting point for anyone looking to understand, and develop for, NVIDIA Grace Hopper or the Grace CPU Superchip. GROMACS on AMD GPU-Based HPC Platforms: Using SYCL for Performance and Portability Andrey Alekseenko (KTH/SciLifeLab); Szilárd Páll (KTH/PDC); and Erik Lindahl (KTH, Stockholm University) Abstract Abstract GROMACS is a widely-used molecular dynamics software package with a focus on performance, portability, and maintainability across a broad range of platforms. Thanks to its early algorithmic redesign and flexible heterogeneous parallelization, GROMACS has successfully harnessed GPU accelerators for more than a decade. With the diversification of accelerator platforms in HPC and no obvious choice for a well-suited multi-vendor programming model, the GROMACS project found itself at a crossroads. The performance and portability requirements, as well as a strong preference for a standards-based programming model, motivated our choice to use SYCL for production on both new HPC GPU platforms: AMD and Intel. Since the GROMACS 2022 release, the SYCL backend has been the primary means to target AMD GPUs in preparation for exascale HPC architectures like LUMI and Frontier. SYCL is a cross-platform, royalty-free, C++17-based standard for programming hardware accelerators, from embedded to HPC. It allows using the same code to target GPUs from all three major vendors with minimal specialization, which offers major portability benefits. While SYCL implementations build on native compilers and runtimes, whether such an approach is performant is not immediately evident. Biomolecular simulations have challenging performance characteristics: latency sensitivity, the need for strong scaling, and typical iteration times as short as hundreds of microseconds. Hence, obtaining good performance across the range of problem sizes and scaling regimes is particularly challenging. Here, we share the results of our work on readying GROMACS for AMD GPU platforms using SYCL, and demonstrate performance on Cray EX235a machines with MI250X accelerators. Our findings illustrate that portability is possible without major performance compromises. We provide a detailed analysis of node-level kernel and runtime performance with the aim of sharing best practices with the HPC community on using SYCL as a performance-portable GPU framework. Presentation, Paper Technical Session 5C Chair: Tina Declerck (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory) Cray EX Security Experiences Ben Matthews (NCAR/UCAR) Abstract Abstract Security is an important, if sometimes overlooked, part of running an HPC system. We will describe the out-of-box experience with the Cray EX system from a security perspective. Several security issues (and mitigations for them) present in the Cray EX (HPCM) software stack and, perhaps, HPC systems in general will be described. The process for reporting, patching, and publishing these issues will be discussed as well as some thoughts on where to look for and how to reduce the risk posed by as yet undiscovered vulnerabilities. Finally, some general advice for securing Cray and other HPC systems will be provided. Best of Times, Worst of Times: A Cautionary Tale of Vulnerability Handling Aaron Scantlin (National Energy Research Scientific Computing Center) Abstract Abstract Vulnerabilities are an unfortunate reality in modern computing - while there's plenty of discussion around the importance of detection and patching in HPC, there's not as much chatter about how to handle vulnerabilities discovered at one's institution. While this process is currently ad-hoc and depends in large part on the maintainers of the software in question, one HPC center's recent experience with the discovery of a critical vulnerability within Lustre within COS, reporting that vulnerability to HPE, and the subsequent handling of that information by both groups suggests that there's room for improvement in a variety of areas on both sides. In this presentation, NERSC Security will: AIOPS Empowered: Failure Prediction in System Management Software Tools Deepak Nanjundaiah and SUBRAHMANYA VINAYAK JOSHI (HPE) Abstract Abstract In the realm of High-Performance Computing (HPC), our project addresses the escalating challenge of failures, particularly with the anticipated complexity surge in Exascale systems. Our innovative Semi-Supervised Failure Prediction service, applicable to various system management software tools, utilizes deep learning on telemetry data for real-time, end-to-end failure prediction. Our approach centers on deep learning models proficient in deciphering intricate patterns within extensive datasets. From data acquisition to prediction, our solution seamlessly integrates with system management software, analyzing critical metrics like CPU usage, memory status, and network activity. By learning from historical data, the model distinguishes between normal and failure states, providing real-time predictions before potential failures. With a semi-supervised learning approach using both labeled and unlabeled data, our model adapts effectively to diverse failure scenarios. Integrated with the AIOps service in system management tools, it offers organizations a proactive edge, enabling early intervention for cost reduction, minimized downtime, and improved data center efficiency. In our upcoming presentation, we will delve into a concise results overview, showcasing the benefits of our approach in enhancing predictive capabilities and providing organizations with strategic advantages in system management. This functionality is being considered for future inclusion in HPE system management products. Presentation, Paper Technical Session 6B Chair: Paul L. Peltz Jr. (Oak Ridge National Laboratory) POD: Reconfiguring Compute and Storage Resources Between Cray EX Systems Eric Roman and Tina Declerck (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) and Sean Lynn (Hewlett Packard Enterprise) Abstract Abstract This paper describes how a set of liquid-cooled compute and air-cooled storage resources can be reconfigured between Cray EX Systems. In each configuration, the resources function as a native set of managed nodes and/or directly attached storage resources to the associated host system. In a normal production configuration for example, users can and run compute jobs on the reconfigured compute nodes or read and write data to the corresponding filesystems. To administrators, these compute nodes are managed like conventional compute nodes and the storage is managed by the Neo software stack. The Slingshot network is connected to each of the possible host systems and the systems management networks interconnect via a layer 3 EVPN VXLAN tunnel. NERSC has implemented this reconfigurable architecture on a set of liquid-cooled and air-cooled resources termed POD. POD resources have been successfully transitioned between NERSC's development, staging and production systems and is currently configured and in use on NERSC's flagship system perlmutter. Zero Downtime System Upgrade Strategy Alden Stradling and Joshi Fullop (Los Alamos National Laboratory) Abstract Abstract In a perfect world, HPC system downtime would be easy to minimize. Just keep a perfect copy of the production cluster to prevent scaling surprises. Multitenancy on HPE Cray EX: network segmentation and isolation Chris Gamboni (Swiss National Supercomputing Centre) Abstract Abstract The Swiss National Supercomputing Center (CSCS) has developed a strategy to provide Infrastructure as a Service (IaaS) to select customers co-investing in the Alps infrastructure. A critical component of this service is the ability to segregate and isolate the Alps network for various IaaS tenants, enabling them to integrate their own site networks. This necessitates the management of network multitenancy capabilities within the infrastructure. The Alps system, powered by HPE Cray EX machines and managed through Cray System Management (CSM) with Slingshot interconnect, utilizes VLANs for node segmentation at the High-Speed Network (HSN) level. Implementing network multitenancy with CSM requires a novel configuration approach in the node management network. Presentation, Paper Technical Session 6A Chair: Chris Fuson (ORNL, Oak Ridge National Laboratory) Unification of Alerting Engines for Monitoring in System Management Raghul Vasudevan, Ambresh Gupta, and Sinchana Karnik (Hewlett Packard) Abstract Abstract Unified alerting provides a single interface or platform in system management to create and manage alerts for various components within HPC systems. The system management monitoring stack collects different types of telemetry from various system components and store events and logs in OpenSearch and metrics in Timescale. HPE Cray EX255a Telemetry - Improved Configurability and Performance Sean Byland, Steven Martin, and Brian Collum (HPE) Abstract Abstract The new HPE Cray EX255a blade (two nodes, each with 4 AMD MI300a sockets) has power demands and additional sensors that require a more robust power subsystem and data collection capability. This necessitated careful evaluation and changes to how we manage, collect, and publish sensor data. We factored all sensor access parameters and operational characteristics for the EX225a node cards into a standardized file format. These files define default values for compilation. The files can be edited and unmarshalled into runtime accessible structures, enabling testing, tuning, and experimentation with alternative settings. Starting at the hardware access layer and working up the stack we optimized the code paths to enable collection of more sensors in our fixed time budget. This work is the foundation for future work that could enable the ability for higher-level management and monitoring software to customize data collection on behalf of users. Best Practices for deployment of LDMS on the HPE Cray EX platform James Brandt, Kevin Stroup, and Ann Gentile (Sandia National Laboratories) Abstract Abstract The Lightweight Distributed Metric Service (LDMS) has been deployed on some of the largest Cray systems over the past decade to enable low overhead capture of system and application metrics of interest. LDMS has evolved over time to provide new capabilities and associated configuration options to address the ever increasing size and heterogeneity of HPC systems. In the last quarter of 2023 a working group was formed to formalize “best practices” for LDMS deployment on large-scale HPC systems as well as to help guide future configuration management approaches and mechanisms in the LDMS open source/development project. We present the results from this working group as they apply to base-level configurations of samplers and aggregators, authentication mechanisms, and practices to simplify deployment, including use of pre-built Docker containers. For those interested in automated aggregator load balancing and resilience to host failure we describe capabilities of the LDMS distributed configuration manager (Maestro). Finally, we present planned extensions and the capabilities they provide. Presentation, Paper Technical Session 6C Chair: Bilel Hadri (KAUST Supercomputing Lab) ClusterStor Tiering, Overview, Setup, and Performance Nathan Rutman (Hewlett Packard) Abstract Abstract ClusterStor Tiering is a suite of software features designed to enhance the usability and management of hybrid storage systems, combining both flash and disk components. Specifically crafted for monitoring and maintaining file layouts and free space on E1000 flash and disk tiers, Tiering offers a range of customizable capabilities through data management policies. Administrators can tailor fine-grained indexing controls, orchestrate file migrations between Object Storage Targets (OSTs) or pools, execute restriping processes, perform purges, and generate reports. These actions are intelligently triggered by preset timers or dynamically in response to system conditions, such as reaching capacity thresholds. Leveraging a scale-out architecture, tiering efficiently handles the movement of large data volumes. Utilizing the System Management Unit (SMU) for all functions, additional data mover nodes can be configured to augment throughput. Key functionalities supported by Tiering include scalable search, transparent tiering, parallel data movers, data purging, and reporting. Exploring new software-defined storage technology using VAST on Cray EX systems Mark Klein, Chris Gamboni, Gennaro Oliva, and Salvatore Di Nardo (Swiss National Supercomputing Centre, ETH Zurich); Maria Gutierrez (VAST Data); and Riccardo Di Maria and Miguel Gila (Swiss National Supercomputing Centre, ETH Zurich) Abstract Abstract Alps is the Swiss national supercomputing centre's multi-tenant software-defined infrastructure This paper describes the configuration and experiences to get VAST working as a performant filesystem option on the HPE Cray EX line of supercomputers and highlights the possibility to attach additional storage options over the edge routers of these systems. Reducing Mean Time to Resolution (MTTR) for complex HPC-based systems with next generation automated service tools. Michael Cush (HPE) Abstract Abstract After years of experience with Cray’s System Snapshot Analyzer (SSA), the HPC Call Home team worked to develop a new more flexible, scalable, open, and secure Call Home infrastructure to support our future HPC products. Becoming part of HPE allowed us to take advantage of and include HPE’s highly secure Remote Data Access (RDA) capabilities as part of that new infrastructure. A key design point was to make the new product useful even for sites that are not typically uploading data – which sounds rather odd for a “call home” tool set. Other points were the maintenance of a pluggable and highly configurable collection framework partnered with an efficient storage methodology. This paper will discuss the design and highlight where enhancements were made. Example collection plugins will be reviewed. Finally, the paper will seek to answer the question, “So why should I run SDU?” Presentation, Paper Technical Session 7A Chair: John Holmen (Oak Ridge National Laboratory) Proactive Precision: Enhancing High-Performance Computing with Early Job Failure Detection Dipanwita Mallick, Siddhi Potdar, Saptashwa Mitra, Nithin Mohan, and Charlie Vollmer (Hewlett Packard Enterprise) Abstract Abstract In the high-performance computing (HPC) realm, swiftly identifying job failures is critical to optimize resource allocation and ensure system efficiency. Given the high costs and extensive resource demands of HPC systems, the impact of job failures, particularly post-resource allocation, is significant. These failures, crucial in time-sensitive research domains, can derail progress and obstruct objectives. Proactive failure detection allows administrators to quickly enact corrective actions, like job resubmission or reconfiguration, reducing downtime and enhancing user satisfaction. Our approach includes predicting job failures at the initial stages, analyzing failure causes, and developing preventive strategies. By implementing a robust data collection process within the HPC system and utilizing the Slurm workload manager, we have streamlined the data handling procedures. Our methodology involves data preprocessing, feature engineering, and using machine learning models optimized with cross-validation, addressing class imbalances, and focusing on precision, recall, and F1-score metrics. This thorough approach aims to improve resource optimization and prevent future inefficiencies in HPC systems. Presentation, Paper Technical Session 8B Chair: Raj Gautam (ExxonMobil) Using HPE-Provided Resources to Integrate HPE Support into Internal Incident Management John Gann, Daniel Gens, and Elizabeth Bautista (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) Abstract Abstract High Performance Computing (HPC) has a demand to streamline incident management workflows while keeping information synchronized between internal tickets and vendor support cases. Prior to HPE’s acquisition of Cray, NERSC created an integration between their ServiceNow incident management platform and the Crayport platform. This is now obsolete once HPE took over Cray and NERSC staff had no choice other than to input information manually every time a new incident was opened or required updating. Further, this manual entry needed to be performed in both ServiceNow and HPE’s platform. Presentation, Paper Technical Session 8A Chair: Tina Declerck (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory) Optimizing I/O Patterns to Speed up Non-contiguous Data Retrieval and Analyses Scott Klasky, Qian Gong, and Norbert Podhorszki (Oak Ridge National Laboratory) Abstract Abstract Scientific applications running on Exascale supercomputer generate massive data requiring efficient storage for future analysis. While simulations leverage thousands of nodes for writing, reading and analysis typically utilize limited resources, i,e, a handful of nodes. As a result, users commonly query only a particular plane of a multidimensional array or read data from files in strides, hoping that the overall I/O and data processing cost is reduced. This presentation delves into these non-contiguous and striding I/O patterns commonly employed in scientific data analyses. Using visualization as examples, we reveal the detrimental impact of non-contiguous file access on overall throughput, counteracting the speedup gained from analyses with reduced data volume. Recognizing the pattern of scientific data access – primarily written once and read frequently – we propose to refactor data at the time of writing into a format leading to efficient retrieval. We investigate several data organization and refactoring strategies, assessing their impact on reading performance, writing performance, and error incurred on post-analysis across several commonly used query and post-analyses tasks. Our experiments are conducted on the Frontier Supercomputer at Oak Ridge National Laboratory, providing insights for optimizing I/O operations in the Exascale computing era. Presentation, Paper Technical Session 8C Chair: Jim Rogers (Oak Ridge National Laboratory) Building LDMS Slingshot Switch Samplers Kevin Stroup, Cory Lueninghoener, Jim Brandt, and Ann Gentile (Sandia National Laboratories) Abstract Abstract The Lightweight Distributed Metric Service (LDMS) is widely used for monitoring HPC systems and is integrated in HPE’s Cray System Management architecture as well as HPE’s High Performance Cluster Management architecture. One of the important components of an HPC system to monitor is the high-speed interconnect. In the case of HPE Cray EX family of systems, that interconnect is the Slingshot high-speed network. LDMS utilizes “samplers” to gather data about network metrics, including some metrics that can only be determined by a sampler running on the Slingshot switches. Tutorial Tutorial 1B Image Deployment and System Monitoring with HPCM Peter Guyan, Sue Miller, Andy Warner, and Raghul Vasudevan (HPE) Abstract Abstract Creating an image combining the components needed for a HPE Cluster compute node on HPCM can be a daunting task. This tutorial will show a recipe is curated, documented, created and deployed for a GPU enabled compute node. The image needed to perform user tasks requires the base OS, HPE Cray Supercomputing Programming Environment Software, Slingshot host software, OS Updates, GPU drivers and a Workload manager. Tutorial Tutorial 1A Supercomputer Affinity on HPE Systems Edgar A. Leon and Jane E. Herriman (Lawrence Livermore National Laboratory) Abstract Abstract When we consider the grand challenges addressed by supercomputing, we likely imagine large machines, like Lawrence Livermore National Laboratory's El Capitan or Oak Ridge National Laboratory's Frontier, and parallel applications that can leverage those machines. Yet, these two pillars of HPC, HPC hardware and HPC software, are not enough to ensure excellent application performance. When unaware of the topology of the underlying hardware, even well-designed software applications can fail to achieve full performance on top-notch systems. Affinity, how software maps to and leverages local hardware resources, forms a third pillar critical to HPC. Tutorial Tutorial 1C Omnitools: Performance Analysis Tools for AMD GPUs Samuel Antao (AMD) Abstract Abstract The top entries of the TOP500 list feature systems enabled with AMD Instinct GPUs, including world and Europe’s fastest supercomputers Frontier and LUMI, respectively. As these systems are already in production, application teams require the ability to profile applications to ascertain performance. To enable this, AMD has released in 2022 two new profiling tools: Omnitrace and Omniperf. These tools are a result of close collaborations between AMD development teams and computational scientists aimed at unpicking performance bottlenecks in applications and identifying improvement strategies. Omnitrace targets end-to-end application performance generating timelines that cover MPI, OpenMP, Kokkos, Python, etc. It enables the developer to identify relevant hardware counters to collect and generate information in performance-limiting kernels. Omniperf can then be used to seek further insight into these kernels through roofline analysis, memory chart analysis, and read-outs of many metrics including cache access, GPU utilization, and speed of light analysis. In this tutorial, we will present advanced features of these tools, with live demonstrations, and provide numerous hands-on examples for attendees to identify and mitigate bottlenecks in scientific and machine learning applications running on AMD GPUs. We will present the latest developments of the profiling tools among with examples and the relation to the hardware counters. Tutorial Tutorial 1B Continued Image Deployment and System Monitoring with HPCM Peter Guyan, Sue Miller, Andy Warner, and Raghul Vasudevan (HPE) Abstract Abstract Creating an image combining the components needed for a HPE Cluster compute node on HPCM can be a daunting task. This tutorial will show a recipe is curated, documented, created and deployed for a GPU enabled compute node. The image needed to perform user tasks requires the base OS, HPE Cray Supercomputing Programming Environment Software, Slingshot host software, OS Updates, GPU drivers and a Workload manager. Tutorial Tutorial 1A Continued Supercomputer Affinity on HPE Systems Edgar A. Leon and Jane E. Herriman (Lawrence Livermore National Laboratory) Abstract Abstract When we consider the grand challenges addressed by supercomputing, we likely imagine large machines, like Lawrence Livermore National Laboratory's El Capitan or Oak Ridge National Laboratory's Frontier, and parallel applications that can leverage those machines. Yet, these two pillars of HPC, HPC hardware and HPC software, are not enough to ensure excellent application performance. When unaware of the topology of the underlying hardware, even well-designed software applications can fail to achieve full performance on top-notch systems. Affinity, how software maps to and leverages local hardware resources, forms a third pillar critical to HPC. Tutorial Tutorial 1C Continued Omnitools: Performance Analysis Tools for AMD GPUs Samuel Antao (AMD) Abstract Abstract The top entries of the TOP500 list feature systems enabled with AMD Instinct GPUs, including world and Europe’s fastest supercomputers Frontier and LUMI, respectively. As these systems are already in production, application teams require the ability to profile applications to ascertain performance. To enable this, AMD has released in 2022 two new profiling tools: Omnitrace and Omniperf. These tools are a result of close collaborations between AMD development teams and computational scientists aimed at unpicking performance bottlenecks in applications and identifying improvement strategies. Omnitrace targets end-to-end application performance generating timelines that cover MPI, OpenMP, Kokkos, Python, etc. It enables the developer to identify relevant hardware counters to collect and generate information in performance-limiting kernels. Omniperf can then be used to seek further insight into these kernels through roofline analysis, memory chart analysis, and read-outs of many metrics including cache access, GPU utilization, and speed of light analysis. In this tutorial, we will present advanced features of these tools, with live demonstrations, and provide numerous hands-on examples for attendees to identify and mitigate bottlenecks in scientific and machine learning applications running on AMD GPUs. We will present the latest developments of the profiling tools among with examples and the relation to the hardware counters. Tutorial Tutorial 2B Automated Inspection of C/C++/Fortran Code Using Codee for Performance Optimization on HPE/Cray Manuel Arenaz (Codee - Appentra Solutions) Abstract Abstract Codee is a suite of software development tools to help improve the performance of C/C++/Fortran applications, providing a systematic, more predictable approach that leverages parallel programming best practices. Codee Static Code Analyzer provides a systematic predictable approach to enforce C/C++/Fortran performance optimization best practices for the target environment: hardware, compiler and operating system. It provides innovative Coding Assistant capabilities to enable semi-automatic source code rewriting, by inserting OpenMP or OpenACC directives in your codes to run on CPUs or offload to accelerator devices such as GPUs, so that novice programmers can write codes at the expert level. Codee provides integrations with IDEs and CI/CD frameworks to make it possible to Shift Left Performance. In this tutorial the participants will be introduced to Codee and to the first Open Catalog of Best Practices for Performance, using short demos and hands-on exercises with step-by-step guides for HPE/Cray systems like Perlmutter. The participants will learn to use the Codee tools, starting with simple well-known kernels and quickly jumping to large HPC codes like WRF. Tutorial Tutorial 2A Monitoring, Tuning, and Troubleshooting a CSM system Harold Longley and Jason Sollom (Hewlett Packard Enterprise) Abstract Abstract Once all the software has been installed, there are many tools available on a CSM-based HPE Cray EX system to monitor, test, and alert for proper health and operation of the system, tune the system software for a specific workload or a diverse job mixture, and troubleshoot any problems which arise. This tutorial will provide exposure to these tools and describe how to use them to tackle problems such as managed node boot failures, detecting and reacting to service unhealthiness related to storage or memory issues on management nodes, performance variability during job execution, finding nodes which underperform for CPU or network speed compared to other nodes, detecting faulty hardware on its way toward failure and after it fails, and other problems. There are some housekeeping activities which should be utilized to ensure continued system health. Don’t forget to run the tools to check and monitor system health. Differences in the toolset for the software stacks released with CSM 1.3, 1.4, and 1.5 systems will be highlighted while discussing these topics. Tutorial Tutorial 2C MGARD & ADIOS-2: A framework for extreme scale I/O with online data reduction Scott Klasky, Qian Gong, and Norbert Podhorszki (ORNL) Abstract Abstract This full-day tutorial provides a comprehensive introduction to critical components building a complex scientific workflow, encompassing storage I/O, data compression, in situ data processing, remote data access, and visualization. Through real-world applications and living demonstrations, attendances will learn how these tools enhance storage and I/O, enable the creation of streaming and file-coupled workflows with ease. The tutorial will also showcase utilizing these tools for remote data access over wide area network and conducting local analysis. The day will conclude with an integrated workflow of simulation, data compression, analysis, and visualization, showcasing configurations for file-based or in-situ analysis workflows. Tutorial Tutorial 2B Continued Automated Inspection of C/C++/Fortran Code Using Codee for Performance Optimization on HPE/Cray Manuel Arenaz (Codee - Appentra Solutions) Abstract Abstract Codee is a suite of software development tools to help improve the performance of C/C++/Fortran applications, providing a systematic, more predictable approach that leverages parallel programming best practices. Codee Static Code Analyzer provides a systematic predictable approach to enforce C/C++/Fortran performance optimization best practices for the target environment: hardware, compiler and operating system. It provides innovative Coding Assistant capabilities to enable semi-automatic source code rewriting, by inserting OpenMP or OpenACC directives in your codes to run on CPUs or offload to accelerator devices such as GPUs, so that novice programmers can write codes at the expert level. Codee provides integrations with IDEs and CI/CD frameworks to make it possible to Shift Left Performance. In this tutorial the participants will be introduced to Codee and to the first Open Catalog of Best Practices for Performance, using short demos and hands-on exercises with step-by-step guides for HPE/Cray systems like Perlmutter. The participants will learn to use the Codee tools, starting with simple well-known kernels and quickly jumping to large HPC codes like WRF. Tutorial Tutorial 2A Continued Monitoring, Tuning, and Troubleshooting a CSM system Harold Longley and Jason Sollom (Hewlett Packard Enterprise) Abstract Abstract Once all the software has been installed, there are many tools available on a CSM-based HPE Cray EX system to monitor, test, and alert for proper health and operation of the system, tune the system software for a specific workload or a diverse job mixture, and troubleshoot any problems which arise. This tutorial will provide exposure to these tools and describe how to use them to tackle problems such as managed node boot failures, detecting and reacting to service unhealthiness related to storage or memory issues on management nodes, performance variability during job execution, finding nodes which underperform for CPU or network speed compared to other nodes, detecting faulty hardware on its way toward failure and after it fails, and other problems. There are some housekeeping activities which should be utilized to ensure continued system health. Don’t forget to run the tools to check and monitor system health. Differences in the toolset for the software stacks released with CSM 1.3, 1.4, and 1.5 systems will be highlighted while discussing these topics. Tutorial Tutorial 2C Continued MGARD & ADIOS-2: A framework for extreme scale I/O with online data reduction Scott Klasky, Qian Gong, and Norbert Podhorszki (ORNL) Abstract Abstract This full-day tutorial provides a comprehensive introduction to critical components building a complex scientific workflow, encompassing storage I/O, data compression, in situ data processing, remote data access, and visualization. Through real-world applications and living demonstrations, attendances will learn how these tools enhance storage and I/O, enable the creation of streaming and file-coupled workflows with ease. The tutorial will also showcase utilizing these tools for remote data access over wide area network and conducting local analysis. The day will conclude with an integrated workflow of simulation, data compression, analysis, and visualization, showcasing configurations for file-based or in-situ analysis workflows. Tutorial Lightning Tutorial 7B Data Science Beyond the Laptop: Handling Data of Any Size with Arkouda Ben McDonald and Michelle Strout (HPE) Abstract Abstract Attendees of this tutorial will learn how to perform data analysis at scale using Arkouda and gain an understanding of how distributed computing can benefit their work. Tutorial Lightning Tutorial 7C Exploring high performance object storage using DAOS Adrian Jackson (EPCC, The University of Edinburgh) Abstract Abstract Recently we have seen a change in the diversity of applications utilising high performance computing (HPC), from primarily computational simulation approaches, to a more varied application mix including machine learning and data analytics. With this diversification in workloads, there has also been a diversification in I/O patterns; the movements in, and requirements on, data storage and access. Data storage technologies in HPC have long been optimised for large scale bulk operations focussed on high bandwidth/low metadata operations. However, many applications now exhibit non-optimal I/O patterns for large scale parallel filesystems, with large amounts of small I/O operations, non-contiguous data access, and increases in read as well as write I/O loads. XTreme (Approved NDA Members Only) XTreme (Approved NDA Members Only) XTreme (Approved NDA Members Only) XTreme (Approved NDA Members Only) XTreme (Approved NDA Members Only) XTreme (Approved NDA Members Only) XTreme (Approved NDA Members Only) XTreme (Approved NDA Members Only) |
Birds of a Feather Programming Environments, Applications, and Documentation (PEAD) The PEAD (Programming Environments, Applications, and Documentation) is a CUG Special Interest Group that provides a forum for discussion and information exchange between CUG sites and Cray/HPE. The group focus includes system usability, performance of programming environments (including compilers, libraries, and tools), scientific applications running on Cray/HPE systems, user support, communication, and documentation. The group host meetings at CUG each year to help foster discussions surrounding these topics between HPE and member sites.
Following a successful event at last year's CUG, this year the PEAD SIG will meet Sunday, May 05, from 1:00 PM - 5:00 PM. We are planning topics surrounding the HPE PE roadmap, training collaborations, HPE documentation, as well as Fortran support. All topics will be interactive and discussion based. Registration for the event is required. Lunch will be available for everyone who registers for the meeting. Birds of a Feather Programming Environments, Applications, and Documentation (PEAD) Break Coffee Break Break Coffee Break Break Coffee Break (sponsored by Altair) Break Coffee Break (sponsored by Linaro) Break Coffee Break (sponsored by SchedMD) Break Coffee Break (sponsored by Thinlinc) Break Coffee Break (sponsored by VAST) Break Coffee Break (sponsored by Pier Group) Break Coffee Break Break Coffee Break CUG Board CUG Board & Sponsors Lunch (closed) CUG sponsors (non-HPE) are invited to join the CUG Board for an informal lunch discussion. CUG Board HPE Executive Lunch (closed) HPE Executives and representatives are invited to join the CUG Board for an informal lunch discussion. CUG Board New CUG Board / Old CUG Board Lunch (closed) Newly elected board members are invited to an informal lunch with the prior CUG Board to discuss remaining activities for the week as well as future plans. Please bring food from the standard lunch buffet to the private area at the far end of the restaurant. CUG Program Committee Program Committee Dinner (invite only) Participants that helped with the reviews and program committee are invited to a private event.
6.30pm meet at the Beer Corner to be seated at 7pm in The Mark which is within COMO The Treasury, https://statebuildings.com/functions/the-mark/. CUG Program Committee CUG Advisory Board Lunch Cabinet (closed) The CUG Advisory Board is comprised of chairs and liaisons from the special interest groups and program committee members. This session is typically lead by the CUG Vice President to discuss the program, provide guidance to session chairs for the week, and to receive feedback to improve processes and content for future events. CUG Program Committee CUG Advisory Board The CUG Advisory Board is comprised of chairs and liaisons from the special interest groups and program committee members. This session is typically lead by the CUG Vice President to receive direct feedback from the conference and improve future events. Lunch Lunch (open to PEAD and XTreme participants) Lunch Lunch (sponsored by Nvidia) Lunch Lunch (sponsored by Codee) Lunch Lunch (sponsored by Nvidia) Lunch Lunch (sponsored by Codee) Networking/Social Event WHPC+ Australasia and AMD Diversity and Inclusion Breakfast Women in High Performance Computing Australasia(WHPC+) and AMD invite you to attend a community networking breakfast from 7:00 to 8:20am at the Westin Perth, in the Banksia Room. WHPC+ was created to promote diversity in the HPC industry by encouraging new people into the field and retaining those who are already here. This event is generously sponsored by AMD who are very supportive of the Australasian chapter. This event is conveniently located in the beautiful Westin Perth hotel so that you can easily get to the first meeting session of the day at 8:30 am in Ballroom 2. Come along to meet and learn from others who are championing diversity and inclusion in HPC! While this event is free to attend, numbers are capped so registration is required: https://pawsey.org.au/event/whpc-australasia-and-amd-diversity-and-inclusion-breakfast/ Presentation, Paper Technical Session 1B Chair: Jim Williams (Los Alamos National Laboratory) Presentation, Paper Technical Session 1A Chair: Lena M Lopatina (LANL) Presentation, Paper Technical Session 1C Chair: Chris Fuson (ORNL, Oak Ridge National Laboratory) Presentation, Paper Technical Session 2B Chair: Lena M Lopatina (LANL) Swordfish/Redfish and ClusterStor - Using Advanced Monitoring to Improve Insight into Complex I/O Workflows. pdfPresentation, Paper Technical Session 2A Chair: Jim Rogers (Oak Ridge National Laboratory) Presentation, Paper Technical Session 2C Chair: Veronica G. Vergara Larrea (Oak Ridge National Laboratory) Presentation, Paper Technical Session 3B Chair: Gabriel Hautreux (CINES) Presentation, Paper Technical Session 3A Chair: Bilel Hadri (KAUST Supercomputing Lab) Presentation, Paper Technical Session 3C Chair: Tina Declerck (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory) Presentation, Paper Technical Session 4B Chair: Brett Bode (National Center for Supercomputing Applications/University of Illinois, National Center for Supercomputing Applications) Presentation, Paper Technical Session 4A Chair: Lena M Lopatina (LANL) Presentation, Paper Technical Session 4C Chair: Gabriel Hautreux (CINES) From Chatbots to Interfaces: Diversifying the Application of Large Language Models for Enhanced Usability pdf, pdfPresentation, Paper Technical Session 5B Chair: Adrian Jackson (EPCC, The University of Edinburgh) Presentation, Paper Technical Session 5A Chair: Veronica G. Vergara Larrea (Oak Ridge National Laboratory) Presentation, Paper Technical Session 5C Chair: Tina Declerck (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory) Presentation, Paper Technical Session 6B Chair: Paul L. Peltz Jr. (Oak Ridge National Laboratory) Presentation, Paper Technical Session 6A Chair: Chris Fuson (ORNL, Oak Ridge National Laboratory) Presentation, Paper Technical Session 6C Chair: Bilel Hadri (KAUST Supercomputing Lab) Presentation, Paper Technical Session 7A Chair: John Holmen (Oak Ridge National Laboratory) Presentation, Paper Technical Session 8B Chair: Raj Gautam (ExxonMobil) Presentation, Paper Technical Session 8A Chair: Tina Declerck (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory) Presentation, Paper Technical Session 8C Chair: Jim Rogers (Oak Ridge National Laboratory) Plenary Plenary: Welcome, Keynote Plenary Plenary: CUG site, HPE update Plenary Plenary: CUG Board Updates (Open), CUG Elections, and Best papers Plenary Plenary: Sponsors talks, HPE 1-100 Plenary Plenary: CUG 2024, Invited speakers Plenary CUG 2024 Closing Presentation, Paper Technical Session 1B Chair: Jim Williams (Los Alamos National Laboratory) Presentation, Paper Technical Session 1A Chair: Lena M Lopatina (LANL) Presentation, Paper Technical Session 1C Chair: Chris Fuson (ORNL, Oak Ridge National Laboratory) Presentation, Paper Technical Session 2B Chair: Lena M Lopatina (LANL) Swordfish/Redfish and ClusterStor - Using Advanced Monitoring to Improve Insight into Complex I/O Workflows. pdfPresentation, Paper Technical Session 2A Chair: Jim Rogers (Oak Ridge National Laboratory) Presentation, Paper Technical Session 2C Chair: Veronica G. Vergara Larrea (Oak Ridge National Laboratory) Presentation, Paper Technical Session 3B Chair: Gabriel Hautreux (CINES) Presentation, Paper Technical Session 3A Chair: Bilel Hadri (KAUST Supercomputing Lab) Presentation, Paper Technical Session 3C Chair: Tina Declerck (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory) Presentation, Paper Technical Session 4B Chair: Brett Bode (National Center for Supercomputing Applications/University of Illinois, National Center for Supercomputing Applications) Presentation, Paper Technical Session 4A Chair: Lena M Lopatina (LANL) Presentation, Paper Technical Session 4C Chair: Gabriel Hautreux (CINES) From Chatbots to Interfaces: Diversifying the Application of Large Language Models for Enhanced Usability pdf, pdfPresentation, Paper Technical Session 5B Chair: Adrian Jackson (EPCC, The University of Edinburgh) Presentation, Paper Technical Session 5A Chair: Veronica G. Vergara Larrea (Oak Ridge National Laboratory) Presentation, Paper Technical Session 5C Chair: Tina Declerck (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory) Presentation, Paper Technical Session 6B Chair: Paul L. Peltz Jr. (Oak Ridge National Laboratory) Presentation, Paper Technical Session 6A Chair: Chris Fuson (ORNL, Oak Ridge National Laboratory) Presentation, Paper Technical Session 6C Chair: Bilel Hadri (KAUST Supercomputing Lab) Presentation, Paper Technical Session 7A Chair: John Holmen (Oak Ridge National Laboratory) Presentation, Paper Technical Session 8B Chair: Raj Gautam (ExxonMobil) Presentation, Paper Technical Session 8A Chair: Tina Declerck (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory) Presentation, Paper Technical Session 8C Chair: Jim Rogers (Oak Ridge National Laboratory) Tutorial Tutorial 2B Tutorial Tutorial 2C Tutorial Tutorial 2B Continued Tutorial Tutorial 2C Continued Tutorial Lightning Tutorial 7B XTreme (Approved NDA Members Only) XTreme (Approved NDA Members Only) XTreme (Approved NDA Members Only) XTreme (Approved NDA Members Only) XTreme (Approved NDA Members Only) XTreme (Approved NDA Members Only) XTreme (Approved NDA Members Only) XTreme (Approved NDA Members Only) |