Birds of a Feather Programming Environments, Applications, and Documentation (PEAD) CPE in a Container Kaylie Anderson (HPE), Ben Cumming (Swiss National Supercomputing Centre), Subil Abraham (Oak Ridge National Laboratory), and Panchapakesan Chitra Shyamshankar (Argonne National Laboratory) Abstract Abstract Containers provide a way to package applications along with their dependencies in a single unit. Containers can aid common HPC workflow processes such as building, testing, and execution. Many CUG centers are expanding use of containers to provide additional workflow functionality to their user community. However, integration with a center’s HPC environment including HPE’s CPE can be complex. In this BOF, representatives from multiple centers will discuss their center’s effort and goals to integrate and utilize containers. Representatives from HPE will discuss the CPE integration options. Discussion is the focus of each PEAD BOF. To help promote discussion, each BOF will open with short presentations followed by open floor discussions. Python Management Chun Sun (HPE); Cristian Di Pietrantonio (Pawsey); Dave Carlson (Stony Brook University); and Juan Herrera (EPCC, The University of Edinburgh) Abstract Abstract Python continues to play a growing role in workflows used within the HPC community. Integrating Python environments with varying needs into HPE programming environments can be complex. For CUG centers, managing multiple Python environments and ensuring performance can be non-trivial. In this BOF, representatives from multiple centers will discuss their center’s Python environment use cases, management, and lessons learned. Representatives from HPE will discuss the CPE provided Python. Discussion is the focus of each PEAD BOF. To help promote discussion, each BOF will open with short presentations followed by open floor discussions. Birds of a Feather Programming Environments, Applications, and Documentation (PEAD) CPE Testing Barbara Chapman (HPE), Cristian Di Pietrantonio (Pawsey), Brian Vanderwende (NCAR), Brandon Cook (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center), and Cedric Jourdain (CINES) Abstract Abstract HPC programming environments can be very complex, containing libraries, compilers, and tools that must work together to provide an effective resource to a center’s user community. Over a resource’s lifespan, upgrades can impact not only the individual component, but the ability for multiple components to successfully work together. Testing at various stages of a resource’s lifespan is crucial to ensure the numerous hardware and software components are in working order. The goal of this BOF is to provide a venue for CUG member sites to share techniques, best practices, and lessons learned for resource testing. During the BOF HPE representatives will discuss the environment, process, and tools used to test the CPE. Discussion is the focus of each PEAD BOF. To help promote discussion, each BOF will open with short presentations followed by open floor discussions. Exploring the Challenges of the World-Class HPE Cray Programming Environment for Modern Software Development in Fortran Manuel Arenaz (Codee) Abstract Abstract Modernizing the software development workflow for Fortran developers is crucial to enhance productivity, code quality, and maintainability. Despite being around since 1950, Fortran remains widely used in Aerospace, Automotive, Climate & Weather, Defense, Energy & Utilities, High Performance Computing, Manufacturing, Oil and Gas, Scientific Research, and other industries. However, traditional Fortran development often lacks modern tools that do exist for C/C++, which empower these developers to implement modern DevOps best practices in the organization. Birds of a Feather BoF 1D Security BoF Aaron Scantlin (National Energy Research Scientific Computing Center) Abstract Abstract This is the first BoF at CUG for the new HPC Security Special Interest Group. Its mission is to provide a forum for both exchanging and discussing strategies and ideas related to ensuring the secure configuration (of both hardware and software), operation (from the system administrator’s perspective) and utilization (from the user’s perspective) of Cray/HPE systems. Related topics in HPC security (e.g. awareness training for both users and admins) will also covered by the SIG. Birds of a Feather BoF 2D Kubernetes on HPE Supercomputers BoF Sadaf Alam (University of Bristol), Dino Conciatore (Swiss National Supercomputing Centre), and Jesse L. Treger (HPE) Abstract Abstract Cloud native technologies including containers, orchestration engines like Kubernetes and virtualisation on the compute plane for HPE EX platforms will be discussed in this BOF with the CUG community. There are diverse use cases driving these somewhat non-native HPC requirements for user or custom defined platforms. These range from AI and ML workflows to the Trusted Research Environments to users-managed CI/CD pipelines. Essentially, we will explore how community needs are evolving for not only running containers on HPC but also MLOps style workflows, which require a level of user autonomy that is not feasible within a batch scheduling system alone. This interactive BOF will include brief presentations (in cases summaries of topics covered in detailed sessions and submitted papers) on the progress made in prototyping Kubernetes environment on the compute plane with the HPE Slingshot interconnect, use cases from sites running diverse supercomputing platforms including variants of CPUs and GPU technologies from different vendors, and a panel-led BoF discussion on future directions, priorities, and challenges. Birds of a Feather BoF 1B CUG SIG System Monitoring Working Group BoF Massimo Benini (CSCS - ETH Zurich), Lena Lopatina (Los Alamos National Laboratory), and Jeff Hanson and Pete Guyan (HPE) Abstract Abstract "The System Monitoring Working Group (SMWG) is a CUG SIG (Special Interest Group) established to promote collaboration between HPE Cray and its customers in enhancing system monitoring capabilities. The group includes representatives from numerous HPE Cray member sites and meets regularly to discuss and address system monitoring-related topics. We meet virtually throughout the year, and the annual CUG meeting is an opportunity for participants from normally-incompatible time zones to swap ideas about their monitoring setup, needs, and findings. This year's main key topics are new monitoring capabilities provided by HPE, member sites creation of a standardized set of dashboards for monitoring HPC centers, exploring new approaches to application code profiling, energy profiling, and data centers metadata management. This BoF is a unique opportunity to engage collaboratively with HPE Cray and with other HPE Cray sites to shape future enhancements of observability and operational data analytics (ODA). This forum also provides a platform to share best practices, showcase common problems and their solutions, and offer HPE a platform to present the current status of their monitoring platform directly to relevant HPC staff" Birds of a Feather BoF 1C Sharing is Caring: Tackling Node-Sharing Challenges at CUG Sites Tim Robinson (Swiss National Supercomputing Centre, ETH Zurich); Tim Wickberg (SchedMD LLC); Pengfei Ding (Lawrence Berkeley National Laboratory); and Cristian Di Pietrantonio (Pawsey Supercomputing Research Centre) Abstract Abstract HPC centres have traditionally allocated computational resources in entire nodes, a practice rooted in the architectural and operational simplicity of earlier systems, and the belief that “HPC” meant the ability to scale to hundreds or thousands of nodes. However, as technology has advanced, nodes have become increasingly powerful and expensive, incorporating hundreds of CPU cores, multiple GPUs, and large amounts of high-bandwidth memory. This evolution makes whole-node allocation inefficient for many modern workloads—high-throughput computing, data-intensive workflows and interactive computing, to name a few. Consequently, subdividing nodes to allocate resources at finer granularity—by socket, core, GPU, or memory—has emerged as a necessary alternative. Birds of a Feather BoF 1A CSM updates, iSCSI boot content projection, and other CSM topics Harold Longley, Dennis Walker, Ravi Bissa, Jason Coverston, Siri Vias Khalsa, Ashalatha A. M, and Ravikanth Nalla (HPE) Abstract Abstract This BOF has an overview of recent changes in the Cray System Management (CSM) software stack, a technical dive into iSCSI-based boot content projection, a method for workload manager-controlled reboots of compute nodes, and then an open discussion amongst attendees. Birds of a Feather BoF 3D Rethinking Interactive HPC Resource Access: Enhancing Security and Flexibility Maxime Martinasso (Swiss National Supercomputing Centre), Sadaf Alam (University of Bristol), and Isa Wazirzada and Larry Kaplan (HPE) Abstract Abstract The traditional approach to accessing HPC resources relies on login nodes and SSH connections authenticated through POSIX Identity and Access Management (IAM). While this method has served the community well, it presents significant challenges in today's landscape of cybersecurity threats and evolving user needs, such as maintaining a secure shared login node or managing identity life-cycle. This Birds of a Feather (BoF) session aims to explore innovative approaches to modernize interactive HPC resource access with CLI, addressing the dual goals of enhancing security and increasing service customization flexibility for users. Emerging practices, such as SSH signed keys, offer a promising alternative to traditional login names and passwords, mitigating risks associated with credential theft by enabling more advance authentication flow like multi-factor authentication. Virtualized login nodes, implemented as containerized environments, could allow user-defined environments with for instance advanced debugging capability, AI stacks or a higher integration with IDE while improving isolation, scalability of users, and individual session management. Additionally, the generation of temporary POSIX accounts from OpenID Connect (OIDC) tokens could seamlessly integrate modern federated and non-local identity providers, reducing administrative overhead and attack surfaces. The session will showcase existing solutions, discuss opportunities for innovation, challenge classic IAM HPC and login nodes workflow and highlight the potential benefits of these new approaches. Attendees will hear from practitioners actively exploring these paradigms, sparking discussions on how the community can collectively advance this shift and benefit for a common solution. We invite participants to contribute their ideas, share experiences, and help shape a future where interactive HPC resource access is not only more secure but also more adaptable to the diverse and continuously evolving needs of its users. Birds of a Feather BoF 2B Managing System Reliability: From system acceptance through production Pete Guyan and Sue Miller (HPE) Abstract Abstract Managing System Reliability: From system acceptance through production can be a messy situation. This BoF will start with a synopsis of what HPE plans to do in the short, mid and long term to make a cohesive strategy for reproducibility. We will discuss the current tools in HPE Performance Cluster Manager [HPCM] and how we weave these tools into the strategy, how they interact with monitoring and reporting tools. Birds of a Feather BoF 2C HPE Slingshot Birds-of-a-Feather Jesse Treger (HPE) Abstract Abstract This birds-of-a-feather session will provide an opportunity for users to ask questions and share advice on managing and using HPE Slingshot systems, as well as to hear and provide input into HPE Slingshot’s software roadmap. HPE Slingshot software scope covers capabilities both for the administrators who operate and manage the system and the fabric, and for HPC and AI application writers/users of the HPE Slingshot NIC’s Libfabric provider. Users will be encouraged to share desired use cases, learnings, and best-known methods. Birds of a Feather BoF 2A CPE Futures Barbara Chapman (HPE, Stony Brook University) and Kaylie Anderson and Chun Sun (HPE) Abstract Abstract The HPE Cray Programming Environment (CPE) provides a suite of integrated programming tools for application development on a diverse range of HPC systems delivered by HPE, including those with integrated node architectures or whose nodes are configured with AMD or NVIDIA GPUs. Its compilers, math libraries, communications libraries, debuggers, and performance tools enable the creation, enhancement and optimization of application codes written using mainstream HPC programming languages and the most widely used parallel programming models. Paper, Presentation, Birds of a Feather Technical Session 7A: AI/ML GPU Workloads Session Chair: Raj Gautam (ExxonMobil) Porting Radio Astronomy Correlation to Setonix, a HPE Cray EX system powered by AMD GPUs Cristian Di Pietrantonio (Pawsey Supercomputing Research Centre, Curtin Institute for Radio Astronomy); Marcin Sokolowski (Curtin Institute of Radio Astronomy); Christopher Harris (Pawsey Supercomputing Research Centre); and Daniel Price and Randal Wayth (SKAO) Abstract Abstract In low-frequency radio astronomy correlation of signals coming from hundreds of radio antennas is an early and fundamental step to create science-ready data products such as images of the sky at radio wavelength. Because of the high volume of data to process and the rate at which they are produced, correlation is most of the time performed in real time by dedicated hardware, a FPGA or GPU cluster, installed near the telescope. However, there are science cases when an astronomer would like to correlate data later with customised settings such as time and frequency averaging of signals. Setonix, Pawsey Supercomputing Centre’s HPE Cray Ex supercomputer based on AMD CPUs and GPUs, provides radio astronomers with enough computational power for such processing, but the only established GPU correlator works only on NVIDIA GPUs and proved hard to port. In this paper we discuss the process of providing Australian astronomers an implementation of the correlation algorithm that harnesses the computational power of Setonix. Evaluating the Performance of Containerized ML and LLM Applications on the Frontier and Odo Supercomputers Bishwo Dahal (University of Louisiana Monroe, Oak Ridge National Laboratory) and Elijah Maccarthy and Subil Abraham (Oak Ridge National Laboratory) Abstract Abstract Containers are transforming scientific computing by simplifying the packaging and distribution of applications. This enables researchers to create and deploy their applications in isolated environments with all necessary dependencies, enhancing portability and deployment flexibility. These advantages make containers especially suitable for High Performance Computing (HPC) facilities like the Oak Ridge Leadership Computing Facility (OLCF), where complex scientific applications are being developed and deployed. In this work, we investigate the performance of containerized machine learning (ML) applications in comparison to bare-metal execution on the Frontier Exascale supercomputer. Specifically, we aim to determine whether ML models, when trained and tested within containers on Frontier using Apptainer, exhibit performance similar to that of bare-metal implementations. To achieve this, we use containers to package and run Convolutional Neural Network (CNN)-based ML applications on the OLCF Frontier and Odo supercomputers and assess their performance against bare-metal runs. After conducting scalability tests across up to 30 nodes with 1680 AMD EPYC CPU cores and 240 GPUs, we find that the performance of the containerized ML applications is at par with that of bare-metal runs. We apply the lessons learned from our containerized ML model to containerizing and evaluating performance of LLMs like AstroLLaMA, and CodeLLaMA on Frontier. BoF on Transforming Hybrid Workflows: The Role of HPE Cray Supercomputing User Services Software in Bridging HPC and AI Tulsi Mishra, Dean Roe, and Larry Kaplan (HPE) Abstract Abstract As the convergence of HPC and AI reshapes computational workflows, the complexity of managing hybrid environments has become a significant challenge for organizations. HPE Cray Supercomputing User Services Software (USS) offers a transformative approach to simplify, scale, and optimize workflows across HPC and AI landscapes. In this session, we will explore how USS aims to bridge the gap between traditional HPC workloads and AI-driven innovations, providing a unified platform for containerized environments, hybrid deployment orchestration, and energy-efficient operations. Break Coffee Break Break Coffee Break Break Coffee Break Break Coffee Break Sponsored by SchedMD Break Coffee Break Sponsored by Pier Group Break Coffee Break Sponsored by Linaro Break Coffee Break Sponsored by VAST Break Coffee Break Sponsored by Altair Break Coffee Break Break Coffee Break Lunch CUG board/ New Sites lunch (closed) Lunch Lunch/ PEAD & XTreme SIG Participants Lunch CUG Advisory Board Lunch Cabinet (closed) Lunch Lunch Sponsored by Codee Lunch CUG Board & Sponsors Lunch (closed) Lunch Lunch Sponsored by NVIDIA Lunch HPE Executive Lunch (closed) Lunch Lunch Sponsored by Codee Lunch Lunch Sponsored by NVIDIA Networking/Social Event Welcome Reception Networking/Social Event Program Committee Dinner (invite only) Networking/Social Event HPE Networking Event HPE will host their annual CUG community networking reception from 6:00 to 8:00 pm ET at the Lokal Eatery & Bar. Lokal is located at 2 2nd St, Jersey City, NJ 07302, along Jersey City’s waterfront, allowing CUG guests to enjoy expansive views of the Manhattan Skyline. Co-presented by AMD, all registered CUG attendees and their guests are invited to attend for a reception with light hors d’oeuvres and drinks.
First bus will leave at 5.55pm, Lokal is about a 10 min walk from the CUG hotel. Last departure from Lokal with the bus will be at 8pm. Networking/Social Event CUG AMD Night Out CUG Night out at Hudson House, 2 Chapel Ave, Jersey City, NJ We invite all registered attendees and guests with a paid CUG night out ticket to join us for an unforgettable evening at Hudson House. Situated at the end of Port Liberte in Jersey City, NJ, this structure is an arms’ length away from the Hudson River and boasts a panoramic view of the Statue of Liberty, Brooklyn, Manhattan, and Verrazano Bridges, and of course the NYC Skyline. Coaches will depart outside the Westin Jersey City Hotel at 18:10 to arrive at Hudson House for a drink’s reception before seating for dinner at approximately 19:15. If you are making your own way to the venue, please use the full address as Google Maps takes you to a different address! Hudson House, 2 Chapel Ave is approx. a 15 – 20-minute drive. Our first bus will return to the hotel at approximately 21:00. Paper, Presentation Technical Session 1B: Workload manager Session Chair: David Carlson (Institute for Advanced Computational Science, Stony Brook University) Slinky: The Missing Link Between Slurm and Kubernetes Tim Wickberg (SchedMD LLC) Abstract Abstract Slinky is SchedMD's collection of projects to integration the Slurm Workload Manager with the Kubernetes Orchestrator. How Best to Leverage Cloud for (Big) HPC Sites Bill Nitzberg and Ian Littlewood (Altair Engineering, Inc.) Abstract Abstract Cloud (finally) works for HPC, but the devil is still in the details. HPC in the Cloud has transitioned from "proof-of-concept engagements" to "hmm, but what about security" to "maybe, but what about the data" to "OK, but only if we carefully manage expenses". Today, significant big sites have voted with their dollars that on-premise HPC is here to stay, adding Cloud judiciously, with an eye towards resilience, and only where the end-to-end ROI makes sense. Divide and Rule: Automated Workload Distribution for Efficient User Support Services Luca Marsella (Swiss National Supercomputing Centre) Abstract Abstract User support services for High-Performance Computing (HPC) systems help users conduct simulations and optimize resources. The complexity of HPC platforms has increased, making user support more challenging. Site Reliability Engineering (SRE) best practices suggest that technical staff should focus on project work and automation to reduce repetitive tasks. Artificial intelligence and machine learning can help create effective knowledge bases from internal reports and user tickets, easing the burden on support staff. Paper, Presentation Technical Session 1C: Software deployment Session Chair: Chris Fuson (ORNL, Oak Ridge National Laboratory) Deploying and Tracking Software with NCCS Software Provisioning Asa Rentschler, Nicholas Hagerty, Elijah Maccarthy, and Edwin F. Posada Correa (Oak Ridge National Laboratory) Abstract Abstract The National Center for Computational Sciences (NCCS) at Oak Ridge National Laboratory has a long history of deploying ground- breaking leadership-class supercomputers for the U.S. Department of Energy. The latest in this line of supercomputers is Frontier, the first supercomputer to break the exascale barrier (1018 floating- point operations per second) on the TOP500 list. Frontier serves a wide array of scientific domains, from traditional simulation- based workloads to newer AI and Machine Learning workloads. To best serve the NCCS user community, NCCS uses Spack to deploy a comprehensive software stack of scientific software packages, providing straightforward access to these packages through Lmod Environment Modules. Maintaining a large software stack while also including multiple new compiler releases each year is a very time-consuming task. Additionally, it is not straightforward to pro- vide a software stack alongside existing vendor-provided software such as the HPE/Cray Programming Environment (CPE), and exist- ing CPE, Spack, and Lmod integration does not allow for multiple versions of GPU libraries such as AMD’s ROCm to be used. To ad- dress these challenges and shortcomings, NCCS has developed the NCCS Software Provisioning tool (NSP), a tool for deploying and monitoring software stacks on HPC systems. NSP allows NCCS to quickly and effectively provision software stacks from the ground up using template-driven recipes and configuration files. NSP is successfully deployed on Frontier and several other NCCS clusters, enabling the NCCS software team to quickly deploy software stacks for newly-released compilers, expand current software offerings, better support GPU-based software, and monitor Lmod module usage to identify unused software packages that can be removed from the software stack. In this work, we discuss the shortcomings of the previous CPE, Spack, and Lmod usage at NCCS, provide further details on the implementation and structure of NSP, then discuss the benefits that NSP provides. Modern Software Deployment on a Multi-Tenant Cray-EX System Ben Cumming, Andreas Fink, Simon Pintarelli, and John Biddiscombe (CSCS) Abstract Abstract User-facing software -- libraries, tools, applications and programming environments tuned for the node and network architecture -- is a key part of HPC centers' service offering. Teams that maintain and support this software face challenges: providing a stable software platform for users with long running projects while also providing the latest versions of software for developers; giving full responsibility to build, modify and deploy the whole software stack to staff who do not have root acccess; and reproducible deployment based on GitOps practices. CSCS addresses these challenges on Alps by using small independent software environments called uenv, which deploy from text-file recipes without requiring installation of the Cray Programming Environment. This paper discusses installing communication libraries from HPE and NVIDIA with SlingShot support; the CI/CD pipeline that builds uenv and deploys them in a container registry; and the command line tools and SLURM plugin that interface users with the software environments. We demonstrate diverse use cases such as JupyterHub, summarize the user and support team experience, and document how to build and deploy CPE containers. Employing a Software-Driven Approach to Scalable HPC System Management Aaron Barlow (Oak Ridge National Laboratory) Abstract Abstract Managing Frontier and other HPE Cray and Apollo clusters at Oak Ridge National Laboratory involves thousands of users, projects, and security policies across multiple HPC systems. With diverse research needs, varying security enclaves, and massive resource allocations, manual processes don’t scale, and administrative burden increases as HPE systems grow. To manage HPC systems at this scale, we developed RATS (Resource Allocation Tracking System), a software platform that centralizes operations. Paper, Presentation Technical Session 1A: Multitenancy Session Chair: Juan F R Herrera (EPCC, The University of Edinburgh) Infrastructure as a Service with Strong Tenant Separation on a Supercomputer Riccardo Di Maria, Chris Gamboni, Manuel Sopena Ballesteros, Hussein Harake, Mark Klein, Marco Passerini, Miguel Gila, Maxime Martinasso, and Thomas C. Schulthess (Swiss National Supercomputing Centre) and Alun Ashton, Derek Feichtinger, Marc Caubet, Elsa Germann, Hans-Nikolai Viessmann, Achim Gsell, and Krisztian Pozsa (Paul Scherrer Institute) Abstract Abstract This paper explores the innovative implementation of Infrastructure-as-a-Service (IaaS) on a HPE Cray Shasta EX supercomputer. In cloud environments, IaaS offers scalable, on-demand access to virtualized resources. However, applying IaaS principles to high-performance computing (HPC) systems without relying on virtualization technologies poses some challenges, since they typically have a tightly coupled software stack. We address these challenges in a co-design partnership between an HPC provider, CSCS, and an end-user institution, PSI, by developing a suite of technologies for the HPE Cray Shasta EX system architecture that supports resource isolation and granular control. This approach not only provides the IaaS model on supercomputing environments but also enables dynamic resource management. Our contributions include a detailed exploration of the technological advancements necessary for integrating IaaS into HPC, together with the lessons learned from our collaborative efforts. By extending IaaS capabilities to supercomputers, we aim to provide scientific institutions with unprecedented flexibility and control over their computational resources. Dynamic Network Perimeterization: Isolating Tenant Workloads With VLANs, VNIs, & ACLs Nikhil Mukundan, Dennis Walker, Stephen Han, Atif Ali, Siri Vias Khalsa, Amit Jain, Vishal Bhatia, and Vinay Karanth (HPE) Abstract Abstract There is a growing trend in the high-performance computing (HPC) community where separate user groups share HPC infrastructure with varying security clearances (tenants). In such cases, tenants require robust security boundaries to ensure data privacy, results integrity, and intellectual property secrecy. Additionally, sensitive transactions within a tenant may need to be further insulated from lower-clearance workloads. Join us as we show how product-agnostic, version-controlled configuration data can be used to dynamically isolate infrastructure resources supporting workloads, including compute node groups, data-at-rest (storage), and data-in-motion within high-speed and management networks. On the high-speed network (HSN), we'll examine how switch port VLAN filters and VNIs (traffic labels) isolate TCP/IP and RDMA traffic per tenant. On the management network, we'll demonstrate how to segment compute node groups via switch ACLs, VLANs, and iptables. Complete dynamic network segmentation will be applied at various levels of infrastructure from chassis, nodes, and within the OS. Finally, we'll review architecture features in Slingshot, CSM, and other products that enable elastic tenant reallocation. We'll compare and contrast the differences in the number of configuration options and security posture when applying segmentation at the switch vs the node. CSCS' journey towards complete platform automation in a multi-tenant environment Miguel Gila, Ivano Bonesana, and Alejandro Dabin (Swiss National Supercomputing Centre, CSCS) Abstract Abstract The Swiss National Supercomputing Centre operates a complex ecosystem of high-performance computing resources. With a focus on scalability and efficiency, we have implemented a multi-tenancy approach to serve diverse scientific communities. This layered architecture encompasses infrastructure, platform, and application layers, each with unique automation challenges. Paper, Presentation Technical Session 2B: Security & Configuration Management Session Chair: Jim Williams (Los Alamos National Laboratory) Pragmatic Security Audits: Fortifying HPC Environments at a Consumable Pace Alden Stradling (Los Alamos National Laboratory) and Monica Dessouky and Dennis Walker (HPE) Abstract Abstract Do you know your security posture? Are you overwhelmed by the latest reports? Don't let a sea of security findings parallelize your progress. By establishing a frequent, recurring cadence for audits and remediation, organizations ensure continuous protection against emerging threats. This paper represents the practical, scalable approach to HPC security implemented at a recent customer to secure its many new environments. Experimenting with Security Compliance Checking using ReFrame Victor Holanda Rusu, Matteo Basso, Chris Gamboni, Fabio Zambrino, and Massimo Benini (Swiss National Supercomputing Centre) Abstract Abstract Security is a critical aspect of High-Performance Computing (HPC) systems, where the implementation of security compliance checks and hardened configurations is essential to safeguard resources and data. Continuous security checking is fundamental, especially for detecting indications of compromise, but its implementation must balance effectiveness and efficiency to avoid unnecessary strain. Open Source and freely available security-focused tools, such as OSCAP, are less known and accessible to engineers from other disciplines, who may not be familiar with their functionality or utility. This creates a barrier to collaborative efforts in improving system-wide configurations and promoting shift-left security in HPC centers. We leveraged ReFrame to perform robust security compliance testing to address these limitations. ReFrame enables the creation of customizable tests to evaluate system configuration, generic exploits, and execute tests in parallel, optimizing testing workflows without significant performance penalties. We will present the latest developments at CSCS, showcasing how we plan to use ReFrame to enhance security compliance testing in HPC environments using three different standards STIG DoD, ANSSI BP-28-enhanced, and CSCS’. We aim to create a community to develop, maintain, and benefit from a shared set of security checks tailored to HPC systems based on customer-specific, industry-specific, or government-mandated requirements. From Weeks to Hours: Harnessing Configuration Management and Deployment Pipelines Dennis Walker and Siri Vias Khalsa (HPE) and Alex Lovell-Troy (Los Alamos National Laboratory) Abstract Abstract Ensuring peak reliability, current functionality, and up-to-date security requires cultivating the capability to continuously update and integrate a complex array of dependencies, spanning hardware, firmware, system management software, network configurations, API services, OS distros, job schedulers, AI libraries, and analytics tools. This paper presents a simple yet contemporary DevOps methodology designed to automate, validate, and replicate changes effectively to one or many production environments. Rev Up Compute Node Reboots: 2x to 5x Faster Dennis Walker (HPE) and Paul Selwood (Met Office, UK / NERC CMS) Abstract Abstract Join us as we race to the bottom, showcasing the innovations developed to speed up reboot times by 300% for UK Met Office's latest, multi-zoned, CSM-based HPE Cray EX systems. With increasing complexity in software and site-specific customizations, node reboot times ballooned to over 35 minutes by November 2023, far exceeding operational requirements. In response, HPE developed automation to download logs, parse metrics, and graph boot stages by duration to understand better what was happening. Guided by data, the following changes were implemented, sorted by impact: - CFS/Ansible run-time plays were moved into local Systemd execution, running earlier in the boot cycle and without blocking other nodes. - Software installation and select configuration activity were moved into image build, streamlining deployment. - CSM boot settings were tuned for optimal performance. - Node Health Checks (NHC) were moved into Systemd, to run preemptively before the job scheduler agent to ensure nodes are consistently job-ready as early as possible. Paper, Presentation Technical Session 2C: Climate applications Session Chair: Maciej Cytowski (Pawsey Supercomputing Research Centre) Bit-reproducibility in UK Met Office Weather and Climate Applications David Acreman (HPE) Abstract Abstract Weather and climate applications solve partial differential equations which are highly sensitive to small perturbations in model variables. Changes in even the least significant bit of a variable can have an observable impact on scientific results ("Butterfly effect"). The nature of floating point arithmetic means that subtle changes to code or the order of a summation can change results at the bit level due to different round-off errors. Consequently achieving bit reproducible results is challenging. Enabling km-scale coupled climate simulations with ICON on AMD GPUs Jussi Enkovaara (CSC - IT Center for Science Ltd.) Abstract Abstract The Icosahedral Nonhydrostatic (ICON) weather and climate model is a modelling framework for numerical weather prediction and climate simulations. ICON is implemented mostly in Fortran 2008 with the GPU version based mainly on OpenACC. ICON is used on a large variety of hardware, ranging from classical CPU cluster to vector architecture and different GPU systems. In coupled simulations ICON can utilize heterogeneous architectures, i.e. ocean runs on CPUs while a atmosphere runs concurrently on GPUs. MARBLChapel: Fortran-Chapel Interoperability in an Ocean Simulation Brandon Neth and Ben Harshbarger (HPE); Scott Bachman ([C]Worthy); and Michelle Mills Strout (HPE, University of Arizona) Abstract Abstract As the climate crisis continues to have widespread effects on the biosphere, scientists increasingly turn to computer modeling to understand the impacts of different interventions. Modeling one such intervention, ocean carbon dioxide removal, requires incorporating multiple sources of interaction (air-sea gas exchange, biogeochemical processes, etc.) and high spatial and temporal resolutions. To address the need for scalable and high-resolution simulations, scientists at [C]Worthy have written the core of an ocean modeling code using Chapel, a parallel programming language for writing high-performance, distributed programs. Although Chapel has enabled rapid development, an important library for modeling biogeochemical processes, MARBL, is written in Fortran. MARBL is a robust, stand-alone library and is used in several state-of-the-art models including MOM6, MPAS, POP, and ROMS. Rather than rewriting the MARBL library in Chapel, we use Chapel and Fortran’s C interoperability to integrate MARBL into the distributed Chapel simulation. This allows us to re-use reliable scientific code while using Chapel to orchestrate parallelism. In this talk, we demonstrate how the distributed Chapel simulation sets up data structures needed by MARBL, calls out to the Fortran library, and brings results back to update the simulation. We show performance results on Perlmutter, an HPE Cray EX system. Redefining Weather Forecasting Systems: The Transition to ICON and Alps Mauro Bianco, Matthias Kraushaar, and Roberto Aielli (ETH Zurich); Oliver Fuhrer (Federal Office of Meteorology and Climatology MeteoSwiss); and Thomas Schulthess (ETH Zurich) Abstract Abstract The transition of MeteoSwiss operational weather forecasting from COSMO to the ICON model represents a major modernization in meteorological services, integrating software-defined infrastructure to improve flexibility, scalability, and resilience. The migration also involved significant hardware upgrades, from fixed systems with K80 GPUs to flexible architectures using V100 and later A100 GPUs, supported by \textit{Alps} infrastructure developed by CSCS that is based on the Cray EX product line of HPE~\cite{maximecug25}. Paper, Presentation Technical Session 2A: Slingshot Session Chair: Brett Bode (National Center for Supercomputing Applications/University of Illinois, National Center for Supercomputing Applications) The HPE Slingshot 400 Expedition Houfar Azgomi, Duncan Roweth, Gregory Faanes, and Jesse Treger (HPE) Abstract Abstract HPE Slingshot 400 is a high-performance interconnect for classic and AI supercomputing clusters. As the successor of HPE Slingshot, it comprises a PCIe Gen5 NIC (Cassini-2) and a 64-port switch (Rosetta-2), linking over standard 400 Gbps Ethernet physical interfaces, and enabling dragonfly and fat-tree networks with up to 260,000 endpoints. HPE Slingshot is currently deployed in 7 of the 10 largest supercomputers worldwide and dominates the top 3 list as the interconnect solution for El Capitan, Frontier, and Aurora machines. With such success, the Slingshot Transport protocol has become the cornerstone for HPC-optimized Ethernet networking standardization efforts led by the Ultra Ethernet Consortium (UEC). Growing around its foundational adaptive routing and congestion management feature set, the HPE Slingshot 400 interconnect doubles its predecessor’s bandwidth with significant enhancements: exact match forwarding increasing route visibility across the cluster; dedicated ACL tables for security and cloud isolation; feature hardening flexibility with P4-programmability; and improved quality of service with 50% more traffic classes. It is supported across HPE’s portfolio of rack- and chassis-based supercomputing platforms including HPE Cray XD, HPE Cray EX, and the latest HPE Cray GX. This paper presents the key features and some early performance results of Rosetta-2 and Cassini-2 devices. Introduction To HPE Slingshot NIC Libfabric Environment Variables Jesse Treger and Ian Ziemba (HPE) Abstract Abstract Libfabric, a high-performance fabric software library, provides a rich API for applications to communicate efficiently over various networking technologies. The Libfabric provider for the specific networking interface type translates the API requested communications into optimal protocols that strive to optimize use of hardware. The HPE Slingshot NIC in particular has extensive hardware offloads to improve performance and reduce memory overhead. While Libfabric aims for seamless integration, achieving optimal performance may require users to configure environment variables to fine-tune the software for specific workloads and hardware setups. This presentation will demystify the role of environment variables in Libfabric, explaining trade-offs and why they matter to performance and stability under various conditions. We will begin with an overview of how the HPE Slingshot NIC uses Libfabric to optimize performance with various messaging requirements. Next, we will explore the most common environment variables users may need to adjust from default values, using examples learned on different applications, MPI middleware, processors, or job scale. Finally, we will briefly touch on some best-known methods to troubleshoot application failures that can be addressed with environment settings. Math in Your Network: Slingshot Hardware Accelerated Reductions Forest Godfrey and Duncan Roweth (HPE) Abstract Abstract In high performance computing applications, the use of collective operations such as reductions and barriers is commonplace. The performance of collectives is critical to overall performance in many applications, especially those where collectives are increasingly large part of the runtime as jobs scale. Collective operations are typically performed by software, requiring packets carrying contributions to the collective to go all the way to endpoint memory and be acted upon only for the result to have to transit back out to the network. This occurs at each level of a collective tree. By performing collective operations inside the network switch hardware itself the round trips to memory are removed and significant improvements in latency can be achieved. The Slingshot Rosetta switch fabric supports the hardware acceleration of many collective operation such as barriers and 64-bit IEEE floating point reductions. Upcoming Slingshot software will enable this functionality and present it to the end user transparently through the industry standard libfabric network communication library. This presentation will cover the details of this upcoming feature and how it can be used to accelerate applications.The implementation inside libfrabric, interaction with the job scheduler and fabric manager, as well as initial benchmarking results will be discussed. Slingshot Host Software Ethernet Tuning Ravi Bissa, Ian Ziemba, Duncan Roweth, and Forest Godfrey (HPE) Abstract Abstract High-performing Ethernet is the cornerstone of Exascale supercomputers, enabling seamless communication, minimizing latency, and supporting massive scalability. Without robust Ethernet infrastructure, these systems cannot achieve their goal of solving the world’s most complex computational problems efficiently and within a reasonable time and energy budgets. Plenary, Paper Plenary Session: CUG Organizational Update and Best Paper Presentation Evolving HPC services to enable ML workloads on HPE Cray EX Stefano Schuppli, Fawzi Mohamed, Henrique Mendonca, Nina Mujkanovic, Elia Palme, Dino Conciatore, Lukas Drescher, Miguel Gila, Pim Witlox, Joost VandeVondele, Maxime Martinasso, Torsten Hoefler, and Thomas Schulthess (Swiss National Supercomputing Centre) Abstract Abstract The Alps Research Infrastructure leverages GH200 technology at scale, featuring 10,752 GPUs. Accessing Alps provides a significant computational advantage for researchers in Artificial Intelligence (AI) and Machine Learning (ML). While Alps serves a broad range of scientific communities, traditional HPC services alone are not sufficient to meet the dynamic needs of the ML community. This paper presents an initial investigation into extending HPC service capabilities to better support ML workloads. We identify key challenges and gaps we have observed since the early-access phase (2023) of Alps by the Swiss AI community and propose several technological enhancements. These include a user environment designed to facilitate the adoption of HPC for ML workloads, balancing performance with flexibility; a utility for rapid performance screening of ML applications during development; observability capabilities and data products for inspecting ongoing large-scale ML workloads; a utility to simplify the vetting of allocated nodes for compute readiness; a service plane infrastructure to deploy various types of workloads, including support and inference services; and a storage infrastructure tailored to the specific needs of ML workloads. These enhancements aim to facilitate the execution of ML workloads on HPC systems, increase system usability and resilience, and better align with the needs of the ML community. We also discuss our current approach to security aspects. This paper concludes by placing these proposals in the broader context of changes in the communities served by HPC infrastructure like ours. Alps, a versatile research infrastructure Maxime Martinasso (Swiss National Supercomputing Centre, ETH Zurich) and Mark Klein and Thomas Schulthess (Swiss National Supercomputing Centre) Abstract Abstract The Swiss National Supercomputing Centre (CSCS) has a long-standing tradition of delivering top-tier high-performance computing systems, exemplified by the Piz Daint supercomputer. However, the increasing diversity of scientific needs has exposed limitations in traditional vertically integrated HPC architectures, which often lack flexibility and composability. To address these challenges, CSCS developed Alps, a next-generation HPC infrastructure designed with a transformative principle: resources operate as independent endpoints within a high-speed network. This architecture enables the creation of independent tenant-specific and platform-specific services, tailored to diverse scientific requirements. Paper, Presentation Technical Session 3B: HPCM Session Chair: Matthew A. Ezell (Oak Ridge National Laboratory) A Brief Summary of the HPCM (HPE Performance Cluster Manager) Evolution Over Recent Releases Sue Miller, Lee Morecroft, and Peter Guyan (HPE) Abstract Abstract This presentation will present the enhancements to the HPE Performance cluster manager (HPCM) over the current releases. It will cover 1.11 to 1.13 and patches to those versions as the release mechanisms for these enhancements. System Visualization Using Rackmap Troy Dey and Peter Guyan (HPE) Abstract Abstract Understanding the health of an HPC system across various dimensions such as power, environment, fabric, and job performance is a challenging task, and continues to increase in difficulty as these systems become larger and more complex. To address this problem, the HPE Performance Cluster Manager (HPCM) now provides Rackmap an extensible CLI tool capable of rendering telemetry data as a dense 2D representation of the physical layout of components within an HPC system. This dense display of information within the CLI allows a system administrator to view, for example, the power status of thousands of nodes instantly, on a single screen, without having to context switch to a separate application. In addition, system administrators can easily create their own maps to display information of interest to them such as whether nodes have passed acceptance tests. This presentation will provide an overview of the Rackmap tool, describe the maps currently being shipped with HPCM, and go over how system administrators can create their own maps. Harvesting, Storing and Processing Data from our HPCM Systems Ben Lenard, Eric Pershey, Brian Toonen, Peter Upton, Doug Waldron, Lisa Childers, Micheal Zhang, and Bryan Brickman (Argonne National Laboratory) Abstract Abstract With the Argonne Leadership Computing Facility (ALCF) acquiring more HPE Supercomputers, each equipped with its own HPCM stack, alongside other operational programs, we devised a strategy to centralize monitoring data from these systems. This centralized system aggregates data from various sources and securely distributes it to different consumers, including various teams and platforms within the ALCF. Paper, Presentation Technical Session 3C: Future Technology Session Chair: Juan F R Herrera (EPCC, The University of Edinburgh) Evolving Sarus to augment Podman for HPC on Cray EX Alberto Madonna, Gwangmu Lee, and Felipe Cruz (Swiss National Supercomputing Centre) Abstract Abstract Podman provides modern and flexible containerization solution but lacks the specialized features required for high-performance computing. The evolution of the Sarus project aims to integrate Podman into a modular, open-source HPC container suite, bridging mainstream container technologies and supercomputing. This presentation highlights how Sarus deploys and optimizes Podman for HPC on CSCS’s Alps infrastructure, a Cray EX system, focusing on the following areas: HPC-Optimized Podman: secure and scalable rootless containers for supercomputing environments with HPC-specific configuration templates. Workload Management Integration: seamless job orchestration of containerized workloads via SLURM-compatible SPANK component. Transparent HPC Resource Access: Open Container Initiative (OCI) hooks and Container Device Interface (CDI) provide pluggable access to compute, network, and storage resources on Cray EX systems. Parallel Filesystems Support: Squashfs-based image store for efficient usage of HPC storage systems. Secure Multi-Tenancy: rootless subid synchronization for Podman on shared distributed systems. This presentation will include test results on Alps, demonstrating how Sarus enables Podman to handle containerized job submissions efficiently and seamlessly. By augmenting community container tools like Podman to meet HPC needs, Sarus delivers a modern and flexible container stack optimized for CSCS’s vClusters architecture on Cray EX systems. What is RISC-V and why should we care? Nick Brown (EPCC) Abstract Abstract RISC-V is an open Instruction Set Architecture (ISA) standard which enables the open development of CPUs and a shared common software ecosystem. With billions of RISC-V cores already produced, and this is accelerating rapidly, we are seeing a revolution driven by open hardware. Nonetheless, for all the successes that RISC-V has enjoyed, it is yet to become mainstream in HPC. This comes at a time when HPC is facing new challenges especially around performance and sustainability of operations, and recent advances in RISC-V such as data centre RISC-V hardware make this technology a more realistic proposition with potential to address these. In this survey paper we explore the current state of art of RISC-V for HPC, identifying areas where RISC-V can benefit the HPC community, the level of maturity of the hardware and software ecosystem for HPC , and identify areas where the HPC community can contribute. The outcome is a set of recommendations around where the HPC and RISC-V communities can can come together and focus on high priority action points to help increase adoption. A Full Stack Framework for High Performance Quantum-Classical Computing Xin Zhan, K. Grace Johnson, and Soumitra Chatterjee (HPE); Barbara Chapman (HPE, Stony Brook University); and Masoud Mohseni, Kirk Bresniker, and Ray Beausoleil (HPE) Abstract Abstract To address the growing needs for scalable distributed High Performance Computing (HPC) and Quantum Computing (QC) integration, we present our HPC-QC full stack framework and its hybrid workload development capability with modular hardware/device-agnostic software integration approach. The latest development in extensible interfaces for quantum programming, dispatching, and compilation within existing mature HPC programming environment are demonstrated. Our HPC-QC full stack enables high-level, portable invocation of quantum kernels from commercial quantum SDKs within HPC meta-program in compiled languages (C/C++ and Fortran) as well as Python through a quantum programming interface library extension. An adaptive circuit knitting hypervisor is being developed to partition large quantum circuits into sub-circuits that fit on smaller noisy quantum devices and classical simulators. At the lower-level, we leverage Cray LLVM-based compilation framework to transform and consume LLVM IR and Quantum IR (QIR) from commercial quantum software front-ends in a retargetable fashion to different hardware architectures. Several hybrid HPC-QC multi-node multi-CPU and GPU workloads (including solving linear system of equations, quantum optimization, and simulating quantum phase transitions) have been demonstrated on HPE EX supercomputers to illustrate functionality and execution viability for all three components developed so far. This work provides the framework for a unified quantum-classical programming environment built upon classical HPC software stack (compilers, libraries, parallel runtime and process scheduling). Paper, Presentation Technical Session 3A: Data Centers Session Chair: Lena M Lopatina (LANL) Causality inference for Digital Twins in GPU Data Centers and Smart Grids. Rolando Pablo Hong Enriquez, Pavana Prakash, Ebad Taheri, and Aditya Dhakal (HPE); Matthias Maiterth and Wesley Brewer (Oak Ridge National Laboratory); and Dejan Milojicic (HPE) Abstract Abstract To the benefit of both technologies, data centers and smart grids will likely get evermore integrated in the near future. The downside is that effectively managing those systems will rapidly become burdensome if we neglect to prepare accordingly. Digital twins that can potentially wrap the benefits of advance analytics and visualizations to manage such complex environments. Yet even today's AI systems lack the proper causal understanding of the data. Here we embark on a journey to collect proper causal data for validating causal inference methods based on three fundamentally different theoretical foundations: causal calculus, information theory, and dynamical system theory. Subsequently, we apply such methods to two target datasets from a smart grid and a GPU data center. We finally analyze the success and failures of applying these methodologies and the indications they offered to create more insightful and energy-efficient prediction strategies for digital twins in support of smart grids and GPU data centers. AlpsB – a Geographically Distributed Infrastructure to Facilitate Large-Scale Training of Weather and Climate AI Models Alex Upton, Jerome Tissieres, and Maxime Martinasso (Swiss National Supercomputing Centre) Abstract Abstract AI-based models are transforming weather forecasting; these models are high quality and inexpensive to run compared to traditional physics-based models, and are already outperforming existing forecasting systems for many standard scores. The size of training datasets, however, remains a challenge. The widely-used ERA5, for example, is over 5PB, and these datasets are not typically located close to the large-scale compute power required for training AI models. As such, new solutions are required. Co-design, deployment and operation of a Modular Data Centre (MDC) with air and direct-liquid cooled supercomputers Sadaf Alam (University of Bristol); Emma Akinyemi, Martin Podstata, and Jan Over (HPE); and Simon McIntosh-Smith, Ross Barnes, Naomi Harris, and Dave Moore (University of Bristol) Abstract Abstract The Bristol Centre of Supercomputing (BriCS) has deployed its first HPE modular data centre (MDC), also known as a Performance Optimised Data Centre (POD), in March 2024. This has been a collaborative, co-design project between HPE and the University of Bristol. The MDC has enabled the rapid commencement of operations for the research community for the direct liquid cooled (DLC) Isambard-AI phase 1 (HPE Cray EX2500) and the air-cooled Isambard 3 (HPE Cray XD224), with NVIDIA Grace-Hopper and Grace-Grace superchips, respectively. The second set of MDCs have been deployed for Isambard-AI phase 2 containing 5,280 NVIDIA Grace-Hopper superchips in HPE Cray EX4000 DLC cabinets, together with the management and storage ecosystems. This manuscript outlines key features of the HPE POD MDCs for sustainability, efficiency, flexibility and observability in the era where data centre cooling and power needs are changing with growing demands for AI and HPC. We leverage the community efforts, specifically, the Energy Efficient High Performance Computing Working Group (EE HPC WG) that aims to sustainably support science through committed community action by encouraging the implementation of energy conservation measures and energy efficient design in HPC [1]. We outline notable advantages of the MDC approach for constraints and requirements that are unique for the Isambard-AI project that led to a co-design approach. We conclude by highlighting the key lessons drawn from this work. Paper, Presentation Technical Session 4B: GPU Energy Efficiency Session Chair: Maciej Cytowski (Pawsey Supercomputing Research Centre) Optimizing GPU Frequency for Sustainable HPC: Lessons Learned from a Year of Production on Adastra, an AMD GPU Supercomputer Gabriel Hautreux, Naïma Alaoui, and Etienne Malaboeuf (CINES) Abstract Abstract Power consumption is a critical concern for GPU-based high-performance computing (HPC) systems as rising energy costs and environmental challenges push for energy-efficient solutions. Modern GPUs, such as the AMD MI250X used in the Adastra supercomputer, offer features like frequency scaling to manage power consumption dynamically. However, optimizing frequency configurations for diverse HPC and AI workloads is complex due to their varied computational demands. CINES, operating the Adastra system in Montpellier, France, conducted a study analyzing the impact of reducing the GPU frequency from 1.7 GHz to 1.5 GHz. Adastra, ranked #3 on the Green500 list of energy-efficient supercomputers, supports French researchers across scientific domains. Previous findings presented at SuperComputing 23 showed that frequency downscaling improved energy efficiency while slightly impacting performance, prompting CINES to adopt the lower frequency in July 2024. A year-long analysis revealed a 15% reduction in energy consumption per node, aligning with sustainability goals without requiring hardware modifications. The study also assessed application performance, user satisfaction, hardware reliability, and differences between HPC and AI workloads. The results provide actionable insights for HPC centers aiming to enhance energy efficiency while avoiding the complexity and overhead of dynamic strategies. Fine-Grained Application Energy and Power Measurements on the Frontier Exascale System Oscar Hernandez and Wael Elwasif (Oak Ridge National Laboratory) Abstract Abstract The increasing complexity and power/energy demands of heterogeneous exascale systems, such as the Frontier supercomputers, present significant challenges for measuring and optimizing power consumption in applications. Current tools either lack the resolution to capture fine-grained power and energy measurements or fail to integrate this information with application performance events. This paper introduces a novel open-source performance toolkit that integrates extended PAPI components with Score-P to enable fine-grained millisecond-level power and energy measurements for AMD MI-250x GPUs and CPUs. EVeREST: An Effective and Versatile Runtime Energy Saving Tool for GPUs Anna Yue, Torsten Wilde, Sanyam Mehta, and Barbara Chapman (HPE) Abstract Abstract The widespread adoption of GPUs combined with the significant power consumption of GPU applications prepares a strong case for an effective power/energy saving tool for GPUs. Interestingly, however, GPUs present unique challenges (that are traditionally not seen in CPUs) towards this goal, such as very few available low-overhead performance counters and fewer optimization opportunities. We propose Everest, a proof-of-concept tool that dynamically characterizes applications to find novel and effective opportunities for power and energy savings while providing desired performance guarantees. Specifically, Everest finds two unique avenues for saving energy using DVFS (Dynamic Voltage Frequency Scaling) in GPUs in addition to the traditional method of lowering core clock for memory bound phases. Everest does not require any application modification or apriori profiling and has very low overhead. Everest relies on a single chosen performance event that is available across both AMD and NVIDIA GPUs that we show to be sufficient and effective in application characterization, which also makes Everest portable across GPU vendors. Experimental results of our PoC across 8 HPC and AI workloads demonstrate up to 25% energy savings while maintaining 90% performance relative to the maximum application performance, outperforming existing solutions on the latest NVIDIA and AMD GPUs. HPE Cray EX225a (MI300a) Blade Power Capping and HBM Page Retirement Steven Martin, Randy Law, Leo Flores, Ron Urwin, and Larry Kaplan (HPE) Abstract Abstract HPE Cray Supercomputing EX255a is a density-optimized, accelerated blade featuring eight AMD MI300a Accelerated Processing Units (APUs). To deploy the HPE Cray EX255a blades at maximum density in the HPE Cray EX4000 cabinet, the nodes need to be power capped to limit total cabinet power to the 400KVA maximum cabinet power constraint. Managing node-level power to enforce the cabinet-level constraint while maximizing node-, cabinet-, and system-level performance drove the engineering team to a new power-capping design that will be described in this presentation. This new power capping design is configured out-of-band via Redfish and is complementary to in-band capping that can be configured via rocm-smi. This presentation will show power and performance data collected on a large customer system and from a smaller system internal to HPE. This design is expected to be leveraged for future HPE Cray EX blades. Paper, Presentation Technical Session 4C: Monitoring Session Chair: David Carlson (Institute for Advanced Computational Science, Stony Brook University) Utilization and Performance Monitoring of Ookami, an ARM Fujitsu A64FX Testbed Cluster with XDMoD Nikolay A. Simakov, Joseph P. White, and Matthew D. Jones (SUNY University at Buffalo) and Eva Siegmann, David Carlson, and Robert J. Harrison (Stony Brook University) Abstract Abstract High-Performance Computing (HPC) resources are essential for scientific and engineering computations but come with substantial initial and operational costs. Therefore, ensuring their optimal utilization throughout their lifecycle is crucial. Monitoring utilization and performance helps maintain efficiency and proactively address user needs and performance issues. This is particularly important for technological testbed systems, where frequent software updates can mask localized performance degradations with improvements elsewhere. HPE Slingshot Monitoring Software: Actionable Insights for HPC and AI Systems Sahil Patel (HPE) Abstract Abstract Modern HPC and AI systems produce vast telemetry data, making performance monitoring and root cause analysis increasingly challenging. Traditional troubleshooting methods often lead to inefficiencies, lengthy resolution times, and costly downtime, unable to meet the demands of today’s high-performance computing environments. LDMS New Features for Deployment in Advanced Environments and Feedback for Operations Jim Brandt, Ben Schwaller, Jennifer Green, Ben Allan, Cory Lueninghoener, Evan Donato, Vanessa Surjadidjaja, Sara Walton, and Ann Gentile (Sandia National Laboratories) Abstract Abstract The Lightweight Distributed Metric Service (LDMS) monitoring, transport, and analysis framework has been deployed on large-scale Cray and HPE systems for over a decade. Over that time its capabilities have improved dramatically. In this talk we provide updates on capabilities including deployment and management methods in bare metal, containerized, and cloud (including hybrid on+off prem) environments. We describe how LDMS is being used to collect application data concurrent with system data and how the low-latency availability of this data for analysis can be used for real-time data analysis and feedback in order to support efficient, resilient, and reliable system operations. Finally, we will describe current related research areas including 1) use of machine learning for modeling application and system behavioral characteristics and 2) use of new features in the bi-directional communication capability of LDMS to provide low-latency communication and feedback from a distributed analysis system to user, system, and application processes on disparate clusters and to inform data center orchestration decisions. Proactive Health Monitoring and Maintenance of High-Speed Slingshot Fabrics in HPC Environments Michael Cush, Jeff Kabel, Michael Schmit, Michael Accola, and Forest Godfrey (HPE) Abstract Abstract This whitepaper addresses the critical need for maintaining the health of high-speed Slingshot fabrics in high-performance computing (HPC) environments. Identifying and resolving known issues swiftly is essential for optimizing HPC workload performance, yet pinpointing common and emerging problems can be highly challenging. We propose a proactive solution that leverages automated capture of key configuration and performance metrics, coupled with sophisticated event logic, to detect unhealthy components/known bugs within the fabric. This is achieved through the System Diagnostic Utility (SDU), integrated with HPCM and CSM software, which automates data capture and securely transmits it to HPE using HPE Remote Device Access (RDA). This solution is complimentary to other system monitoring solutions such as SMS and AIOps. In fact, this solution can also capture data from those tools to consolidate and enhance the level of data captured for analysis. Paper, Presentation Technical Session 4A: New Deployment Session Chair: Jim Rogers (Oak Ridge National Laboratory) A journey to provide GH200 Mark Klein, Thomas Schulthess, Jonathan Coles, and Miguel Gila (Swiss National Supercomputing Centre, ETH Zurich) Abstract Abstract Bringing a new hardware architecture to production involves a multi-phase process that ensures optimal performance, stability, and integration with existing infrastructure. The process begins with the physical installation of hardware and networking components. Once the hardware is in place, system engineers configure the operating system and necessary software stacks for performance monitoring and fault detection. This is followed by rigorous testing, including stress tests and benchmark runs, to verify the system’s capabilities and identify any hardware or software anomalies. Evaluating AMD MI300A APU: Performance Insights on LLM Training via Knowledge Distillation Dennis Dickmann (Seedbox); Philipp Offenhäuser (HPE); Rishabh Saxena (HLRS, University of Stuttgart); George Markomanolis (AMD); Alessandro Rigazzi (HPE HPC/AI EMEA Research Lab); Patrick Keller (HPE); and Kerem Kayabay and Dennis Hoppe (HLRS, University of Stuttgart) Abstract Abstract AMD (Advanced Micro Devices) has recently launched the MI300A Accelerated Processing Unit (APU), which integrates Central Processing unit (CPU) and Graphical Processing Unit (GPU) compute in a single chip with unified High Bandwidth Memory (HBM). This study assesses the AMD Instinct™ MI300A capabilities and examines how this new architecture handles real-world generative Artificial Intelligence (AI) workloads. While performance data for Large Language Model (LLM) use cases exists for the MI250X and MI300X, to the best of our knowledge, such assessments are absent for the MI300A. We apply a Knowledge Distillation (KD) use case to distil the knowledge of the Mistral-Small-24B-Base-2501 teacher model to a student model that is 50% sparse using a 2:4 sparsity pattern. The results show quasi-linear scaling of raw performances on up to 256 APUs. Lastly, we discuss the challenges, research and practical implications, and outlook. Evaluation of the Nvidia Grace Superchip in the HPE/Cray XD Isambard 3 supercomputer Thomas Green and Sadaf Alam (University of Bristol) Abstract Abstract The Bristol Centre for Supercomputing (BriCS) has recently deployed 55,296 Arm Neoverse V2 CPU cores in a supercomputing platform, via 384-nodes of NVIDIA Grace CPU Superchips with LPDDR5 memory as part of the Isambard 3 HPC service for the UK HPC research community. Isambard 3 is an HPE/Cray XD series system using the Slingshot 11 interconnect. As one of the first systems of this kind, this manuscript overviews details of the hardware and software configuration and presents early performance evaluation and benchmarking results using a representative subset of scientific applications. The focus is to evaluate Isambard 3 as a “plug-and-play” environment for researchers, especially who are familiar with Cray software environment. We include microbenchmark results to provide insights into the performance behaviour of this unique architecture. We present a small scale scaling comparison between the NVIDIA Grace CPU Superchip with other mainstream CPUs, including Intel Sapphire Rapids and AMD Genoa and Bergamo. We report on issues during the attempts to use several major software toolchains available for Arm, such as the HPE Cray Compiler Environment (CCE), the Arm Compiler for Linux, and the NVIDIA Compiler, therefore focussing on GCC. Our findings include key opportunities for improvements that were discovered during our benchmarking, evaluation and regression testing on the system as we transitioned the service into operations from January 2025. Separating concerns: Decoupling the Slingshot Fabric Manager from Cray System Management Riccardo Di Maria and Chris Gamboni (Swiss National Supercomputing Centre), Davide Tacchella and Isa Wazirzada (HPE), and Mark Klein (Swiss National Supercomputing Centre) Abstract Abstract The Alps research infrastructure consists of 32 HPE Cray EX cabinets which are managed by Cray System Management (CSM). A critical component of this system is the Slingshot fabric manager and it is responsible for managing the high-speed network fabric. Presently, the fabric manager is deployed as a Kubernetes Pod and runs amongst other services on the system management nodes. An ongoing effort aims at separating the fabric manager from Kubernetes, and deploying it on bare metal hardware. The architectural decision-making process is examined in detail, accompanied by a walkthrough of the newly proposed design. The discussion of the design is framed within the context of key quality attributes, including reliability, resiliency, availability, observability, and performance. Subsequently, the focus transitions from the "what" to the "how," providing a comprehensive overview of the execution of the migration of the fabric manager from a Kubernetes-based deployment to a bare-metal environment. Insights are presented regarding aspects that were successful, challenges encountered, and whether the overall outcome of this effort achieved the intended objectives. Paper, Presentation Technical Session 5B: Maintaining Large Systems Session Chair: Aaron Scantlin (National Energy Research Scientific Computing Center) Hardware Triage Tool: Enhancements and Extensions Isa Muhammad Wazirzada, Abhishek Mehta, Vinanti Phadke, and Bhuvan Meda Rajesh (HPE) Abstract Abstract In 2023, HPE released the Hardware Triage Tool (HTT) with the mission to provide high-fidelity diagnoses and minimize time to repair of hardware faults across HPE Cray EX compute and accelerator blades irrespective of the system manager being used. Detecting operating system noise with detect-detour Nagaraju KN, Clark Snyder, Dean Roe, and Larry Kaplan (HPE) Abstract Abstract HPC applications, especially those that frequently perform global synchronization operations, can be negatively affected by background operating system (OS) activity. The background actions of interest are processing hardware interrupts, software interrupts, and process context switches. While these actions are necessary to the operation of the OS, from the application's point of view, they are viewed as "OS noise" that affects performance, and the system should be tuned to minimize them. Identifying sources of OS noise is crucial for application performance but can be difficult. Few options exist to identify sources of OS noise without getting into the intricacies of the underlying kernel internals. The detect-detour tool makes use of the Linux kernel enhanced Berkeley Packet Filter (eBPF) [1] feature to help system administrators identify sources of OS noise without requiring them to be kernel experts. Analyzing a Lifetime of Failures on a Cray XC40 Supercomputer Kevin Brown and Tanwi Mallick (Argonne National Laboratory), Zhiling Lan (University of Illinois Chicago), Robert Ross (Argonne National Laboratory), and Christopher Carothers (Rensselaer Polytechnic Institute) Abstract Abstract We analyze hardware errors over the seven-year lifetime of the Theta supercomputer, a large-scale Cray XC40 system at the Argonne Leadership Computing Facility. To ensure accurate interpretation of the logs, we leverage expert knowledge to clean the dataset and remove redundant information. Temporal and spatial analysis techniques are then used to expose how failures and errors trend over time and across components in the system. Additionally, we correlate hardware error logs to system downtime logs to capture the relationship between critical errors and outages over the lifetime of the system. The results in this work represent a state-of-the-practice report highlighting how severe error types vary over time and across different component types, such as on-node and off-node (network) components. We also demonstrate the effectiveness of our technique in simplifying log analysis by using a unified error classification across components from different vendors, providing valuable insights into normal and anomalous system behaviors. Paper, Presentation Technical Session 5C: Filesystems & I/O Session Chair: Raj Gautam (ExxonMobil) E2000 Performance From Microbenchmarks to Applications William Loewe, Michael Moore, Sakib Samar, and Chris Walker (HPE) Abstract Abstract With the advance of the Exascale Age and its continued gains in FLOPS performance, the associated I/O demands of performance increase commensurately. To address this, the HPE Cray Supercomputing Storage Systems E2000 is the next generation of the HPE Cray Supercomputing Storage product line with a focus on performance. This paper discusses the architecture changes in the E2000 and provides node and file system microbenchmarks measuring bandwidth, IOPS, and metadata performance. The improved PCIe and NVMe drive speeds in addition to the higher density enclosure in the E2000-F allow for more than twice the throughput and IOPS performance compared to the previous generation with nearly all of the performance achievable by optimal application workloads. System configuration choices, such as number of storage targets and BIOS settings, which influence system level performance will be compared with an aim to optimize the gains and determine ideal client/server tunings. Finally, performance of application-relevant workloads including random access, shared file, and AI/ML storage workloads will be presented along with discussion of application and job changes to utilize the E2000 performance improvements. Towards Empirical Roofline Modeling of Distributed Data Services: Mapping the Boundaries of RPC Throughput Philip Carns, Matthieu Dorier, Rob Latham, Shane Snyder, and Amal Gueroudji (Argonne National Laboratory); Seth Ockerman (University of Wisconsin-Madison); Jerome Soumagne (HPE); Dong Dai (University of Delaware); and Robert Ross (Argonne National Laboratory) Abstract Abstract The scientific computing community relies on distributed data services to augment file systems and decouple data management functionality from applications. These services may be native to HPC or adapted from cloud environments, and they encompass diverse use cases such as domain-specific indexing, in situ analytics, AI data orchestration, and special-purpose file systems. They unlock new levels of performance and productivity, but also introduce new tuning challenges. In particular, how do practitioners assess performance, select deployment footprints, and ensure that services reach their full potential? Roofline models could address these challenges by setting practical performance expectations and providing guidance to achieve them. HPC workload characterization using eBPF Shubh Pachchigar and Brandon Cook (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) and Brian Friesen (Lawrence Berkeley National Laboratory) Abstract Abstract Efficient interactions with filesystems are essential for scientific workflows operating at scale on HPC systems. In order to design new filesystems, tune system configurations, effective I/O characterization is needed. Darshan is a widely used tool for I/O characterization that relies on injecting code into application binaries but has some limitations in providing low-level insights. In this work, we propose leveraging eBPF that enables the execution of user-defined programs within the kernel, to develop a new I/O characterization tool. Our approach aims to complement the capabilities of Darshan by using eBPF to gain deeper insight into application interactions with the underlying filesystems. This is achieved by deploying dynamic instrumentation techniques below the application layer to extract detailed I/O metrics. In this work, we demonstrate the collection of read/write operations, and their associated latencies with various available filesystems. The metrics are periodically sampled with a custom eBPF-based LDMS sampler to enable the collection of data at scale. Finally, to demonstrate its feasibility in production HPC environments, we establish that the overhead of the tool is low. This work demonstrates the potential of eBPF to enhance I/O characterization in HPC environments, providing valuable insights that can lead to improved performance and resource utilization. Paper, Presentation Technical Session 5A: Slingshot & MPI Tuning Session Chair: Brett Bode (National Center for Supercomputing Applications/University of Illinois, National Center for Supercomputing Applications) MPI implementation optimization for Slingshot network Rahulkumar Gayatri, Adam Lavely, Neil Mehta, Brandon Cook, and Afton Geil (Lawrence Berkeley National Laboratory) Abstract Abstract Optimizing MPI performance is one of the keys to improving performance of HPC applications. While algorithmic improvements such as overlap of communication and computation are used to improve MPI parallelism within a workflow, the choice of MPI implementation for a given application can also affect overall performance. We have characterized how OpenMPI, MPICH, and Cray MPICH perform on Perlmutter for small to moderate problem sizes using the OSU micro-benchmarks and have shown that the vendor-tuned Cray MPICH typically outperforms the other MPI implementations. We plan to expand this work for additional microbenchmarks, identify ways to improve the MPI implementations, and show how MPI performance differences impact the performance of full HPC applications for the final paper. Using Different MPI Implementations on HPE Cray EX Supercomputers for Native and Containerized Applications Execution Maciej Pawlik and Maciej Szpindler (Academic Computer Centre CYFRONET), Marcin Krotkiewski (University of Oslo), and Alfio Lazzaro (HPE) Abstract Abstract Message Passing Interface (MPI) implementations have to be tailored for specific system architectures to maximize application performance. This applies to optimizations for network transport and provision of efficient data movement between CPU and GPU memories. The default MPI on HPE Cray Ex systems such as LUMI is a proprietary and optimized implementation based on the open source MPICH Ch4 implementation. There are situations where having an alternative such as OpenMPI benefit users who have applications or containers that are targeting OpenMPI and where some effort would be needed to change. Having alternative MPI implementations allows for performance comparisons, investigating bugs, and checking new MPI functionalities. In this paper, we report on our experience with installing containerized and native OpenMPI environments on LUMI, showing how users can build and run containers and get the expected performance. We show a performance comparison with respect to HPE Cray MPI executions using OSU benchmarks and an example real-world application for the solution of Dirac equations using GPUs. Although we only refer to LUMI, similar concepts can be applied to the case of other supercomputers. Scaling MPI Applications on Aurora Nilakantan Mahadevan (Hewlett Packard Enterprise); Premanand Sakarda (Intel Corporation); Scott Parker, Servesh Muralidharan, Vitali Morozov, and Victor Anisimov (Argonne National Laboratory); Huda Ibeid, Anthony-Trung Nguyen, and Aditya Nishtala (Intel Corporation); Larry Kaplan and Michael Woodacre (Hewlett Packard Enterprise); and Kalyan Kumaran and JaeHyuk Kwack (Argonne National Laboratory) Abstract Abstract The Aurora supercomputer, which was deployed at Argonne National Laboratory in 2024, is currently one of three Exascale machines in the world on the Top500 list. The Aurora system is composed of over ten thousand nodes each of which contains six Intel Data Center Max Series GPUs, Intel’s first data center-focused discrete GPU, and two Intel Xeon Max Series CPUs, Intel’s first Xeon processor to contain HBM memory. To achieve Exascale performance the system utilizes the HPE Slingshot high-performance fabric interconnect to connect the nodes. Aurora is currently the largest deployment of the Slingshot fabric to date with nearly 85,000 Cassini NICs and 5,600 Rosetta switches connected in a dragonfly topology. The combination of the Intel powered nodes and the Slingshot network enabled Aurora to become the second fastest system on the Top500 list in June of 2024 and the fastest system on the HPL MxP benchmark. The system is one of the most powerful systems in the world dedicated to AI and HPC simulations for open science. This paper presents details of the Aurora system design with a particular focus on the network fabric and the approach taken to validating it. The performance of the systems is demonstrated through the presentation of the results of MPI benchmarks as well as performance benchmarks including HPL, HPL-MxP, Graph500, and HPCG run on a large fraction of the system. Additionally results are presented for a diverse set of applications including HACC, AMR-Wind, LAMMPS, and FMM demonstrating that Aurora provides the throughput, latency, and bandwidth across the system needed to allow applications to perform and scale to large node counts and providing new levels of capability and enabling breakthrough science. Paper, Presentation Technical Session 6B: Framework for HPC-AI workflows Session Chair: Chris Fuson (ORNL, Oak Ridge National Laboratory) Framework for tracking metadata, lineage and model provenance in hybrid simulation-AI HPC exascale workflows Martin Foltin, Andrew Shao, Rishabh Sharma, Shreyas Kulkarni, Annmary Justine Koomthanam, Aalap Tripathy, and Cong Xu (HPE); Wenqian Dong (Oregon State University); Suparna Bhattacharya (HPE); Brian Sammuli (General Atomics); and Paolo Faraboschi (HPE) Abstract Abstract Integration of AI in HPC workflows can have a profound impact on HPC scale and usability, for example, by accelerating simulations with surrogate models or intelligently steering simulations based on previous results. New workflows are explored in which AI models are iteratively improved by continual learning to better reflect input data distributions and avoid outliers and drifts. Tracking of model provenance in these workflows is important to understand how new data affect model performance, allow unwinding to previous iterations, and provide a better understanding of conditions where AI models perform well for future reuse. This is more challenging in hybrid HPC-AI workflows because the lineage and provenance must be tracked across multiple software components at different levels of scale. In this work, we extend HPE Common Metadata Framework to hybrid simulation – AI workflows. We demonstrate benefits of CMF tracking across simulation, AI training and inference along HPE SmartSim system on a simple computational fluid dynamics problem with Eddy Kinetic Energy parameterized by AI. We track out-of-distribution data for continuous learning and employ adaptive switching between different models to improve the quality of results. We are working with fusion energy and materials science communities enhancing their workflows in a similar fashion. Search and Query Framework for Workflows with HPC and AI Models Christopher Rickett, Sreenivas Sukumar, and Karlon West (HPE) Abstract Abstract Modern computational science workflows increasingly involve complex, interactive, and iterative search through data from simulations of physics-based equations coupled with analytic, predictive, generative, and agentive tasks. Unfortunately, there are no query engines that empower scientists to search through scientific data with AI, analytic and physics-based models similar to searches with SQL query engines on structured data or keyword-search/prompt engines for textual data. FirecREST v2: Lessons Learned from Redesigning an API for Scalable HPC Resource Access Elia Palme and Juan Pablo Dorsch (CSCS - ETH Zurich); Ali Khosravi and Giovanni Pizzi (PSI Center for Scientific Computing, Theory, and Data); and Francesco Pagnamenta, Andrea Ceriani, Eirini Koutsaniti, Rafael Sarmiento, Ivano Bonesana, and Alejandro Dabin (CSCS - ETH Zurich) Abstract Abstract Introducing FirecREST v2, the next generation of our open-source RESTful API for programmatic access to HPC resources. FirecREST v2 delivers a ~100x performance improvement over its predecessor. This paper explores the lessons learned from redesigning FirecREST from the ground up, with a focus on integrating enhanced security and high throughput as core requirements. We provide a detailed account of our systematic performance testing methodology, highlighting common bottlenecks in proxy-based APIs with intensive I/O operations. Key design and architectural changes that enabled these performance gains are presented. Finally, we demonstrate the impact of these improvements, supported by independent peer validation, and discuss opportunities for further improvements. Paper, Presentation Technical Session 6C: Programming Models Session Chair: Benjamin Cumming (CSCS, ETH Zurich) Designing GPU-aware OpenSHMEM for HPE Cray EX and XD Systems Danielle Sikich, Naveen Namashivayam Ravichandrasekaran, Md Rahman, Elliot Joseph Ronaghan, Nathan Wichmann, and William Okuno (HPE) Abstract Abstract OpenSHMEM is a Partitioned Global Address Space (PGAS) based library interface specification. It is a culmination of a standardization effort among many implementers and users of the SHMEM programming model. The existing OpenSHMEM specification is not GPU-aware, the programming model does not enable managing the data movement operations involving a GPU-attached memory buffer. However, the OpenSHMEM users are exploring options to enable the execution of their data-driven workloads on heterogeneous system architectures. Quantifying Message Aggregation Optimisations for Energy Savings in PGAS Models Aaron Welch and Oscar Hernandez (Oak Ridge National Laboratory) and Stephen Poole and Wendy Poole (Los Alamos National Laboratory) Abstract Abstract Upon breaking past the exascale barrier, HPC systems are facing their greatest challenge yet - a power wall that must be addressed through new methods in both hardware and software. While energy costs are becoming a major issue at all levels, of particular concern is that of the network, as the relative cost of moving data is increasing faster than ever. The partitioned global address space (PGAS) model is critical within certain HPC domains, but is known to suffer from the small message problem, where irregular many-to-many access patterns result in congesting the network with excessive numbers of small messages. To address this, the conveyor aggregation library was developed to defer individual messages and group them for subsequent bulk processing. In this paper, we investigate its impact on energy use related to the network, with a focus on the Slingshot 11 interconnect. We will demonstrate that this strategy is not only highly performant, but also crucial to reducing energy footprints to remain within target power envelopes. Accelerating LArTPC Simulations: Enhancing larnd-sim with GPU Optimization Techniques Madan Timalsina (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory); Matt Kramer (Lawrence Berkeley National Laboratory); Pengfei Ding (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory); Ronan Doherty (Trinity College Dublin); Rishabh Dave (UC Berkeley); Nicholas Tyler, Urjoshi Sinha, and William Arndt (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory); and Callum Wilkinson (Lawrence Berkeley National Laboratory) Abstract Abstract Advancements in general-purpose computing on GPUs have enabled highly parallelized Monte Carlo simulations for particle physics experiments, including for the Deep Underground Neutrino Experiment (DUNE), which will use the world's most powerful neutrino beam to study the properties of these elusive particles. Here, we present our efforts on the optimization of larnd-sim, a microphysical simulation for liquid argon time projection chambers (LArTPCs) with light and pixelated charge readout, originally developed for the DUNE Near Detector (ND-LAr). Implemented in Python and utilizing GPU acceleration via Numba and CuPy, larnd-sim processes energy depositions from Geant4 to simulate physical phenomena (such as ionization electron drift) and the response of the detector electronics. By profiling with NVIDIA tools and optimizing memory transfers, adjusting register counts, tuning block and grid dimensions, altering floating-point precision, enabling the "fastmath" option for transcendental functions, converting arrays to a jagged format, and tuning CUDA kernels, we achieved an over-50% GPU-memory reduction, a ~30% wallclock speed improvement, and individual kernel speedups of 10–500%. In addition to these ongoing tests on the NERSC Perlmutter supercomputer, we are working with collaborators at ANL to run these simulations on the Polaris machine, further expanding larnd-sim's reach. Paper, Presentation Technical Session 6A: DAOS Session Chair: Jesse A. Hanley (Oak Ridge National Laboratory) DAOS - New Horizons for High Performance Storage Michael Hennecke and Jerome Soumagne (HPE) Abstract Abstract DAOS is an open-source scale-out storage system that has redefined performance for a wide spectrum of HPC and AI workloads (https://daos.io/). It is an all-flash solution that can be deployed as a stand-alone storage system, or it can be a high-performance storage tier used in combination with traditional Lustre, GPFS, or (cloud) object storage environments. Enhancing RPC on Slingshot for Aurora’s DAOS Storage System Jerome Soumagne, Alexander Oganezov, Ian Ziemba, and Steve Welch (HPE); Philip Carns and Kevin Harms (Argonne National Laboratory); and John Carrier, Johann Lombardi, Mohamad Chaarawi, Zhen Liang, and Scott Peirce (HPE) Abstract Abstract DAOS, an open-source software-defined high-performance storage solution designed for massively distributed NVMe SSDs and Non Volatile Memory (NVM), is a key component of the Aurora Exascale system that aims to deliver high storage throughput and low latency to application users. Utilizing the Slingshot interconnect, DAOS leverages Remote Procedure Call (RPC) to communicate between compute and storage nodes. While the preexisting RPC mechanism used by DAOS was already designed for High-Performance Computing (HPC) fabrics, it required a number of scalability, performance, and security enhancements in order to be successfully deployed on Aurora. Global Distributed Client-side Cache for DAOS Clarete R. Crasta, John L Byrne, Abhishek Dwaraki, David Emberson, Harumi Kuno, Sekwon Lee, Ramya Ahobala Rao, Shreyas Vinayaka Basri K S, Amitha C, Chinmay Ghosh, Rishi Kesh Kumar Rajak, Sriram Ravishankar, Porno Shome, and Lance Evans (HPE) Abstract Abstract HPC/AI workloads process large amounts of data and perform complex operations on the data at exascale rates, for time-critical insights/results. Distributed workloads are often bottlenecked by communication when storage systems are used to co-ordinate and share results. Storage solutions supporting effective, scalable parallel access from compute clusters are critical to HPC architectures. Caching data on storage servers and/or clients are known techniques used by storage systems to ameliorate the communication costs. Current server-side caching methodologies are constrained by amount of memory and network bandwidth on the fixed and finite server nodes. Furthermore, most client-side caches are node-local, meaning the cached data is accessible solely by the node on which the data is stored. DAOS is a promising exascale storage stack recently acquired by HPE. Global client-side caching for DAOS is an attractive proposition due to higher aggregate client-side resources (e.g., DRAM and network bandwidth) that can scale independent of the number of server nodes. In addition to providing faster data access, a client-side cache should also be efficient as it consumes expensive resources and requires an efficient caching framework with its associated policies. In this paper, we cover the details of realizing efficient shared client-side caching for DAOS. Paper, Presentation Technical Session 7B: Access Nodes & Kubernetes Management Session Chair: Jim Williams (Los Alamos National Laboratory) Addressing Resource Constraints on Aurora with Admin Access Nodes Peter Upton, Ben Lenard, Ben Allen, and Cyrus Blackworth (Argonne National Laboratory) Abstract Abstract This paper presents Administrator Access Nodes (AANs) as an alternative to the traditional reliance on a single Admin Node for all aspects of system administration in an HPE Performance Cluster Manager (HPCM) managed supercomputer cluster. At the Argonne Leadership Computing Facility (ALCF), managing the Aurora supercomputer, a large HPE Cray EX system, requires a team of skilled developers and administrators. These professionals require access to many tools for tasks such as parsing log files, issuing power commands, and connecting to nodes via SSH. These tasks have typically been performed solely on the Admin Node. However, this centralization can lead to resource constraints due to simultaneous resource requirements by multiple administrators. To address these issues, the paper details the implementation and operation of AANs, including custom tools for interacting with HPCM, scripts to replicate some Admin Node functionality on AANs, and synchronization tools for configuration files. The introduction of AANs has alleviated resource constraints, streamlined workflows, and enhanced system manageability. Possible future work is also discussed, focusing on further integrating HPCM's APIs and improving usability, aiming to enhance AAN capabilities and administrative efficiency for Aurora's complex environment. HPE Slingshot in the Kubernetes Ecosystem Caio Davi and Jesse Treger (HPE) Abstract Abstract The convergence of traditional HPC systems with AI increases expectations for supercomputing sites to deliver new capabilities (beyond traditional batch scheduling, single tenancy, and bare-metal application deployment methodologies) for more dynamic provisioning. Convergence with enterprise cloud computing techniques such as containerized applications and Kubernetes have become a priority. But transitioning high-performance computing (HPC) environments and applications to Kubernetes is complex because of the critical requirement to maintain low-latency networking for high-performance. In this context, we have HPE Slingshot, a modern high-performance interconnect for HPC and AI clusters that delivers industry-leading performance, bandwidth, and low-latency for HPC, AI/ML, and data analytics applications through innovations in the fabric to overcome congestion and innovations in the NIC to significantly offload communications and message processing from the hosts. Because the HPE Slingshot NICs run native Ethernet alongside its optimized RDMA transport and connectionless protocol, ensuring that the RDMA transport is operating as intended is critical to delivering the high performance expected in HPC and AI. This requires careful configuration of Kubernetes because If not configured, the system can fall back to using standard TCP/IP over Ethernet instead of the expected HPC and AI performance. Our proposed solution is composed of a number of Kubernetes components such as device plugins, CNIs, Operator, and Admission Policies. These contributions represent a significant advancement in deploying and operating HPC applications within containerized environments and offering a robust framework for future developments in distributed computing, ensuring both high performance and ease of management for the continuing convergence of HPC/AI and cloud computing and the coming transition from siloed HPC interconnects to interoperable Ultra-Ethernet transport. Building non-standard images for CSM systems Harold Longley, Isa Wazirzada, Dennis Walker, Andy Warner, and Davide Tacchella (HPE) Abstract Abstract HPC scientists increasingly must innovate using diverse toolchains crafted from various Linux distributions ensuring they meet individual and project-specific needs to tackle complex challenges. Paper, Presentation Technical Session 7C: Application Performance Session Chair: Juan F R Herrera (EPCC, The University of Edinburgh) Task-decomposed Overlapped Pressure Preconditioner for Sustained Strong Scalability on Accelerated Exascale Systems Niclas Jansson (KTH Royal Institute of Technology) Abstract Abstract Computational Fluid Dynamics is a natural driver for exascale computing with a virtually unbounded need for computational resources for accurate simulation of turbulent fluid flow, both for academic and engineering usage. However, with exascale computing capabilities on the horizon, we have seen a transition to more heterogeneous computer architectures with various accelerators. While offering high theoretical peak performance and high memory bandwidth, complex programming models and significant programming investments are necessary to efficiently exploit these systems. We detail our work on improving the performance and scalability of key numerical methods in the high-fidelity spectral element code Neko on accelerated exascale machines. Efficient preconditioners are essential in incompressible fluid dynamics; however, the most efficient method (with respect to convergence) might be challenging to implement with good performance on an accelerator. We present our development of a GPU-optimised preconditioner with task overlapping for the pressure-Poisson equation, improving the preconditioner's throughput (in TDoF/s) by more than 61%. The new preconditioner is explained in detail, together with detailed performance studies on Cray EX platforms, including strong scalability studies on Frontier, a performance comparison between AMD and NVIDIA accelerated nodes, and an assessment of the feasibility of mixing both node types in a single simulation. Supernovae in HPC: Benchmarking FLASH Across Advanced Computing Clusters Joshua Martin, Eva Siegmann, and Alan Calder (Stony Brook University, Institute of Advanced Computational Science) Abstract Abstract Astrophysical simulations are highly demanding in terms of computation, memory, and energy, requiring new advancements in hardware. Stony Brook University recently expanded its "SeaWulf" computing cluster by adding 94 new nodes with Intel Sapphire Rapids Xeon Max series CPUs. This benchmarking study evaluates the performance and power efficiency of this new hardware using FLASH: a multi-scale, multi-physics software instrument that utilizes adaptive mesh refinement. Our study also compares the performance of Stony Brook's Ookami testbed which features ARM-based A64FX-700 processors as well as SeaWulf’s existing AMD EPYC Milan and Intel Skylake nodes. The focus of our simulation is the evolution of a bright stellar explosion known as a thermonuclear (Type Ia) supernova—a complex 3D problem that incorporates various operators for hydrodynamics, gravity, nuclear burning, and routines for the material equation of state. We perform strong-scaling tests on a 220 GB problem size and assess both single-node and multi-node performance. We analyze the performance of various MPI mappings and processor distributions across nodes. From our strong-scaling tests, we conclude the optimal configuration for balancing the minimization of runtime and energy consumption for our application. Expanding Community Access to Real-World HPC Application I/O Characterization Data Using Darshan Shane Snyder, Philip Carns, Robert Ross, Robert Latham, and Kevin Harms (Argonne National Laboratory) Abstract Abstract HPC systems are deployed with massive, distributed storage subsystems to meet the demands of data-intensive applications. While these storage systems offer impressive peak performance, it is often only attainable in idealized scenarios not reflective of production workloads. In general, there continues to be a lack of community understanding of the I/O performance characteristics of real-world applications. Paper, Presentation, Birds of a Feather Technical Session 7A: AI/ML GPU Workloads Session Chair: Raj Gautam (ExxonMobil) Porting Radio Astronomy Correlation to Setonix, a HPE Cray EX system powered by AMD GPUs Cristian Di Pietrantonio (Pawsey Supercomputing Research Centre, Curtin Institute for Radio Astronomy); Marcin Sokolowski (Curtin Institute of Radio Astronomy); Christopher Harris (Pawsey Supercomputing Research Centre); and Daniel Price and Randal Wayth (SKAO) Abstract Abstract In low-frequency radio astronomy correlation of signals coming from hundreds of radio antennas is an early and fundamental step to create science-ready data products such as images of the sky at radio wavelength. Because of the high volume of data to process and the rate at which they are produced, correlation is most of the time performed in real time by dedicated hardware, a FPGA or GPU cluster, installed near the telescope. However, there are science cases when an astronomer would like to correlate data later with customised settings such as time and frequency averaging of signals. Setonix, Pawsey Supercomputing Centre’s HPE Cray Ex supercomputer based on AMD CPUs and GPUs, provides radio astronomers with enough computational power for such processing, but the only established GPU correlator works only on NVIDIA GPUs and proved hard to port. In this paper we discuss the process of providing Australian astronomers an implementation of the correlation algorithm that harnesses the computational power of Setonix. Evaluating the Performance of Containerized ML and LLM Applications on the Frontier and Odo Supercomputers Bishwo Dahal (University of Louisiana Monroe, Oak Ridge National Laboratory) and Elijah Maccarthy and Subil Abraham (Oak Ridge National Laboratory) Abstract Abstract Containers are transforming scientific computing by simplifying the packaging and distribution of applications. This enables researchers to create and deploy their applications in isolated environments with all necessary dependencies, enhancing portability and deployment flexibility. These advantages make containers especially suitable for High Performance Computing (HPC) facilities like the Oak Ridge Leadership Computing Facility (OLCF), where complex scientific applications are being developed and deployed. In this work, we investigate the performance of containerized machine learning (ML) applications in comparison to bare-metal execution on the Frontier Exascale supercomputer. Specifically, we aim to determine whether ML models, when trained and tested within containers on Frontier using Apptainer, exhibit performance similar to that of bare-metal implementations. To achieve this, we use containers to package and run Convolutional Neural Network (CNN)-based ML applications on the OLCF Frontier and Odo supercomputers and assess their performance against bare-metal runs. After conducting scalability tests across up to 30 nodes with 1680 AMD EPYC CPU cores and 240 GPUs, we find that the performance of the containerized ML applications is at par with that of bare-metal runs. We apply the lessons learned from our containerized ML model to containerizing and evaluating performance of LLMs like AstroLLaMA, and CodeLLaMA on Frontier. BoF on Transforming Hybrid Workflows: The Role of HPE Cray Supercomputing User Services Software in Bridging HPC and AI Tulsi Mishra, Dean Roe, and Larry Kaplan (HPE) Abstract Abstract As the convergence of HPC and AI reshapes computational workflows, the complexity of managing hybrid environments has become a significant challenge for organizations. HPE Cray Supercomputing User Services Software (USS) offers a transformative approach to simplify, scale, and optimize workflows across HPC and AI landscapes. In this session, we will explore how USS aims to bridge the gap between traditional HPC workloads and AI-driven innovations, providing a unified platform for containerized environments, hybrid deployment orchestration, and energy-efficient operations. Plenary Plenary Session: CUG 2025 Welcome, Keynote Presentation Keynote: What I’ve Learned About Supercomputing from Blowing Up Stars, Michael Zingale (Stony Brook University) Michael Zingale (Stony Brook University) Abstract Abstract Stars shine throughout their lives by converting light elements into heavier elements via nuclear burning. While there are different pathways that low and high mass stars take in their evolution as they exhaust their fuel, explosions of both groups (or their remnants) are possible, leading to a wide-range of stellar transients. Modeling these events requires capturing the interplay between hydrodynamics, nuclear reactions, gravity, radiation, rotation, and more physics. These models are also inherently multi-dimensional and span a vast range of timescales. Both algorithmic developments and leveraging of modern supercomputer architectures are key to performing accurate and efficient simulations of these explosions. In this talk, I will discuss some of the lessons I’ve learned from more than two decades of writing simulation codes for these problems. I will show examples of where new algorithms needed to be developed, instead of using general codes, and when complete rewrites of codes were needed to support new architectures. Finally, I will talk about how we will train our students to write the next generation of codes. New Member Site: Introducing LRZ Markus Michael Müller (LRZ) Abstract Abstract The Leibniz Supercomputing Centre (LRZ) was founded in 1962 as an institute of the Bavarian Academy of Sciences and Humanities (BAdW). LRZ is one of the three German high-performance computing centres of the Gauss Centre for Supercomputing (GCS). Apart from providing compute services, LRZ also provides various IT services for universities and research institutes in Munich, among them the Munich Scientific Network, identity management, and many more Plenary Plenary Session: Stony Brook LOC Welcome, HPE Update Altair: AI/ML Intelligent Scheduling for HPC with Altair® Bill Nitzberg (Altair Engineering, Inc.) Abstract Abstract What's possible when you start with a deep understanding of HPC usage patterns (via Altair InsightPro™) and use that data to build predictive AI models (via Altair RapidMiner®) to augment traditional HPC (via Altair HPCWorks®)? Better utilization (a lot better), faster turnaround (a lot faster), and more throughput (a lot more). NVIDIA HPC Software - Expanding HPC with Python & AI Becca Zandstein (NVIDIA) Abstract Abstract NVIDIA's HPC Software enables developers to build applications that take advantage of every aspect of the hardware available to them: CPU, GPU, and interconnect. In this presentation you will learn the latest updates on NVIDIA's HPC software that is being used in HPC centers, including the latest AI for Science and Python products. This presentation will provide an overview of the NVIDIA HPC Software stack, taking you from traditional HPC compilers, CPU optimized libraries, Python HPC tooling, and AI for Science software that you can start using today. Plenary, Paper Plenary Session: CUG Organizational Update and Best Paper Presentation Evolving HPC services to enable ML workloads on HPE Cray EX Stefano Schuppli, Fawzi Mohamed, Henrique Mendonca, Nina Mujkanovic, Elia Palme, Dino Conciatore, Lukas Drescher, Miguel Gila, Pim Witlox, Joost VandeVondele, Maxime Martinasso, Torsten Hoefler, and Thomas Schulthess (Swiss National Supercomputing Centre) Abstract Abstract The Alps Research Infrastructure leverages GH200 technology at scale, featuring 10,752 GPUs. Accessing Alps provides a significant computational advantage for researchers in Artificial Intelligence (AI) and Machine Learning (ML). While Alps serves a broad range of scientific communities, traditional HPC services alone are not sufficient to meet the dynamic needs of the ML community. This paper presents an initial investigation into extending HPC service capabilities to better support ML workloads. We identify key challenges and gaps we have observed since the early-access phase (2023) of Alps by the Swiss AI community and propose several technological enhancements. These include a user environment designed to facilitate the adoption of HPC for ML workloads, balancing performance with flexibility; a utility for rapid performance screening of ML applications during development; observability capabilities and data products for inspecting ongoing large-scale ML workloads; a utility to simplify the vetting of allocated nodes for compute readiness; a service plane infrastructure to deploy various types of workloads, including support and inference services; and a storage infrastructure tailored to the specific needs of ML workloads. These enhancements aim to facilitate the execution of ML workloads on HPC systems, increase system usability and resilience, and better align with the needs of the ML community. We also discuss our current approach to security aspects. This paper concludes by placing these proposals in the broader context of changes in the communities served by HPC infrastructure like ours. Alps, a versatile research infrastructure Maxime Martinasso (Swiss National Supercomputing Centre, ETH Zurich) and Mark Klein and Thomas Schulthess (Swiss National Supercomputing Centre) Abstract Abstract The Swiss National Supercomputing Centre (CSCS) has a long-standing tradition of delivering top-tier high-performance computing systems, exemplified by the Piz Daint supercomputer. However, the increasing diversity of scientific needs has exposed limitations in traditional vertically integrated HPC architectures, which often lack flexibility and composability. To address these challenges, CSCS developed Alps, a next-generation HPC infrastructure designed with a transformative principle: resources operate as independent endpoints within a high-speed network. This architecture enables the creation of independent tenant-specific and platform-specific services, tailored to diverse scientific requirements. Plenary, Vendor Plenary: Sponsors Talks, HPE 1-100 Linaro: Unlocking Exascale Debugging and Performance Engineering with Linaro Forge Rudy Shand (Linaro Ltd) Abstract Abstract Dive into the future of code development and see how Linaro Forge is reshaping what's possible in the world of parallel computing. Linaro Forge unveils the latest advancements: with Linaro DDT, MAP and Performance Reports, we're setting new standards in scalability and ease of use. Discover how these tools have become the go-to solution for developers seeking to push the boundaries of code optimization and performance engineering. Codee: A Tool to Enhance Correctness, Modernization, Security, Portability and Optimization in Fortran and C/C++ Software Applications Manuel Arenaz (Codee) Abstract Abstract Fortran/C/C++ developers are under constant pressure to deliver increasingly complex simulation software that is correct, secure and fast. It is critical to empower development teams with tools to automate code reviews, enforce compliance with industry standards, and prioritize reducing the risk of security vulnerabilities. Codee features unique capabilities for Deep Analysis of Fortran/C/C++ code, helping to catch bugs, enforce coding guidelines, modernize legacy code, ensure code portability, address security vulnerabilities, and optimize code efficiency. Codee provides automated checkers for the rules documented in the Open Catalog as well as AutoFix capabilities for semi-automatic source code rewriting, including modification of source code statements and insertion of OpenMP or OpenACC directives. Codee integrates seamlessly with popular editors, IDEs, Control Version systems and CI/CD frameworks, making it easy to incorporate into existing development workflows. Overall, Developers who are actively writing, modifying, testing and benchmarking Fortran code will increase their productivity by using Codee. Developers, team leaders and managers will benefit from DevOps and DevSecOps best practices, mitigating risks, boosting productivity, and reducing costs. In this presentation we will also talk about how to use Codee in conjunction with the Cray tools, including compilers (CCE) and performance tools (e.g. CrayPat, Reveal). AMD: The Unreasonable Effectiveness of FP64 Precision Arithmetic Nicholas Malaya (AMD) Abstract Abstract Double precision datatypes, also known as FP64, has been a mainstay of high performance computing (HPC) for decades. Recent advances in AI have extensively leveraged reduced precision, such as FP16, or more recently, FP8 for Deepseek. Many HPC teams are now exploring mixed and reduced precision to see if significant speed-ups are possible in traditional scientific applications, including methods such as the Ozaki scheme for emulating FP64 matrix multiplications with INT8 datatypes. In this talk, we will discuss the opportunities, and significant challenges, in migrating from double precision to reduced precision. Ultimately, AMD believes a spectrum of precisions are necessary to support the full range of computational motifs in HPC, and that native FP64 remains necessary in the near future. Plenary Plenary: CUG 2026, Panel New Member Site: Introducing Cyfronet Patryk Lasoń (Academic Computer Centre Cyfronet AGH) Abstract Abstract The Academic Computer Center Cyfronet AGH is the longest-operating and one of the largest supercomputing and networking centres in Poland, with a history of providing access to supercomputing resources dating back to 1975. The operator of the fastest supercomputers in Poland, as well as very high-capacity data storage systems and MAN network. Cyfronet is the organiser and leader of the PLGrid Consortium, consolidating national computing resources, is also actively involved in leading European projects related to the development of supercomputing technologies and services based on them. The centre also works with SMEs and large companies to enable the effective implementation of HPC (High-Performance Computing) and AI (Artificial Intelligence). VAST Data Platform Jan Heichler (VAST) Abstract Abstract The VAST Data Platform leverages its Disaggregated Shared Everything (DASE) architecture to seamlessly unify HPC and AI data management, offering a data platform that prioritizes high speed, scalability, and simplicity. By eliminating downtime during system upgrades and streamlining parallel I/O, VAST ensures continuous and efficient operation at any scale. Its support for multiple data protocols facilitates effortless integration into existing infrastructures, requiring no custom tuning or complex configurations. This session will explore how the VAST Data Platform addresses the evolving needs of modern technical environments, enabling enhanced performance and operational efficiency. Panel: The Future of Precision in HPC, which FP is the Right One? Ashley Barker (Oak Ridge National Laboratory) Abstract Abstract This panel will explore the evolving role of floating-point precision in high-performance computing (HPC) and AI workloads, analyzing the trade-offs between FP64 and emerging alternatives such as mixed-precision techniques, lower-precision formats, and emulation methods. The panelists will share a nuanced discussion on balancing accuracy, performance, and hardware constraints in modern HPC and AI workloads. Plenary CUG 2025 Closing Remarks Paper, Presentation Technical Session 1B: Workload manager Session Chair: David Carlson (Institute for Advanced Computational Science, Stony Brook University) Slinky: The Missing Link Between Slurm and Kubernetes Tim Wickberg (SchedMD LLC) Abstract Abstract Slinky is SchedMD's collection of projects to integration the Slurm Workload Manager with the Kubernetes Orchestrator. How Best to Leverage Cloud for (Big) HPC Sites Bill Nitzberg and Ian Littlewood (Altair Engineering, Inc.) Abstract Abstract Cloud (finally) works for HPC, but the devil is still in the details. HPC in the Cloud has transitioned from "proof-of-concept engagements" to "hmm, but what about security" to "maybe, but what about the data" to "OK, but only if we carefully manage expenses". Today, significant big sites have voted with their dollars that on-premise HPC is here to stay, adding Cloud judiciously, with an eye towards resilience, and only where the end-to-end ROI makes sense. Divide and Rule: Automated Workload Distribution for Efficient User Support Services Luca Marsella (Swiss National Supercomputing Centre) Abstract Abstract User support services for High-Performance Computing (HPC) systems help users conduct simulations and optimize resources. The complexity of HPC platforms has increased, making user support more challenging. Site Reliability Engineering (SRE) best practices suggest that technical staff should focus on project work and automation to reduce repetitive tasks. Artificial intelligence and machine learning can help create effective knowledge bases from internal reports and user tickets, easing the burden on support staff. Paper, Presentation Technical Session 1C: Software deployment Session Chair: Chris Fuson (ORNL, Oak Ridge National Laboratory) Deploying and Tracking Software with NCCS Software Provisioning Asa Rentschler, Nicholas Hagerty, Elijah Maccarthy, and Edwin F. Posada Correa (Oak Ridge National Laboratory) Abstract Abstract The National Center for Computational Sciences (NCCS) at Oak Ridge National Laboratory has a long history of deploying ground- breaking leadership-class supercomputers for the U.S. Department of Energy. The latest in this line of supercomputers is Frontier, the first supercomputer to break the exascale barrier (1018 floating- point operations per second) on the TOP500 list. Frontier serves a wide array of scientific domains, from traditional simulation- based workloads to newer AI and Machine Learning workloads. To best serve the NCCS user community, NCCS uses Spack to deploy a comprehensive software stack of scientific software packages, providing straightforward access to these packages through Lmod Environment Modules. Maintaining a large software stack while also including multiple new compiler releases each year is a very time-consuming task. Additionally, it is not straightforward to pro- vide a software stack alongside existing vendor-provided software such as the HPE/Cray Programming Environment (CPE), and exist- ing CPE, Spack, and Lmod integration does not allow for multiple versions of GPU libraries such as AMD’s ROCm to be used. To ad- dress these challenges and shortcomings, NCCS has developed the NCCS Software Provisioning tool (NSP), a tool for deploying and monitoring software stacks on HPC systems. NSP allows NCCS to quickly and effectively provision software stacks from the ground up using template-driven recipes and configuration files. NSP is successfully deployed on Frontier and several other NCCS clusters, enabling the NCCS software team to quickly deploy software stacks for newly-released compilers, expand current software offerings, better support GPU-based software, and monitor Lmod module usage to identify unused software packages that can be removed from the software stack. In this work, we discuss the shortcomings of the previous CPE, Spack, and Lmod usage at NCCS, provide further details on the implementation and structure of NSP, then discuss the benefits that NSP provides. Modern Software Deployment on a Multi-Tenant Cray-EX System Ben Cumming, Andreas Fink, Simon Pintarelli, and John Biddiscombe (CSCS) Abstract Abstract User-facing software -- libraries, tools, applications and programming environments tuned for the node and network architecture -- is a key part of HPC centers' service offering. Teams that maintain and support this software face challenges: providing a stable software platform for users with long running projects while also providing the latest versions of software for developers; giving full responsibility to build, modify and deploy the whole software stack to staff who do not have root acccess; and reproducible deployment based on GitOps practices. CSCS addresses these challenges on Alps by using small independent software environments called uenv, which deploy from text-file recipes without requiring installation of the Cray Programming Environment. This paper discusses installing communication libraries from HPE and NVIDIA with SlingShot support; the CI/CD pipeline that builds uenv and deploys them in a container registry; and the command line tools and SLURM plugin that interface users with the software environments. We demonstrate diverse use cases such as JupyterHub, summarize the user and support team experience, and document how to build and deploy CPE containers. Employing a Software-Driven Approach to Scalable HPC System Management Aaron Barlow (Oak Ridge National Laboratory) Abstract Abstract Managing Frontier and other HPE Cray and Apollo clusters at Oak Ridge National Laboratory involves thousands of users, projects, and security policies across multiple HPC systems. With diverse research needs, varying security enclaves, and massive resource allocations, manual processes don’t scale, and administrative burden increases as HPE systems grow. To manage HPC systems at this scale, we developed RATS (Resource Allocation Tracking System), a software platform that centralizes operations. Paper, Presentation Technical Session 1A: Multitenancy Session Chair: Juan F R Herrera (EPCC, The University of Edinburgh) Infrastructure as a Service with Strong Tenant Separation on a Supercomputer Riccardo Di Maria, Chris Gamboni, Manuel Sopena Ballesteros, Hussein Harake, Mark Klein, Marco Passerini, Miguel Gila, Maxime Martinasso, and Thomas C. Schulthess (Swiss National Supercomputing Centre) and Alun Ashton, Derek Feichtinger, Marc Caubet, Elsa Germann, Hans-Nikolai Viessmann, Achim Gsell, and Krisztian Pozsa (Paul Scherrer Institute) Abstract Abstract This paper explores the innovative implementation of Infrastructure-as-a-Service (IaaS) on a HPE Cray Shasta EX supercomputer. In cloud environments, IaaS offers scalable, on-demand access to virtualized resources. However, applying IaaS principles to high-performance computing (HPC) systems without relying on virtualization technologies poses some challenges, since they typically have a tightly coupled software stack. We address these challenges in a co-design partnership between an HPC provider, CSCS, and an end-user institution, PSI, by developing a suite of technologies for the HPE Cray Shasta EX system architecture that supports resource isolation and granular control. This approach not only provides the IaaS model on supercomputing environments but also enables dynamic resource management. Our contributions include a detailed exploration of the technological advancements necessary for integrating IaaS into HPC, together with the lessons learned from our collaborative efforts. By extending IaaS capabilities to supercomputers, we aim to provide scientific institutions with unprecedented flexibility and control over their computational resources. Dynamic Network Perimeterization: Isolating Tenant Workloads With VLANs, VNIs, & ACLs Nikhil Mukundan, Dennis Walker, Stephen Han, Atif Ali, Siri Vias Khalsa, Amit Jain, Vishal Bhatia, and Vinay Karanth (HPE) Abstract Abstract There is a growing trend in the high-performance computing (HPC) community where separate user groups share HPC infrastructure with varying security clearances (tenants). In such cases, tenants require robust security boundaries to ensure data privacy, results integrity, and intellectual property secrecy. Additionally, sensitive transactions within a tenant may need to be further insulated from lower-clearance workloads. Join us as we show how product-agnostic, version-controlled configuration data can be used to dynamically isolate infrastructure resources supporting workloads, including compute node groups, data-at-rest (storage), and data-in-motion within high-speed and management networks. On the high-speed network (HSN), we'll examine how switch port VLAN filters and VNIs (traffic labels) isolate TCP/IP and RDMA traffic per tenant. On the management network, we'll demonstrate how to segment compute node groups via switch ACLs, VLANs, and iptables. Complete dynamic network segmentation will be applied at various levels of infrastructure from chassis, nodes, and within the OS. Finally, we'll review architecture features in Slingshot, CSM, and other products that enable elastic tenant reallocation. We'll compare and contrast the differences in the number of configuration options and security posture when applying segmentation at the switch vs the node. CSCS' journey towards complete platform automation in a multi-tenant environment Miguel Gila, Ivano Bonesana, and Alejandro Dabin (Swiss National Supercomputing Centre, CSCS) Abstract Abstract The Swiss National Supercomputing Centre operates a complex ecosystem of high-performance computing resources. With a focus on scalability and efficiency, we have implemented a multi-tenancy approach to serve diverse scientific communities. This layered architecture encompasses infrastructure, platform, and application layers, each with unique automation challenges. Paper, Presentation Technical Session 2B: Security & Configuration Management Session Chair: Jim Williams (Los Alamos National Laboratory) Pragmatic Security Audits: Fortifying HPC Environments at a Consumable Pace Alden Stradling (Los Alamos National Laboratory) and Monica Dessouky and Dennis Walker (HPE) Abstract Abstract Do you know your security posture? Are you overwhelmed by the latest reports? Don't let a sea of security findings parallelize your progress. By establishing a frequent, recurring cadence for audits and remediation, organizations ensure continuous protection against emerging threats. This paper represents the practical, scalable approach to HPC security implemented at a recent customer to secure its many new environments. Experimenting with Security Compliance Checking using ReFrame Victor Holanda Rusu, Matteo Basso, Chris Gamboni, Fabio Zambrino, and Massimo Benini (Swiss National Supercomputing Centre) Abstract Abstract Security is a critical aspect of High-Performance Computing (HPC) systems, where the implementation of security compliance checks and hardened configurations is essential to safeguard resources and data. Continuous security checking is fundamental, especially for detecting indications of compromise, but its implementation must balance effectiveness and efficiency to avoid unnecessary strain. Open Source and freely available security-focused tools, such as OSCAP, are less known and accessible to engineers from other disciplines, who may not be familiar with their functionality or utility. This creates a barrier to collaborative efforts in improving system-wide configurations and promoting shift-left security in HPC centers. We leveraged ReFrame to perform robust security compliance testing to address these limitations. ReFrame enables the creation of customizable tests to evaluate system configuration, generic exploits, and execute tests in parallel, optimizing testing workflows without significant performance penalties. We will present the latest developments at CSCS, showcasing how we plan to use ReFrame to enhance security compliance testing in HPC environments using three different standards STIG DoD, ANSSI BP-28-enhanced, and CSCS’. We aim to create a community to develop, maintain, and benefit from a shared set of security checks tailored to HPC systems based on customer-specific, industry-specific, or government-mandated requirements. From Weeks to Hours: Harnessing Configuration Management and Deployment Pipelines Dennis Walker and Siri Vias Khalsa (HPE) and Alex Lovell-Troy (Los Alamos National Laboratory) Abstract Abstract Ensuring peak reliability, current functionality, and up-to-date security requires cultivating the capability to continuously update and integrate a complex array of dependencies, spanning hardware, firmware, system management software, network configurations, API services, OS distros, job schedulers, AI libraries, and analytics tools. This paper presents a simple yet contemporary DevOps methodology designed to automate, validate, and replicate changes effectively to one or many production environments. Rev Up Compute Node Reboots: 2x to 5x Faster Dennis Walker (HPE) and Paul Selwood (Met Office, UK / NERC CMS) Abstract Abstract Join us as we race to the bottom, showcasing the innovations developed to speed up reboot times by 300% for UK Met Office's latest, multi-zoned, CSM-based HPE Cray EX systems. With increasing complexity in software and site-specific customizations, node reboot times ballooned to over 35 minutes by November 2023, far exceeding operational requirements. In response, HPE developed automation to download logs, parse metrics, and graph boot stages by duration to understand better what was happening. Guided by data, the following changes were implemented, sorted by impact: - CFS/Ansible run-time plays were moved into local Systemd execution, running earlier in the boot cycle and without blocking other nodes. - Software installation and select configuration activity were moved into image build, streamlining deployment. - CSM boot settings were tuned for optimal performance. - Node Health Checks (NHC) were moved into Systemd, to run preemptively before the job scheduler agent to ensure nodes are consistently job-ready as early as possible. Paper, Presentation Technical Session 2C: Climate applications Session Chair: Maciej Cytowski (Pawsey Supercomputing Research Centre) Bit-reproducibility in UK Met Office Weather and Climate Applications David Acreman (HPE) Abstract Abstract Weather and climate applications solve partial differential equations which are highly sensitive to small perturbations in model variables. Changes in even the least significant bit of a variable can have an observable impact on scientific results ("Butterfly effect"). The nature of floating point arithmetic means that subtle changes to code or the order of a summation can change results at the bit level due to different round-off errors. Consequently achieving bit reproducible results is challenging. Enabling km-scale coupled climate simulations with ICON on AMD GPUs Jussi Enkovaara (CSC - IT Center for Science Ltd.) Abstract Abstract The Icosahedral Nonhydrostatic (ICON) weather and climate model is a modelling framework for numerical weather prediction and climate simulations. ICON is implemented mostly in Fortran 2008 with the GPU version based mainly on OpenACC. ICON is used on a large variety of hardware, ranging from classical CPU cluster to vector architecture and different GPU systems. In coupled simulations ICON can utilize heterogeneous architectures, i.e. ocean runs on CPUs while a atmosphere runs concurrently on GPUs. MARBLChapel: Fortran-Chapel Interoperability in an Ocean Simulation Brandon Neth and Ben Harshbarger (HPE); Scott Bachman ([C]Worthy); and Michelle Mills Strout (HPE, University of Arizona) Abstract Abstract As the climate crisis continues to have widespread effects on the biosphere, scientists increasingly turn to computer modeling to understand the impacts of different interventions. Modeling one such intervention, ocean carbon dioxide removal, requires incorporating multiple sources of interaction (air-sea gas exchange, biogeochemical processes, etc.) and high spatial and temporal resolutions. To address the need for scalable and high-resolution simulations, scientists at [C]Worthy have written the core of an ocean modeling code using Chapel, a parallel programming language for writing high-performance, distributed programs. Although Chapel has enabled rapid development, an important library for modeling biogeochemical processes, MARBL, is written in Fortran. MARBL is a robust, stand-alone library and is used in several state-of-the-art models including MOM6, MPAS, POP, and ROMS. Rather than rewriting the MARBL library in Chapel, we use Chapel and Fortran’s C interoperability to integrate MARBL into the distributed Chapel simulation. This allows us to re-use reliable scientific code while using Chapel to orchestrate parallelism. In this talk, we demonstrate how the distributed Chapel simulation sets up data structures needed by MARBL, calls out to the Fortran library, and brings results back to update the simulation. We show performance results on Perlmutter, an HPE Cray EX system. Redefining Weather Forecasting Systems: The Transition to ICON and Alps Mauro Bianco, Matthias Kraushaar, and Roberto Aielli (ETH Zurich); Oliver Fuhrer (Federal Office of Meteorology and Climatology MeteoSwiss); and Thomas Schulthess (ETH Zurich) Abstract Abstract The transition of MeteoSwiss operational weather forecasting from COSMO to the ICON model represents a major modernization in meteorological services, integrating software-defined infrastructure to improve flexibility, scalability, and resilience. The migration also involved significant hardware upgrades, from fixed systems with K80 GPUs to flexible architectures using V100 and later A100 GPUs, supported by \textit{Alps} infrastructure developed by CSCS that is based on the Cray EX product line of HPE~\cite{maximecug25}. Paper, Presentation Technical Session 2A: Slingshot Session Chair: Brett Bode (National Center for Supercomputing Applications/University of Illinois, National Center for Supercomputing Applications) The HPE Slingshot 400 Expedition Houfar Azgomi, Duncan Roweth, Gregory Faanes, and Jesse Treger (HPE) Abstract Abstract HPE Slingshot 400 is a high-performance interconnect for classic and AI supercomputing clusters. As the successor of HPE Slingshot, it comprises a PCIe Gen5 NIC (Cassini-2) and a 64-port switch (Rosetta-2), linking over standard 400 Gbps Ethernet physical interfaces, and enabling dragonfly and fat-tree networks with up to 260,000 endpoints. HPE Slingshot is currently deployed in 7 of the 10 largest supercomputers worldwide and dominates the top 3 list as the interconnect solution for El Capitan, Frontier, and Aurora machines. With such success, the Slingshot Transport protocol has become the cornerstone for HPC-optimized Ethernet networking standardization efforts led by the Ultra Ethernet Consortium (UEC). Growing around its foundational adaptive routing and congestion management feature set, the HPE Slingshot 400 interconnect doubles its predecessor’s bandwidth with significant enhancements: exact match forwarding increasing route visibility across the cluster; dedicated ACL tables for security and cloud isolation; feature hardening flexibility with P4-programmability; and improved quality of service with 50% more traffic classes. It is supported across HPE’s portfolio of rack- and chassis-based supercomputing platforms including HPE Cray XD, HPE Cray EX, and the latest HPE Cray GX. This paper presents the key features and some early performance results of Rosetta-2 and Cassini-2 devices. Introduction To HPE Slingshot NIC Libfabric Environment Variables Jesse Treger and Ian Ziemba (HPE) Abstract Abstract Libfabric, a high-performance fabric software library, provides a rich API for applications to communicate efficiently over various networking technologies. The Libfabric provider for the specific networking interface type translates the API requested communications into optimal protocols that strive to optimize use of hardware. The HPE Slingshot NIC in particular has extensive hardware offloads to improve performance and reduce memory overhead. While Libfabric aims for seamless integration, achieving optimal performance may require users to configure environment variables to fine-tune the software for specific workloads and hardware setups. This presentation will demystify the role of environment variables in Libfabric, explaining trade-offs and why they matter to performance and stability under various conditions. We will begin with an overview of how the HPE Slingshot NIC uses Libfabric to optimize performance with various messaging requirements. Next, we will explore the most common environment variables users may need to adjust from default values, using examples learned on different applications, MPI middleware, processors, or job scale. Finally, we will briefly touch on some best-known methods to troubleshoot application failures that can be addressed with environment settings. Math in Your Network: Slingshot Hardware Accelerated Reductions Forest Godfrey and Duncan Roweth (HPE) Abstract Abstract In high performance computing applications, the use of collective operations such as reductions and barriers is commonplace. The performance of collectives is critical to overall performance in many applications, especially those where collectives are increasingly large part of the runtime as jobs scale. Collective operations are typically performed by software, requiring packets carrying contributions to the collective to go all the way to endpoint memory and be acted upon only for the result to have to transit back out to the network. This occurs at each level of a collective tree. By performing collective operations inside the network switch hardware itself the round trips to memory are removed and significant improvements in latency can be achieved. The Slingshot Rosetta switch fabric supports the hardware acceleration of many collective operation such as barriers and 64-bit IEEE floating point reductions. Upcoming Slingshot software will enable this functionality and present it to the end user transparently through the industry standard libfabric network communication library. This presentation will cover the details of this upcoming feature and how it can be used to accelerate applications.The implementation inside libfrabric, interaction with the job scheduler and fabric manager, as well as initial benchmarking results will be discussed. Slingshot Host Software Ethernet Tuning Ravi Bissa, Ian Ziemba, Duncan Roweth, and Forest Godfrey (HPE) Abstract Abstract High-performing Ethernet is the cornerstone of Exascale supercomputers, enabling seamless communication, minimizing latency, and supporting massive scalability. Without robust Ethernet infrastructure, these systems cannot achieve their goal of solving the world’s most complex computational problems efficiently and within a reasonable time and energy budgets. Paper, Presentation Technical Session 3B: HPCM Session Chair: Matthew A. Ezell (Oak Ridge National Laboratory) A Brief Summary of the HPCM (HPE Performance Cluster Manager) Evolution Over Recent Releases Sue Miller, Lee Morecroft, and Peter Guyan (HPE) Abstract Abstract This presentation will present the enhancements to the HPE Performance cluster manager (HPCM) over the current releases. It will cover 1.11 to 1.13 and patches to those versions as the release mechanisms for these enhancements. System Visualization Using Rackmap Troy Dey and Peter Guyan (HPE) Abstract Abstract Understanding the health of an HPC system across various dimensions such as power, environment, fabric, and job performance is a challenging task, and continues to increase in difficulty as these systems become larger and more complex. To address this problem, the HPE Performance Cluster Manager (HPCM) now provides Rackmap an extensible CLI tool capable of rendering telemetry data as a dense 2D representation of the physical layout of components within an HPC system. This dense display of information within the CLI allows a system administrator to view, for example, the power status of thousands of nodes instantly, on a single screen, without having to context switch to a separate application. In addition, system administrators can easily create their own maps to display information of interest to them such as whether nodes have passed acceptance tests. This presentation will provide an overview of the Rackmap tool, describe the maps currently being shipped with HPCM, and go over how system administrators can create their own maps. Harvesting, Storing and Processing Data from our HPCM Systems Ben Lenard, Eric Pershey, Brian Toonen, Peter Upton, Doug Waldron, Lisa Childers, Micheal Zhang, and Bryan Brickman (Argonne National Laboratory) Abstract Abstract With the Argonne Leadership Computing Facility (ALCF) acquiring more HPE Supercomputers, each equipped with its own HPCM stack, alongside other operational programs, we devised a strategy to centralize monitoring data from these systems. This centralized system aggregates data from various sources and securely distributes it to different consumers, including various teams and platforms within the ALCF. Paper, Presentation Technical Session 3C: Future Technology Session Chair: Juan F R Herrera (EPCC, The University of Edinburgh) Evolving Sarus to augment Podman for HPC on Cray EX Alberto Madonna, Gwangmu Lee, and Felipe Cruz (Swiss National Supercomputing Centre) Abstract Abstract Podman provides modern and flexible containerization solution but lacks the specialized features required for high-performance computing. The evolution of the Sarus project aims to integrate Podman into a modular, open-source HPC container suite, bridging mainstream container technologies and supercomputing. This presentation highlights how Sarus deploys and optimizes Podman for HPC on CSCS’s Alps infrastructure, a Cray EX system, focusing on the following areas: HPC-Optimized Podman: secure and scalable rootless containers for supercomputing environments with HPC-specific configuration templates. Workload Management Integration: seamless job orchestration of containerized workloads via SLURM-compatible SPANK component. Transparent HPC Resource Access: Open Container Initiative (OCI) hooks and Container Device Interface (CDI) provide pluggable access to compute, network, and storage resources on Cray EX systems. Parallel Filesystems Support: Squashfs-based image store for efficient usage of HPC storage systems. Secure Multi-Tenancy: rootless subid synchronization for Podman on shared distributed systems. This presentation will include test results on Alps, demonstrating how Sarus enables Podman to handle containerized job submissions efficiently and seamlessly. By augmenting community container tools like Podman to meet HPC needs, Sarus delivers a modern and flexible container stack optimized for CSCS’s vClusters architecture on Cray EX systems. What is RISC-V and why should we care? Nick Brown (EPCC) Abstract Abstract RISC-V is an open Instruction Set Architecture (ISA) standard which enables the open development of CPUs and a shared common software ecosystem. With billions of RISC-V cores already produced, and this is accelerating rapidly, we are seeing a revolution driven by open hardware. Nonetheless, for all the successes that RISC-V has enjoyed, it is yet to become mainstream in HPC. This comes at a time when HPC is facing new challenges especially around performance and sustainability of operations, and recent advances in RISC-V such as data centre RISC-V hardware make this technology a more realistic proposition with potential to address these. In this survey paper we explore the current state of art of RISC-V for HPC, identifying areas where RISC-V can benefit the HPC community, the level of maturity of the hardware and software ecosystem for HPC , and identify areas where the HPC community can contribute. The outcome is a set of recommendations around where the HPC and RISC-V communities can can come together and focus on high priority action points to help increase adoption. A Full Stack Framework for High Performance Quantum-Classical Computing Xin Zhan, K. Grace Johnson, and Soumitra Chatterjee (HPE); Barbara Chapman (HPE, Stony Brook University); and Masoud Mohseni, Kirk Bresniker, and Ray Beausoleil (HPE) Abstract Abstract To address the growing needs for scalable distributed High Performance Computing (HPC) and Quantum Computing (QC) integration, we present our HPC-QC full stack framework and its hybrid workload development capability with modular hardware/device-agnostic software integration approach. The latest development in extensible interfaces for quantum programming, dispatching, and compilation within existing mature HPC programming environment are demonstrated. Our HPC-QC full stack enables high-level, portable invocation of quantum kernels from commercial quantum SDKs within HPC meta-program in compiled languages (C/C++ and Fortran) as well as Python through a quantum programming interface library extension. An adaptive circuit knitting hypervisor is being developed to partition large quantum circuits into sub-circuits that fit on smaller noisy quantum devices and classical simulators. At the lower-level, we leverage Cray LLVM-based compilation framework to transform and consume LLVM IR and Quantum IR (QIR) from commercial quantum software front-ends in a retargetable fashion to different hardware architectures. Several hybrid HPC-QC multi-node multi-CPU and GPU workloads (including solving linear system of equations, quantum optimization, and simulating quantum phase transitions) have been demonstrated on HPE EX supercomputers to illustrate functionality and execution viability for all three components developed so far. This work provides the framework for a unified quantum-classical programming environment built upon classical HPC software stack (compilers, libraries, parallel runtime and process scheduling). Paper, Presentation Technical Session 3A: Data Centers Session Chair: Lena M Lopatina (LANL) Causality inference for Digital Twins in GPU Data Centers and Smart Grids. Rolando Pablo Hong Enriquez, Pavana Prakash, Ebad Taheri, and Aditya Dhakal (HPE); Matthias Maiterth and Wesley Brewer (Oak Ridge National Laboratory); and Dejan Milojicic (HPE) Abstract Abstract To the benefit of both technologies, data centers and smart grids will likely get evermore integrated in the near future. The downside is that effectively managing those systems will rapidly become burdensome if we neglect to prepare accordingly. Digital twins that can potentially wrap the benefits of advance analytics and visualizations to manage such complex environments. Yet even today's AI systems lack the proper causal understanding of the data. Here we embark on a journey to collect proper causal data for validating causal inference methods based on three fundamentally different theoretical foundations: causal calculus, information theory, and dynamical system theory. Subsequently, we apply such methods to two target datasets from a smart grid and a GPU data center. We finally analyze the success and failures of applying these methodologies and the indications they offered to create more insightful and energy-efficient prediction strategies for digital twins in support of smart grids and GPU data centers. AlpsB – a Geographically Distributed Infrastructure to Facilitate Large-Scale Training of Weather and Climate AI Models Alex Upton, Jerome Tissieres, and Maxime Martinasso (Swiss National Supercomputing Centre) Abstract Abstract AI-based models are transforming weather forecasting; these models are high quality and inexpensive to run compared to traditional physics-based models, and are already outperforming existing forecasting systems for many standard scores. The size of training datasets, however, remains a challenge. The widely-used ERA5, for example, is over 5PB, and these datasets are not typically located close to the large-scale compute power required for training AI models. As such, new solutions are required. Co-design, deployment and operation of a Modular Data Centre (MDC) with air and direct-liquid cooled supercomputers Sadaf Alam (University of Bristol); Emma Akinyemi, Martin Podstata, and Jan Over (HPE); and Simon McIntosh-Smith, Ross Barnes, Naomi Harris, and Dave Moore (University of Bristol) Abstract Abstract The Bristol Centre of Supercomputing (BriCS) has deployed its first HPE modular data centre (MDC), also known as a Performance Optimised Data Centre (POD), in March 2024. This has been a collaborative, co-design project between HPE and the University of Bristol. The MDC has enabled the rapid commencement of operations for the research community for the direct liquid cooled (DLC) Isambard-AI phase 1 (HPE Cray EX2500) and the air-cooled Isambard 3 (HPE Cray XD224), with NVIDIA Grace-Hopper and Grace-Grace superchips, respectively. The second set of MDCs have been deployed for Isambard-AI phase 2 containing 5,280 NVIDIA Grace-Hopper superchips in HPE Cray EX4000 DLC cabinets, together with the management and storage ecosystems. This manuscript outlines key features of the HPE POD MDCs for sustainability, efficiency, flexibility and observability in the era where data centre cooling and power needs are changing with growing demands for AI and HPC. We leverage the community efforts, specifically, the Energy Efficient High Performance Computing Working Group (EE HPC WG) that aims to sustainably support science through committed community action by encouraging the implementation of energy conservation measures and energy efficient design in HPC [1]. We outline notable advantages of the MDC approach for constraints and requirements that are unique for the Isambard-AI project that led to a co-design approach. We conclude by highlighting the key lessons drawn from this work. Paper, Presentation Technical Session 4B: GPU Energy Efficiency Session Chair: Maciej Cytowski (Pawsey Supercomputing Research Centre) Optimizing GPU Frequency for Sustainable HPC: Lessons Learned from a Year of Production on Adastra, an AMD GPU Supercomputer Gabriel Hautreux, Naïma Alaoui, and Etienne Malaboeuf (CINES) Abstract Abstract Power consumption is a critical concern for GPU-based high-performance computing (HPC) systems as rising energy costs and environmental challenges push for energy-efficient solutions. Modern GPUs, such as the AMD MI250X used in the Adastra supercomputer, offer features like frequency scaling to manage power consumption dynamically. However, optimizing frequency configurations for diverse HPC and AI workloads is complex due to their varied computational demands. CINES, operating the Adastra system in Montpellier, France, conducted a study analyzing the impact of reducing the GPU frequency from 1.7 GHz to 1.5 GHz. Adastra, ranked #3 on the Green500 list of energy-efficient supercomputers, supports French researchers across scientific domains. Previous findings presented at SuperComputing 23 showed that frequency downscaling improved energy efficiency while slightly impacting performance, prompting CINES to adopt the lower frequency in July 2024. A year-long analysis revealed a 15% reduction in energy consumption per node, aligning with sustainability goals without requiring hardware modifications. The study also assessed application performance, user satisfaction, hardware reliability, and differences between HPC and AI workloads. The results provide actionable insights for HPC centers aiming to enhance energy efficiency while avoiding the complexity and overhead of dynamic strategies. Fine-Grained Application Energy and Power Measurements on the Frontier Exascale System Oscar Hernandez and Wael Elwasif (Oak Ridge National Laboratory) Abstract Abstract The increasing complexity and power/energy demands of heterogeneous exascale systems, such as the Frontier supercomputers, present significant challenges for measuring and optimizing power consumption in applications. Current tools either lack the resolution to capture fine-grained power and energy measurements or fail to integrate this information with application performance events. This paper introduces a novel open-source performance toolkit that integrates extended PAPI components with Score-P to enable fine-grained millisecond-level power and energy measurements for AMD MI-250x GPUs and CPUs. EVeREST: An Effective and Versatile Runtime Energy Saving Tool for GPUs Anna Yue, Torsten Wilde, Sanyam Mehta, and Barbara Chapman (HPE) Abstract Abstract The widespread adoption of GPUs combined with the significant power consumption of GPU applications prepares a strong case for an effective power/energy saving tool for GPUs. Interestingly, however, GPUs present unique challenges (that are traditionally not seen in CPUs) towards this goal, such as very few available low-overhead performance counters and fewer optimization opportunities. We propose Everest, a proof-of-concept tool that dynamically characterizes applications to find novel and effective opportunities for power and energy savings while providing desired performance guarantees. Specifically, Everest finds two unique avenues for saving energy using DVFS (Dynamic Voltage Frequency Scaling) in GPUs in addition to the traditional method of lowering core clock for memory bound phases. Everest does not require any application modification or apriori profiling and has very low overhead. Everest relies on a single chosen performance event that is available across both AMD and NVIDIA GPUs that we show to be sufficient and effective in application characterization, which also makes Everest portable across GPU vendors. Experimental results of our PoC across 8 HPC and AI workloads demonstrate up to 25% energy savings while maintaining 90% performance relative to the maximum application performance, outperforming existing solutions on the latest NVIDIA and AMD GPUs. HPE Cray EX225a (MI300a) Blade Power Capping and HBM Page Retirement Steven Martin, Randy Law, Leo Flores, Ron Urwin, and Larry Kaplan (HPE) Abstract Abstract HPE Cray Supercomputing EX255a is a density-optimized, accelerated blade featuring eight AMD MI300a Accelerated Processing Units (APUs). To deploy the HPE Cray EX255a blades at maximum density in the HPE Cray EX4000 cabinet, the nodes need to be power capped to limit total cabinet power to the 400KVA maximum cabinet power constraint. Managing node-level power to enforce the cabinet-level constraint while maximizing node-, cabinet-, and system-level performance drove the engineering team to a new power-capping design that will be described in this presentation. This new power capping design is configured out-of-band via Redfish and is complementary to in-band capping that can be configured via rocm-smi. This presentation will show power and performance data collected on a large customer system and from a smaller system internal to HPE. This design is expected to be leveraged for future HPE Cray EX blades. Paper, Presentation Technical Session 4C: Monitoring Session Chair: David Carlson (Institute for Advanced Computational Science, Stony Brook University) Utilization and Performance Monitoring of Ookami, an ARM Fujitsu A64FX Testbed Cluster with XDMoD Nikolay A. Simakov, Joseph P. White, and Matthew D. Jones (SUNY University at Buffalo) and Eva Siegmann, David Carlson, and Robert J. Harrison (Stony Brook University) Abstract Abstract High-Performance Computing (HPC) resources are essential for scientific and engineering computations but come with substantial initial and operational costs. Therefore, ensuring their optimal utilization throughout their lifecycle is crucial. Monitoring utilization and performance helps maintain efficiency and proactively address user needs and performance issues. This is particularly important for technological testbed systems, where frequent software updates can mask localized performance degradations with improvements elsewhere. HPE Slingshot Monitoring Software: Actionable Insights for HPC and AI Systems Sahil Patel (HPE) Abstract Abstract Modern HPC and AI systems produce vast telemetry data, making performance monitoring and root cause analysis increasingly challenging. Traditional troubleshooting methods often lead to inefficiencies, lengthy resolution times, and costly downtime, unable to meet the demands of today’s high-performance computing environments. LDMS New Features for Deployment in Advanced Environments and Feedback for Operations Jim Brandt, Ben Schwaller, Jennifer Green, Ben Allan, Cory Lueninghoener, Evan Donato, Vanessa Surjadidjaja, Sara Walton, and Ann Gentile (Sandia National Laboratories) Abstract Abstract The Lightweight Distributed Metric Service (LDMS) monitoring, transport, and analysis framework has been deployed on large-scale Cray and HPE systems for over a decade. Over that time its capabilities have improved dramatically. In this talk we provide updates on capabilities including deployment and management methods in bare metal, containerized, and cloud (including hybrid on+off prem) environments. We describe how LDMS is being used to collect application data concurrent with system data and how the low-latency availability of this data for analysis can be used for real-time data analysis and feedback in order to support efficient, resilient, and reliable system operations. Finally, we will describe current related research areas including 1) use of machine learning for modeling application and system behavioral characteristics and 2) use of new features in the bi-directional communication capability of LDMS to provide low-latency communication and feedback from a distributed analysis system to user, system, and application processes on disparate clusters and to inform data center orchestration decisions. Proactive Health Monitoring and Maintenance of High-Speed Slingshot Fabrics in HPC Environments Michael Cush, Jeff Kabel, Michael Schmit, Michael Accola, and Forest Godfrey (HPE) Abstract Abstract This whitepaper addresses the critical need for maintaining the health of high-speed Slingshot fabrics in high-performance computing (HPC) environments. Identifying and resolving known issues swiftly is essential for optimizing HPC workload performance, yet pinpointing common and emerging problems can be highly challenging. We propose a proactive solution that leverages automated capture of key configuration and performance metrics, coupled with sophisticated event logic, to detect unhealthy components/known bugs within the fabric. This is achieved through the System Diagnostic Utility (SDU), integrated with HPCM and CSM software, which automates data capture and securely transmits it to HPE using HPE Remote Device Access (RDA). This solution is complimentary to other system monitoring solutions such as SMS and AIOps. In fact, this solution can also capture data from those tools to consolidate and enhance the level of data captured for analysis. Paper, Presentation Technical Session 4A: New Deployment Session Chair: Jim Rogers (Oak Ridge National Laboratory) A journey to provide GH200 Mark Klein, Thomas Schulthess, Jonathan Coles, and Miguel Gila (Swiss National Supercomputing Centre, ETH Zurich) Abstract Abstract Bringing a new hardware architecture to production involves a multi-phase process that ensures optimal performance, stability, and integration with existing infrastructure. The process begins with the physical installation of hardware and networking components. Once the hardware is in place, system engineers configure the operating system and necessary software stacks for performance monitoring and fault detection. This is followed by rigorous testing, including stress tests and benchmark runs, to verify the system’s capabilities and identify any hardware or software anomalies. Evaluating AMD MI300A APU: Performance Insights on LLM Training via Knowledge Distillation Dennis Dickmann (Seedbox); Philipp Offenhäuser (HPE); Rishabh Saxena (HLRS, University of Stuttgart); George Markomanolis (AMD); Alessandro Rigazzi (HPE HPC/AI EMEA Research Lab); Patrick Keller (HPE); and Kerem Kayabay and Dennis Hoppe (HLRS, University of Stuttgart) Abstract Abstract AMD (Advanced Micro Devices) has recently launched the MI300A Accelerated Processing Unit (APU), which integrates Central Processing unit (CPU) and Graphical Processing Unit (GPU) compute in a single chip with unified High Bandwidth Memory (HBM). This study assesses the AMD Instinct™ MI300A capabilities and examines how this new architecture handles real-world generative Artificial Intelligence (AI) workloads. While performance data for Large Language Model (LLM) use cases exists for the MI250X and MI300X, to the best of our knowledge, such assessments are absent for the MI300A. We apply a Knowledge Distillation (KD) use case to distil the knowledge of the Mistral-Small-24B-Base-2501 teacher model to a student model that is 50% sparse using a 2:4 sparsity pattern. The results show quasi-linear scaling of raw performances on up to 256 APUs. Lastly, we discuss the challenges, research and practical implications, and outlook. Evaluation of the Nvidia Grace Superchip in the HPE/Cray XD Isambard 3 supercomputer Thomas Green and Sadaf Alam (University of Bristol) Abstract Abstract The Bristol Centre for Supercomputing (BriCS) has recently deployed 55,296 Arm Neoverse V2 CPU cores in a supercomputing platform, via 384-nodes of NVIDIA Grace CPU Superchips with LPDDR5 memory as part of the Isambard 3 HPC service for the UK HPC research community. Isambard 3 is an HPE/Cray XD series system using the Slingshot 11 interconnect. As one of the first systems of this kind, this manuscript overviews details of the hardware and software configuration and presents early performance evaluation and benchmarking results using a representative subset of scientific applications. The focus is to evaluate Isambard 3 as a “plug-and-play” environment for researchers, especially who are familiar with Cray software environment. We include microbenchmark results to provide insights into the performance behaviour of this unique architecture. We present a small scale scaling comparison between the NVIDIA Grace CPU Superchip with other mainstream CPUs, including Intel Sapphire Rapids and AMD Genoa and Bergamo. We report on issues during the attempts to use several major software toolchains available for Arm, such as the HPE Cray Compiler Environment (CCE), the Arm Compiler for Linux, and the NVIDIA Compiler, therefore focussing on GCC. Our findings include key opportunities for improvements that were discovered during our benchmarking, evaluation and regression testing on the system as we transitioned the service into operations from January 2025. Separating concerns: Decoupling the Slingshot Fabric Manager from Cray System Management Riccardo Di Maria and Chris Gamboni (Swiss National Supercomputing Centre), Davide Tacchella and Isa Wazirzada (HPE), and Mark Klein (Swiss National Supercomputing Centre) Abstract Abstract The Alps research infrastructure consists of 32 HPE Cray EX cabinets which are managed by Cray System Management (CSM). A critical component of this system is the Slingshot fabric manager and it is responsible for managing the high-speed network fabric. Presently, the fabric manager is deployed as a Kubernetes Pod and runs amongst other services on the system management nodes. An ongoing effort aims at separating the fabric manager from Kubernetes, and deploying it on bare metal hardware. The architectural decision-making process is examined in detail, accompanied by a walkthrough of the newly proposed design. The discussion of the design is framed within the context of key quality attributes, including reliability, resiliency, availability, observability, and performance. Subsequently, the focus transitions from the "what" to the "how," providing a comprehensive overview of the execution of the migration of the fabric manager from a Kubernetes-based deployment to a bare-metal environment. Insights are presented regarding aspects that were successful, challenges encountered, and whether the overall outcome of this effort achieved the intended objectives. Paper, Presentation Technical Session 5B: Maintaining Large Systems Session Chair: Aaron Scantlin (National Energy Research Scientific Computing Center) Hardware Triage Tool: Enhancements and Extensions Isa Muhammad Wazirzada, Abhishek Mehta, Vinanti Phadke, and Bhuvan Meda Rajesh (HPE) Abstract Abstract In 2023, HPE released the Hardware Triage Tool (HTT) with the mission to provide high-fidelity diagnoses and minimize time to repair of hardware faults across HPE Cray EX compute and accelerator blades irrespective of the system manager being used. Detecting operating system noise with detect-detour Nagaraju KN, Clark Snyder, Dean Roe, and Larry Kaplan (HPE) Abstract Abstract HPC applications, especially those that frequently perform global synchronization operations, can be negatively affected by background operating system (OS) activity. The background actions of interest are processing hardware interrupts, software interrupts, and process context switches. While these actions are necessary to the operation of the OS, from the application's point of view, they are viewed as "OS noise" that affects performance, and the system should be tuned to minimize them. Identifying sources of OS noise is crucial for application performance but can be difficult. Few options exist to identify sources of OS noise without getting into the intricacies of the underlying kernel internals. The detect-detour tool makes use of the Linux kernel enhanced Berkeley Packet Filter (eBPF) [1] feature to help system administrators identify sources of OS noise without requiring them to be kernel experts. Analyzing a Lifetime of Failures on a Cray XC40 Supercomputer Kevin Brown and Tanwi Mallick (Argonne National Laboratory), Zhiling Lan (University of Illinois Chicago), Robert Ross (Argonne National Laboratory), and Christopher Carothers (Rensselaer Polytechnic Institute) Abstract Abstract We analyze hardware errors over the seven-year lifetime of the Theta supercomputer, a large-scale Cray XC40 system at the Argonne Leadership Computing Facility. To ensure accurate interpretation of the logs, we leverage expert knowledge to clean the dataset and remove redundant information. Temporal and spatial analysis techniques are then used to expose how failures and errors trend over time and across components in the system. Additionally, we correlate hardware error logs to system downtime logs to capture the relationship between critical errors and outages over the lifetime of the system. The results in this work represent a state-of-the-practice report highlighting how severe error types vary over time and across different component types, such as on-node and off-node (network) components. We also demonstrate the effectiveness of our technique in simplifying log analysis by using a unified error classification across components from different vendors, providing valuable insights into normal and anomalous system behaviors. Paper, Presentation Technical Session 5C: Filesystems & I/O Session Chair: Raj Gautam (ExxonMobil) E2000 Performance From Microbenchmarks to Applications William Loewe, Michael Moore, Sakib Samar, and Chris Walker (HPE) Abstract Abstract With the advance of the Exascale Age and its continued gains in FLOPS performance, the associated I/O demands of performance increase commensurately. To address this, the HPE Cray Supercomputing Storage Systems E2000 is the next generation of the HPE Cray Supercomputing Storage product line with a focus on performance. This paper discusses the architecture changes in the E2000 and provides node and file system microbenchmarks measuring bandwidth, IOPS, and metadata performance. The improved PCIe and NVMe drive speeds in addition to the higher density enclosure in the E2000-F allow for more than twice the throughput and IOPS performance compared to the previous generation with nearly all of the performance achievable by optimal application workloads. System configuration choices, such as number of storage targets and BIOS settings, which influence system level performance will be compared with an aim to optimize the gains and determine ideal client/server tunings. Finally, performance of application-relevant workloads including random access, shared file, and AI/ML storage workloads will be presented along with discussion of application and job changes to utilize the E2000 performance improvements. Towards Empirical Roofline Modeling of Distributed Data Services: Mapping the Boundaries of RPC Throughput Philip Carns, Matthieu Dorier, Rob Latham, Shane Snyder, and Amal Gueroudji (Argonne National Laboratory); Seth Ockerman (University of Wisconsin-Madison); Jerome Soumagne (HPE); Dong Dai (University of Delaware); and Robert Ross (Argonne National Laboratory) Abstract Abstract The scientific computing community relies on distributed data services to augment file systems and decouple data management functionality from applications. These services may be native to HPC or adapted from cloud environments, and they encompass diverse use cases such as domain-specific indexing, in situ analytics, AI data orchestration, and special-purpose file systems. They unlock new levels of performance and productivity, but also introduce new tuning challenges. In particular, how do practitioners assess performance, select deployment footprints, and ensure that services reach their full potential? Roofline models could address these challenges by setting practical performance expectations and providing guidance to achieve them. HPC workload characterization using eBPF Shubh Pachchigar and Brandon Cook (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) and Brian Friesen (Lawrence Berkeley National Laboratory) Abstract Abstract Efficient interactions with filesystems are essential for scientific workflows operating at scale on HPC systems. In order to design new filesystems, tune system configurations, effective I/O characterization is needed. Darshan is a widely used tool for I/O characterization that relies on injecting code into application binaries but has some limitations in providing low-level insights. In this work, we propose leveraging eBPF that enables the execution of user-defined programs within the kernel, to develop a new I/O characterization tool. Our approach aims to complement the capabilities of Darshan by using eBPF to gain deeper insight into application interactions with the underlying filesystems. This is achieved by deploying dynamic instrumentation techniques below the application layer to extract detailed I/O metrics. In this work, we demonstrate the collection of read/write operations, and their associated latencies with various available filesystems. The metrics are periodically sampled with a custom eBPF-based LDMS sampler to enable the collection of data at scale. Finally, to demonstrate its feasibility in production HPC environments, we establish that the overhead of the tool is low. This work demonstrates the potential of eBPF to enhance I/O characterization in HPC environments, providing valuable insights that can lead to improved performance and resource utilization. Paper, Presentation Technical Session 5A: Slingshot & MPI Tuning Session Chair: Brett Bode (National Center for Supercomputing Applications/University of Illinois, National Center for Supercomputing Applications) MPI implementation optimization for Slingshot network Rahulkumar Gayatri, Adam Lavely, Neil Mehta, Brandon Cook, and Afton Geil (Lawrence Berkeley National Laboratory) Abstract Abstract Optimizing MPI performance is one of the keys to improving performance of HPC applications. While algorithmic improvements such as overlap of communication and computation are used to improve MPI parallelism within a workflow, the choice of MPI implementation for a given application can also affect overall performance. We have characterized how OpenMPI, MPICH, and Cray MPICH perform on Perlmutter for small to moderate problem sizes using the OSU micro-benchmarks and have shown that the vendor-tuned Cray MPICH typically outperforms the other MPI implementations. We plan to expand this work for additional microbenchmarks, identify ways to improve the MPI implementations, and show how MPI performance differences impact the performance of full HPC applications for the final paper. Using Different MPI Implementations on HPE Cray EX Supercomputers for Native and Containerized Applications Execution Maciej Pawlik and Maciej Szpindler (Academic Computer Centre CYFRONET), Marcin Krotkiewski (University of Oslo), and Alfio Lazzaro (HPE) Abstract Abstract Message Passing Interface (MPI) implementations have to be tailored for specific system architectures to maximize application performance. This applies to optimizations for network transport and provision of efficient data movement between CPU and GPU memories. The default MPI on HPE Cray Ex systems such as LUMI is a proprietary and optimized implementation based on the open source MPICH Ch4 implementation. There are situations where having an alternative such as OpenMPI benefit users who have applications or containers that are targeting OpenMPI and where some effort would be needed to change. Having alternative MPI implementations allows for performance comparisons, investigating bugs, and checking new MPI functionalities. In this paper, we report on our experience with installing containerized and native OpenMPI environments on LUMI, showing how users can build and run containers and get the expected performance. We show a performance comparison with respect to HPE Cray MPI executions using OSU benchmarks and an example real-world application for the solution of Dirac equations using GPUs. Although we only refer to LUMI, similar concepts can be applied to the case of other supercomputers. Scaling MPI Applications on Aurora Nilakantan Mahadevan (Hewlett Packard Enterprise); Premanand Sakarda (Intel Corporation); Scott Parker, Servesh Muralidharan, Vitali Morozov, and Victor Anisimov (Argonne National Laboratory); Huda Ibeid, Anthony-Trung Nguyen, and Aditya Nishtala (Intel Corporation); Larry Kaplan and Michael Woodacre (Hewlett Packard Enterprise); and Kalyan Kumaran and JaeHyuk Kwack (Argonne National Laboratory) Abstract Abstract The Aurora supercomputer, which was deployed at Argonne National Laboratory in 2024, is currently one of three Exascale machines in the world on the Top500 list. The Aurora system is composed of over ten thousand nodes each of which contains six Intel Data Center Max Series GPUs, Intel’s first data center-focused discrete GPU, and two Intel Xeon Max Series CPUs, Intel’s first Xeon processor to contain HBM memory. To achieve Exascale performance the system utilizes the HPE Slingshot high-performance fabric interconnect to connect the nodes. Aurora is currently the largest deployment of the Slingshot fabric to date with nearly 85,000 Cassini NICs and 5,600 Rosetta switches connected in a dragonfly topology. The combination of the Intel powered nodes and the Slingshot network enabled Aurora to become the second fastest system on the Top500 list in June of 2024 and the fastest system on the HPL MxP benchmark. The system is one of the most powerful systems in the world dedicated to AI and HPC simulations for open science. This paper presents details of the Aurora system design with a particular focus on the network fabric and the approach taken to validating it. The performance of the systems is demonstrated through the presentation of the results of MPI benchmarks as well as performance benchmarks including HPL, HPL-MxP, Graph500, and HPCG run on a large fraction of the system. Additionally results are presented for a diverse set of applications including HACC, AMR-Wind, LAMMPS, and FMM demonstrating that Aurora provides the throughput, latency, and bandwidth across the system needed to allow applications to perform and scale to large node counts and providing new levels of capability and enabling breakthrough science. Paper, Presentation Technical Session 6B: Framework for HPC-AI workflows Session Chair: Chris Fuson (ORNL, Oak Ridge National Laboratory) Framework for tracking metadata, lineage and model provenance in hybrid simulation-AI HPC exascale workflows Martin Foltin, Andrew Shao, Rishabh Sharma, Shreyas Kulkarni, Annmary Justine Koomthanam, Aalap Tripathy, and Cong Xu (HPE); Wenqian Dong (Oregon State University); Suparna Bhattacharya (HPE); Brian Sammuli (General Atomics); and Paolo Faraboschi (HPE) Abstract Abstract Integration of AI in HPC workflows can have a profound impact on HPC scale and usability, for example, by accelerating simulations with surrogate models or intelligently steering simulations based on previous results. New workflows are explored in which AI models are iteratively improved by continual learning to better reflect input data distributions and avoid outliers and drifts. Tracking of model provenance in these workflows is important to understand how new data affect model performance, allow unwinding to previous iterations, and provide a better understanding of conditions where AI models perform well for future reuse. This is more challenging in hybrid HPC-AI workflows because the lineage and provenance must be tracked across multiple software components at different levels of scale. In this work, we extend HPE Common Metadata Framework to hybrid simulation – AI workflows. We demonstrate benefits of CMF tracking across simulation, AI training and inference along HPE SmartSim system on a simple computational fluid dynamics problem with Eddy Kinetic Energy parameterized by AI. We track out-of-distribution data for continuous learning and employ adaptive switching between different models to improve the quality of results. We are working with fusion energy and materials science communities enhancing their workflows in a similar fashion. Search and Query Framework for Workflows with HPC and AI Models Christopher Rickett, Sreenivas Sukumar, and Karlon West (HPE) Abstract Abstract Modern computational science workflows increasingly involve complex, interactive, and iterative search through data from simulations of physics-based equations coupled with analytic, predictive, generative, and agentive tasks. Unfortunately, there are no query engines that empower scientists to search through scientific data with AI, analytic and physics-based models similar to searches with SQL query engines on structured data or keyword-search/prompt engines for textual data. FirecREST v2: Lessons Learned from Redesigning an API for Scalable HPC Resource Access Elia Palme and Juan Pablo Dorsch (CSCS - ETH Zurich); Ali Khosravi and Giovanni Pizzi (PSI Center for Scientific Computing, Theory, and Data); and Francesco Pagnamenta, Andrea Ceriani, Eirini Koutsaniti, Rafael Sarmiento, Ivano Bonesana, and Alejandro Dabin (CSCS - ETH Zurich) Abstract Abstract Introducing FirecREST v2, the next generation of our open-source RESTful API for programmatic access to HPC resources. FirecREST v2 delivers a ~100x performance improvement over its predecessor. This paper explores the lessons learned from redesigning FirecREST from the ground up, with a focus on integrating enhanced security and high throughput as core requirements. We provide a detailed account of our systematic performance testing methodology, highlighting common bottlenecks in proxy-based APIs with intensive I/O operations. Key design and architectural changes that enabled these performance gains are presented. Finally, we demonstrate the impact of these improvements, supported by independent peer validation, and discuss opportunities for further improvements. Paper, Presentation Technical Session 6C: Programming Models Session Chair: Benjamin Cumming (CSCS, ETH Zurich) Designing GPU-aware OpenSHMEM for HPE Cray EX and XD Systems Danielle Sikich, Naveen Namashivayam Ravichandrasekaran, Md Rahman, Elliot Joseph Ronaghan, Nathan Wichmann, and William Okuno (HPE) Abstract Abstract OpenSHMEM is a Partitioned Global Address Space (PGAS) based library interface specification. It is a culmination of a standardization effort among many implementers and users of the SHMEM programming model. The existing OpenSHMEM specification is not GPU-aware, the programming model does not enable managing the data movement operations involving a GPU-attached memory buffer. However, the OpenSHMEM users are exploring options to enable the execution of their data-driven workloads on heterogeneous system architectures. Quantifying Message Aggregation Optimisations for Energy Savings in PGAS Models Aaron Welch and Oscar Hernandez (Oak Ridge National Laboratory) and Stephen Poole and Wendy Poole (Los Alamos National Laboratory) Abstract Abstract Upon breaking past the exascale barrier, HPC systems are facing their greatest challenge yet - a power wall that must be addressed through new methods in both hardware and software. While energy costs are becoming a major issue at all levels, of particular concern is that of the network, as the relative cost of moving data is increasing faster than ever. The partitioned global address space (PGAS) model is critical within certain HPC domains, but is known to suffer from the small message problem, where irregular many-to-many access patterns result in congesting the network with excessive numbers of small messages. To address this, the conveyor aggregation library was developed to defer individual messages and group them for subsequent bulk processing. In this paper, we investigate its impact on energy use related to the network, with a focus on the Slingshot 11 interconnect. We will demonstrate that this strategy is not only highly performant, but also crucial to reducing energy footprints to remain within target power envelopes. Accelerating LArTPC Simulations: Enhancing larnd-sim with GPU Optimization Techniques Madan Timalsina (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory); Matt Kramer (Lawrence Berkeley National Laboratory); Pengfei Ding (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory); Ronan Doherty (Trinity College Dublin); Rishabh Dave (UC Berkeley); Nicholas Tyler, Urjoshi Sinha, and William Arndt (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory); and Callum Wilkinson (Lawrence Berkeley National Laboratory) Abstract Abstract Advancements in general-purpose computing on GPUs have enabled highly parallelized Monte Carlo simulations for particle physics experiments, including for the Deep Underground Neutrino Experiment (DUNE), which will use the world's most powerful neutrino beam to study the properties of these elusive particles. Here, we present our efforts on the optimization of larnd-sim, a microphysical simulation for liquid argon time projection chambers (LArTPCs) with light and pixelated charge readout, originally developed for the DUNE Near Detector (ND-LAr). Implemented in Python and utilizing GPU acceleration via Numba and CuPy, larnd-sim processes energy depositions from Geant4 to simulate physical phenomena (such as ionization electron drift) and the response of the detector electronics. By profiling with NVIDIA tools and optimizing memory transfers, adjusting register counts, tuning block and grid dimensions, altering floating-point precision, enabling the "fastmath" option for transcendental functions, converting arrays to a jagged format, and tuning CUDA kernels, we achieved an over-50% GPU-memory reduction, a ~30% wallclock speed improvement, and individual kernel speedups of 10–500%. In addition to these ongoing tests on the NERSC Perlmutter supercomputer, we are working with collaborators at ANL to run these simulations on the Polaris machine, further expanding larnd-sim's reach. Paper, Presentation Technical Session 6A: DAOS Session Chair: Jesse A. Hanley (Oak Ridge National Laboratory) DAOS - New Horizons for High Performance Storage Michael Hennecke and Jerome Soumagne (HPE) Abstract Abstract DAOS is an open-source scale-out storage system that has redefined performance for a wide spectrum of HPC and AI workloads (https://daos.io/). It is an all-flash solution that can be deployed as a stand-alone storage system, or it can be a high-performance storage tier used in combination with traditional Lustre, GPFS, or (cloud) object storage environments. Enhancing RPC on Slingshot for Aurora’s DAOS Storage System Jerome Soumagne, Alexander Oganezov, Ian Ziemba, and Steve Welch (HPE); Philip Carns and Kevin Harms (Argonne National Laboratory); and John Carrier, Johann Lombardi, Mohamad Chaarawi, Zhen Liang, and Scott Peirce (HPE) Abstract Abstract DAOS, an open-source software-defined high-performance storage solution designed for massively distributed NVMe SSDs and Non Volatile Memory (NVM), is a key component of the Aurora Exascale system that aims to deliver high storage throughput and low latency to application users. Utilizing the Slingshot interconnect, DAOS leverages Remote Procedure Call (RPC) to communicate between compute and storage nodes. While the preexisting RPC mechanism used by DAOS was already designed for High-Performance Computing (HPC) fabrics, it required a number of scalability, performance, and security enhancements in order to be successfully deployed on Aurora. Global Distributed Client-side Cache for DAOS Clarete R. Crasta, John L Byrne, Abhishek Dwaraki, David Emberson, Harumi Kuno, Sekwon Lee, Ramya Ahobala Rao, Shreyas Vinayaka Basri K S, Amitha C, Chinmay Ghosh, Rishi Kesh Kumar Rajak, Sriram Ravishankar, Porno Shome, and Lance Evans (HPE) Abstract Abstract HPC/AI workloads process large amounts of data and perform complex operations on the data at exascale rates, for time-critical insights/results. Distributed workloads are often bottlenecked by communication when storage systems are used to co-ordinate and share results. Storage solutions supporting effective, scalable parallel access from compute clusters are critical to HPC architectures. Caching data on storage servers and/or clients are known techniques used by storage systems to ameliorate the communication costs. Current server-side caching methodologies are constrained by amount of memory and network bandwidth on the fixed and finite server nodes. Furthermore, most client-side caches are node-local, meaning the cached data is accessible solely by the node on which the data is stored. DAOS is a promising exascale storage stack recently acquired by HPE. Global client-side caching for DAOS is an attractive proposition due to higher aggregate client-side resources (e.g., DRAM and network bandwidth) that can scale independent of the number of server nodes. In addition to providing faster data access, a client-side cache should also be efficient as it consumes expensive resources and requires an efficient caching framework with its associated policies. In this paper, we cover the details of realizing efficient shared client-side caching for DAOS. Paper, Presentation Technical Session 7B: Access Nodes & Kubernetes Management Session Chair: Jim Williams (Los Alamos National Laboratory) Addressing Resource Constraints on Aurora with Admin Access Nodes Peter Upton, Ben Lenard, Ben Allen, and Cyrus Blackworth (Argonne National Laboratory) Abstract Abstract This paper presents Administrator Access Nodes (AANs) as an alternative to the traditional reliance on a single Admin Node for all aspects of system administration in an HPE Performance Cluster Manager (HPCM) managed supercomputer cluster. At the Argonne Leadership Computing Facility (ALCF), managing the Aurora supercomputer, a large HPE Cray EX system, requires a team of skilled developers and administrators. These professionals require access to many tools for tasks such as parsing log files, issuing power commands, and connecting to nodes via SSH. These tasks have typically been performed solely on the Admin Node. However, this centralization can lead to resource constraints due to simultaneous resource requirements by multiple administrators. To address these issues, the paper details the implementation and operation of AANs, including custom tools for interacting with HPCM, scripts to replicate some Admin Node functionality on AANs, and synchronization tools for configuration files. The introduction of AANs has alleviated resource constraints, streamlined workflows, and enhanced system manageability. Possible future work is also discussed, focusing on further integrating HPCM's APIs and improving usability, aiming to enhance AAN capabilities and administrative efficiency for Aurora's complex environment. HPE Slingshot in the Kubernetes Ecosystem Caio Davi and Jesse Treger (HPE) Abstract Abstract The convergence of traditional HPC systems with AI increases expectations for supercomputing sites to deliver new capabilities (beyond traditional batch scheduling, single tenancy, and bare-metal application deployment methodologies) for more dynamic provisioning. Convergence with enterprise cloud computing techniques such as containerized applications and Kubernetes have become a priority. But transitioning high-performance computing (HPC) environments and applications to Kubernetes is complex because of the critical requirement to maintain low-latency networking for high-performance. In this context, we have HPE Slingshot, a modern high-performance interconnect for HPC and AI clusters that delivers industry-leading performance, bandwidth, and low-latency for HPC, AI/ML, and data analytics applications through innovations in the fabric to overcome congestion and innovations in the NIC to significantly offload communications and message processing from the hosts. Because the HPE Slingshot NICs run native Ethernet alongside its optimized RDMA transport and connectionless protocol, ensuring that the RDMA transport is operating as intended is critical to delivering the high performance expected in HPC and AI. This requires careful configuration of Kubernetes because If not configured, the system can fall back to using standard TCP/IP over Ethernet instead of the expected HPC and AI performance. Our proposed solution is composed of a number of Kubernetes components such as device plugins, CNIs, Operator, and Admission Policies. These contributions represent a significant advancement in deploying and operating HPC applications within containerized environments and offering a robust framework for future developments in distributed computing, ensuring both high performance and ease of management for the continuing convergence of HPC/AI and cloud computing and the coming transition from siloed HPC interconnects to interoperable Ultra-Ethernet transport. Building non-standard images for CSM systems Harold Longley, Isa Wazirzada, Dennis Walker, Andy Warner, and Davide Tacchella (HPE) Abstract Abstract HPC scientists increasingly must innovate using diverse toolchains crafted from various Linux distributions ensuring they meet individual and project-specific needs to tackle complex challenges. Paper, Presentation Technical Session 7C: Application Performance Session Chair: Juan F R Herrera (EPCC, The University of Edinburgh) Task-decomposed Overlapped Pressure Preconditioner for Sustained Strong Scalability on Accelerated Exascale Systems Niclas Jansson (KTH Royal Institute of Technology) Abstract Abstract Computational Fluid Dynamics is a natural driver for exascale computing with a virtually unbounded need for computational resources for accurate simulation of turbulent fluid flow, both for academic and engineering usage. However, with exascale computing capabilities on the horizon, we have seen a transition to more heterogeneous computer architectures with various accelerators. While offering high theoretical peak performance and high memory bandwidth, complex programming models and significant programming investments are necessary to efficiently exploit these systems. We detail our work on improving the performance and scalability of key numerical methods in the high-fidelity spectral element code Neko on accelerated exascale machines. Efficient preconditioners are essential in incompressible fluid dynamics; however, the most efficient method (with respect to convergence) might be challenging to implement with good performance on an accelerator. We present our development of a GPU-optimised preconditioner with task overlapping for the pressure-Poisson equation, improving the preconditioner's throughput (in TDoF/s) by more than 61%. The new preconditioner is explained in detail, together with detailed performance studies on Cray EX platforms, including strong scalability studies on Frontier, a performance comparison between AMD and NVIDIA accelerated nodes, and an assessment of the feasibility of mixing both node types in a single simulation. Supernovae in HPC: Benchmarking FLASH Across Advanced Computing Clusters Joshua Martin, Eva Siegmann, and Alan Calder (Stony Brook University, Institute of Advanced Computational Science) Abstract Abstract Astrophysical simulations are highly demanding in terms of computation, memory, and energy, requiring new advancements in hardware. Stony Brook University recently expanded its "SeaWulf" computing cluster by adding 94 new nodes with Intel Sapphire Rapids Xeon Max series CPUs. This benchmarking study evaluates the performance and power efficiency of this new hardware using FLASH: a multi-scale, multi-physics software instrument that utilizes adaptive mesh refinement. Our study also compares the performance of Stony Brook's Ookami testbed which features ARM-based A64FX-700 processors as well as SeaWulf’s existing AMD EPYC Milan and Intel Skylake nodes. The focus of our simulation is the evolution of a bright stellar explosion known as a thermonuclear (Type Ia) supernova—a complex 3D problem that incorporates various operators for hydrodynamics, gravity, nuclear burning, and routines for the material equation of state. We perform strong-scaling tests on a 220 GB problem size and assess both single-node and multi-node performance. We analyze the performance of various MPI mappings and processor distributions across nodes. From our strong-scaling tests, we conclude the optimal configuration for balancing the minimization of runtime and energy consumption for our application. Expanding Community Access to Real-World HPC Application I/O Characterization Data Using Darshan Shane Snyder, Philip Carns, Robert Ross, Robert Latham, and Kevin Harms (Argonne National Laboratory) Abstract Abstract HPC systems are deployed with massive, distributed storage subsystems to meet the demands of data-intensive applications. While these storage systems offer impressive peak performance, it is often only attainable in idealized scenarios not reflective of production workloads. In general, there continues to be a lack of community understanding of the I/O performance characteristics of real-world applications. Paper, Presentation, Birds of a Feather Technical Session 7A: AI/ML GPU Workloads Session Chair: Raj Gautam (ExxonMobil) Porting Radio Astronomy Correlation to Setonix, a HPE Cray EX system powered by AMD GPUs Cristian Di Pietrantonio (Pawsey Supercomputing Research Centre, Curtin Institute for Radio Astronomy); Marcin Sokolowski (Curtin Institute of Radio Astronomy); Christopher Harris (Pawsey Supercomputing Research Centre); and Daniel Price and Randal Wayth (SKAO) Abstract Abstract In low-frequency radio astronomy correlation of signals coming from hundreds of radio antennas is an early and fundamental step to create science-ready data products such as images of the sky at radio wavelength. Because of the high volume of data to process and the rate at which they are produced, correlation is most of the time performed in real time by dedicated hardware, a FPGA or GPU cluster, installed near the telescope. However, there are science cases when an astronomer would like to correlate data later with customised settings such as time and frequency averaging of signals. Setonix, Pawsey Supercomputing Centre’s HPE Cray Ex supercomputer based on AMD CPUs and GPUs, provides radio astronomers with enough computational power for such processing, but the only established GPU correlator works only on NVIDIA GPUs and proved hard to port. In this paper we discuss the process of providing Australian astronomers an implementation of the correlation algorithm that harnesses the computational power of Setonix. Evaluating the Performance of Containerized ML and LLM Applications on the Frontier and Odo Supercomputers Bishwo Dahal (University of Louisiana Monroe, Oak Ridge National Laboratory) and Elijah Maccarthy and Subil Abraham (Oak Ridge National Laboratory) Abstract Abstract Containers are transforming scientific computing by simplifying the packaging and distribution of applications. This enables researchers to create and deploy their applications in isolated environments with all necessary dependencies, enhancing portability and deployment flexibility. These advantages make containers especially suitable for High Performance Computing (HPC) facilities like the Oak Ridge Leadership Computing Facility (OLCF), where complex scientific applications are being developed and deployed. In this work, we investigate the performance of containerized machine learning (ML) applications in comparison to bare-metal execution on the Frontier Exascale supercomputer. Specifically, we aim to determine whether ML models, when trained and tested within containers on Frontier using Apptainer, exhibit performance similar to that of bare-metal implementations. To achieve this, we use containers to package and run Convolutional Neural Network (CNN)-based ML applications on the OLCF Frontier and Odo supercomputers and assess their performance against bare-metal runs. After conducting scalability tests across up to 30 nodes with 1680 AMD EPYC CPU cores and 240 GPUs, we find that the performance of the containerized ML applications is at par with that of bare-metal runs. We apply the lessons learned from our containerized ML model to containerizing and evaluating performance of LLMs like AstroLLaMA, and CodeLLaMA on Frontier. BoF on Transforming Hybrid Workflows: The Role of HPE Cray Supercomputing User Services Software in Bridging HPC and AI Tulsi Mishra, Dean Roe, and Larry Kaplan (HPE) Abstract Abstract As the convergence of HPC and AI reshapes computational workflows, the complexity of managing hybrid environments has become a significant challenge for organizations. HPE Cray Supercomputing User Services Software (USS) offers a transformative approach to simplify, scale, and optimize workflows across HPC and AI landscapes. In this session, we will explore how USS aims to bridge the gap between traditional HPC workloads and AI-driven innovations, providing a unified platform for containerized environments, hybrid deployment orchestration, and energy-efficient operations. Program Event Content Expanding Horizons in AI with HPC Workshop This workshop, located at Stony Brook University on Expanding Horizons in AI with HPC, aims to explore the dynamic intersection of AI and HPC, focusing on how advanced computing can accelerate AI research and applications. As AI models become more complex and data-intensive, traditional computing systems struggle to meet the demand for scalability, efficiency, and speed. HPC offers a solution by providing the necessary infrastructure for training large-scale models, enhancing AI algorithms, and enabling breakthroughs in fields such as deep learning, natural language processing, and autonomous systems. Registration and more details are available here: https://cug.org/cug-2025-aiwithhpc-workshop-2/ Program Event Content Expanding Horizons in AI with HPC Workshop This workshop, located at Stony Brook University on Expanding Horizons in AI with HPC, aims to explore the dynamic intersection of AI and HPC, focusing on how advanced computing can accelerate AI research and applications. As AI models become more complex and data-intensive, traditional computing systems struggle to meet the demand for scalability, efficiency, and speed. HPC offers a solution by providing the necessary infrastructure for training large-scale models, enhancing AI algorithms, and enabling breakthroughs in fields such as deep learning, natural language processing, and autonomous systems. Registration and more details are available here: https://cug.org/cug-2025-aiwithhpc-workshop-2/ Tutorial Tutorial 1B Hands on with uenv and CPE in a container with Grace Hopper on Alps Ben Cumming and Tim Robinson (Swiss National Supercomputing Centre, ETH Zurich) Abstract Abstract Management of the Cray Programming Environment (CPE) can pose challenges for CUG sites, including dependency issues within a monolithic stack, disruptive updates, and the need for administrative privileges. For these reasons, the CUG community has expressed considerable interest in exploring alternative approaches for deploying and running software on HPE Cray systems that do not rely on the CPE being included in the operating system image. This tutorial offers a hands-on experience with building and managing software on an HPE Cray system that does not have the CPE installed, specifically using uenv and a containerized CPE. Participants will get access to the cutting-edge NVIDIA Grace-Hopper (ARM64) nodes on Alps -- an HPE Cray Supercomputing EX system at CSCS. https://eth-cscs.github.io/cug25-uenv/ Tutorial Tutorial 1C Best Practices For Operating and Maintaining Slingshot Fabrics Forest Godfrey (Hewlett Packard Enterprise) Abstract Abstract Whether performing an initial bringup, applying software updates, or adding capacity, handling routine maintenance tasks is a fact of life. This tutorial will guide attendees through best practices for handling common scenarios in the lifecycle of Slingshot fabrics. Emulated Slingshot environments and sample fabric designs will be used to provide hands-on experiences working through important workflows in the lifecycle of a Slingshot deployment. Tutorial Tutorial 1A Monitoring HPE Cray HPC systems Harold Longley, Sue Miller, Pete Guyan, and Raghul Vasudevan (HPE) Abstract Abstract This tutorial covers the current monitoring architecture for HPE Cray HPC systems running HPCM (HPE Performance Cluster Manager) and CSM (Cray System Management) and how to configure the monitoring components to ensure log and telemetry data flows into the proper location for analysis. The various available analysis tools are discussed including OpenSearch dashboards for log data, Grafana for telemetry such as compute node resources and fabric metrics, infrastructure monitoring and alerting whilst looking at the latest new data inspection tools. The Monitoring Pipeline Visualization Tool (MPVT) enhances system monitoring by collecting real-time metrics from monitoring stack components such as Kafka, OpenSearch, and VictoriaMetrics via Prometheus exporters, and storing the data in VictoriaMetrics. MPVT generates detailed graphs using Grafana’s Node Graph panel to represent the health, performance, and data flow of the monitoring pipeline. Alerting configuration and customization are presented. Tutorial Tutorial 1D Exploring High Performance Storage with DAOS Adrian Jackson (EPCC, The University of Edinburgh) and Mohamad Chaarawi and Kenneth Cain (HPE) Abstract Abstract The diversity of applications utilizing HPC has been increasing beyond computational simulation approaches to a more varied mix including machine learning and data analytics. This introduces changes in I/O patterns and requirements on data storage. Traditional data storage technologies in HPC have long been optimized for bulk data movement, focused on high bandwidth with relatively low volumes of metadata operations. However, many applications now exhibit non-optimal I/O patterns for Parallel File Systems (PFS), with significant amounts of small I/O operations, non-contiguous data access, and increases in read as well as write I/O activities. PFS today are not optimized for all such patterns and are struggling with the diversified application workloads. Tutorial Tutorial 1B Continued Hands on with uenv and CPE in a container with Grace Hopper on Alps Ben Cumming and Tim Robinson (Swiss National Supercomputing Centre, ETH Zurich) Abstract Abstract Management of the Cray Programming Environment (CPE) can pose challenges for CUG sites, including dependency issues within a monolithic stack, disruptive updates, and the need for administrative privileges. For these reasons, the CUG community has expressed considerable interest in exploring alternative approaches for deploying and running software on HPE Cray systems that do not rely on the CPE being included in the operating system image. This tutorial offers a hands-on experience with building and managing software on an HPE Cray system that does not have the CPE installed, specifically using uenv and a containerized CPE. Participants will get access to the cutting-edge NVIDIA Grace-Hopper (ARM64) nodes on Alps -- an HPE Cray Supercomputing EX system at CSCS. https://eth-cscs.github.io/cug25-uenv/ Tutorial Tutorial 1C Continued Best Practices For Operating and Maintaining Slingshot Fabrics Forest Godfrey (Hewlett Packard Enterprise) Abstract Abstract Whether performing an initial bringup, applying software updates, or adding capacity, handling routine maintenance tasks is a fact of life. This tutorial will guide attendees through best practices for handling common scenarios in the lifecycle of Slingshot fabrics. Emulated Slingshot environments and sample fabric designs will be used to provide hands-on experiences working through important workflows in the lifecycle of a Slingshot deployment. Tutorial Tutorial 1A Continued Monitoring HPE Cray HPC systems Harold Longley, Sue Miller, Pete Guyan, and Raghul Vasudevan (HPE) Abstract Abstract This tutorial covers the current monitoring architecture for HPE Cray HPC systems running HPCM (HPE Performance Cluster Manager) and CSM (Cray System Management) and how to configure the monitoring components to ensure log and telemetry data flows into the proper location for analysis. The various available analysis tools are discussed including OpenSearch dashboards for log data, Grafana for telemetry such as compute node resources and fabric metrics, infrastructure monitoring and alerting whilst looking at the latest new data inspection tools. The Monitoring Pipeline Visualization Tool (MPVT) enhances system monitoring by collecting real-time metrics from monitoring stack components such as Kafka, OpenSearch, and VictoriaMetrics via Prometheus exporters, and storing the data in VictoriaMetrics. MPVT generates detailed graphs using Grafana’s Node Graph panel to represent the health, performance, and data flow of the monitoring pipeline. Alerting configuration and customization are presented. Tutorial Tutorial 1D Continued Exploring High Performance Storage with DAOS Adrian Jackson (EPCC, The University of Edinburgh) and Mohamad Chaarawi and Kenneth Cain (HPE) Abstract Abstract The diversity of applications utilizing HPC has been increasing beyond computational simulation approaches to a more varied mix including machine learning and data analytics. This introduces changes in I/O patterns and requirements on data storage. Traditional data storage technologies in HPC have long been optimized for bulk data movement, focused on high bandwidth with relatively low volumes of metadata operations. However, many applications now exhibit non-optimal I/O patterns for Parallel File Systems (PFS), with significant amounts of small I/O operations, non-contiguous data access, and increases in read as well as write I/O activities. PFS today are not optimized for all such patterns and are struggling with the diversified application workloads. Tutorial Tutorial 2B Automated Inspection of Fortran/C/C++ Code Using Codee for Correctness, Modernization, Optimization, and Security on HPE/Cray Manuel Arenaz (Codee - Appentra Solutions) Abstract Abstract Codee is a software development tool that facilitates the development, maintenance, modernization, and optimization of Fortran/C/C++ codes by providing a systematic and predictable approach to finding and fixing issues related to correctness, modernization, optimization, and security vulnerabilities. Codee provides automated checkers for the rules documented in the Open Catalog of Code Guidelines for Correctness, Modernization, and Optimization. It also features AutoFix capabilities to enable semi-automatic source code rewriting, by modifying Fortran statements and inserting OpenMP or OpenACC directives. Codee integrates seamlessly with popular IDEs, Control Version systems and CI/CD frameworks. Overall, Codee helps developers uncover hidden bugs, avoid introducing new ones, and pinpoint suggestions for various code improvements. In this tutorial, participants will explore Codee and the Open Catalog through short demos and hands-on exercises, with step-by-step instructions for HPE/Cray systems such as Perlmutter. The session begins with simple, well-known kernels and quickly progresses to large HPC codes like WRF, enabling participants to effectively use Codee tools in real-world scenarios. Tutorial Tutorial 2C Performance Analysis on AMD GPUs Georgios Markomanolis (AMD) Abstract Abstract Over the past few years, AMD has released several profiling tools for AMD GPUs. These tools, called rocprofv3, ROCm Systems Profiler ( rocprof-sys, ex-Omnitrace) and ROCm Compute Profiler (rocprof-compute, ex-Omniperf), are now AMD products. In this tutorial, divided in 4 parts, in the first part we discuss the used applications and GPUs for the current tutorial, including MI300A. Afterwards, we will showcase the latest advancements and offer guidance on selecting the most suitable tool based on your specific needs. We will introduce rocprofv3, showcasing its new features, usage, and functionalities designed to enhance user experience, instrumenting MPI applications is straight forward compared to previous rocprof version. Notable additions include support for OpenMP offloading profiling for Fortran, which we will demonstrate among also other improvements, as also improved overhead efficiency. In the third part, we will explain the capabilities of the rocprof-sys tool, for timeline analysis of an application execution, visualize traces, and understanding the insights among also performance characteristics. Participants will learn how to use profiling tools with key applications and various programming models, including MPI, OpenMP, Python, and Kokkos. In the last part, the rocprof-compute will be used for the roofline analysis of kernel performance, presenting the new developments and improvements. Additionally, we will delve into identifying inefficient metrics affecting specific kernel performance and provide deeper insights into optimization strategies. Tutorial Tutorial 1A Continued Monitoring HPE Cray HPC systems Harold Longley, Sue Miller, Pete Guyan, and Raghul Vasudevan (HPE) Abstract Abstract This tutorial covers the current monitoring architecture for HPE Cray HPC systems running HPCM (HPE Performance Cluster Manager) and CSM (Cray System Management) and how to configure the monitoring components to ensure log and telemetry data flows into the proper location for analysis. The various available analysis tools are discussed including OpenSearch dashboards for log data, Grafana for telemetry such as compute node resources and fabric metrics, infrastructure monitoring and alerting whilst looking at the latest new data inspection tools. The Monitoring Pipeline Visualization Tool (MPVT) enhances system monitoring by collecting real-time metrics from monitoring stack components such as Kafka, OpenSearch, and VictoriaMetrics via Prometheus exporters, and storing the data in VictoriaMetrics. MPVT generates detailed graphs using Grafana’s Node Graph panel to represent the health, performance, and data flow of the monitoring pipeline. Alerting configuration and customization are presented. Tutorial Tutorial 2B Continued Automated Inspection of Fortran/C/C++ Code Using Codee for Correctness, Modernization, Optimization, and Security on HPE/Cray Manuel Arenaz (Codee - Appentra Solutions) Abstract Abstract Codee is a software development tool that facilitates the development, maintenance, modernization, and optimization of Fortran/C/C++ codes by providing a systematic and predictable approach to finding and fixing issues related to correctness, modernization, optimization, and security vulnerabilities. Codee provides automated checkers for the rules documented in the Open Catalog of Code Guidelines for Correctness, Modernization, and Optimization. It also features AutoFix capabilities to enable semi-automatic source code rewriting, by modifying Fortran statements and inserting OpenMP or OpenACC directives. Codee integrates seamlessly with popular IDEs, Control Version systems and CI/CD frameworks. Overall, Codee helps developers uncover hidden bugs, avoid introducing new ones, and pinpoint suggestions for various code improvements. In this tutorial, participants will explore Codee and the Open Catalog through short demos and hands-on exercises, with step-by-step instructions for HPE/Cray systems such as Perlmutter. The session begins with simple, well-known kernels and quickly progresses to large HPC codes like WRF, enabling participants to effectively use Codee tools in real-world scenarios. Tutorial Tutorial 2C Continued Performance Analysis on AMD GPUs Georgios Markomanolis (AMD) Abstract Abstract Over the past few years, AMD has released several profiling tools for AMD GPUs. These tools, called rocprofv3, ROCm Systems Profiler ( rocprof-sys, ex-Omnitrace) and ROCm Compute Profiler (rocprof-compute, ex-Omniperf), are now AMD products. In this tutorial, divided in 4 parts, in the first part we discuss the used applications and GPUs for the current tutorial, including MI300A. Afterwards, we will showcase the latest advancements and offer guidance on selecting the most suitable tool based on your specific needs. We will introduce rocprofv3, showcasing its new features, usage, and functionalities designed to enhance user experience, instrumenting MPI applications is straight forward compared to previous rocprof version. Notable additions include support for OpenMP offloading profiling for Fortran, which we will demonstrate among also other improvements, as also improved overhead efficiency. In the third part, we will explain the capabilities of the rocprof-sys tool, for timeline analysis of an application execution, visualize traces, and understanding the insights among also performance characteristics. Participants will learn how to use profiling tools with key applications and various programming models, including MPI, OpenMP, Python, and Kokkos. In the last part, the rocprof-compute will be used for the roofline analysis of kernel performance, presenting the new developments and improvements. Additionally, we will delve into identifying inefficient metrics affecting specific kernel performance and provide deeper insights into optimization strategies. Tutorial Tutorial 1A Continued Monitoring HPE Cray HPC systems Harold Longley, Sue Miller, Pete Guyan, and Raghul Vasudevan (HPE) Abstract Abstract This tutorial covers the current monitoring architecture for HPE Cray HPC systems running HPCM (HPE Performance Cluster Manager) and CSM (Cray System Management) and how to configure the monitoring components to ensure log and telemetry data flows into the proper location for analysis. The various available analysis tools are discussed including OpenSearch dashboards for log data, Grafana for telemetry such as compute node resources and fabric metrics, infrastructure monitoring and alerting whilst looking at the latest new data inspection tools. The Monitoring Pipeline Visualization Tool (MPVT) enhances system monitoring by collecting real-time metrics from monitoring stack components such as Kafka, OpenSearch, and VictoriaMetrics via Prometheus exporters, and storing the data in VictoriaMetrics. MPVT generates detailed graphs using Grafana’s Node Graph panel to represent the health, performance, and data flow of the monitoring pipeline. Alerting configuration and customization are presented. Plenary, Vendor Plenary: Sponsors Talks, HPE 1-100 Linaro: Unlocking Exascale Debugging and Performance Engineering with Linaro Forge Rudy Shand (Linaro Ltd) Abstract Abstract Dive into the future of code development and see how Linaro Forge is reshaping what's possible in the world of parallel computing. Linaro Forge unveils the latest advancements: with Linaro DDT, MAP and Performance Reports, we're setting new standards in scalability and ease of use. Discover how these tools have become the go-to solution for developers seeking to push the boundaries of code optimization and performance engineering. Codee: A Tool to Enhance Correctness, Modernization, Security, Portability and Optimization in Fortran and C/C++ Software Applications Manuel Arenaz (Codee) Abstract Abstract Fortran/C/C++ developers are under constant pressure to deliver increasingly complex simulation software that is correct, secure and fast. It is critical to empower development teams with tools to automate code reviews, enforce compliance with industry standards, and prioritize reducing the risk of security vulnerabilities. Codee features unique capabilities for Deep Analysis of Fortran/C/C++ code, helping to catch bugs, enforce coding guidelines, modernize legacy code, ensure code portability, address security vulnerabilities, and optimize code efficiency. Codee provides automated checkers for the rules documented in the Open Catalog as well as AutoFix capabilities for semi-automatic source code rewriting, including modification of source code statements and insertion of OpenMP or OpenACC directives. Codee integrates seamlessly with popular editors, IDEs, Control Version systems and CI/CD frameworks, making it easy to incorporate into existing development workflows. Overall, Developers who are actively writing, modifying, testing and benchmarking Fortran code will increase their productivity by using Codee. Developers, team leaders and managers will benefit from DevOps and DevSecOps best practices, mitigating risks, boosting productivity, and reducing costs. In this presentation we will also talk about how to use Codee in conjunction with the Cray tools, including compilers (CCE) and performance tools (e.g. CrayPat, Reveal). AMD: The Unreasonable Effectiveness of FP64 Precision Arithmetic Nicholas Malaya (AMD) Abstract Abstract Double precision datatypes, also known as FP64, has been a mainstay of high performance computing (HPC) for decades. Recent advances in AI have extensively leveraged reduced precision, such as FP16, or more recently, FP8 for Deepseek. Many HPC teams are now exploring mixed and reduced precision to see if significant speed-ups are possible in traditional scientific applications, including methods such as the Ozaki scheme for emulating FP64 matrix multiplications with INT8 datatypes. In this talk, we will discuss the opportunities, and significant challenges, in migrating from double precision to reduced precision. Ultimately, AMD believes a spectrum of precisions are necessary to support the full range of computational motifs in HPC, and that native FP64 remains necessary in the near future. XTreme (Approved NDA Members Only) XTreme (Under NDA, Members Only) XTreme (Approved NDA Members Only) XTreme (Under NDA, Members Only) XTreme (Approved NDA Members Only) XTreme (Under NDA, Members Only XTreme (Approved NDA Members Only) XTreme (Under NDA, Members Only |
Birds of a Feather Programming Environments, Applications, and Documentation (PEAD) Birds of a Feather Programming Environments, Applications, and Documentation (PEAD) Exploring the Challenges of the World-Class HPE Cray Programming Environment for Modern Software Development in Fortran pdfBirds of a Feather BoF 1C Birds of a Feather BoF 1A Birds of a Feather BoF 3D Birds of a Feather BoF 2B Paper, Presentation, Birds of a Feather Technical Session 7A: AI/ML GPU Workloads Session Chair: Raj Gautam (ExxonMobil) Break Coffee Break Break Coffee Break Break Coffee Break Break Coffee Break Sponsored by SchedMD Break Coffee Break Sponsored by Pier Group Break Coffee Break Sponsored by Linaro Break Coffee Break Sponsored by VAST Break Coffee Break Sponsored by Altair Break Coffee Break Break Coffee Break Lunch CUG board/ New Sites lunch (closed) Lunch Lunch/ PEAD & XTreme SIG Participants Lunch CUG Advisory Board Lunch Cabinet (closed) Lunch Lunch Sponsored by Codee Lunch CUG Board & Sponsors Lunch (closed) Lunch Lunch Sponsored by NVIDIA Lunch HPE Executive Lunch (closed) Lunch Lunch Sponsored by Codee Lunch Lunch Sponsored by NVIDIA Networking/Social Event Welcome Reception Networking/Social Event Program Committee Dinner (invite only) Networking/Social Event HPE Networking Event HPE will host their annual CUG community networking reception from 6:00 to 8:00 pm ET at the Lokal Eatery & Bar. Lokal is located at 2 2nd St, Jersey City, NJ 07302, along Jersey City’s waterfront, allowing CUG guests to enjoy expansive views of the Manhattan Skyline. Co-presented by AMD, all registered CUG attendees and their guests are invited to attend for a reception with light hors d’oeuvres and drinks.
First bus will leave at 5.55pm, Lokal is about a 10 min walk from the CUG hotel. Last departure from Lokal with the bus will be at 8pm. Networking/Social Event CUG AMD Night Out CUG Night out at Hudson House, 2 Chapel Ave, Jersey City, NJ We invite all registered attendees and guests with a paid CUG night out ticket to join us for an unforgettable evening at Hudson House. Situated at the end of Port Liberte in Jersey City, NJ, this structure is an arms’ length away from the Hudson River and boasts a panoramic view of the Statue of Liberty, Brooklyn, Manhattan, and Verrazano Bridges, and of course the NYC Skyline. Coaches will depart outside the Westin Jersey City Hotel at 18:10 to arrive at Hudson House for a drink’s reception before seating for dinner at approximately 19:15. If you are making your own way to the venue, please use the full address as Google Maps takes you to a different address! Hudson House, 2 Chapel Ave is approx. a 15 – 20-minute drive. Our first bus will return to the hotel at approximately 21:00. Paper, Presentation Technical Session 1B: Workload manager Session Chair: David Carlson (Institute for Advanced Computational Science, Stony Brook University) Paper, Presentation Technical Session 1C: Software deployment Session Chair: Chris Fuson (ORNL, Oak Ridge National Laboratory) Paper, Presentation Technical Session 1A: Multitenancy Session Chair: Juan F R Herrera (EPCC, The University of Edinburgh) Paper, Presentation Technical Session 2B: Security & Configuration Management Session Chair: Jim Williams (Los Alamos National Laboratory) Paper, Presentation Technical Session 2C: Climate applications Session Chair: Maciej Cytowski (Pawsey Supercomputing Research Centre) Paper, Presentation Technical Session 2A: Slingshot Session Chair: Brett Bode (National Center for Supercomputing Applications/University of Illinois, National Center for Supercomputing Applications) Plenary, Paper Plenary Session: CUG Organizational Update and Best Paper Presentation Paper, Presentation Technical Session 3B: HPCM Session Chair: Matthew A. Ezell (Oak Ridge National Laboratory) Paper, Presentation Technical Session 3C: Future Technology Session Chair: Juan F R Herrera (EPCC, The University of Edinburgh) Paper, Presentation Technical Session 3A: Data Centers Session Chair: Lena M Lopatina (LANL) AlpsB – a Geographically Distributed Infrastructure to Facilitate Large-Scale Training of Weather and Climate AI Models pdfPaper, Presentation Technical Session 4B: GPU Energy Efficiency Session Chair: Maciej Cytowski (Pawsey Supercomputing Research Centre) Optimizing GPU Frequency for Sustainable HPC: Lessons Learned from a Year of Production on Adastra, an AMD GPU Supercomputer pdf, pdfPaper, Presentation Technical Session 4C: Monitoring Session Chair: David Carlson (Institute for Advanced Computational Science, Stony Brook University) Utilization and Performance Monitoring of Ookami, an ARM Fujitsu A64FX Testbed Cluster with XDMoD pdfPaper, Presentation Technical Session 4A: New Deployment Session Chair: Jim Rogers (Oak Ridge National Laboratory) Paper, Presentation Technical Session 5B: Maintaining Large Systems Session Chair: Aaron Scantlin (National Energy Research Scientific Computing Center) Paper, Presentation Technical Session 5C: Filesystems & I/O Session Chair: Raj Gautam (ExxonMobil) Paper, Presentation Technical Session 5A: Slingshot & MPI Tuning Session Chair: Brett Bode (National Center for Supercomputing Applications/University of Illinois, National Center for Supercomputing Applications) Using Different MPI Implementations on HPE Cray EX Supercomputers for Native and Containerized Applications Execution pdfPaper, Presentation Technical Session 6B: Framework for HPC-AI workflows Session Chair: Chris Fuson (ORNL, Oak Ridge National Laboratory) Paper, Presentation Technical Session 6C: Programming Models Session Chair: Benjamin Cumming (CSCS, ETH Zurich) Paper, Presentation Technical Session 6A: DAOS Session Chair: Jesse A. Hanley (Oak Ridge National Laboratory) Paper, Presentation Technical Session 7B: Access Nodes & Kubernetes Management Session Chair: Jim Williams (Los Alamos National Laboratory) Paper, Presentation Technical Session 7C: Application Performance Session Chair: Juan F R Herrera (EPCC, The University of Edinburgh) Task-decomposed Overlapped Pressure Preconditioner for Sustained Strong Scalability on Accelerated Exascale Systems pdfPaper, Presentation, Birds of a Feather Technical Session 7A: AI/ML GPU Workloads Session Chair: Raj Gautam (ExxonMobil) Plenary Plenary Session: CUG 2025 Welcome, Keynote Presentation Plenary Plenary Session: Stony Brook LOC Welcome, HPE Update Plenary, Paper Plenary Session: CUG Organizational Update and Best Paper Presentation Plenary, Vendor Plenary: Sponsors Talks, HPE 1-100 Codee: A Tool to Enhance Correctness, Modernization, Security, Portability and Optimization in Fortran and C/C++ Software Applications pdfPlenary Plenary: CUG 2026, Panel Plenary CUG 2025 Closing Remarks Paper, Presentation Technical Session 1B: Workload manager Session Chair: David Carlson (Institute for Advanced Computational Science, Stony Brook University) Paper, Presentation Technical Session 1C: Software deployment Session Chair: Chris Fuson (ORNL, Oak Ridge National Laboratory) Paper, Presentation Technical Session 1A: Multitenancy Session Chair: Juan F R Herrera (EPCC, The University of Edinburgh) Paper, Presentation Technical Session 2B: Security & Configuration Management Session Chair: Jim Williams (Los Alamos National Laboratory) Paper, Presentation Technical Session 2C: Climate applications Session Chair: Maciej Cytowski (Pawsey Supercomputing Research Centre) Paper, Presentation Technical Session 2A: Slingshot Session Chair: Brett Bode (National Center for Supercomputing Applications/University of Illinois, National Center for Supercomputing Applications) Paper, Presentation Technical Session 3B: HPCM Session Chair: Matthew A. Ezell (Oak Ridge National Laboratory) Paper, Presentation Technical Session 3C: Future Technology Session Chair: Juan F R Herrera (EPCC, The University of Edinburgh) Paper, Presentation Technical Session 3A: Data Centers Session Chair: Lena M Lopatina (LANL) AlpsB – a Geographically Distributed Infrastructure to Facilitate Large-Scale Training of Weather and Climate AI Models pdfPaper, Presentation Technical Session 4B: GPU Energy Efficiency Session Chair: Maciej Cytowski (Pawsey Supercomputing Research Centre) Optimizing GPU Frequency for Sustainable HPC: Lessons Learned from a Year of Production on Adastra, an AMD GPU Supercomputer pdf, pdfPaper, Presentation Technical Session 4C: Monitoring Session Chair: David Carlson (Institute for Advanced Computational Science, Stony Brook University) Utilization and Performance Monitoring of Ookami, an ARM Fujitsu A64FX Testbed Cluster with XDMoD pdfPaper, Presentation Technical Session 4A: New Deployment Session Chair: Jim Rogers (Oak Ridge National Laboratory) Paper, Presentation Technical Session 5B: Maintaining Large Systems Session Chair: Aaron Scantlin (National Energy Research Scientific Computing Center) Paper, Presentation Technical Session 5C: Filesystems & I/O Session Chair: Raj Gautam (ExxonMobil) Paper, Presentation Technical Session 5A: Slingshot & MPI Tuning Session Chair: Brett Bode (National Center for Supercomputing Applications/University of Illinois, National Center for Supercomputing Applications) Using Different MPI Implementations on HPE Cray EX Supercomputers for Native and Containerized Applications Execution pdfPaper, Presentation Technical Session 6B: Framework for HPC-AI workflows Session Chair: Chris Fuson (ORNL, Oak Ridge National Laboratory) Paper, Presentation Technical Session 6C: Programming Models Session Chair: Benjamin Cumming (CSCS, ETH Zurich) Paper, Presentation Technical Session 6A: DAOS Session Chair: Jesse A. Hanley (Oak Ridge National Laboratory) Paper, Presentation Technical Session 7B: Access Nodes & Kubernetes Management Session Chair: Jim Williams (Los Alamos National Laboratory) Paper, Presentation Technical Session 7C: Application Performance Session Chair: Juan F R Herrera (EPCC, The University of Edinburgh) Task-decomposed Overlapped Pressure Preconditioner for Sustained Strong Scalability on Accelerated Exascale Systems pdfPaper, Presentation, Birds of a Feather Technical Session 7A: AI/ML GPU Workloads Session Chair: Raj Gautam (ExxonMobil) Program Event Content Expanding Horizons in AI with HPC Workshop This workshop, located at Stony Brook University on Expanding Horizons in AI with HPC, aims to explore the dynamic intersection of AI and HPC, focusing on how advanced computing can accelerate AI research and applications. As AI models become more complex and data-intensive, traditional computing systems struggle to meet the demand for scalability, efficiency, and speed. HPC offers a solution by providing the necessary infrastructure for training large-scale models, enhancing AI algorithms, and enabling breakthroughs in fields such as deep learning, natural language processing, and autonomous systems. Registration and more details are available here: https://cug.org/cug-2025-aiwithhpc-workshop-2/ Program Event Content Expanding Horizons in AI with HPC Workshop This workshop, located at Stony Brook University on Expanding Horizons in AI with HPC, aims to explore the dynamic intersection of AI and HPC, focusing on how advanced computing can accelerate AI research and applications. As AI models become more complex and data-intensive, traditional computing systems struggle to meet the demand for scalability, efficiency, and speed. HPC offers a solution by providing the necessary infrastructure for training large-scale models, enhancing AI algorithms, and enabling breakthroughs in fields such as deep learning, natural language processing, and autonomous systems. Registration and more details are available here: https://cug.org/cug-2025-aiwithhpc-workshop-2/ Tutorial Tutorial 1B Continued Tutorial Tutorial 1C Continued Tutorial Tutorial 2B Tutorial Tutorial 2B Continued Plenary, Vendor Plenary: Sponsors Talks, HPE 1-100 Codee: A Tool to Enhance Correctness, Modernization, Security, Portability and Optimization in Fortran and C/C++ Software Applications pdfXTreme (Approved NDA Members Only) XTreme (Under NDA, Members Only) XTreme (Approved NDA Members Only) XTreme (Under NDA, Members Only) XTreme (Approved NDA Members Only) XTreme (Under NDA, Members Only XTreme (Approved NDA Members Only) XTreme (Under NDA, Members Only |
