CUG2025 Proceedings

Kaylie Anderson (HPE), Ben Cumming (Swiss National Supercomputing Centre), Subil Abraham (Oak Ridge National Laboratory), and Panchapakesan Chitra Shyamshankar (Argonne National Laboratory)

Abstract

Python Management

Chun Sun (HPE); Cristian Di Pietrantonio (Pawsey); Dave Carlson (Stony Brook University); and Juan Herrera (EPCC, The University of Edinburgh)

Abstract

Birds of a Feather

Programming Environments, Applications, and Documentation (PEAD)

CPE Update

Barbara Chapman (HPE)

Abstract

CPE Testing

Barbara Chapman (HPE), Cristian Di Pietrantonio (Pawsey), Brian Vanderwende (NCAR), Brandon Cook (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center), and Cedric Jourdain (CINES)

Abstract

Exploring the Challenges of the World-Class HPE Cray Programming Environment for Modern Software Development in Fortran

Manuel Arenaz (Codee)

Abstract

Modernizing the software development workflow for Fortran developers is crucial to enhance productivity, code quality, and maintainability. Despite being around since 1950, Fortran remains widely used in Aerospace, Automotive, Climate & Weather, Defense, Energy & Utilities, High Performance Computing, Manufacturing, Oil and Gas, Scientific Research, and other industries. However, traditional Fortran development often lacks modern tools that do exist for C/C++, which empower these developers to implement modern DevOps best practices in the organization.

The HPE Cray Programming Environment (CPE) is recognized as the best-in-class ecosystem for Fotran/C/C++, providing the CCE compilers integrated with the tools CrayPat and Reveal, among others. Recent community feedback revealed the need for better Fortran code formatting tools and static code analyzers, highly customizable and seamlessly integrated with popular Integrated Development Environments (IDEs) like VS Code, version control systems like Git, and Continuous Integration/Continuous Deployment (CI/CD) frameworks like GitLab and GitHub. This begs an open question: what are the current challenges for the world-class HPE CPE to better support modern software development best practices in Fortran?

The key is to provide Fortran developers with new automated tools that speed up frequent coding tasks carried out daily. For example, by employing code formatting tools, developers can ensure consistent code style, which improves readability and simplifies collaboration. Static code analyzers play a vital role in detecting potential bugs, enforcing coding guidelines, and ensuring code correctness by identifying issues early in the development process. Integration with modern IDEs provides advanced features such as syntax highlighting, autocompletion, and debugging support, significantly improving the development experience. Utilizing version control systems enables efficient collaboration, change tracking, and rollback capabilities. Furthermore, embedding Fortran workflows into CI/CD pipelines ensures continuous testing, automated builds, and faster deployment cycles. So, what can we learn from the 20-year-old successful tooling ecosystem available for C/C++ embedded software development and adopt it for Fortran in HPE CPE?

In this BoF we aim to discuss what important capabilities and tools are desirable that would enhance today’s HPE CPE. Brainstorming and networking with HPE/Cray users and experts through short presentations, discussions with panelists, and interactive Q&A is the main objective of this BoF.

pdf

Open Floor Discussion

Chris Fuson (Oak Ridge National Laboratory)

Abstract

Birds of a Feather

BoF 1D

Security BoF

Aaron Scantlin (National Energy Research Scientific Computing Center)

Abstract

pdf

Birds of a Feather

BoF 2D

Kubernetes on HPE Supercomputers BoF

Sadaf Alam (University of Bristol), Dino Conciatore (Swiss National Supercomputing Centre), and Jesse L. Treger (HPE)

Abstract

pdf, pdf

Birds of a Feather

BoF 1B

CUG SIG System Monitoring Working Group BoF

Massimo Benini (CSCS - ETH Zurich), Lena Lopatina (Los Alamos National Laboratory), and Jeff Hanson and Pete Guyan (HPE)

Abstract

pdf, pdf

Birds of a Feather

BoF 1C

Sharing is Caring: Tackling Node-Sharing Challenges at CUG Sites

Tim Robinson (Swiss National Supercomputing Centre, ETH Zurich); Tim Wickberg (SchedMD LLC); Pengfei Ding (Lawrence Berkeley National Laboratory); and Cristian Di Pietrantonio (Pawsey Supercomputing Research Centre)

Abstract

HPC centres have traditionally allocated computational resources in entire nodes, a practice rooted in the architectural and operational simplicity of earlier systems, and the belief that “HPC” meant the ability to scale to hundreds or thousands of nodes. However, as technology has advanced, nodes have become increasingly powerful and expensive, incorporating hundreds of CPU cores, multiple GPUs, and large amounts of high-bandwidth memory. This evolution makes whole-node allocation inefficient for many modern workloads—high-throughput computing, data-intensive workflows and interactive computing, to name a few. Consequently, subdividing nodes to allocate resources at finer granularity—by socket, core, GPU, or memory—has emerged as a necessary alternative.

While node subdivision can improve resource utilization and lower costs, it introduces several challenges, including technical, operational, and business concerns. Fine-grained resource allocation complicates job scheduling and resource management, requiring enhancements to workload managers like Slurm to handle mixed-resource requests effectively. Subdividing nodes also increases the risk of resource contention, such as memory bandwidth bottlenecks, interconnect congestion, or power sloshing, which can all affect application performance. Finally, the heterogeneity of modern nodes, with CPUs, GPUs, and other accelerators, creates challenges for monitoring and accounting: how should researchers be charged for using fractions of these components?

In this BOF, we will explore topics such as resource contention, performance variability, security and isolation, scheduling complexity, accounting, and scalability of shared workloads. Our focus will extend beyond static resource allocations at the Slurm level to include finer, dynamic granularities, such as GPU sharing (e.g., NVIDIA’s Multi-Instance GPU and the recently open-sourced Run:ai). Ahead of the conference, we will gather discussion points and questions from the wider CUG community. At the session, speakers from Pawsey, NERSC, and CSCS will share their current practices and future needs, while SchedMD will offer insights from the workload management perspective.

pdf

Birds of a Feather

BoF 1A

CSM updates, iSCSI boot content projection, and other CSM topics

Harold Longley, Dennis Walker, Ravi Bissa, Jason Coverston, Siri Vias Khalsa, Ashalatha A. M, and Ravikanth Nalla (HPE)

Abstract

pdf

Birds of a Feather

BoF 3D

Rethinking Interactive HPC Resource Access: Enhancing Security and Flexibility

Maxime Martinasso (Swiss National Supercomputing Centre), Sadaf Alam (University of Bristol), and Isa Wazirzada and Larry Kaplan (HPE)

Abstract

The traditional approach to accessing HPC resources relies on login nodes and SSH connections authenticated through POSIX Identity and Access Management (IAM). While this method has served the community well, it presents significant challenges in today's landscape of cybersecurity threats and evolving user needs, such as maintaining a secure shared login node or managing identity life-cycle. This Birds of a Feather (BoF) session aims to explore innovative approaches to modernize interactive HPC resource access with CLI, addressing the dual goals of enhancing security and increasing service customization flexibility for users. Emerging practices, such as SSH signed keys, offer a promising alternative to traditional login names and passwords, mitigating risks associated with credential theft by enabling more advance authentication flow like multi-factor authentication. Virtualized login nodes, implemented as containerized environments, could allow user-defined environments with for instance advanced debugging capability, AI stacks or a higher integration with IDE while improving isolation, scalability of users, and individual session management. Additionally, the generation of temporary POSIX accounts from OpenID Connect (OIDC) tokens could seamlessly integrate modern federated and non-local identity providers, reducing administrative overhead and attack surfaces. The session will showcase existing solutions, discuss opportunities for innovation, challenge classic IAM HPC and login nodes workflow and highlight the potential benefits of these new approaches. Attendees will hear from practitioners actively exploring these paradigms, sparking discussions on how the community can collectively advance this shift and benefit for a common solution. We invite participants to contribute their ideas, share experiences, and help shape a future where interactive HPC resource access is not only more secure but also more adaptable to the diverse and continuously evolving needs of its users.

Organiser: CSCS has implemented MFA by introducing SSH services and SSH-signed keys as the sole method for accessing login nodes. BriCS has adopted a similar approach, enhancing user management by generating POSIX accounts on login to eliminate the need for manual identity management. Additionally, HPE has introduced the concept of User Access Instance (UAI), which function as on-demand containers for login nodes.

Agenda: To kick off the session, we will have three brief 5-minute presentations introducing key concepts: the CSCS SSH service, BriCS identity management approach, and the HPE User Access Node (UAN) concept. Following these presentations, we will present a list of thought-provoking questions to ignite discussion and encourage debate. See examples in the attached document.

pdf, pdf

Birds of a Feather

BoF 2B

Managing System Reliability: From system acceptance through production

Pete Guyan and Sue Miller (HPE)

Abstract

Birds of a Feather

BoF 2C

HPE Slingshot Birds-of-a-Feather

Jesse Treger (HPE)

Abstract

Birds of a Feather

BoF 2A

CPE Futures

Barbara Chapman (HPE, Stony Brook University) and Kaylie Anderson and Chun Sun (HPE)

Abstract

Paper, Presentation, Birds of a Feather

Technical Session 7A: AI/ML GPU Workloads

Session Chair: Raj Gautam (ExxonMobil)

Porting Radio Astronomy Correlation to Setonix, a HPE Cray EX system powered by AMD GPUs

Cristian Di Pietrantonio (Pawsey Supercomputing Research Centre, Curtin Institute for Radio Astronomy); Marcin Sokolowski (Curtin Institute of Radio Astronomy); Christopher Harris (Pawsey Supercomputing Research Centre); and Daniel Price and Randal Wayth (SKAO)

Abstract

pdf, pdf

Evaluating the Performance of Containerized ML and LLM Applications on the Frontier and Odo Supercomputers

Bishwo Dahal (University of Louisiana Monroe, Oak Ridge National Laboratory) and Elijah Maccarthy and Subil Abraham (Oak Ridge National Laboratory)

Abstract

pdf, pdf

BoF on Transforming Hybrid Workflows: The Role of HPE Cray Supercomputing User Services Software in Bridging HPC and AI

Tulsi Mishra, Dean Roe, and Larry Kaplan (HPE)

Abstract

pdf

Return to Top

Break

Coffee Break

Break

Coffee Break

Break

Coffee Break

Break

Coffee Break Sponsored by SchedMD

Break

Coffee Break Sponsored by Pier Group

Break

Coffee Break Sponsored by Linaro

Break

Coffee Break Sponsored by VAST

Break

Coffee Break Sponsored by Altair

Break

Coffee Break

Break

Coffee Break

Return to Top

CUG Program Committee

CUG Advisory Board (closed)

Return to Top

Lunch

CUG board/ New Sites lunch (closed)

Lunch

Lunch/ PEAD & XTreme SIG Participants

Lunch

CUG Advisory Board Lunch Cabinet (closed)

Lunch

Lunch Sponsored by Codee

Lunch

CUG Board & Sponsors Lunch (closed)

Lunch

Lunch Sponsored by NVIDIA

Lunch

HPE Executive Lunch (closed)

Lunch

Lunch Sponsored by Codee

Lunch

Lunch Sponsored by NVIDIA

Return to Top

Networking/Social Event

Welcome Reception

Networking/Social Event

Program Committee Dinner (invite only)

Networking/Social Event

HPE Networking Event

HPE will host their annual CUG community networking reception from 6:00 to 8:00 pm ET at the Lokal Eatery & Bar. Lokal is located at 2 2nd St, Jersey City, NJ 07302, along Jersey City’s waterfront, allowing CUG guests to enjoy expansive views of the Manhattan Skyline. Co-presented by AMD, all registered CUG attendees and their guests are invited to attend for a reception with light hors d’oeuvres and drinks. First bus will leave at 5.55pm, Lokal is about a 10 min walk from the CUG hotel. Last departure from Lokal with the bus will be at 8pm.

Networking/Social Event

CUG AMD Night Out

CUG Night out at Hudson House, 2 Chapel Ave, Jersey City, NJ

We invite all registered attendees and guests with a paid CUG night out ticket to join us for an unforgettable evening at Hudson House. Situated at the end of Port Liberte in Jersey City, NJ, this structure is an arms’ length away from the Hudson River and boasts a panoramic view of the Statue of Liberty, Brooklyn, Manhattan, and Verrazano Bridges, and of course the NYC Skyline. Coaches will depart outside the Westin Jersey City Hotel at 18:10 to arrive at Hudson House for a drink’s reception before seating for dinner at approximately 19:15. If you are making your own way to the venue, please use the full address as Google Maps takes you to a different address! Hudson House, 2 Chapel Ave is approx. a 15 – 20-minute drive. Our first bus will return to the hotel at approximately 21:00.

Return to Top

Papers

Paper, Presentation

Technical Session 1B: Workload manager

Session Chair: David Carlson (Institute for Advanced Computational Science, Stony Brook University)

Slinky: The Missing Link Between Slurm and Kubernetes

Tim Wickberg (SchedMD LLC)

Abstract

pdf

How Best to Leverage Cloud for (Big) HPC Sites

Bill Nitzberg and Ian Littlewood (Altair Engineering, Inc.)

Abstract

pdf

Divide and Rule: Automated Workload Distribution for Efficient User Support Services

Luca Marsella (Swiss National Supercomputing Centre)

Abstract

pdf

Paper, Presentation

Technical Session 1C: Software deployment

Session Chair: Chris Fuson (ORNL, Oak Ridge National Laboratory)

Deploying and Tracking Software with NCCS Software Provisioning

Asa Rentschler, Nicholas Hagerty, Elijah Maccarthy, and Edwin F. Posada Correa (Oak Ridge National Laboratory)

Abstract

The National Center for Computational Sciences (NCCS) at Oak Ridge National Laboratory has a long history of deploying ground- breaking leadership-class supercomputers for the U.S. Department of Energy. The latest in this line of supercomputers is Frontier, the first supercomputer to break the exascale barrier (1018 floating- point operations per second) on the TOP500 list. Frontier serves a wide array of scientific domains, from traditional simulation- based workloads to newer AI and Machine Learning workloads. To best serve the NCCS user community, NCCS uses Spack to deploy a comprehensive software stack of scientific software packages, providing straightforward access to these packages through Lmod Environment Modules. Maintaining a large software stack while also including multiple new compiler releases each year is a very time-consuming task. Additionally, it is not straightforward to pro- vide a software stack alongside existing vendor-provided software such as the HPE/Cray Programming Environment (CPE), and exist- ing CPE, Spack, and Lmod integration does not allow for multiple versions of GPU libraries such as AMD’s ROCm to be used. To ad- dress these challenges and shortcomings, NCCS has developed the NCCS Software Provisioning tool (NSP), a tool for deploying and monitoring software stacks on HPC systems. NSP allows NCCS to quickly and effectively provision software stacks from the ground up using template-driven recipes and configuration files. NSP is successfully deployed on Frontier and several other NCCS clusters, enabling the NCCS software team to quickly deploy software stacks for newly-released compilers, expand current software offerings, better support GPU-based software, and monitor Lmod module usage to identify unused software packages that can be removed from the software stack. In this work, we discuss the shortcomings of the previous CPE, Spack, and Lmod usage at NCCS, provide further details on the implementation and structure of NSP, then discuss the benefits that NSP provides.

pdf, pdf

Modern Software Deployment on a Multi-Tenant Cray-EX System

Ben Cumming, Andreas Fink, Simon Pintarelli, and John Biddiscombe (CSCS)

Abstract

pdf

Employing a Software-Driven Approach to Scalable HPC System Management

Aaron Barlow (Oak Ridge National Laboratory)

Abstract

pdf

Paper, Presentation

Technical Session 1A: Multitenancy

Session Chair: Juan F R Herrera (EPCC, The University of Edinburgh)

Infrastructure as a Service with Strong Tenant Separation on a Supercomputer

Riccardo Di Maria, Chris Gamboni, Manuel Sopena Ballesteros, Hussein Harake, Mark Klein, Marco Passerini, Miguel Gila, Maxime Martinasso, and Thomas C. Schulthess (Swiss National Supercomputing Centre) and Alun Ashton, Derek Feichtinger, Marc Caubet, Elsa Germann, Hans-Nikolai Viessmann, Achim Gsell, and Krisztian Pozsa (Paul Scherrer Institute)

Abstract

pdf, pdf

Dynamic Network Perimeterization: Isolating Tenant Workloads With VLANs, VNIs, & ACLs

Nikhil Mukundan, Dennis Walker, Stephen Han, Atif Ali, Siri Vias Khalsa, Amit Jain, Vishal Bhatia, and Vinay Karanth (HPE)

Abstract

pdf, pdf

CSCS' journey towards complete platform automation in a multi-tenant environment

Miguel Gila, Ivano Bonesana, and Alejandro Dabin (Swiss National Supercomputing Centre, CSCS)

Abstract

pdf

Paper, Presentation

Technical Session 2B: Security & Configuration Management

Session Chair: Jim Williams (Los Alamos National Laboratory)

Pragmatic Security Audits: Fortifying HPC Environments at a Consumable Pace

Alden Stradling (Los Alamos National Laboratory) and Monica Dessouky and Dennis Walker (HPE)

Abstract

pdf, pdf

Experimenting with Security Compliance Checking using ReFrame

Victor Holanda Rusu, Matteo Basso, Chris Gamboni, Fabio Zambrino, and Massimo Benini (Swiss National Supercomputing Centre)

Abstract

pdf, pdf

From Weeks to Hours: Harnessing Configuration Management and Deployment Pipelines

Dennis Walker and Siri Vias Khalsa (HPE) and Alex Lovell-Troy (Los Alamos National Laboratory)

Abstract

pdf, pdf

Rev Up Compute Node Reboots: 2x to 5x Faster

Dennis Walker (HPE) and Paul Selwood (Met Office, UK / NERC CMS)

Abstract

pdf, pdf

Paper, Presentation

Technical Session 2C: Climate applications

Session Chair: Maciej Cytowski (Pawsey Supercomputing Research Centre)

Bit-reproducibility in UK Met Office Weather and Climate Applications

David Acreman (HPE)

Abstract

pdf

Enabling km-scale coupled climate simulations with ICON on AMD GPUs

Jussi Enkovaara (CSC - IT Center for Science Ltd.)

Abstract

pdf

MARBLChapel: Fortran-Chapel Interoperability in an Ocean Simulation

Brandon Neth and Ben Harshbarger (HPE); Scott Bachman ([C]Worthy); and Michelle Mills Strout (HPE, University of Arizona)

Abstract

pdf

Redefining Weather Forecasting Systems: The Transition to ICON and Alps

Mauro Bianco, Matthias Kraushaar, and Roberto Aielli (ETH Zurich); Oliver Fuhrer (Federal Office of Meteorology and Climatology MeteoSwiss); and Thomas Schulthess (ETH Zurich)

Abstract

pdf, pdf

Paper, Presentation

Technical Session 2A: Slingshot

Session Chair: Brett Bode (National Center for Supercomputing Applications/University of Illinois, National Center for Supercomputing Applications)

The HPE Slingshot 400 Expedition

Houfar Azgomi, Duncan Roweth, Gregory Faanes, and Jesse Treger (HPE)

Abstract

pdf, pdf

Introduction To HPE Slingshot NIC Libfabric Environment Variables

Jesse Treger and Ian Ziemba (HPE)

Abstract

pdf

Math in Your Network: Slingshot Hardware Accelerated Reductions

Forest Godfrey and Duncan Roweth (HPE)

Abstract

pdf

Slingshot Host Software Ethernet Tuning

Ravi Bissa, Ian Ziemba, Duncan Roweth, and Forest Godfrey (HPE)

Abstract

pdf

Plenary, Paper

Plenary Session: CUG Organizational Update and Best Paper Presentation

CUG Organizational Update

Ashley Barker (Oak Ridge National Laboratory)

Abstract

Evolving HPC services to enable ML workloads on HPE Cray EX

Stefano Schuppli, Fawzi Mohamed, Henrique Mendonca, Nina Mujkanovic, Elia Palme, Dino Conciatore, Lukas Drescher, Miguel Gila, Pim Witlox, Joost VandeVondele, Maxime Martinasso, Torsten Hoefler, and Thomas Schulthess (Swiss National Supercomputing Centre)

Abstract

pdf, pdf

Alps, a versatile research infrastructure

Maxime Martinasso (Swiss National Supercomputing Centre, ETH Zurich) and Mark Klein and Thomas Schulthess (Swiss National Supercomputing Centre)

Abstract

pdf, pdf

Paper, Presentation

Technical Session 3B: HPCM

Session Chair: Matthew A. Ezell (Oak Ridge National Laboratory)

A Brief Summary of the HPCM (HPE Performance Cluster Manager) Evolution Over Recent Releases

Sue Miller, Lee Morecroft, and Peter Guyan (HPE)

Abstract

System Visualization Using Rackmap

Troy Dey and Peter Guyan (HPE)

Abstract

Harvesting, Storing and Processing Data from our HPCM Systems

Ben Lenard, Eric Pershey, Brian Toonen, Peter Upton, Doug Waldron, Lisa Childers, Micheal Zhang, and Bryan Brickman (Argonne National Laboratory)

Abstract

pdf, pdf

Paper, Presentation

Technical Session 3C: Future Technology

Session Chair: Juan F R Herrera (EPCC, The University of Edinburgh)

Evolving Sarus to augment Podman for HPC on Cray EX

Alberto Madonna, Gwangmu Lee, and Felipe Cruz (Swiss National Supercomputing Centre)

Abstract

pdf

What is RISC-V and why should we care?

Nick Brown (EPCC)

Abstract

pdf, pdf

A Full Stack Framework for High Performance Quantum-Classical Computing

Xin Zhan, K. Grace Johnson, and Soumitra Chatterjee (HPE); Barbara Chapman (HPE, Stony Brook University); and Masoud Mohseni, Kirk Bresniker, and Ray Beausoleil (HPE)

Abstract

pdf, pdf

Paper, Presentation

Technical Session 3A: Data Centers

Session Chair: Lena M Lopatina (LANL)

Causality inference for Digital Twins in GPU Data Centers and Smart Grids.

Rolando Pablo Hong Enriquez, Pavana Prakash, Ebad Taheri, and Aditya Dhakal (HPE); Matthias Maiterth and Wesley Brewer (Oak Ridge National Laboratory); and Dejan Milojicic (HPE)

Abstract

pdf, pdf

AlpsB – a Geographically Distributed Infrastructure to Facilitate Large-Scale Training of Weather and Climate AI Models

Alex Upton, Jerome Tissieres, and Maxime Martinasso (Swiss National Supercomputing Centre)

Abstract

pdf

Co-design, deployment and operation of a Modular Data Centre (MDC) with air and direct-liquid cooled supercomputers

Sadaf Alam (University of Bristol); Emma Akinyemi, Martin Podstata, and Jan Over (HPE); and Simon McIntosh-Smith, Ross Barnes, Naomi Harris, and Dave Moore (University of Bristol)

Abstract

pdf, pdf

Paper, Presentation

Technical Session 4B: GPU Energy Efficiency

Session Chair: Maciej Cytowski (Pawsey Supercomputing Research Centre)

Optimizing GPU Frequency for Sustainable HPC: Lessons Learned from a Year of Production on Adastra, an AMD GPU Supercomputer

Gabriel Hautreux, Naïma Alaoui, and Etienne Malaboeuf (CINES)

Abstract

pdf, pdf

Fine-Grained Application Energy and Power Measurements on the Frontier Exascale System

Oscar Hernandez and Wael Elwasif (Oak Ridge National Laboratory)

Abstract

pdf, pdf

EVeREST: An Effective and Versatile Runtime Energy Saving Tool for GPUs

Anna Yue, Torsten Wilde, Sanyam Mehta, and Barbara Chapman (HPE)

Abstract

pdf

HPE Cray EX225a (MI300a) Blade Power Capping and HBM Page Retirement

Steven Martin, Randy Law, Leo Flores, Ron Urwin, and Larry Kaplan (HPE)

Abstract

pdf

Paper, Presentation

Technical Session 4C: Monitoring

Session Chair: David Carlson (Institute for Advanced Computational Science, Stony Brook University)

Utilization and Performance Monitoring of Ookami, an ARM Fujitsu A64FX Testbed Cluster with XDMoD

Nikolay A. Simakov, Joseph P. White, and Matthew D. Jones (SUNY University at Buffalo) and Eva Siegmann, David Carlson, and Robert J. Harrison (Stony Brook University)

Abstract

pdf

HPE Slingshot Monitoring Software: Actionable Insights for HPC and AI Systems

Sahil Patel (HPE)

Abstract

pdf

LDMS New Features for Deployment in Advanced Environments and Feedback for Operations

Jim Brandt, Ben Schwaller, Jennifer Green, Ben Allan, Cory Lueninghoener, Evan Donato, Vanessa Surjadidjaja, Sara Walton, and Ann Gentile (Sandia National Laboratories)

Abstract

pdf

Proactive Health Monitoring and Maintenance of High-Speed Slingshot Fabrics in HPC Environments

Michael Cush, Jeff Kabel, Michael Schmit, Michael Accola, and Forest Godfrey (HPE)

Abstract

pdf, pdf

Paper, Presentation

Technical Session 4A: New Deployment

Session Chair: Jim Rogers (Oak Ridge National Laboratory)

A journey to provide GH200

Mark Klein, Thomas Schulthess, Jonathan Coles, and Miguel Gila (Swiss National Supercomputing Centre, ETH Zurich)

Abstract

pdf

Evaluating AMD MI300A APU: Performance Insights on LLM Training via Knowledge Distillation

Dennis Dickmann (Seedbox); Philipp Offenhäuser (HPE); Rishabh Saxena (HLRS, University of Stuttgart); George Markomanolis (AMD); Alessandro Rigazzi (HPE HPC/AI EMEA Research Lab); Patrick Keller (HPE); and Kerem Kayabay and Dennis Hoppe (HLRS, University of Stuttgart)

Abstract

pdf, pdf

Evaluation of the Nvidia Grace Superchip in the HPE/Cray XD Isambard 3 supercomputer

Thomas Green and Sadaf Alam (University of Bristol)

Abstract

pdf, pdf

Separating concerns: Decoupling the Slingshot Fabric Manager from Cray System Management

Riccardo Di Maria and Chris Gamboni (Swiss National Supercomputing Centre), Davide Tacchella and Isa Wazirzada (HPE), and Mark Klein (Swiss National Supercomputing Centre)

Abstract

pdf

Paper, Presentation

Technical Session 5B: Maintaining Large Systems

Session Chair: Aaron Scantlin (National Energy Research Scientific Computing Center)

Hardware Triage Tool: Enhancements and Extensions

Isa Muhammad Wazirzada, Abhishek Mehta, Vinanti Phadke, and Bhuvan Meda Rajesh (HPE)

Abstract

pdf

Detecting operating system noise with detect-detour

Nagaraju KN, Clark Snyder, Dean Roe, and Larry Kaplan (HPE)

Abstract

pdf, pdf

Analyzing a Lifetime of Failures on a Cray XC40 Supercomputer

Kevin Brown and Tanwi Mallick (Argonne National Laboratory), Zhiling Lan (University of Illinois Chicago), Robert Ross (Argonne National Laboratory), and Christopher Carothers (Rensselaer Polytechnic Institute)

Abstract

pdf, pdf

Paper, Presentation

Technical Session 5C: Filesystems & I/O

Session Chair: Raj Gautam (ExxonMobil)

E2000 Performance From Microbenchmarks to Applications

William Loewe, Michael Moore, Sakib Samar, and Chris Walker (HPE)

Abstract

pdf, pdf

Towards Empirical Roofline Modeling of Distributed Data Services: Mapping the Boundaries of RPC Throughput

Philip Carns, Matthieu Dorier, Rob Latham, Shane Snyder, and Amal Gueroudji (Argonne National Laboratory); Seth Ockerman (University of Wisconsin-Madison); Jerome Soumagne (HPE); Dong Dai (University of Delaware); and Robert Ross (Argonne National Laboratory)

Abstract

pdf, pdf

HPC workload characterization using eBPF

Shubh Pachchigar and Brandon Cook (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) and Brian Friesen (Lawrence Berkeley National Laboratory)

Abstract

pdf, pdf

Paper, Presentation

Technical Session 5A: Slingshot & MPI Tuning

Session Chair: Brett Bode (National Center for Supercomputing Applications/University of Illinois, National Center for Supercomputing Applications)

MPI implementation optimization for Slingshot network

Rahulkumar Gayatri, Adam Lavely, Neil Mehta, Brandon Cook, and Afton Geil (Lawrence Berkeley National Laboratory)

Abstract

pdf, pdf

Using Different MPI Implementations on HPE Cray EX Supercomputers for Native and Containerized Applications Execution

Maciej Pawlik and Maciej Szpindler (Academic Computer Centre CYFRONET), Marcin Krotkiewski (University of Oslo), and Alfio Lazzaro (HPE)

Abstract

pdf

Scaling MPI Applications on Aurora

Nilakantan Mahadevan (Hewlett Packard Enterprise); Premanand Sakarda (Intel Corporation); Scott Parker, Servesh Muralidharan, Vitali Morozov, and Victor Anisimov (Argonne National Laboratory); Huda Ibeid, Anthony-Trung Nguyen, and Aditya Nishtala (Intel Corporation); Larry Kaplan and Michael Woodacre (Hewlett Packard Enterprise); and Kalyan Kumaran and JaeHyuk Kwack (Argonne National Laboratory)

Abstract

pdf, pdf

Paper, Presentation

Technical Session 6B: Framework for HPC-AI workflows

Session Chair: Chris Fuson (ORNL, Oak Ridge National Laboratory)

Framework for tracking metadata, lineage and model provenance in hybrid simulation-AI HPC exascale workflows

Martin Foltin, Andrew Shao, Rishabh Sharma, Shreyas Kulkarni, Annmary Justine Koomthanam, Aalap Tripathy, and Cong Xu (HPE); Wenqian Dong (Oregon State University); Suparna Bhattacharya (HPE); Brian Sammuli (General Atomics); and Paolo Faraboschi (HPE)

Abstract

pdf, pdf

Search and Query Framework for Workflows with HPC and AI Models

Christopher Rickett, Sreenivas Sukumar, and Karlon West (HPE)

Abstract

pdf, pdf

FirecREST v2: Lessons Learned from Redesigning an API for Scalable HPC Resource Access

Elia Palme and Juan Pablo Dorsch (CSCS - ETH Zurich); Ali Khosravi and Giovanni Pizzi (PSI Center for Scientific Computing, Theory, and Data); and Francesco Pagnamenta, Andrea Ceriani, Eirini Koutsaniti, Rafael Sarmiento, Ivano Bonesana, and Alejandro Dabin (CSCS - ETH Zurich)

Abstract

pdf, pdf

Paper, Presentation

Technical Session 6C: Programming Models

Session Chair: Benjamin Cumming (CSCS, ETH Zurich)

Designing GPU-aware OpenSHMEM for HPE Cray EX and XD Systems

Danielle Sikich, Naveen Namashivayam Ravichandrasekaran, Md Rahman, Elliot Joseph Ronaghan, Nathan Wichmann, and William Okuno (HPE)

Abstract

pdf, pdf

Quantifying Message Aggregation Optimisations for Energy Savings in PGAS Models

Aaron Welch and Oscar Hernandez (Oak Ridge National Laboratory) and Stephen Poole and Wendy Poole (Los Alamos National Laboratory)

Abstract

pdf, pdf

Accelerating LArTPC Simulations: Enhancing larnd-sim with GPU Optimization Techniques

Madan Timalsina (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory); Matt Kramer (Lawrence Berkeley National Laboratory); Pengfei Ding (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory); Ronan Doherty (Trinity College Dublin); Rishabh Dave (UC Berkeley); Nicholas Tyler, Urjoshi Sinha, and William Arndt (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory); and Callum Wilkinson (Lawrence Berkeley National Laboratory)

Abstract

pdf

Paper, Presentation

Technical Session 6A: DAOS

Session Chair: Jesse A. Hanley (Oak Ridge National Laboratory)

DAOS - New Horizons for High Performance Storage

Michael Hennecke and Jerome Soumagne (HPE)

Abstract

pdf

Enhancing RPC on Slingshot for Aurora’s DAOS Storage System

Jerome Soumagne, Alexander Oganezov, Ian Ziemba, and Steve Welch (HPE); Philip Carns and Kevin Harms (Argonne National Laboratory); and John Carrier, Johann Lombardi, Mohamad Chaarawi, Zhen Liang, and Scott Peirce (HPE)

Abstract

pdf, pdf

Global Distributed Client-side Cache for DAOS

Clarete R. Crasta, John L Byrne, Abhishek Dwaraki, David Emberson, Harumi Kuno, Sekwon Lee, Ramya Ahobala Rao, Shreyas Vinayaka Basri K S, Amitha C, Chinmay Ghosh, Rishi Kesh Kumar Rajak, Sriram Ravishankar, Porno Shome, and Lance Evans (HPE)

Abstract

pdf, pdf

Paper, Presentation

Technical Session 7B: Access Nodes & Kubernetes Management

Session Chair: Jim Williams (Los Alamos National Laboratory)

Addressing Resource Constraints on Aurora with Admin Access Nodes

Peter Upton, Ben Lenard, Ben Allen, and Cyrus Blackworth (Argonne National Laboratory)

Abstract

pdf, pdf

HPE Slingshot in the Kubernetes Ecosystem

Caio Davi and Jesse Treger (HPE)

Abstract

pdf, pdf

Building non-standard images for CSM systems

Harold Longley, Isa Wazirzada, Dennis Walker, Andy Warner, and Davide Tacchella (HPE)

Abstract

pdf, pdf

Paper, Presentation

Technical Session 7C: Application Performance

Session Chair: Juan F R Herrera (EPCC, The University of Edinburgh)

Task-decomposed Overlapped Pressure Preconditioner for Sustained Strong Scalability on Accelerated Exascale Systems

Niclas Jansson (KTH Royal Institute of Technology)

Abstract

pdf

Supernovae in HPC: Benchmarking FLASH Across Advanced Computing Clusters

Joshua Martin, Eva Siegmann, and Alan Calder (Stony Brook University, Institute of Advanced Computational Science)

Abstract

pdf

Expanding Community Access to Real-World HPC Application I/O Characterization Data Using Darshan

Shane Snyder, Philip Carns, Robert Ross, Robert Latham, and Kevin Harms (Argonne National Laboratory)

Abstract

pdf, pdf

Paper, Presentation, Birds of a Feather

Technical Session 7A: AI/ML GPU Workloads

Session Chair: Raj Gautam (ExxonMobil)

Porting Radio Astronomy Correlation to Setonix, a HPE Cray EX system powered by AMD GPUs

Abstract

pdf, pdf

Evaluating the Performance of Containerized ML and LLM Applications on the Frontier and Odo Supercomputers

Bishwo Dahal (University of Louisiana Monroe, Oak Ridge National Laboratory) and Elijah Maccarthy and Subil Abraham (Oak Ridge National Laboratory)

Abstract

pdf, pdf

BoF on Transforming Hybrid Workflows: The Role of HPE Cray Supercomputing User Services Software in Bridging HPC and AI

Tulsi Mishra, Dean Roe, and Larry Kaplan (HPE)

Abstract

pdf

Return to Top

Plenary

Plenary Session: CUG 2025 Welcome, Keynote Presentation

Welcome from the CUG President, Ashley Barker

Ashley Barker (Oak Ridge National Laboratory)

Abstract

Keynote: What I’ve Learned About Supercomputing from Blowing Up Stars, Michael Zingale (Stony Brook University)

Michael Zingale (Stony Brook University)

Abstract

New Member Site: Introducing LRZ

Markus Michael Müller (LRZ)

Abstract

pdf

CUG 2026 Elections: Candidate Statements

Lipi Gupta (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center)

Abstract

Plenary

Plenary Session: Stony Brook LOC Welcome, HPE Update

Welcome by Stony Brook University

Robert Harrison (Stony Brook University)

Abstract

Altair: AI/ML Intelligent Scheduling for HPC with Altair®

Bill Nitzberg (Altair Engineering, Inc.)

Abstract

pdf

NVIDIA HPC Software - Expanding HPC with Python & AI

Becca Zandstein (NVIDIA)

Abstract

pdf

HPE Corporate Update, Gerald Kleyn

Gerald Kleyn (HPE)

Abstract

Plenary, Paper

Plenary Session: CUG Organizational Update and Best Paper Presentation

CUG Organizational Update

Ashley Barker (Oak Ridge National Laboratory)

Abstract

Evolving HPC services to enable ML workloads on HPE Cray EX

Abstract

pdf, pdf

Alps, a versatile research infrastructure

Maxime Martinasso (Swiss National Supercomputing Centre, ETH Zurich) and Mark Klein and Thomas Schulthess (Swiss National Supercomputing Centre)

Abstract

pdf, pdf

Plenary, Vendor

Plenary: Sponsors Talks, HPE 1-100

Linaro: Unlocking Exascale Debugging and Performance Engineering with Linaro Forge

Rudy Shand (Linaro Ltd)

Abstract

pdf

Codee: A Tool to Enhance Correctness, Modernization, Security, Portability and Optimization in Fortran and C/C++ Software Applications

Manuel Arenaz (Codee)

Abstract

pdf

AMD: The Unreasonable Effectiveness of FP64 Precision Arithmetic

Nicholas Malaya (AMD)

Abstract

HPE 1 on 100 with Trish Damkroger (HPE Customers only. No HPE partners or CUG sponsors)

Trish Damkroger (HPE)

Abstract

Plenary

Plenary: CUG 2026, Panel

New Member Site: Introducing GeoSphere

Martin Shivraj Saini (Geosphere)

Abstract

pdf

New Member Site: Introducing Cyfronet

Patryk Lasoń (Academic Computer Centre Cyfronet AGH)

Abstract

pdf

VAST Data Platform

Jan Heichler (VAST)

Abstract

pdf

CUG2026 site presentation

TBD TBD (TBD)

Abstract

Panel: The Future of Precision in HPC, which FP is the Right One?

Ashley Barker (Oak Ridge National Laboratory)

Abstract

Plenary

CUG 2025 Closing Remarks

Return to Top

Presentations

Paper, Presentation

Technical Session 1B: Workload manager

Session Chair: David Carlson (Institute for Advanced Computational Science, Stony Brook University)

Slinky: The Missing Link Between Slurm and Kubernetes

Tim Wickberg (SchedMD LLC)

Abstract

pdf

How Best to Leverage Cloud for (Big) HPC Sites

Bill Nitzberg and Ian Littlewood (Altair Engineering, Inc.)

Abstract

pdf

Divide and Rule: Automated Workload Distribution for Efficient User Support Services

Luca Marsella (Swiss National Supercomputing Centre)

Abstract

pdf

Paper, Presentation

Technical Session 1C: Software deployment

Session Chair: Chris Fuson (ORNL, Oak Ridge National Laboratory)

Deploying and Tracking Software with NCCS Software Provisioning

Asa Rentschler, Nicholas Hagerty, Elijah Maccarthy, and Edwin F. Posada Correa (Oak Ridge National Laboratory)

Abstract

The National Center for Computational Sciences (NCCS) at Oak Ridge National Laboratory has a long history of deploying ground- breaking leadership-class supercomputers for the U.S. Department of Energy. The latest in this line of supercomputers is Frontier, the first supercomputer to break the exascale barrier (1018 floating- point operations per second) on the TOP500 list. Frontier serves a wide array of scientific domains, from traditional simulation- based workloads to newer AI and Machine Learning workloads. To best serve the NCCS user community, NCCS uses Spack to deploy a comprehensive software stack of scientific software packages, providing straightforward access to these packages through Lmod Environment Modules. Maintaining a large software stack while also including multiple new compiler releases each year is a very time-consuming task. Additionally, it is not straightforward to pro- vide a software stack alongside existing vendor-provided software such as the HPE/Cray Programming Environment (CPE), and exist- ing CPE, Spack, and Lmod integration does not allow for multiple versions of GPU libraries such as AMD’s ROCm to be used. To ad- dress these challenges and shortcomings, NCCS has developed the NCCS Software Provisioning tool (NSP), a tool for deploying and monitoring software stacks on HPC systems. NSP allows NCCS to quickly and effectively provision software stacks from the ground up using template-driven recipes and configuration files. NSP is successfully deployed on Frontier and several other NCCS clusters, enabling the NCCS software team to quickly deploy software stacks for newly-released compilers, expand current software offerings, better support GPU-based software, and monitor Lmod module usage to identify unused software packages that can be removed from the software stack. In this work, we discuss the shortcomings of the previous CPE, Spack, and Lmod usage at NCCS, provide further details on the implementation and structure of NSP, then discuss the benefits that NSP provides.

pdf, pdf

Modern Software Deployment on a Multi-Tenant Cray-EX System

Ben Cumming, Andreas Fink, Simon Pintarelli, and John Biddiscombe (CSCS)

Abstract

pdf

Employing a Software-Driven Approach to Scalable HPC System Management

Aaron Barlow (Oak Ridge National Laboratory)

Abstract

pdf

Paper, Presentation

Technical Session 1A: Multitenancy

Session Chair: Juan F R Herrera (EPCC, The University of Edinburgh)

Infrastructure as a Service with Strong Tenant Separation on a Supercomputer

Abstract

pdf, pdf

Dynamic Network Perimeterization: Isolating Tenant Workloads With VLANs, VNIs, & ACLs

Nikhil Mukundan, Dennis Walker, Stephen Han, Atif Ali, Siri Vias Khalsa, Amit Jain, Vishal Bhatia, and Vinay Karanth (HPE)

Abstract

pdf, pdf

CSCS' journey towards complete platform automation in a multi-tenant environment

Miguel Gila, Ivano Bonesana, and Alejandro Dabin (Swiss National Supercomputing Centre, CSCS)

Abstract

pdf

Paper, Presentation

Technical Session 2B: Security & Configuration Management

Session Chair: Jim Williams (Los Alamos National Laboratory)

Pragmatic Security Audits: Fortifying HPC Environments at a Consumable Pace

Alden Stradling (Los Alamos National Laboratory) and Monica Dessouky and Dennis Walker (HPE)

Abstract

pdf, pdf

Experimenting with Security Compliance Checking using ReFrame

Victor Holanda Rusu, Matteo Basso, Chris Gamboni, Fabio Zambrino, and Massimo Benini (Swiss National Supercomputing Centre)

Abstract

pdf, pdf

From Weeks to Hours: Harnessing Configuration Management and Deployment Pipelines

Dennis Walker and Siri Vias Khalsa (HPE) and Alex Lovell-Troy (Los Alamos National Laboratory)

Abstract

pdf, pdf

Rev Up Compute Node Reboots: 2x to 5x Faster

Dennis Walker (HPE) and Paul Selwood (Met Office, UK / NERC CMS)

Abstract

pdf, pdf

Paper, Presentation

Technical Session 2C: Climate applications

Session Chair: Maciej Cytowski (Pawsey Supercomputing Research Centre)

Bit-reproducibility in UK Met Office Weather and Climate Applications

David Acreman (HPE)

Abstract

pdf

Enabling km-scale coupled climate simulations with ICON on AMD GPUs

Jussi Enkovaara (CSC - IT Center for Science Ltd.)

Abstract

pdf

MARBLChapel: Fortran-Chapel Interoperability in an Ocean Simulation

Brandon Neth and Ben Harshbarger (HPE); Scott Bachman ([C]Worthy); and Michelle Mills Strout (HPE, University of Arizona)

Abstract

pdf

Redefining Weather Forecasting Systems: The Transition to ICON and Alps

Mauro Bianco, Matthias Kraushaar, and Roberto Aielli (ETH Zurich); Oliver Fuhrer (Federal Office of Meteorology and Climatology MeteoSwiss); and Thomas Schulthess (ETH Zurich)

Abstract

pdf, pdf

Paper, Presentation

Technical Session 2A: Slingshot

Session Chair: Brett Bode (National Center for Supercomputing Applications/University of Illinois, National Center for Supercomputing Applications)

The HPE Slingshot 400 Expedition

Houfar Azgomi, Duncan Roweth, Gregory Faanes, and Jesse Treger (HPE)

Abstract

pdf, pdf

Introduction To HPE Slingshot NIC Libfabric Environment Variables

Jesse Treger and Ian Ziemba (HPE)

Abstract

pdf

Math in Your Network: Slingshot Hardware Accelerated Reductions

Forest Godfrey and Duncan Roweth (HPE)

Abstract

pdf

Slingshot Host Software Ethernet Tuning

Ravi Bissa, Ian Ziemba, Duncan Roweth, and Forest Godfrey (HPE)

Abstract

pdf

Paper, Presentation

Technical Session 3B: HPCM

Session Chair: Matthew A. Ezell (Oak Ridge National Laboratory)

A Brief Summary of the HPCM (HPE Performance Cluster Manager) Evolution Over Recent Releases

Sue Miller, Lee Morecroft, and Peter Guyan (HPE)

Abstract

System Visualization Using Rackmap

Troy Dey and Peter Guyan (HPE)

Abstract

Harvesting, Storing and Processing Data from our HPCM Systems

Ben Lenard, Eric Pershey, Brian Toonen, Peter Upton, Doug Waldron, Lisa Childers, Micheal Zhang, and Bryan Brickman (Argonne National Laboratory)

Abstract

pdf, pdf

Paper, Presentation

Technical Session 3C: Future Technology

Session Chair: Juan F R Herrera (EPCC, The University of Edinburgh)

Evolving Sarus to augment Podman for HPC on Cray EX

Alberto Madonna, Gwangmu Lee, and Felipe Cruz (Swiss National Supercomputing Centre)

Abstract

pdf

What is RISC-V and why should we care?

Nick Brown (EPCC)

Abstract

pdf, pdf

A Full Stack Framework for High Performance Quantum-Classical Computing

Xin Zhan, K. Grace Johnson, and Soumitra Chatterjee (HPE); Barbara Chapman (HPE, Stony Brook University); and Masoud Mohseni, Kirk Bresniker, and Ray Beausoleil (HPE)

Abstract

pdf, pdf

Paper, Presentation

Technical Session 3A: Data Centers

Session Chair: Lena M Lopatina (LANL)

Causality inference for Digital Twins in GPU Data Centers and Smart Grids.

Rolando Pablo Hong Enriquez, Pavana Prakash, Ebad Taheri, and Aditya Dhakal (HPE); Matthias Maiterth and Wesley Brewer (Oak Ridge National Laboratory); and Dejan Milojicic (HPE)

Abstract

pdf, pdf

AlpsB – a Geographically Distributed Infrastructure to Facilitate Large-Scale Training of Weather and Climate AI Models

Alex Upton, Jerome Tissieres, and Maxime Martinasso (Swiss National Supercomputing Centre)

Abstract

pdf

Co-design, deployment and operation of a Modular Data Centre (MDC) with air and direct-liquid cooled supercomputers

Sadaf Alam (University of Bristol); Emma Akinyemi, Martin Podstata, and Jan Over (HPE); and Simon McIntosh-Smith, Ross Barnes, Naomi Harris, and Dave Moore (University of Bristol)

Abstract

pdf, pdf

Paper, Presentation

Technical Session 4B: GPU Energy Efficiency

Session Chair: Maciej Cytowski (Pawsey Supercomputing Research Centre)

Optimizing GPU Frequency for Sustainable HPC: Lessons Learned from a Year of Production on Adastra, an AMD GPU Supercomputer

Gabriel Hautreux, Naïma Alaoui, and Etienne Malaboeuf (CINES)

Abstract

pdf, pdf

Fine-Grained Application Energy and Power Measurements on the Frontier Exascale System

Oscar Hernandez and Wael Elwasif (Oak Ridge National Laboratory)

Abstract

pdf, pdf

EVeREST: An Effective and Versatile Runtime Energy Saving Tool for GPUs

Anna Yue, Torsten Wilde, Sanyam Mehta, and Barbara Chapman (HPE)

Abstract

pdf

HPE Cray EX225a (MI300a) Blade Power Capping and HBM Page Retirement

Steven Martin, Randy Law, Leo Flores, Ron Urwin, and Larry Kaplan (HPE)

Abstract

pdf

Paper, Presentation

Technical Session 4C: Monitoring

Session Chair: David Carlson (Institute for Advanced Computational Science, Stony Brook University)

Utilization and Performance Monitoring of Ookami, an ARM Fujitsu A64FX Testbed Cluster with XDMoD

Nikolay A. Simakov, Joseph P. White, and Matthew D. Jones (SUNY University at Buffalo) and Eva Siegmann, David Carlson, and Robert J. Harrison (Stony Brook University)

Abstract

pdf

HPE Slingshot Monitoring Software: Actionable Insights for HPC and AI Systems

Sahil Patel (HPE)

Abstract

pdf

LDMS New Features for Deployment in Advanced Environments and Feedback for Operations

Jim Brandt, Ben Schwaller, Jennifer Green, Ben Allan, Cory Lueninghoener, Evan Donato, Vanessa Surjadidjaja, Sara Walton, and Ann Gentile (Sandia National Laboratories)

Abstract

pdf

Proactive Health Monitoring and Maintenance of High-Speed Slingshot Fabrics in HPC Environments

Michael Cush, Jeff Kabel, Michael Schmit, Michael Accola, and Forest Godfrey (HPE)

Abstract

pdf, pdf

Paper, Presentation

Technical Session 4A: New Deployment

Session Chair: Jim Rogers (Oak Ridge National Laboratory)

A journey to provide GH200

Mark Klein, Thomas Schulthess, Jonathan Coles, and Miguel Gila (Swiss National Supercomputing Centre, ETH Zurich)

Abstract

pdf

Evaluating AMD MI300A APU: Performance Insights on LLM Training via Knowledge Distillation

Abstract

pdf, pdf

Evaluation of the Nvidia Grace Superchip in the HPE/Cray XD Isambard 3 supercomputer

Thomas Green and Sadaf Alam (University of Bristol)

Abstract

pdf, pdf

Separating concerns: Decoupling the Slingshot Fabric Manager from Cray System Management

Riccardo Di Maria and Chris Gamboni (Swiss National Supercomputing Centre), Davide Tacchella and Isa Wazirzada (HPE), and Mark Klein (Swiss National Supercomputing Centre)

Abstract

pdf

Paper, Presentation

Technical Session 5B: Maintaining Large Systems

Session Chair: Aaron Scantlin (National Energy Research Scientific Computing Center)

Hardware Triage Tool: Enhancements and Extensions

Isa Muhammad Wazirzada, Abhishek Mehta, Vinanti Phadke, and Bhuvan Meda Rajesh (HPE)

Abstract

pdf

Detecting operating system noise with detect-detour

Nagaraju KN, Clark Snyder, Dean Roe, and Larry Kaplan (HPE)

Abstract

pdf, pdf

Analyzing a Lifetime of Failures on a Cray XC40 Supercomputer

Abstract

pdf, pdf

Paper, Presentation

Technical Session 5C: Filesystems & I/O

Session Chair: Raj Gautam (ExxonMobil)

E2000 Performance From Microbenchmarks to Applications

William Loewe, Michael Moore, Sakib Samar, and Chris Walker (HPE)

Abstract

pdf, pdf

Towards Empirical Roofline Modeling of Distributed Data Services: Mapping the Boundaries of RPC Throughput

Abstract

pdf, pdf

HPC workload characterization using eBPF

Shubh Pachchigar and Brandon Cook (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) and Brian Friesen (Lawrence Berkeley National Laboratory)

Abstract

pdf, pdf

Paper, Presentation

Technical Session 5A: Slingshot & MPI Tuning

Session Chair: Brett Bode (National Center for Supercomputing Applications/University of Illinois, National Center for Supercomputing Applications)

MPI implementation optimization for Slingshot network

Rahulkumar Gayatri, Adam Lavely, Neil Mehta, Brandon Cook, and Afton Geil (Lawrence Berkeley National Laboratory)

Abstract

pdf, pdf

Using Different MPI Implementations on HPE Cray EX Supercomputers for Native and Containerized Applications Execution

Maciej Pawlik and Maciej Szpindler (Academic Computer Centre CYFRONET), Marcin Krotkiewski (University of Oslo), and Alfio Lazzaro (HPE)

Abstract

pdf

Scaling MPI Applications on Aurora

Abstract

pdf, pdf

Paper, Presentation

Technical Session 6B: Framework for HPC-AI workflows

Session Chair: Chris Fuson (ORNL, Oak Ridge National Laboratory)

Framework for tracking metadata, lineage and model provenance in hybrid simulation-AI HPC exascale workflows

Abstract

pdf, pdf

Search and Query Framework for Workflows with HPC and AI Models

Christopher Rickett, Sreenivas Sukumar, and Karlon West (HPE)

Abstract

pdf, pdf

FirecREST v2: Lessons Learned from Redesigning an API for Scalable HPC Resource Access

Abstract

pdf, pdf

Paper, Presentation

Technical Session 6C: Programming Models

Session Chair: Benjamin Cumming (CSCS, ETH Zurich)

Designing GPU-aware OpenSHMEM for HPE Cray EX and XD Systems

Danielle Sikich, Naveen Namashivayam Ravichandrasekaran, Md Rahman, Elliot Joseph Ronaghan, Nathan Wichmann, and William Okuno (HPE)

Abstract

pdf, pdf

Quantifying Message Aggregation Optimisations for Energy Savings in PGAS Models

Aaron Welch and Oscar Hernandez (Oak Ridge National Laboratory) and Stephen Poole and Wendy Poole (Los Alamos National Laboratory)

Abstract

pdf, pdf

Accelerating LArTPC Simulations: Enhancing larnd-sim with GPU Optimization Techniques

Abstract

pdf

Paper, Presentation

Technical Session 6A: DAOS

Session Chair: Jesse A. Hanley (Oak Ridge National Laboratory)

DAOS - New Horizons for High Performance Storage

Michael Hennecke and Jerome Soumagne (HPE)

Abstract

pdf

Enhancing RPC on Slingshot for Aurora’s DAOS Storage System

Abstract

pdf, pdf

Global Distributed Client-side Cache for DAOS

Abstract

pdf, pdf

Paper, Presentation

Technical Session 7B: Access Nodes & Kubernetes Management

Session Chair: Jim Williams (Los Alamos National Laboratory)

Addressing Resource Constraints on Aurora with Admin Access Nodes

Peter Upton, Ben Lenard, Ben Allen, and Cyrus Blackworth (Argonne National Laboratory)

Abstract

pdf, pdf

HPE Slingshot in the Kubernetes Ecosystem

Caio Davi and Jesse Treger (HPE)

Abstract

pdf, pdf

Building non-standard images for CSM systems

Harold Longley, Isa Wazirzada, Dennis Walker, Andy Warner, and Davide Tacchella (HPE)

Abstract

pdf, pdf

Paper, Presentation

Technical Session 7C: Application Performance

Session Chair: Juan F R Herrera (EPCC, The University of Edinburgh)

Task-decomposed Overlapped Pressure Preconditioner for Sustained Strong Scalability on Accelerated Exascale Systems

Niclas Jansson (KTH Royal Institute of Technology)

Abstract

pdf

Supernovae in HPC: Benchmarking FLASH Across Advanced Computing Clusters

Joshua Martin, Eva Siegmann, and Alan Calder (Stony Brook University, Institute of Advanced Computational Science)

Abstract

pdf

Expanding Community Access to Real-World HPC Application I/O Characterization Data Using Darshan

Shane Snyder, Philip Carns, Robert Ross, Robert Latham, and Kevin Harms (Argonne National Laboratory)

Abstract

pdf, pdf

Paper, Presentation, Birds of a Feather

Technical Session 7A: AI/ML GPU Workloads

Session Chair: Raj Gautam (ExxonMobil)

Porting Radio Astronomy Correlation to Setonix, a HPE Cray EX system powered by AMD GPUs

Abstract

pdf, pdf

Evaluating the Performance of Containerized ML and LLM Applications on the Frontier and Odo Supercomputers

Bishwo Dahal (University of Louisiana Monroe, Oak Ridge National Laboratory) and Elijah Maccarthy and Subil Abraham (Oak Ridge National Laboratory)

Abstract

pdf, pdf

BoF on Transforming Hybrid Workflows: The Role of HPE Cray Supercomputing User Services Software in Bridging HPC and AI

Tulsi Mishra, Dean Roe, and Larry Kaplan (HPE)

Abstract

pdf

Return to Top

Program Event Contents

Program Event Content

Expanding Horizons in AI with HPC Workshop

This workshop, located at Stony Brook University on Expanding Horizons in AI with HPC, aims to explore the dynamic intersection of AI and HPC, focusing on how advanced computing can accelerate AI research and applications. As AI models become more complex and data-intensive, traditional computing systems struggle to meet the demand for scalability, efficiency, and speed. HPC offers a solution by providing the necessary infrastructure for training large-scale models, enhancing AI algorithms, and enabling breakthroughs in fields such as deep learning, natural language processing, and autonomous systems.

Registration and more details are available here: https://cug.org/cug-2025-aiwithhpc-workshop-2/

Program Event Content

Expanding Horizons in AI with HPC Workshop

Return to Top

Tutorials

Tutorial

Tutorial 1B

Hands on with uenv and CPE in a container with Grace Hopper on Alps

Ben Cumming and Tim Robinson (Swiss National Supercomputing Centre, ETH Zurich)

Abstract

pdf

Tutorial

Tutorial 1C

Best Practices For Operating and Maintaining Slingshot Fabrics

Forest Godfrey (Hewlett Packard Enterprise)

Abstract

pdf

Tutorial

Tutorial 1A

Monitoring HPE Cray HPC systems

Harold Longley, Sue Miller, Pete Guyan, and Raghul Vasudevan (HPE)

Abstract

pdf, gz, gz

Tutorial

Tutorial 1D

Exploring High Performance Storage with DAOS

Adrian Jackson (EPCC, The University of Edinburgh) and Mohamad Chaarawi and Kenneth Cain (HPE)

Abstract

pdf

Tutorial

Tutorial 1B Continued

Hands on with uenv and CPE in a container with Grace Hopper on Alps

Ben Cumming and Tim Robinson (Swiss National Supercomputing Centre, ETH Zurich)

Abstract

pdf

Tutorial

Tutorial 1C Continued

Best Practices For Operating and Maintaining Slingshot Fabrics

Forest Godfrey (Hewlett Packard Enterprise)

Abstract

pdf

Tutorial

Tutorial 1A Continued

Monitoring HPE Cray HPC systems

Harold Longley, Sue Miller, Pete Guyan, and Raghul Vasudevan (HPE)

Abstract

pdf, gz, gz

Tutorial

Tutorial 1D Continued

Exploring High Performance Storage with DAOS

Adrian Jackson (EPCC, The University of Edinburgh) and Mohamad Chaarawi and Kenneth Cain (HPE)

Abstract

pdf

Tutorial

Tutorial 2B

Automated Inspection of Fortran/C/C++ Code Using Codee for Correctness, Modernization, Optimization, and Security on HPE/Cray

Manuel Arenaz (Codee - Appentra Solutions)

Abstract

pdf, pdf

Tutorial

Tutorial 2C

Performance Analysis on AMD GPUs

Georgios Markomanolis (AMD)

Abstract

pdf

Tutorial

Tutorial 1A Continued

Monitoring HPE Cray HPC systems

Harold Longley, Sue Miller, Pete Guyan, and Raghul Vasudevan (HPE)

Abstract

pdf, gz, gz

Tutorial

Tutorial 2B Continued

Automated Inspection of Fortran/C/C++ Code Using Codee for Correctness, Modernization, Optimization, and Security on HPE/Cray

Manuel Arenaz (Codee - Appentra Solutions)

Abstract

pdf, pdf

Tutorial

Tutorial 2C Continued

Performance Analysis on AMD GPUs

Georgios Markomanolis (AMD)

Abstract

pdf

Tutorial

Tutorial 1A Continued

Monitoring HPE Cray HPC systems

Harold Longley, Sue Miller, Pete Guyan, and Raghul Vasudevan (HPE)

Abstract

pdf, gz, gz

Return to Top

Vendors

Plenary, Vendor

Plenary: Sponsors Talks, HPE 1-100

Linaro: Unlocking Exascale Debugging and Performance Engineering with Linaro Forge

Rudy Shand (Linaro Ltd)

Abstract

pdf

Codee: A Tool to Enhance Correctness, Modernization, Security, Portability and Optimization in Fortran and C/C++ Software Applications

Manuel Arenaz (Codee)

Abstract

pdf

AMD: The Unreasonable Effectiveness of FP64 Precision Arithmetic

Nicholas Malaya (AMD)

Abstract

HPE 1 on 100 with Trish Damkroger (HPE Customers only. No HPE partners or CUG sponsors)

Trish Damkroger (HPE)

Abstract

Return to Top

XTreme

XTreme (Approved NDA Members Only)

XTreme (Under NDA, Members Only)

XTreme (Approved NDA Members Only)

XTreme (Under NDA, Members Only)

XTreme (Approved NDA Members Only)

XTreme (Under NDA, Members Only

XTreme (Approved NDA Members Only)

XTreme (Under NDA, Members Only

Return to Top