CUG2025 Proceedings


Sunday, May 4


8:30am-10:00am

XTreme (Under NDA, Members Only)
XTreme (Approved NDA Members Only)

10:00am-10:30am

Coffee Break
Break

10:30am-12:00pm

XTreme (Under NDA, Members Only)
XTreme (Approved NDA Members Only)

12:00pm-1:00pm

CUG board/ New Sites lunch (closed)
Lunch

Lunch/ PEAD & XTreme SIG Participants
Lunch

1:00pm-2:30pm

Programming Environments, Applications, and Documentation (PEAD)
Introduction PEAD
CPE in a Container
Python Management
Birds of a Feather

XTreme (Under NDA, Members Only
XTreme (Approved NDA Members Only)

2:30pm-3:00pm

Coffee Break
Break

3:00pm-5:00pm

Programming Environments, Applications, and Documentation (PEAD)
CPE Update
CPE Testing
Exploring the Challenges of the World-Class HPE Cray Programming Environment for Modern Software Development in Fortran
Open Floor Discussion
Birds of a Feather

XTreme (Under NDA, Members Only
XTreme (Approved NDA Members Only)

5:30pm-6:30pm

Welcome Reception
Networking/Social Event

6:30pm-8:30pm

Program Committee Dinner (invite only)
Networking/Social Event

Monday, May 5


8:30am-10:00am

Tutorial 1A
Monitoring HPE Cray HPC systems
Tutorial

Tutorial 1B
Hands on with uenv and CPE in a container with Grace Hopper on Alps
Tutorial

Tutorial 1C
Best Practices For Operating and Maintaining Slingshot Fabrics
Tutorial

Tutorial 1D
Exploring High Performance Storage with DAOS
Tutorial

10:00am-10:30am

Coffee Break
Break

10:30am-12:00pm

Tutorial 1A Continued
Monitoring HPE Cray HPC systems
Tutorial

Tutorial 1B Continued
Hands on with uenv and CPE in a container with Grace Hopper on Alps
Tutorial

Tutorial 1C Continued
Best Practices For Operating and Maintaining Slingshot Fabrics
Tutorial

Tutorial 1D Continued
Exploring High Performance Storage with DAOS
Tutorial

12:00pm-1:00pm

CUG Advisory Board Lunch Cabinet (closed)
Lunch

Lunch Sponsored by Codee
Lunch

1:00pm-2:30pm

BoF 1D
Security BoF
Birds of a Feather

Tutorial 1A Continued
Monitoring HPE Cray HPC systems
Tutorial

Tutorial 2B
Automated Inspection of Fortran/C/C++ Code Using Codee for Correctness, Modernization, Optimization, and Security on HPE/Cray
Tutorial

Tutorial 2C
Performance Analysis on AMD GPUs
Tutorial

2:30pm-2:45pm

Coffee Break Sponsored by SchedMD
Break

2:45pm-4:15pm

BoF 2D
Kubernetes on HPE Supercomputers BoF
Birds of a Feather

Tutorial 1A Continued
Monitoring HPE Cray HPC systems
Tutorial

Tutorial 2B Continued
Automated Inspection of Fortran/C/C++ Code Using Codee for Correctness, Modernization, Optimization, and Security on HPE/Cray
Tutorial

Tutorial 2C Continued
Performance Analysis on AMD GPUs
Tutorial

4:20pm-5:30pm

BoF 1A
CSM updates, iSCSI boot content projection, and other CSM topics
Birds of a Feather

BoF 1B
CUG SIG System Monitoring Working Group BoF
Birds of a Feather

BoF 1C
Sharing is Caring: Tackling Node-Sharing Challenges at CUG Sites
Birds of a Feather

BoF 3D
Rethinking Interactive HPC Resource Access: Enhancing Security and Flexibility
Birds of a Feather

Tuesday, May 6


8:30am-10:00am

Plenary Session: CUG 2025 Welcome, Keynote Presentation
Welcome from the CUG President, Ashley Barker
Keynote: What I’ve Learned About Supercomputing from Blowing Up Stars, Michael Zingale (Stony Brook University)
New Member Site: Introducing LRZ
CUG 2026 Elections: Candidate Statements
Plenary

10:00am-10:30am

Coffee Break Sponsored by Pier Group
Break

10:30am-12:00pm

Plenary Session: Stony Brook LOC Welcome, HPE Update
Welcome by Stony Brook University
Altair: AI/ML Intelligent Scheduling for HPC with Altair®
NVIDIA HPC Software - Expanding HPC with Python & AI
HPE Corporate Update, Gerald Kleyn
Plenary

12:00pm-1:00pm

CUG Board & Sponsors Lunch (closed)
Lunch

Lunch Sponsored by NVIDIA
Lunch

1:00pm-2:30pm

Technical Session 1A: Multitenancy
Juan F R Herrera
Infrastructure as a Service with Strong Tenant Separation on a Supercomputer
Dynamic Network Perimeterization: Isolating Tenant Workloads With VLANs, VNIs, & ACLs
CSCS' journey towards complete platform automation in a multi-tenant environment
Paper, Presentation

Technical Session 1B: Workload manager
David Carlson
Slinky: The Missing Link Between Slurm and Kubernetes
How Best to Leverage Cloud for (Big) HPC Sites
Divide and Rule: Automated Workload Distribution for Efficient User Support Services
Paper, Presentation

Technical Session 1C: Software deployment
Chris Fuson
Deploying and Tracking Software with NCCS Software Provisioning
Modern Software Deployment on a Multi-Tenant Cray-EX System
Employing a Software-Driven Approach to Scalable HPC System Management
Paper, Presentation

2:30pm-3:00pm

Coffee Break Sponsored by Linaro
Break

3:00pm-5:00pm

Technical Session 2A: Slingshot
Brett Bode
The HPE Slingshot 400 Expedition
Introduction To HPE Slingshot NIC Libfabric Environment Variables
Math in Your Network: Slingshot Hardware Accelerated Reductions
Slingshot Host Software Ethernet Tuning
Paper, Presentation

Technical Session 2B: Security & Configuration Management
Jim Williams
Pragmatic Security Audits: Fortifying HPC Environments at a Consumable Pace
Experimenting with Security Compliance Checking using ReFrame
From Weeks to Hours: Harnessing Configuration Management and Deployment Pipelines
Rev Up Compute Node Reboots: 2x to 5x Faster
Paper, Presentation

Technical Session 2C: Climate applications
Maciej Cytowski
Bit-reproducibility in UK Met Office Weather and Climate Applications
Enabling km-scale coupled climate simulations with ICON on AMD GPUs
MARBLChapel: Fortran-Chapel Interoperability in an Ocean Simulation
Redefining Weather Forecasting Systems: The Transition to ICON and Alps
Paper, Presentation

6:00pm-8:00pm

HPE Networking Event
Description
Networking/Social Event

Wednesday, May 7


8:30am-10:00am

Plenary Session: CUG Organizational Update and Best Paper Presentation
CUG Organizational Update
Evolving HPC services to enable ML workloads on HPE Cray EX
Alps, a versatile research infrastructure
Plenary, Paper

10:00am-10:30am

Coffee Break Sponsored by VAST
Break

10:30am-12:00pm

Plenary: Sponsors Talks, HPE 1-100
Linaro: Unlocking Exascale Debugging and Performance Engineering with Linaro Forge
Codee: A Tool to Enhance Correctness, Modernization, Security, Portability and Optimization in Fortran and C/C++ Software Applications
AMD: The Unreasonable Effectiveness of FP64 Precision Arithmetic
HPE 1 on 100 with Trish Damkroger (HPE Customers only. No HPE partners or CUG sponsors)
Plenary, Vendor

12:00pm-1:00pm

HPE Executive Lunch (closed)
Lunch

Lunch Sponsored by Codee
Lunch

1:00pm-2:30pm

Technical Session 3A: Data Centers
Lena M Lopatina
Causality inference for Digital Twins in GPU Data Centers and Smart Grids.
AlpsB – a Geographically Distributed Infrastructure to Facilitate Large-Scale Training of Weather and Climate AI Models
Co-design, deployment and operation of a Modular Data Centre (MDC) with air and direct-liquid cooled supercomputers
Paper, Presentation

Technical Session 3B: HPCM
Matthew A. Ezell
A Brief Summary of the HPCM (HPE Performance Cluster Manager) Evolution Over Recent Releases
System Visualization Using Rackmap
Harvesting, Storing and Processing Data from our HPCM Systems
Paper, Presentation

Technical Session 3C: Future Technology
Juan F R Herrera
Evolving Sarus to augment Podman for HPC on Cray EX
What is RISC-V and why should we care?
A Full Stack Framework for High Performance Quantum-Classical Computing
Paper, Presentation

2:30pm-3:00pm

Coffee Break Sponsored by Altair
Break

3:00pm-5:00pm

Technical Session 4A: New Deployment
Jim Rogers
A journey to provide GH200
Evaluating AMD MI300A APU: Performance Insights on LLM Training via Knowledge Distillation
Evaluation of the Nvidia Grace Superchip in the HPE/Cray XD Isambard 3 supercomputer
Separating concerns: Decoupling the Slingshot Fabric Manager from Cray System Management
Paper, Presentation

Technical Session 4B: GPU Energy Efficiency
Maciej Cytowski
Optimizing GPU Frequency for Sustainable HPC: Lessons Learned from a Year of Production on Adastra, an AMD GPU Supercomputer
Fine-Grained Application Energy and Power Measurements on the Frontier Exascale System
EVeREST: An Effective and Versatile Runtime Energy Saving Tool for GPUs
HPE Cray EX225a (MI300a) Blade Power Capping and HBM Page Retirement
Paper, Presentation

Technical Session 4C: Monitoring
David Carlson
Utilization and Performance Monitoring of Ookami, an ARM Fujitsu A64FX Testbed Cluster with XDMoD
HPE Slingshot Monitoring Software: Actionable Insights for HPC and AI Systems
LDMS New Features for Deployment in Advanced Environments and Feedback for Operations
Proactive Health Monitoring and Maintenance of High-Speed Slingshot Fabrics in HPC Environments
Paper, Presentation

5:05pm-5:45pm

BoF 2A
CPE Futures
Birds of a Feather

BoF 2B
Managing System Reliability: From system acceptance through production
Birds of a Feather

BoF 2C
HPE Slingshot Birds-of-a-Feather
Birds of a Feather

6:00pm-10:00pm

CUG AMD Night Out
Description
Networking/Social Event

Thursday, May 8


8:30am-10:00am

Plenary: CUG 2026, Panel
New Member Site: Introducing GeoSphere
New Member Site: Introducing Cyfronet
VAST Data Platform
CUG2026 site presentation
Panel: The Future of Precision in HPC, which FP is the Right One?
Plenary

10:00am-10:30am

Coffee Break
Break

CUG Advisory Board (closed)
CUG Program Committee

10:30am-12:00pm

Technical Session 5A: Slingshot & MPI Tuning
Brett Bode
MPI implementation optimization for Slingshot network
Using Different MPI Implementations on HPE Cray EX Supercomputers for Native and Containerized Applications Execution ​
Scaling MPI Applications on Aurora
Paper, Presentation

Technical Session 5B: Maintaining Large Systems
Aaron Scantlin
Hardware Triage Tool: Enhancements and Extensions
Detecting operating system noise with detect-detour
Analyzing a Lifetime of Failures on a Cray XC40 Supercomputer
Paper, Presentation

Technical Session 5C: Filesystems & I/O
Raj Gautam
E2000 Performance From Microbenchmarks to Applications
Towards Empirical Roofline Modeling of Distributed Data Services: Mapping the Boundaries of RPC Throughput
HPC workload characterization using eBPF
Paper, Presentation

12:00pm-1:00pm

Lunch Sponsored by NVIDIA
Lunch

1:00pm-2:30pm

Technical Session 6A: DAOS
Jesse A. Hanley
DAOS - New Horizons for High Performance Storage
Enhancing RPC on Slingshot for Aurora’s DAOS Storage System
Global Distributed Client-side Cache for DAOS
Paper, Presentation

Technical Session 6B: Framework for HPC-AI workflows
Chris Fuson
Framework for tracking metadata, lineage and model provenance in hybrid simulation-AI HPC exascale workflows
Search and Query Framework for Workflows with HPC and AI Models
FirecREST v2: Lessons Learned from Redesigning an API for Scalable HPC Resource Access
Paper, Presentation

Technical Session 6C: Programming Models
Benjamin Cumming
Designing GPU-aware OpenSHMEM for HPE Cray EX and XD Systems
Quantifying Message Aggregation Optimisations for Energy Savings in PGAS Models
Accelerating LArTPC Simulations: Enhancing larnd-sim with GPU Optimization Techniques
Paper, Presentation

2:30pm-3:00pm

Coffee Break
Break

3:00pm-4:30pm

Technical Session 7A: AI/ML GPU Workloads
Raj Gautam
Porting Radio Astronomy Correlation to Setonix, a HPE Cray EX system powered by AMD GPUs
Evaluating the Performance of Containerized ML and LLM Applications on the Frontier and Odo Supercomputers
BoF on Transforming Hybrid Workflows: The Role of HPE Cray Supercomputing User Services Software in Bridging HPC and AI
Paper, Presentation, Birds of a Feather

Technical Session 7B: Access Nodes & Kubernetes Management
Jim Williams
Addressing Resource Constraints on Aurora with Admin Access Nodes
HPE Slingshot in the Kubernetes Ecosystem
Building non-standard images for CSM systems
Paper, Presentation

Technical Session 7C: Application Performance
Juan F R Herrera
Task-decomposed Overlapped Pressure Preconditioner for Sustained Strong Scalability on Accelerated Exascale Systems
Supernovae in HPC: Benchmarking FLASH Across Advanced Computing Clusters
Expanding Community Access to Real-World HPC Application I/O Characterization Data Using Darshan
Paper, Presentation

4:30pm-4:40pm

CUG 2025 Closing Remarks
Plenary

Friday, May 9


9:00am-5:00pm

Expanding Horizons in AI with HPC Workshop
Description
Program Event Content

Saturday, May 10


9:00am-3:00pm

Expanding Horizons in AI with HPC Workshop
Description
Program Event Content

Created 2025-5-15 2:51