CUG2024 Proceedings


Sunday, May 5th


8:30am-10:15am

XTreme (Approved NDA Members Only)
XTreme (Approved NDA Members Only)

10:15am-10:30am

Coffee Break
Break

10:30am-12:00pm

XTreme (Approved NDA Members Only)
XTreme (Approved NDA Members Only)

12:00pm-1:00pm

Lunch (open to PEAD and XTreme participants)
Lunch

1:00pm-2:30pm

Programming Environments, Applications, and Documentation (PEAD)
description
PEAD Introduction
HPE Fortran Support
HPE/Cray CPE Roadmap
Programming Environment Management BoF
Birds of a Feather

XTreme (Approved NDA Members Only)
XTreme (Approved NDA Members Only)

2:30pm-2:45pm

Coffee Break
Break

2:45pm-5:00pm

Programming Environments, Applications, and Documentation (PEAD)
HPE Documentation and Training Updates
Collaborative Development of HPC Training Materials
HPE User Engagement Survey
Open Discussions
Birds of a Feather

XTreme (Approved NDA Members Only)
XTreme (Approved NDA Members Only)

6:30pm-9:00pm

Program Committee Dinner (invite only)
description
CUG Program Committee

Monday, May 6th


8:30am-10:00am

Tutorial 1A
Supercomputer Affinity on HPE Systems
Tutorial

Tutorial 1B
Image Deployment and System Monitoring with HPCM
Tutorial

Tutorial 1C
Omnitools: Performance Analysis Tools for AMD GPUs
Tutorial

10:00am-10:30am

Coffee Break (sponsored by Altair)
Break

10:30am-12:00pm

Tutorial 1A Continued
Supercomputer Affinity on HPE Systems
Tutorial

Tutorial 1B Continued
Image Deployment and System Monitoring with HPCM
Tutorial

Tutorial 1C Continued
Omnitools: Performance Analysis Tools for AMD GPUs
Tutorial

12:00pm-1:00pm

CUG Advisory Board Lunch Cabinet (closed)
description
CUG Program Committee

Lunch (sponsored by Nvidia)
Lunch

1:00pm-2:30pm

Tutorial 2A
Monitoring, Tuning, and Troubleshooting a CSM system
Tutorial

Tutorial 2B
Automated Inspection of C/C++/Fortran Code Using Codee for Performance Optimization on HPE/Cray
Tutorial

Tutorial 2C
MGARD & ADIOS-2: A framework for extreme scale I/O with online data reduction
Tutorial

2:30pm-3:00pm

Coffee Break (sponsored by Linaro)
Break

3:00pm-4:30pm

Tutorial 2A Continued
Monitoring, Tuning, and Troubleshooting a CSM system
Tutorial

Tutorial 2B Continued
Automated Inspection of C/C++/Fortran Code Using Codee for Performance Optimization on HPE/Cray
Tutorial

Tutorial 2C Continued
MGARD & ADIOS-2: A framework for extreme scale I/O with online data reduction
Tutorial

4:35pm-6:00pm

BoF 1A
OpenCHAMI for collaborators and the collaborator-curious
Birds of a Feather

BoF 1B
High Performance Data-centre Digital Twins
Birds of a Feather

BoF 1C
HPE Slingshot Birds of a Feather
Birds of a Feather

Tuesday, May 7th


8:00am-10:00am

Plenary: Welcome, Keynote
Opening
Welcome to Country Ceremony
Welcome from the CUG President
Talk by Dr. Sarah Pearce, SKA-Low Telescope Director
Convergence of Energy Efficient Scientific Computing and GenAI
High Performance Remote Linux Desktops with ThinLinc
Unlocking Exascale Debugging and Performance Engineering with Linaro Forge
Plenary

10:00am-10:30am

Coffee Break (sponsored by SchedMD)
Break

10:30am-12:00pm

Plenary: CUG site, HPE update
Welcome by Pawsey
HPE corporate update by Gerald Kleyn
Plenary

12:00pm-1:00pm

CUG Board & Sponsors Lunch (closed)
description
CUG Board

Lunch (sponsored by Codee)
Lunch

1:00pm-2:30pm

Technical Session 1A
Lena M Lopatina
CPE Updates
A Deep Dive Into NVIDIA's HPC Software
Slurm 24.05 and Beyond
Presentation, Paper

Technical Session 1B
Jim Williams
Enhancing HPC Service Management on Alps using FirecREST API
Automated Hardware-Aware Node Selection for Cluster Computing
Versatile Software-defined Cluster on Cray HPE EX Systems
Presentation, Paper

Technical Session 1C
Chris Fuson
Towards the Development of an Exascale Network Digital Twin
A Performance Deep Dive into HPC-AI Workflows with Digital Twins
Optimizing Checkpoint-Restart Mechanisms for HPC with DMTCP in Containers at NERSC
Presentation, Paper

2:30pm-3:00pm

Coffee Break (sponsored by Thinlinc)
Break

3:00pm-5:00pm

Technical Session 2A
Jim Rogers
Updated Node Power Management For New HPE Cray EX255a and EX254n Blades
HPE Cray EX Power Monitoring Counters
First Analysis on Cooling Temperature Impacts on MI250x Exascale Nodes (HPE Cray EX235A)
EVeREST: An Effective and Versatile Runtime Energy Saving Tool
Presentation, Paper

Technical Session 2B
Lena M Lopatina
EMOI: CSCS Extensible Monitoring and Observability Infrastructure
Swordfish/Redfish and ClusterStor - Using Advanced Monitoring to Improve Insight into Complex I/O Workflows.
CADDY: Scalable Summarizations over Voluminous Telemetry Data for Efficient Monitoring
Command Lines vs. Requested Resources: How Well Do They Align?
Presentation, Paper

Technical Session 2C
Veronica G. Vergara Larrea
Optimising the Processing and Storage of Radio Astronomy Data
Performance and scaling of the LFRic weather and climate model on different generations of HPE Cray EX supercomputers
Disaggregated memory in OpenSHMEM applications – Approach and Benefits
Migrating Complex Workflows to the Exascale: Challenges for Radio Astronomy
Presentation, Paper

5:05pm-6:00pm

BoF 2A
2024 HPC Testathon: Experiences and Results
Birds of a Feather

BoF 2B
Bird of Feather on Artificial Intelligence and Machine Learning for HPC Workload Analysis (AIMLHPCWorkload2024)
Birds of a Feather

BoF 2C
Architecting a Cloud-based Supercomputing as-a-Service Solution
Birds of a Feather

Wednesday, May 8th


7:00am-8:20am

WHPC+ Australasia and AMD Diversity and Inclusion Breakfast
description
Networking/Social Event

8:30am-10:00am

Plenary: CUG Board Updates (Open), CUG Elections, and Best papers
CUG Board Updates, SIG Presentations, and Board Elections – Open Session
Nine Months in the life of an all-flash file system
Isambard-AI: a leadership-class supercomputer optimised specifically for Artificial Intelligence
Plenary

10:00am-10:30am

Coffee Break (sponsored by VAST)
Break

10:30am-12:00pm

Plenary: Sponsors talks, HPE 1-100
The Biggest Change to HPC Job Scheduling and Resource Management in 30 Years
Codee: Automatic Code Inspection Tools for Performance and Code Modernization
AMD Together We Advance Supercomputing
HPE 1 on 100 (HPE Customers only: no HPE partners or CUG sponsors)
Plenary

12:00pm-1:00pm

HPE Executive Lunch (closed)
description
CUG Board

Lunch (sponsored by Nvidia)
Lunch

1:00pm-2:30pm

Technical Session 3A
Bilel Hadri
Early Application Experiences on Aurora at ALCF: Moving From Petascale to Exascale Systems
Streaming Data in HPC Workflows Using ADIOS
Enrichment and Acceleration of Edge to Exascale Computational Steering STEM Workflow using Common Metadata Framework
Presentation, Paper

Technical Session 3B
Gabriel Hautreux
Spack Based Production Programming Environments on Cray Shasta
Containers-first user environments on HPE Cray EX
Cloud-Native Slurm management on HPE Cray EX
Presentation, Paper

Technical Session 3C
Tina Declerck
CSM-based Software Stack Overview 2024
Overview of HPCM
Seamless Cluster Migration in CSM
Presentation, Paper

2:30pm-3:00pm

Coffee Break (sponsored by Pier Group)
Break

3:00pm-5:00pm

Technical Session 4A
Lena M Lopatina
Multi-stage Approach for Identifying Defective Hardware in Frontier
From Frontier to Framework: Enhancing Hardware Triage for Exascale Machines
Full-stack Approach to HPC Testing
An Approach to Continuous Testing
Presentation, Paper

Technical Session 4B
Brett Bode
Scalability and Performance of OFI and UCX on ARCHER2
Using P4 for Cassini-3 Software Development Environment
Running NCCL and RCCL Applications on HPE Slingshot NIC
Enabling NCCL on Slingshot 11 at NERSC
Presentation, Paper

Technical Session 4C
Gabriel Hautreux
LLM Serving With Efficient KV-Cache Management Using Triggered Operations
From Chatbots to Interfaces: Diversifying the Application of Large Language Models for Enhanced Usability
Delivering Large Language Model Platforms With HPC
System for Recommendation and Evaluation of Large Language Models for practical tasks in Science
Presentation, Paper

Thursday, May 9th


8:30am-10:00am

Plenary: CUG 2024, Invited speakers
CUG2025 site presentation
VAST Data
Advancing Gas Turbine Development using HPC: Challenges and Rewards
Plenary

10:00am-10:30am

Coffee Break
Break

CUG Advisory Board
description
CUG Program Committee

10:30am-12:00pm

Technical Session 5A
Veronica G. Vergara Larrea
Power and Performance analysis of GraceHopper superchips on HPE Cray EX systems
Accelerating Scientific Workflows with the NVIDIA Grace Hopper Platform
GROMACS on AMD GPU-Based HPC Platforms: Using SYCL for Performance and Portability
Presentation, Paper

Technical Session 5B
Adrian Jackson
Leveraging GNU Parallel for Optimal Utilization of HPC Resources on Frontier and Perlmutter Supercomputers
Portable Support for GPUs and Distributed-Memory Parallelism in Chapel
PaCER: Accelerating Science on Setonix
Presentation, Paper

Technical Session 5C
Tina Declerck
Cray EX Security Experiences
Best of Times, Worst of Times: A Cautionary Tale of Vulnerability Handling
AIOPS Empowered: Failure Prediction in System Management Software Tools
Presentation, Paper

12:00pm-1:00pm

Lunch (sponsored by Codee)
Lunch

New CUG Board / Old CUG Board Lunch (closed)
description
CUG Board

1:00pm-2:30pm

Technical Session 6A
Chris Fuson
Unification of Alerting Engines for Monitoring in System Management
HPE Cray EX255a Telemetry - Improved Configurability and Performance
Best Practices for deployment of LDMS on the HPE Cray EX platform
Presentation, Paper

Technical Session 6B
Paul L. Peltz Jr.
POD: Reconfiguring Compute and Storage Resources Between Cray EX Systems
Zero Downtime System Upgrade Strategy
Multitenancy on HPE Cray EX: network segmentation and isolation
Presentation, Paper

Technical Session 6C
Bilel Hadri
ClusterStor Tiering, Overview, Setup, and Performance
Exploring new software-defined storage technology using VAST on Cray EX systems
Reducing Mean Time to Resolution (MTTR) for complex HPC-based systems with next generation automated service tools.
Presentation, Paper

2:30pm-3:00pm

Coffee Break
Break

3:00pm-4:00pm

BoF 3A
System Monitoring Working Group
Birds of a Feather

3:00pm-4:30pm

Lightning Tutorial 7B
Data Science Beyond the Laptop: Handling Data of Any Size with Arkouda
Tutorial

Lightning Tutorial 7C
Exploring high performance object storage using DAOS
Tutorial

4:00pm-4:30pm

Technical Session 7A
John Holmen
Proactive Precision: Enhancing High-Performance Computing with Early Job Failure Detection
Presentation, Paper

4:30pm-5:00pm

Technical Session 8A
Tina Declerck
Optimizing I/O Patterns to Speed up Non-contiguous Data Retrieval and Analyses
Presentation, Paper

Technical Session 8B
Raj Gautam
Using HPE-Provided Resources to Integrate HPE Support into Internal Incident Management
Presentation, Paper

Technical Session 8C
Jim Rogers
Building LDMS Slingshot Switch Samplers
Presentation, Paper

5:00pm-5:10pm

CUG 2024 Closing
Plenary

Created 2024-5-7 17:35