CUG2018 Proceedings

Monday, May 21st

8:30am-10:00am

Tutorial 1B

Event Small

Harold Longley

Managing SMW/eLogin and ARM XC nodes

Tutorial

Tutorial 1C

Studio 6

Michael Ringenburg

Analytics and Artificial Intelligence Workloads on Cray Systems

Tutorial

Tutorial 1D

Studio 3

John Levesque

Applying a “Whack-a-Mole” Method Using Cray perftools to Identify the Moles

Tutorial

10:30am-12:00pm

Tutorial 1B Continued

Event Small

Harold Longley

Managing SMW/eLogin and ARM XC nodes

Tutorial

Tutorial 1C Continued

Studio 6

Michael Ringenburg

Analytics and Artificial Intelligence Workloads on Cray Systems

Tutorial

Tutorial 1D Continued

Studio 3

John Levesque

Applying a “Whack-a-Mole” Method Using Cray perftools to Identify the Moles

Tutorial

1:00pm-2:30pm

Tutorial 2B

Event Small

Harold Longley

Managing SMW/eLogin and ARM XC nodes

Tutorial

Tutorial 2C

Studio 6

Michael Ringenburg

Analytics and Artificial Intelligence Workloads on Cray Systems

Tutorial

Tutorial 2D

Studio 3

Benjamin Landsteiner

DataWarp Administration Tutorial

Tutorial

3:00pm-4:30pm

Tutorial 2B Continued

Event Small

Managing SMW/eLogin and ARM XC nodes

Tutorial

Tutorial 2C Continued

Studio 6

Michael Ringenburg

Analytics and Artificial Intelligence Workloads on Cray Systems

Tutorial

Tutorial 2D Continued

Studio 3

DataWarp Administration Tutorial

Tutorial

4:40pm-6:40pm

BoF 3A

Event Large

Bilel Hadri

Programming Environments, Applications, and Documentation (PEAD) Special Interest Group meeting

Birds of a Feather

BoF 3B

Event Small

Nicholas Cardo

Systems Support SIG Meeting

Birds of a Feather

BoF 3C

Studio 6

Michele Bertasi; Torben Kling Petersen

When to use Flash and when not to …..

Scalable Accounting & Reporting for Compute Jobs

Birds of a Feather

BoF 3D

Studio 3

Sadaf R. Alam

Tools and Utilities for Data Science Workloads and Workflows

Opportunities for containers in HPC ecosystems

Birds of a Feather

Tuesday, May 22nd

9:40am-9:50am

Sponsor Talk 5

Event Large

Trey Breckenridge

Software ecosystem for Arm-based HPC

Sponsor Talk

9:50am-10:00am

Sponsor Talk 6

Event Large

Trey Breckenridge

PBS Works 2018 and Beyond: Multi-sched, Power Rate Ramp Limiting, Soft walltime, Job Equivalence Classes, …, Native Mode

Sponsor Talk

1:00pm-2:30pm

Technical Session 8A

Event Large

Ronald Brightwell

Cray Next Generation Software Integration Options

CLE Port to ARM: Functionality, Performance, and Lessons Learned

SSA, ClusterStor Call-home Service Actions, and an Introduction to Cray Central Telemetry and Triage Services (C2TS)

Paper

Technical Session 8B

Event Small

Bilel Hadri

Performance Evaluation of MPI on Cray XC40 Xeon Phi Systems

Performance Impact of Rank Reordering on Advanced Polar Decomposition Algorithms

Are We Witnessing the Spectre of an HPC Meltdown?

Paper

Technical Session 8C

Studio 6

Frank M. Indiviglio

How Deep is Your I/O? Toward Practical Large-Scale I/O Optimization via Machine Learning Methods

TOKIO on ClusterStor: Connecting Standard Tools to Enable Holistic I/O Performance Analysis

Improving Nektar++ IO Performance for Cray XC Architecture

Paper

3:00pm-5:00pm

Technical Session 9A

Event Large

Ann Gentile; Jim Rogers

Cray System Monitoring: Successes, Priorities, Visions

Cray System Monitoring: Successes, Priorities, Visions

Ville Ahlgren (CSC - IT Center for Science Ltd.); Stefan Andersson (Cray Inc., High Performance Computing Center Stuttgart); Jim Brandt (Sandia National Laboratories); Nicholas Cardo (Swiss National Supercomputing Centre); Sudheer Chunduri (Argonne National Laboratory); Jeremy Enos (National Center for Supercomputing Applications); Parks Fields (Los Alamos National Laboratory); Ann Gentile (Sandia National Laboratories); Richard Gerber (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center); Joe Greenseid (Cray Inc.); Annette Greiner (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center); Bilel Hadri (King Abdullah University of Science and Technology); Helen He (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center); Dennis Hoppe (High Performance Computing Center Stuttgart); Urpo Kaila (CSC - IT Center for Science Ltd.); Kaki Kelly (Los Alamos National Laboratory); Mark Klein (Swiss National Supercomputing Centre); Alex Kristiansen (Argonne National Laboratory); Steve Leak (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center); Michael Mason (Los Alamos National Laboratory); Kevin Pedretti (Sandia National Laboratories); Jean_Guillaume Piccinali (Swiss National Supercomputing Centre); Jason Repik (Sandia National Laboratories); Jim Rogers (Oak Ridge National Laboratory); Susanna Salminen (CSC - IT Center for Science Ltd.); Michael Showerman (National Center for Supercomputing Applications); Cary Whitney (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center); and Jim Williams (Los Alamos National Laboratory)

Abstract

Effective HPC system operations and utilization require unprecedented insight into system state, applications’ demands for resources, contention for shared resources, and system demands on center power and cooling. Monitoring can provide such insights when the necessary fundamental capabilities for data availability and usability are provided. In this paper, multiple Cray sites seek to motivate monitoring as a core capability in HPC design, through the presentation of success stories illustrating enhanced understanding and improved performance and/or operations as a result of monitoring and analysis.We present the utility, limitations, and gaps of the data necessary to enable the required insights. The capabilities developed to enable the case successes drive our identification and prioritization of monitoring system requirements. Ultimately, we seek to engage all HPC stakeholders to drive community and vendor progress on these priorities.

pdf, pdf

Use of the ERD for administrative monitoring of Theta

Supporting failure analysis with discoverable, annotated log datasets

Paper

Technical Session 9B

Event Small

Zhengji Zhao

Intel® Xeon processor Scalable family performance in HPC, AI and other key segments

Storage and memory hierarchy in HPC: new paradigm and new solutions with Intel

Chapel Comes of Age: Making Scalable Programming Productive

The Cray Programming Environment: Current Status and Future Directions

Paper

Technical Session 9C

Studio 6

Ronald Brightwell

Scaling Deep Learning without Impacting Batchsize

Alchemist: An Apache Spark <=> MPI Interface

Continuous integration in a Cray multiuser environment

Paper

5:10pm-6:20pm

BoF 10A

Event Large

Larry Kaplan

Cray Next Generation Software Integration Options

Birds of a Feather

BoF 10B

Event Small

Bilel Hadri

Managing Effectively the User Software Ecosystem

Managing Effectively the User Software Ecosystem

Bilel Hadri (KAUST Supercomputing Lab), Guilherme Peretti-Pezzi (Swiss National Supercomputing Centre), Ashley Barker (Oak Ridge National Laboratory), Mario Melara and Helen He (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center), and Peggy Sanchez (Cray Inc.)

Abstract

Supercomputing centers host and manage HPC systems to enable researchers to productively carry out their science and engineering research and provide them support and assistance not only with software from the vendor-supplied programming environment, but also third-party software installed by HPC staff.

The staff often supports and maintains hundreds of software packages, each with multiple versions, and each version potentially built with multiple compilers and usually have complex dependencies. Furthermore, it is necessary to upgrade packages on a regular basis and to make the process of installation versatile and automated as much as possible. Following these guidelines, the installation process becomes more reproducible by any member of the staff and result in less issues faced by end users.

To manage effectively the user software ecosystem, many CUG member sites have adopted different strategies and employed different tools such as EasyBuild, SWTools and Spack. Nevertheless, each of these tools have its advantages and inconveniences and in some cases, the installation does not consider the optimal configuration to achieve the best performance on Cray platform(s).

The goal of this BoF is to start a discussion between CUG members along with Cray Documentation and Performance teams, in order to share the strengths and weaknesses of the strategies currently adopted, gather the best recipes of installation, and merge our efforts in a common repository hosted in the Cray Documentation Portal. This catalogue gathering of the different applications installations will benefit the whole HPC community using Cray systems.

pdf

Birds of a Feather

BoF 10C

Studio 6

Stephen Leak

Practical implementation of monitoring on Cray system

Birds of a Feather

BoF 10D

Studio 3

David Hancock

Open Discussion with CUG Board

Birds of a Feather

Wednesday, May 23rd

9:40am-9:50am

Sponsor Talk 12

Event Large

Trey Breckenridge

PGI Compilers for Accelerated Computing

Sponsor Talk

9:50am-10:00am

Sponsor Talk 13

Event Large

Trey Breckenridge

Making a Large Investment in HPC? Think Bright

Sponsor Talk

11:50am-12:00pm

New Site 15

Event Large

Helen He

Isambard, the world’s first production Arm-based supercomputer

New Site

2:00pm-2:10pm

New Site 17

Event Large

Helen He

The NIWA/NeSI HPC Replacement Project - A Voyage into Complexity: Integrating multi-site XC, CS, ESS, Spectrum Scale, and OpenStack

The NIWA/NeSI HPC Replacement Project - A Voyage into Complexity: Integrating multi-site XC, CS, ESS, Spectrum Scale, and OpenStack

Michael Uddstrom (National Institute of Water and Atmospheric Research); Brian Corrie and Nick Jones (NeSI); Fabrice Cantos and Aaron Hicks (National Institute of Water and Atmospheric Research); Greg Hall (NeSI); Wolfgang Hayek (National Institute of Water and Atmospheric Research); David Kelly, Patricia Balle, Adam Sachitano, and Brian Gilmer (Cray Inc.); and Dale McCurdy and Andrew Beattie (IBM)

Abstract

Replacement of New Zealand’s national HPC infrastructure was initiated in December 2016 through the release of a single RFP for three HPC systems, funded by four entities: NIWA, University of Auckland, University of Otago and Landcare Research.

The design combined requirements for both resilient operational forecasting services for NIWA, and national research services for both HPC (e.g. climate, seismic) and HTC (e.g. materials, genomics) problems, virtual laboratories, remote visualisation, and data management. To meet these requirements a two-site, geographically disparate solution is being implemented. The infrastructure at each site is similar in design, although different in scale. In summary:

1) In Auckland, a Disaster Recovery system for NIWA operational services: An XC50-AC, CS500 Bright OpenStack managed cluster for pre/post processing, Slurm multi-cluster, IBM Spectrum Scale (ESS) storage, Spectrum Protect backup and HSM services, EDR network, Virtual Laboratories (including NICE DCV remote visualisation), and a CS400 OpenStack managed cluster for shared services (e.g. LDAP (FreeIPA), ELK, Accounting). This system is housed in a unique seismically-anchored air-containment pod. 2) The primary Wellington site houses larger instances of the abovementioned components, including an XC50-LC, plus a CS400 cluster that includes compute, large and huge memory nodes, part of which will also be managed as a Private Cloud under Bright OpenStack. All systems share the IBM Spectrum Scale filesystems (parts of which are replicated between Wellington and Auckland).

This presentation will highlight some of the design features and choices made, and the challenges encountered along the way to implementation of this integrated heterogeneous system.

pdf

New Site

2:10pm-2:20pm

Sponsor Talk 18

Event Large

Trey Breckenridge

Applying DDN to Machine Learning

Sponsor Talk

2:20pm-2:30pm

Sponsor Talk 19

Event Large

Trey Breckenridge

Slurm Overview and Road Map

Sponsor Talk

3:00pm-5:00pm

Technical Session 20A

Event Large

Jim Rogers

Modernizing Cray Systems Management – Use of Redfish APIs on Next Generation Shasta Hardware

Managing the SMW as a git Branch

External Login Nodes at Scale

Best Practices for Management and Operation of Large HPC Installations

Paper

Technical Session 20B

Event Small

Zhengji Zhao

Comparative Benchmarking of the First Generation of HPC-Optimised Arm Processors on Isambard

Evaluating Runtime and Power Requirements of Multilevel Checkpointing MPI Applications on Four Different Parallel Architectures: An Empirical Study

OpenACC and CUDA Unified Memory

Strategies to Accelerate VASP with GPUs Using OpenACC

Paper

Technical Session 20C

Studio 6

Jim Williams

Cray Storage Road Map

DataWarp Transparent Cache: Implementation, Challenges, and Early Experience

PBS Professional - Optimizing the "When & Where" of Scheduling Cray DataWarp Jobs

DataWarp Transparent Caching: Data Path Implementation

Paper

5:10pm-6:20pm

BoF 21A

Event Large

Harold Longley

XC System Management Usability BOF

Birds of a Feather

BoF 21B

Event Small

Chris Fuson

Best Practices for Supporting Diverse HPC User Storage Needs with Finite Storage Resources

Birds of a Feather

BoF 21C

Studio 6

Michael Showerman

Automated Analysis and Effective Feedback

Birds of a Feather

BoF 21D

Studio 3

Kelly J. Marquardt

Customer Collaboration and the Shasta Software Stack

Birds of a Feather

Thursday, May 24th

9:50am-10:00am

Sponsor Talk 23

Event Large

Trey Breckenridge

AMD update - EPYC !

Sponsor Talk

10:30am-12:00pm

Technical Session 24A

Event Large

Jim Rogers

Trinity: Opportunities and Challenges of a Heterogeneous System

Cray Advanced Power Management Updates

Weathering the Storm – Lessons Learnt in Managing a 24x7x365 HPC Delivery Platform

Paper

Technical Session 24B

Event Small

David Paul

Using CAASCADE and CrayPAT for Analysis of HPC Applications

Toward Automated Application Profiling on Cray Systems

Roofline Analysis with Cray Performance Analysis Tools (CrayPat) and Roofline-based Performance Projections for a Future Architecture

Paper

Technical Session 24C

Studio 6

Tim Robinson

Enabling Docker for HPC

Installation, Configuration and Performance Tuning of Shifter V16 on Blue Waters

Incorporating a Test and Development System Within the Production System

Paper

1:00pm-2:30pm

Technical Session 25A

Event Large

Tim Robinson

TensorFlow at Scale - MPI, RDMA and All That

High Performance Scalable Deep Learning with the Cray Programming Environments Deep Learning Plugin

Performance evaluation of parallel computing and Big Data processing with Java and PCJ library

Paper

Technical Session 25B

Event Small

Frank M. Indiviglio

Leveraging MPI RMA to optimise halo-swapping communications in MONC on Cray machines

Cray Performance Tools for Analyzing Applications at Scale

Performance Study of Popular Computational Chemistry Softwares on Cray HPC

Paper

Technical Session 25C

Studio 6

Paul L. Peltz Jr.

The Role of SSD Block Caches in a World with Networked Burst Buffers

Use of View for ClusterStor to monitor and diagnose performance of your Lustre filesystem

Nuclear Meltdown?: Assessing the impact of the Meltdown/Spectre bug at Los Alamos National Laboratory

Paper

3:00pm-4:30pm

Technical Session 26A

Event Large

Bilel Hadri

Optimised all-to-all Communication on Multicore Architectures Applied to FFTs with Pencil Decomposition

Eigensolver Performance Comparison on Cray XC Systems

On the Use of Vectorization in Production Engineering Workloads

Paper

Technical Session 26B

Event Small

Jim Williams

Unikernels/Library Operating Systems for HPC

GPU Usage Reporting

Instrumenting Slurm Command Line Commands to Gain Workload Insight

Paper

Technical Session 26C

Studio 6

Chris Fuson

How to implement the Sonexion RestAPI and correlate it with SEDC and other data.

Usage and Performance of libhio on XC-40 Systems

Improved I/O Using Native Spectrum Scale (GPFS) Clients on a Cray XC System

Paper