CUG2017 Proceedings

Tutorial

Tutorial 1A

Chair: Harold Longley (Cray Inc.)

Migrating, Managing, and Booting Cray XC and CMC/eLogin

Harold Longley (Cray Inc.)

Abstract

pdf

Tutorial

Tutorial 1B

Chair: John Levesque (Cray Inc.)

Getting the most of of Knight's Landing

John Levesque (Cray Ine)

Abstract

pdf

Tutorial

Tutorial 1C

Chair: Lisa Gerhardt (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory)

Shifter: Bringing Container Computing to HPC

Lisa Gerhardt, Shane Canon, and Doug Jacobsen (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center)

Abstract

pdf

Tutorial

Tutorial 1A Continued

Chair: Harold Longley (Cray Inc.)

Migrating, Managing, and Booting Cray XC and CMC/eLogin

Harold Longley (Cray Inc.)

Abstract

pdf

Tutorial

Tutorial 1B Continued

Chair: John Levesque (Cray Inc.)

Getting the most of of Knight's Landing

John Levesque (Cray Ine)

Abstract

pdf

Tutorial

Tutorial 1C Continued

Chair: Lisa Gerhardt (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory)

Shifter: Bringing Container Computing to HPC

Lisa Gerhardt, Shane Canon, and Doug Jacobsen (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center)

Abstract

pdf

Tutorial

Tutorial 2A

Chair: Harold Longley (Cray Inc.)

Migrating, Managing, and Booting Cray XC and CMC/eLogin

Harold Longley (Cray Inc.)

Abstract

pdf

Tutorial

Tutorial 2B

Chair: Deborah Bard (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center)

Burst Buffer Basics

Deborah Bard (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center), Bilel Hadri and George Markomanolis (KAUST), and David Paul (NERSC)

Abstract

pdf, pdf, gz

Tutorial

Tutorial 2C

Chair: Michael Ringenburg (Cray, Inc)

Analytics and Machine Learning on Cray XC and Intel systems

Michael Ringenburg (Cray, Inc); Kristyn Maschhoff (Cray Inc.); Lisa Gerhardt, Rollin Thomas, and Richard Canon (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center); and Jing Huang and Vivek Rane (Intel Corporation)

Abstract

pdf, pdf

Tutorial

Tutorial 2A Continued

Chair: Harold Longley (Cray Inc.)

Migrating, Managing, and Booting Cray XC and CMC/eLogin

Harold Longley (Cray Inc.)

Abstract

pdf

Tutorial

Tutorial 2B Continued

Chair: Deborah Bard (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center)

Burst Buffer Basics

Deborah Bard (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center), Bilel Hadri and George Markomanolis (KAUST), and David Paul (NERSC)

Abstract

pdf, pdf, gz

Tutorial

Tutorial 2C Continued

Chair: Michael Ringenburg (Cray, Inc)

Analytics and Machine Learning on Cray XC and Intel systems

Michael Ringenburg (Cray, Inc); Kristyn Maschhoff (Cray Inc.); Lisa Gerhardt, Rollin Thomas, and Richard Canon (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center); and Jing Huang and Vivek Rane (Intel Corporation)

Abstract

pdf, pdf

Birds of a Feather

BoF 3A

Chair: Nicholas Cardo (Swiss National Supercomputing Centre)

Systems Support SIG Meeting

Nicholas Cardo (Swiss National Supercomputing Centre)

Abstract

pdf

Birds of a Feather

BoF 3B

Chair: Kelly J. Marquardt (Cray); Sadaf R. Alam (CSCS)

New use cases and usage models for Cray DataWarp

Sadaf R. Alam (CSCS-ETHZ), Thomas Schulthess (ETH Zurich), Bilel Hadri (KAUST), Debbie Bard (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center), and Maxime Martinasso (CSCS-ETHZ)

Abstract

pdf

Sharing Cray Solutions

Kelly J. Marquardt (Cray, Inc.)

Abstract

Tutorial

Tutorial 3C - continued

Chair: Michael Ringenburg (Cray, Inc)

Analytics and Machine Learning on Cray XC and Intel systems

Michael Ringenburg (Cray, Inc); Kristyn Maschhoff (Cray Inc.); Lisa Gerhardt, Rollin Thomas, and Richard Canon (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center); and Jing Huang and Vivek Rane (Intel Corporation)

Abstract

pdf, pdf

Plenary

General Session 4

Chair: David Hancock (Indiana University)

CUG Welcome

David Hancock (Indiana University)

Abstract

Keynote: What are the Opportunities and Challenges for a new Class of Exascale Applications? What Challenge Problems can these Applications Address through Modeling and Simulation & Data Analytic Computing Solutions?

Douglas Kothe (Oak Ridge National Laboratory)

Abstract

The Department of Energy’s (DOE) Exascale Computing Project (ECP) is a partnership between the DOE Office of Science and the National Nuclear Security Administration. Its mission is to transform today’s high performance computing (HPC) ecosystem by executing a multi-faceted plan: developing mission critical applications of unprecedented complexity; supporting U.S. national security initiatives; partnering with the U.S. HPC industry to develop exascale computer architectures; collaborating with U.S. software vendors to develop a software stack that is both exascale-capable and usable on U.S. industrial and academic scale systems, and training the next-generation workforce of computer and computational scientists, engineers, mathematicians, and data scientists. The ECP aims to accelerate delivery of a capable exascale computing ecosystem that will enable breakthrough modeling and simulation (M&S) and data analytic computing (DAC) solutions to the most critical challenges in scientific research, energy assurance, economic competitiveness, and national security.

The computer and computational science and engineering communities in the public, private, and government sectors have been arguably thinking about exascale-class modeling and simulation technologies and capabilities for almost a decade. With exascale platforms becoming more certain and finally within sight, application developers and users must “get real” now to adequately take advantage of this opportunity. The hardware and software technologies currently envisioned in exascale platforms will present new challenges for application developers that could be disruptive relative to current approaches. New algorithms, for example, that communicate infrequently and store very little, may be critical for applications to move forward or even “hold pace”. Hybrid node architectures with hierarchical memory and compute technologies will likely be the norm, and applications may face comprehensive restructuring to exploit more appropriate task-based programming models and new data structures.

Given these challenges, tremendous opportunity nevertheless exists for science-based computational applications that can deliver, through effective exploitation of exascale HPC technology, breakthrough M&S and DAC solutions that yield high-confidence insights and answers to the nation’s most critical problems and challenges in scientific discovery, energy assurance, economic competitiveness, and national security. While reflecting on some of my own person R&D experiences, I will survey these application opportunities, where I will also touch upon challenges, decadal challenge problems, and prospective outcomes and impact.

pdf

Sponsor Talk

Sponsor Talk 5

Chair: Trey Breckenridge (Mississippi State University)

[DDN] Flash-Native Caching for Predictable Job Completion in Data-Intensive Environments

James Coomer (DataDirect Networks)

Abstract

pdf

Sponsor Talk

Sponsor Talk 6

Chair: Trey Breckenridge (Mississippi State University)

[ANSYS] Why Supercomputing Partnerships Matter for CFD Simulations

Wim Slagter (ANSYS)

Abstract

pdf

Plenary

General Session 7

Chair: Helen He (National Energy Research Scientific Computing Center)

Cray Corporate Update

Peter Ungaro, Stathis Papaefstathiou, and Steve Scott (Cray Inc.)

Abstract

Paper

Technical Session 8A

Chair: Bilel Hadri (KAUST Supercomputing Lab)

Early Evaluation of the Cray XC40 Xeon Phi System 'Theta' at Argonne

Scott Parker, Vitali Morozov, Sudheer Chunduri, Kevin Harms, Christopher Knight, and Kalyan Kumaran (Argonne National Laboratory)

Abstract

pdf, pdf

Performance on Trinity Phase 2 (a Cray XC40 utilizing Intel Xeon Phi processors) with Acceptance Applications and Benchmarks

Anthony Agelastos and Mahesh Rajan (Sandia National Laboratories); Nathan Wichmann (Cray Inc.); Randal Baker (Los Alamos National Laboratory); Stefan Domino (Sandia National Laboratories); Erik Draeger (Lawrence Livermore National Laboratory); and Sarah Anderson, Jacob Balma, Steve Behling, Mike Berry, Pierre Carrier, Mike Davis, Kim McMahon, Dick Sandness, Kevin Thomas, Steve Warren, and Ting-Ting Zhu (Cray Inc.)

Abstract

pdf, pdf

Evaluating the Networking Characteristics of the Cray XC-40 Intel Knights Landing Based Cori Supercomputer at NERSC

Douglas Doerfler, Brian Austin, Brandon Cook, and Jack Deslippe (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) and Krishna Kandalla and Peter Mendygral (Cray Inc.)

Abstract

pdf, pdf

Paper

Technical Session 8B

Chair: Frank M. Indiviglio (National Oceanic and Atmospheric Administration)

Toward Interactive Supercomputing at NERSC with Jupyter

Rollin Thomas, Shane Canon, Shreyas Cholia, Lisa Gerhardt, and Evan Racah (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center)

Abstract

pdf, pdf

In-situ data analytics for highly scalable cloud modelling on Cray machines

Nick Brown (EPCC, The University of Edinburgh); Adrian Hill and Ben Shipway (Met Office, UK); and Michele Weiland (EPCC, The University of Edinburgh)

Abstract

pdf, pdf

Precipitation Nowcasting: Leveraging Deep Recurrent Convolutional Neural Networks

Alexander Heye, Karthik Venkatesan, and Jericho Cain (Cray Inc.)

Abstract

pdf, pdf

Paper

Technical Session 8C

Chair: Chris Fuson (ORNL)

Telemetry-enabled Customer Support using the Cray System Snapshot Analyzer (SSA)

Richard J. Duckworth, Kevin Coryell, Scott McLeod, and Jay Blakeborough (Cray Inc.)

Abstract

pdf, pdf

How-to write a plugin to export job, power, energy, and system environmental data from your Cray XC system

Steven J. Martin (Cray Inc.), Cary Whitney (Lawrence Berkeley National Laboratory), and David Rush and Matthew Kappel (Cray Inc.)

Abstract

pdf, pdf

Using Open XDMoD for accounting analytics on the Cray XC supercomputer

Thomas Lorenzen and Damon Kasacjak (Danish Meteorological Institute) and Jason Coverston (Cray Inc.)

Abstract

pdf, pdf

Paper

Technical Session 9A

Chair: Zhengji Zhao (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory)

Using Spack to Manage Software on Cray Supercomputers

Mario A. Melara (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center), Todd Gamblin and Gregory Becker (Lawrence Livermore National Laboratory), Robert French and Matt Belhorn (Oak Ridge National Laboratory), Kelly Thompson (Los Alamos National Laboratory), and Rebecca Hartman-Baker (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center)

Abstract

pdf, pdf

Regression testing on Shaheen Cray XC40: Implementation and Lessons Learned

Bilel Hadri and Samuel Kortas (KAUST Supercomputing Lab), Robert Fiedler (Cray Inc.), and George Markomanolis (KAUST Supercomputing Lab)

Abstract

pdf, pdf

Python Usage Metrics on Blue Waters

Colin A. MacLean (National Center for Supercomputing Applications/University of Illinois)

Abstract

pdf, pdf

Paper

Technical Session 9B

Chair: Georgios Markomanolis (KAUST- King Abdullah University For Science And Technology)

libhio: Optimizing IO on Cray XC Systems With DataWarp

Nathan Hjelm and Cornell Wright (Los Alamos National Laboratory)

Abstract

pdf, pdf

How to Use Datawarp

Glen Overby (Cray, Cray Inc.)

Abstract

pdf, pdf

Datawarp Accounting Metrics

Andrew Barry (Cray Inc.)

Abstract

pdf, pdf

Paper

Technical Session 9C

Chair: Tina Butler (National Energy Research Scientific Computing Center)

Theta: Rapid Installation and Acceptance of an XC40 KNL System

Ti Leggett, Mark Fahey, Susan Coghlan, Kevin Harms, Paul Rich, Ben Allen, Ed Holohan, and Gordon McPheeters (Argonne National Laboratory)

Abstract

pdf, pdf

Extending CLE6 to a multi-supercomputer Operating System

Douglas M. Jacobsen (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center)

Abstract

pdf, pdf

Updating the SPP Benchmark Suite for Extreme-Scale Systems

Gregory Bauer, Victor Anisimov, Galen Arnold, Brett Bode, Robert Brunner, Tom Cortese, Roland Haas, Andriy Kot, William Kramer, JaeHyuk Kwack, Jing Li, Celso Mendes, Ryan Mokos, and Craig Steffen (National Center for Supercomputing Applications/University of Illinois)

Abstract

pdf, pdf

Birds of a Feather

BoF 10A

Chair: Bilel Hadri (KAUST Supercomputing Lab)

Programming Environments, Applications, and Documentation SIG Meeting

Bilel Hadri (King Abdullah University of Science and Technology)

Abstract

pdf, pdf

Birds of a Feather

BoF 10B

Chair: David Paul (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center)

A BoF - "Bursts of a Feather" - Burst Buffers from a Systems Perspective

David Paul (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center), John Bent (Seagate), and Andrey Kudryavtsev (Intel Corporation)

Abstract

pdf

Birds of a Feather

BoF 10C

Chair: Matteo Chesi (Swiss National Supercomputing Centre); Jeff Keopp (Cray Inc.)

XC System Management Usability BOF

Harold Longley and Joel Landsteiner (Cray Inc.)

Abstract

pdf

HPC Storage Operations: from experience to new tools

Matteo Chesi (Swiss National Supercomputing Centre), Tina Declerck (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory), Oliver Treiber (European Centre for Medium-Range Weather Forecasts), and Maciej L. Olchowik (King Abdullah University of Science and Technology)

Abstract

pdf

Birds of a Feather

BoF 10D

Chair: David Hancock (Indiana University); Michael Showerman (National Center for Supercomputing Applications)

Open Discussion with CUG Board

David Hancock (Indiana University)

Abstract

Holistic Systems Monitoring and Analysis

Michael T. Showerman (National Center for Supercomputing Applications/University of Illinois) and Ann Gentile and James M. Brandt (Sandia National Laboratories)

Abstract

Plenary

General Session 11

Chair: David Hancock (Indiana University)

CUG Business

David Hancock (Indiana University)

Abstract

pdf

Invited Talk: Perspectives on HPC and Enterprise High Performance Data Analytics

Arno Kolster (Providentia Worldwide)

Abstract

pdf

Sponsor Talk

Sponsor Talk 12

Chair: Trey Breckenridge (Mississippi State University)

[Seagate] The Effects Fragmentation and Remaining Capacity has on File System Performance

John Fragalla (Seagate)

Abstract

Sponsor Talk

Sponsor Talk 13

Chair: Trey Breckenridge (Mississippi State University)

[SchedMD] Slurm Roadmap

Morris Jette (SchedMD)

Abstract

pdf

Plenary

General Session 14

Chair: Helen He (National Energy Research Scientific Computing Center)

Lustre Lockahead: Early Experience and Performance using Optimized Locking

Michael Moore, Patrick Farrell, and Bob Cernohous (Cray Inc.)

Abstract

pdf, pdf

Plenary talk from Intel: Exascale Reborn

Rajeeb Hazra (Intel Corporation)

Abstract

Sponsor Talk

Sponsor Talk 15

Chair: Trey Breckenridge (Mississippi State University)

[PGI] OpenACC and Unified Memory

Doug Miles (PGI)

Abstract

pdf

Sponsor Talk

Sponsor Talk 16

Chair: Trey Breckenridge (Mississippi State University)

[Allinea] Tools and Methodology for Ensuring HPC Programs Correctness and Performance

Beau Paisley (ARM)

Abstract

pdf

Sponsor Talk

Sponsor Talk 17

Chair: Trey Breckenridge (Mississippi State University)

[Altair] PBS Professional - Stronger, Faster, Better!

Scott J. Suchyta (Altair Engineering, Inc.)

Abstract

pdf

Plenary

General Session 18

Chair: David Hancock (Indiana University)

1 on 100 or More...

Pete Ungaro (Cray Inc.)

Abstract

Paper

Technical Session 19A

Chair: Richard Barrett (Sandia National Labs)

The Cray Programming Environment: Current Status and Future Directions

Luiz DeRose (Cray Inc.)

Abstract

Current State of the Cray MPT Software Stacks on the Cray XC Series Supercomputers

Krishna Kandalla, Peter Mendygral, Naveen Ravichandrasekaran, Nick Radcliffe, Bob Cernohous, Kim McMahon, Christopher Sadlo, and Mark Pagel (Cray)

Abstract

pdf, pdf

Profiling and Analyzing Program Performance Using Cray Tools

Heidi Poxon (Cray Inc.)

Abstract

pdf

Novel approaches to HPC user engagement

Clair Barrass and David Henty (EPCC, The University of Edinburgh)

Abstract

pdf, pptx

Paper

Technical Session 19B

Chair: Veronica G. Vergara Larrea (Oak Ridge National Laboratory)

Improving I/O Bandwidth With Cray DVS Client-Side Caching

Bryce T. Hicks (Cray Inc.)

Abstract

pdf, pdf

Implementing a Hierarchical Storage Management system in a large-scale Lustre and HPSS environment

Brett M. Bode, Michelle Butler, Jim Glasgow, and Sean Stevens (National Center for Supercomputing Applications/University of Illinois) and Nathan Schumann and Frank Zago (Cray Inc.)

Abstract

pdf, pdf

Understanding the IO Performance Gap Between Cori KNL and Haswell

Jialin Liu, Quincey Koziol, and Houjun Tang (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center); Francois Tessier (Argonne National Laboratory); and Wahid Bhimji, Brandon Cook, Brian Austin, Suren Byna, Bhupender Thakur, Glenn Lockwood, Jack Deslippe, and Mr Prabhat (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center)

Abstract

pdf, pdf

DXT: Darshan eXtended Tracing

Cong Xu (Intel Corporation), Shane Snyder (Argonne National Laboratory), Omkar Kulkarni and Vishwanath Venkatesan (Intel Corporation), Philip Carns (Argonne National Laboratory), Suren Byna (Lawrence Berkeley National Laboratory), Robert Sisneros (National Center for Supercomputing Applications), and Kalyana Chadalavada (Intel Corporation)

Abstract

pdf, pdf

Paper

Technical Session 19C

Chair: Chris Fuson (ORNL)

Scheduler Optimization for Current Generation Cray Systems

Morris Jette (SchedMD LLC) and Douglas Jacobsen and David Paul (NERSC)

Abstract

pdf, pdf

Trust Separation on the Cray XC40 using PBS Pro

Sam Clarke (Met Office, UK)

Abstract

pdf, pdf

Experiences running different work load managers across Cray Platforms

Haripriya Ayyalasomayajula and Karlon West (Cray Inc.)

Abstract

pdf, pdf

An Operational Perspective on a Hybrid and Heterogeneous Cray XC50 System

Sadaf R. Alam, Nicola Bianchi, Nicholas Cardo, Matteo Chesi, Miguel Gila, Stefano Gorini, Mark Klein, Marco Passerini, Carmelo Ponti, Fabio Verzelloni, and Colin McMurtrie (CSCS-ETHZ)

Abstract

pdf, pdf

Birds of a Feather

BoF 20A

Chair: Patricia Langer (Cray )

Sonexion Monitoring and Metrics: data collection, data retention, user workflows

Patricia Langer and Craig Flaskerud (Cray)

Abstract

pdf, pdf

Birds of a Feather

BoF 20B

Chair: Harold Longley (Cray Inc.)

eLogin Usability and Best Practices

Jeff Keopp, Mark Ahlstrom, and Harold Longley (Cray Inc.)

Abstract

pdf

Birds of a Feather

BoF 20C

Chair: Robert Stober (Bright)

Building an Enterprise-Grade Deep Learning Environment with Bright and Cray

Robert Stober (Bright Computing)

Abstract

Birds of a Feather

BoF 20D

Chair: Nicholas Cardo (Swiss National Supercomputing Centre)

Bringing "Shifter" to the Broader Community

Nicholas Cardo (Swiss National Supercomputing Centre) and Douglas Jacobsen (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center)

Abstract

pdf

Plenary

General Session 21

Chair: Helen He (National Energy Research Scientific Computing Center)

Panel: Future Directions of Data Analytics and High Performance Computing

Yun (Helen) He (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center), Scott Michael (Indiana University), Mr Prabhat (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory), Rangan Sukumar (Cray Inc.), Javier Luraschi (RStudio), and Meena Arunachalam (Intel)

Abstract

Plenary

General Session 22

Chair: David Hancock (Indiana University)

CUG2018 Site Presentation

Erwin Laure (KTH/PDC)

Abstract

pdf

Sponsor Talk

Sponsor Talk 23

Chair: Trey Breckenridge (Mississippi State University)

[Adaptive Computing] Reporting and Analytics, Portal-based Job Submission, Remote Visualization, Accounting and High Throughput Task Processing on Torque and Slurm

Nick Ihli (Adaptive Computing)

Abstract

pdf

Sponsor Talk

Sponsor Talk 24

Chair: Trey Breckenridge (Mississippi State University)

[Bright Computing] Achieving a Dynamic Datacenter with Bright and Cray

Robert Stober (Bright)

Abstract

pdf

Paper

Technical Session 25A

Chair: Zhengji Zhao (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory)

Porting the microphysics model CASIM to GPU and KNL Cray machines

Nick Brown and Alexandr Nigay (EPCC, The University of Edinburgh); Ben Shipway and Adrian Hill (Met Office, UK); and Michele Weiland (EPCC, The University of Edinburgh)

Abstract

pdf, pdf

An in-depth evaluation of GCC’s OpenACC implementation on Cray systems

Veronica G. Vergara Larrea, Wael R. Elwasif, and Oscar Hernandez (Oak Ridge National Laboratory) and Cesar Philippidis and Randy Allen (Mentor Graphics)

Abstract

pdf, pdf

HPCG and HPGMG benchmark tests on Multiple Program, Multiple Data (MPMD) mode on Blue Waters – a Cray XE6/XK7 hybrid system

JaeHyuk Kwack and Gregory H. Bauer (National Center for Supercomputing Applications)

Abstract

pdf, pdf

Paper

Technical Session 25B

Chair: Bilel Hadri (KAUST Supercomputing Lab)

Project Caribou : Monitoring and Metrics for Sonexion

Craig Flaskerud (Cray)

Abstract

The scale and number of subsystems in today’s High Performance Computing system deployments make it difficult to monitor application performance and determine root causes when performance is not what is expected. System component failures, system resource oversubscription, or poorly written applications can all contribute to systems not running as expected and thus to poorly performing applications. This problem is exacerbated by the need to mine information from multiple sources across system subcomponents. Collecting the data may require privileged access and the data must be collected in a timely manner or critical information can be lost.

Many of Cray’s large customers have created their own solutions for monitoring Cray system resources to address these challenges. They have relied on available log files and created their own scripts and tools to collect, consolidate, and persist information. Each of these customers has similar needs; a toolset which will: collect relevant information from the different Cray systems, persist this information to allow for current and historical analysis, and present customized reports and alerts, all of which together allow administrators to proactively address failures or degradations in performance.

Caribou is a new monitoring and metric software solution created by Cray to help customers address this problem. Caribou initially focuses on collecting and persisting performance and job metrics specific to the Cray Sonexion storage system, correlating this with job application information collected from the Cray computing systems. Caribou is installed on a customer’s on-site standalone server that is connected to the customer’s network. Caribou collects Lustre and jobstats metrics, system logs, and system events from each Sonexion storage system configured to be monitored. If Caribou is connected to the Sonexion high speed InfiniBand network, it will discover all HCAs and switches on the fabric and collect and persist port counters from the IB fabric. Caribou is integrated with the Cray System Management Workstation (SMW), collecting job information such as start/stop, job id for jobs that were launched using a workload manager. This information is persisted into a time-series database on the customer’s Caribou server and persisted using a standardized data model. Administrators are able to view this information in “real-time” or look at information collected at different points in the past through user friendly dashboards and workflows.

pdf, pdf

Preparing NERSC users for Cori, a Cray XC40 system with Intel Many Integrated Cores

Yun (Helen) He, Brandon Cook, Jack Deslippe, Brian Friesen, Richard Gerber, Rebecca Hartman-Baker, Alice Koniges, Thorsten Kurth, Stephen Leak, Woo-Sun Yang, and zhengji Zhao (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory); Eddie Baron (University of Oklahoma); and Peter Hauschildt (Hamburger Sternwarte)

Abstract

pdf, pdf

A High Performance SVD Solver on Manycore Systems

Dalal Sukkari and Hatem Ltaief (KAUST), Aniello Esposito (Cray EMEA Research Lab (CERL)), and David Keyes (KAUST)

Abstract

pdf

Paper

Technical Session 25C

Chair: Tina Declerck (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory)

Cray XC40 System Diagnosability: Functionality, Performance, and Lessons Learned

Jeffrey Schutkoske (Cray Inc.)

Abstract

The Intel® Xeon Phi CPU 7250 processor presents new opportunities for diagnosing the node in the Cray® XC40 system. This processor supports a new high-bandwidth on-package MCDRAM memory and interfaces. It also supports the ability to support different Non-Uniform Memory Access (NUMA) configurations. The new Cray Processor Daughter Card (PDC) also supports an optional PCIe SSD card. Previous work has outlined Cray system diagnosability for the Cray® XC Series. This processor requires new BIOS, administrative commands, power and thermal limits, as well as, new diagnostics to validate functionality and performance.

The Hardware Supervisory System (HSS) supports the new high-bandwidth on-package MCDRAM memory and interfaces. It supports the On-Demand configuration of the MCDRAM and NUMA. The MCDRAM and NUMA configurations, as well as SSD enable/disable, are configurable from both the command line on the System Management Workstation (SMW) and by a workload manager (WLM) running under the Cray Linux Environment (CLE) utilizing the Cray Advanced Platform Monitoring and Control (CAPMC) interface.

New Intel® Xeon Phi CPU 7250 processor on-line diagnostics have been written to validate the node and MCDRAM functionality and performance. The diagnostics validate the node based on the MCDRAM and NUMA configurations. The Workload Test Suite (WTS) has also been updated to detect and diagnose this processor problems under CLE.

There are new utilities and diagnostics to support the PCIe SSD card. There is a new diagnostic utility that executes in CLE on the Data Warp Node or the compute node to support the SSD. This diagnostic utility is periodically scheduled to check the health of the SSD. It reports the status to the SMW via RCA. The system administrator can query and display the current SSD health, as well as, historical data. The results of the SSD diagnostic utility can also be viewed on the SMW. This paper describes the tool chain changes required to support the new blade with this processor and optional PCIe SSD cards. It also provides detailed examples on how to diagnose this processor faults within the Cray® XC40 system.

pdf, pdf

KNL System Software

Peter Hill, Clark Snyder, and John Sygulla (Cray Inc.)

Abstract

pdf, pdf

Runtime collection and analysis of system metrics for production monitoring of Trinity Phase II

Adam J. DeConinck, Hai Ah Nam, Amanda Bonnie, David Morton, and Cory Lueninghoener (Los Alamos National Laboratory); James Brandt, Ann Gentile, Kevin Pedretti, Anthony Agelastos, Courtenay Vaughan, Simon Hammond, and Benjamin Allan (Sandia National Laboratories); and Jason Repik and Mike Davis (Cray Inc.)

Abstract

pdf, pdf

Birds of a Feather

BoF 26

Chair: Sreenivas Sukumar (Oak Ridge National Lab)

Deep Learning on Cray Platforms

Sreenivas Sukumar (Cray Inc.), Mr. Prabhat (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center), and Maxime Martinasso (Swiss National Supercomputing Centre)

Abstract

Paper

Technical Session 27A

Chair: Tina Butler (National Energy Research Scientific Computing Center)

Next Generation Science Applications for the Next Generation of Supercomputing

Courtenay Vaughan, Simon Hammond, Dennis Dinge, Paul Lin, Christian Trott, Douglas Pase, Jeanine Cook, Clay Hughes, and Robert Hoekstra (Sandia National Laboratories)

Abstract

pdf, pdf

Fusion PIC Code Performance Analysis on The Cori KNL System

Tuomas Koskela, Jack Deslippe, and Brian Friesen (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) and Karthik Raman (Intel Corporation)

Abstract

pdf, pdf

Performance of Hybrid MPI/OpenMP VASP on Cray XC40 Based on Intel Knights Landing Many Integrated Core Architecture

Zhengji Zhao (National Energy Research Scientific Computing Center (NERSC), USA); Martijn Marsman (Universität Wien, Austria); Florian Wende (Zuse Institute Berlin (ZIB), Germany); and Jeongnim Kim (Intel, USA)

Abstract

pdf, pdf

Paper

Technical Session 27B

Chair: Scott Michael (Indiana University)

Toward a Scalable Bank of Filters for High Throughput Image Analysis on the Cray Urika-GX System

FNU Shilpika, Nicola Ferrier, and Venkatram Vishwanath (Argonne National Laboratory)

Abstract

pdf, pdf

Towards Seamless Integration of Data Analytics into Existing HPC Infrastructures

Dennis Hoppe, Michael Gienger, and Thomas Boenisch (High Performance Computing Center Stuttgart); Diana Moise (Cray Inc.); and Oleksandr Shcherbakov (High Performance Computing Center Stuttgart)

Abstract

pdf, pdf

Quantifying Performance of CGE: A Unified Scalable Pattern Mining and Search System

Kristyn J. Maschhoff, Robert Vesse, Sreenivas R. Sukumar, Michael F. Ringenburg, and James Maltby (Cray Inc.)

Abstract

pdf, pdf

Paper

Technical Session 27C

Chair: Jean-Guillaume Piccinali (Swiss National Supercomputing Centre)

Application-Level Regression Testing Framework using Jenkins

Reuben D. Budiardja (Oak Ridge National Laboratory) and Timothy Bouvet and Galen Arnold (National Center for Supercomputing Applications/University of Illinois)

Abstract

pdf, pdf

Understanding Fault Scenarios and Impacts through Fault Injection Experiments in Cielo

Valerio Formicola, Saurabh Jha, Fei Deng, and Daniel Chen (University of Illinois at Urbana-Champaign); Amanda Bonnie and Mike Mason (Los Alamos National Laboratory); Annette Greiner (National Energy Research Scientific Computing Center); Ann Gentile and Jim Brandt (Sandia National Laboratories); Larry Kaplan and Jason Repik (Cray Inc.); Jeremy Enos and Michael Showerman (National Center for Supercomputing Applications/University of Illinois); Zbigniew Kalbarczyk (University of Illinois at Urbana-Champaign); William Kramer (National Center for Supercomputing Applications/University of Illinois); and Ravishankar Iyer (University of Illinois at Urbana-Champaign)

Abstract

pdf, pdf

A regression framework for checking the health of large HPC systems

Vasileios Karakasis, Victor Holanda Rusu, Andreas Jocksch, Jean-Guillaume Piccinali, and Guilherme Peretti-Pezzi (Swiss National Supercomputing Centre)

Abstract

pdf, pdf

Paper

Technical Session 28A

Chair: Veronica G. Vergara Larrea (Oak Ridge National Laboratory)

HPC Containers in use

Jonathan Sparks (Cray Inc.)

Abstract

pdf, pdf

Shifter: Fast and consistent HPC workflows using containers

Lucas Benedicic, Felipe A. Cruz, and Thomas Schulthess (Swiss National Supercomputing Centre)

Abstract

pdf, pdf

ExPBB: A framework to explore the performance of Burst Buffer

Georgios Markomanolis (KAUST Supercomputing Laboratory)

Abstract

pdf, pdf

Paper

Technical Session 28B

Chair: Frank M. Indiviglio (National Oceanic and Atmospheric Administration)

Tuning Sub-filing Performance on Parallel File Systems

Suren Byna (Lawrence Berkeley National Laboratory), Mohamad Chaarawi (Intel Corporation), Quincey Koziol (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center), and John Mainzer and Frank Willmore (The HDF Group)

Abstract

pdf, pdf

Enabling Portable I/O Analysis of Commercially Sensitive HPC Applications Through Workload Replication

James Dickson and Steven Wright (University of Warwick); Satheesh Maheswaran, Andy Herdman, and Duncan Harris (Atomic Weapons Establishment); Mark C. Miller (Lawrence Livermore National Laboratory); and Stephen Jarvis (University of Warwick)

Abstract

pdf, pdf

An Exploration into Object Storage for Exascale Supercomputers

Raghunath Raja Chandrasekar, Lance Evans, and Robert Wespetal (Cray Inc.)

Abstract

pdf, pdf

Paper

Technical Session 28C

Chair: Richard Barrett (Sandia National Labs)

Enabling the Super Facility with Software Defined Networking

Richard S. Canon, Brent R. Draney, Jason R. Lee, David L. Paul, David E. Skinner, and Tina M. Declerck (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center)

Abstract

pdf, pdf

Advanced Risk Mitigation of Software Vulnerabilities at Research Computing Centers

Urpo Kaila (CSC - IT Center for Science Ltd.)

Abstract

pdf, pdf

Comparing Spark GraphX and Cray Graph Engine using large-scale client data

Eric Dull and Brian Sacash (Deloitte)

Abstract

pdf, pdf

New Site Talk

New Site Talk 29

Chair: Helen He (National Energy Research Scientific Computing Center)

New CUG member site talk from SSC

Deric Sullivan (SSC)

Abstract

pdf

Plenary

General Session 30

Chair: David Hancock (Indiana University)

CUG 2017 Conference Close

David Hancock (Indiana University)

Abstract

Hexagon@University of Bergen, Norway

Csaba Anderlik (University of Bergen, Norway); Ingo Bethke, Mats Bentsen, and Alok Kumar Gupta (Uni Research Climate, Norway); Jon Albretsen and Lars Asplin (Institute for Marine Research, Norway); Michel S. Mesquita (Uni Research Climate, Norway); Saurabh Bhardwaj (The Energy and Resources Institute, India); and Laurent Bertino (Nansen Environmental and Remote Sensing Center, Norway)

Abstract

pdf