Tutorial Tutorial 1A Chair: Harold Longley (Cray Inc.) Migrating, Managing, and Booting Cray XC and CMC/eLogin Harold Longley (Cray Inc.) Abstract Abstract System management on Cray XC systems has improved since SMW 8.0/CLE 6.0 was introduced. This tutorial moves from introductory information to advanced topics in system management of XC series systems including CMC/eLogin. An overview of system management for the XC series system is provided: the Image Management and Provisioning System (IMPS), the Configuration Management Framework (CMF), and Node Image Mapping Service (NIMS) for XC series systems, and the CSMS, OpenStack, and eLogin software for external login nodes. Tutorial Tutorial 1B Chair: John Levesque (Cray Inc.) Getting the most of of Knight's Landing John Levesque (Cray Ine) Abstract Abstract There are several large Knight Landing's systems in the community and there still is a lot of confusion on how best to utilize this new many-core system. KNL is a difficult system to utilize most effectively because there are so many confirgurations that can be employed and users do not have enough time or access to discern how best to configure the system for their particular application. This tutorial will be built upon the vast experience that the author has accumulated over the past twelve months using very large applications on the Trinity KNL at Los Alamos Scientific National Laboratory. Over that time a series of tests were conducted to determine the best clustering mode for different types of applications. The clustering modes will be discussed with results from a variety of applications. The next important aspect of KNL is how best to configured the High Bandwidth Memory - once again the decision depends upon the size of memory required on each node and memory access patterns. The most important aspect of using KNL is to vectorize as much as possible, as this is the only mode in which KNL can out-perform the state of the art Xeon node. While Threading on the node is not as critical as originally thought, there are still places when employing OpenMP threads can gain addition performance. This tutorial will cover all the aspects of getting the most of the Knight's Landing and when Knight's landing may not be better than the state-of-the-art Xeon. Tutorial Tutorial 1C Chair: Lisa Gerhardt (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory) Shifter: Bringing Container Computing to HPC Lisa Gerhardt, Shane Canon, and Doug Jacobsen (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) Abstract Abstract This half-day tutorial will introduce researchers and developers to the basics of container computing and running those containers in a Cray environment using Shifter, a framework that delivers docker-like functionality to HPC by extracting images from native formats (such as a Docker image) and converting them to a common format that is optimally tuned for the HPC environment. The tutorial will also cover more advanced topics including how to set up a Shifter image Gateway and create images that run MPI applications that require high-performance networks. The tutorial will also cover ways that Docker images can be used in the scientific process including packaging images so they can be used to regenerate and confirm results and used in the publication process and will integrate hands-on exercises throughout the training. These exercises will include building Docker images on their own laptop and running those images on an Shifter-enabled HPC system via tutorial accounts. While attendees will not require advanced knowledge of Docker or Shifter, they should be familiar with basic Linux administration such as installing packages and building software. This tutorial will be presented by an experienced team including the authors of the Shifter tool and members of NERSC’s Data and Analytics Services Group that focuses on training scientists to use HPC. This tutorial will present an updated version of the very popular tutorial presented earlier this year at the Supercomputing 2016 Conference. Tutorial Tutorial 1A Continued Chair: Harold Longley (Cray Inc.) Migrating, Managing, and Booting Cray XC and CMC/eLogin Harold Longley (Cray Inc.) Abstract Abstract System management on Cray XC systems has improved since SMW 8.0/CLE 6.0 was introduced. This tutorial moves from introductory information to advanced topics in system management of XC series systems including CMC/eLogin. An overview of system management for the XC series system is provided: the Image Management and Provisioning System (IMPS), the Configuration Management Framework (CMF), and Node Image Mapping Service (NIMS) for XC series systems, and the CSMS, OpenStack, and eLogin software for external login nodes. Tutorial Tutorial 1B Continued Chair: John Levesque (Cray Inc.) Getting the most of of Knight's Landing John Levesque (Cray Ine) Abstract Abstract There are several large Knight Landing's systems in the community and there still is a lot of confusion on how best to utilize this new many-core system. KNL is a difficult system to utilize most effectively because there are so many confirgurations that can be employed and users do not have enough time or access to discern how best to configure the system for their particular application. This tutorial will be built upon the vast experience that the author has accumulated over the past twelve months using very large applications on the Trinity KNL at Los Alamos Scientific National Laboratory. Over that time a series of tests were conducted to determine the best clustering mode for different types of applications. The clustering modes will be discussed with results from a variety of applications. The next important aspect of KNL is how best to configured the High Bandwidth Memory - once again the decision depends upon the size of memory required on each node and memory access patterns. The most important aspect of using KNL is to vectorize as much as possible, as this is the only mode in which KNL can out-perform the state of the art Xeon node. While Threading on the node is not as critical as originally thought, there are still places when employing OpenMP threads can gain addition performance. This tutorial will cover all the aspects of getting the most of the Knight's Landing and when Knight's landing may not be better than the state-of-the-art Xeon. Tutorial Tutorial 1C Continued Chair: Lisa Gerhardt (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory) Shifter: Bringing Container Computing to HPC Lisa Gerhardt, Shane Canon, and Doug Jacobsen (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) Abstract Abstract This half-day tutorial will introduce researchers and developers to the basics of container computing and running those containers in a Cray environment using Shifter, a framework that delivers docker-like functionality to HPC by extracting images from native formats (such as a Docker image) and converting them to a common format that is optimally tuned for the HPC environment. The tutorial will also cover more advanced topics including how to set up a Shifter image Gateway and create images that run MPI applications that require high-performance networks. The tutorial will also cover ways that Docker images can be used in the scientific process including packaging images so they can be used to regenerate and confirm results and used in the publication process and will integrate hands-on exercises throughout the training. These exercises will include building Docker images on their own laptop and running those images on an Shifter-enabled HPC system via tutorial accounts. While attendees will not require advanced knowledge of Docker or Shifter, they should be familiar with basic Linux administration such as installing packages and building software. This tutorial will be presented by an experienced team including the authors of the Shifter tool and members of NERSC’s Data and Analytics Services Group that focuses on training scientists to use HPC. This tutorial will present an updated version of the very popular tutorial presented earlier this year at the Supercomputing 2016 Conference. Tutorial Tutorial 2A Chair: Harold Longley (Cray Inc.) Migrating, Managing, and Booting Cray XC and CMC/eLogin Harold Longley (Cray Inc.) Abstract Abstract System management on Cray XC systems has improved since SMW 8.0/CLE 6.0 was introduced. This tutorial moves from introductory information to advanced topics in system management of XC series systems including CMC/eLogin. An overview of system management for the XC series system is provided: the Image Management and Provisioning System (IMPS), the Configuration Management Framework (CMF), and Node Image Mapping Service (NIMS) for XC series systems, and the CSMS, OpenStack, and eLogin software for external login nodes. Tutorial Tutorial 2B Chair: Deborah Bard (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) Burst Buffer Basics Deborah Bard (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center), Bilel Hadri and George Markomanolis (KAUST), and David Paul (NERSC) Abstract Abstract The long-awaited Burst Buffer technology is now being deployed on major supercomputing systems. In this tutorial we will introduce how Burst Buffers have been configured on new supercomputers at NERSC and KAUST and discuss briefly early experience with Burst Buffers from both a system and a user’s perspective. Attendees will be given access to Cori, NERSC’s newest supercomputer, and will use the Cori Burst Buffer in a series of exercises designed to demonstrate the IO capabilities of the SSD storage at scale, as well as some of the limitations. Simple optimisation exercises will be provided. The tutorial will conclude with a live demonstration of a complex workflow executed on the Burst Buffer, including simulation, analysis and visualisation. Materials are available on https://sites.google.com/lbl.gov/burstbuffer-tutorial-cug17. Tutorial Tutorial 2C Chair: Michael Ringenburg (Cray, Inc) Analytics and Machine Learning on Cray XC and Intel systems Michael Ringenburg (Cray, Inc); Kristyn Maschhoff (Cray Inc.); Lisa Gerhardt, Rollin Thomas, and Richard Canon (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center); and Jing Huang and Vivek Rane (Intel Corporation) Abstract Abstract Big data analytics and machine learning have become important application areas for Cray systems. This tutorial will describe a variety of widely used analytics frameworks such as Apache Spark, the Python PyData stack, and R, and show how these frameworks can be run on Cray XC series supercomputers. We will also provide an overview of the Cray Graph Engine semantic graph database and graph analytics framework (CGE), and show how to generate an RDF dataset, load this into CGE, and run SPARQL queries. In addition, we will discuss installing and running deep learning frameworks such as TensorFlow, and describe Intel optimizations for deep learning. The tutorial will also describe tips and tricks for getting the best performance from analytics and machine learning on Cray platforms. Simple exercises using interactive Jupyter notebooks and software installed at NERSC will be interspersed to allow attendees to apply what they have learned. We will assume a basic familiarity with Cray XC systems. Tutorial Tutorial 2A Continued Chair: Harold Longley (Cray Inc.) Migrating, Managing, and Booting Cray XC and CMC/eLogin Harold Longley (Cray Inc.) Abstract Abstract System management on Cray XC systems has improved since SMW 8.0/CLE 6.0 was introduced. This tutorial moves from introductory information to advanced topics in system management of XC series systems including CMC/eLogin. An overview of system management for the XC series system is provided: the Image Management and Provisioning System (IMPS), the Configuration Management Framework (CMF), and Node Image Mapping Service (NIMS) for XC series systems, and the CSMS, OpenStack, and eLogin software for external login nodes. Tutorial Tutorial 2B Continued Chair: Deborah Bard (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) Burst Buffer Basics Deborah Bard (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center), Bilel Hadri and George Markomanolis (KAUST), and David Paul (NERSC) Abstract Abstract The long-awaited Burst Buffer technology is now being deployed on major supercomputing systems. In this tutorial we will introduce how Burst Buffers have been configured on new supercomputers at NERSC and KAUST and discuss briefly early experience with Burst Buffers from both a system and a user’s perspective. Attendees will be given access to Cori, NERSC’s newest supercomputer, and will use the Cori Burst Buffer in a series of exercises designed to demonstrate the IO capabilities of the SSD storage at scale, as well as some of the limitations. Simple optimisation exercises will be provided. The tutorial will conclude with a live demonstration of a complex workflow executed on the Burst Buffer, including simulation, analysis and visualisation. Materials are available on https://sites.google.com/lbl.gov/burstbuffer-tutorial-cug17. Tutorial Tutorial 2C Continued Chair: Michael Ringenburg (Cray, Inc) Analytics and Machine Learning on Cray XC and Intel systems Michael Ringenburg (Cray, Inc); Kristyn Maschhoff (Cray Inc.); Lisa Gerhardt, Rollin Thomas, and Richard Canon (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center); and Jing Huang and Vivek Rane (Intel Corporation) Abstract Abstract Big data analytics and machine learning have become important application areas for Cray systems. This tutorial will describe a variety of widely used analytics frameworks such as Apache Spark, the Python PyData stack, and R, and show how these frameworks can be run on Cray XC series supercomputers. We will also provide an overview of the Cray Graph Engine semantic graph database and graph analytics framework (CGE), and show how to generate an RDF dataset, load this into CGE, and run SPARQL queries. In addition, we will discuss installing and running deep learning frameworks such as TensorFlow, and describe Intel optimizations for deep learning. The tutorial will also describe tips and tricks for getting the best performance from analytics and machine learning on Cray platforms. Simple exercises using interactive Jupyter notebooks and software installed at NERSC will be interspersed to allow attendees to apply what they have learned. We will assume a basic familiarity with Cray XC systems. Birds of a Feather BoF 3A Chair: Nicholas Cardo (Swiss National Supercomputing Centre) Birds of a Feather BoF 3B Chair: Kelly J. Marquardt (Cray); Sadaf R. Alam (CSCS) New use cases and usage models for Cray DataWarp Sadaf R. Alam (CSCS-ETHZ), Thomas Schulthess (ETH Zurich), Bilel Hadri (KAUST), Debbie Bard (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center), and Maxime Martinasso (CSCS-ETHZ) Abstract Abstract This BOF will be a collaborative effort between the sites that have deployed the Cray DataWarp technology and Cray DataWarp engineers and developers. The goal of this BOF is to explore use cases and usage models that could open up opportunities for data science workflows. Motivating examples will be presented, which will be followed by a technical discussion on the current implementation of DataWarp software stack. The BOF participants will have an opportunity to share experiences using the DataWarp technology and to contribute representative data science use cases that could benefit from the Cray DataWarp technology. Sharing Cray Solutions Kelly J. Marquardt (Cray, Inc.) Abstract Abstract You’re invited to help brainstorm! Cray is considering ways to enhance the ecosystem of solutions for Cray platforms by building a strong community. Could include: Tutorial Tutorial 3C - continued Chair: Michael Ringenburg (Cray, Inc) Analytics and Machine Learning on Cray XC and Intel systems Michael Ringenburg (Cray, Inc); Kristyn Maschhoff (Cray Inc.); Lisa Gerhardt, Rollin Thomas, and Richard Canon (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center); and Jing Huang and Vivek Rane (Intel Corporation) Abstract Abstract Big data analytics and machine learning have become important application areas for Cray systems. This tutorial will describe a variety of widely used analytics frameworks such as Apache Spark, the Python PyData stack, and R, and show how these frameworks can be run on Cray XC series supercomputers. We will also provide an overview of the Cray Graph Engine semantic graph database and graph analytics framework (CGE), and show how to generate an RDF dataset, load this into CGE, and run SPARQL queries. In addition, we will discuss installing and running deep learning frameworks such as TensorFlow, and describe Intel optimizations for deep learning. The tutorial will also describe tips and tricks for getting the best performance from analytics and machine learning on Cray platforms. Simple exercises using interactive Jupyter notebooks and software installed at NERSC will be interspersed to allow attendees to apply what they have learned. We will assume a basic familiarity with Cray XC systems. Plenary General Session 4 Chair: David Hancock (Indiana University) Keynote: What are the Opportunities and Challenges for a new Class of Exascale Applications? What Challenge Problems can these Applications Address through Modeling and Simulation & Data Analytic Computing Solutions? Douglas Kothe (Oak Ridge National Laboratory) Abstract Abstract The Department of Energy’s (DOE) Exascale Computing Project (ECP) is a partnership between the DOE Office of Science and the National Nuclear Security Administration. Its mission is to transform today’s high performance computing (HPC) ecosystem by executing a multi-faceted plan: developing mission critical applications of unprecedented complexity; supporting U.S. national security initiatives; partnering with the U.S. HPC industry to develop exascale computer architectures; collaborating with U.S. software vendors to develop a software stack that is both exascale-capable and usable on U.S. industrial and academic scale systems, and training the next-generation workforce of computer and computational scientists, engineers, mathematicians, and data scientists. The ECP aims to accelerate delivery of a capable exascale computing ecosystem that will enable breakthrough modeling and simulation (M&S) and data analytic computing (DAC) solutions to the most critical challenges in scientific research, energy assurance, economic competitiveness, and national security. Sponsor Talk Sponsor Talk 5 Chair: Trey Breckenridge (Mississippi State University) [DDN] Flash-Native Caching for Predictable Job Completion in Data-Intensive Environments James Coomer (DataDirect Networks) Abstract Abstract In this Talk, Dr. James Coomer, Senior Technical Advisor for EMEA, will provide examples of testing and deployment results from the past year of DDN’s Flash-Native caching solution: Infinite Memory Engine. He will also discuss the company’s Lustre development and productization efforts and future plans. Sponsor Talk Sponsor Talk 6 Chair: Trey Breckenridge (Mississippi State University) [ANSYS] Why Supercomputing Partnerships Matter for CFD Simulations Wim Slagter (ANSYS) Abstract Abstract This presentation will address how CFD scalability and capabilities for customization have evolved over the last decade, and how supercomputing partnerships are playing a crucial role. Examples of extreme scalability (on Cray systems) and application customization will be featured that are illustrative for all users - whether you’re running on a 100,000+ core supercomputer or a 1,000-core cluster. Plenary General Session 7 Chair: Helen He (National Energy Research Scientific Computing Center) Paper Technical Session 8A Chair: Bilel Hadri (KAUST Supercomputing Lab) Early Evaluation of the Cray XC40 Xeon Phi System 'Theta' at Argonne Scott Parker, Vitali Morozov, Sudheer Chunduri, Kevin Harms, Christopher Knight, and Kalyan Kumaran (Argonne National Laboratory) Abstract Abstract The Argonne Leadership Computing Facility (ALCF) has recently deployed a nearly 10 PF Cray XC40 system named Theta. Theta is nearly identical in performance capacity to the ALCF’s current IBM Blue Gene/Q system Mira and represents the first step in a path from Mira to a much larger next generation Intel-Cray system named Aurora to be deployed in 2018. Theta serves as both a production resource for scientific computation and a platform for transitioning applications to run efficiently on the Intel Xeon Phi architecture and the Dragonfly topology based network. This paper presents an initial evaluation of Theta. The Theta system architecture will be described along with the results of benchmarks characterizing the performance at the processor core, memory, and network component levels. In addition, benchmarks are used to evaluate the performance of important runtime components such as MPI and OpenMP. Finally, performance results for the scientific applications are described. Performance on Trinity Phase 2 (a Cray XC40 utilizing Intel Xeon Phi processors) with Acceptance Applications and Benchmarks Anthony Agelastos and Mahesh Rajan (Sandia National Laboratories); Nathan Wichmann (Cray Inc.); Randal Baker (Los Alamos National Laboratory); Stefan Domino (Sandia National Laboratories); Erik Draeger (Lawrence Livermore National Laboratory); and Sarah Anderson, Jacob Balma, Steve Behling, Mike Berry, Pierre Carrier, Mike Davis, Kim McMahon, Dick Sandness, Kevin Thomas, Steve Warren, and Ting-Ting Zhu (Cray Inc.) Abstract Abstract Trinity is the first NNSA ASC Advanced Technology System (ATS-1) designed to provide the application scalability, performance, and system throughput required for the Nuclear Security Enterprise. Trinity Phase 1 with 9,436 dual-socket Haswell nodes is currently in production use. Phase 2 system, the focus of this paper, has close to 9,900 Intel Knights Landing (KNL) Xeon Phi nodes. This paper documents the performance of the selected acceptance applications, Sustained System Performance benchmark suite, and a number of micro-benchmarks. This paper discusses the experiences of the Tri-Lab (LANL, SNL, and LLNL) and Cray teams to extract the optimal performance with considerations to: to the choice of the KNL memory mode, hybrid MPI+OpenMP parallelization, vectorization, and HBM utilization. Evaluating the Networking Characteristics of the Cray XC-40 Intel Knights Landing Based Cori Supercomputer at NERSC Douglas Doerfler, Brian Austin, Brandon Cook, and Jack Deslippe (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) and Krishna Kandalla and Peter Mendygral (Cray Inc.) Abstract Abstract There are many potential issues associated with deploying the Intel Knights Landing (KNL) manycore processor in a large-scale supercomputer. One in particular is the ability to fully utilize the high-speed communications network, given the serial performance of a Xeon Phi core is a fraction of a Xeon core. In this paper we take a look at the tradeoffs associated with allocating enough cores to fully utilize the Aries high-speed network versus cores dedicated to computation, e.g. the tradeoff between MPI and OpenMP. In addition, we evaluate new features of Cray MPI in support of KNL, such as inter-node optimizations and support for KNL's high-speed memory (MCDRAM). We also evaluate one-sided programming models such as Unified Parallel C. We quantify the impact of the above tradeoffs and features using a suite of NERSC applications. Paper Technical Session 8B Chair: Frank M. Indiviglio (National Oceanic and Atmospheric Administration) Toward Interactive Supercomputing at NERSC with Jupyter Rollin Thomas, Shane Canon, Shreyas Cholia, Lisa Gerhardt, and Evan Racah (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) Abstract Abstract Extracting scientific insights from data increasingly demands a richer, more interactive experience than traditional high-performance computing systems historically have provided. We present our efforts to leverage Jupyter for interactive data-intensive supercomputing on the Cray XC40 Cori system at the National Energy Research Scientific Computing Center (NERSC). Jupyter is a flexible, popular literate-computing web application for creating "notebooks" containing code, equations, visualization, and text. We explain the motivation for interactive supercomputing, describe our implementation strategy, and outline lessons learned along the way. Our deployment will allow users access to software packages and specialized kernels for scalable analytics with Spark, real-time data visualization with yt, complex analytics workflows with DASK and IPyParallel, and much more. We anticipate that many users may come to rely exclusively on Jupyter at NERSC, leaving behind the traditional login shell. In-situ data analytics for highly scalable cloud modelling on Cray machines Nick Brown (EPCC, The University of Edinburgh); Adrian Hill and Ben Shipway (Met Office, UK); and Michele Weiland (EPCC, The University of Edinburgh) Abstract Abstract MONC is a highly scalable modelling tool for the investigation of atmospheric flows, turbulence and cloud microphysics. Simulations produce very large amounts of raw data which must then be analysed for scientific investigation. For performance and scalability this analysis should be performed in-situ as the data is generated however one does not wish to pause the computation whilst analysis is performed. Precipitation Nowcasting: Leveraging Deep Recurrent Convolutional Neural Networks Alexander Heye, Karthik Venkatesan, and Jericho Cain (Cray Inc.) Abstract Abstract Automating very short-term precipitation forecasts can prove a significant challenge in that traditional physics-based weather models are computationally expensive; by the time the forecast is made, it may already be irrelevant. Deep Learning offers a solution to this problem, as that a computationally dense machine can train a neural network ahead of time using historical data and deploy that trained network in real-time to produce a new output within seconds or minutes. Our team intends to prove the capabilities of Deep-Learning in short-term forecasting by leveraging a model built on Convolutional Long Short-Term Memory (convLSTM) networks. By designing a 3D sequence-to-sequence convLSTM model, we hope to offer accurate precipitation forecasts at minute level time resolution and comparable spatial resolution to the radar input data. Our work will be accelerated by the GPU-dense CS-Storm system for training and the Cray GX for real-time processing of radar data. Paper Technical Session 8C Chair: Chris Fuson (ORNL) Telemetry-enabled Customer Support using the Cray System Snapshot Analyzer (SSA) Richard J. Duckworth, Kevin Coryell, Scott McLeod, and Jay Blakeborough (Cray Inc.) Abstract Abstract SSA is a Cray customer service application designed to support product issue diagnosis and reduce time to resolution. SSA is focused on the submission of product telemetry from a customer system to Cray, over a secure network channel. SSA provides value to the support process by automating 1) the collection, submission and analysis of product diagnostic information 2) the collection, submission and analysis of product health information and 3) key aspects of the customer support process. The focal topic for this paper is a discourse on the benefits of SSA use. For background, we will provide a general overview of SSA and references for further reading. Next, we will provide an update on SSA product release and operations history. Finally, we will discuss the anticipated product roadmap for SSA. How-to write a plugin to export job, power, energy, and system environmental data from your Cray XC system Steven J. Martin (Cray Inc.), Cary Whitney (Lawrence Berkeley National Laboratory), and David Rush and Matthew Kappel (Cray Inc.) Abstract Abstract In this paper we take a deep dive into writing a plugin to export power, energy, and other system environmental data from a Cray XC system. With the release of SMW 8.0/CLE 6.0 software Cray has enabled customers to create site-specific plugins to export all of the data that can flow into the Cray Power Management Database (PMDB) into site-specific infrastructure. In this paper we give practical information on what data is available using the plugin, and how to write, test, and deploy a plugin. We also share and explain example plugin code, detail design considerations that need to be considered when architecting a plugin, and look at some practical use cases supported by exporting telemetry data off a Cray XC system. This paper is targeted at plugin developers, system administrators, data scientists, and site planners Using Open XDMoD for accounting analytics on the Cray XC supercomputer Thomas Lorenzen and Damon Kasacjak (Danish Meteorological Institute) and Jason Coverston (Cray Inc.) Abstract Abstract As supercomputers grow and accommodate more users and projects, the system utilization and accounting log files grow as well, often beyond easy native comprehension and thus requiring a flexible graphical tool for accounting analytics. This presentation will depict the joint effort of the Danish Meteorological Institute, DMI, and Cray to adapt Open XDMoD, http://open.xdmod.org, to the DMI Cray XC supercomputer. Extensions to the Cray RUR framework that monitor metrics of particular relevance to the site have been embedded into Open XDMoD and will be presented as well. This will show Open XDMoD to be a flexible tool for use with the Cray XC supercomputer, with strong potential for extending metrics ingestion and graphical presentation in numerous ways. Paper Technical Session 9A Chair: Zhengji Zhao (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory) Using Spack to Manage Software on Cray Supercomputers Mario A. Melara (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center), Todd Gamblin and Gregory Becker (Lawrence Livermore National Laboratory), Robert French and Matt Belhorn (Oak Ridge National Laboratory), Kelly Thompson (Los Alamos National Laboratory), and Rebecca Hartman-Baker (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) Abstract Abstract HPC software is becoming increasingly complex. A single application may have over 70 dependency libraries, and HPC centers deploy even more libraries for users. Users demand many different builds of packages with different compilers and options, but building them can be tedious. HPC support teams cannot keep up with user demand without better tools. Regression testing on Shaheen Cray XC40: Implementation and Lessons Learned Bilel Hadri and Samuel Kortas (KAUST Supercomputing Lab), Robert Fiedler (Cray Inc.), and George Markomanolis (KAUST Supercomputing Lab) Abstract Abstract Leadership-class supercomputers are becoming larger and more complex tightly integrated systems consisting of many different hardware components, tens of thousands of processors and memory chips, kilometers of networking cables, large numbers of disks, and hundreds of applications and libraries. To increase scientific productivity and ensure that applications efficiently and effectively exploit a system’s full potential, all the components must deliver reliable, stable, and performant service. Therefore, to deliver the best computing environment to our users, system performance assessments are critical, especially after an unplanned downtime or any scheduled maintenance session. This paper describes the design and implementation of the regression testing methodology used on the Shaheen2 XC40 to detect and track issues related to the performance and functionality of compute nodes, storage, network, and programming environment. We also present an analysis of the results over 24 months, along with the lessons learned. Python Usage Metrics on Blue Waters Colin A. MacLean (National Center for Supercomputing Applications/University of Illinois) Abstract Abstract Blue Waters supports a large Python stack containing over 650 total packages. As part of maintaining this support, logging functionality has been introduced to track the usage statistics of both National Center for Supercomputing Applications (NCSA) and user provided Python packages. Due to the number of NCSA supplied packages, it is rare to receive a request for packages which are not already installed, leading to a lack of information about which packages and their dependencies are being used. By tracking module imports, a detailed log of usage information has been used to focus support efforts on improving the usability and performance of popular usage patterns. Paper Technical Session 9B Chair: Georgios Markomanolis (KAUST- King Abdullah University For Science And Technology) libhio: Optimizing IO on Cray XC Systems With DataWarp Nathan Hjelm and Cornell Wright (Los Alamos National Laboratory) Abstract Abstract High performance systems are rapidly increasing in size and complexity. To keep up with the IO demands of applications and to provide improved functionality, performance and cost, IO subsystems are also increasing in complexity. To help applications to utilize and exploit increased functionality and improved performance in this more complex environment, we developed a new Hierarchical Input/Output (HIO) library: libhio. In this paper we present the motivation behind the development and the design of libhio. How to Use Datawarp Glen Overby (Cray, Cray Inc.) Abstract Abstract Cray DataWarp is a set of technologies that accelerates application I/O in order to reduce job wall clock time. It creates a near storage layer between main memory and hard disk drives, with direct attached solid-state disk (SSD) storage to provide more cost effective bandwidth than an external parallel file system (PFS) allowing DataWarp to be provisioned for bandwidth and the PFS to be provisioned for capacity and resiliency. This paper will discuss ways in which DataWarp can benefit applications and provide specific examples of using DataWarp with the Moab, PBS and Slurm workload managers. The detailed examples will include how to use DataWarp striped storage, how applications access that storage, how to stage data, how to use DataWarp per-node storage, how to request storage that persists across multiple jobs, and also how to use of DataWarp as a transparent cache. Datawarp Accounting Metrics Andrew Barry (Cray Inc.) Abstract Abstract Datawarp, a burst buffer package of fast flash storage and filesystem software, can provide a large improvement in I/O performance for jobs run on a Cray system, but the scale of this improvement depends on the configuration of the system as well as application optimization. Datawarp is a limited resource on Cray systems, with insufficient storage capacity for all jobs to completely replace their use of parallel filesystems. Thus, it is useful to know how various applications make use of the resource and to track total utilization. These statistics indicate which users are using Datawarp and for which applications. Cray Resource Utilization Reporting (RUR) plugins are available to collect Datawarp statistics and to archive it for future analysis. This paper describes the available Datawarp usage statistics, context for interpreting those metrics, and some case studies of applications using Datawarp in different ways. Paper Technical Session 9C Chair: Tina Butler (National Energy Research Scientific Computing Center) Theta: Rapid Installation and Acceptance of an XC40 KNL System Ti Leggett, Mark Fahey, Susan Coghlan, Kevin Harms, Paul Rich, Ben Allen, Ed Holohan, and Gordon McPheeters (Argonne National Laboratory) Abstract Abstract In order to provide a stepping stone from the Argonne Leadership Computing Facility’s (ALCF) world class production 10 petaFLOP IBM BlueGene/Q system, Mira, to its next generation 180 petaFLOPS 3rd generation Intel Xeon Phi system, Aurora, ALCF worked with Intel and Cray to acquire an 8.6 petaFLOPS 2nd generation Intel Xeon Phi based system named Theta. Theta was delivered, installed, integrated and accepted on an aggressive schedule in just over 3 months. We will detail how we were able to successfully meet the aggressive deadline as well as lessons learned during the process. Extending CLE6 to a multi-supercomputer Operating System Douglas M. Jacobsen (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) Abstract Abstract NERSC operates multiple Cray supercomputing platforms using CLE6. In this paper we describe our methods, customizations, and additions to CLE6 to enable a coordinated management strategy for all four systems (two production systems and two development systems). Our methods use modern software engineering tools and practices to run precisely the same management software on all four systems, while still allowing version drift, customization, and some amount of acceptable feature drift between systems. In particular, we have devised procedures, software, and tools to allow the test and development systems to move rapidly between the exact production system configuration as well as various testing (future) configurations. We also discuss capabilities added to CLE6, such as customized ansible facts trees, integration of ansible-vault for securing sensitive information, “branching” zypper repositories in a context sensitive way, replicating NIMS data structures across systems. These methods reduce administration and development cost and increase availability and reproduce-ability. Updating the SPP Benchmark Suite for Extreme-Scale Systems Gregory Bauer, Victor Anisimov, Galen Arnold, Brett Bode, Robert Brunner, Tom Cortese, Roland Haas, Andriy Kot, William Kramer, JaeHyuk Kwack, Jing Li, Celso Mendes, Ryan Mokos, and Craig Steffen (National Center for Supercomputing Applications/University of Illinois) Abstract Abstract For the effective deployment of modern extreme-scale systems, it is critical to rely on meaningful benchmarks that provide an assessment of the achievable performance a system may yield on real applications. The Sustained Petascale Performance (SPP) benchmark suite was used very successfully in evaluating the Blue Waters system, deployed by Cray in 2012, and in ensuring that the system achieved sustained petascale performance on applications from several areas. However, some of the original SPP codes did not have significant use, or underwent continuous optimizations. Hence, NCSA prepared an updated SPP suite containing codes that more closely reflect the recent workload observed on Blue Waters. NCSA is also creating a public website with source codes, input data, build/run scripts and instructions, plus performance results of this updated SPP suite. This paper describes the characteristics of those codes and analyzes their observed performance, pointing to areas of potential enhancements on modern systems. Birds of a Feather BoF 10A Chair: Bilel Hadri (KAUST Supercomputing Lab) Birds of a Feather BoF 10B Chair: David Paul (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) A BoF - "Bursts of a Feather" - Burst Buffers from a Systems Perspective David Paul (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center), John Bent (Seagate), and Andrey Kudryavtsev (Intel Corporation) Abstract Abstract This BoF will bring interested parties together to discuss Burst Buffers- how they are integrated into today’s supercomputers, what are their underpinnings and use cases, and what lies ahead for this exciting new high performance I/O technology. Birds of a Feather BoF 10C Chair: Matteo Chesi (Swiss National Supercomputing Centre); Jeff Keopp (Cray Inc.) XC System Management Usability BOF Harold Longley and Joel Landsteiner (Cray Inc.) Abstract Abstract This BOF will be a facilitated discussion around usability of system management software on an XC system with SMW 8.0/CLE 6.0 software focusing on standard and advanced administrator use cases. The goal will be to gain an understanding of the best and worst parts of interacting with XC System Management software and to understand how customers would like to see the software evolve. HPC Storage Operations: from experience to new tools Matteo Chesi (Swiss National Supercomputing Centre), Tina Declerck (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory), Oliver Treiber (European Centre for Medium-Range Weather Forecasts), and Maciej L. Olchowik (King Abdullah University of Science and Technology) Abstract Abstract Managing storage systems at large scale is a challenging duty: Detecting incidents, correlating events, troubleshooting strange I/O behaviors, migrating data, planning maintenances, testing new technologies, dealing with users requests ... Starting from Storage Operations experiences we will discuss about the tools that are available today for Storage Administrators to monitor and control the I/O subsystem of HPC clusters and about the development of new features and tools that may satisfy current and future needs coming together with the data analytics and cloud computing trends. Birds of a Feather BoF 10D Chair: David Hancock (Indiana University); Michael Showerman (National Center for Supercomputing Applications) Open Discussion with CUG Board David Hancock (Indiana University) Abstract Abstract This session is designed as an open discussion with the CUG Board but there are a few high level topics that will also be on the agenda. The discussion will focus on corporation changes to achieve non-profit status (including bylaw changes), feedback on increasing CUG participation, and feedback on SIG structure and communication. An open floor question and answer period will follow these topics. Formal voting (on candidates and the bylaws) will open after this session, so any candidates or members with questions about the process are welcome to bring up those topics. Holistic Systems Monitoring and Analysis Michael T. Showerman (National Center for Supercomputing Applications/University of Illinois) and Ann Gentile and James M. Brandt (Sandia National Laboratories) Abstract Abstract This BOF will be used to improve collaborations in the monitoring and analysis of Cray systems. It will include updates and future directions from many of the sites represented within the CUG monitoring working group. A goal is to improve the number and content of tool-and-technique quick start guides being developed by the Cray monitoring community. This will help the community to gather both lessons learned and requirements for future deployments of the full spectrum of Cray resources. Plenary General Session 11 Chair: David Hancock (Indiana University) Invited Talk: Perspectives on HPC and Enterprise High Performance Data Analytics Arno Kolster (Providentia Worldwide) Abstract Abstract Mr. Kolster will present his experience of blending HPC and enterprise architectures to solve real-time, web-scale analytics problems and discuss the need to bridge the gap between HPC and enterprise. His unique perspective illustrates the need for enterprise to embrace HPC technologies and vice versa. Sponsor Talk Sponsor Talk 12 Chair: Trey Breckenridge (Mississippi State University) [Seagate] The Effects Fragmentation and Remaining Capacity has on File System Performance John Fragalla (Seagate) Abstract Abstract After a Lustre file system is put in production, and the usage model increases with the variability of users deleting and creating files overtime, fragmentation and the available storage capacity has effect on overall performance throughput, compared to a pristine file system. In this presentation, Seagate will discuss the methodology of the benchmark setup how to fill up the storage capacity of a file system and introduce fragmentation at different capacity points to analyze the throughput performance impact. The presentation will illustrate how at various percentages of a capacity filled file system, the biggest impact is the amount of fragmentation that exists and not only the capacity filled. Sponsor Talk Sponsor Talk 13 Chair: Trey Breckenridge (Mississippi State University) [SchedMD] Slurm Roadmap Morris Jette (SchedMD) Abstract Abstract Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system used on many of the largest computers in the world including five of the top ten systems on the TOP500 supercomputer list. This presentation will briefly describe Slurm capabilities with respect to Cray systems and the Slurm roadmap. Plenary General Session 14 Chair: Helen He (National Energy Research Scientific Computing Center) Lustre Lockahead: Early Experience and Performance using Optimized Locking Michael Moore, Patrick Farrell, and Bob Cernohous (Cray Inc.) Abstract Abstract Recent Cray-authored Lustre modifications known as “Lustre lockahead” show significantly improved write performance for collective, shared-file I/O workloads. Initial tests show write performance improvements of more than 200% for small transfer sizes and over 100% for larger transfer sizes compared to traditional Lustre locking. Standard Lustre shared-file locking mechanisms limit scaling of shared file I/O performance on modern high performance Lustre servers. The new lockahead feature provides a mechanism for applications (or libraries) with knowledge of their I/O patterns to overcome this limitation by explicitly requesting locks. MPI-IO is able to use this feature to dramatically improve shared file collective I/O performance, achieving more than 80% of file per process performance. This paper discusses our early experience using lockahead with applications. We also present application and synthetic performance results and discuss performance considerations for applications that benefit from lockahead. Plenary talk from Intel: Exascale Reborn Rajeeb Hazra (Intel Corporation) Abstract Abstract Join Intel corporate vice president and general manager of the Enterprise and Government group, Dr. Rajeeb Hazra, for our plenary talk. Raj will discuss what must be done to address an ecosystem of ever changing, complex, and diversified applications and workloads to deliver real world performance at exascale scale. Sponsor Talk Sponsor Talk 15 Chair: Trey Breckenridge (Mississippi State University) [PGI] OpenACC and Unified Memory Doug Miles (PGI) Abstract Abstract Optimizing data movement between host and device memories is an important step when porting applications to GPUs. This is true for any programming model (CUDA, OpenACC, OpenMP 4+, ...), and becomes even more challenging with complex aggregate data structures. While OpenACC data management directives are designed so they can be safely ignored on a shared memory system with a single address space, such as a multicore CPU, both the CUDA and OpenACC APIs require the programmer or compiler to explicitly manage data allocation and coherence on a system with separated memories. The OpenACC committee is designing directives to extend explicit data management for aggregate data structures. CUDA C++ has managed memory allocation routines and CUDA Fortran has the managed attribute for allocatable arrays, allowing the CUDA driver to manage data movement and coherence for dynamically allocated data. The latest NVIDIA GPUs include hardware support for fully unified memory, enabling operating system and driver support for sharing of the entire address space between the host CPU and the GPU. We will compare current and future explicit data movement with driver- and system-managed memory, and discuss the impact of these on application development, programmer productivity and performance. Sponsor Talk Sponsor Talk 16 Chair: Trey Breckenridge (Mississippi State University) [Allinea] Tools and Methodology for Ensuring HPC Programs Correctness and Performance Beau Paisley (ARM) Abstract Abstract In this presentation we will discuss best practices and methodology for HPC software engineering. We will provide illustrations of how the Allinea debugging and performance analysis tools can be used to ensure that you obtain optimal performance from your codes and that your codes run correctly. Sponsor Talk Sponsor Talk 17 Chair: Trey Breckenridge (Mississippi State University) [Altair] PBS Professional - Stronger, Faster, Better! Scott J. Suchyta (Altair Engineering, Inc.) Abstract Abstract Not all workloads are created equal. They vary in size, duration, priority level, and are influenced by many other site-specific factors. Such requirements often lead administrators to compromise system utilization, service level agreements, or even limit the capabilities of the system itself. Altair continues to provide features and flexibility allowing administrators to control how jobs are scheduled without making dramatic compromises. PBS Professional v13 is architected for Exascale with increased speed, scale, resiliency, and much more. This presentation will provide a high-level overview of some of the key new PBS capabilities such as Multi-Scheduler, Preemption Targets, and flexibility & performance enhancements. Plenary General Session 18 Chair: David Hancock (Indiana University) Paper Technical Session 19A Chair: Richard Barrett (Sandia National Labs) The Cray Programming Environment: Current Status and Future Directions Luiz DeRose (Cray Inc.) Abstract Abstract The scale and complexity of current and future high end systems with wide nodes, many integrated core (MIC) architectures, multiple levels in the memory hierarchy, and heterogeneous processing elements brings a new set of challenges for application developers. These technology changes in the supercomputing industry are forcing computational scientists to face new critical system characteristics that will significantly impact the performance and scalability of applications. Users must be supported by intelligent compilers, automatic performance analysis and porting tools, scalable debugging tools, and adaptive libraries. In this talk I will present the recent activities, new functionalities, roadmap, and future directions of the Cray Programming Environment, which are being developed and deployed on Cray Clusters and Cray Supercomputers for scalable performance with high programmability. Current State of the Cray MPT Software Stacks on the Cray XC Series Supercomputers Krishna Kandalla, Peter Mendygral, Naveen Ravichandrasekaran, Nick Radcliffe, Bob Cernohous, Kim McMahon, Christopher Sadlo, and Mark Pagel (Cray) Abstract Abstract HPC applications heavily rely on Message Passing Interface (MPI) and SHMEM programming models to develop distributed memory parallel applications. This paper describes a set of new features and optimizations that have been introduced in Cray MPT software libraries to optimize the performance of scientific parallel applications on modern Cray XC series supercomputers. For Cray XC systems based on the Intel KNL processor, Cray MPT libraries have been optimized to improve communication performance, memory utilization, while also facilitating better utilization of the MCDRAM technology. Cray MPT continues to improve the performance of hybrid MPI/OpenMP applications that perform communication operations within threaded regions. The latest Cray MPICH offers a new lock-ahead optimization for MPI I/O along with exposing internal timers and statistics for I/O performance profiling. Finally, this paper describes efforts involved in optimizing real-world applications such as WOMBAT and SNAP, along with Deep Learning applications on the latest Cray XC supercomputers. Profiling and Analyzing Program Performance Using Cray Tools Heidi Poxon (Cray Inc.) Abstract Abstract The Cray Performance Tools help the user obtain maximum computing performance on Cray systems with profiling and analysis that focuses on discovering key bottlenecks in programs that run across many nodes. The tools robust analysis capability helps users identify program hot spots, imbalance, communication patterns, and memory usage issues that impede scaling or optimal performance. As an example, CrayPAT and Cray Apprentice2 were recently used to scale the CNTK deep learning code to new levels for the system at the Swiss National Supercomputing Centre (CSCS). In addition to focusing on simple interfaces to make profiling and analysis accessible to more users, recent enhancements to CrayPAT, Cray Apprentice2 and Reveal include the new HBM memory analysis tool that identifies key arrays that can benefit from allocation in KNL’s MCDRAM, a per-NUMA node memory high water mark, general Intel KNL and NVIDIA P100 support, as well as profiling support for Charm++. Novel approaches to HPC user engagement Clair Barrass and David Henty (EPCC, The University of Edinburgh) Abstract Abstract EPCC operates the UK National HPC service ARCHER, a Cray XC30 with a diverse user community. A key challenge to any HPC provider is growing the user base, making new users aware of the potential benefits of the service and ensuring a low barrier to entry. To this end, we have explored a number of approaches to user engagement that are novel within the context of UK HPC: Paper Technical Session 19B Chair: Veronica G. Vergara Larrea (Oak Ridge National Laboratory) Improving I/O Bandwidth With Cray DVS Client-Side Caching Bryce T. Hicks (Cray Inc.) Abstract Abstract Cray's Data Virtualization Service, DVS, is an I/O forwarder providing access to native parallel filesystems and to Cray Datawarp application I/O accelerator storage at the largest system scales while still maximizing data throughput. This paper introduces DVS Client-Side Caching, a new option for DVS to improve I/O bandwidth, reduce network latency costs, and decrease the load on both DVS servers and backing parallel filesystems. Implementing a Hierarchical Storage Management system in a large-scale Lustre and HPSS environment Brett M. Bode, Michelle Butler, Jim Glasgow, and Sean Stevens (National Center for Supercomputing Applications/University of Illinois) and Nathan Schumann and Frank Zago (Cray Inc.) Abstract Abstract HSM functionality has been available with Lustre for several releases and is an important aspect for HPC systems to provide data protection, space savings, and cost efficiencies, and is especially important to the NCSA Blue Waters system. Very few operational HPC centers have deployed HSM with Lustre, and even fewer at the scale of Blue Waters. Understanding the IO Performance Gap Between Cori KNL and Haswell Jialin Liu, Quincey Koziol, and Houjun Tang (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center); Francois Tessier (Argonne National Laboratory); and Wahid Bhimji, Brandon Cook, Brian Austin, Suren Byna, Bhupender Thakur, Glenn Lockwood, Jack Deslippe, and Mr Prabhat (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) Abstract Abstract The Cori system at NERSC has two compute partitions with different CPU architectures: a 2,004 node Haswell partition and a 9,688 node KNL partition, which ranked as the 5th most powerful and fastest supercomputer on the November 2016 Top 500 list. The compute partitions share a common storage configuration, and understanding the IO performance gap between them is important, impacting not only to NERSC/LBNL users and other national labs, but also to the relevant hardware vendors and software developers. In this paper, we have analyzed performance of single core and single node IO comprehensively on the Haswell and KNL partitions, and have discovered the major bottlenecks, which include CPU frequencies and memory copy performance. We have also extended our performance tests to multi-node IO and revealed the IO cost difference caused by network latency, buffer size, and communication cost. Overall, we have developed a strong understanding of the IO gap between Haswell and KNL nodes and the lessons learned from this exploration will guide us in designing optimal IO solutions in many-core era. DXT: Darshan eXtended Tracing Cong Xu (Intel Corporation), Shane Snyder (Argonne National Laboratory), Omkar Kulkarni and Vishwanath Venkatesan (Intel Corporation), Philip Carns (Argonne National Laboratory), Suren Byna (Lawrence Berkeley National Laboratory), Robert Sisneros (National Center for Supercomputing Applications), and Kalyana Chadalavada (Intel Corporation) Abstract Abstract As modern supercomputers evolve to exascale, their I/O subsystems are becoming increasingly complex, making optimization of I/O for scientific applications a daunting task. Although I/O profiling tools facilitate the process of optimizing application I/O performance, legacy profiling tools lack flexibility in their level of detail and ability to correlate traces with other sources of data. Additionally, a lack of robust trace analysis tools makes it difficult to derive actionable insights from large-scale I/O traces. Paper Technical Session 19C Chair: Chris Fuson (ORNL) Scheduler Optimization for Current Generation Cray Systems Morris Jette (SchedMD LLC) and Douglas Jacobsen and David Paul (NERSC) Abstract Abstract The current generation of Cray systems introduces two major complications for workload scheduling: DataWarp burst buffers and Intel Knights Landing (KNL) processors. In a typical use case, DataWarp resources are allocated to a job and data is staged-in before compute resources are allocated to that job, then retained after computation for staging-out of data. KNL supports five different NUMA modes and three MCDRAM modes. Applications may require a specific KNL configuration or its performance may be highly configuration dependent. If KNL resources with the desired configuration are not available for pending work, the overhead of rebooting compute nodes must be weighed against running the application in a less than ideal configuration or waiting for processors already in the desired configuration to become available. This paper will present the algorithms used by the Slurm workload manager. a statistical analysis of NERSC’s workload and experiences with Slurm’s management of DataWarp and KNL. Trust Separation on the Cray XC40 using PBS Pro Sam Clarke (Met Office, UK) Abstract Abstract As the UK's national weather agency, the Met Office has a requirement to produce regular, timely weather forecasts. As a major centre for climate and weather research, it has a need to provide access to large-scale supercomputing resources to users from within the organisation. It also provides a supercomputer facility for academic partners inside the UK, and to international collaborators. Each of these user categories has a different set of availability requirements and requires a different level of access. Experiences running different work load managers across Cray Platforms Haripriya Ayyalasomayajula and Karlon West (Cray Inc.) Abstract Abstract Workload management is a challenging problem both in Analytics and in High Performance Computing. The desire is to have efficient platform utilization while still meeting scalability and scheduling requirements. SLURM and Moab/Torque are two commonly used workload managers that serve both resource allocation and scheduling requirements on the Cray CS and XC series super computers. Analytics applications interact with a different set of workload managers such as YARN or more recently, Apache Mesos, which is the main resource manager for Urika-GX. In this paper, we describe our experiences using different workload managers across Cray platforms (Analytics and HPC). We describe the characteristics and functioning of each of the workload managers. We will compare the different workload managers and specifically discuss the pros and cons of the HPC schedulers vs. Mesos, and run a sample workflow on each of the Cray platforms and illustrate resource allocation and job scheduling. An Operational Perspective on a Hybrid and Heterogeneous Cray XC50 System Sadaf R. Alam, Nicola Bianchi, Nicholas Cardo, Matteo Chesi, Miguel Gila, Stefano Gorini, Mark Klein, Marco Passerini, Carmelo Ponti, Fabio Verzelloni, and Colin McMurtrie (CSCS-ETHZ) Abstract Abstract The Swiss National Supercomputing Centre (CSCS) upgraded its flagship system called Piz Daint in Q4 2016 in order to support a wider range of services. The upgraded system is a heterogeneous Cray XC50 and XC40 system with Nvidia GPU accelerated (Pascal) devices as well as multi-core nodes with diverse memory configurations. Despite the state-of-the-art hardware and the design complexity, the system was built in a matter of weeks and was returned to fully operational service for CSCS user communities in less than two months, while at the same time providing significant improvements in energy efficiency. This paper focuses on the innovative features of the Piz Daint system that not only resulted in an adaptive, scalable and stable platform but also offers a very high level of operational robustness for a complex ecosystem. Birds of a Feather BoF 20A Chair: Patricia Langer (Cray ) Sonexion Monitoring and Metrics: data collection, data retention, user workflows Patricia Langer and Craig Flaskerud (Cray) Abstract Abstract This BOF will explore types of metrics useful in analyzing performance issues on Cray Sonexion storage systems, data retention and reduction concerns considering the volume of metrics that can be collected, and typical workflows administrators use to analyze and isolate Sonexion performance issues related to jobs launched from their Cray HPC systems. Birds of a Feather BoF 20B Chair: Harold Longley (Cray Inc.) eLogin Usability and Best Practices Jeff Keopp, Mark Ahlstrom, and Harold Longley (Cray Inc.) Abstract Abstract This BOF session is a facilitated discussion around usability and best practices for administrating and configuring eLogin nodes. eLogin nodes are the external login nodes for Cray XC systems running CLE 6.x. They replace the esLogin nodes used with CLE 5.x. The discussion will also include the new Cray Management Controller (CMC) and Cray System Management Software (CSMS) which replace the current CIMS and Bright Cluster Manager software for managing eLogin nodes. The goal will be to gain an understanding of the best and worst parts of administrating eLogin nodes and to understand how customers would like to see the software evolve. Birds of a Feather BoF 20C Chair: Robert Stober (Bright) Building an Enterprise-Grade Deep Learning Environment with Bright and Cray Robert Stober (Bright Computing) Abstract Abstract Enterprises are collecting increasing amounts of data. By leveraging deep and machine learning technologies, the analysis of corporate data can be taken to the next level, providing organizations with richer insight to their business, resulting in increased sales and / or significant competitive advantage. When business advantage is tied to the insights achieved via deep learning, it is essential for the underlying IT infrastructure to be deployed and managed as enterprise-grade, not as a lab experiment. However, building and managing an advanced cluster, installing the software that satisfies all of the library dependencies, and making it all work together, presents an enormous challenge. Birds of a Feather BoF 20D Chair: Nicholas Cardo (Swiss National Supercomputing Centre) Bringing "Shifter" to the Broader Community Nicholas Cardo (Swiss National Supercomputing Centre) and Douglas Jacobsen (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) Abstract Abstract The success and popularity of using "Shifter", developed by NERSC, for containers continues to grow. In order to keep pace with this success and growth of "Shifter", a community is forming behind it. During this BoF, we'll discuss the status, efforts, and opportunities of bringing "Shifter" to an Open Source Community. Topics will include, organisational structure, software management, and levels of participation. Plenary General Session 21 Chair: Helen He (National Energy Research Scientific Computing Center) Panel: Future Directions of Data Analytics and High Performance Computing Yun (Helen) He (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center), Scott Michael (Indiana University), Mr Prabhat (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory), Rangan Sukumar (Cray Inc.), Javier Luraschi (RStudio), and Meena Arunachalam (Intel) Abstract Abstract Panel Discussion: Future Directions of Data Analytics and High Performance Computing Plenary General Session 22 Chair: David Hancock (Indiana University) Sponsor Talk Sponsor Talk 23 Chair: Trey Breckenridge (Mississippi State University) [Adaptive Computing] Reporting and Analytics, Portal-based Job Submission, Remote Visualization, Accounting and High Throughput Task Processing on Torque and Slurm Nick Ihli (Adaptive Computing) Abstract Abstract Adaptive Computing will present on Reporting & Analytics, Viewpoint (portal-based job submission), Remote Visualization, Accounting and Nitro (high throughput task processing) on Torque and Slurm as it covers how the “Open Platform” initiative helps bring Enterprise solutions to your choice of scheduler. Further, Adaptive Computing will present on significant new Torque advancements intended in its platform “Unification” initiative, as well as advancements in power management, datawarp integration and other product enhancements. Sponsor Talk Sponsor Talk 24 Chair: Trey Breckenridge (Mississippi State University) [Bright Computing] Achieving a Dynamic Datacenter with Bright and Cray Robert Stober (Bright) Abstract Abstract IT teams are under pressure to manage an emerging range of computing-intensive and data-intensive workloads with very different characteristics from traditional IT. Executing these workloads means that companies need to master multiple technologies ranging from high performance computing and big data analytics to virtualization, containerization, and cloud. Paper Technical Session 25A Chair: Zhengji Zhao (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory) Porting the microphysics model CASIM to GPU and KNL Cray machines Nick Brown and Alexandr Nigay (EPCC, The University of Edinburgh); Ben Shipway and Adrian Hill (Met Office, UK); and Michele Weiland (EPCC, The University of Edinburgh) Abstract Abstract CASIM is a microphysics model used to investigate interactions at the millimetre scale and study the formation and development of moisture. This is a crucial aspect of atmospheric modelling and as such is used as a sub model by other models but is computationally intensive and can severely impact the runtime of these models. An in-depth evaluation of GCC’s OpenACC implementation on Cray systems Veronica G. Vergara Larrea, Wael R. Elwasif, and Oscar Hernandez (Oak Ridge National Laboratory) and Cesar Philippidis and Randy Allen (Mentor Graphics) Abstract Abstract In this study, we will perform and in-depth evaluation of GCC’s OpenACC implementation on ORNL’s Cray XK7 and compare it with other available implementations. The results presented will be useful for the larger community interested in using and evaluating new OpenACC implementations. Finally, a discussion on how an OpenACC implementation in GCC may help the interoperability of both OpenACC and OpenMP 4.5 (offload) specifications will be presented. HPCG and HPGMG benchmark tests on Multiple Program, Multiple Data (MPMD) mode on Blue Waters – a Cray XE6/XK7 hybrid system JaeHyuk Kwack and Gregory H. Bauer (National Center for Supercomputing Applications) Abstract Abstract The High Performance Conjugate Gradients (HPCG) and High Performance Geometric Multi-Grid (HPGMG) benchmarks are alternatives to the traditional LINPACK benchmark (HPL) in measuring the performance of modern HPC platforms. We performed HPCG and HPGMG benchmark tests on a Cray XE6/XK7 hybrid supercomputer, Blue Waters at National Center for Supercomputing Applications (NCSA). The benchmarks were tested on CPU-based and GPU-enabled nodes separately, and then we analyzed characteristic parameters that affect their performance. Based on our analyses, we performed HPCG and HPGMG runs in Multiple Program, Multiple Data (MPMD) mode in Cray Linux Environment in order to measure their hybrid performance on both CPU-based and GPU-enabled nodes. We observed and analyzed several performance issues during those tests. Based on lessons learned from this study, we provide recommendations about how to optimize science applications on modern hybrid HPC platforms. Paper Technical Session 25B Chair: Bilel Hadri (KAUST Supercomputing Lab) Project Caribou : Monitoring and Metrics for Sonexion Craig Flaskerud (Cray) Abstract Abstract The scale and number of subsystems in today’s High Performance Computing system deployments make it difficult to monitor application performance and determine root causes when performance is not what is expected. System component failures, system resource oversubscription, or poorly written applications can all contribute to systems not running as expected and thus to poorly performing applications. This problem is exacerbated by the need to mine information from multiple sources across system subcomponents. Collecting the data may require privileged access and the data must be collected in a timely manner or critical information can be lost. Preparing NERSC users for Cori, a Cray XC40 system with Intel Many Integrated Cores Yun (Helen) He, Brandon Cook, Jack Deslippe, Brian Friesen, Richard Gerber, Rebecca Hartman-Baker, Alice Koniges, Thorsten Kurth, Stephen Leak, Woo-Sun Yang, and zhengji Zhao (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory); Eddie Baron (University of Oklahoma); and Peter Hauschildt (Hamburger Sternwarte) Abstract Abstract The newest NERSC supercomputer Cori is a Cray XC40 system consisting of 2,388 Intel Xeon Haswell nodes and 9,688 Intel Xeon-Phi “Knights Landing” (KNL) nodes. Compared to the Xeon-based clusters NERSC users are familiar with, optimal performance on Cori requires consideration of KNL mode settings; process, thread, and memory affinity; fine-grain parallelization; vectorization; and use of the high-bandwidth MCDRAM memory. This paper describes our efforts preparing NERSC users for KNL through the NERSC Exascale Science Application Program (NESAP), web documentation, and user training. We discuss how we configured the Cori system for usability and productivity, addressing programming concerns, batch system configurations, and default KNL cluster and memory modes. System usage data, job completion analysis, programming and running jobs issues, and a few successful user stories on KNL are presented. A High Performance SVD Solver on Manycore Systems Dalal Sukkari and Hatem Ltaief (KAUST), Aniello Esposito (Cray EMEA Research Lab (CERL)), and David Keyes (KAUST) Abstract Abstract We describe the high performance implementation of a new singular value decomposition (SVD) solver for dense matrices on distributed-memory manycore systems. Based on the iterative QR dynamically-weighted Halley algorithm (QDWH), the new SVD solver performs more floating-point operations than the bidiagonal reduction variant of the standard SVD, but exposes at the same time more parallelism, and therefore, runs closer to the theoretical peak performance of the system, thanks to more compute-bound matrix operations. The resulting distributed-memory QDWH-SVD solver is more numerically robust in presence of ill-conditioned large matrix sizes and achieves up to fourfold speedup on thousands of cores against current state-of-the-art SVD implementation from Cray Scientific Library. Paper Technical Session 25C Chair: Tina Declerck (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory) Cray XC40 System Diagnosability: Functionality, Performance, and Lessons Learned Jeffrey Schutkoske (Cray Inc.) Abstract Abstract The Intel® Xeon Phi CPU 7250 processor presents new opportunities for diagnosing the node in the Cray® XC40 system. This processor supports a new high-bandwidth on-package MCDRAM memory and interfaces. It also supports the ability to support different Non-Uniform Memory Access (NUMA) configurations. The new Cray Processor Daughter Card (PDC) also supports an optional PCIe SSD card. Previous work has outlined Cray system diagnosability for the Cray® XC Series. This processor requires new BIOS, administrative commands, power and thermal limits, as well as, new diagnostics to validate functionality and performance. KNL System Software Peter Hill, Clark Snyder, and John Sygulla (Cray Inc.) Abstract Abstract Intel Xeon Phi "Knights Landing" (KNL) presents opportunities and challenges for system software. This paper starts with an overview of KNL architecture. We describe some of the key differences from traditional Xeon processors, such as processor (NUMA) and memory (MCDRAM) modes. We describe which KNL modes are most useful, and why. From there, we describe a day in the life of a KNL system, emphasizing unique features such as mode reconfiguration (selecting the processor and memory configuration for a job) and the zone sort feature (which optimizes performance of MCDRAM cache). As part of this coverage, we'll look at implementation, scaling and performance issues. Runtime collection and analysis of system metrics for production monitoring of Trinity Phase II Adam J. DeConinck, Hai Ah Nam, Amanda Bonnie, David Morton, and Cory Lueninghoener (Los Alamos National Laboratory); James Brandt, Ann Gentile, Kevin Pedretti, Anthony Agelastos, Courtenay Vaughan, Simon Hammond, and Benjamin Allan (Sandia National Laboratories); and Jason Repik and Mike Davis (Cray Inc.) Abstract Abstract We present the holistic approach taken by the ACES team in the design and implementation of a monitoring system tailored to the new Cray XC40 KNL based Trinity Phase II system currently being deployed in an Open Science campaign. We have created a unique dataset from controlled experiments to which we apply various numerical analyses and visualizations in order to determine actionable monitoring data combinations that we can associate with performance impact and system issues. Our ultimate goal is to perform run-time analysis of such data combinations and apply runtime feedback to users and system software in order to improve application performance and system efficiency. Birds of a Feather BoF 26 Chair: Sreenivas Sukumar (Oak Ridge National Lab) Deep Learning on Cray Platforms Sreenivas Sukumar (Cray Inc.), Mr. Prabhat (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center), and Maxime Martinasso (Swiss National Supercomputing Centre) Abstract Abstract The application of machine learning and deep learning has gained tremendous interest both in academic and commercial organizations. With increased adoption and faster rates of data creation and collection, the need for scale on these deep learning problems have arrived. In this Birds-of-a-feather session, we will engage in active discussion around running deep learning workloads on Cray platforms at scale. Researchers from leadership computing facilities along with Cray engineers will be sharing their experiences. More specifically, we will discuss: (i) how to run deep learning workloads on the XC, CS and Urika-GX platforms, (ii) popular DL toolkits (TensorFlow, MXNet, Caffe, CNTK, BigDL etc.), (iii) HPC best practices toward “strong-scaling” deep learning workloads on multi-node configurations, (iv) application of deep learning in the different science domains today and at exascale, (v) experiences with deep learning on different HPC architectures (IB vs. Aries, CPU vs. GPU, etc.). Paper Technical Session 27A Chair: Tina Butler (National Energy Research Scientific Computing Center) Next Generation Science Applications for the Next Generation of Supercomputing Courtenay Vaughan, Simon Hammond, Dennis Dinge, Paul Lin, Christian Trott, Douglas Pase, Jeanine Cook, Clay Hughes, and Robert Hoekstra (Sandia National Laboratories) Abstract Abstract The Trinity supercomputer deployment by Los Alamos and Sandia National Laboratories represents the first Advanced Technology System deployment for the United States National Nuclear Security Administration. It will be one of the largest XC40 deployments in the world when installed during 2017. We present performance analysis of a suite of new applications that have been written from the ground up to be portable across computing architectures, parallel in terms of multi-node and on-node threading and to feature more flexible component-based code design. These applications leverage Kokkos, Sandia’s C++ Performance Portability programming mode, the Trilinos linear solver library and our broader performance analysis capabilities in a close knit codesign program. Driven by the NNSA’s Advanced Technology Development and Mitigation (“ATDM”) program, the new codes represent prototypes of fully-capable production science codes that will execute with high levels of efficiency on the next-generation of supercomputing platforms including Trinity and beyond. Fusion PIC Code Performance Analysis on The Cori KNL System Tuomas Koskela, Jack Deslippe, and Brian Friesen (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) and Karthik Raman (Intel Corporation) Abstract Abstract We study the attainable performance of Particle-In-Cell codes on the Cori KNL system by analyzing a miniature particle push application based on the fusion PIC code XGC1. We start from the most basic building blocks of a PIC code and build up the complexity to identify the kernels that cost the most in performance and focus optimization efforts there. Particle push kernels operate at high AI and are not likely to be memory bandwidth or even cache bandwidth bound on KNL. Therefore, we see only minor benefits from the high bandwidth memory available on KNL, and achieving good vectorization is the most beneficial optimization path and can theoretically yield up to 8x speedup on KNL, but is in practice limited by the data layout to 4x. Performance of Hybrid MPI/OpenMP VASP on Cray XC40 Based on Intel Knights Landing Many Integrated Core Architecture Zhengji Zhao (National Energy Research Scientific Computing Center (NERSC), USA); Martijn Marsman (Universität Wien, Austria); Florian Wende (Zuse Institute Berlin (ZIB), Germany); and Jeongnim Kim (Intel, USA) Abstract Abstract With the recent installation of Cori, a Cray XC40 system with Intel Xeon Phi Knights Landing (KNL) many integrated core (MIC) architecture, NERSC is transitioning from the multi-core to the more energy-efficient many-core era. The developers of VASP, a widely used materials science code, have adopted MPI/OpenMP parallelism to better exploit the increased on-node parallelism, wider vector units, and the high bandwidth on-package memory (MCDRAM) of KNL. To achieve optimal performance, KNL specifics relevant for the build, boot and run time setup must be explored. In this paper, we will present the performance analysis of representative VASP workloads on Cori, focusing on the effects of the compilers, libraries, and boot/run time options such as the NUMA/MCDRAM modes, Hyper-Threading, huge pages, core specialization, and thread scaling. The paper is intended to serve as a KNL performance guide for VASP users, but it will also benefit other KNL users. Paper Technical Session 27B Chair: Scott Michael (Indiana University) Toward a Scalable Bank of Filters for High Throughput Image Analysis on the Cray Urika-GX System FNU Shilpika, Nicola Ferrier, and Venkatram Vishwanath (Argonne National Laboratory) Abstract Abstract High throughput image analysis is critical for experimental sciences facilities and enables one to glean timely insights of the various experiments and to better understand the physical phenomena being imaged. We present the design and evaluation of the bank of filters, the core building blocks for high throughput image analysis, on the Cray Urika-GX system. We describe our infrastructure developed with Apache Spark. We scaled this to 500 cores of the Urika-GX system for analysis of a Combustion engine dataset imaged at the Advanced Photon Source at Argonne National Laboratory and observe significant speedups. This scalable infrastructure now opens the doors to the application of a wide range of image processing algorithms and filters to the large-scale datasets being imaged at various light sources. Towards Seamless Integration of Data Analytics into Existing HPC Infrastructures Dennis Hoppe, Michael Gienger, and Thomas Boenisch (High Performance Computing Center Stuttgart); Diana Moise (Cray Inc.); and Oleksandr Shcherbakov (High Performance Computing Center Stuttgart) Abstract Abstract Customers of the High Performance Computing Center (HLRS) tend to execute more complex and data-driven applications, often resulting in large amounts of data of up to 1 Petabyte. The majority of our customers, however, is currently lacking the ability and knowledge to process this amount of data in a timely manner in order to extract meaningful information. We have therefore established a new project in order to support our users with the task of knowledge discovery by means of data analytics. We put the high performance data analytics system, a Cray Urika-GX, into operation to cope with this challenge. In this paper, we give an overview about our project and discuss immanent challenges in bridging the gap between HPC and data analytics in a production environment. The paper concludes with a case study about analyzing log files of a Cray XC40 to detect variations in system performance. We were able to identify successfully so-called aggressor jobs, which reduce significantly the performance of other simultaneously running jobs. Quantifying Performance of CGE: A Unified Scalable Pattern Mining and Search System Kristyn J. Maschhoff, Robert Vesse, Sreenivas R. Sukumar, Michael F. Ringenburg, and James Maltby (Cray Inc.) Abstract Abstract CGE was developed as one of the first applications to embody our vision of an analytics ecosystem that can be run on multiple Cray platforms. This paper presents Cray Graph Engine (CGE) as a solution that addresses the need for a unified ad-hoc subject-matter driven graph-pattern search and linear-algebraic graph analysis system. We demonstrate that the CGE implemented using the PGAS parallel programming model performs better than most off-the-shelf graph query engines with ad-hoc pattern search while also enabling the study of graph-theoretic spectral properties in runtimes comparable to optimized graph-analysis libraries. Currently CGE is provided with the Cray Urika-GX and can also run on Cray XC systems. Through experiments, we show that compared to other state-of-the-art tools, CGE offers strong scaling and can often handle graphs three orders of magnitude larger, more complex datasets (long diameters, hypergraphs, etc.), and more computationally intensive complex pattern searches. Paper Technical Session 27C Chair: Jean-Guillaume Piccinali (Swiss National Supercomputing Centre) Application-Level Regression Testing Framework using Jenkins Reuben D. Budiardja (Oak Ridge National Laboratory) and Timothy Bouvet and Galen Arnold (National Center for Supercomputing Applications/University of Illinois) Abstract Abstract This paper will explore the challenges of regression testing and monitoring of large scale systems such as NCSA’s Blue Waters. Our goal was to come up with an automated solution for running user-level regression tests to evaluate system usability and performance. Jenkins was chosen for its versatility, large user base, and multitudes of plugins including plotting test results over time. We utilize these plots to track trends and alert us to system-level issues before they are reported by our partners (users). Not only does Jenkins have the ability to store historical data, but it can also send email or text messages based on results of a test. Other requirements we had include two-factor authentication for accessing the Jenkins GUI with administrator privileges and account management through LDAP. In this paper we describe our experience in deploying Jenkins as a user-level system-wide regression testing and monitoring framework for Blue Waters. Understanding Fault Scenarios and Impacts through Fault Injection Experiments in Cielo Valerio Formicola, Saurabh Jha, Fei Deng, and Daniel Chen (University of Illinois at Urbana-Champaign); Amanda Bonnie and Mike Mason (Los Alamos National Laboratory); Annette Greiner (National Energy Research Scientific Computing Center); Ann Gentile and Jim Brandt (Sandia National Laboratories); Larry Kaplan and Jason Repik (Cray Inc.); Jeremy Enos and Michael Showerman (National Center for Supercomputing Applications/University of Illinois); Zbigniew Kalbarczyk (University of Illinois at Urbana-Champaign); William Kramer (National Center for Supercomputing Applications/University of Illinois); and Ravishankar Iyer (University of Illinois at Urbana-Champaign) Abstract Abstract We present a set of fault injection experiments performed on the ACES (LANL/SNL) Cray XE supercomputer Cielo. We use this experimental campaign to improve the understanding of failure causes and propagation that we observed in the field failure data analysis of NCSA’s Blue Waters. We use the data collected from the logs and from network performance counter data 1) to characterize the fault-error-failure sequence and recovery mechanisms in the Gemini network and in the Cray compute nodes, 2) to understand the impact of failures on the system and the user applications at different scale, and 3) to identify and recreate fault scenarios that induce unrecoverable failures, in order to create new tests for system and application design. The faults were injected through special input commands to bring down network links, directional connections, nodes, and blades. We present extensions that will be needed to apply our methodologies of injection and analysis to the Cray XC (Aries) systems. A regression framework for checking the health of large HPC systems Vasileios Karakasis, Victor Holanda Rusu, Andreas Jocksch, Jean-Guillaume Piccinali, and Guilherme Peretti-Pezzi (Swiss National Supercomputing Centre) Abstract Abstract In this paper, we present a new framework for writing regression tests for HPC systems, called ReFrame. The goal of this framework is to abstract away the complexity of the interactions with the system, separating the logic of a regression test from the low-level details, which pertain to the system configuration and setup. This allows users to write easily portable regression tests, focusing only on the functionality. Paper Technical Session 28A Chair: Veronica G. Vergara Larrea (Oak Ridge National Laboratory) HPC Containers in use Jonathan Sparks (Cray Inc.) Abstract Abstract Linux containers in the commercial world are changing the landscape for application development and deployments. Container technologies are also making inroads into HPC environments, as exemplified by NERSC’s Shifter and LBL’s Singularity. While the first generation of HPC containers offers some of the same benefits as the existing open container frameworks, like CoreOS or Docker, they do not address the cloud/commercial feature sets such as virtualized networks, full isolation, and orchestration. This paper will explore the use of containers in the HPC environment and summarize our study to determine how best to use these technologies in the HPC environment at scale. Shifter: Fast and consistent HPC workflows using containers Lucas Benedicic, Felipe A. Cruz, and Thomas Schulthess (Swiss National Supercomputing Centre) Abstract Abstract In this work we describe the experiences of building and deploying containers using Docker and Shifter, respectively. We present basic benchmarking tests that show the performance portability of certain workflows as well as performance results from the deployment of widely used non-trivial scientific applications. Furthermore, we discuss the resulting workflows through use cases that cover the container creation on a laptop and their deployment at scale, taking advantage of specialized hardware: Cray Aries interconnect and NVIDIA Tesla P100 GPU accelerators. ExPBB: A framework to explore the performance of Burst Buffer Georgios Markomanolis (KAUST Supercomputing Laboratory) Abstract Abstract ShaheenII supercomputer provides 268 Burst Buffer nodes based on Cray DataWarp technology. Thus, there is an extra layer between the compute nodes and the parallel filesystem by using SSDs. However, this technology is new, and many scientists try to understand and gain the maximum performance. We present an auto-tuning I/O framework called Explore the Performance of Burst Buffer. The purpose of this project is to determine the optimum parameters to acquire the maximum performance of the executed applications on the Burst Buffer. We study the number of the used Burst Buffer nodes, MPI aggregators, striping unit of files, and MPI/OpenMP processes. The framework aggregates I/O performance data from the Darshan tool and MPI I/O statistics provided by Cray MPICH, then it proceeds to the study of the parameters, depending on many criteria, till it concludes to the maximum performance. We report results, where in some cases we achieved speedup up to 4.52 times when we used this framework. Paper Technical Session 28B Chair: Frank M. Indiviglio (National Oceanic and Atmospheric Administration) Tuning Sub-filing Performance on Parallel File Systems Suren Byna (Lawrence Berkeley National Laboratory), Mohamad Chaarawi (Intel Corporation), Quincey Koziol (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center), and John Mainzer and Frank Willmore (The HDF Group) Abstract Abstract Subfiling is a technique used on parallel file systems to reduce locking and contention issues when multiple compute nodes interact with the same storage target node. Subfiling provides a compromise between the single shared file approach that instigates the lock contention problems on parallel file systems and having one file per process, which results in generating a massive and unmanageable number of files. In this paper, we evaluate and tune the performance of recently implemented subfiling feature in HDF5. In specific, we explain the implementation strategy of subfiling feature in HDF5, provide examples of using the feature, and evaluate and tune parallel I/O performance of this feature with parallel file systems of the Cray XC40 system at NERSC (Cori) that include a burst buffer storage and a Lustre disk-based storage. We also evaluate I/O performance on the Cray XC30 system, Edison, at NERSC. Our results show performance benefits of 1.2X to 6X performance advantage with subfiling compared to writing a single shared HDF5 file. We present our exploration of configurations, such as the number of subfiles and the number of Lustre storage targets to storing files, as optimization parameters to obtain superior I/O performance. Based on this exploration, we discuss recommendations for achieving good I/O performance as well as limitations with using the subfiling feature. Enabling Portable I/O Analysis of Commercially Sensitive HPC Applications Through Workload Replication James Dickson and Steven Wright (University of Warwick); Satheesh Maheswaran, Andy Herdman, and Duncan Harris (Atomic Weapons Establishment); Mark C. Miller (Lawrence Livermore National Laboratory); and Stephen Jarvis (University of Warwick) Abstract Abstract Benchmarking and analyzing I/O performance across high performance computing (HPC) platforms is necessary to identify performance bottlenecks and guide effective use of new and existing storage systems. Doing this with large production applications, which can often be commercially sensitive and lack portability, is not a straightforward task and the availability of a representative proxy for I/O workloads can help to provide a solution. We use Darshan I/O characterization and the MACSio proxy application to replicate five production workloads, showing how these can be used effectively to investigate I/O performance when migrating between HPC systems ranging from small local clusters to leadership scale machines. Preliminary results indicate that it is possible to generate datasets that match the target appli- cation with a good degree of accuracy. This enables a predictive performance analysis study of a representative workload to be conducted on five different systems. The results of this analysis are used to identify how workloads exhibit different I/O footprints on a file system and what effect file system configuration can have on performance. An Exploration into Object Storage for Exascale Supercomputers Raghunath Raja Chandrasekar, Lance Evans, and Robert Wespetal (Cray Inc.) Abstract Abstract The need for scalable, resilient, high performance storage is greater now than ever, in high performance computing. Exploratory research at Cray studies aspects of emerging storage hardware and software design for exascale-class supercomputers, analytics frameworks, and commodity clusters. Our outlook toward object storage and scalable database technologies is improving as trends, opportunities, and challenges of transitioning to them, also evolve. Cray's prototype SAROJA (Scalable And Resilient ObJect storAge) library is presented as one example of our exploration, highlighting design principles guided by the I/O semantics of HPC codes and the characteristics of up-and-coming storage media. SAROJA is extensible I/O middleware that has been designed ground-up with object semantics exposed via APIs to applications, while supporting a variety of pluggable file and object back-ends. It decouples the metadata and data paths, allowing for independent implementation, management, and scaling of each. Initial functional and performance evaluations indicate there is both promise and plenty of opportunity for advancement. Paper Technical Session 28C Chair: Richard Barrett (Sandia National Labs) Enabling the Super Facility with Software Defined Networking Richard S. Canon, Brent R. Draney, Jason R. Lee, David L. Paul, David E. Skinner, and Tina M. Declerck (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) Abstract Abstract Experimental and Observational facilities are increasingly turning to high-performance computing centers to meet their growing analysis requirements. The combination of experimental facilities with HPC Centers has been termed the Super Facility. This vision requires a new level of connectivity and bandwidth between these remote instruments and the HPC systems. NERSC in collaboration with Cray has been exploring a new model of networking that builds on the principles of Software Defined Networking. We envision an architecture that allows the wide-area network to extend into the Cray system and enables external facilities to stream data directly to compute resources inside the system at over a 100 Gbs in the near-future and eventually reach beyond a 1 Tbs. In this paper will expand on this vision, describe some of the motivating use cases in more detail, layout our proposed architecture and implementation, describe our progress to date, and outline future plans. Advanced Risk Mitigation of Software Vulnerabilities at Research Computing Centers Urpo Kaila (CSC - IT Center for Science Ltd.) Abstract Abstract Software security vulnerabilities have caused research computing centers concern, excess work, service breaks, and data leakages since the time of the great Morris Internet worm in 1988. Despite evolving awareness, testing and patching procedures, vulnerabilities and vulnerability patching still cause too much trouble both for users and for the sites. Comparing Spark GraphX and Cray Graph Engine using large-scale client data Eric Dull and Brian Sacash (Deloitte) Abstract Abstract Graph analytics are useful for overcoming real-world analytic challenges such as detecting cyber threats. Our Urika GX system is configured to use both the Cray Graph Engine (CGE) and Apache Spark for developing and executing hybrid workflows utilizing both Spark and Graph analytic engines. Spark allows us to quickly process data, stored in HDFS, powering flexible analytics in addition to graph analytics for cyber threat detection. Spark’s GraphX library offers an alternative graph engine to CGE. In this presentation, we will compare the available algorithms, challenges, and performance of both CGE and GraphX engines in the context of a real-world client use case utilizing 40 billion RDF triples. New Site Talk New Site Talk 29 Chair: Helen He (National Energy Research Scientific Computing Center) Plenary General Session 30 Chair: David Hancock (Indiana University) Hexagon@University of Bergen, Norway Csaba Anderlik (University of Bergen, Norway); Ingo Bethke, Mats Bentsen, and Alok Kumar Gupta (Uni Research Climate, Norway); Jon Albretsen and Lars Asplin (Institute for Marine Research, Norway); Michel S. Mesquita (Uni Research Climate, Norway); Saurabh Bhardwaj (The Energy and Resources Institute, India); and Laurent Bertino (Nansen Environmental and Remote Sensing Center, Norway) Abstract Abstract Hexagon, the current High Performance Computing (HPC) resource at the University of Bergen, Norway is approaching its end of life. This article highlights some of the scientific results in the field of Climate Modelling obtained using this exceptional resource, a Cray XE6m-200 machine. |
Tutorial Tutorial 1A Chair: Harold Longley (Cray Inc.) Tutorial Tutorial 1B Chair: John Levesque (Cray Inc.) Tutorial Tutorial 1C Chair: Lisa Gerhardt (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory) Tutorial Tutorial 1A Continued Chair: Harold Longley (Cray Inc.) Tutorial Tutorial 1B Continued Chair: John Levesque (Cray Inc.) Tutorial Tutorial 1C Continued Chair: Lisa Gerhardt (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory) Tutorial Tutorial 2A Chair: Harold Longley (Cray Inc.) Tutorial Tutorial 2B Chair: Deborah Bard (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) Tutorial Tutorial 2C Chair: Michael Ringenburg (Cray, Inc) Tutorial Tutorial 2A Continued Chair: Harold Longley (Cray Inc.) Tutorial Tutorial 2B Continued Chair: Deborah Bard (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) Tutorial Tutorial 2C Continued Chair: Michael Ringenburg (Cray, Inc) Birds of a Feather BoF 3A Chair: Nicholas Cardo (Swiss National Supercomputing Centre) Birds of a Feather BoF 3B Chair: Kelly J. Marquardt (Cray); Sadaf R. Alam (CSCS) Tutorial Tutorial 3C - continued Chair: Michael Ringenburg (Cray, Inc) Plenary General Session 4 Chair: David Hancock (Indiana University) Sponsor Talk Sponsor Talk 5 Chair: Trey Breckenridge (Mississippi State University) Sponsor Talk Sponsor Talk 6 Chair: Trey Breckenridge (Mississippi State University) Plenary General Session 7 Chair: Helen He (National Energy Research Scientific Computing Center) Paper Technical Session 8A Chair: Bilel Hadri (KAUST Supercomputing Lab) Paper Technical Session 8B Chair: Frank M. Indiviglio (National Oceanic and Atmospheric Administration) Paper Technical Session 8C Chair: Chris Fuson (ORNL) Paper Technical Session 9A Chair: Zhengji Zhao (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory) Paper Technical Session 9B Chair: Georgios Markomanolis (KAUST- King Abdullah University For Science And Technology) Paper Technical Session 9C Chair: Tina Butler (National Energy Research Scientific Computing Center) Birds of a Feather BoF 10A Chair: Bilel Hadri (KAUST Supercomputing Lab) Birds of a Feather BoF 10B Chair: David Paul (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) Birds of a Feather BoF 10C Chair: Matteo Chesi (Swiss National Supercomputing Centre); Jeff Keopp (Cray Inc.) Birds of a Feather BoF 10D Chair: David Hancock (Indiana University); Michael Showerman (National Center for Supercomputing Applications) Plenary General Session 11 Chair: David Hancock (Indiana University) Sponsor Talk Sponsor Talk 12 Chair: Trey Breckenridge (Mississippi State University) Sponsor Talk Sponsor Talk 13 Chair: Trey Breckenridge (Mississippi State University) Plenary General Session 14 Chair: Helen He (National Energy Research Scientific Computing Center) Sponsor Talk Sponsor Talk 15 Chair: Trey Breckenridge (Mississippi State University) Sponsor Talk Sponsor Talk 16 Chair: Trey Breckenridge (Mississippi State University) Sponsor Talk Sponsor Talk 17 Chair: Trey Breckenridge (Mississippi State University) Paper Technical Session 19A Chair: Richard Barrett (Sandia National Labs) Paper Technical Session 19B Chair: Veronica G. Vergara Larrea (Oak Ridge National Laboratory) Paper Technical Session 19C Chair: Chris Fuson (ORNL) Birds of a Feather BoF 20A Chair: Patricia Langer (Cray ) Birds of a Feather BoF 20B Chair: Harold Longley (Cray Inc.) Birds of a Feather BoF 20C Chair: Robert Stober (Bright) Birds of a Feather BoF 20D Chair: Nicholas Cardo (Swiss National Supercomputing Centre) Plenary General Session 21 Chair: Helen He (National Energy Research Scientific Computing Center) Plenary General Session 22 Chair: David Hancock (Indiana University) Sponsor Talk Sponsor Talk 23 Chair: Trey Breckenridge (Mississippi State University) Sponsor Talk Sponsor Talk 24 Chair: Trey Breckenridge (Mississippi State University) Paper Technical Session 25A Chair: Zhengji Zhao (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory) Paper Technical Session 25B Chair: Bilel Hadri (KAUST Supercomputing Lab) Paper Technical Session 25C Chair: Tina Declerck (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory) Birds of a Feather BoF 26 Chair: Sreenivas Sukumar (Oak Ridge National Lab) Paper Technical Session 27A Chair: Tina Butler (National Energy Research Scientific Computing Center) Paper Technical Session 27B Chair: Scott Michael (Indiana University) Paper Technical Session 27C Chair: Jean-Guillaume Piccinali (Swiss National Supercomputing Centre) Paper Technical Session 28A Chair: Veronica G. Vergara Larrea (Oak Ridge National Laboratory) Paper Technical Session 28B Chair: Frank M. Indiviglio (National Oceanic and Atmospheric Administration) Paper Technical Session 28C Chair: Richard Barrett (Sandia National Labs) New Site Talk New Site Talk 29 Chair: Helen He (National Energy Research Scientific Computing Center) |