Paper · Applications & Programming Environments Technical Session 7A Chair: Helen He (National Energy Research Scientific Computing Center) Performance on Trinity (a Cray XC40) with Acceptance-Applications and Benchmarks Mahesh Rajan (Sandia National Laboratories); Nathan Wichmann, Cindy Nuss, Pierre Carrier, Ryan Olson, Sarah Anderson, and Mike Davis (Cray Inc.); Randy Baker (Los Alamos National Laboratory); Erik Draeger (Lawrence Livermore National Laboratory); and Stefan Domino and Anthony Agelastos (Sandia National Laboratories) Abstract Abstract Abstract—Trinity is NNSA’s first ASC Advanced Technology System (ATS) targeted to support the largest, most demanding nuclear weapon simulations. Trinity Phase-1 (the focus of this paper) has 9436 dual-socket Haswell nodes while Phase-2 will have close to 9500 KNL nodes. This paper documents the performance of applications and benchmarks used for Trinity acceptance. It discusses the early experiences of the Tri-Lab (LANL, SNL and LLNL) and Cray teams to meet the challenges for optimal performance on this new architecture by taking advantage of the large number of cores on the node, wider SIMD/vector units and the Cray Aries network. Application performance comparisons to our previous generation large Cray capability systems show excellent scalability. The overall architecture is facilitating easy migration of our production simulations to this 11 PFLOPS system, while improved work flow through the use of Burst-Buffer nodes is still under investigation. Improving I/O Performance of the Weather Research and Forecast (WRF) Model Tricia Balle and Peter Johnsen (Cray Inc.) Abstract Abstract As HPC resources continue to increase in size and availability, the complexity of numeric weather prediction models also rises. This increases demands on HPC I/O subsystems, which continue to cause bottlenecks in efficient production weather forecasting. Performance Evaluation of Apache Spark on Cray XC Systems Nicholas Chaimov and Malony Allen (University of Oregon), Khaled Ibrahim and Costin Iancu (Lawrence Berkeley National Laboratory), and Shane Canon and Jay Srinivasan (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) Abstract Abstract We report our experiences in porting and tuning the Apache Spark data analytics framework on the Cray XC30 (Edison) and XC40 (Cori) systems, installed at NERSC. Spark has been designed for cloud environments where local disk I/O is cheap and performance is constrained by the network latency. In large HPC systems diskless nodes are connected by fast networks: without careful tuning Spark execution is dominated by I/O performance. In default mode the centralized storage system, such as Lustre, results in metadata access latency being a major bottleneck that severely constrains scalability. We show how to mitigate this by using per-node loopback filesystems for temporary storage. With this technique, we reduce the communication (data shuffle) time by multiple orders of magnitude and improve the application scalability from O(100) to O(10,000) cores on Cori. With this configuration Spark's execution becomes again network dominated. This reflects in the performance comparison with a cluster with fast local SSDs, specifically designed for data intensive workloads. Due to slightly faster processor and better network, Cori provides performance better by an average of 13.7% for the machine learning benchmark suite. This is the first such result where HPC systems outperform systems designed for data intensive workloads. Overall, we believe this paper demonstrates that local disks are not necessary for good performance on data analytics workloads. Paper · Applications & Programming Environments Technical Session 8A Chair: Bilel Hadri (KAUST Supercomputing Lab) Trinity: Architecture and Early Experience Scott Hemmert (Sandia National Laboratories); Manuel Vigil and James Lujan (Los Alamos National Laboratory); Rob Hoekstra (Sandia National Laboratories); Daryl Grunau; David Morton; Hai Ah Nam; Paul Peltz, Jr.; Alfred Torrez; and Cornell Wright (Los Alamos National Laboratory); Shawn Dawson (Lawrence Livermore National Laboratory); and Simon Hammond and Michael Glass (Sandia National Laboratories) Abstract Abstract The Trinity supercomputer is the first in a series of Advanced Technology Systems (ATS) that will be procured by the DOE’s Advanced Simulation and Computing program over the next decade. The ATS systems serve the dual role of meeting immediate mission needs and helping to prepare for future system designs. Trinity meets this goal though a two-phase delivery. Phase 1 consists of Xeon-based compute nodes, while phase 2 adds Xeon Phi-based nodes. Code Porting to Cray XC-40 Lesson Learned James McClean and Raj Gautam (Petroleum-Geo Services) Abstract Abstract We present a case study of porting seismic applications from the Beowulf cluster using Ethernet network to the Cray XC40 cluster. The applications in question are Tilted Transverse Anisotropic Reverse Time Migration (TTI RTM), Kirchhoff Depth Migration (KDMIG) and Wave equation migration (WEM). The primary obstacle in this port was that TTI RTM and WEM use local scratch disk heavily and imaging is performed one shot per node. The Cray nodes do not have local scratch disks. The primary obstacle in KDMIG was its heavy IO usage from permanent disk due to the constant reading of Travel Time Maps. We briefly explain how these algorithms were refactored so as to not be primarily dependent on scratch disk and also to fully utilize the better networking in the Cray XC40. In the case of KDMIG, we explain how its IO load was reduced via a memory pool concept. Early Experiences Writing Performance Portable OpenMP 4 Codes Verónica G. Vergara Larrea, Wayne Joubert, M. Graham Lopez, and Oscar Hernandez (Oak Ridge National Laboratory) Abstract Abstract At least two major architectural trends are leading the way to Exascale: accelerator-based (e.g., Summit and Sierra), and self-hosted compute nodes (e.g., Aurora). Today, the ability to produce performance portable code is crucial to take full advantage of these different architectures. Directive-based programming APIs (e.g., OpenMP, OpenACC) have helped in this regard, and recently, OpenMP added an accelerator programming model, in addition to its shared memory programming model support. However, as of today, little is understood about how efficiently the accelerator programming model can be mapped onto different architectures, including self-hosted and traditional shared memory systems, and whether it can be used to generate performance portable code across architectures. In this paper, we parallelize a representative computational kernel using the two different OpenMP 4 styles (shared memory and accelerator models), and compare their performance on multiple architectures including OLCF’s Titan supercomputer. A Reasoning And Hypothesis Generation Framework Based On Scalable Graph Analytics: Enabling Discoveries In Medicine Using Cray Urika-XA And Urika-GD Sreenivas R. Sukumar, Larry W. Roberts, Jeffrey Graves, and Jim Rogers (Oak Ridge National Laboratory) Abstract Abstract Finding actionable insights from data has always been difficult. As the scale and forms of data increase tremendously, the task of finding value becomes even more challenging. Data scientists at Oak Ridge National Laboratory are leveraging unique leadership infrastructure (e.g. Urika-XA and Urika-GD appliances) to develop scalable algorithms for semantic, logical and statistical reasoning with unstructured Big Data. We present the deployment of such a framework called ORiGAMI (Oak Ridge Graph Analytics for Medical Innovations) on the National Library of Medicine’s SEMANTIC Medline (archive of medical knowledge since 1994). Medline contains over 70 million knowledge nuggets published in 23.5 million papers in medical literature with thousands more added daily. ORiGAMI is available as an open-science medical hypothesis generation tool - both as a web-service and an application programming interface (API) at http://hypothesis.ornl.gov . Paper · Applications & Programming Environments Technical Session 14A Chair: Chris Fuson (ORNL) Estimating the Performance Impact of the MCDRAM on KNL Using Dual-Socket Ivy Bridge nodes on Cray XC30 Zhengji Zhao (NERSC/LBNL) and Martijn Marsman (University of Vienna) Abstract Abstract NERSC is preparing for its next petascale system, named Cori, a Cray XC system based on the Intel KNL MIC architecture. Each Cori node will have 72 cores (288 threads), 512 bit vector units, and a low capacity (16GB) and high bandwidth (~5x DDR4) on-package memory (MCDRAM or HBM). To help applications get ready for Cori, NERSC has developed optimization strategies that focus on the MPI+OpenMP program model, vectorization, and the HBM. While the optimization on MPI+OpenMP and vectorization can be carried out on today’s multi-core architectures, optimization of the HBM is difficult to perform where the HBM is unavailable. In this paper, we will present our HBM performance analysis on the VASP code, a widely used materials science code, using Intel's development tools, Memkind and AutoHBW, and a dual-socket Ivy Bridge processor node on Edison, a Cray XC30, as a proxy to the HBM on KNL. Cray Performance Tools Enhancements for Next Generation Systems Heidi Poxon (Cray Inc.) Abstract Abstract The Cray performance tools provide a complete solution from instrumentation, measurement, analysis and visualization of data. The focus of the tools is on whole program analysis, providing insight into performance bottlenecks within programs that use many computing resources across many nodes. With two complimentary interfaces: one for first-time users that provides a program profile at the end of program execution, and one for advanced users that provides in-depth performance investigation and tuning assistance, the tools enable users to quickly identify areas in their programs that most heavily impact performance or energy consumption. Recent development activity targets the new Intel KNL many-core processors, more assistance with adding OpenMP to MPI programs, improved tool usability, and enhanced application power and energy monitoring feedback. New CrayPat, Reveal and Cray Apprentice2 functionality is presented that will offer additional insight into application performance on next generation Cray systems. Paper · Applications & Programming Environments Technical Session 15A Chair: Zhengji Zhao (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory) Lonestar 5: Customizing the Cray XC40 Software Environment Cyrus Proctor, David Gignac, Robert McLay, Si Liu, Doug James, Tommy Minyard, and Dan Stanzione (Texas Advanced Computing Center) Abstract Abstract Lonestar 5, a 30,000 core, 1.2 petaflop Cray XC40, entered production at the Texas Advanced Computing Center (TACC) on January 12, 2016. Customized to meet the needs of TACC’s diverse computational research community, Lonestar 5 provides each user a choice between two alternative, independent configurations. Each is robust, mature, and proven: Lonestar 5 hosts both the environment delivered by Cray, and a second customized environment that mirrors Stampede, Lonestar 4, and other TACC clusters. The Cray Programming Environment: Current Status and Future Directions Luiz DeRose (Cray Inc.) Abstract Abstract In this talk I will present the recent activities, roadmap, and future directions of the Cray Programming Environment, which are being developed and deployed on Cray Clusters and Cray Supercomputers for scalable performance with high programmability. The presentation will discuss Cray’s programming environment new functionality to help porting and hybridizing applications to support systems with Intel KNL processors. This new functionality includes compiler directives to access high bandwidth memory, new features in the scoping tool Reveal to assist in parallelization of applications, and the Cray Comparative Debugger, which was designed and developed to help identify porting issues. In addition, I will present the recent activities in the Cray Scientific Libraries, and the Cray Message Passing Toolkit, and will discuss the Cray Programming Environment strategy for accelerated computing with GPUs, as well as the Cray Compiling Environment standard compliance plans for C++14, OpenMP 4.5, and OpenACC. Making Scientific Software Installation Reproducible On Cray Systems Using EasyBuild Petar Forai (Research Institute of Molecular Pathology (IMP)), Kenneth Hoste (Central IT Department of Ghent University), Guilherme Peretti Pezzi (Swiss National Supercomputing Centre), and Brett Bode (National Center for Supercomputing Applications/University of Illinois) Abstract Abstract Cray provides a tuned and supported OS and programming environment (PE), including compilers and libraries integrated with the modules system. While the Cray PE is updated frequently, tools and libraries not in it quickly become outdated. In addition, the amount of tools, libraries and scientific applications that HPC user support teams are expected to provide support for is increasing significantly. The uniformity of the software environment across Cray sites makes it an attractive target for to share this ubiquitous burden, and to collaborate on a common solution. Paper · Applications & Programming Environments Technical Session 18A Chair: Tina Declerck (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory) Opportunities for container environments on Cray XC30 with GPU devices Lucas Benedicic, Miguel Gila, and Sadaf Alam (Swiss National Supercomputing Centre) Abstract Abstract Thanks to the significant popularity gained lately by Docker, HPC community have recently started exploring container technology and the potential benefits its use would bring to the users of supercomputing systems like the Cray XC series. In this paper, we explore feasibility of diverse, nontraditional data and computing oriented use cases with practically no overhead thus achieving native execution performance. Working in close collaboration with NERSC and an engineering team at Nvidia, CSCS is working on extending the Shifter framework in order to enable GPU access to containers at scale. We also briefly discuss the implications of using containers within a shared HPC system from the security point of view to provide service that does not compromise the stability of the system or the privacy of the use. Furthermore, we describe several valuable lessons learned through our analysis and share the challenges we encountered. Shifter: Containers for HPC Richard S. Canon and Douglas M. Jacobsen (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) and David Henseleer (Cray Inc.) Abstract Abstract Container-based computed is rapidly changing the way software is developed, tested, and deployed. We will present a detailed overview of the design and implementation of Shifter, which in partnership with Cray has extended on the early prototype concepts and is now in production at NERSC. Shifter enables end users to execute containers using images constructed from various methods including the popular Docker-based ecosystem. We will discuss some of the improvements and implementation details. In addition, we will discuss lessons learned, performance results, and real-world use cases of Shifter in action and the potential role of containers in scientific and technical computing including how they complement the scientific process. We will conclude with a discussion about the future directions of Shifter. Dynamic RDMA Credentials James Shimek and James Swaro (Cray Inc.) Abstract Abstract Dynamic RDMA Credentials (DRC) is a new system service to allow shared network access between different user applications. DRC allows user applications to request managed network credentials, which can be shared with other users, groups or jobs. Access to a credential is governed by the application and DRC to provide authorized and protected sharing of network access between applications. DRC extends the existing protection domain functionality provided by ALPS without exposing application data to unauthorized applications. DRC can also be used with other batch systems such as SLURM, without any loss of functionality. Paper · Applications & Programming Environments Technical Session 18B Chair: Jason Hill (Oak Ridge National Laboratory) Characterizing the Performance of Analytics Workloads on the Cray XC40 Michael F. Ringenburg, Shuxia Zhang, Kristyn Maschhoff, and Bill Sparks (Cray Inc.) and Evan Racah and Mr Prabhat (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) Abstract Abstract This paper describes an investigation of the performance characteristics of high performance data analytics (HPDA) workloads on the Cray XC40, with a focus on commonly-used open source analytics frameworks like Apache Spark. We look at two types of Spark workloads: the Spark benchmarks from the Intel HiBench 4.0 suite and a CX matrix decomposition algorithm. We study performance from both the bottom-up view (via system metrics) and the top-down view (via application log analysis), and show how these two views can help identify performance bottlenecks and system issues impacting data analytics workload performance. Based on this study, we provide recommendations for improving the performance of analytics workloads on the XC40. Interactive Data Analysis using Spark on Cray Urika Appliance (WITHDRAWN) Gaurav Kaul (Intel) and Xavier Tordoir and Andy Petrella (Data Fellas) Abstract Abstract In this talk, we discuss how data scientists can use the Intel Data Analytics Acceleration Library (DAAL) with Spark and Spark notebook. Intel DAAL provides building blocks for analytics optimized for x86 architecture and. We have integrated DAAL with Spark Notebook. This provides a Spark user, an interactive and optimized interface for running Spark jobs. We take real world machine learning and graph analytics workloads in bioinformatics and run it on Spark using the Cray Urika-XA appliance. The benefits of an co-designed hardware and software stack become apparent with these examples with the added benefit of an interactive front end for users. Experiences Running Mixed Workloads On Cray Analytics Platform HARIPRIYA AYYALASOMAYAJULA and Kristyn Maschhoff (Cray Inc.) Abstract Abstract The ability to run both HPC and big data frameworks together on the same machine is a principal design goal for future Cray analytics platforms. Hadoop provides a reasonable solution for parallel processing of batch workloads using the YARN resource manager. Spark is a general-purpose cluster-computing framework, which also provides parallel processing of batch workloads as well as in-memory data analytics capabilities; iterative, incremental algorithms; ad hoc queries; and stream processing. Spark can be run using YARN, Mesos or its own standalone resource manager. The Cray Graph Engine (CGE) supports real-time analytics on the largest and most complex graph problems. CGE is a more traditional HPC application that runs under either Slurm or PBS. Traditionally, running workloads that require different resource managers requires static partitioning of the cluster. This can lead to underutilization of resources. Paper · Applications & Programming Environments Technical Session 19A Chair: Zhengji Zhao (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory) Optimizing Cray MPI and SHMEM Software Stacks for Cray-XC Supercomputers based on Intel KNL Processors Krishna Kandalla, Peter Mendygral, Nick Radcliffe, Bob Cernohous, Kim McMahon, and Mark Pagel (Cray) Abstract Abstract HPC applications commonly use Message Passing Interface (MPI) and SHMEM programming models to achieve high performance in a portable manner. With the advent of the Intel MIC processor technology, hybrid programming models that involve the use of MPI/SHMEM along with threading models (such as OpenMP) are gaining traction. However, most current generation MPI implementations are not poised to offer high performance communication in highly threaded environments. The latest MIC architecture, Intel Knights Landing (KNL), also offers High Bandwidth Memory - a new memory technology, along with complex NUMA topologies. This paper describes the current status of the Cray MPI and SHMEM implementations for optimizing application performance on Cray XC supercomputers that rely on KNL processors. A description of the evolution of WOMBAT (a high fidelity astrophysics code) to leverage thread-hot RMA in Cray MPICH is included. Finally, this paper also summarizes new optimizations in the Cray MPI and SHMEM implementations. What's new in Allinea's tools: from easy batch script integration and remote access to energy profiling. Patrick Wohlschlegel (Allinea Software) Abstract Abstract We address application energy use and performance, productivity, and the future in this talk. The Allinea Forge debugging and profiling tools, DDT and MAP, are deployed on most Cray systems - we take this opportunity to highlight important recent innovations and share the future. Configuring and Customizing the Cray Programming Environment on CLE 6.0 Systems Geir Johansen (Cray Inc) Abstract Abstract Abstract: The Cray CLE 6.0 system will provide a new installation model for the Cray Programming Environment. This paper will focus on the new processes for configuring and customizing the Cray Programming Environment to best meet the customer site's requirements. Topics will include configuring the login shell start-up scripts to load the appropriate modulefiles, creating specialized modulefiles to load specific versions of Cray Programming Environment components, and how to install third party programming tools and libraries not released by Cray. Directions will be provided on porting programming environment software to CLE 6.0 systems, including instructions on how to create modulefiles. The specific example of porting the Python MPI library (mpi4py) to CLE 6.0 will be included. Paper · Applications & Programming Environments Technical Session 19C Chair: Tina Declerck (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory) The GNI Provider Layer for OFI libfabric Howard Pritchard and Evan Harvey (Los Alamos National Laboratory) and Sung-Eun Choi, James Swaro, and Zachary Tiffany (Cray Inc.) Abstract Abstract The Open Fabrics Interfaces (OFI) libfabric, a community-designed networking API, has gained increasing attention over the past two years as an API which promises both high performance and portability across a wide variety of network technologies. The code itself is being developed as Open Source Software with contributions from across government labs, industry and academia. In this paper, we present a libfabric provider implementation for the Cray XC system using the Generic Network Interface (GNI) library. The provider is especially targeted for highly multi-threaded applications requiring concurrent access to the Aries High Speed Network with minimal contention between threads. Big Data Analytics on Cray XC Series DataWarp using Hadoop, Spark and Flink Robert Schmidtke, Guido Laubender, and Thomas Steinke (Zuse Institute Berlin) Abstract Abstract We currently explore the Big Data analytics capabilities of the Cray XC architectures to harness the computing power for increasingly common programming paradigms for handling large volumes of data. These include MapReduce and, more recently, in-memory data processing approaches such as Apache Spark and Apache Flink. We use our Cray XC Test and Development System (TDS) with 16 diskless compute nodes and eight DataWarp nodes. We use Hadoop, Spark and Flink implementations of select benchmarks from the Intel HiBench micro benchmark suite to find suitable runtime configurations of these frameworks for the TDS hardware. Motivated by preliminary results in throughput per node in the popular Hadoop TeraSort benchmark we conduct a detailed scaling study and investigate resource utilization. Furthermore we identify scenarios where using DataWarp nodes is advantageous to using Lustre. Performance Test of Parallel Linear Equation Solvers on Blue Waters - Cray XE6/XK7 system JaeHyuk Kwack, Gregory Bauer, and Seid Koric (National Center for Supercomputing Applications) Abstract Abstract Parallel linear equation solvers are one of the most important components determining the scalability and efficiency of many supercomputing applications. Several groups and companies are leading the development of linear system solver libraries for HPC applications. In this paper, we present an objective performance test study for the solvers available on a Cray XE6/XK7 supercomputer, named Blue Waters, at National Center for Supercomputing Applications (NCSA). A series of non-symmetric matrices are created through mesh refinements of a CFD problem. PETSc, MUMPS, SuperLU, Cray LibSci, Intel PARDISO, IBM WSMP, ACML, GSL, NVIDIA cuSOLVER and AmgX solver are employed for the performance test. CPU-compatible libraries are tested on XE6 nodes while GPU-compatible libraries are tested on XK7 nodes. We present scalability test results of each library on Blue Waters, and how far and fast the employed libraries can solve the series of matrices. Paper · Applications & Programming Environments Technical Session 20A Chair: Chris Fuson (ORNL) Scaling hybid coarray/MPI miniapps on Archer Luis Cebamanos (EPCC, The University of Edinburgh); Anton Shterenlikht (Mech Eng Dept, The University of Bristol); and David Arregui and Lee Margetts (School of Mech, Aero and Civil Engineering, The University of Manchester) Abstract Abstract We have developed miniapps from MPI finite element library ParaFEM and Fortran 2008 coarray cellular automata library CGPACK. The miniapps represent multi-scale fracture models of polycrystalline solids. The software from which these miniapps have been derived will improve predictive modelling in the automotive, aerospace, power generation, defense and manufacturing sectors. The libraries and miniapps are distributed under BSD license, so these can be used by computer scientists and hardware vendors to test various tools including compilers and performance monitoring applications. CrayPAT tools have been used for sampling and tracing analysis of the miniapps. Two routines with all-to-all communication structures have been identified a primary candidates for optimisation. New routines have been written implementing the nearest neighbour algorithm and using coarray collectives. Scaling limit for miniapps has been increased by a factor of 3, from about 2k to over 7k cores. The miniapps uncovered several issues in CrayPAT and Cray implementation of Fortran coarrays. We are working with Cray engineers to resolve these. Hybrid coarray/MPI programming is uniquely enabled on Cray systems. This work is of particular interest to Cray developers, because it details real experiences of using hybrid Fortran coarray/MPI programming for scientific computing in an area of cutting edge research. Enhancing Scalability of the Gyrokinetic Code GS2 by using MPI Shared Memory for FFTs Lucian Anton (Cray UK), Ferdinand van Wyk and Edmund Highcock (University of Oxford), Colin Roach (CCFE Culham Science Centre), and Joseph Parker (STFC) Abstract Abstract GS2 (http://sourceforge.net/projects/gyrokinetics) is a 5-D initial value parallel code used to simulate low frequency electromagnetic turbulence in magnetically confined fusion plasmas. Feasible calculations routinely capture plasma turbulence at length scales close either to the electron or the ion Larmor radius. Self-consistently capturing the interaction between turbulence at ion scale and electron scale requires a huge increase in the scale of computation. Scalable Remote Memory Access Halo Exchange with Reduced Synchronization Cost Maciej Szpindler (ICM, University of Warsaw) Abstract Abstract Remote Memory Access (RMA) is a popular technique for data exchange in the parallel processing. Message Passing Interface (MPI), ubiquitous environment for distributed memory programming, introduced improved model for RMA in the recent version of the standard. While RMA provides direct access to low-level high performance hardware, MPI one-sided communication enables various synchronization regimes including scalable group synchronization. This combination provides methods to improve performance of commonly used communication schemes in parallel computing. This work evaluates one-sided halo exchange implementation on the Cray XC40 system. Large numerical weather prediction code is studied. To address already identified overheads for RMA synchronization, recently proposed extension of Notified Access is considered. To reduce the cost of the most frequent message passing communication scheme, alternative RMA implementation is proposed. Additionally, to identify more scalable approaches, performance of general active target synchronization, Notified Access modes of RMA and original message passing implementation are compared. Paper · Applications & Programming Environments Technical Session 20B Chair: Bilel Hadri (KAUST Supercomputing Lab) Directive-based Programming for Highly-scalable Nodes Douglas Miles and Michael Wolfe (PGI) Abstract Abstract High end supercomputers have increased in performance from about 4 TFLOPS to 33 PFLOPS in the past 15 years, a factor of about 10,000. Increased node count accounts for a factor of 10, and clock rate increases for another factor of 5. Most of the increase, a factor of about 200, is due to increases in single-node performance. We expect this trend to continue with single-node performance increasing faster than node count. Building scalable applications for such targets means exploiting as much intra-node parallelism as possible. We discuss coming supercomputer node designs and how to abstract the differences to enable design of portable scalable applications, and the implications for HPC programming languages and models such as OpenACC and OpenMP. Balancing particle and Mesh Computation in a Particle-In-Cell Code Patrick H. Worley and Eduardo F. D'Azevedo (Oak Ridge National Laboratory), Robert Hager and Seung-Hoe Ku (Princeton Plasmas Physics Laboratory), Eisung Yoon (Rensselaer Polytechnic Institute), and Choong-Seock Chang (Princeton Plasmas Physics Laboratory) Abstract Abstract The XGC1 plasma microturbulence particle-in-cell simulation code has both particle-based and mesh-based computational kernels that dominate performance. Both of these are subject to load imbalances that can degrade performance and that evolve during a simulation. Each separately can be addressed adequately, but optimizing just for one can introduce significant load imbalances in the other, degrading overall performance. A technique has been developed based on Golden Section Search that minimizes wallclock time given prior information on wallclock time, and on current particle distribution and mesh cost per cell, and also adapts to evolution in load imbalance in both particle and mesh work. In problems of interest this doubled the performance on full system runs on the XK7 at the Oak Ridge Leadership Computing Facility compared to load balancing only one of the kernels. Computational Efficiency Of The Aerosol Scheme In The Met Office Unified Model Mark Richardson (University of Leeds); Fiona O'Connor (Met Office Hadley Centre, UK); Graham W. Mann (University of Leeds); and Paul Selwood (Met Office, UK) Abstract Abstract Abstract - A new data structuring has been implemented in the Met Office Unified Model (MetUM) which improves the performance of the aerosol subsystem. Smaller amounts of atmospheric data, in the arrangement of segments of atmospheric columns, are passed to the aerosol sub-processes. The number of columns that are in a segment can be changed at runtime and thus can be tuned to the hardware and science in operation. This revision alone has halved the time spent in some of the aerosol sections for the case under investigation. The new arrangement allows simpler implementation of OpenMP around the whole of the aerosol subsystem and is shown to give close to ideal speed up. Applying a dynamic schedule or retaining a simpler static schedule for the OpenMP parallel loop are shown to differ related to the number of threads. The percentage of the run spent in the UKCA sections has been reduced from 30% to 24% with a corresponding reduction in runtime by 11% for a single threaded run. When the reference version is using 4 threads the percentage of time spent in UKCA is higher at 40% but with the OpenMP and segmenting modifications this is now reduced to 20% with a corresponding reduction in run time of 17%. For 4 threads the parallel speed-up for the reference code was 1.78 and after the modifications it is 1.91. Both these values indicate that there is still a significant amount of the run that is serial (within an MPI task) which is continually being addressed by the software development teams involved in MetUM. Paper · Applications & Programming Environments Technical Session 21A Chair: Richard Barrett (Sandia National Labs) Stitching Threads into the Unified Model Matthew Glover, Paul Selwood, Andy Malcolm, and Michele Guidolin (Met Office, UK) Abstract Abstract The Met Office Unified Model (UM) uses a hybrid parallelization strategy: MPI and OpenMP. Being legacy code, OpenMP has been retrofitted in a piecemeal fashion over recent years. On Enhancing 3D-FFT Performance in VASP Florian Wende (Zuse Institute Berlin), Martijn Marsman (Universität Wien), and Thomas Steinke (Zuse Institute Berlin) Abstract Abstract We optimize the computation of 3D-FFT in VASP in order to prepare the code for an efficient execution on multi- and many-core CPUs like Intel's Xeon Phi. Along with the transition from MPI to MPI+OpenMP, library calls need to adapt to threaded versions. One of the most time consuming components in VASP is 3D-FFT. Beside assessing the performance of multi-threaded calls to FFTW and Intel MKL, we investigate strategies to improve the performance of FFT in a general sense. We incorporate our insights and strategies for FFT computation into a library which encapsulates FFTW and Intel MKL specifics and implements the following features: reuse of FFT plans, composed FFTs, and the use of high bandwidth memory on Intel's KNL Xeon Phi. We will present results on a Cray-XC40 and a Cray-XC30 Xeon Phi system using synthetic benchmarks and with the library integrated into VASP. Exploiting Thread Parallelism for Ocean Modeling on Cray XC Supercomputers Abhinav Sarje (Lawrence Berkeley National Laboratory), Douglas Jacobsen (LANL), Samuel Williams (LBNL), Todd Ringler (LANL), and Leonid Oliker (LBNL) Abstract Abstract Incorporation of increasing core counts in modern processors used to build state-of-the-art supercomputers is driving application development towards implementation of thread parallelism, in addition to distributed memory parallelism, to deliver efficient high-performance codes. In this work we describe the implementation of threading and our experiences with it with respect to a real-world ocean modeling application code, MPAS-Ocean. We present detailed performance analysis and comparisons of various approaches and configurations for threading on the Cray XC series supercomputers, and show the benefits of threading on run time performance and energy requirements with increasing concurrency. Cori - A System to Support Data-Intensive Computing Katie Antypas, Deborah Bard, Wahid Bhimji, Tina M. Declerck, Yun (Helen) He, Douglas Jacobsen, Shreyas Cholia, Mr Prabhat, and Nicholas J. Wright (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) and Richard Shane Canon (Lawrence Berkeley National Laboratory) Abstract Abstract The first phase of Cori, NERSC’s next generation supercomputer, a Cray XC40, has been configured to specifically support data intensive computing. With increasing dataset sizes coming from experimental and observational facilities, including telescopes, sensors, detectors, microscopes, sequencers, and, supercomputers, scientific users from the Department of Energy, Office of Science are increasingly relying on NERSC for extreme scale data analytics. This paper will discuss the Cori Phase 1 architecture, and installation into the new and energy efficient CRT facility, and explains how the system will be combined with the larger Cori Phase 2 system based on the Intel Knights Landing processor. In addition, the paper will describe the unique features and configuration of the Cori system that allow it to support data-intensive science. Paper · Applications & Programming Environments Technical Session 21C Chair: David Hancock (Indiana University) Maintaining Large Software Stacks in a Cray Ecosystem with Gentoo Portage Colin A. MacLean (National Center for Supercomputing Applications/University of Illinois) Abstract Abstract Building and maintaining a large collection of software packages from source is difficult without powerful package management tools. This task is made more difficult in an environment where many libraries do not reside in standard paths and where loadable modules can drastically alter the build environment, such as on a Cray system. The need to maintain multiple Python interpreters with a large collection of Python modules is one such case of having a large and complicated software stack and is described in this paper. To address limitations of current tools, Gentoo Prefix was ported to the Cray XE/XK system, Blue Waters, giving the ability to use the Portage package manager. This infrastructure allows for fine-grained dependency tracking, consistent build environments, multiple Python implementations, and customizable builds. This infrastructure is used to build and maintain over 400 packages for Python support on Blue Waters for use by its partners. Early Application Experiences on Trinity - the Next Generation of Supercomputing for the NNSA Courtenay Vaughan, Dennis Dinge, Paul T. Lin, Kendall H. Pierson, Simon D. Hammond, J. Cook, Christian R. Trott, Anthony M. Agelastos, Douglas M. Pase, Robert E. Benner, Mahesh Rajan, and Robert J. Hoekstra (Sandia National Laboratories) Abstract Abstract Trinity, a Cray XC40 supercomputer, will be the flagship capability computing platform for the United States nuclear weapons stockpile stewardship program when the machine enters full production during 2016. In the first phase of the machine, almost 10,000 dual socket Haswell processor nodes will be deployed, followed by a second phase utilizing Intel's next-generation Knights Landing processor. Executing dynamic heterogeneous workloads on Blue Waters with RADICAL-Pilot Mark Santcroos (Rutgers University); Ralph Castain (Intel Corporation); Andre Merzky (Rutgers University); Iain Bethune (EPCC, The University of Edinburgh); and Shantenu Jha (Rutgers University) Abstract Abstract Traditionally HPC systems such as Crays have been designed to support mostly monolithic workloads. However, the workload of many important scientific applications is constructed out of spatially and temporally heterogeneous tasks that are often dynamically inter-related. These workloads can benefit from being executed at scale on HPC resources, but a tension exists between the workloads' resource utilization requirements and the capabilities of the HPC system software and usage policies. Pilot systems have the potential to relieve this tension. RADICAL-Pilot is a scalable and portable pilot system that enables the execution of such diverse workloads. In this paper we describe the design and characterize the performance of its RADICAL-Pilot's scheduling and executing components on Crays, which are engineered for efficient resource utilization while maintaining the full generality of the Pilot abstraction. We will discuss four different implementations of support for RADICAL-Pilot on Cray systems and analyze and report on their performance. Evaluating Shifter for HPC Applications Donald M. Bahls (Cray Inc.) Abstract Abstract Shifter is a powerful tool that has the potential to expand the availability of HPC applications on Cray XC systems by allowing Docker-based containers to be run with little porting effort. In this paper, we explore the use of Shifter as a means of running HPC applications built for commodity Linux clusters environments on a Cray XC under the Shifter environment. We compare developer productivity, application performance, and application scaling of stock applications compiled for commodity Linux clusters with both Cray XC tuned Docker images as well as natively compiled applications not using the Shifter environment. We also discuss pitfalls and issues associated with running non-SLES-based Docker images in the Cray XC environment. Paper · Filesystems & I/O Technical Session 7B Chair: Jason Hill (Oak Ridge National Laboratory) Collective I/O Optimizations for Adaptive Mesh Refinement Data Writes on Lustre File System Dharshi Devendran, Suren Byna, Bin Dong, Brian Van Straalen, Hans Johansen, and Noel Keen (Lawrence Berkeley National Laboratory) and Nagiza F. Samatova (North Carolina State University) Abstract Abstract Adaptive mesh refinement (AMR) applications refine small regions of a physical space. As a result, when AMR data has to be stored in a file, writing data involves storing a large number of small blocks of data. Chombo is an AMR software library for solving partial differential equations over block-structured grids, and is used in large-scale climate and fluid dynamics simulations. Chombo's current implementation for writing data on an AMR hierarchy uses several independent write operations, causing low I/O performance. In this paper, we investigate collective I/O optimizations for Chombo's write function. We introduce Aggregated Collective Buffering (ACB) to reduce the number of small writes. We demonstrate that our approach outperforms the current implementation by 2X to 9.1X and the MPI-IO collective buffering by 1.5X to 3.4X on the Edison and Cori platforms at NERSC using the Chombo-IO benchmark. We also test ACB on the BISICLES Antarctica benchmark on Edison, and show that it outperforms the current implementation by 13.1X to 20X, and the MPI-IO collective buffering by 6.4X to 12.8X. Using the Darshan I/O characterization tool, we show that ACB makes larger contiguous writes than collective buffering at the POSIX level, and this difference gives ACB a significant performance benefit over collective buffering. Finally, A Way to Measure Frontend I/O Performance. Christopher Zimmer, Veronica Vergara Larrea, and Saurbh Gupta (Oak Ridge National Laboratory) Abstract Abstract Identifying sources of variability in the Spider II file system on Titan is challenging because it spans multiple networks with layers of hardware performing various functions to fulfill the needs of the parallel file system. Several efforts have targeted file system monitoring but only focused on metric logging associated with the storage side of the file system. In this work, we enhance that view by designing and deploying a low-impact network congestion monitor designed especially for the IO routers that are deployed on service nodes within the Titan Cray XK7 Gemini network. To the best of our knowledge, this is is the first tool that provides a capability of live monitoring for performance bottlenecks at the IO router. Our studies show high correlation between IO router congestion and IO bandwidth. Ultimately, we plan on using this tool for IO hotspot identification within Titan and guided scheduling for large IO. A Classification of Parallel I/O Toward Demystifying HPC I/O Best Practices Robert Sisneros (National Center for Supercomputing Applications) and Kalyana Chadalavada (Intel Corporation) Abstract Abstract The process of optimizing parallel I/O can quite easily become daunting. By the nature of its implementation there are many highly sensitive, tunable parameters and a subtle change to any of these may have drastic or even completely counterintuitive results. There are many factors affecting performance: complex hardware configurations, significant yet unpredictable system loads, and system level implementations that perform tasks in unexpected ways. A final compounding issue is that an optimization is very likely specific to only a single application. The state of the art then is usually a fuzzy mixture of expertise and trial-and-error testing. In this work we introduce a characterization of application I/O based on aggregation which we define as a combination of job-level and filesystem-level. We will show how this characterization may be used to analyze parallel I/O performance to not only validate I/O best practices but also communicate benefits in a user centric way. Paper · Filesystems & I/O Technical Session 14B Chair: Ashley Barker (Oak Ridge National Laboratory) Architecture and Design of Cray DataWarp Dave Henseler, Benjamin R. Landsteiner, and Doug Petesch (Cray, Cray Inc.); Cornell Wright (Los Alamos National Laboratory); and Nicholas J. Wright (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) Abstract Abstract This paper describes the architecture, design, use and performance of Cray DataWarp, an infrastructure that uses direct attached solid state disk (SSD) storage to provide more cost effective bandwidth than an external parallel file system (PFS), allowing DataWarp to be provisioned for bandwidth and the PFS to be provisioned for capacity and resiliency. Placing this storage between the application and the PFS allows application I/O to be decoupled from (and in some cases eliminating) PFS I/O. This reduces the time required for the application to do I/O, while also increasing the overlap of computation with PFS I/O and typically reducing application elapsed time. DataWarp allocates and configures SSD backed storage for jobs and users on demand, providing many of the benefits of both software defined storage and storage virtualization. Exascale HPC Storage – A possibility or a pipe dream? Torben K. Petersen (Seagate) Abstract Abstract The advances in flash and NV-RAM technologies promise exascale level throughput, however, the building and implementing of full solutions continues to be expensive. Noting HDDs are increasing in capacity and speed, are these new drives good enough to fulfill these essential areas? Paper · Filesystems & I/O Technical Session 15B Chair: Jason Hill (Oak Ridge National Laboratory) The Evolution of Lustre Networking at Cray Chris A. Horn (Cray Inc.) Abstract Abstract Lustre Network (LNet) routers with more than one InfiniBand Host Channel Adapter (HCA) have been in use at Cray for some time. This type of LNet router configuration is necessary on Cray Supercomputers in order to extract maximum performance out of a single LNet router node. This paper provides a look at the state of the art in this dual-HCA router configuration. Topics include avoiding ARP flux with proper subnet configuration, flat vs. fine-grained routing, and configuration emplacement. We’ll also provide a look at how LNet will provide compatibility with InfiniBand HCAs requiring the latest mlx5 drivers, and what is needed to support a mix of mlx4 and mlx5 on the same fabric. Extreme Scale Storage & IO Eric Barton (Intel Corporation) Abstract Abstract New technologies such as 3D Xpoint and integrated high performance fabrics are set to revolutionize the storage landscape as we reach towards Exascale computing. Unfortunately the latencies inherent in today’s storage software mask the benefits of these technologies and the horizontal scaling. This talk will describe the work currently underway in the DOE funded Extreme Scale Storage & IO project to prototype a storage stack capable of exploiting these new technologies to the full and designed to overcome the extreme scaling and resilience challenges presented by Exascale Computing. Managing your Digital Data Explosion Matt Starr and Janice Kinnin (Spectra Logic) Abstract Abstract Our society is currently undergoing an explosion in digital data. It is predicted that our digital universe will double every two years to reach more than 44 zettabytes (ZB) by 2020. The volume of data created each day has increased immensely and will continue to grow exponentially over time. This trend in data growth makes it clear that the data storage problems we struggle with today will soon seem very minor. Paper · Filesystems & I/O Technical Session 19B Chair: Richard Barrett (Sandia National Labs) FCP: A Fast and Scalable Data Copy Tool for High Performance Parallel File Systems Feiyi Wang, Veronica Vergara Larrea, Dustin Leverman, and Sarp Oral (Oak Ridge National Laboratory) Abstract Abstract The design of HPC file and storage systems has largely been driven by the requirements on capability, reliability, and capacity. However, the convergence of large-scale simulations with big data analytics have put the data, its usability, and management back on the front and center position. LIOProf: Exposing Lustre File System Behavior for I/O Middleware Cong Xu (Intel Corporation), Suren Byna (Lawrence Berkeley National Laboratory), Vishwanath Venkatesan (Intel Corporation), Robert Sisneros (National Center for Supercomputing Applications), Omkar Kulkarni (Intel Corporation), Mohamad Chaarawi (The HDF Group), and Kalyana Chadalavada (Intel Corporation) Abstract Abstract As parallel I/O subsystem in large-scale supercomputers is becoming complex due to multiple levels of software libraries, hardware layers, and various I/O patterns, detecting performance bottlenecks is a critical requirement. While there exist a few tools to characterize application I/O, robust analysis of file system behavior and associating file-system feedback with application I/O patterns are largely missing. Toward filling this void, we introduce Lustre IO Profiler, called LIOProf, for monitoring the I/O behavior and for characterizing the I/O activity statistics in the Lustre file system. In this paper, we use LIOProf for both uncovering pitfalls of MPI-IO’s collective read operation over Lustre file system and identifying HDF5 overhead. Based on LIOProf characterization, we have implemented a Lustre-specific MPI-IO collective read algorithm, enabled HDF5 collective metadata operations and applied HDF5 datasets optimization. Our evaluation results on two Cray systems (Cori at NERSC and Blue Waters at NCSA) demonstrate the efficiency of our optimization efforts. Psync - Parallel Synchronization Of Multi-Pebibyte File Systems Andy Loftus (NCSA) Abstract Abstract When challenged to find a way to migrate an entire file system onto new hardware while maximizing availability and ensuring exact data and metadata duplication, NCSA found that existing file copy tools couldn’t fit the bill. So they set out to create a new tool. One that would scale to the limits of the file system and provide a robust interface to adjust to the dynamic needs of the cluster. The resulting tool, Psync, effectively manages many syncs running in parallel. It is dynamically scalable (can add and remove nodes on the fly) and robust (can start/stop/restart the sync). Psync has been run successfully on hundreds of nodes each with multiple processes (yielding possibly thousands of parallel processes). This talk will present the overall design of Psync and it’s use as a general purpose tool for copying lots of data as quickly as possible. Paper · Filesystems & I/O Technical Session 21B Chair: Frank M. Indiviglio (National Oceanic and Atmospheric Administration) H5Spark: Bridging the I/O Gap between Spark and Scientific Data Formats on HPC Systems Jialin Liu, Evan Racah, Quincey Koziol, and Richard Shane Canon (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center); Alex Gittens (University of California, Berkeley); Lisa Gerhardt (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center); Suren Byna (Lawrence Berkeley National Laboratory); Michael F. Ringenburg (Cray Inc.); and Mr Prabhat (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) Abstract Abstract Spark has been tremendously powerful in performing Big Data analytics in distributed data centers. However, using the Spark framework on HPC systems to analyze large-scale scientific data has several challenges. For instance, parallel file system is shared among all computing nodes in contrast to shared-nothing architectures. Another challenge is in accessing data stored in scientific data formats, such as HDF5 and NetCDF, that are not natively supported in Spark. Our study focuses on improving I/O performance of Spark on HPC systems for reading and writing scientific data arrays, e.g., HDF5/netCDF. We select several scientific use cases to drive the design of an efficient parallel I/O API for Spark on HPC systems, called H5Spark. We optimize the I/O performance, taking into account Lustre file system striping. We evaluate the performance of H5Spark on Cori, a Cray XC40 system, located at NERSC. The time is now. Unleash your CPU cores with Intel® SSDs Andrey O. Kudryavtsev (Intel Corporation) and Ken Furnanz (Intel) Abstract Abstract Andrey Kudryavtsev, HPC Solution Architect for the Intel® Non-Volatile Solutions Group (NSG) will discuss advancements in Intel SSD technology that is unleashing the power of the CPU. He will dive into the benefits of Intel® NVMe SSDs that can greatly benefit HPC specific performance with parallel file systems. He will also share the HPC solutions and performance benefits that Intel has already seen with their customers today, and how adoption of the current SSD technology sets the foundation for consumption of Intel’s next generation of memory technology 3D Xpoint Intel® SSDs with Intel Optane™ Technology in the High Performance Compute segment. Introducing a new IO tier for HPC Storage James Coomer (DDN Storage) Abstract Abstract Tier 1 “performance" storage is becomingly increasingly flanked by new, solid-state-based tiers and active archive tiers that improve the economics of both performance and capacity. The available implementations of solid-state tiers into parallel filesystems are typically based on a separate namespace and/or utilise existing filesystem technologies. Given the price/performance characteristics of SSDs today, huge value is gained by addressing both optimal data placement to the SSD tier and in comprehensively building this tier to accelerate the broadest spectrum of IO, rather than just small IO random read. Paper · Systems Support Technical Session 8C Chair: Jean-Guillaume Piccinali (Swiss National Supercomputing Centre) Crossing the Rhine - Moving to CLE 6.0 System Management Tina Butler and Tina Declerck (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) Abstract Abstract With the release of Cray Linux Environment 6.0, Cray has introduced a new paradigm for CLE system configuration and management. This major shift requires significant changes in formatting and practices on the System Management Workstation (SMW). Although Cray has committed to delivering migration tools for legacy systems, they will not be available until CLE 6.0 UP02, scheduled for July 2016 release. In the third quarter of 2016, NERSC will be taking delivery of the second phase of its Cori system, with Intel KNL processors. KNL requires CLE 6.0. In order to support phase 2, Cori will have to be upgraded to CLE 6.0 - the hard way. This paper will chronicle that effort. The NERSC Data Collect Environment Cary Whitney, Elizabeth Bautista, and Tom Davis (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) Abstract Abstract As computational facilities prepare for Exascale computing, there is a wider range of data that can be collected and analyzed but existing infrastructures have not scaled to the magnitude of the data. Further, as systems grow, there is wider impact of their environmental footprint and data analysis should include answers to power consumption, a correlation to jobs processed and power efficiency as well as how jobs can be scheduled to leverage this data. At NERSC, we have created a new data collection methodology for the Cray system that goes beyond the system and extends into the computational center. This robust and scalable system can help us manage the center, the Cray and ultimately help us learn how to scale our system and workload to the Exascale realm. Making the jump to Light Speed with Cray’s DataWarp - An Administrator’s Perspective Tina M. Declerck and David Paul (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) Abstract Abstract Cori, the first phase of NERSC’s next generation supercomputer, has 144 DataWarp nodes available to it’s users. Cray’s DataWarp technology provides an intermediate storage capability, sitting between on node memory and the parallel file system. It utilizes Cray’s DVS to provide access from compute nodes on a request basis. In order for this to work, the workload manager interacts with Cray’s DataWarp API’s to create the requested file system and make it available on the nodes requested by the job (or jobs). Some of the tools needed by an administrator are therefore included in the workload manager, SLURM at our site, and other information requires use of tools and commands provided by Cray. It is important to know what information is available, where to find it, and how to get it. Paper · Systems Support Technical Session 14C Chair: Jim Rogers (Oak Ridge National Laboratory) ACES and Cray Collaborate on Advanced Power Management for Trinity James H. Laros III, Kevin Pedretti, Stephen Olivier, Ryan Grant, and Michael Levenhagen (Sandia National Laboratories); David Debonis (Hewlett Packard Enterprise); Scott Pakin (Los Alamos National Laboratory); and Paul Falde, Steve Martin, and Matthew Kappel (Cray Inc.) Abstract Abstract The motivation for power and energy measurement and control capabilities for High Performance Computing (HPC) systems is now well accepted by the community. While technology providers have begun to deliver some features in this area, interfaces to expose these features are vendor specific. The need for a standard interface, now and in the future is clear. Cray XC Power Monitoring and Control for Knights Landing (KNL) Steven J. Martin, David Rush, Matthew Kappel, Michael Sandstedt, and Joshua Williams (Cray Inc.) Abstract Abstract This paper details the Cray XC40 power monitoring and control capabilities for Intel Knights Landing (KNL) based systems. The Cray XC40 hardware blade design for Intel KNL processors is the first in the XC family to incorporate enhancements directly related to power monitoring feedback driven by customers and the HPC community. This paper focuses on power monitoring and control directly related to Cray blades with Intel KNL processors and the interfaces available to users, system administrators, and workload managers to access power management features. Paper · Systems Support Technical Session 15C Chair: Helen He (National Energy Research Scientific Computing Center) SLURM. Our way. A tale of two XCs transitioning to SLURM. Douglas M. Jacobsen, James F. Botts, and Yun (Helen) He (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) Abstract Abstract NERSC recently transitioned its batch system and workload manager from an ALPS based solution to SLURM running “natively” on our Cray XC systems. The driving motivation for making this change is to gain access to features NERSC has long implemented in alternate forms, such as a capacity for running large numbers of serial tasks, and gaining tight user-interface integration with new features of our systems, such BurstBuffers, Shifter, and VTune, while still retaining access to a flexible batch system that delivers high utilization of our systems. While we have derived successes in all these areas, perhaps the largest unexpected impact has been the change in how our staff interacts with the system. Using SLURM as the native WLM has blurred the line between system management and operation. This has been greatly beneficial in the impact our staff have on system configuration and deployment of new features: a platform for innovation. Early experiences configuring a Cray CS Storm for Mission Critical Workloads Mark D. Klein and Marco Induni (Swiss National Supercomputing Centre) Abstract Abstract MeteoSwiss is transitioning from a traditional Cray XE6 system to a very dense GPU configuration of the Cray CS Storm. This paper will discuss some of the system design choices and configuration decisions that has gone into the new setup in order to operate the mission critical workloads of weather forecasting. This paper will share some of the modifications that have been made to enhance things such as CPU/GPU/HCA affinity in the job scheduler, monitoring systems that have been set up to examine performance fluctuations, and also discuss the design of the failover system. This paper will also share some challenges found with the CS Storm management software, as well as the current support situation for this product line. Analysis of Gemini Interconnect Recovery Mechanisms: Methods and Observations Saurabh Jha and Valerio Formicola (University of Illinois), Catello Di Martino (Nokia), Zbigniew Kalbarczyk (University of Illinois), William T. Kramer (National Center for Supercomputing Applications/University of Illinois), and Ravishankar K. Iyer (University of Illinois) Abstract Abstract This paper presents methodology and tools to understand and characterize the recovery mechanisms of the Gemini interconnect system from raw system logs. The tools can assess the impact of these recovery mechanisms on the system and user workloads. The methodology is based on the topology-aware state-machine based clustering algorithm to coalesce the Gemini-related events (i.e., errors, failure and recovery events) into groups. The presented methodology has been used to analyze more than two years of logs from Blue Waters, the 13.1-petaflop Cray hybrid supercomputer at the University of Illinois - National Center for Supercomputing Applications (NCSA). Paper · Systems Support Technical Session 18C Chair: Jim Rogers (Oak Ridge National Laboratory) Network Performance Counter Monitoring and Analysis on the Cray XC Platform Jim Brandt (Sandia National Laboratories), Edwin Froese (Cray Inc.), Ann Gentile (Sandia National Laboratories), Larry Kaplan (Cray Inc.), and Benjamin Allan and Edward Walsh (Sandia National Laboratories) Abstract Abstract The instrumentation of Cray's Aries network ASIC, of which the XC platform's High Speed Network (HSN) is comprised, offers unprecedented potential for better understanding and utilization of platform HSN resources. Monitoring the amount of data generated on a large-scale system presents challenges with respect to synchronization, data management, and analysis. There are over a thousand raw counter metrics per Aries router and interface with functional combinations of these raw metrics required for insight into network state. Design and implementation of a scalable monitoring system for Trinity Adam DeConinck, Amanda Bonnie, Kathleen Kelly, Samuel Sanchez, Cynthia Martin, and Michael Mason (Los Alamos National Laboratory); Jim Brandt, Ann Gentile, Benjamin Allan, and Anthony Agelastos (Sandia National Laboratories); and Michael Davis and Michael Berry (Cray Inc.) Abstract Abstract The Trinity XC-40 system at Los Alamos National Laboratory presents unprecedented challenges to our system management capabilities, including increased scale, new and unfamiliar subsystems, and radical changes in the system software stack. These challenges have motivated the development of a next-generation monitoring system with new capabilities for collection and analysis of system and facilities data. Dynamic Model Specific Register (MSR) Data Collection as a System Service Greg Bauer (National Center for Supercomputing Applications/University of Illinois), Jim Brandt and Ann Gentile (Sandia National Laboratories), and Andriy Kot and Michael Showerman (National Center for Supercomputing Applications) Abstract Abstract The typical use case for Model Specific Register (MSR) data is to provide application profiling tools with hardware performance counter data (e.g., cache misses, flops, instructions executed). This enables the user/developer to gain understanding about relative performance/efficiencies of the code overall as well as smaller code sections. Due to the overhead of collecting data at sufficient fidelity for the required resolution, these tools are typically only run while tuning a code. Paper · Systems Support Technical Session 20C Chair: Andrew Winfer (KAUST) How to Automate and not Manage under Rhine/Redwood Paul L. Peltz Jr., Adam J. DeConinck, and Daryl W. Grunau (Los Alamos National Laboratory) Abstract Abstract Los Alamos National Laboratory and Sandia National Laboratory under the Alliance for Computing at Extreme Scale (ACES) have partnered with Cray to deliver Trinity, the Department of Energy’s next supercomputer on the path to exascale. Trinity, which is an XC40, is an ambitious system for a number of reasons, one of which is the deployment of Cray’s new Rhine/Redwood (CLE 6.0/SMW 8.0) system management stack. With this release came a much-needed update to the system management stack to provide scalability and a new philosophy on system management. However, this update required LANL to update its own system management philosophy, and presented a number of challenges in integrating the system into the larger computing infrastructure at Los Alamos. This paper will discuss the work the LANL team is doing to integrate Trinity, automate system management with the new Rhine/Redwood stack, and combine LANL’s and Cray’s new system management philosophy. The Intel® Omni-Path Architecture: Game-Changing Performance, Scalability, and Economics Andrew Russell (Intel Corporation) Abstract Abstract The Intel® Omni-Path Architecture, Intel’s next-generation fabric product line, is off to an extremely fast start since its launch in late 2015. With high-profile customer deployments being announced at a feverish pace, the performance, resiliency, scalability, and economics of Intel’s innovative fabric product line are winning over customers across the HPC industry. Learn from an Intel Fabric Solution Architect how to maximize both the performance and economic benefits when deploying Intel® OPA-based cluster, and how it deliver huge benefits to HPC applications over standard Infiniband-based designs. The Hidden Cost of Large Jobs - Drain Time Analysis at Scale Joseph 'Joshi' Fullop (National Center for Supercomputing Applications) Abstract Abstract At supercomputing centers where many users submit jobs of various sizes, scheduling efficiency is the key to maximizing system utilization. With the capability of running jobs on massive numbers of nodes being the hallmark of large clusters, draining sufficient nodes in order to launch those jobs can severely impact the throughput of these systems. While these principles apply to any sized cluster, the idle node-hours due to drain on the scale of today's systems warrants attention. In this paper we provide methods of accounting for system-wide drain time as well as how to attribute drain time to a specific job. Having data like this allows for real evaluation of scheduling policies and their effect on node occupancy. This type of measurement is also necessary to allow for backfill recovery analytics and enables other types of assessments. Paper · User Services Technical Session 7C Chair: Ashley Barker (Oak Ridge National Laboratory) Unified Workload Management for the Cray XC30/40 System with Univa Grid Engine Daniel Gruber and Fritz Ferstl (Univa Corporation) Abstract Abstract Workload management (WLM) software provides batch queues and scheduling intelligence for efficient management of Cray systems. The widely used Univa Grid Engine (UGE) WLM software is available and in commercial production use for several years now for Cray XC30/40 systems. UGE allows to integrate Cray system seamlessly with other computational resources of an organization to form one, unified WLM domain. This paper describes the general structure and features of UGE, how those components map onto a Cray system, how UGE is integrated with the Cray infrastructure, and what the user will see for job structure and features in using Univa Grid Engine. Special features of UGE are also highlighted. Driving More Efficient Workload Management on Cray Systems with PBS Professional Graham Russell and Jemellah Alhabashneh (Altair Engineering, Inc.) Abstract Abstract The year 2015 continued to increase the adoption of key HPC technologies, from data analytics solutions to power-efficient scheduling. The HPC user landscape is changing and it is now critical for workload management vendors to provide not only foundational scheduling functionality but also the adjacent capabilities that truly optimize system performance. In this presentation, Altair will provide a look at key advances in PBS Professional for improved performance on Cray systems. Topics include new Cray-specific features like Suspend/Resume, Xeon Phi support, Power-aware Scheduling, and DataWarp. Broadening Moab for Expanding User Needs Gary Brown (Adaptive Computing) Abstract Abstract Adaptive Computing's Moab HPC scheduler, Nitro HTC scheduler, and the open-source TORQUE RM are broadening their reach to handle increasingly diverse users and their needs. This presentation will discuss the following examples. Paper · User Services Technical Session 8B Chair: Jenett Tillotson (Indiana University) Technical Publications New Portal Peggy A. Sanchez (Cray Inc.) Abstract Abstract Over the last year and a half, the Cray Publications department has gone through revolutionary changes. The new Cray Portal (not to be confused with CrayPort the customer service portal) is a tool based on the results of these changes. There are Cray internal benefits but more importantly are the user benefits due to the new standard. The portal is innovative for technical publications and a sought after example of new technology. While the portal is relatively simple to use, it is a rare, albeit excellent model of what users can do within the same standards. Key benefits of this presentation will be for the attendees to understand how the documentation portal brings opportunities to customize, rate, and respond to content as well as see the future of content delivery in a responsive design at Cray. Improving User Notification on Frequently Changing HPC Environments Chris Fuson, William Renaud, and James Wynne III (ORNL) Abstract Abstract Today’s HPC centers’ user environments can be very complex. Centers often contain multiple large complicated computational systems each with their own user environment. Changes to a system’s environment can be very impactful; however, a center’s user environment is, in one-way or another, frequently changing. Because of this, it is vital for centers to notify users of change. For users, untracked changes can be costly, resulting in unnecessary debug time as well as wasting valuable compute allocations and research time. Communicating frequent change to diverse user communities is a common and ongoing task for HPC centers. This paper will cover the OLCF’s current processes and methods used to communicate change to users of the center’s large Cray systems and supporting resources. The paper will share lessons learned and goals as well as practices, tools, and methods used to continually improve and reach members of the OLCF user community. Slurm Overview and Road Map Jacob Jenson (SchedMD LLC) Abstract Abstract Slurm is an open source workload manager used on five of the world's top 10 most powerful computers and provides a rich set of features including topology aware optimized resource allocation, the ability to expand and shrink jobs on demand, failure management support for applications, hierarchical bank accounts with fair-share job prioritization, job profiling, and a multitude of plugins for easy customization. Interactive Visualization of Scheduler Usage Data Daniel Gall (engility/noaa) Abstract Abstract This paper describes the use of contemporary web client-based interactive visualization software to explore HPC job scheduler usage data for a large system. Particularly, we offer a visualization web application to NOAA users that enables them to compare their experiences with their earlier experiences and those of other users. The application draws from a 30 day history of job data aggregated hourly by user, machine, QoS, and other factors. This transparency enables users to draw more informed conclusions about the behavior of the system. The technologies used are dc.js, d3.js, crossfilter, and bootstrap. This application differs from most visualizations of job data in that we eschewed the absolute time domain to focus on developing relevant charts in other domains, like relative time, job size, priority boost, and queue wait time. This enables us to more easily see repetitive trends in the data like diurnal and weekly cycles of job submissions. Tutorial Tutorial 1A Cray Management System for XC Systems with SMW 8.0/CLE 6.0 Harold Longley (Cray Inc.) Abstract Abstract New versions of CLE 8.0 and SMW 6.0 have been developed that include a new Cray Management System (CMS) for Cray XC systems. This new CMS includes system management tools and processes which separate software and configuration for the Cray XC systems, while at the same time preserving the system reliability and scalability upon which you depend. The new CMS includes a new common installation process for SMW and CLE, and more tightly integrates external login nodes (eLogin) as part of the Cray XC system. It includes the Image Management and Provisioning System (IMPS), the Configuration Management Framework (CMF), and Node Image Mapping Service (NIMS). Finally, it integrates with SUSE Linux Enterprise Server 12. Tutorial Tutorial 1B Knights Landing and Your Application: Getting Everything from the Hardware Andrew C. Mallinson (Intel Corporation) Abstract Abstract Knights Landing, the 2nd generation Intel® Xeon Phi™ processor, utilizes many breakthrough technologies to combine breakthrough’s in power performance with standard, portable, and familiar programming models. This presentation provides an in-depth tour of Knights Landing's features and focuses on how to ensure that your applications can get the most performance out of the hardware. Tutorial Tutorial 1C CUG 2016 Cray XC Power Monitoring and Management Tutorial Steven Martin, David Rush, Matthew Kappel, and Joshua Williams (Cray Inc.) Abstract Abstract This half day (3 hour) tutorial will focus on the setup, usage and use cases for Cray XC power monitoring and management features. The tutorial will cover power and energy monitoring and control from three perspectives: site and system administrators working from the SMW command line, users who run jobs on the system, and third party software development partners integrating with Cray’s RUR and CAPMC features. Tutorial Tutorial 1A Continued Cray Management System for XC Systems with SMW 8.0/CLE 6.0 Harold Longley (Cray Inc.) Abstract Abstract New versions of CLE 8.0 and SMW 6.0 have been developed that include a new Cray Management System (CMS) for Cray XC systems. This new CMS includes system management tools and processes which separate software and configuration for the Cray XC systems, while at the same time preserving the system reliability and scalability upon which you depend. The new CMS includes a new common installation process for SMW and CLE, and more tightly integrates external login nodes (eLogin) as part of the Cray XC system. It includes the Image Management and Provisioning System (IMPS), the Configuration Management Framework (CMF), and Node Image Mapping Service (NIMS). Finally, it integrates with SUSE Linux Enterprise Server 12. Tutorial Tutorial 1B Continued Knights Landing and Your Application: Getting Everything from the Hardware Andrew C. Mallinson (Intel Corporation) Abstract Abstract Knights Landing, the 2nd generation Intel® Xeon Phi™ processor, utilizes many breakthrough technologies to combine breakthrough’s in power performance with standard, portable, and familiar programming models. This presentation provides an in-depth tour of Knights Landing's features and focuses on how to ensure that your applications can get the most performance out of the hardware. Tutorial Tutorial 1C Continued CUG 2016 Cray XC Power Monitoring and Management Tutorial Steven Martin, David Rush, Matthew Kappel, and Joshua Williams (Cray Inc.) Abstract Abstract This half day (3 hour) tutorial will focus on the setup, usage and use cases for Cray XC power monitoring and management features. The tutorial will cover power and energy monitoring and control from three perspectives: site and system administrators working from the SMW command line, users who run jobs on the system, and third party software development partners integrating with Cray’s RUR and CAPMC features. Tutorial Tutorial 2A Cray Management System for XC Systems with SMW 8.0/CLE 6.0 Harold Longley (Cray Inc.) Abstract Abstract New versions of CLE 8.0 and SMW 6.0 have been developed that include a new Cray Management System (CMS) for Cray XC systems. This new CMS includes system management tools and processes which separate software and configuration for the Cray XC systems, while at the same time preserving the system reliability and scalability upon which you depend. The new CMS includes a new common installation process for SMW and CLE, and more tightly integrates external login nodes (eLogin) as part of the Cray XC system. It includes the Image Management and Provisioning System (IMPS), the Configuration Management Framework (CMF), and Node Image Mapping Service (NIMS). Finally, it integrates with SUSE Linux Enterprise Server 12. Tutorial Tutorial 2B Getting the full potential of OpenMP on Many-core systems John Levesque (Cray Inc) and Jacob Poulsen (Danish Met) Abstract Abstract With the advent of the Knight’s series of Phi processors, code developers who employed only MPI in their applications will be challenged to achieve good performance. The traditional methods of employing loop level OpenMP are not suitable for larger legacy codes due to the risk of significant inefficiencies. Runtime overhead, NUMA effects, load imbalance are the principal issues facing the code developer. This tutorial will suggest a higher-level approach that has shown promise of circumventing these inefficiencies and achieving good performance on many-core systems. Tutorial Tutorial 2C eLogin Made Easy - An Introduction and Tutorial on the new Cray External Login Node. Jeff Keopp and Blaine Ebeling (Cray Inc.) Abstract Abstract The new eLogin product (external login node) is substantially different from its esLogin predecessor. Management is provided by the new OpenStack based Cray System Management Software. Images are prescriptively built with the same technology used to build CLE images for Cray compute and service nodes, and the Cray Programming Environment is kept separate from the operational image. Tutorial Tutorial 2A Continued Cray Management System for XC Systems with SMW 8.0/CLE 6.0 Harold Longley (Cray Inc.) Abstract Abstract New versions of CLE 8.0 and SMW 6.0 have been developed that include a new Cray Management System (CMS) for Cray XC systems. This new CMS includes system management tools and processes which separate software and configuration for the Cray XC systems, while at the same time preserving the system reliability and scalability upon which you depend. The new CMS includes a new common installation process for SMW and CLE, and more tightly integrates external login nodes (eLogin) as part of the Cray XC system. It includes the Image Management and Provisioning System (IMPS), the Configuration Management Framework (CMF), and Node Image Mapping Service (NIMS). Finally, it integrates with SUSE Linux Enterprise Server 12. Tutorial Tutorial 2B Continued Getting the full potential of OpenMP on Many-core systems John Levesque (Cray Inc) and Jacob Poulsen (Danish Met) Abstract Abstract With the advent of the Knight’s series of Phi processors, code developers who employed only MPI in their applications will be challenged to achieve good performance. The traditional methods of employing loop level OpenMP are not suitable for larger legacy codes due to the risk of significant inefficiencies. Runtime overhead, NUMA effects, load imbalance are the principal issues facing the code developer. This tutorial will suggest a higher-level approach that has shown promise of circumventing these inefficiencies and achieving good performance on many-core systems. Tutorial Tutorial 2C Continued eLogin Made Easy - An Introduction and Tutorial on the new Cray External Login Node. Jeff Keopp and Blaine Ebeling (Cray Inc.) Abstract Abstract The new eLogin product (external login node) is substantially different from its esLogin predecessor. Management is provided by the new OpenStack based Cray System Management Software. Images are prescriptively built with the same technology used to build CLE images for Cray compute and service nodes, and the Cray Programming Environment is kept separate from the operational image. Birds of a Feather Interactive 3A Chair: Matteo Chesi (Swiss National Supercomputing Centre) Birds of a Feather Interactive 3B Chair: Richard S. Canon (LBNL) Containers for HPC Richard S. Canon and Douglas M. Jacobson (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) and Sadaf Alam (Swiss National Supercomputing Centre) Abstract Abstract Container-based computing is an emerging model for developing and deploying applications and is now making inroads into the HPC community with the release of products like Shifter. The promise is large, but many challenges remain. What security concerns still need to be addressed? How do we train users to take advantage of this capability? What are best practices around creating and distributing images? How can we build up an ecosystem across the broader HPC community to promote reusability and increase productivity? This BOF will provide an opportunity to discuss these questions with experts from NERSC, CSCS, and Cray. Birds of a Feather Interactive 3C Chair: David Hancock (Indiana University) Open Discussion with CUG Board David Hancock (Indiana University) Abstract Abstract This session is designed as an open discussion with the CUG Board but there are a few high level topics that will also be on the agenda. The discussion will focus on corporation changes to achieve non-profit status (including bylaw changes), feedback on increasing CUG participation, and feedback on SIG structure and communication. An open floor question and answer period will follow these topics. Formal voting (on candidates and the bylaws) will open after this session, so any candidates or members with questions about the process are welcome to bring up those topics. Birds of a Feather Interactive 4A Chair: Derek Burke (Seagate) Birds of a Feather Interactive 4B Chair: CJ Corbett (Cray Inc.) Cray and HPC in the Cloud Steve Scott (Cray, Cray Inc.) and CJ Corbett, Kunju Kothari, and Ryan Waite (Cray Inc.) Abstract Abstract Supercomputing customers have unique set of requirements and some fit better than others with standard cloud offerings. Many Cray customers have asked for cloud enablement and co-existence of their applications in the Cloud. Others provide de facto private and specialized cloud services to their own communities and customers. These sessions will drill down on use cases (current and anticipated), best practices, necessary infrastructure (SW/HW) developments and enablement for you to successfully deploy Cray for HPC in the cloud. Participants will be asked to participate in two consecutive BoF sessions. Session 1 will include a facilitator-led exercise to answer the determine and group participants’ requirements for embracing cloud services in their operations. This session ends with and exercise. Session 2 collects the result of the exercise for focused discussion and prioritization of the findings. Results will be shared with the participants of this BoF. For continuity of discussion participants are asked to commit to both Sessions. Birds of a Feather Interactive 4C Plenary General Session 5 Chair: David Hancock (Indiana University) The Strength of a Common Goal Florence Rabier (ECMWF - European Centre for Medium-Range Weather Forecasts) Abstract Abstract ECMWF is an intergovernmental organisation supported by 34 European States. It provides forecasts of global weather to 15 days ahead as well as monthly and seasonal forecasts. The National Meteorological Services of Member and Co-operating States use ECMWF's products for their own national duties, in particular to give early warning of potentially damaging severe weather. Plenary General Session 6 Chair: Andrew Winfer (KAUST) Paper · Applications & Programming Environments Technical Session 7A Chair: Helen He (National Energy Research Scientific Computing Center) Performance on Trinity (a Cray XC40) with Acceptance-Applications and Benchmarks Mahesh Rajan (Sandia National Laboratories); Nathan Wichmann, Cindy Nuss, Pierre Carrier, Ryan Olson, Sarah Anderson, and Mike Davis (Cray Inc.); Randy Baker (Los Alamos National Laboratory); Erik Draeger (Lawrence Livermore National Laboratory); and Stefan Domino and Anthony Agelastos (Sandia National Laboratories) Abstract Abstract Abstract—Trinity is NNSA’s first ASC Advanced Technology System (ATS) targeted to support the largest, most demanding nuclear weapon simulations. Trinity Phase-1 (the focus of this paper) has 9436 dual-socket Haswell nodes while Phase-2 will have close to 9500 KNL nodes. This paper documents the performance of applications and benchmarks used for Trinity acceptance. It discusses the early experiences of the Tri-Lab (LANL, SNL and LLNL) and Cray teams to meet the challenges for optimal performance on this new architecture by taking advantage of the large number of cores on the node, wider SIMD/vector units and the Cray Aries network. Application performance comparisons to our previous generation large Cray capability systems show excellent scalability. The overall architecture is facilitating easy migration of our production simulations to this 11 PFLOPS system, while improved work flow through the use of Burst-Buffer nodes is still under investigation. Improving I/O Performance of the Weather Research and Forecast (WRF) Model Tricia Balle and Peter Johnsen (Cray Inc.) Abstract Abstract As HPC resources continue to increase in size and availability, the complexity of numeric weather prediction models also rises. This increases demands on HPC I/O subsystems, which continue to cause bottlenecks in efficient production weather forecasting. Performance Evaluation of Apache Spark on Cray XC Systems Nicholas Chaimov and Malony Allen (University of Oregon), Khaled Ibrahim and Costin Iancu (Lawrence Berkeley National Laboratory), and Shane Canon and Jay Srinivasan (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) Abstract Abstract We report our experiences in porting and tuning the Apache Spark data analytics framework on the Cray XC30 (Edison) and XC40 (Cori) systems, installed at NERSC. Spark has been designed for cloud environments where local disk I/O is cheap and performance is constrained by the network latency. In large HPC systems diskless nodes are connected by fast networks: without careful tuning Spark execution is dominated by I/O performance. In default mode the centralized storage system, such as Lustre, results in metadata access latency being a major bottleneck that severely constrains scalability. We show how to mitigate this by using per-node loopback filesystems for temporary storage. With this technique, we reduce the communication (data shuffle) time by multiple orders of magnitude and improve the application scalability from O(100) to O(10,000) cores on Cori. With this configuration Spark's execution becomes again network dominated. This reflects in the performance comparison with a cluster with fast local SSDs, specifically designed for data intensive workloads. Due to slightly faster processor and better network, Cori provides performance better by an average of 13.7% for the machine learning benchmark suite. This is the first such result where HPC systems outperform systems designed for data intensive workloads. Overall, we believe this paper demonstrates that local disks are not necessary for good performance on data analytics workloads. Paper · User Services Technical Session 7C Chair: Ashley Barker (Oak Ridge National Laboratory) Unified Workload Management for the Cray XC30/40 System with Univa Grid Engine Daniel Gruber and Fritz Ferstl (Univa Corporation) Abstract Abstract Workload management (WLM) software provides batch queues and scheduling intelligence for efficient management of Cray systems. The widely used Univa Grid Engine (UGE) WLM software is available and in commercial production use for several years now for Cray XC30/40 systems. UGE allows to integrate Cray system seamlessly with other computational resources of an organization to form one, unified WLM domain. This paper describes the general structure and features of UGE, how those components map onto a Cray system, how UGE is integrated with the Cray infrastructure, and what the user will see for job structure and features in using Univa Grid Engine. Special features of UGE are also highlighted. Driving More Efficient Workload Management on Cray Systems with PBS Professional Graham Russell and Jemellah Alhabashneh (Altair Engineering, Inc.) Abstract Abstract The year 2015 continued to increase the adoption of key HPC technologies, from data analytics solutions to power-efficient scheduling. The HPC user landscape is changing and it is now critical for workload management vendors to provide not only foundational scheduling functionality but also the adjacent capabilities that truly optimize system performance. In this presentation, Altair will provide a look at key advances in PBS Professional for improved performance on Cray systems. Topics include new Cray-specific features like Suspend/Resume, Xeon Phi support, Power-aware Scheduling, and DataWarp. Broadening Moab for Expanding User Needs Gary Brown (Adaptive Computing) Abstract Abstract Adaptive Computing's Moab HPC scheduler, Nitro HTC scheduler, and the open-source TORQUE RM are broadening their reach to handle increasingly diverse users and their needs. This presentation will discuss the following examples. Paper · Applications & Programming Environments Technical Session 8A Chair: Bilel Hadri (KAUST Supercomputing Lab) Trinity: Architecture and Early Experience Scott Hemmert (Sandia National Laboratories); Manuel Vigil and James Lujan (Los Alamos National Laboratory); Rob Hoekstra (Sandia National Laboratories); Daryl Grunau; David Morton; Hai Ah Nam; Paul Peltz, Jr.; Alfred Torrez; and Cornell Wright (Los Alamos National Laboratory); Shawn Dawson (Lawrence Livermore National Laboratory); and Simon Hammond and Michael Glass (Sandia National Laboratories) Abstract Abstract The Trinity supercomputer is the first in a series of Advanced Technology Systems (ATS) that will be procured by the DOE’s Advanced Simulation and Computing program over the next decade. The ATS systems serve the dual role of meeting immediate mission needs and helping to prepare for future system designs. Trinity meets this goal though a two-phase delivery. Phase 1 consists of Xeon-based compute nodes, while phase 2 adds Xeon Phi-based nodes. Code Porting to Cray XC-40 Lesson Learned James McClean and Raj Gautam (Petroleum-Geo Services) Abstract Abstract We present a case study of porting seismic applications from the Beowulf cluster using Ethernet network to the Cray XC40 cluster. The applications in question are Tilted Transverse Anisotropic Reverse Time Migration (TTI RTM), Kirchhoff Depth Migration (KDMIG) and Wave equation migration (WEM). The primary obstacle in this port was that TTI RTM and WEM use local scratch disk heavily and imaging is performed one shot per node. The Cray nodes do not have local scratch disks. The primary obstacle in KDMIG was its heavy IO usage from permanent disk due to the constant reading of Travel Time Maps. We briefly explain how these algorithms were refactored so as to not be primarily dependent on scratch disk and also to fully utilize the better networking in the Cray XC40. In the case of KDMIG, we explain how its IO load was reduced via a memory pool concept. Early Experiences Writing Performance Portable OpenMP 4 Codes Verónica G. Vergara Larrea, Wayne Joubert, M. Graham Lopez, and Oscar Hernandez (Oak Ridge National Laboratory) Abstract Abstract At least two major architectural trends are leading the way to Exascale: accelerator-based (e.g., Summit and Sierra), and self-hosted compute nodes (e.g., Aurora). Today, the ability to produce performance portable code is crucial to take full advantage of these different architectures. Directive-based programming APIs (e.g., OpenMP, OpenACC) have helped in this regard, and recently, OpenMP added an accelerator programming model, in addition to its shared memory programming model support. However, as of today, little is understood about how efficiently the accelerator programming model can be mapped onto different architectures, including self-hosted and traditional shared memory systems, and whether it can be used to generate performance portable code across architectures. In this paper, we parallelize a representative computational kernel using the two different OpenMP 4 styles (shared memory and accelerator models), and compare their performance on multiple architectures including OLCF’s Titan supercomputer. A Reasoning And Hypothesis Generation Framework Based On Scalable Graph Analytics: Enabling Discoveries In Medicine Using Cray Urika-XA And Urika-GD Sreenivas R. Sukumar, Larry W. Roberts, Jeffrey Graves, and Jim Rogers (Oak Ridge National Laboratory) Abstract Abstract Finding actionable insights from data has always been difficult. As the scale and forms of data increase tremendously, the task of finding value becomes even more challenging. Data scientists at Oak Ridge National Laboratory are leveraging unique leadership infrastructure (e.g. Urika-XA and Urika-GD appliances) to develop scalable algorithms for semantic, logical and statistical reasoning with unstructured Big Data. We present the deployment of such a framework called ORiGAMI (Oak Ridge Graph Analytics for Medical Innovations) on the National Library of Medicine’s SEMANTIC Medline (archive of medical knowledge since 1994). Medline contains over 70 million knowledge nuggets published in 23.5 million papers in medical literature with thousands more added daily. ORiGAMI is available as an open-science medical hypothesis generation tool - both as a web-service and an application programming interface (API) at http://hypothesis.ornl.gov . Paper · Systems Support Technical Session 8C Chair: Jean-Guillaume Piccinali (Swiss National Supercomputing Centre) Crossing the Rhine - Moving to CLE 6.0 System Management Tina Butler and Tina Declerck (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) Abstract Abstract With the release of Cray Linux Environment 6.0, Cray has introduced a new paradigm for CLE system configuration and management. This major shift requires significant changes in formatting and practices on the System Management Workstation (SMW). Although Cray has committed to delivering migration tools for legacy systems, they will not be available until CLE 6.0 UP02, scheduled for July 2016 release. In the third quarter of 2016, NERSC will be taking delivery of the second phase of its Cori system, with Intel KNL processors. KNL requires CLE 6.0. In order to support phase 2, Cori will have to be upgraded to CLE 6.0 - the hard way. This paper will chronicle that effort. The NERSC Data Collect Environment Cary Whitney, Elizabeth Bautista, and Tom Davis (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) Abstract Abstract As computational facilities prepare for Exascale computing, there is a wider range of data that can be collected and analyzed but existing infrastructures have not scaled to the magnitude of the data. Further, as systems grow, there is wider impact of their environmental footprint and data analysis should include answers to power consumption, a correlation to jobs processed and power efficiency as well as how jobs can be scheduled to leverage this data. At NERSC, we have created a new data collection methodology for the Cray system that goes beyond the system and extends into the computational center. This robust and scalable system can help us manage the center, the Cray and ultimately help us learn how to scale our system and workload to the Exascale realm. Making the jump to Light Speed with Cray’s DataWarp - An Administrator’s Perspective Tina M. Declerck and David Paul (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) Abstract Abstract Cori, the first phase of NERSC’s next generation supercomputer, has 144 DataWarp nodes available to it’s users. Cray’s DataWarp technology provides an intermediate storage capability, sitting between on node memory and the parallel file system. It utilizes Cray’s DVS to provide access from compute nodes on a request basis. In order for this to work, the workload manager interacts with Cray’s DataWarp API’s to create the requested file system and make it available on the nodes requested by the job (or jobs). Some of the tools needed by an administrator are therefore included in the workload manager, SLURM at our site, and other information requires use of tools and commands provided by Cray. It is important to know what information is available, where to find it, and how to get it. Birds of a Feather Interactive 9A Chair: Jason Hill (Oak Ridge National Laboratory) Birds of a Feather Interactive 9B Chair: Chris Fuson (ORNL) Best Practices for Managing HPC User Documentation and Communication Chris Fuson and Ashley Barker (ORNL), Gerber Richard (NERSC), Frank Indiviglio (GFDL), and Helen He (NERSC) Abstract Abstract HPC centers provide large, complex, state-of-the-art computational and data resources to large user communities that span diverse science domains and contain members with varied experience levels. Effectively using these resources can pose a challenge to users, especially considering that each center often has site-specific configurations and procedures. Birds of a Feather Interactive 9C Chair: Peter Messmer (NVIDIA) GPU accelerated Cray XC systems: Where HPC meets Big Data Peter Messmer (NVIDIA), Chris Lindahl (Cray Inc.), and Sadaf Alam (Swiss National Supercomputing Centre) Abstract Abstract We discuss accelerated compute, data analysis and visualization capabilities of NVIDIA Tesla GPUs within the scalable and adaptable Cray XC series supercomputers. Historically, HPC and Big Data had their distinct challenges, algorithms and computing hardware. Today, the amount of data produced by HPC applications turns their analysis into a Big Data challenge. On the other hand, the computational complexity of modern data analysis requires compute and messaging performance only found in HPC systems. The ideal supercomputer therefore unites compute, analysis and visualization capabilities. Presenters from NVIDIA, Cray and HPC sites will showcase features of the XC series that go beyond accelerated computing and demonstrate how heterogeneous systems are the ideal platform for converging high-end computing and data analysis. We will cover new features of NVIDIA drivers and libraries, allowing to leverage the GPU's graphics capabilities, Cray's support for container technologies like shifter, and discuss an integrated and yet decoupled programming and execution environment as a highly performance and yet flexible platform for a wide range of traditional HPC but also emerging analytics and data science applications and workflows. Birds of a Feather Interactive 10A Chair: Michael Showerman (National Center for Supercomputing Applications) Addressing the challenges of "systems monitoring" data flows Mike Showerman (NCSA at the University of Illinois) and Jim Brandt and Ann Gentile (Sandia National Laboratories) Abstract Abstract As Cray systems have evolved in both scale and complexity, the volume of quality systems data has grown to levels that are challenging to process and store. This BOF is an opportunity to discuss some of the use cases for high resolution power, interconnect, compute, and storage subsystem data. We hope to be able to gain insights into the requirements sites have for future systems deployments, and how these data need to be integrated and processed. There will be presentations of known problems that cannot be addressed with the current infrastructure design, as well as directions Cray could go to meet the needs of sites. Birds of a Feather Interactive 10B Chair: Peggy A. Sanchez (Cray Inc) Technical Documentation and Users Peggy A. Sanchez (Cray Inc) Abstract Abstract An open discussion about the direction of documentation for users as part of the continuing effort to improve the user experience and better understand the needs outside of Cray. This is an opportunity to contribute to the direction of content within technical publications as well as identify immediate needs. Birds of a Feather Interactive 10C Plenary General Session 11 Chair: David Hancock (Indiana University) Weather and Climate Services and need of High Performance Computing (HPC) Resources Petteri Taalas (WMO - World Meteorological Organization) Abstract Abstract We are entering a new era in technological innovation and in use and integration of different sources of information for the well-being of society and their ability to cope with multi-hazards through weather and climate services. New predictive tools that will detail weather conditions down to neighbourhood and street level, and provide early warnings a month ahead, and forecasts from rainfall to energy consumption will be some of the main outcome of the research activities in weather science over the next decade. Plenary General Session 12 Chair: Andrew Winfer (KAUST) Accelerating Science with the NERSC Burst Buffer Early User Program Wahid Bhimji, Deborah Bard, Melissa Romanus AbdelBaky, David Paul, Andrey Ovsyannikov, and Brian Friesen (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center); Matt Bryson (Lawrence Berkeley National Laboratory); Joaquin Correa and Glenn K. Lockwood (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center); Vakho Tsulaia, Suren Byna, and Steve Farrell (Lawrence Berkeley National Laboratory); Doga Gursoy (Argonne National Laboratory); Chris Daley (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center); Vince Beckner, Brian Van Straalen, David Trebotich, Craig Tull, and Gunther H. Weber (Lawrence Berkeley National Laboratory); and Nicholas J. Wright, Katie Antypas, and Mr Prabhat (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) Abstract Abstract NVRAM-based Burst Buffers are an important part of the emerging HPC storage landscape. The National Energy Research Scientific Computing Center (NERSC) at Lawrence Berkeley National Laboratory recently installed one of the first Burst Buffer systems as part of its new Cori supercomputer, collaborating with Cray on the development of the DataWarp software. NERSC has a diverse user base comprised of over 6500 users in 750 different projects spanning a wide variety of scientific applications, including climate modeling, combustion, fusion, astrophysics, computational biology, and many more. The potential applications of the Burst Buffer at NERSC are therefore also considerable and diverse. We describe here the Burst Buffer Early User Program at NERSC, which selected a number of research projects to gain early access to the Burst Buffer and exercise its different capabilities to enable new scientific advancements. We present details of the program, in-depth performance results and lessons-learnt from highlighted projects. Is Cloud A Passing Fancy? Rajeeb Hazra (Intel) Abstract Abstract Enterprise businesses are rapidly moving to the cloud, transforming from bricks and mortar into cloud based services. They increasingly rely on high volume data collection enabling complex analysis and modelling as fundamental business capabilities. Plenary General Session 13 Chair: David Hancock (Indiana University) Paper · Applications & Programming Environments Technical Session 14A Chair: Chris Fuson (ORNL) Estimating the Performance Impact of the MCDRAM on KNL Using Dual-Socket Ivy Bridge nodes on Cray XC30 Zhengji Zhao (NERSC/LBNL) and Martijn Marsman (University of Vienna) Abstract Abstract NERSC is preparing for its next petascale system, named Cori, a Cray XC system based on the Intel KNL MIC architecture. Each Cori node will have 72 cores (288 threads), 512 bit vector units, and a low capacity (16GB) and high bandwidth (~5x DDR4) on-package memory (MCDRAM or HBM). To help applications get ready for Cori, NERSC has developed optimization strategies that focus on the MPI+OpenMP program model, vectorization, and the HBM. While the optimization on MPI+OpenMP and vectorization can be carried out on today’s multi-core architectures, optimization of the HBM is difficult to perform where the HBM is unavailable. In this paper, we will present our HBM performance analysis on the VASP code, a widely used materials science code, using Intel's development tools, Memkind and AutoHBW, and a dual-socket Ivy Bridge processor node on Edison, a Cray XC30, as a proxy to the HBM on KNL. Cray Performance Tools Enhancements for Next Generation Systems Heidi Poxon (Cray Inc.) Abstract Abstract The Cray performance tools provide a complete solution from instrumentation, measurement, analysis and visualization of data. The focus of the tools is on whole program analysis, providing insight into performance bottlenecks within programs that use many computing resources across many nodes. With two complimentary interfaces: one for first-time users that provides a program profile at the end of program execution, and one for advanced users that provides in-depth performance investigation and tuning assistance, the tools enable users to quickly identify areas in their programs that most heavily impact performance or energy consumption. Recent development activity targets the new Intel KNL many-core processors, more assistance with adding OpenMP to MPI programs, improved tool usability, and enhanced application power and energy monitoring feedback. New CrayPat, Reveal and Cray Apprentice2 functionality is presented that will offer additional insight into application performance on next generation Cray systems. Paper · Systems Support Technical Session 14C Chair: Jim Rogers (Oak Ridge National Laboratory) ACES and Cray Collaborate on Advanced Power Management for Trinity James H. Laros III, Kevin Pedretti, Stephen Olivier, Ryan Grant, and Michael Levenhagen (Sandia National Laboratories); David Debonis (Hewlett Packard Enterprise); Scott Pakin (Los Alamos National Laboratory); and Paul Falde, Steve Martin, and Matthew Kappel (Cray Inc.) Abstract Abstract The motivation for power and energy measurement and control capabilities for High Performance Computing (HPC) systems is now well accepted by the community. While technology providers have begun to deliver some features in this area, interfaces to expose these features are vendor specific. The need for a standard interface, now and in the future is clear. Cray XC Power Monitoring and Control for Knights Landing (KNL) Steven J. Martin, David Rush, Matthew Kappel, Michael Sandstedt, and Joshua Williams (Cray Inc.) Abstract Abstract This paper details the Cray XC40 power monitoring and control capabilities for Intel Knights Landing (KNL) based systems. The Cray XC40 hardware blade design for Intel KNL processors is the first in the XC family to incorporate enhancements directly related to power monitoring feedback driven by customers and the HPC community. This paper focuses on power monitoring and control directly related to Cray blades with Intel KNL processors and the interfaces available to users, system administrators, and workload managers to access power management features. Paper · Applications & Programming Environments Technical Session 15A Chair: Zhengji Zhao (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory) Lonestar 5: Customizing the Cray XC40 Software Environment Cyrus Proctor, David Gignac, Robert McLay, Si Liu, Doug James, Tommy Minyard, and Dan Stanzione (Texas Advanced Computing Center) Abstract Abstract Lonestar 5, a 30,000 core, 1.2 petaflop Cray XC40, entered production at the Texas Advanced Computing Center (TACC) on January 12, 2016. Customized to meet the needs of TACC’s diverse computational research community, Lonestar 5 provides each user a choice between two alternative, independent configurations. Each is robust, mature, and proven: Lonestar 5 hosts both the environment delivered by Cray, and a second customized environment that mirrors Stampede, Lonestar 4, and other TACC clusters. The Cray Programming Environment: Current Status and Future Directions Luiz DeRose (Cray Inc.) Abstract Abstract In this talk I will present the recent activities, roadmap, and future directions of the Cray Programming Environment, which are being developed and deployed on Cray Clusters and Cray Supercomputers for scalable performance with high programmability. The presentation will discuss Cray’s programming environment new functionality to help porting and hybridizing applications to support systems with Intel KNL processors. This new functionality includes compiler directives to access high bandwidth memory, new features in the scoping tool Reveal to assist in parallelization of applications, and the Cray Comparative Debugger, which was designed and developed to help identify porting issues. In addition, I will present the recent activities in the Cray Scientific Libraries, and the Cray Message Passing Toolkit, and will discuss the Cray Programming Environment strategy for accelerated computing with GPUs, as well as the Cray Compiling Environment standard compliance plans for C++14, OpenMP 4.5, and OpenACC. Making Scientific Software Installation Reproducible On Cray Systems Using EasyBuild Petar Forai (Research Institute of Molecular Pathology (IMP)), Kenneth Hoste (Central IT Department of Ghent University), Guilherme Peretti Pezzi (Swiss National Supercomputing Centre), and Brett Bode (National Center for Supercomputing Applications/University of Illinois) Abstract Abstract Cray provides a tuned and supported OS and programming environment (PE), including compilers and libraries integrated with the modules system. While the Cray PE is updated frequently, tools and libraries not in it quickly become outdated. In addition, the amount of tools, libraries and scientific applications that HPC user support teams are expected to provide support for is increasing significantly. The uniformity of the software environment across Cray sites makes it an attractive target for to share this ubiquitous burden, and to collaborate on a common solution. Paper · Filesystems & I/O Technical Session 15B Chair: Jason Hill (Oak Ridge National Laboratory) The Evolution of Lustre Networking at Cray Chris A. Horn (Cray Inc.) Abstract Abstract Lustre Network (LNet) routers with more than one InfiniBand Host Channel Adapter (HCA) have been in use at Cray for some time. This type of LNet router configuration is necessary on Cray Supercomputers in order to extract maximum performance out of a single LNet router node. This paper provides a look at the state of the art in this dual-HCA router configuration. Topics include avoiding ARP flux with proper subnet configuration, flat vs. fine-grained routing, and configuration emplacement. We’ll also provide a look at how LNet will provide compatibility with InfiniBand HCAs requiring the latest mlx5 drivers, and what is needed to support a mix of mlx4 and mlx5 on the same fabric. Extreme Scale Storage & IO Eric Barton (Intel Corporation) Abstract Abstract New technologies such as 3D Xpoint and integrated high performance fabrics are set to revolutionize the storage landscape as we reach towards Exascale computing. Unfortunately the latencies inherent in today’s storage software mask the benefits of these technologies and the horizontal scaling. This talk will describe the work currently underway in the DOE funded Extreme Scale Storage & IO project to prototype a storage stack capable of exploiting these new technologies to the full and designed to overcome the extreme scaling and resilience challenges presented by Exascale Computing. Managing your Digital Data Explosion Matt Starr and Janice Kinnin (Spectra Logic) Abstract Abstract Our society is currently undergoing an explosion in digital data. It is predicted that our digital universe will double every two years to reach more than 44 zettabytes (ZB) by 2020. The volume of data created each day has increased immensely and will continue to grow exponentially over time. This trend in data growth makes it clear that the data storage problems we struggle with today will soon seem very minor. Paper · Systems Support Technical Session 15C Chair: Helen He (National Energy Research Scientific Computing Center) SLURM. Our way. A tale of two XCs transitioning to SLURM. Douglas M. Jacobsen, James F. Botts, and Yun (Helen) He (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) Abstract Abstract NERSC recently transitioned its batch system and workload manager from an ALPS based solution to SLURM running “natively” on our Cray XC systems. The driving motivation for making this change is to gain access to features NERSC has long implemented in alternate forms, such as a capacity for running large numbers of serial tasks, and gaining tight user-interface integration with new features of our systems, such BurstBuffers, Shifter, and VTune, while still retaining access to a flexible batch system that delivers high utilization of our systems. While we have derived successes in all these areas, perhaps the largest unexpected impact has been the change in how our staff interacts with the system. Using SLURM as the native WLM has blurred the line between system management and operation. This has been greatly beneficial in the impact our staff have on system configuration and deployment of new features: a platform for innovation. Early experiences configuring a Cray CS Storm for Mission Critical Workloads Mark D. Klein and Marco Induni (Swiss National Supercomputing Centre) Abstract Abstract MeteoSwiss is transitioning from a traditional Cray XE6 system to a very dense GPU configuration of the Cray CS Storm. This paper will discuss some of the system design choices and configuration decisions that has gone into the new setup in order to operate the mission critical workloads of weather forecasting. This paper will share some of the modifications that have been made to enhance things such as CPU/GPU/HCA affinity in the job scheduler, monitoring systems that have been set up to examine performance fluctuations, and also discuss the design of the failover system. This paper will also share some challenges found with the CS Storm management software, as well as the current support situation for this product line. Analysis of Gemini Interconnect Recovery Mechanisms: Methods and Observations Saurabh Jha and Valerio Formicola (University of Illinois), Catello Di Martino (Nokia), Zbigniew Kalbarczyk (University of Illinois), William T. Kramer (National Center for Supercomputing Applications/University of Illinois), and Ravishankar K. Iyer (University of Illinois) Abstract Abstract This paper presents methodology and tools to understand and characterize the recovery mechanisms of the Gemini interconnect system from raw system logs. The tools can assess the impact of these recovery mechanisms on the system and user workloads. The methodology is based on the topology-aware state-machine based clustering algorithm to coalesce the Gemini-related events (i.e., errors, failure and recovery events) into groups. The presented methodology has been used to analyze more than two years of logs from Blue Waters, the 13.1-petaflop Cray hybrid supercomputer at the University of Illinois - National Center for Supercomputing Applications (NCSA). Birds of a Feather Interactive 16A Chair: Timothy W. Robinson (Swiss National Supercomputing Centre) Birds of a Feather Interactive 16B Chair: Cory Spitz (Cray Inc.) Evolving parallel file systems in response to the changing storage and memory landscape Cory Spitz (Cray Inc.) Abstract Abstract Burst buffers and storage-memory hierarchies are disruptive technologies to parallel file systems (PFS), but there isn’t consensus among members of the HPC community on how PFSs should adapt to include their use, if at all. There also isn’t consensus on how HPC users should ultimately use and manage their data with these emerging technologies. In this BoF we will discuss how HPC and technical computing users want to interact with burst buffers or other storage-memory hierarchies and how their PFS should adapt. What do they expect? Will they want to continue to use POSIX-like semantics for access like with MPI-I/O or HDF5 containers? What do users expect for legacy codes? Generally, what do users, application developers, and systems engineers require? Will they accept exotic solutions or must a de facto industry standard emerge? Birds of a Feather Interactive 16C Chair: Wendy L. Palm (Cray Inc.) Birds of a Feather Interactive 17A Chair: Bill Nitzberg (Altair Engineering, Inc.) PBS Professional: Welcome to the Open Source Community Bill Nitzberg (Altair Engineering, Inc.) Abstract Abstract Altair will be releasing PBS Pro under an Open Source license in mid-2016. Birds of a Feather Interactive 17B Chair: CJ Corbett (Cray Inc.) Cray and HPC in the Cloud: Discussion and Conclusion Steve Scott, Kunju Kothari, CJ Corbett, and Ryan Waite (Cray Inc.) Abstract Abstract The BoF session follows up on and concludes a previous BoF session dedicated to Cray and HPC in the Cloud. (For continuity of discussion, participants in this BoF are asked to commit to the previous BoF.) This BoF collects the result of the previous exercise for focused discussion. Cray's roadmap and related development efforts for HPC in the Cloud will be presented and positioned against those findings for affinity, ability ranking and prioritization. Session will conclude with a "next steps" action plan. Birds of a Feather Interactive 17C Paper · Applications & Programming Environments Technical Session 18A Chair: Tina Declerck (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory) Opportunities for container environments on Cray XC30 with GPU devices Lucas Benedicic, Miguel Gila, and Sadaf Alam (Swiss National Supercomputing Centre) Abstract Abstract Thanks to the significant popularity gained lately by Docker, HPC community have recently started exploring container technology and the potential benefits its use would bring to the users of supercomputing systems like the Cray XC series. In this paper, we explore feasibility of diverse, nontraditional data and computing oriented use cases with practically no overhead thus achieving native execution performance. Working in close collaboration with NERSC and an engineering team at Nvidia, CSCS is working on extending the Shifter framework in order to enable GPU access to containers at scale. We also briefly discuss the implications of using containers within a shared HPC system from the security point of view to provide service that does not compromise the stability of the system or the privacy of the use. Furthermore, we describe several valuable lessons learned through our analysis and share the challenges we encountered. Shifter: Containers for HPC Richard S. Canon and Douglas M. Jacobsen (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) and David Henseleer (Cray Inc.) Abstract Abstract Container-based computed is rapidly changing the way software is developed, tested, and deployed. We will present a detailed overview of the design and implementation of Shifter, which in partnership with Cray has extended on the early prototype concepts and is now in production at NERSC. Shifter enables end users to execute containers using images constructed from various methods including the popular Docker-based ecosystem. We will discuss some of the improvements and implementation details. In addition, we will discuss lessons learned, performance results, and real-world use cases of Shifter in action and the potential role of containers in scientific and technical computing including how they complement the scientific process. We will conclude with a discussion about the future directions of Shifter. Dynamic RDMA Credentials James Shimek and James Swaro (Cray Inc.) Abstract Abstract Dynamic RDMA Credentials (DRC) is a new system service to allow shared network access between different user applications. DRC allows user applications to request managed network credentials, which can be shared with other users, groups or jobs. Access to a credential is governed by the application and DRC to provide authorized and protected sharing of network access between applications. DRC extends the existing protection domain functionality provided by ALPS without exposing application data to unauthorized applications. DRC can also be used with other batch systems such as SLURM, without any loss of functionality. Paper · Applications & Programming Environments Technical Session 18B Chair: Jason Hill (Oak Ridge National Laboratory) Characterizing the Performance of Analytics Workloads on the Cray XC40 Michael F. Ringenburg, Shuxia Zhang, Kristyn Maschhoff, and Bill Sparks (Cray Inc.) and Evan Racah and Mr Prabhat (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) Abstract Abstract This paper describes an investigation of the performance characteristics of high performance data analytics (HPDA) workloads on the Cray XC40, with a focus on commonly-used open source analytics frameworks like Apache Spark. We look at two types of Spark workloads: the Spark benchmarks from the Intel HiBench 4.0 suite and a CX matrix decomposition algorithm. We study performance from both the bottom-up view (via system metrics) and the top-down view (via application log analysis), and show how these two views can help identify performance bottlenecks and system issues impacting data analytics workload performance. Based on this study, we provide recommendations for improving the performance of analytics workloads on the XC40. Interactive Data Analysis using Spark on Cray Urika Appliance (WITHDRAWN) Gaurav Kaul (Intel) and Xavier Tordoir and Andy Petrella (Data Fellas) Abstract Abstract In this talk, we discuss how data scientists can use the Intel Data Analytics Acceleration Library (DAAL) with Spark and Spark notebook. Intel DAAL provides building blocks for analytics optimized for x86 architecture and. We have integrated DAAL with Spark Notebook. This provides a Spark user, an interactive and optimized interface for running Spark jobs. We take real world machine learning and graph analytics workloads in bioinformatics and run it on Spark using the Cray Urika-XA appliance. The benefits of an co-designed hardware and software stack become apparent with these examples with the added benefit of an interactive front end for users. Experiences Running Mixed Workloads On Cray Analytics Platform HARIPRIYA AYYALASOMAYAJULA and Kristyn Maschhoff (Cray Inc.) Abstract Abstract The ability to run both HPC and big data frameworks together on the same machine is a principal design goal for future Cray analytics platforms. Hadoop provides a reasonable solution for parallel processing of batch workloads using the YARN resource manager. Spark is a general-purpose cluster-computing framework, which also provides parallel processing of batch workloads as well as in-memory data analytics capabilities; iterative, incremental algorithms; ad hoc queries; and stream processing. Spark can be run using YARN, Mesos or its own standalone resource manager. The Cray Graph Engine (CGE) supports real-time analytics on the largest and most complex graph problems. CGE is a more traditional HPC application that runs under either Slurm or PBS. Traditionally, running workloads that require different resource managers requires static partitioning of the cluster. This can lead to underutilization of resources. Paper · Systems Support Technical Session 18C Chair: Jim Rogers (Oak Ridge National Laboratory) Network Performance Counter Monitoring and Analysis on the Cray XC Platform Jim Brandt (Sandia National Laboratories), Edwin Froese (Cray Inc.), Ann Gentile (Sandia National Laboratories), Larry Kaplan (Cray Inc.), and Benjamin Allan and Edward Walsh (Sandia National Laboratories) Abstract Abstract The instrumentation of Cray's Aries network ASIC, of which the XC platform's High Speed Network (HSN) is comprised, offers unprecedented potential for better understanding and utilization of platform HSN resources. Monitoring the amount of data generated on a large-scale system presents challenges with respect to synchronization, data management, and analysis. There are over a thousand raw counter metrics per Aries router and interface with functional combinations of these raw metrics required for insight into network state. Design and implementation of a scalable monitoring system for Trinity Adam DeConinck, Amanda Bonnie, Kathleen Kelly, Samuel Sanchez, Cynthia Martin, and Michael Mason (Los Alamos National Laboratory); Jim Brandt, Ann Gentile, Benjamin Allan, and Anthony Agelastos (Sandia National Laboratories); and Michael Davis and Michael Berry (Cray Inc.) Abstract Abstract The Trinity XC-40 system at Los Alamos National Laboratory presents unprecedented challenges to our system management capabilities, including increased scale, new and unfamiliar subsystems, and radical changes in the system software stack. These challenges have motivated the development of a next-generation monitoring system with new capabilities for collection and analysis of system and facilities data. Dynamic Model Specific Register (MSR) Data Collection as a System Service Greg Bauer (National Center for Supercomputing Applications/University of Illinois), Jim Brandt and Ann Gentile (Sandia National Laboratories), and Andriy Kot and Michael Showerman (National Center for Supercomputing Applications) Abstract Abstract The typical use case for Model Specific Register (MSR) data is to provide application profiling tools with hardware performance counter data (e.g., cache misses, flops, instructions executed). This enables the user/developer to gain understanding about relative performance/efficiencies of the code overall as well as smaller code sections. Due to the overhead of collecting data at sufficient fidelity for the required resolution, these tools are typically only run while tuning a code. Paper · Filesystems & I/O Technical Session 19B Chair: Richard Barrett (Sandia National Labs) FCP: A Fast and Scalable Data Copy Tool for High Performance Parallel File Systems Feiyi Wang, Veronica Vergara Larrea, Dustin Leverman, and Sarp Oral (Oak Ridge National Laboratory) Abstract Abstract The design of HPC file and storage systems has largely been driven by the requirements on capability, reliability, and capacity. However, the convergence of large-scale simulations with big data analytics have put the data, its usability, and management back on the front and center position. LIOProf: Exposing Lustre File System Behavior for I/O Middleware Cong Xu (Intel Corporation), Suren Byna (Lawrence Berkeley National Laboratory), Vishwanath Venkatesan (Intel Corporation), Robert Sisneros (National Center for Supercomputing Applications), Omkar Kulkarni (Intel Corporation), Mohamad Chaarawi (The HDF Group), and Kalyana Chadalavada (Intel Corporation) Abstract Abstract As parallel I/O subsystem in large-scale supercomputers is becoming complex due to multiple levels of software libraries, hardware layers, and various I/O patterns, detecting performance bottlenecks is a critical requirement. While there exist a few tools to characterize application I/O, robust analysis of file system behavior and associating file-system feedback with application I/O patterns are largely missing. Toward filling this void, we introduce Lustre IO Profiler, called LIOProf, for monitoring the I/O behavior and for characterizing the I/O activity statistics in the Lustre file system. In this paper, we use LIOProf for both uncovering pitfalls of MPI-IO’s collective read operation over Lustre file system and identifying HDF5 overhead. Based on LIOProf characterization, we have implemented a Lustre-specific MPI-IO collective read algorithm, enabled HDF5 collective metadata operations and applied HDF5 datasets optimization. Our evaluation results on two Cray systems (Cori at NERSC and Blue Waters at NCSA) demonstrate the efficiency of our optimization efforts. Psync - Parallel Synchronization Of Multi-Pebibyte File Systems Andy Loftus (NCSA) Abstract Abstract When challenged to find a way to migrate an entire file system onto new hardware while maximizing availability and ensuring exact data and metadata duplication, NCSA found that existing file copy tools couldn’t fit the bill. So they set out to create a new tool. One that would scale to the limits of the file system and provide a robust interface to adjust to the dynamic needs of the cluster. The resulting tool, Psync, effectively manages many syncs running in parallel. It is dynamically scalable (can add and remove nodes on the fly) and robust (can start/stop/restart the sync). Psync has been run successfully on hundreds of nodes each with multiple processes (yielding possibly thousands of parallel processes). This talk will present the overall design of Psync and it’s use as a general purpose tool for copying lots of data as quickly as possible. Paper · Applications & Programming Environments Technical Session 20A Chair: Chris Fuson (ORNL) Scaling hybid coarray/MPI miniapps on Archer Luis Cebamanos (EPCC, The University of Edinburgh); Anton Shterenlikht (Mech Eng Dept, The University of Bristol); and David Arregui and Lee Margetts (School of Mech, Aero and Civil Engineering, The University of Manchester) Abstract Abstract We have developed miniapps from MPI finite element library ParaFEM and Fortran 2008 coarray cellular automata library CGPACK. The miniapps represent multi-scale fracture models of polycrystalline solids. The software from which these miniapps have been derived will improve predictive modelling in the automotive, aerospace, power generation, defense and manufacturing sectors. The libraries and miniapps are distributed under BSD license, so these can be used by computer scientists and hardware vendors to test various tools including compilers and performance monitoring applications. CrayPAT tools have been used for sampling and tracing analysis of the miniapps. Two routines with all-to-all communication structures have been identified a primary candidates for optimisation. New routines have been written implementing the nearest neighbour algorithm and using coarray collectives. Scaling limit for miniapps has been increased by a factor of 3, from about 2k to over 7k cores. The miniapps uncovered several issues in CrayPAT and Cray implementation of Fortran coarrays. We are working with Cray engineers to resolve these. Hybrid coarray/MPI programming is uniquely enabled on Cray systems. This work is of particular interest to Cray developers, because it details real experiences of using hybrid Fortran coarray/MPI programming for scientific computing in an area of cutting edge research. Enhancing Scalability of the Gyrokinetic Code GS2 by using MPI Shared Memory for FFTs Lucian Anton (Cray UK), Ferdinand van Wyk and Edmund Highcock (University of Oxford), Colin Roach (CCFE Culham Science Centre), and Joseph Parker (STFC) Abstract Abstract GS2 (http://sourceforge.net/projects/gyrokinetics) is a 5-D initial value parallel code used to simulate low frequency electromagnetic turbulence in magnetically confined fusion plasmas. Feasible calculations routinely capture plasma turbulence at length scales close either to the electron or the ion Larmor radius. Self-consistently capturing the interaction between turbulence at ion scale and electron scale requires a huge increase in the scale of computation. Scalable Remote Memory Access Halo Exchange with Reduced Synchronization Cost Maciej Szpindler (ICM, University of Warsaw) Abstract Abstract Remote Memory Access (RMA) is a popular technique for data exchange in the parallel processing. Message Passing Interface (MPI), ubiquitous environment for distributed memory programming, introduced improved model for RMA in the recent version of the standard. While RMA provides direct access to low-level high performance hardware, MPI one-sided communication enables various synchronization regimes including scalable group synchronization. This combination provides methods to improve performance of commonly used communication schemes in parallel computing. This work evaluates one-sided halo exchange implementation on the Cray XC40 system. Large numerical weather prediction code is studied. To address already identified overheads for RMA synchronization, recently proposed extension of Notified Access is considered. To reduce the cost of the most frequent message passing communication scheme, alternative RMA implementation is proposed. Additionally, to identify more scalable approaches, performance of general active target synchronization, Notified Access modes of RMA and original message passing implementation are compared. Paper · Applications & Programming Environments Technical Session 20B Chair: Bilel Hadri (KAUST Supercomputing Lab) Directive-based Programming for Highly-scalable Nodes Douglas Miles and Michael Wolfe (PGI) Abstract Abstract High end supercomputers have increased in performance from about 4 TFLOPS to 33 PFLOPS in the past 15 years, a factor of about 10,000. Increased node count accounts for a factor of 10, and clock rate increases for another factor of 5. Most of the increase, a factor of about 200, is due to increases in single-node performance. We expect this trend to continue with single-node performance increasing faster than node count. Building scalable applications for such targets means exploiting as much intra-node parallelism as possible. We discuss coming supercomputer node designs and how to abstract the differences to enable design of portable scalable applications, and the implications for HPC programming languages and models such as OpenACC and OpenMP. Balancing particle and Mesh Computation in a Particle-In-Cell Code Patrick H. Worley and Eduardo F. D'Azevedo (Oak Ridge National Laboratory), Robert Hager and Seung-Hoe Ku (Princeton Plasmas Physics Laboratory), Eisung Yoon (Rensselaer Polytechnic Institute), and Choong-Seock Chang (Princeton Plasmas Physics Laboratory) Abstract Abstract The XGC1 plasma microturbulence particle-in-cell simulation code has both particle-based and mesh-based computational kernels that dominate performance. Both of these are subject to load imbalances that can degrade performance and that evolve during a simulation. Each separately can be addressed adequately, but optimizing just for one can introduce significant load imbalances in the other, degrading overall performance. A technique has been developed based on Golden Section Search that minimizes wallclock time given prior information on wallclock time, and on current particle distribution and mesh cost per cell, and also adapts to evolution in load imbalance in both particle and mesh work. In problems of interest this doubled the performance on full system runs on the XK7 at the Oak Ridge Leadership Computing Facility compared to load balancing only one of the kernels. Computational Efficiency Of The Aerosol Scheme In The Met Office Unified Model Mark Richardson (University of Leeds); Fiona O'Connor (Met Office Hadley Centre, UK); Graham W. Mann (University of Leeds); and Paul Selwood (Met Office, UK) Abstract Abstract Abstract - A new data structuring has been implemented in the Met Office Unified Model (MetUM) which improves the performance of the aerosol subsystem. Smaller amounts of atmospheric data, in the arrangement of segments of atmospheric columns, are passed to the aerosol sub-processes. The number of columns that are in a segment can be changed at runtime and thus can be tuned to the hardware and science in operation. This revision alone has halved the time spent in some of the aerosol sections for the case under investigation. The new arrangement allows simpler implementation of OpenMP around the whole of the aerosol subsystem and is shown to give close to ideal speed up. Applying a dynamic schedule or retaining a simpler static schedule for the OpenMP parallel loop are shown to differ related to the number of threads. The percentage of the run spent in the UKCA sections has been reduced from 30% to 24% with a corresponding reduction in runtime by 11% for a single threaded run. When the reference version is using 4 threads the percentage of time spent in UKCA is higher at 40% but with the OpenMP and segmenting modifications this is now reduced to 20% with a corresponding reduction in run time of 17%. For 4 threads the parallel speed-up for the reference code was 1.78 and after the modifications it is 1.91. Both these values indicate that there is still a significant amount of the run that is serial (within an MPI task) which is continually being addressed by the software development teams involved in MetUM. Paper · Systems Support Technical Session 20C Chair: Andrew Winfer (KAUST) How to Automate and not Manage under Rhine/Redwood Paul L. Peltz Jr., Adam J. DeConinck, and Daryl W. Grunau (Los Alamos National Laboratory) Abstract Abstract Los Alamos National Laboratory and Sandia National Laboratory under the Alliance for Computing at Extreme Scale (ACES) have partnered with Cray to deliver Trinity, the Department of Energy’s next supercomputer on the path to exascale. Trinity, which is an XC40, is an ambitious system for a number of reasons, one of which is the deployment of Cray’s new Rhine/Redwood (CLE 6.0/SMW 8.0) system management stack. With this release came a much-needed update to the system management stack to provide scalability and a new philosophy on system management. However, this update required LANL to update its own system management philosophy, and presented a number of challenges in integrating the system into the larger computing infrastructure at Los Alamos. This paper will discuss the work the LANL team is doing to integrate Trinity, automate system management with the new Rhine/Redwood stack, and combine LANL’s and Cray’s new system management philosophy. The Intel® Omni-Path Architecture: Game-Changing Performance, Scalability, and Economics Andrew Russell (Intel Corporation) Abstract Abstract The Intel® Omni-Path Architecture, Intel’s next-generation fabric product line, is off to an extremely fast start since its launch in late 2015. With high-profile customer deployments being announced at a feverish pace, the performance, resiliency, scalability, and economics of Intel’s innovative fabric product line are winning over customers across the HPC industry. Learn from an Intel Fabric Solution Architect how to maximize both the performance and economic benefits when deploying Intel® OPA-based cluster, and how it deliver huge benefits to HPC applications over standard Infiniband-based designs. The Hidden Cost of Large Jobs - Drain Time Analysis at Scale Joseph 'Joshi' Fullop (National Center for Supercomputing Applications) Abstract Abstract At supercomputing centers where many users submit jobs of various sizes, scheduling efficiency is the key to maximizing system utilization. With the capability of running jobs on massive numbers of nodes being the hallmark of large clusters, draining sufficient nodes in order to launch those jobs can severely impact the throughput of these systems. While these principles apply to any sized cluster, the idle node-hours due to drain on the scale of today's systems warrants attention. In this paper we provide methods of accounting for system-wide drain time as well as how to attribute drain time to a specific job. Having data like this allows for real evaluation of scheduling policies and their effect on node occupancy. This type of measurement is also necessary to allow for backfill recovery analytics and enables other types of assessments. Paper · Applications & Programming Environments Technical Session 21A Chair: Richard Barrett (Sandia National Labs) Stitching Threads into the Unified Model Matthew Glover, Paul Selwood, Andy Malcolm, and Michele Guidolin (Met Office, UK) Abstract Abstract The Met Office Unified Model (UM) uses a hybrid parallelization strategy: MPI and OpenMP. Being legacy code, OpenMP has been retrofitted in a piecemeal fashion over recent years. On Enhancing 3D-FFT Performance in VASP Florian Wende (Zuse Institute Berlin), Martijn Marsman (Universität Wien), and Thomas Steinke (Zuse Institute Berlin) Abstract Abstract We optimize the computation of 3D-FFT in VASP in order to prepare the code for an efficient execution on multi- and many-core CPUs like Intel's Xeon Phi. Along with the transition from MPI to MPI+OpenMP, library calls need to adapt to threaded versions. One of the most time consuming components in VASP is 3D-FFT. Beside assessing the performance of multi-threaded calls to FFTW and Intel MKL, we investigate strategies to improve the performance of FFT in a general sense. We incorporate our insights and strategies for FFT computation into a library which encapsulates FFTW and Intel MKL specifics and implements the following features: reuse of FFT plans, composed FFTs, and the use of high bandwidth memory on Intel's KNL Xeon Phi. We will present results on a Cray-XC40 and a Cray-XC30 Xeon Phi system using synthetic benchmarks and with the library integrated into VASP. Exploiting Thread Parallelism for Ocean Modeling on Cray XC Supercomputers Abhinav Sarje (Lawrence Berkeley National Laboratory), Douglas Jacobsen (LANL), Samuel Williams (LBNL), Todd Ringler (LANL), and Leonid Oliker (LBNL) Abstract Abstract Incorporation of increasing core counts in modern processors used to build state-of-the-art supercomputers is driving application development towards implementation of thread parallelism, in addition to distributed memory parallelism, to deliver efficient high-performance codes. In this work we describe the implementation of threading and our experiences with it with respect to a real-world ocean modeling application code, MPAS-Ocean. We present detailed performance analysis and comparisons of various approaches and configurations for threading on the Cray XC series supercomputers, and show the benefits of threading on run time performance and energy requirements with increasing concurrency. Cori - A System to Support Data-Intensive Computing Katie Antypas, Deborah Bard, Wahid Bhimji, Tina M. Declerck, Yun (Helen) He, Douglas Jacobsen, Shreyas Cholia, Mr Prabhat, and Nicholas J. Wright (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) and Richard Shane Canon (Lawrence Berkeley National Laboratory) Abstract Abstract The first phase of Cori, NERSC’s next generation supercomputer, a Cray XC40, has been configured to specifically support data intensive computing. With increasing dataset sizes coming from experimental and observational facilities, including telescopes, sensors, detectors, microscopes, sequencers, and, supercomputers, scientific users from the Department of Energy, Office of Science are increasingly relying on NERSC for extreme scale data analytics. This paper will discuss the Cori Phase 1 architecture, and installation into the new and energy efficient CRT facility, and explains how the system will be combined with the larger Cori Phase 2 system based on the Intel Knights Landing processor. In addition, the paper will describe the unique features and configuration of the Cori system that allow it to support data-intensive science. Paper · Filesystems & I/O Technical Session 21B Chair: Frank M. Indiviglio (National Oceanic and Atmospheric Administration) H5Spark: Bridging the I/O Gap between Spark and Scientific Data Formats on HPC Systems Jialin Liu, Evan Racah, Quincey Koziol, and Richard Shane Canon (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center); Alex Gittens (University of California, Berkeley); Lisa Gerhardt (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center); Suren Byna (Lawrence Berkeley National Laboratory); Michael F. Ringenburg (Cray Inc.); and Mr Prabhat (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) Abstract Abstract Spark has been tremendously powerful in performing Big Data analytics in distributed data centers. However, using the Spark framework on HPC systems to analyze large-scale scientific data has several challenges. For instance, parallel file system is shared among all computing nodes in contrast to shared-nothing architectures. Another challenge is in accessing data stored in scientific data formats, such as HDF5 and NetCDF, that are not natively supported in Spark. Our study focuses on improving I/O performance of Spark on HPC systems for reading and writing scientific data arrays, e.g., HDF5/netCDF. We select several scientific use cases to drive the design of an efficient parallel I/O API for Spark on HPC systems, called H5Spark. We optimize the I/O performance, taking into account Lustre file system striping. We evaluate the performance of H5Spark on Cori, a Cray XC40 system, located at NERSC. The time is now. Unleash your CPU cores with Intel® SSDs Andrey O. Kudryavtsev (Intel Corporation) and Ken Furnanz (Intel) Abstract Abstract Andrey Kudryavtsev, HPC Solution Architect for the Intel® Non-Volatile Solutions Group (NSG) will discuss advancements in Intel SSD technology that is unleashing the power of the CPU. He will dive into the benefits of Intel® NVMe SSDs that can greatly benefit HPC specific performance with parallel file systems. He will also share the HPC solutions and performance benefits that Intel has already seen with their customers today, and how adoption of the current SSD technology sets the foundation for consumption of Intel’s next generation of memory technology 3D Xpoint Intel® SSDs with Intel Optane™ Technology in the High Performance Compute segment. Introducing a new IO tier for HPC Storage James Coomer (DDN Storage) Abstract Abstract Tier 1 “performance" storage is becomingly increasingly flanked by new, solid-state-based tiers and active archive tiers that improve the economics of both performance and capacity. The available implementations of solid-state tiers into parallel filesystems are typically based on a separate namespace and/or utilise existing filesystem technologies. Given the price/performance characteristics of SSDs today, huge value is gained by addressing both optimal data placement to the SSD tier and in comprehensively building this tier to accelerate the broadest spectrum of IO, rather than just small IO random read. Paper · Applications & Programming Environments Technical Session 21C Chair: David Hancock (Indiana University) Maintaining Large Software Stacks in a Cray Ecosystem with Gentoo Portage Colin A. MacLean (National Center for Supercomputing Applications/University of Illinois) Abstract Abstract Building and maintaining a large collection of software packages from source is difficult without powerful package management tools. This task is made more difficult in an environment where many libraries do not reside in standard paths and where loadable modules can drastically alter the build environment, such as on a Cray system. The need to maintain multiple Python interpreters with a large collection of Python modules is one such case of having a large and complicated software stack and is described in this paper. To address limitations of current tools, Gentoo Prefix was ported to the Cray XE/XK system, Blue Waters, giving the ability to use the Portage package manager. This infrastructure allows for fine-grained dependency tracking, consistent build environments, multiple Python implementations, and customizable builds. This infrastructure is used to build and maintain over 400 packages for Python support on Blue Waters for use by its partners. Early Application Experiences on Trinity - the Next Generation of Supercomputing for the NNSA Courtenay Vaughan, Dennis Dinge, Paul T. Lin, Kendall H. Pierson, Simon D. Hammond, J. Cook, Christian R. Trott, Anthony M. Agelastos, Douglas M. Pase, Robert E. Benner, Mahesh Rajan, and Robert J. Hoekstra (Sandia National Laboratories) Abstract Abstract Trinity, a Cray XC40 supercomputer, will be the flagship capability computing platform for the United States nuclear weapons stockpile stewardship program when the machine enters full production during 2016. In the first phase of the machine, almost 10,000 dual socket Haswell processor nodes will be deployed, followed by a second phase utilizing Intel's next-generation Knights Landing processor. Executing dynamic heterogeneous workloads on Blue Waters with RADICAL-Pilot Mark Santcroos (Rutgers University); Ralph Castain (Intel Corporation); Andre Merzky (Rutgers University); Iain Bethune (EPCC, The University of Edinburgh); and Shantenu Jha (Rutgers University) Abstract Abstract Traditionally HPC systems such as Crays have been designed to support mostly monolithic workloads. However, the workload of many important scientific applications is constructed out of spatially and temporally heterogeneous tasks that are often dynamically inter-related. These workloads can benefit from being executed at scale on HPC resources, but a tension exists between the workloads' resource utilization requirements and the capabilities of the HPC system software and usage policies. Pilot systems have the potential to relieve this tension. RADICAL-Pilot is a scalable and portable pilot system that enables the execution of such diverse workloads. In this paper we describe the design and characterize the performance of its RADICAL-Pilot's scheduling and executing components on Crays, which are engineered for efficient resource utilization while maintaining the full generality of the Pilot abstraction. We will discuss four different implementations of support for RADICAL-Pilot on Cray systems and analyze and report on their performance. Evaluating Shifter for HPC Applications Donald M. Bahls (Cray Inc.) Abstract Abstract Shifter is a powerful tool that has the potential to expand the availability of HPC applications on Cray XC systems by allowing Docker-based containers to be run with little porting effort. In this paper, we explore the use of Shifter as a means of running HPC applications built for commodity Linux clusters environments on a Cray XC under the Shifter environment. We compare developer productivity, application performance, and application scaling of stock applications compiled for commodity Linux clusters with both Cray XC tuned Docker images as well as natively compiled applications not using the Shifter environment. We also discuss pitfalls and issues associated with running non-SLES-based Docker images in the Cray XC environment. |
Paper · Applications & Programming Environments Technical Session 7A Chair: Helen He (National Energy Research Scientific Computing Center) Paper · Applications & Programming Environments Technical Session 8A Chair: Bilel Hadri (KAUST Supercomputing Lab) Paper · Applications & Programming Environments Technical Session 14A Chair: Chris Fuson (ORNL) Paper · Applications & Programming Environments Technical Session 15A Chair: Zhengji Zhao (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory) Paper · Applications & Programming Environments Technical Session 18A Chair: Tina Declerck (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory) Paper · Applications & Programming Environments Technical Session 18B Chair: Jason Hill (Oak Ridge National Laboratory) Paper · Applications & Programming Environments Technical Session 19A Chair: Zhengji Zhao (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory) Optimizing Cray MPI and SHMEM Software Stacks for Cray-XC Supercomputers based on Intel KNL Processors pdf, pdfWhat's new in Allinea's tools: from easy batch script integration and remote access to energy profiling. pdfPaper · Applications & Programming Environments Technical Session 19C Chair: Tina Declerck (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory) Paper · Applications & Programming Environments Technical Session 20A Chair: Chris Fuson (ORNL) Paper · Applications & Programming Environments Technical Session 20B Chair: Bilel Hadri (KAUST Supercomputing Lab) Paper · Applications & Programming Environments Technical Session 21A Chair: Richard Barrett (Sandia National Labs) Paper · Filesystems & I/O Technical Session 7B Chair: Jason Hill (Oak Ridge National Laboratory) Paper · Filesystems & I/O Technical Session 14B Chair: Ashley Barker (Oak Ridge National Laboratory) Paper · Filesystems & I/O Technical Session 15B Chair: Jason Hill (Oak Ridge National Laboratory) Paper · Filesystems & I/O Technical Session 19B Chair: Richard Barrett (Sandia National Labs) Paper · Systems Support Technical Session 8C Chair: Jean-Guillaume Piccinali (Swiss National Supercomputing Centre) Paper · Systems Support Technical Session 14C Chair: Jim Rogers (Oak Ridge National Laboratory) Paper · Systems Support Technical Session 15C Chair: Helen He (National Energy Research Scientific Computing Center) Paper · Systems Support Technical Session 18C Chair: Jim Rogers (Oak Ridge National Laboratory) Paper · Systems Support Technical Session 20C Chair: Andrew Winfer (KAUST) Paper · User Services Technical Session 7C Chair: Ashley Barker (Oak Ridge National Laboratory) Paper · User Services Technical Session 8B Chair: Jenett Tillotson (Indiana University) Tutorial Tutorial 1B Tutorial Tutorial 1B Continued Tutorial Tutorial 2C Tutorial Tutorial 2C Continued Birds of a Feather Interactive 3A Chair: Matteo Chesi (Swiss National Supercomputing Centre) Birds of a Feather Interactive 3C Chair: David Hancock (Indiana University) Birds of a Feather Interactive 4A Chair: Derek Burke (Seagate) Birds of a Feather Interactive 4C Plenary General Session 5 Chair: David Hancock (Indiana University) Plenary General Session 6 Chair: Andrew Winfer (KAUST) Paper · Applications & Programming Environments Technical Session 7A Chair: Helen He (National Energy Research Scientific Computing Center) Paper · User Services Technical Session 7C Chair: Ashley Barker (Oak Ridge National Laboratory) Paper · Applications & Programming Environments Technical Session 8A Chair: Bilel Hadri (KAUST Supercomputing Lab) Paper · Systems Support Technical Session 8C Chair: Jean-Guillaume Piccinali (Swiss National Supercomputing Centre) Birds of a Feather Interactive 9A Chair: Jason Hill (Oak Ridge National Laboratory) Birds of a Feather Interactive 9B Chair: Chris Fuson (ORNL) Birds of a Feather Interactive 9C Chair: Peter Messmer (NVIDIA) Birds of a Feather Interactive 10A Chair: Michael Showerman (National Center for Supercomputing Applications) Birds of a Feather Interactive 10B Chair: Peggy A. Sanchez (Cray Inc) Birds of a Feather Interactive 10C Plenary General Session 11 Chair: David Hancock (Indiana University) Plenary General Session 12 Chair: Andrew Winfer (KAUST) Paper · Applications & Programming Environments Technical Session 14A Chair: Chris Fuson (ORNL) Paper · Systems Support Technical Session 14C Chair: Jim Rogers (Oak Ridge National Laboratory) Paper · Applications & Programming Environments Technical Session 15A Chair: Zhengji Zhao (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory) Paper · Filesystems & I/O Technical Session 15B Chair: Jason Hill (Oak Ridge National Laboratory) Paper · Systems Support Technical Session 15C Chair: Helen He (National Energy Research Scientific Computing Center) Birds of a Feather Interactive 16A Chair: Timothy W. Robinson (Swiss National Supercomputing Centre) Birds of a Feather Interactive 16B Chair: Cory Spitz (Cray Inc.) Birds of a Feather Interactive 16C Chair: Wendy L. Palm (Cray Inc.) Birds of a Feather Interactive 17A Chair: Bill Nitzberg (Altair Engineering, Inc.) Birds of a Feather Interactive 17B Chair: CJ Corbett (Cray Inc.) Birds of a Feather Interactive 17C Paper · Applications & Programming Environments Technical Session 18A Chair: Tina Declerck (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory) Paper · Applications & Programming Environments Technical Session 18B Chair: Jason Hill (Oak Ridge National Laboratory) Paper · Systems Support Technical Session 18C Chair: Jim Rogers (Oak Ridge National Laboratory) Paper · Filesystems & I/O Technical Session 19B Chair: Richard Barrett (Sandia National Labs) Paper · Applications & Programming Environments Technical Session 20A Chair: Chris Fuson (ORNL) Paper · Applications & Programming Environments Technical Session 20B Chair: Bilel Hadri (KAUST Supercomputing Lab) Paper · Systems Support Technical Session 20C Chair: Andrew Winfer (KAUST) Paper · Applications & Programming Environments Technical Session 21A Chair: Richard Barrett (Sandia National Labs) Paper · Filesystems & I/O Technical Session 21B Chair: Frank M. Indiviglio (National Oceanic and Atmospheric Administration) Paper · Applications & Programming Environments Technical Session 21C Chair: David Hancock (Indiana University) Plenary General Session 22 Chair: David Hancock (Indiana University) |