Birds of a Feather BoF 3A Chair: Bilel Hadri (KAUST Supercomputing Lab) Programming Environments, Applications, and Documentation (PEAD) Special Interest Group meeting Bilel Hadri (KAUST Supercomputing Lab) Abstract Abstract The Programming Environments, Applications and Documentation Special Interest Group (“the SIG”) has as its mission to provide a forum for exchange of information related to the usability and performance of programming environments (including compilers, libraries and tools) and scientific applications running on Cray systems. Related topics in user support and communication (e.g. documentation) are also covered by the SIG. Birds of a Feather BoF 3B Chair: Nicholas Cardo (Swiss National Supercomputing Centre) Systems Support SIG Meeting Nicholas Cardo (Swiss National Supercomputing Centre) Abstract Abstract This BoF will be focused on topics related to the day-to-day operations of Cray supercomputers including operating system support, storage, and networking. Birds of a Feather BoF 3C Chair: Torben Kling Petersen (Cray Inc.); Michele Bertasi (Bright Computing) When to use Flash and when not to ….. Torben Kling Petersen (Cray Inc.) Abstract Abstract Flash in all forms and designs is known to solve all storage problems … But is the higher cost of flash really worth the investment ?? What about reliability and the effects of long term use ?? And are HDDs really dead ?? Scalable Accounting & Reporting for Compute Jobs Michele Bertasi (Bright Computing) Abstract Abstract HPC systems usually come with a significant price tag, which means that it is highly desirable to be able to gain insight into how effectively the resources in an HPC system are being used. It is important to be able to differentiate between reservation of resources and actual resource usage (e.g. wall-clock time versus CPU time) to be able to make statements about efficiency. In this session we will describe the workings of an accounting and reporting engine that was introduced in the latest version of Bright Cluster Manager. HPC system administrators can use this to answer questions such as "Which users typically allocate resources that they don't use effectively?" or "How much power was consumed for running jobs of type X by user Y?". In particular, we will address how the system can scale to large numbers of nodes & jobs, and how the use of the PromQL query language provides flexibility in terms of report generation. Birds of a Feather BoF 3D Chair: Sadaf R. Alam (CSCS) Tools and Utilities for Data Science Workloads and Workflows Sadaf Alam and Maxime Martinasso (Swiss National Supercomputing Centre) and Michael Ringenburg (Cray Inc.) Abstract Abstract The goal of this BOF is to share experiences in using data science software packages, tools and utilities in HPC environments. These include packages and solutions that HPC sites offer as a service and the Cray Urika-XC software suite. A further goal of this BOF is to identify opportunities and challenges that the HPC community is facing in order to offer integrated solutions for HPC and data science workloads and workflows. Opportunities for containers in HPC ecosystems Sadaf Alam and Lucas Benedicic (Swiss National Supercomputing Centre) and Shane Canon and Douglas Jacobsen (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory) Abstract Abstract Several containers solutions have emerged in the few couple years to take numerous advantages of this technology in high performance computing environments. As a follow up to the CUG 2017 BOF on creating a community around Shifter container solution, in this BOF we share updates, experiences and challenges. We present an architectural design for extensibility and community engagement. We also discuss opportunities for engagement and integration into the broader Docker community. Birds of a Feather BoF 10A Chair: Larry Kaplan (Cray Inc.) Cray Next Generation Software Integration Options Larry Kaplan (Cray Inc.) and Nicholas Cardo (Swiss National Supercomputing Centre) Abstract Abstract This BOF is intended to collect feedback and use cases based on the information presented in pap114 “Cray Next Generation Software Integration Options”. The paper presents a framework and some details for potential types of software integrations into Cray’s systems. This BOF is intended to explore further details and examples of such integrations, driven by Cray customer input. Birds of a Feather BoF 10B Chair: Bilel Hadri (KAUST Supercomputing Lab) Managing Effectively the User Software Ecosystem Bilel Hadri (KAUST Supercomputing Lab), Guilherme Peretti-Pezzi (Swiss National Supercomputing Centre), Ashley Barker (Oak Ridge National Laboratory), Mario Melara and Helen He (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center), and Peggy Sanchez (Cray Inc.) Abstract Abstract Supercomputing centers host and manage HPC systems to enable researchers to productively carry out their science and engineering research and provide them support and assistance not only with software from the vendor-supplied programming environment, but also third-party software installed by HPC staff. Birds of a Feather BoF 10C Chair: Stephen Leak (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory) Practical implementation of monitoring on Cray system Stephen Leak (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) and Jean-Guillaume Piccinali (CSCS) Abstract Abstract Monitoring has become an area of significant interest for many Cray sites and a System Monitoring Working Group was established by Cray to support collaboration between sites in monitoring efforts. An important topic within this group is the practical implementation of monitoring. Different sites have implemented monitoring of different aspects of Cray systems, and the aim of this BoF is to share practical implementation recommendations so we can learn from each other's experiences rather than independently repeating them via trial-and-error. This BoF will consist of presentations from sites on lessons learned in practical experience, interspersed with discussion about problems sites are facing and known or proposed solutions to these. Birds of a Feather BoF 10D Chair: David Hancock (Indiana University) Open Discussion with CUG Board David Hancock (Indiana University) Abstract Abstract This session is designed as an open discussion with the CUG Board but there are a few high level topics that will also be on the agenda. The discussion will focus on corporation update, feedback on increasing CUG participation, and feedback on SIG structure and communication. An open floor question and answer period will follow these topics. Formal voting (on candidates and the bylaws) will open after this session, so any candidates or members with questions about the process are welcome to bring up those topics. Birds of a Feather BoF 21A Chair: Harold Longley (Cray Inc.) XC System Management Usability BOF Harold Longley and Eric Cozzi (Cray Inc.) Abstract Abstract This BOF will be a facilitated discussion around usability of system management software on an XC series system with SMW 8.0/CLE 6.0 software focusing on standard and advanced administrator use cases. The goal will be to gain an understanding of the best and worst parts of interacting with the system management software for the SMW, CLE nodes, and eLogin nodes to understand how customers would like to see the software evolve. Birds of a Feather BoF 21B Chair: Chris Fuson (ORNL) Best Practices for Supporting Diverse HPC User Storage Needs with Finite Storage Resources Chris Fuson (Oak Ridge National Laboratory), Bilel Hadri (King Abdullah University of Science and Technology), Frank Indiviglio (National Oceanic and Atmospheric Administration), and Maciej Olchowik (King Abdullah University of Science and Technology) Abstract Abstract HPC centers provide large, multi-tiered, shared data storage resources to large user communities who span diverse science domains with varied storage requirements. The diverse storage needs of user communities can create contention for a center’s finite storage resources. Birds of a Feather BoF 21C Chair: Michael Showerman (National Center for Supercomputing Applications) Automated Analysis and Effective Feedback Mike Showerman (National Center for Supercomputing Applications) and Jim Brandt and Ann Gentile (Sandia National Laboratories) Abstract Abstract As systems increase in size and complexity, diagnosing problems, assessing the best responses from set a choices, and coordinating complex responses are too labor intensive to manually complete in a timely fashion. Thus detecting and diagnosing problems will increasingly need to rely on automated, rather than manual, analysis. Taking mitigating action on meaningful timescales will require automation as well. Birds of a Feather BoF 21D Chair: Kelly J. Marquardt (Cray) Customer Collaboration and the Shasta Software Stack Kelly J. Marquardt (Cray Inc.) Abstract Abstract The Shasta software stack is being architected in order to support far greater flexibility. The software will be highly modular and there will be published APIs at various layers. This is intended to give customers the ability to use the software stack in new ways and to integrate with other software components in ways that weren’t possible before. New Site New Site 15 Chair: Helen He (National Energy Research Scientific Computing Center) Isambard, the world’s first production Arm-based supercomputer Simon McIntosh-Smith (University of Bristol) Abstract Abstract This is a New Site talk from GW4. Isambard will be the first Cray Scout XC50 Arm-based supercomputer when it ships in July 2018. We believe it will also be the first production Arm-based supercomputer of any type. Run by the UK’s GW4 Alliance universities of Bristol, Bath, Cardiff and Exeter, in collaboration with the Met Office, Cray and Arm, Isambard will include over 10,000 high-performance Armv8 cores, delivered by Cavium ThunderX2 processors. The project has already made significant progress, courtesy of eight early access nodes, delivered in November 2017, and upgraded to near-production B0 silicon in March 2018. Two hackathons have seen many of our target science codes already ported and optimised for ThunderX2. The porting process has been remarkably straightforward, and the performance results extremely encouraging. New Site New Site 17 Chair: Helen He (National Energy Research Scientific Computing Center) The NIWA/NeSI HPC Replacement Project - A Voyage into Complexity: Integrating multi-site XC, CS, ESS, Spectrum Scale, and OpenStack Michael Uddstrom (National Institute of Water and Atmospheric Research); Brian Corrie and Nick Jones (NeSI); Fabrice Cantos and Aaron Hicks (National Institute of Water and Atmospheric Research); Greg Hall (NeSI); Wolfgang Hayek (National Institute of Water and Atmospheric Research); David Kelly, Patricia Balle, Adam Sachitano, and Brian Gilmer (Cray Inc.); and Dale McCurdy and Andrew Beattie (IBM) Abstract Abstract Replacement of New Zealand’s national HPC infrastructure was initiated in December 2016 through the release of a single RFP for three HPC systems, funded by four entities: NIWA, University of Auckland, University of Otago and Landcare Research. Paper Technical Session 8A Chair: Ronald Brightwell (Sandia National Laboratories) Cray Next Generation Software Integration Options Larry Kaplan, Kitrick Sheets, Dean Roe, Bill Sparks, Michael Ringenburg, and Jeff Schutkoske (Cray Inc.) Abstract Abstract The ability to integrate third-party software into Cray's systems is a critical feature. This paper describes the types of third-party software integrations being investigated for Cray systems in the future. One of the key goals is to allow a plethora of third-party integration options that take advantage of all parts of the Cray technology stack. As a position paper, it aims to be directional and provides a guide through design and architecture decisions. It should not be considered a plan of record. It is intended to enumerate the possibilities but does not attempt to prioritize them or consider additional effort to enable them, if any. CLE Port to ARM: Functionality, Performance, and Lessons Learned Jeffrey Schutkoske (Cray Inc.) Abstract Abstract The Cray Linux Environment (CLE) supports the Cavium ThunderX2 CN99xx ARM processor in the Cray XC50 system. This is the first ARM processor that is supported by CLE. The port of CLE to ARM provides the same level of functionality, performance, and scalability that is available on other Cray XC systems. This paper describes the functionality, performance, and lessons learned from the port of CLE to ARM. SSA, ClusterStor Call-home Service Actions, and an Introduction to Cray Central Telemetry and Triage Services (C2TS) Jeremy Duckworth and Tim Morneau (Cray Inc.) Abstract Abstract The ClusterStor platform is designed to minimize a customer’s operational burden (time and money) by offering an enterprise ready reliability, availability, and serviceability (RAS) solution. ClusterStor RAS, via SSA, can securely submit comprehensive diagnostic information to Cray in near real time, over the Internet. Using this data stream, Cray plans to generate proactive service opportunities and, in select cases, automate repair part shipment and service dispatch. As a complementary serviceability feature, Cray also plans to make it easier for customers to capture and securely transfer support data to Cray – by utilizing SSA on ClusterStor as a triage data collection framework. Finally, as a foundation for the future of call-home systems at Cray, this paper introduces the Cray Central Telemetry and Triage Services (C2TS) – including key motivations for Cray’s work in this area and how C2TS relates to ClusterStor, SSA, and future products. Paper Technical Session 8B Chair: Bilel Hadri (KAUST Supercomputing Lab) Performance Evaluation of MPI on Cray XC40 Xeon Phi Systems Scott Parker, Sudheer Chunduri, and Kevin Harms (Argonne National Laboratory) and Krishna Kandalla (Cray Inc.) Abstract Abstract The scale and complexity of large-scale systems continues to increase, therefore optimal performance of commonly used communication primitives such as MPI point-to-point and collective operations is essential to the scalability of parallel applications. This work presents an analysis of the performance of the Cray MPI point-to-point and collectives operations on the Argonne Theta Cray XC Xeon Phi system. The performance of key MPI routines is benchmarked using the OSU benchmarks and from the collected data analytical models are fit in order to quantify the performance and scaling of the point-to-point and collective implementations. In addition, the impact of congestion on the repeatability and relative performance consistency of MPI collectives is discussed. Performance Impact of Rank Reordering on Advanced Polar Decomposition Algorithms Aniello Esposito (Cray EMEA Research Lab) and David Keyes, Hatem Ltaief, and Dalal Sukkari (King Abdullah University of Science and Technology) Abstract Abstract We demonstrate the importance of MPI rank reordering for the performance of parallel scientific applications in the context of the cray-mpich library. Using MPICH_RANK_REORDER_METHOD=3 and a custom reorder file MPICH_RANK_ORDER, end-users may change the default process placement policy on Cray XC systems to maximize the on-node communications, while reducing the expensive off-node data traffic. We investigate the performance impact of rank reordering using two advanced polar decomposition (PD) algorithm, i.e. the QR-based Dynamically Weighted Halley method (QDWH) and the Zolotarev rational functions (ZOLOPD), whose irregular workloads may greatly suffer from process misplacement. PD is the first computational step toward solving symmetric eigenvalue problems and the singular value decomposition. We consider an extensive combination of grid topologies and rank reorderings for different matrix sizes and number of nodes. Performance profiling reveals an improvement of up to 54%, thanks to a careful process placement. Are We Witnessing the Spectre of an HPC Meltdown? Veronica G. Vergara Larrea, Michael J. Brim, Wayne Joubert, Swen Boehm, Oscar Hernandez, Sarp Oral, James Simmons, Don Maxwell, and Matthew Baker (Oak Ridge National Laboratory) Abstract Abstract This study will measure and analyze the performance observed when running applications and benchmarks before and after the Meltdown and Spectre fixes have been applied to the Cray supercomputers and supporting systems at the Oak Ridge Leadership Computing Facility (OLCF). Of particular interest to this work is the effect of these fixes on applications selected from the OLCF portfolio when running at scale. This comprehensive study will present results from experiments run on Titan, Eos, Cumulus, and Percival supercomputers at the OLCF. The results from this study could be useful for HPC users running on leadership class supercomputers, and could serve as a guide to better understand the impact that these two vulnerabilities will have on diverse HPC workloads at scale. Paper Technical Session 8C Chair: Frank M. Indiviglio (National Oceanic and Atmospheric Administration) How Deep is Your I/O? Toward Practical Large-Scale I/O Optimization via Machine Learning Methods Robert Sisneros and Jonathan Kim (National Center for Supercomputing Applications), Mohammad Raji (University of Tennessee), and Kalyana Chadalavada (Intel Corporation) Abstract Abstract Performance-related diagnostic data routinely collected by administrators of HPC machines is an excellent target for the application of machine learning approaches. There is a clear notion of ``good" and ``bad" and there is an obvious application: performance prediction and optimization. In this paper we will detail utilizing machine learning to model I/O on the Blue Waters supercomputer. We will outline data collection alongside usage of two representative machine learning approaches. Our final goal is the creation of a practical utility to advise application developers on I/O optimization strategies and further provide a heuristic allowing developers to weigh efforts against expectations. We have additionally devised an incremental experimental framework in an attempt to pinpoint impacts and causes thereof; in this way we hope to partially open the machine learning black box and communicate additional insights/considerations for future efforts. TOKIO on ClusterStor: Connecting Standard Tools to Enable Holistic I/O Performance Analysis Glenn K. Lockwood (Lawrence Berkeley National Laboratory); Shane Snyder, George Brown, Kevin Harms, and Philip Carns (Argonne National Laboratory); and Nicholas J. Wright (Lawrence Berkeley National Laboratory) Abstract Abstract At present, I/O performance analysis requires different tools to characterize individual components of the I/O subsystem, and institutional I/O expertise is relied upon to translate these disparate data into an integrated view of application performance. This process is labor-intensive and not sustainable as the storage hierarchy deepens and system complexity increases. To address this growing disparity, we have developed the Total Knowledge of I/O (TOKIO) framework to combine the insights from existing component-level monitoring tools and provide a holistic view of performance across the entire I/O stack. Improving Nektar++ IO Performance for Cray XC Architecture Michael Bareford, Nick Johnson, and Michele Weiland (EPCC, The University of Edinburgh) Abstract Abstract Future machine architectures are likely to have higher core counts placing tougher demands on the parallel IO routinely performed by codes such as Nektar++, an open-source MPI-based spectral element code that is widely used within the UK CFD community. There is a need therefore to compare the performance of different IO techniques on today's platforms in order to determine the most promising candidates for exascale machines. We measure file access times for three IO methods, XML, HDF5 and SIONlib, over a range of core counts (up to 6144) on the ARCHER Cray XC-30. The first of these (XML) follows a file-per-process approach, whereas HDF5 and SIONlib allow one to manage a single shared file, thus minimising meta IO costs. We conclude that SIONlib is the preferred choice for single-shared file as a result of two advantages, lower decompositional overhead and a greater responsiveness to Lustre file customisations. Paper Technical Session 9A Chair: Jim Rogers (Oak Ridge National Laboratory); Ann Gentile (Sandia National Laboratories) Cray System Monitoring: Successes, Priorities, Visions Ville Ahlgren (CSC - IT Center for Science Ltd.); Stefan Andersson (Cray Inc., High Performance Computing Center Stuttgart); Jim Brandt (Sandia National Laboratories); Nicholas Cardo (Swiss National Supercomputing Centre); Sudheer Chunduri (Argonne National Laboratory); Jeremy Enos (National Center for Supercomputing Applications); Parks Fields (Los Alamos National Laboratory); Ann Gentile (Sandia National Laboratories); Richard Gerber (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center); Joe Greenseid (Cray Inc.); Annette Greiner (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center); Bilel Hadri (King Abdullah University of Science and Technology); Helen He (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center); Dennis Hoppe (High Performance Computing Center Stuttgart); Urpo Kaila (CSC - IT Center for Science Ltd.); Kaki Kelly (Los Alamos National Laboratory); Mark Klein (Swiss National Supercomputing Centre); Alex Kristiansen (Argonne National Laboratory); Steve Leak (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center); Michael Mason (Los Alamos National Laboratory); Kevin Pedretti (Sandia National Laboratories); Jean_Guillaume Piccinali (Swiss National Supercomputing Centre); Jason Repik (Sandia National Laboratories); Jim Rogers (Oak Ridge National Laboratory); Susanna Salminen (CSC - IT Center for Science Ltd.); Michael Showerman (National Center for Supercomputing Applications); Cary Whitney (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center); and Jim Williams (Los Alamos National Laboratory) Abstract Abstract Effective HPC system operations and utilization require unprecedented insight into system state, applications’ demands for resources, contention for shared resources, and system demands on center power and cooling. Monitoring can provide such insights when the necessary fundamental capabilities for data availability and usability are provided. In this paper, multiple Cray sites seek to motivate monitoring as a core capability in HPC design, through the presentation of success stories illustrating enhanced understanding and improved performance and/or operations as a result of monitoring and analysis.We present the utility, limitations, and gaps of the data necessary to enable the required insights. The capabilities developed to enable the case successes drive our identification and prioritization of monitoring system requirements. Ultimately, we seek to engage all HPC stakeholders to drive community and vendor progress on these priorities. Use of the ERD for administrative monitoring of Theta Alex Kristiansen (Argonne National Laboratory) Abstract Abstract Monitoring the state of an HPC cluster in a timely and accurate fashion is critical to most system administration functions. For many Cray users, the first step in monitoring is ingestion of log files. Unfortunately, log parsing is an inherently inefficient process, requiring multiple software components to read and write from files on disk. Cray's own utilities use a message bus, the ERD, for a wide variety of purposes. At ALCF, we have begun to use this message bus for monitoring via a client library written in Go, allowing us to read in structured data directly from Cray's services, and in many instances, bypass log files entirely. In this paper we will examine the implementation and utilization of this approach on our 4392 node XC40, Theta, as well as the overall benefits and drawbacks to using the ERD for real-time monitoring. Supporting failure analysis with discoverable, annotated log datasets Stephen Leak and Annette Greiner (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory) and Ann Gentile and James Brandt (Sandia National Laboratories) Abstract Abstract Detection, characterization, and mitigation of faults on supercomputers is complicated by the large variety of interacting subsystems. Failures often manifest as vague observations like ``my job failed" and may result from system hardware/firmware/software, filesystems, networks, resource manager state, and more. Data such as system logs, environmental metrics, job history, cluster state snapshots, published outage notices and user reports is routinely collected. These data are typically stored in different locations and formats for specific use by targeted consumers. Combining data sources for analysis generally requires a consumer-dependent custom approach. We present a vocabulary for describing data, including format and access details, an annotation schema for attaching observations to a dataset, and tools to aid in discovery and publishing system-related insights. We present case studies in which our analysis tools utilize information from disparate data sources to investigate failures and performance issues from user and administrator perspectives. Paper Technical Session 9B Chair: Zhengji Zhao (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory) Intel® Xeon processor Scalable family performance in HPC, AI and other key segments Jean-Laurent Philippe (Intel Corporation) Abstract Abstract In this session, Dr. Jean-Laurent Philippe from Intel will talk about the new Intel Xeon processor Scalable family and the platform features it enables. The first part of the presentation will be a refresher with an overview of the new features of the Intel Xeon Scalable processor. This will include the core description, the Intel Mesh architecture that connects the cores (up to 28 cores), the significant increases in memory and I/O bandwidth, as well as the new generation of Intel Advanced Vector Extensions called Intel AVX-512, enabling up to double the flops per clock cycle compared to the previous-generation Intel AVX2. The presenter will show that this new family of processors is designed to deliver advanced HPC capabilities. The presenter will then give several examples of the new levels of performance reached on various workloads and applications in the HPC segment and beyond. Storage and memory hierarchy in HPC: new paradigm and new solutions with Intel Jean-Laurent Philippe (Intel Corporation) Abstract Abstract With networking more capable than ever in regards to transmission speeds, the storage element has now become the most important aspect of any modern day performance solution, as we must no longer hold back compute resources from realising their maximum potential due to IO wait and high latency from the storage feeding them. NVME is now one of the biggest advances in storage capabilities, as it now allows the architecting of solutions that deliver huge advances in storage performance with the promise of more to come. In this session, you will be provided with an insight as to how Intel’s Optane™ 3D Xpoint™ products coupled with 3D NVME NAND device portfolio of non-volatile memory products address all tiers of performance in the here and now as well as into the future. Chapel Comes of Age: Making Scalable Programming Productive Bradford L. Chamberlain, Ben Albrecht, Elliot Ronaghan, and et al. (Cray Inc.) Abstract Abstract Chapel is a programming language whose goal is to support productive, general-purpose parallel computing at scale. Chapel's approach can be thought of as combining the strengths of Python, Fortran, C/C++, and MPI in a single language. Five years ago, the DARPA High Productivity Computing Systems (HPCS) program that launched Chapel wrapped up, and the team embarked on a five-year effort to improve Chapel's appeal to end-users. This paper follows up on our CUG 2013 paper by summarizing the progress made by the Chapel project since that time. Specifically, Chapel's performance now competes with or beats hand-coded C+MPI/SHMEM+OpenMP; its suite of standard libraries has grown to include FFTW, BLAS, LAPACK, MPI, ZMQ, and other key technologies; its documentation has been modernized and fleshed out; and the set of tools available to Chapel users has grown. This paper also characterizes the experiences of early adopters from communities as diverse as astrophysics and artificial intelligence. The Cray Programming Environment: Current Status and Future Directions Luiz DeRose (Cray Inc.) Abstract Abstract One of the famous Seymour Cray’s quote was “anyone can build a fast CPU. The trick is to build a fast system”. Today, with the availability of commodity processors, anyone can build a system, but it is still tricky to build a balanced and fast supercomputer, and a key component is the programing environment. The technology changes in the supercomputing industry are forcing computational scientists to face new critical system characteristics that will significantly impact the performance and scalability of applications. Hence, application developers need sophisticated compilers, tools, and libraries that can help maximize programmability with low porting and tuning efforts, while not losing sight of performance portability across a wide range of architectures. In this talk I will present the new functionalities, roadmap, and future directions of the Cray Programming Environment, which are being developed and deployed on Cray Clusters and Cray Supercomputers for scalable performance with high programmability. Paper Technical Session 9C Chair: Ronald Brightwell (Sandia National Laboratories) Scaling Deep Learning without Impacting Batchsize Alexander D. Heye (Cray, Inc) Abstract Abstract Deep learning has proven itself to be a difficult problem in the HPC space. Though the algorithm can scale very efficiently with a sufficiently large batchsize, the efficacy of training tends to decreases as the batchsize grows. Scaling the training of a single model may be effective in narrow fields such as image classification, but more generalizable options can be achieved when considering alternate methods of parallelism and the larger workflow surrounding neural network training. Hyperparameter optimization, dataset segmentation, hierarchical fine-tuning and model parallelism can all provide significant scaling capacity without increasing batchsize and can be paired with a traditional, single-model scaling approach for a multiplicative scaling improvement. This paper intends to further define and examine these scaling techniques in how they perform individually and how combining them can provide significant improvements in overall training times. Alchemist: An Apache Spark <=> MPI Interface Alex Gittens (Rensselaer Polytechnic Institute); Kai Rothauge, Michael W. Mahoney, and Shusen Wang (UC Berkeley); Michael Ringenburg and Kristyn Maschhoff (Cray Inc.); Mr. Prabhat and Lisa Gerhardt (NERSC/LBNL); and Jey Kottalam (UC Berkeley) Abstract Abstract The Apache Spark framework for distributed computation is popular in the data analytics community due to its ease of use, but its MapReduce-style programming model can incur significant overheads when performing computations that do not map directly onto this model. One way to mitigate these costs is to off-load computations onto MPI codes. In recent work, we introduced Alchemist, a system for the analysis of large-scale data sets. Alchemist calls MPI-based libraries from within Spark applications, and it has minimal coding, communication, and memory overheads. In particular, Alchemist allows users to retain the productivity benefits of working within the Spark software ecosystem without sacrificing performance efficiency in linear algebra, machine learning, and other related computations. Continuous integration in a Cray multiuser environment Ben Lenard and Tommie Jackson (Argonne National Laboratory) Abstract Abstract Continuous integration (CI) provides the ability for developers to compile and unit test their code after commits to their repository. This is a tool for producing better code by recompiling and testing often. CI in this environment means that a build will happen either on a schedule or triggered by a commit to a software repo. In this paper, we will be looking at how Argonne National Laboratory’s LCF is implementing CI with a two-pronged approach, security considerations, and project isolation. We are currently implementing a Jenkins instance that has the ability to connect to external software repositories, listen for web-events, compile code, then execute tests or submit jobs to Cobalt our job scheduler. While Jenkins provides the ability to build code on demand and execute jobs on our systems, we will also be deploying another solution that is tied to GitLab that provides seamless integration. Paper Technical Session 20A Chair: Jim Rogers (Oak Ridge National Laboratory) Modernizing Cray Systems Management – Use of Redfish APIs on Next Generation Shasta Hardware Steven Martin, Kevin Hughes, Matt Kelly, and David Rush (Cray Inc.) Abstract Abstract This paper will give a high-level overview and a deep dive into using a Redfish based paradigm and strategy for low level hardware management on future Cray platforms. A brief overview of the Distributed Management Task Force (DMTF) Redfish systems management specification is given, along with an outline of some of our motivations for adopting this open specification for future Cray platforms. Details and examples are also provided illustrating the use of these open and accepted industry standard REST API based mechanisms and schemas. The goal of modernizing the Cray hardware management, while still providing optimized capabilities in areas such as telemetry that Cray has provided in the past, is considered. This paper is targeted to system administrators, systems designers, site planners, and anyone else wishing to learn more about trends and considerations for next generation Cray hardware management. Managing the SMW as a git Branch Douglas Jacobsen (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) and Randy Kleinman and Harold Longley (Cray Inc.) Abstract Abstract Modern software engineering/DevOps techniques applied to Cray XC systems management enable higher fidelity translation from test to production environments, reduce administration costs by avoiding duplicate efforts, and increase reliability. This can be done by using a git repository to track and manage all system configurations on the SMW(s), and then adapt a gitflow-like development methodology. External Login Nodes at Scale Georg Rath, Douglas Jacobsen, and Tina Declerck (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) Abstract Abstract NERSC is currently in the process of converting a Cray CS system to an 800 node eLogin cluster. We operate two high-throughput Cray CS systems (PDSF and Genepool), beside four Cray XC production supercomputers (Cori, Edison, Gerty and Alva). With the adoption of CLE6 and our SMWFlow system management framework, the configuration management of the supercomputers was unified, which lead to reduced cost of system management, consistency across machines and lower turnaround times for changes in system configuration. To apply these advantages on the high-throughput systems we replace the current way of managing those systems, using xCAT and CFEngine, to one based on the Cray eLogin management platform. We will describe the current state of the project and how it will lead to a more consistent user experience across our systems and a marked decrease in operational efforts. It will be the cornerstone of a common job submission system for all systems at NERSC. Best Practices for Management and Operation of Large HPC Installations Scott Lathrop, Celso Mendes, Jeremy Enos, Brett Bode, Gregory H Bauer, Roberto Sisneros, and William Kramer (National Center for Supercomputing Applications/University of Illinois) Abstract Abstract To achieve their mission and goals, HPC centers continually strive to improve their resources and services to best serve their constituencies. Collectively, the community has learned a great deal about how to manage and operate HPC centers, provide robust and effective services, develop new communities, and other important aspects. Yet, cataloguing best practices to help inform and guide the broader HPC community is not often done. To improve such scenario, the Blue Waters project has internally documented sets of best practices that have been adopted for the deployment and operation over the past five years of the Blue Waters system, a large Cray XE6/XK7 supercomputer at NCSA. Those practices, described in this paper, cover several aspects of managing and operating the system, and supporting its users. Although these practices are particularly relevant for Cray systems, we believe that they would benefit the operation of other large HPC installation as well. Paper Technical Session 20B Chair: Zhengji Zhao (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory) Comparative Benchmarking of the First Generation of HPC-Optimised Arm Processors on Isambard Simon McIntosh-Smith, James Price, Tom Deakin, and Andrei Poenaru (University of Bristol) Abstract Abstract In this paper we present performance results from Isambard, the first production supercomputer to be based on Arm CPUs that have been optimised specifically for HPC. Isam- bard is the first Cray XC50 Scout system, combining Cavium ThunderX2 Arm-based CPUs with Cray’s Aries interconnect. The full Isambard system will be delivered in the summer of 2018 when it will contain over 10,000 Arm cores. In this work we present node-level performance results from eight early-access nodes that were upgraded to B0 beta silicon in March 2018. We present node-level benchmark results comparing ThunderX2 with mainstream CPUs, including Intel Skylake and Broadwell, as well as Xeon Phi. We focus on a range of applications and mini-apps important to the UK national HPC service, ARCHER, as well as to the Isambard project partners and the wider HPC community. We also compare performance across three major software toolchains available for Arm: Cray’s CCE, Arm’s version of Clang/Flang/LLVM, and GNU. Evaluating Runtime and Power Requirements of Multilevel Checkpointing MPI Applications on Four Different Parallel Architectures: An Empirical Study Xingfu Wu and Valerie Taylor (Argonne National Laboratory, The University of Chicago) and Zhiling Lan (Illinois Institute of Technology) Abstract Abstract While reducing execution time is still the major objective for high performance computing, future systems and applications will have additional power and resilience requirements that represent a multidimensional tuning challenge. In this paper we present an empirical study to evaluate the runtime and power requirements of multilevel checkpointing MPI applications using the FTI (Fault Tolerance Interface) library. We develop the FTI version of Intel MPI benchmarks to evaluate how FTI affects MPI communication. We then conduct experiments with two applications --- an MPI heat distribution application code (HDC), which is compute intensive, and the benchmark STREAM, which is memory intensive --- on four parallel systems: Cray XC40, IBM BG/Q, Intel Haswell, and AMD Kaveri. We evaluate how checkpointing and bit-flip failure injection affect the application runtime and power requirements. The experimental results indicate that the runtime and power consumption for both applications vary across the different architectures. Both Cray XC40 and AMD Kaveri with dynamic power management exhibited the smallest impact, whereas Intel Haswell without dynamic power management manifested the largest impact. Bit-flip failure injections with and without different bit positions for FTI had little impact on runtime and power. This provides us a good start to understand the tradeoffs among runtime, power and resilience on these architectures. OpenACC and CUDA Unified Memory Sebastien Deldon (NVIDIA/PGI), James Beyer (NVIDIA), and Doug Miles (NVIDIA/PGI) Abstract Abstract CUDA Unified Memory (UM) simplifies application development for GPU-accelerated systems by presenting a single memory address space to both CPUs and GPUs. Data allocated in UM can be read or written through the same virtual address by code running on either a CPU or an NVIDIA GPU. OpenACC is a directive-based parallel programming model for both traditional shared-memory SMP systems and heterogeneous GPU-accelerated systems. It includes directives for managing data movement between levels of a memory hierarchy, and in particular between host and device memory on GPU-accelerated systems. OpenACC data directives can be safely ignored or even omitted on a shared-memory system, allowing programmers to focus on exposing and expressing parallelism rather than on underlying system details. This paper describes an implementation of OpenACC built on top of CUDA Unified Memory that provides dramatic productivity benefits for porting and optimization of Fortran, C and C++ programs to GPU-accelerated Cray systems. Strategies to Accelerate VASP with GPUs Using OpenACC Stefan Maintz and Markus Wetzstein (NVIDIA) Abstract Abstract We report a porting effort of VASP (the Vienna Ab Initio Simulation Package) to GPUs, using OpenACC. While having been useful to researchers, the existing CUDA C based port of VASP was hard to maintain due to source code duplication. We demonstrate a directive based OpenACC adaption for the most important DFT-level solvers available in VASP: RMM-DIIS and blocked-Davidson. A comparative performance study shows that the OpenACC efforts can even significantly outperform the former port. No extensive code refactoring was necessary. Guidelines to manage device memory for heavily aggregated data structures are presented. These lead to cleaner code and lower the entry barrier to accelerate additional parts of VASP and may also help accelerating other high-performance applications, as well. Paper Technical Session 20C Chair: Jim Williams (Los Alamos National Laboratory) DataWarp Transparent Cache: Implementation, Challenges, and Early Experience Benjamin R. Landsteiner (Cray Inc.) and David Paul (NERSC) Abstract Abstract DataWarp accelerates performance by making use of fast SSDs layered between a parallel file system (PFS) and applications. Transparent caching functionality provides a new way to accelerate application performance. By configuring the SSDs as a transparent cache to the PFS, DataWarp enables improved application performance without requiring users to manually manage the copying of their data between the PFS and DataWarp. We provide an overview of the implementation to show how easy it is to get started. We then cover some of the challenges encountered during implementation. We discuss our early experience on Cray development systems and on NERSC's Gerty supercomputer. We also discuss future work opportunities. PBS Professional - Optimizing the "When & Where" of Scheduling Cray DataWarp Jobs Scott J. Suchyta (Altair Engineering) Abstract Abstract Integrating Cray DataWarp with PBS Professional was not difficult. The challenge was identifying when and where it made sense to use the "applications I/O accelerator technology that delivers a balanced and cohesive system architecture from compute to storage." As users began to learn more about how their applications performed in this environment, it became clear that when and where jobs ran could greatly affect performance and efficiency. With DataWarp, job data is staged into a "special" storage object (the where), the job executes, and the data is staged out. The catch: minimize wasted compute cycles waiting for the data staging (the when). DataWarp Transparent Caching: Data Path Implementation Matt Richerson (Cray Inc.) Abstract Abstract DataWarp transparent cache uses SSDs located on the high speed network to provide an implicit cache for the parallel filesystem. We'll look at which components make up the data path for DataWarp transparent cache and see how they interact with each other. The implementation of each component is discussed in depth, and we'll see how the design decisions affect performance for different I/O patterns. Paper Technical Session 24A Chair: Jim Rogers (Oak Ridge National Laboratory) Trinity: Opportunities and Challenges of a Heterogeneous System K. Scott Hemmert (Sandia National Laboratories); James Lujan (Los); David Morton, Hai Ah Nam, Paul Peltz, and Alfred Torrez (Los Alamos National Laboratory); Stan Moore (Sandia National Laboratories); Mike Davis and John Levesque (Cray Inc.); Nathan Hjelm and Galen Shipman (Los Alamos National Laboratory); and Michail Gallis (Sandia National Laboratories) Abstract Abstract This short paper is an expanded outline of a full paper on the high-level architecture, success and challenges with Trinity, the first DOE ASC Advanced Technology System. Trinity is a Cray XC40 supercomputer which was delivered in two phases: a Haswell first phase, with Knights Landing second phase. The paper will describe the bringup and acceptance of the KNL partition, as well as the merge of the system into a single heterogeneous supercomputer. Cray Advanced Power Management Updates Steven J. Martin, Greg J. Koprowski, and Sean J. Wallace (Cray Inc.) Abstract Abstract This paper will highlight the power management features of Cray’s two newest blades for XC50, and updates to the Cray PMDB (Power Management Database). The paper will first highlight the power monitoring and control features of XC50 compute blades. The paper will then highlight power management changes in the SMW 8.0.UP06 release for the PMDB. These database implementation changes enhance the PMDB performance, and enhanced HA support. This paper is targeted to system administrators, researchers involved in advanced power monitoring and management, power aware computing and energy efficiency. Weathering the Storm – Lessons Learnt in Managing a 24x7x365 HPC Delivery Platform Craig West (Australian Bureau of Meteorology) Abstract Abstract The Bureau of Meteorology is Australia's national weather agency. Its mandate covers weather forecasting, extreme weather events and operational advice to aviation, maritime, military and agriculture clients. In a country of significant weather extremes, checking the forecast on the BoM is a daily ritual for most Australians, "the BoM" provides one of the most widely used services in Australia. Paper Technical Session 24B Chair: David Paul (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) Using CAASCADE and CrayPAT for Analysis of HPC Applications Reuben D. Budiardja, M. Graham Lopez, Oscar Hernandez, and Jack C. Wells (Oak Ridge National Laboratory) Abstract Abstract We describe our work on integrating CAASCADE with CrayPAT to obtain both static and dynamic information on characteristics of high-performance computing (HPC) applications. CAASCADE --- Compiler-Assisted Application Source Code Analysis and Database---is a system we are developing to extract features of application from its source code by utilizing compiler plugins. CrayPAT enable us to add runtime-based information to CAASCADE's feature detection. We present results from analysis of HPC applications. Toward Automated Application Profiling on Cray Systems Charlene Yang, Brian Friesen, Thorsten Kurth, Brandon Cook, and Samuel Williams (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) Abstract Abstract Application performance data can be used by HPC users to optimize their code and prioritize their development efforts, and by HPC facilities to better understand their user base and guide their future procurements. This paper presents an exploration of six commonly used profiling tools in order to assess their ability to enable automated passive performance data collection on Cray systems. Each tool is benchmarked with three applications with distinct performance characteristics, to collect five pre-selected metrics such as the total number of floating-point operations and memory bandwidth. Results are then used to evaluate the tools' usability, overhead to run, amount of actionable information that they can provide, and accuracy of the information provided. Roofline Analysis with Cray Performance Analysis Tools (CrayPat) and Roofline-based Performance Projections for a Future Architecture JaeHyuk Kwack, Galen Arnold, Celso Mendes, and Gregory H. Bauer (National Center for Supercomputing Applications/University of Illinois) Abstract Abstract The roofline analysis model is a visually intuitive performance model used to understand hardware performance limitations as well as potential benefits of optimizations for science and engineering applications. Intel Advisor has provided a useful roofline analysis feature since its version 2017 update 2, but it is not widely compatible with other compilers and chip-architectures. As an alternative, we have employed Cray Performance Analysis Tools (CrayPat) that are more flexible for multiple compilers and architectures. First, we present our procedure for measuring a reliable computational intensity for roofline analysis. We performed several numerical studies for validation via manually derived reference data as well as data from Intel Advisor. Second, we provide roofline analysis results on Blue Waters for several HPC benchmarks and sparse linear algebra libraries. In addition, we present an example of roofline-based performance projection for a future system. Paper Technical Session 24C Chair: Tim Robinson (Swiss National Supercomputing Centre) Enabling Docker for HPC Jonathan Sparks (Cray) Abstract Abstract Docker is quickly becoming the de facto standard for containerization. Besides running on all the major Linux distributions, Docker is supported on all the main cloud platform providers. The Docker ecosystem provides the capabilities necessary to build, manage, and execute containers on any platform. High-Performance Computers (HPC) systems present their own set of unique set of challenges to the standard deployment of docker with respect to scaling image storage, user security, and access to host-level resources. This paper presents a set of Docker API plugins and features to address the HPC concerns of scaling, security, resource access, and execution in an HPC environment. Installation, Configuration and Performance Tuning of Shifter V16 on Blue Waters Hon Wai Leong, Timothy A. Bouvet, Brett Bode, Jeremy J. Enos, and David A. King (National Center for Supercomputing Applications/University of Illinois) Abstract Abstract NCSA recently announced the availability of Shifter version 16.08.3 (V16) for production use on Blue Waters. Shifter provides researchers with the capability to execute container-based HPC applications on Blue Waters. In this paper, we present the procedure that we performed to backport Shifter V16 to Blue Waters. We describe the details of the installation of the Shifter software stack, code customization, configuration, and the complex integration efforts to scale Shifter jobs to start in parallel on a few thousand compute nodes. We will discuss in this paper the methods and workarounds that we utilized to address the challenges that we encountered during the deployment, which include security hardening, performance tuning, running GPU workloads and other operation related issues. Today, we have successfully tuned Shifter to the scale that could execute a container-based job on Blue Waters across more than 4000 compute nodes. Incorporating a Test and Development System Within the Production System Nicholas Cardo and Marco Induni (Swiss National Supercomputing Centre) Abstract Abstract Test and Development Systems often get traded off for investments into more computational capability. However, the value a TDS can contribute to the overall success of a production resource is tremendous. The Swiss National Supercomputer Centre (CSCS) has developed a way to provide TDS capabilities on a Cray CS-Storm System by utilizing the production hardware, with only a small investment. An understanding of the system architecture will be provided, leading up to the creation of a TDS on the production hardware, without removing the system from production operations. Paper Technical Session 25A Chair: Tim Robinson (Swiss National Supercomputing Centre) TensorFlow at Scale - MPI, RDMA and All That Thorsten Kurth (Lawrence Berkeley National Laboratory), Mikhail Smorkalov (Intel Corporation), Peter Mendygral (Cray Inc.), and Srinivas Sridharan and Amrita Mathuriya (Intel Corporation) Abstract Abstract Deep learning has proven to be a successful tool for solving a large variety of problems in various scientific fields and beyond. In recent years, the models as well as the available datasets have grown bigger and more complicated and thus an increasing amount of computing resources is required in order to train these models in a reasonable amount of time. Besides being able to use HPC resources, deep learning model developers want flexible frameworks which allow for rapid prototyping. One of the most important of these frameworks is Google TensorFlow, which provides both features, i.e. good performance as well as flexibility. In this paper we discuss different solutions for scaling the TensorFlow Framework to thousands of nodes on contemporary Cray XC supercomputing systems. High Performance Scalable Deep Learning with the Cray Programming Environments Deep Learning Plugin Peter Mendygral, Nick Hill, Krishna Kandalla, Diana Moise, and Jacob Balma (Cray Inc.) and Marcel Schongens (Swiss National Supercomputing Centre) Abstract Abstract Deep Learning (DL) with neural networks is emerging as a critical tool for academia and industry with transformative potential for a wide variety of problems. The amount of computational resources needed to train sufficiently complex networks can limit the use of DL in production, however. High Performance Computing (HPC), in particular efficient scaling to large numbers of nodes, are ideal for addressing this problem. This paper describes the Cray Programming Environments Machine Learning Plugin (CPE ML Plugin), a DL framework portable solution for high performance scaling. Performance on Cray platforms and a selection of neural network topology implementations using TensorFlow are described. Performance evaluation of parallel computing and Big Data processing with Java and PCJ library Marek Nowicki (Nicolaus Copernicus University in Toruń) and Łukasz Górski and Piotr Bała (ICM University of Warsaw) Abstract Abstract In this paper, we present PCJ (Parallel Computing in Java), a novel tool for scalable high-performance computing and big data processing in Java. PCJ is Java library implementing PGAS (Partitioned Global Address Space) programming paradigm. It allows for the easy and feasible development of computational applications as well as Big Data processing. The use of Java brings HPC and Big Data type of processing together and enables running on the different types of hardware. In particular, the high scalability and good performance of PCJ applications have been demonstrated using Cray XC40 systems. Paper Technical Session 25B Chair: Frank M. Indiviglio (National Oceanic and Atmospheric Administration) Leveraging MPI RMA to optimise halo-swapping communications in MONC on Cray machines Michael Bareford and Nick Brown (EPCC, The University of Edinburgh) Abstract Abstract Remote Memory Access (RMA), also known as single sided communications, provides a way for reading and writing directly into the memory of other processes without having to issue explicit message passing style communication calls. Previous studies have concluded that MPI RMA can provide increased communication performance over traditional MPI Point to Point (P2P) but these are based on synthetic benchmarks rather than real world codes. In this work, we replace the existing non-blocking P2P communication calls in the Met Office NERC Cloud model, a mature code for modelling the atmosphere, with MPI RMA. We describe our approach in detail and discuss the options taken for correctness and performance. Experiments are performed on ARCHER, a Cray XC30 and Cirrus, an SGI ICE machine. We demonstrate on ARCHER that by using RMA we can obtain between a 5% and 10% reduction in communication time at each timestep on up to 32768 cores, which over the entirety of a run (with many timesteps) results in a significant improvement in performance compared to P2P on the Cray. However, RMA is not a silver bullet and there are challenges when integrating RMA calls into existing codes: important optimisations are necessary to achieve good performance and library support is not universally mature, as is the case on Cirrus. In this paper we discuss, in the context of a real world code, the lessons learned converting P2P to RMA, explore performance and scaling challenges, and contrast alternative RMA synchronisation approaches in detail. Cray Performance Tools for Analyzing Applications at Scale Heidi Poxon (Cray Inc.) Abstract Abstract The Cray Performance Measurement and Analysis Tools focus on providing functionality that reduces the time investment associated with porting and tuning applications on Cray systems. With both simple and advanced modes of use, users who need to identify key performance bottlenecks have a wealth of capability available to analyze the behavior of their biggest, most important applications. In addition to enhancements that continue to focus on easy-to-use interfaces for performance data collection, analysis, and reporting, new capability has been added to better handle application analysis at scale, and to provide more insight into program characteristics such as memory utilization. This presentation highlights some of the recently added features and provides a preview of what’s coming next for users of the Cray tools. Performance Study of Popular Computational Chemistry Softwares on Cray HPC Junjie Li, Shijie Sheng, and Ray Sheppard (Indiana University) Abstract Abstract Chemistry has been one of the major scientific fields that heavily relies on simulation for decades. Today, 8 of the top 15 HPC applications are computational chemistry programs, and they are all GPU accelerated. At Indiana University where multiple large Cray systems are installed, computational chemistry represents approximately 40% of the total workload, therefore understanding the performance and limitations of these applications is crucial to improve the throughput of scientific research as well as the quality of our service. In this work, numerous highly popular computational chemistry programs are studied on Cray XE6/XK7 and Cray XC30 in the aspects of 1) performance tuning and profiling 2) comparison of different parallelization paradigms including MPI, MPI+Cuda, MPI+OpenMP, etc. 3) scalability using Cray’s different interconnects. Paper Technical Session 25C Chair: Paul L. Peltz Jr. (Los Alamos National Laboratory) The Role of SSD Block Caches in a World with Networked Burst Buffers Torben Petersen and Bill Loewe (Cray) Abstract Abstract Cray’s NXD storage appliance transparently uses an SSD cache behind the block controller in a hard drive disk array. In a world with SSDs in separate storage tiers altogether like Cray’s DataWarp system, the role of additional SSDs in the hard drive tier might seem redundant. This perceived redundancy grows with upcoming parallel file system features such as Lustre’s Data on MDS. However, in this talk, we will identify and discuss the different workloads that uniquely benefit from SSDs in a variety of different locations within the storage stack. Seemingly redundant, the counter-intuitive benefits of NXD in a world with networked SSDs will be shown. Use of View for ClusterStor to monitor and diagnose performance of your Lustre filesystem Patricia Langer (Cray Inc.) Abstract Abstract View for ClusterStor continuously collects metric data about the health of your ClusterStor Lustre filesystem and correlates this data with job information from your Cray XC system. By combining Lustre performance metrics, job statistics, and job details, administrators have visibility into the health and performance of their Lustre filesystem(s) and what job or jobs may be contributing to performance concerns. This presentation will discuss techniques for identifying and isolating potential job performance issues with your Lustre filesystem using View for ClusterStor. Nuclear Meltdown?: Assessing the impact of the Meltdown/Spectre bug at Los Alamos National Laboratory Joseph Fullop and Jennifer Green (Los Alamos National Laboratory) Abstract Abstract With the recent revelation of the Meltdown/Spectre bug, much speculation has been levied on the performance impact of the fixes to avoid potential compromises. Single node, single process codes have had very high impact estimates, but little is known on the extent to which the early patches will affect large-scale MPI jobs. Will the delays cascade into jitter vapor lock, or will the delays get lost in the waves on the ocean? Or is reality somewhere in the middle? With flagship codes and their performance driving future systems design and purchase, it is important to understand the new normal. In a small scale parameter study, we review how the updates affect job performance of various benchmark codes as well as our mainstay workloads across different job sizes and architectures. Paper Technical Session 26A Chair: Bilel Hadri (KAUST Supercomputing Lab) Optimised all-to-all Communication on Multicore Architectures Applied to FFTs with Pencil Decomposition Andreas Jocksch and Matthias Kraushaar (Swiss National Supercomputing Centre) and David Daverio (University of Cambridge) Abstract Abstract All-to-all communication is a basic functionality of parallel communication libraries such as the Message Passing Interface (MPI). Typically there are multiple different algorithms underlying which are chosen according to message size. We propose a communication algorithm which exploits the fact that modern supercomputers combine shared memory parallelism and distributed memory parallelism. The application example of our algorithm is FFTs with pencil decomposition. Furthermore we propose an extension of the MPI standard in order to accommodate this and other algorithms in an efficient way. Eigensolver Performance Comparison on Cray XC Systems Brandon Cook, Thorsten Kurth, and Jack Deslippe (Lawrence Berkeley National Laboratory) and Nick Hill, Pierre Carrier, and Nathan Wichmann (Cray Inc.) Abstract Abstract Scalable dense symmetric eigensolvers are important for performance of a wide class of HPC applications including many materials science and chemistry applications. In this paper, we create a benchmark for exploring the performance of two leading libraries, the incumbent ScaLAPACK library and the newer ELPA library, for arbitrary matrix size. We include a performance study of these two libraries by varying matrix size, node-count, MPI ranks, OpenMP threads, system architecture (including KNL and Haswell). We demonstrate that choosing optimal parameters for a given matrix makes a significant difference in walltime and provide a tool for users in generating an optimal configuration. On the Use of Vectorization in Production Engineering Workloads Courtenay Vaughan, Jeanine Cook, Robert Benner, Dennis Dinge, Paul Lin, Clayton Hughes, Robert Hoekstra, and Simon Hammond (Sandia National Laboratories) Abstract Abstract Many recent high-performance computing platforms have seen a resurgence in the use of vector or vector-like hardware units. Some have argued that efficient GPU kernels must be written with vector programming in mind, and, of interest to the authors of this paper, is the use of wide vector units in Intel’s recent Knights Landing, Haswell and, most recently, Sky Lake server processors. Paper Technical Session 26B Chair: Jim Williams (Los Alamos National Laboratory) Unikernels/Library Operating Systems for HPC Jonathan Sparks (Cray) Abstract Abstract Both OS virtualization and containerization have become core technologies in enterprise data centers and for the cloud. Containers offer users an agile lightweight methodology to build and deploy applications. OS virtualization, on the other hand, brings security, isolation, flexibility, and better sharing and utilization of hardware resources. Still a container remains dependent on the host kernel and services that it provides. To avoid this dependency, a different approach can be taken: a developer can build an application against a set of modular OS libraries. These OS libraries are then compiled directly into the kernel, resulting in a portable, self-contained minimal application called a unikernel. Unikernel implements the bare minimum of operating system functions - just enough to enable the application to execute in a secure and isolated environment. This paper will investigate the use of unikernels, along with tools and technologies for building and launching HPC unikernels on supercomputer systems. GPU Usage Reporting Nicholas cardo, Mark Klein, and Miguel Gila (Swiss National Supercomputing Centre) Abstract Abstract For systems with accelerators, such as Graphics Processing Units (GPUs), it is vital to understand their usage in solving scientific problems. Not only is this important to understand for current systems, but it also provides insight into future system needs. However, there are limitations and challenges that have prevented reliable statistic capturing, recording and reporting. The Swiss National Supercomputer Centre (CSCS) has developed a mechanism for capturing and storing GPU statistical information for each batch job. Additionally, a batch job summary report has been developed to display useful statistics about the job, including GPU utilization statistics. This paper will discuss the challenges that needed to be overcome along with the design and implementation of the solution. Instrumenting Slurm Command Line Commands to Gain Workload Insight Douglas Jacobsen and Zhengji Zhao (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) Abstract Abstract Understanding user behavior and user interactions with an HPC system is of critical importance when planning future work, reviewing existing policies, and debugging specific issues on the system. The batch system and scheduler interface represent the user interface to the system, and thus collecting data from the batch system about user commands in a systematic, structured way generates a valuable dataset revealing user behavior. Using the in-development cli_filter plugin capability of Slurm, we collect all user options for all job submission and application startup requests (including failed submissions that are never transmitted to the server), collect ALTD application data, as well as desired portions of the user environment in each job and job step. The cli_filter plugins also allow client-side policy enforcement, and enable user-definable functionality (like setting user-specified defaults for specific options). We discuss our implementation and how we gather and analyze these data scalably on our Cray systems. Paper Technical Session 26C Chair: Chris Fuson (ORNL) How to implement the Sonexion RestAPI and correlate it with SEDC and other data. Cary Whitney (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) Abstract Abstract I am using the Sonexion RestAPI to gather Lustre statistics from our Lustre filesystems to replace the current LMT. I will be talking about what it took to implement it and the types of data being gathered. Included will be how we are combining this information with the current SEDC, facilities and environmental data to help debug and troubleshoot user issues. Usage and Performance of libhio on XC-40 Systems Nathan Hjelm and Howard Pritchard (Los Alamos National Lab) and Cornell Wright (Los Alamos National Laboratory) Abstract Abstract High performance systems are rapidly increasing in size and complexity. To keep up with the Input/Output (IO) demands of applications and to provide improved functionality, performance and cost, IO subsystems are also increasing in complexity. To help applications to utilize and exploit increased functionality and improved performance in this more complex environment, we developed a user-space Hierarchical Input/Output (HIO) library: libHIO. In this paper we discuss the libHIO Application Programming Interface (API) and its usage on XC-40 systems with both DataWarp and Lustre filesys- tems. We detail the performance of libHIO on the Trinity supercomputer, a Cray XC-40 at Los Alamos National Lab (LANL), with multiple user applica- tions at large scale. Improved I/O Using Native Spectrum Scale (GPFS) Clients on a Cray XC System Jesse A. Hanley, Chris J. Muzyn, and Matt A. Ezell (Oak Ridge National Laboratory) Abstract Abstract The National Center for Computational Sciences (NCCS) created a method for natively routing communication between Cumulus, a Cray Rhine/Redwood XC40 super- computer, and Wolf, a Spectrum Scale file system, using IP over InfiniBand (IPoIB) and native Linux kernel tools. Spectrum Scale lacks a routing facility like Lustre’s LNET capability. To facilitate communication between storage and compute, Cumulus originally projected Wolf using Cray’s Data Virtualization Service (DVS). The lack of native file system support impacted users as they ported workflows to Cumulus. To support these and future use cases, the DVS projection method has been replaced by a native Spectrum Scale cluster on Cumulus that routes traffic to and from Wolf at comparable performance. This paper presents an introduction to the systems involved, a summary of motivations for the work, challenges faced, and details about the native routing configuration. Sponsor Talk Sponsor Talk 5 Chair: Trey Breckenridge (Mississippi State University) Sponsor Talk Sponsor Talk 6 Chair: Trey Breckenridge (Mississippi State University) PBS Works 2018 and Beyond: Multi-sched, Power Rate Ramp Limiting, Soft walltime, Job Equivalence Classes, …, Native Mode Scott J. Suchyta (Altair Engineering) Abstract Abstract At Altair, we believe HPC improves people's lives - that's why we focus on technologies to better access HPC, to better optimize HPC, and to better control HPC. PBS Works 2018 combines an amazingly natural user experience for engineers and researchers with new system-level optimizations and superior administrator controls. In particular, PBS Professional v18 merges two code-lines, combining the Cray-specific features from v13 with all the features generally available on other platforms. PBS Pro v18 brings: multi-sched (parallel job scheduling that speeds throughput and retains a single-system view), power rate ramp limiting (energy management that minimizes demand spikes), soft walltime (scheduler advice enabling better utilization and job turnaround by optimizing backfilling), job equivalence classes (internal data structure groups like jobs together greatly speeding scheduling), and numerous enhancements and fixes. Finally, in 2018, Altair and Cray engineering teams have begun joint development for PBS Pro to support Native mode on the Cray XC environment. Sponsor Talk Sponsor Talk 12 Chair: Trey Breckenridge (Mississippi State University) PGI Compilers for Accelerated Computing Doug Miles (NVIDIA) Abstract Abstract Speaker: Doug Miles, Director or PGI Compilers & Tools, NVIDIA. Sponsor Talk Sponsor Talk 13 Chair: Trey Breckenridge (Mississippi State University) Making a Large Investment in HPC? Think Bright Lee Carter (Bright Computing) Abstract Abstract Maximizing your HPC cluster investment is imperative in the current economic climate. If your organization is planning on – or in the process of - making a large investment in HPC, then this is the presentation for you. Lee Carter will highlight how Bright’s integration with Cray’s CS series of cluster supercomputers gives organizations the ability create an agile clustered infrastructure ready to tackle the demands of compute- and data-intensive workloads. Lee will unveil the latest features in Bright Cluster Manager, including Workload Accounting and Reporting which shows how effectively resources are being used, providing vital information to ensure maximum return on your HPC investment. Sponsor Talk Sponsor Talk 18 Chair: Trey Breckenridge (Mississippi State University) Applying DDN to Machine Learning Jean-Thomas Acquaviva Jean-Thomas Acquaviva (DataDirect Networks) Abstract Abstract While deep learning impact is shaking the industry, DDN by applying its HPC technologies and know-how delivers 10 times higher performance than competitive Enterprise solutions. These improvements are robust and apply against more types of data using a wide variety of techniques. It also allows machine learning and deep learning programs to start small for proof of concept and scale to production-level performance and petabytes per rack with no additional architecturing required. Sponsor Talk Sponsor Talk 19 Chair: Trey Breckenridge (Mississippi State University) Tutorial Tutorial 1B Chair: Harold Longley (Cray Inc.) Managing SMW/eLogin and ARM XC nodes Harold Longley, Eric Cozzi, Jeff Keopp, and Mark Ahlstrom (Cray Inc.) Abstract Abstract System management on Cray XC systems with CLE 6.0.UP06 has some exciting new tutorial topics this year about managing the eLogin nodes from the SMW and tool changes to support ARM nodes. Tutorial Tutorial 1C Chair: Michael Ringenburg (Cray, Inc) Analytics and Artificial Intelligence Workloads on Cray Systems Michael Ringenburg and Kristyn Maschhoff (Cray Inc.) Abstract Abstract Over the last few years, Artificial Intelligence (AI) and Data Analytics have emerged as critical use cases for supercomputing resources. This tutorial will describe a variety of analytics and AI frameworks, and show how these frameworks can be run on Cray XC systems. We will also provide tips for maximizing performance, and show how Cray’s Urika-XC stack brings an optimized and fully integrated analytics and AI stack to Cray XC systems with Shifter containers. Simple exercises using interactive Jupyter notebooks will be interspersed to allow attendees to apply what they have learned. We will assume a basic familiarity with Cray XC systems. Tutorial Tutorial 1D Chair: John Levesque (Cray Inc.) Applying a “Whack-a-Mole” Method Using Cray perftools to Identify the Moles John Levesque (Cray Inc.) Abstract Abstract Over the past several years, Cray has strived to make obtaining and analyzing application performance information easier for the developer. Ease-of-use is especially important because the task of application profiling is not typically performed as regularly as, for example compilations, and trying to remember how to use a tool can be an easy deterrent when considering whether or not to tackle application performance tuning. This tutorial will cover a recommended process of using the Cray performance tools to identify key bottlenecks (moles) in a program, and then reduce/remove (whack) them with some innovative optimization techniques. In addition to demonstrating the ease of using perftools-lite experiments, we will discuss how to interpret data from the generated reports. Tutorial Tutorial 1B Continued Chair: Harold Longley (Cray Inc.) Managing SMW/eLogin and ARM XC nodes Harold Longley, Eric Cozzi, Jeff Keopp, and Mark Ahlstrom (Cray Inc.) Abstract Abstract System management on Cray XC systems with CLE 6.0.UP06 has some exciting new tutorial topics this year about managing the eLogin nodes from the SMW and tool changes to support ARM nodes. Tutorial Tutorial 1C Continued Chair: Michael Ringenburg (Cray, Inc) Analytics and Artificial Intelligence Workloads on Cray Systems Michael Ringenburg and Kristyn Maschhoff (Cray Inc.) Abstract Abstract Over the last few years, Artificial Intelligence (AI) and Data Analytics have emerged as critical use cases for supercomputing resources. This tutorial will describe a variety of analytics and AI frameworks, and show how these frameworks can be run on Cray XC systems. We will also provide tips for maximizing performance, and show how Cray’s Urika-XC stack brings an optimized and fully integrated analytics and AI stack to Cray XC systems with Shifter containers. Simple exercises using interactive Jupyter notebooks will be interspersed to allow attendees to apply what they have learned. We will assume a basic familiarity with Cray XC systems. Tutorial Tutorial 1D Continued Chair: John Levesque (Cray Inc.) Applying a “Whack-a-Mole” Method Using Cray perftools to Identify the Moles John Levesque (Cray Inc.) Abstract Abstract Over the past several years, Cray has strived to make obtaining and analyzing application performance information easier for the developer. Ease-of-use is especially important because the task of application profiling is not typically performed as regularly as, for example compilations, and trying to remember how to use a tool can be an easy deterrent when considering whether or not to tackle application performance tuning. This tutorial will cover a recommended process of using the Cray performance tools to identify key bottlenecks (moles) in a program, and then reduce/remove (whack) them with some innovative optimization techniques. In addition to demonstrating the ease of using perftools-lite experiments, we will discuss how to interpret data from the generated reports. Tutorial Tutorial 2B Chair: Harold Longley (Cray Inc.) Managing SMW/eLogin and ARM XC nodes Harold Longley, Eric Cozzi, Jeff Keopp, and Mark Ahlstrom (Cray Inc.) Abstract Abstract System management on Cray XC systems with CLE 6.0.UP06 has some exciting new tutorial topics this year about managing the eLogin nodes from the SMW and tool changes to support ARM nodes. Tutorial Tutorial 2C Chair: Michael Ringenburg (Cray, Inc) Analytics and Artificial Intelligence Workloads on Cray Systems Michael Ringenburg and Kristyn Maschhoff (Cray Inc.) Abstract Abstract Over the last few years, Artificial Intelligence (AI) and Data Analytics have emerged as critical use cases for supercomputing resources. This tutorial will describe a variety of analytics and AI frameworks, and show how these frameworks can be run on Cray XC systems. We will also provide tips for maximizing performance, and show how Cray’s Urika-XC stack brings an optimized and fully integrated analytics and AI stack to Cray XC systems with Shifter containers. Simple exercises using interactive Jupyter notebooks will be interspersed to allow attendees to apply what they have learned. We will assume a basic familiarity with Cray XC systems. Tutorial Tutorial 2D Chair: Benjamin Landsteiner (Cray Inc.) DataWarp Administration Tutorial Benjamin Landsteiner (Cray Inc.) and David Paul (LBNL/NERSC) Abstract Abstract The DataWarp ecosystem includes fast SSD hardware, Cray IO and management software, and workload manager software. From the smallest test system to the largest production environment with hundreds of DataWarp nodes, the procedures for administering a DataWarp installation are the same. We will explore how to perform basic initial configuration of your DataWarp system and how to fine tune settings to best meet the needs of the expected application workload. Armed with real-world experience and examples, we will show how to recognize and fix common problems. For everything else, we will show you the tools needed to properly troubleshoot and debug. This includes examining how all of the major components interact with each other over the lifetime of a DataWarp-enabled batch job, log file analysis, and an introduction to scripts that assist with analysis. Tutorial Tutorial 2B Continued Managing SMW/eLogin and ARM XC nodes Harold Longley, Eric Cozzi, Jeff Keopp, and Mark Ahlstrom (Cray Inc.) Abstract Abstract System management on Cray XC systems with CLE 6.0.UP06 has some exciting new tutorial topics this year about managing the eLogin nodes from the SMW and tool changes to support ARM nodes. Tutorial Tutorial 2C Continued Chair: Michael Ringenburg (Cray, Inc) Analytics and Artificial Intelligence Workloads on Cray Systems Michael Ringenburg and Kristyn Maschhoff (Cray Inc.) Abstract Abstract Over the last few years, Artificial Intelligence (AI) and Data Analytics have emerged as critical use cases for supercomputing resources. This tutorial will describe a variety of analytics and AI frameworks, and show how these frameworks can be run on Cray XC systems. We will also provide tips for maximizing performance, and show how Cray’s Urika-XC stack brings an optimized and fully integrated analytics and AI stack to Cray XC systems with Shifter containers. Simple exercises using interactive Jupyter notebooks will be interspersed to allow attendees to apply what they have learned. We will assume a basic familiarity with Cray XC systems. Tutorial Tutorial 2D Continued DataWarp Administration Tutorial Benjamin Landsteiner (Cray Inc.) and David Paul (LBNL/NERSC) Abstract Abstract The DataWarp ecosystem includes fast SSD hardware, Cray IO and management software, and workload manager software. From the smallest test system to the largest production environment with hundreds of DataWarp nodes, the procedures for administering a DataWarp installation are the same. We will explore how to perform basic initial configuration of your DataWarp system and how to fine tune settings to best meet the needs of the expected application workload. Armed with real-world experience and examples, we will show how to recognize and fix common problems. For everything else, we will show you the tools needed to properly troubleshoot and debug. This includes examining how all of the major components interact with each other over the lifetime of a DataWarp-enabled batch job, log file analysis, and an introduction to scripts that assist with analysis. |
Birds of a Feather BoF 3A Chair: Bilel Hadri (KAUST Supercomputing Lab) Birds of a Feather BoF 3B Chair: Nicholas Cardo (Swiss National Supercomputing Centre) Birds of a Feather BoF 3C Chair: Torben Kling Petersen (Cray Inc.); Michele Bertasi (Bright Computing) Birds of a Feather BoF 3D Chair: Sadaf R. Alam (CSCS) Birds of a Feather BoF 10A Chair: Larry Kaplan (Cray Inc.) Birds of a Feather BoF 10B Chair: Bilel Hadri (KAUST Supercomputing Lab) Birds of a Feather BoF 10C Chair: Stephen Leak (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory) Birds of a Feather BoF 10D Chair: David Hancock (Indiana University) Birds of a Feather BoF 21A Chair: Harold Longley (Cray Inc.) Birds of a Feather BoF 21B Chair: Chris Fuson (ORNL) Birds of a Feather BoF 21C Chair: Michael Showerman (National Center for Supercomputing Applications) Birds of a Feather BoF 21D Chair: Kelly J. Marquardt (Cray) New Site New Site 15 Chair: Helen He (National Energy Research Scientific Computing Center) Paper Technical Session 8A Chair: Ronald Brightwell (Sandia National Laboratories) Paper Technical Session 8B Chair: Bilel Hadri (KAUST Supercomputing Lab) Paper Technical Session 8C Chair: Frank M. Indiviglio (National Oceanic and Atmospheric Administration) How Deep is Your I/O? Toward Practical Large-Scale I/O Optimization via Machine Learning Methods pdf, pdfPaper Technical Session 9A Chair: Jim Rogers (Oak Ridge National Laboratory); Ann Gentile (Sandia National Laboratories) Paper Technical Session 9B Chair: Zhengji Zhao (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory) Paper Technical Session 9C Chair: Ronald Brightwell (Sandia National Laboratories) Paper Technical Session 20A Chair: Jim Rogers (Oak Ridge National Laboratory) Paper Technical Session 20B Chair: Zhengji Zhao (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory) Comparative Benchmarking of the First Generation of HPC-Optimised Arm Processors on Isambard pdf, pdfPaper Technical Session 20C Chair: Jim Williams (Los Alamos National Laboratory) Paper Technical Session 24A Chair: Jim Rogers (Oak Ridge National Laboratory) Paper Technical Session 24B Chair: David Paul (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) Paper Technical Session 24C Chair: Tim Robinson (Swiss National Supercomputing Centre) Paper Technical Session 25A Chair: Tim Robinson (Swiss National Supercomputing Centre) Paper Technical Session 25B Chair: Frank M. Indiviglio (National Oceanic and Atmospheric Administration) Paper Technical Session 25C Chair: Paul L. Peltz Jr. (Los Alamos National Laboratory) Paper Technical Session 26A Chair: Bilel Hadri (KAUST Supercomputing Lab) Paper Technical Session 26B Chair: Jim Williams (Los Alamos National Laboratory) Paper Technical Session 26C Chair: Chris Fuson (ORNL) Sponsor Talk Sponsor Talk 5 Chair: Trey Breckenridge (Mississippi State University) Sponsor Talk Sponsor Talk 6 Chair: Trey Breckenridge (Mississippi State University) Sponsor Talk Sponsor Talk 12 Chair: Trey Breckenridge (Mississippi State University) Sponsor Talk Sponsor Talk 13 Chair: Trey Breckenridge (Mississippi State University) Sponsor Talk Sponsor Talk 18 Chair: Trey Breckenridge (Mississippi State University) Sponsor Talk Sponsor Talk 19 Chair: Trey Breckenridge (Mississippi State University) Tutorial Tutorial 1B Chair: Harold Longley (Cray Inc.) Tutorial Tutorial 1C Chair: Michael Ringenburg (Cray, Inc) Tutorial Tutorial 1D Chair: John Levesque (Cray Inc.) Tutorial Tutorial 1B Continued Chair: Harold Longley (Cray Inc.) Tutorial Tutorial 1C Continued Chair: Michael Ringenburg (Cray, Inc) Tutorial Tutorial 1D Continued Chair: John Levesque (Cray Inc.) Tutorial Tutorial 2B Chair: Harold Longley (Cray Inc.) Tutorial Tutorial 2C Chair: Michael Ringenburg (Cray, Inc) Tutorial Tutorial 2D Chair: Benjamin Landsteiner (Cray Inc.) Tutorial Tutorial 2C Continued Chair: Michael Ringenburg (Cray, Inc) |