Birds of a Feather BoF 3A Chair: Matt Allen (Indiana University) Systems Support SIG Meeting Matthew Allen (Indiana University) Abstract Abstract The Systems Support Special Interest Group provides an opportunity for CUG members to communicate directly with senior Cray support representatives, discuss outstanding support issues, and share lessons learned. Birds of a Feather BoF 3B Chair: Bilel Hadri (KAUST Supercomputing Lab) Programming Environments, Applications, and Documentation (PEAD) Special Interest Group meeting Bilel Hadri (KAUST Supercomputing Lab) Abstract Abstract The Programming Environments, Applications and Documentation Special Interest Group (“the SIG”) has as its mission to provide a forum for exchange of information related to the usability and performance of programming environments (including compilers, libraries and tools) and scientific applications running on Cray systems. Related topics in user support and communication (e.g. documentation) are also covered by the SIG. Birds of a Feather BoF 3C Chair: Sadaf R. Alam (CSCS) Tools and Utilities for Data Science Workloads and Workflows Sadaf R. Alam (Swiss National Supercomputing Centre), Mike Ringenburg (Cray Inc.), and Maxime Martinasso (Swiss National Supercomputing Centre) Abstract Abstract The goal of this BOF is to share experiences in using data science software packages, tools and utilities in HPC environments. These include packages and solutions that HPC sites offer as a service such as Jupyter for interactive computing and the Cray Urika-XC software suite. A further goal of this BOF is to identify opportunities and challenges that the HPC community is facing in order to offer integrated solutions for HPC and data science workloads and workflows. Birds of a Feather BoF 12A Chair: Sadaf R. Alam (CSCS) Third Annual Meeting on Opportunities for containers in HPC ecosystems Sadaf R. Alam (Swiss National Supercomputing Centre), Shane Canon (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center), Lucas Benedicic (Swiss National Supercomputing Centre), and Douglas Jacobsen (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) Abstract Abstract Several containers solutions have emerged in the few couple years to take numerous advantages of this technology in high performance computing environments. This BOF is the third annual meeting on creating a community around container solutions within and beyond Cray ecosystems. We will share updates, experiences and challenges across multiple Cray sites using container technologies in production on hybrid and heterogenous Petascale systems. We present an architectural design for extensibility and community engagement. We also discuss opportunities for engagement and integration into the broader Docker community. Birds of a Feather BoF 12B Chair: Bilel Hadri (KAUST Supercomputing Lab) Managing Effectively the User Software Ecosystem Bilel Hadri (KAUST Supercomputing Lab); Peggy Sanchez (Cray Inc.); Guilherme Peretti-Pezzi (Swiss National Supercomputing Centre); Christopher Fuson, (Oak Ridge National Laboratory); and Yun He and Mario Melara (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) Abstract Abstract This is a follow up BoF of CUG2018 on Managing Effectively the User Software Ecosystem. Birds of a Feather BoF 12C Chair: Veronica G. Vergara Larrea (Oak Ridge National Laboratory) Large-scale System Acceptance Testing: Procedures, Tests, and Automation Veronica G. Vergara Larrea and Reuben Budiardja (Oak Ridge National Laboratory (ORNL)) Abstract Abstract The increasing complexity of High Performance Computing (HPC) architectures requires a larger number of tests in order to thoroughly evaluate a new system before it is accepted and transitioned to production. Large-scale systems, in particular, test the boundaries of new technologies as often vendors do not have an internal system of the same scale to test on before shipping it to the customer site. For that reason, in many cases, HPC centers run hundreds of tests to verify the functionality, performance, and stability of both the hardware and the software stack. Birds of a Feather BoF 24C Chair: Colin McMurtrie (Swiss National Supercomputing Centre) Open Discussion with CUG Board Colin McMurtrie (Swiss National Supercomputing Centre) Abstract Abstract This session is designed as an open discussion with the CUG Board but there are a few high level topics that will also be on the agenda. The discussion will focus on corporation update, feedback on increasing CUG participation, and feedback on SIG structure and communication. An open floor question and answer period will follow these topics. Formal voting (on candidates and the bylaws) will open after this session, so any candidates or members with questions about the process are welcome to bring up those topics. New Site New Site 8 Chair: Brian Skjerven (Pawsey Supercomputing Centre) New Site New Site 17 Chair: Brian Skjerven (Pawsey Supercomputing Centre) New Site Site Talk 21 Chair: Trey Breckenridge (Mississippi State University) Paper/Presentation Technical Session 10A Chair: Jim Rogers (Oak Ridge National Laboratory) Shasta System Management Overview Harold W. Longley (Cray Inc.) Abstract Abstract What should you expect in Cray Shasta systems? “The future is seldom the same as the past”, said Seymour Cray. The Cray Shasta system has new hardware, more flexibility in choice of operating system, and a new DevOps-ready system management paradigm. The new architecture has containerized microservices with well-documented RESTful APIs orchestrated by Kubernetes to provide highly-available, resilient services on management nodes to enable continuous operation of scalable computational resources. Hardware management, network management, image management, configuration management, and booting processes can now be managed via DevOps methods. Authentication and authorization protect critical resources. There are enhanced tools for the collection, monitoring, and analysis of telemetry and log data. Shasta System Monitoring Framework Patricia Langer (Cray) Abstract Abstract Monitoring of components within an HPC system is critical to understanding what is going on within the system, and to aid in diagnosing components failures and/or application performance degradation. There are many aspects to monitoring which help paint the picture of system health; including hardware monitoring, network monitoring, and overall system monitoring. Significant Advances in Cray System Architecture for Diagnostics, Availability, Resiliency, and Health Stephen Fisher and Christer Lundin (Cray Inc.) Abstract Abstract This paper will give a high-level overview and a deep dive into architectural advances for HPC systems by Cray to significantly decrease operational downtime, increase diagnostics efficiency, root cause analysis of component and service failures. The paper will also cover modern advances in system resiliency to hard and software failure and performance degradation. The paper will cover a range of technologies and industry patterns new in Cray's Shasta systems including micro-service architectures, Container-based services, orchestration by Kubernetes, failure domains, continuous operations goals, and correlated failure diagnostics and root cause analysis in an integrated and multi-tenant HPC system. This paper is targeted to system administrators, operations and monitoring teams, performance and application engineers, support engineers, and systems designers and others who want to learn more about the new Cray system diagnostics, health, high availability and resiliency. Paper/Presentation Technical Session 10B Chair: Scott Michael (Indiana University) Cray Performance Tools: New Functionality and Future Directions Heidi Poxon (Cray Inc.) Abstract Abstract Creating optimized, scalable applications brings challenges especially as the HPC industry marches towards Exascale-class systems with powerful nodes, many cores, and heterogeneous environments. Using application profiling tools that are intuitive and that identify key application performance inhibitors is critical to the tuning process that is necessary in order to take advantage of these powerful systems. Porting Quantum ESPRESSO Hybrid Functional DFT to GPUs Using CUDA Fortran Thorsten Kurth (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory); Joshua Romero, Everett Phillips, and Massimiliano Fatica (NVIDIA); and Brandon Cook, Rahul Gayatri, Zhengji Zhao, and Jack Deslippe (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory) Abstract Abstract Quantum ESPRESSO is a widely used framework for electronic structure calculations. The exact exchange hybrid DFT component is one of the most commonly used at NERSC. In this talk we discuss how we are porting this package to GPUs utilizing CUDA Fortran support in the PGI compiler. Using selected benchmark problems, we compare the achieved performance on a Cray CS Storm system equipped with nvidia V100 GPUs to the KNL-based Cori as well as the Ivybridge-based Edison systems at NERSC. We identify remaining performance bottlenecks in our GPU port and discuss how those could be mitigated. We finally give an outlook to potentially promising algorithmic optimizations such as mixed precision arithmetic in order to enable optimal performance of Quantum ESPRESSO on future GPU systems such as the upcoming NERSC Perlmutter system. Accelerating modern scientific simulations with FPGAs Tobias Kenter, Paolo Gorlani, and Christian Plessl (Paderborn University) Abstract Abstract In this talk, we will present our recent work on using FPGAs to implement the nodal discontinuous Galerkin method for computing time-domain solutions of Maxwell’s equations on an unstructured mesh. For designing the FPGA accelerators, we use the Intel OpenCL SDK for FPGAs. This tool-flow allows us to generate efficient implementations while maintaining a high level of abstraction, productivity, and flexibility. Our target system is the Cray CS500 cluster Noctua at Paderborn University. Our cluster is one of the first systems to feature the most recent generation Intel Stratix 10 FPGAs and is currently the largest academic HPC installation with state-of-the-art FPGAs worldwide. Paper/Presentation Technical Session 10C Chair: Bilel Hadri (KAUST Supercomputing Lab) Dynamically Provisioning Cray DataWarp Storage François Tessier, Maxime Martinasso, Mark Klein, Matteo Chesi, and Miguel Gila (Swiss National Supercomputing Centre) Abstract Abstract The multiplication of layers in the I/O software stack deployed on HPC systems makes access to resources difficult and is often not compatible with the workloads. Burst buffers for instance, such as Cray DataWarp, or hybrid storage tiers, such as NVMe, have been designed to mitigate the I/O bottleneck by providing an intermediate tier of fast storage between the compute nodes and the parallel file system. However, to take advantage of this technology, application developers are dependent on the installed data management service. In this work, we propose to dynamically supply a data management system on top of DataWarp and NVMe devices such that the workload can decide the type of interface it needs to use intermediate storage resources. We particularly focus our effort on deploying a BeeGFS instance across multiple DataWarp nodes on a Cray XC50 system. Exploring Lustre Overstriping For Shared File Performance on Disk and Flash Michael Moore (Cray, Inc.) and Patrick Farrell (Whamcloud) Abstract Abstract From its earliest versions, Lustre has included striping files across multiple data targets (OSTs). This foundational feature enables scaling performance of shared-file I/O workloads by striping across additional OSTs. Current Lustre software places one file stripe on each OST and for many I/O workloads this behavior is optimal. However, faster OSTs backed by non-rotational storage show individual stripe bandwidth limitations due to the underlying file systems (ldiskfs, ZFS). Additionally, shared file write performance, for I/O workloads that don't use optimizations such as Lustre lock ahead, may be limited by write-lock contention since Lustre file locks are granted per-stripe. A new Lustre feature known as ‘overstriping’ addresses these limitations by allowing a single file to have more than one stripe per OST. This paper discusses synthetic I/O workload performance using overstriping and implications for achieving expected performance of next-generation file systems in shared file I/O workloads. Designing an All-Flash Lustre File System for the 2020 NERSC Perlmutter System Kirill Lozinskiy, Glenn K. Lockwood, Lisa Gerhardt, Ravi Cheema, Damian Hazen, and Nicholas J. Wright (Lawrence Berkeley National Laboratory) Abstract Abstract New experimental and AI-driven workloads are moving into the realm of extreme-scale HPC systems at the same time that high-performance flash is becoming cost-effective to deploy at scale. This confluence poses a number of new technical and economic challenges and opportunities in designing the next generation of HPC storage and I/O subsystems to achieve the right balance of bandwidth, latency, endurance, and cost. In this paper, we present the quantitative approach to requirements definition that resulted in the 30 PB all-flash Lustre file system that will be deployed with NERSC's upcoming Perlmutter system in 2020. By integrating analysis of current workloads and projections of future performance and throughput, we were able to constrain many critical design space parameters and quantitatively demonstrate that Perlmutter will not only deliver optimal performance, but effectively balance cost with capacity, endurance, and many modern features of Lustre. Paper/Presentation Technical Session 11A Chair: Veronica G. Vergara Larrea (Oak Ridge National Laboratory) Exploring New Monitoring and Analysis Capabilities on Cray’s Software Preview System Jim Brandt and Ann Gentile (Sandia National Laboratories), Joe Greenseid (Cray Inc.), William Kramer (National Center for Supercomputing Applications/University of Illinois), Patti Langer and Aamir Rashad (Cray Inc.), and Michael Showerman (National Center for Supercomputing Applications) Abstract Abstract Cray, NCSA, and Sandia staff and engineers are collaborating to jointly investigate and provide new insights on the monitoring aspects of Cray’s recently released “Software Preview System.” In this paper, we explore how data collection and aggregation services interact with platform services, system services, user applications, and available network fabrics (management networks and High Speed Network (HSN)). We explore how data is made available to the telemetry bus and how that can be turned into actionable information for users and system services. Further, we provide recommendations on what functionalities may need to be extended to support complex scenarios, such as monitoring applications running inside containers, integrating application monitoring data into existing data streams, use of the available networks for low-latency data transport, and feedback of analysis results to user and system software. Reimagining image management in the new Shasta environment Harold W. Longley and Eric Cozzi (Cray Inc.) Abstract Abstract The process to build and customize compute and non-compute node images within Cray's new Shasta System has been dramatically rethought to use industry standard tools, technologies and best practices. This talk introduces the Cray Image Management Service (IMS) and the Cray Package Repository Service (PRS). DevOps techniques are used to manipulate the IMS and PRS RESTful micro-services running within the Kubernetes-orchestrated Cray Management Plane via a Cray provided CLI. IMS provides the ability to build images from recipes and customize images. Images are built using the industry standard Kiwi-NG tool, which has been containerized and made part of a Kubernetes Job workflow. Once built, images can be customized pre-boot via an SSH configuration environment. The PRS service is used to define zypper/yum RPM package repositories and provide the RPM content, at scale, for installing and updating software for every compute and non-compute node in the system. Exploring the Mysterious Universe of Shasta Software for Perlmutter James F. Botts and Douglas M. Jacobsen (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) Abstract Abstract The perlmutter system is to be the first delivered Shasta-architecture system. CLE for Shasta brings with it a fully new system software and system management methods. As perlmutter is the initial Shasta deployment, NERSC is partnering with Cray in a Systems Software COE intended to ensure goals of Continuous Operation, Continuous Integration for Software, and efficient systems management are all fully integrated. NERSC is participating in the Shasta Software Preview Program, which ships with a development system. The system is neither a TDS nor production hardware, running current Shasta software development snapshots. The administrative and user environment is very different than CLE6. System services on management nodes are provided as microservices in Docker containers. The cluster of management nodes and pods of containers that run on them are orchestrated by Kubernetes. This work describes the current progress and our impact towards evolving to a highly productive initial deployment of Perlmutter. Paper/Presentation Technical Session 11B Chair: Abhinav S. Thota (Indiana University) Challenges in Providing an Interactive Service with Jupyter on Large-Scale HPC Systems Tim Robinson, Lucas Benedicic, Mark Klein, and Maxime Martinasso (Swiss National Supercomputing Centre) Abstract Abstract High-performance computing users have traditionally interacted with supercomputing environments through a command-line interface, with compute-intensive jobs scheduled in batch mode via workload managers. Driven in part by the emergence of data science as a dominating field – and the convergence of Big Data and HPC – we have seen increased efforts to bridge the gap between scientific computing in the desktop and supercomputing worlds. The Jupyter notebook is an open-source web application that allows the creation of portable documents containing executable code, narrative text and equations, and visualization of generated results. Jupyter notebooks have gained significant adoption as research tools and teaching aids, and a number of cloud providers have built front-end interfaces for their users based on the Jupyter notebook or derivatives thereof (examples include Amazon’s SageMaker Notebook, Google’s Colaboratory and Microsoft’s Azure Notebook). Likewise, JupyterHub has proved valuable for spawning and managing multiple instances of the Jupyter notebook server in settings such as research groups and institutional clusters. Large-scale HPC centers face distinct challenges in deploying such services, however, including access to remote or distributed file systems, integration with central authentication services, and most obviously, the apparent incompatibility between interactivity and batch scheduling. In this presentation we survey Jupyter-based offerings available at a number of HPC centers, both Cray and non-Cray-based – including CSCS, NERSC, and the Jülich Supercomputing Centre. By highlighting advantages and limitations, our goal is to establish a set of best practices for implementing gateways for user-friendly interactive computing for present and future Cray architectures. Resource Management in a Heterogeneous Environment Jonathan Sparks (Cray Inc.) Abstract Abstract Over the last decade the growth of Cloud computing, which promises unlimited compute power, flexibility and technologies is seen as the new global compute resource, but this technology has seen a slow adoption in the traditional HPC environment. In this work, we will explore how both cloud native orchestration, Kubernetes and traditional HPC resource management technologies can be used to provide a unified resource model to efficiently manage large systems. We will start by examining some of the fundamental issues built into these different technologies and how they intersect with typical data science workflows. We will then explore some of the potential ways these technologies can be utilized both by different resource management frameworks and by the end users to conduct end-to-end workflows. We will also examine some of the gaps in these technologies that may currently limit the effectiveness for these use cases and potential mitigations. Cray Programming Environments within Containers on Cray XC Systems Maxime Martinasso, Miguel Gila, William Sawyer, Rafael Sarmiento, Guilherme Peretti-Pezzi, and Vasileios Karakasis (ETH Zurich / CSCS) Abstract Abstract Containers have been welcomed in the High-Performance Computing community as a solution for packaging software stacks into supercomputer facilities and to manage large ecosystems of applications. However, the exploitation of containers on HPC systems is still in an early stage and it is foreseeable that new use cases will arise to benefit from the their many advantages. We present a methodology to enable the complete software development life cycle on Cray XC systems within containers by containerizing any version of the Cray Programming Environments. The procedure introduced here for building a CPE inside a container consists on three main parts: The creation of containers holding the CPE, the compilation of software within such containers, and the packaging of the resulting binaries, libraries and dependencies on lightweight images. The installation of the CPE inside containers, facilitates many aspects of the typical HPC support and operation workloads of managing Cray XC systems. Paper/Presentation Technical Session 11C Chair: David Hancock (Indiana University) Scheduling Data Streams for Low Latency and High Throughput on a Cray XC40 Using Libfabric Farouk Salem, Thorsten Schütt, Florian Schintke, and Alexander Reinefeld (Zuse Institute Berlin) Abstract Abstract Achieving efficient many-to-many communication on a given network topology is a challenging task when many data streams from different sources have to be scattered concurrently to many destinations with low variance in arrival times. In such scenarios, it is critical to saturate but not to congest the bisectional bandwidth of the network topology in order to achieve a good aggregate throughput. When there are many concurrent point-to-point connections, the communication pattern needs to be dynamically scheduled in a fine-grained manner to avoid network congestion (links, switches), overload in the node’s incoming links, and receive buffer overflow. Motivated by the use case of the Compressed Baryonic Matter experiment (CBM), we study the performance and variance of such communication patterns on a Cray XC40 with different routing schemes and scheduling approaches. We present a distributed Data Flow Scheduler (DFS) that reduces the variance of arrival times from all sources at least 30 times and increases the achieved aggregate bandwidth by up to 50%. Characterizing Full-system Network Performance and Congestion Management Capabilities with Improved Network Benchmarks Peter Mendygral, Nathan Wichmann, Duncan Roweth, Krishna Kandalla, and Kim McMahon (Cray Inc.) Abstract Abstract High performance interconnects for petaflop and exaflop systems must scale to large numbers of endpoints and provide low latency and high bandwidth for diverse workloads. A common practice is to measure the latency and bandwidth characteristics of a network with a set of nodes on an otherwise idle system, often with only a single MPI rank on each node. Such measurements are not representative of the conditions that real HPC applications execute under. Communication patterns generated by MPI collectives are implementation dependent, which limits their usefulness for measuring network performance. HPC applications demand high performance at scale from a network under load, typically a production system running many applications at once. Such networks need to efficiently mitigate the effects of congestion. HPC applications and even system services can reliably or spontaneously generate traffic patterns that overwhelm a network. These events can significantly impact the performance of other applications running on the system. New methods are needed for characterizing network performance at scale and under load or in the presence of congestion. These new methods should better represent the performance of HPC workloads in real-world conditions. In this talk we introduce a new methodology for characterizing network performance. We present a framework for measuring tail latency and bandwidth at scale and discuss techniques for measuring the impact of congestion on latency and bandwidth communication. New Lustre Features to Improve Lustre Metadata and Small-File Performance John Fragalla; Bill Loewe, PhD.; and Torben Kling-Petersen, PhD. (Cray Inc.) Abstract Abstract As HPC I/O evolves beyond the challenges of large I/O performance, new Lustre features of Distributed Namespace (DNE) 2, Progressive File Layout (PFL), and Data on Metadata (DoM), when combined with flash-based Lustre targets, can provide metadata and small-file performance improvements transparently to applications. To improve the execution of single-directory operations and increase the scalability of overall metadata operations, Striped or Remote Directories with DNE2 can be configured with multiple MDTs. For small-file I/O, Progressive File Layout with Flash and Disks can be used to optimize for small- and large-I/O by seamlessly storing files on Flash-based MDTs or specific Flash OSTs. Cray will share performance results showing the scalability benefits of metadata performance using Striped or Remote directories as well as small-file performance with DoM and Flash OSTs with PFL, both with and without Cray’s Block I/O acceleration configured, called NXD. Paper/Presentation Technical Session 19A Chair: Brian Skjerven (Pawsey Supercomputing Centre) I/O Performance Analysis of Science Applications Using HDF5 File-level Provenance Tonglin Li and Quincey Koziol (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center); Houjun Tang (Lawrence Berkeley National Laboratory, Lawrence Berkeley National Lab); Jialin Liu (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center); and Suren Byna (Lawrence Berkeley National Laboratory) Abstract Abstract Systematic capture of extensive, useful metadata and provenance requires an easy-to-use strategy to automatically record information throughout the data life cycle without posing significant performance impact. Towards that goal, we have developed a Virtual Object Layer (VOL) connector for HDF5, the most popular I/O middleware on HPC systems. The VOL connector, called H5Prov, transparently intercepts HDF5 calls and records I/O and metadata operations at multiple levels, namely file, group, dataset, and data object levels. The provenance data produced can also be analyzed to reveal I/O patterns and correlations between application behaviors/semantics and I/O performance issues, which enables optimization opportunities. We analyze the captured provenance information to understand HDF5 file usage and to detect I/O patterns, with preliminary results showing good promise. Roofline-based Performance Efficiency of HPC Benchmarks and Applications on Current Generation of Processor Architectures JaeHyuk Kwack, Thomas Applencourt, Colleen Bertoni, Yasaman Ghadar, Huihuo Zheng, Christopher Knight, and Scott Parker (Argonne National Laboratory) Abstract Abstract The emerging pre-exascale/exascale systems are composed of innovative components that evolved from exist- ing petascale systems. One of the most exciting evolutions is ongoing in processor architecture. In this study, we present performance results of a test suite consisting of HPC bench- marks (e.g., HPGMG, and NEKBONE) and HPC applications (e.g., GAMESS, LAMMPS, QMCPACK, and Qbox) on several processor architectures (e.g., Intel Xeon, Intel Xeon Phi, Arm, and NVIDIA GPU). For the baseline performance, we employ the Argonne Leadership Computing Facility (ALCF)’s Theta system, a Cray XC40 system that has 4,392 Intel Xeon Phi 7230 processors with a peak of 11.69 PF. We perform roofline performance analysis for the tests in the test suite and categorize them with their computational intensities (CI). Based on the CI values and the corresponding achievable performance peaks from the rooflines, we provide their performance efficiencies on the processor architectures. Experiences porting mini-applications to OpenACC and OpenMP on heterogeneous systems Veronica G. Vergara Larrea and Reuben Budiardja (Oak Ridge National Laboratory), Rahulkumar Gayatri and Christopher Daley (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center), and Oscar Hernandez and Wayne Joubert (Oak Ridge National Laboratory (ORNL)) Abstract Abstract This paper studies mini-applications--minisweep, GenASIS, GPP, and FF--that use computational methods commonly encountered in HPC. We will port these applications to develop OpenACC and OpenMP versions, and evaluate their performance on Titan (Cray XK7/K20x GPUs), Cori (Cray XC40/Intel KNL), Summit (IBM Power9/Volta GPUs), and Cori-GPU (Cray CS-Storm 500NX/Intel Skylake and Volta GPUs). Our goals are for these new ports to be useful to both application and compiler developers, to document and describe the lessons learned and the methodology to create optimized OpenMP and OpenACC versions, and to provide a description of possible migration paths between the two specifications. Cases where specific directives or code patterns result in improved performance for a given architecture will be highlighted. We also include discussions of the functionality and maturity of the latest compilers available on the above platforms with respect to OpenACC or OpenMP implementations. Hierarchical Roofline Analysis for GPUs: Accelerating Performance Optimization for the NERSC-9 Perlmutter System Charlene Yang and Thorsten Kurth (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory) and Samuel Williams (Computational Research Division/Lawrence Berkeley National Laboratory) Abstract Abstract The Roofline performance model provides an intuitive and insightful approach to identifying performance bottlenecks and guiding performance optimization. In preparation for the next-generation Perlmutter system at NERSC, this paper presents a methodology to construct hierarchical Roofline on GPUs and three use cases. The hierarchical Roofline incorporates L1, L2, device memory and system memory bandwidths into a single figure and provides more insights into performance analysis. Three proxy applications, GPP from BerkeleyGW, HPGMG from AMReX, and conv2d from TensorFlow, are showcased to demonstrate the ability of our methodology to readily understand the various nuances of GPU performance, such as memory coalescing and thread predication. Paper/Presentation Technical Session 19B Chair: David Hancock (Indiana University) Uncovering Lustre Performance Issues in Operational Weather Forecasting at DMI with View for ClusterStor Thomas Lorenzen (Danish Meteorological Institute DMI) and Torben Kling Petersen (Cray Inc.) Abstract Abstract DMI as a national weather site requires timely execution of its operational jobs with predictable performance while also servicing a diverse workload. The forecast production chains produce and consume large amounts of data which means that I/O performance becomes a critical component of the forecast production. Because the production chains and diverse workloads are executing on a shared Lustre file system, the need to understand per job and per process I/O statistics is critical to DMI for ensuring production jobs complete on time and to identify performance problems in a timely manner that could impede production. DMI is evaluating View for ClusterStor and is working with Cray to understand how View for ClusterStor could be leveraged to identify performance problems and inefficiencies in a weather forecast production environment. Based on the results of this evaluation, the goal is to broaden the use cases that View for ClusterStor can address beyond ALPS launched jobs to jobs running natively in PBSPro and processes running on login and other service nodes. This work will make View for ClusterStor more useful for a broader set of use cases at DMI as well as for the community. Analysis of parallel I/O use on the UK national supercomputing service, ARCHER using Cray LASSi and EPCC SAFE Andrew Turner and Dominic Sloan-Murphy (EPCC, The University of Edinburgh); Karthee Sivalingam and Harvey Richardson (Cray Inc.); and Julian Kunkel (University of Reading) Abstract Abstract In this paper we describe how have used a combination of the LASSi tool (developed by Cray) and the SAFE software (developed by EPCC) to collect and analyse Lustre I/O performance data for all jobs running on the UK national supercomputing service, ARCHER; and to provide reports on I/O usage for users in our standard reporting framework. We also present results from analysis of parallel I/O use on ARCHER and analysis on potential impact of different applications on file system performance using metrics we have derived from the LASSi data. We show that the performance data from LASSi reveals how the same application can stress different components of the file system depending on how it is run, and how the LASSi risk metrics allow us to identify use cases that could potentially cause issues for global I/O performance to work with users to improve their I/O use. Interfacing HDF5 with A Scalable Object-centric Storage System on Hierarchical Storage Jingqing Mu and Jerome Soumagne (The HDF Group); Suren Byna, Quincey Koziol, and Houjun Tang (Lawrence Berkeley National Laboratory); and Richard Warren (The HDF Group) Abstract Abstract Object storage technologies that take advantage of multi-tier storage on HPC systems are emerging. However, to use these technologies, applications now have to be modified significantly from current I/O libraries. HDF5, a widely used I/O middleware on HPC systems, provides a Virtual Object Layer (VOL) that allows applications to connect to different storage mechanisms transparently without requiring significant code modifications. We recently designed the Proactive Data Containers (PDC) object-centric storage system that provides the capabilities of transparent, asynchronous, and autonomous data movement taking advantage of multiple storage tiers—a decision that has so far been left upon the user on most current systems. Hybrid Flash/Disk Storage Systems with Lustre Nathan Rutman (Cray) Abstract Abstract Vendors are now offering Lustre storage systems with flash-based OSTs and MDTs in addition to traditional HDD OSTs. With the introduction of multiple classes of storage tiers, new questions arise for system designers. What are the various use cases that are well-served by mixed systems? What are the right ratios of flash to disk? What features in Lustre lend themselves to tier management? This presentation will explore various dimensions of these issues and look at some new features Cray is developing to ease management of mixed-media systems, including spillover space, pool quotas, and HSM extensions. Paper/Presentation Technical Session 19C Chair: Abhinav S. Thota (Indiana University) Running Alchemist on Cray XC and CS Series Supercomputers: Dask and PySpark Interfaces, Deployment Options, and Data Transfer Times Kai Rothauge (UC Berkeley); Haripriya Ayyalasomayajula, Kristyn J. Maschhoff, and Michael Ringenburg (Cray Inc.); and Michael W. Mahoney (UC Berkeley) Abstract Abstract Alchemist allows Apache Spark to achieve better performance by interfacing with HPC libraries for large-scale distributed computations. In this paper we highlight some recent developments in Alchemist that are of interest to Cray users and the scientific community in general. We discuss our experience porting Alchemist to container images and deploying it on Cray XC (using Shifter) and CS (using Singularity) series supercomputers, on a local Kubernetes cluster, and on the cloud. Machine learning on Crays to optimise petrophysical workflows in oil and gas exploration Nick Brown (EPCC) Abstract Abstract The oil and gas industry is awash with sub-surface data, which is used to characterize the rock and fluid properties beneath the seabed. This in turn drives commercial decision making and exploration, but the industry currently relies upon highly manual workflows when processing data. A key question is whether this can be improved using machine learning to complement the activities of petrophysicists searching for hydrocarbons. In this paper we present work done, in collaboration with Rock Solid Images (RSI), using supervised machine learning on a Cray XC30 to train models that streamline the manual data interpretation process. With a general aim of decreasing the petrophysical interpretation time down from over 7 days to 7 minutes, in this paper we describe the use of mathematical models that have been trained using raw well log data, for completing each of the four stages of a petrophysical interpretation workflow, along with initial data cleaning. We explore how the predictions from these models compare against the interpretations of human petrophysicists, along with numerous options and techniques that were used to optimise the prediction of our models. The power provided by modern supercomputers such as Cray machines is crucial here, but some popular machine learning framework are unable to take full advantage of modern HPC machines. As such we will also explore the suitability of the machine learning tools we have used, and describe steps we took to work round their limitations. The result of this work is the ability, for the first time, to use machine learning for the entire petrophysical workflow. Whilst there are numerous challenges, limitations and caveats, we demonstrate that machine learning has an important role to play in the processing of sub-surface data Scalable Reinforcement Learning on Cray Systems Ananda Vardhan Kommaraju, Kristyn J. Maschhoff, Michael F. Ringenburg, and Benjamin Robbins (Cray Inc.) Abstract Abstract Recent advancements in deep learning have enabled reinforcement learning (RL) to scale to a wide range of decision making problems. The emergence of this paradigm brings multiple challenges to system resource management as RL applications continuously train a deep learning or a machine learning model while interacting with uncertain simulation models. These new generation of AI applications impose significant demand on system resources such as memory, storage, network, and compute. Urika-GX Platform's Multi-Tenancy Support: Lessons Learned Oleksandr Shcherbakov, Dennis Hoppe, Thomas Bönisch, and Michael Gienger (High Performance Computing Center Stuttgart) and Stefan Andersson, Juri Kuebler, and Nina Mujkanovic (Cray Inc.) Abstract Abstract HPDA systems such as the Cray Urika-GX enable academic and industrial users to perform compute-intensive data analytics on huge amounts of data for the first time. This endeavor allowed us to develop new markets and user communities in the domains of data analytics, machine learning, and AI. However, new potential customers with limited knowledge and/or high demands on security face various hurdles when moving to HPDA, and thus prevent us to unlock the full potential of the Urika-GX: First, the default way of using the Urika-GX via the command line requires expert knowledge, and secondly, support for multi-tenancy was just recently introduced with the Urika-GX platform 2.2, but is not yet available across the entire software stack to satisfy requirements of a production environment, where multiple users are required to access the system simultaneously while guaranteeing isolation of data and processes. Therefore, we have taken actions to lower the gap for (non-professional) customers and to improve multi-tenancy support. In this presentation, we are discussing security issues with respect to multi-tenancy, and introduce a virtualization layer on top of the Urika-GX software stack in order to serve multiple users simultaneously with a secured desktop environment, which is a crucial requirement to attract non-professionals. Specifically, we have secured access to the system via further modifications to the underlying software stack, and setup VMs with an Ubuntu desktop running Jupyter notebooks, GNU R, and KNIME; KNIME is an easy-to-use graphical interface to create HPDA tasks by connecting building blocks of an analytics pipeline. Paper/Presentation Technical Session 28A Chair: Tina Declerck (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory) Hardware Discovery and Maintenance Workflows in Shasta Systems Steven Presser and Brent Shields (Cray Inc.) Abstract Abstract Cray’s Shasta supercomputers support more varied hardware than any previous Cray system. This includes a significantly wider variety of processors, coprocessors, and accelerators than has previously been available on Cray systems. Further, Cray is supporting the use of certain commodity hardware in Shasta systems. The more complicated hardware ecosystem in Shasta makes hardware management more complicated than previous Cray systems. The Beat Goes On... Cascade XC Release Schedule and Patching Stragegy Kelly Mark (Cray Inc.) Abstract Abstract The Beat Goes On - XC Release Schedule and Patching Strategy The Art of Conversation with CrayPort (Bidirectional Record Management) Daniel Gens, Owen James, and Elizabeth Bautista (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) and Melissa Abdelbaky (Lawrence Berkeley National Laboratory) Abstract Abstract The National Energy Research Scientific Computing Center (NERSC) is the primary scientific computing facility for the Office of Science in the U.S. Department of Energy. NERSC houses top-ranked Cray supercomputers, and staff works very closely with on-site Cray engineers, submitting on average 75 Cray cases each month. Until recently, submitting a Cray case was done by phone or email, and all subsequent updates were delivered manually. This process caused delays, errors during manual data entry, and increased incident processing and resolution time. In 2018, NERSC deployed an API-based bidirectional integration that allowed submitting and updating Cray cases directly from one Incident Management platform, thus streamlining 24x7 operations and enhancing communication between engineering teams. Paper/Presentation Technical Session 28B Chair: Ann Gentile (Sandia National Laboratories) Continuous Deployment Automation in Supercomputer Operations: Techniques, Experiences and Challenges Nicholas Cardo, Matteo Chesi, and Miguel Gila (Swiss National Supercomputing Centre) Abstract Abstract Continuous Deployment (CD) is a software development process that pursues the objective of immediately deploying software in production to users as soon as it is developed. CD is part of the DevOps methodology and has been widely adopted in the industry including large web technology companies like Facebook and Amazon to release new features to their public. The role of emerging orchestration and execution models in HPC Environments Richard Canon (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) and Jonathan Sparks (Cray Inc.) Abstract Abstract New models for deployment and workload execution are emerging in the Enterprise and Cloud space, but do these technologies and approaches have a role in HPC environments? In this paper we will explore two popular trends in the enterprise space, Kubernetes and Serverless compute, and evaluate how these can be used for HPC workloads. We will start by examining some of the assumption built into these technologies and how they intersect with typical HPC settings. We will then explore some of the potential ways these technologies can be utilized both by resource providers and end users and some of the ramifications. We will also examine some of the gaps in these technologies that may limit the effectiveness for these use cases and potential mitigations. PBS Professional on Shasta Lisa Endrjukaitis and Vincent Stumpf (Altair Engineering, Inc.) Abstract Abstract Altair and Cray have collaborated on the integration of the PBS Professional workload manager and Cray's new APIs for Cray's new Shasta supercomputer. Altair is the first to leverage and provide feedback on these new APIs, which are for services that include the new Parallel Application Launch Service (PALS). We will discuss how PBS leverages the APIs and how we've adapted PBS features for them, and we'll go into our collaboration methods. We will also discuss the usage of specific PBS features on Shasta versus Cray’s current generation XC supercomputers with Application Level Placement Schedulers (ALPS). Paper/Presentation Technical Session 28C Chair: Bilel Hadri (KAUST Supercomputing Lab) FirecREST: RESTful API on Cray XC systems Felipe A. Cruz and Maxime Martinasso (Swiss National Supercomputing Centre) Abstract Abstract As science gateways are becoming an increasingly popular digital interface for scientific communities, it is also becoming increasingly important for High-Performance Computing centers to provide a modern Web-enabled APIs that facilitate the integration of its services into science gateways. This work presents the FirecREST API, a RESTful Web API infrastructure that allows scientific communities to access the various integrated resources and services available from the Cray XC systems at the Swiss National Supercomputing Centre. User-Friendly Data Management for Scientific Computing Users Kirill Lozinskiy, Lisa Gerhardt, Annette Greiner, Ravi Cheema, Damian Hazen, Kristy Kallback-Rose, and Rei Lee (Lawrence Berkeley National Laboratory) Abstract Abstract Wrangling data at a scientific computing center can be a major challenge for users, particularly when quotas may impact their ability to utilize resources. In such an environment, a task as simple as listing space usage for one's files can take hours. The National Energy Research Scientific Computing Center (NERSC) has roughly 50 PBs of shared storage utilizing more than 4.6B inodes, and a 146 PB high-performance tape archive, all accessible from two supercomputers. As data volumes increase exponentially, managing data is becoming a larger burden on scientists. To ease the pain, we have designed and built a “Data Dashboard”. Here, in a web-enabled visual application, our 7,000 users can easily review their usage against quotas, discover patterns, and identify candidate files for archiving or deletion. We describe this system, the framework supporting it, and the challenges for such a framework moving into the exascale age. Implementation of a multi-purpose DataHub: Making the XC-40 more attractive to Data Scientists at KAUST Samuel Kortas (KAUST) and Kristyn Maschhoff and Jim Maltby (Cray Inc.) Abstract Abstract The support of containers along with the release of URIKA-XC Data analytics stack made Shaheen, our 6174-node XC-40, more accessible to data scientists. Paper/Presentation Technical Session 29A Chair: Kevin Buckley (Pawsey Supercomputing Centre) Using Slurm to Balance the XC Equation Douglas Jacobsen (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center), Brian Christiansen (SchedMD LLC), and Christopher Samuel (Lawr) Abstract Abstract Slurm as a workload manager has demonstrated great success in automating an XC supercomputer workload while enforcing site policy. In this paper we explore how recent additions to Slurm can be used to automate aspects of the management and maintenance of the XC supercomputer itself to drive maximal performance and user productivity. One important aspect of this is that Slurm can be used to coordinate and schedule through micro-maintenance activities to continuously maintain optimal performance of the system, for example, by maintaining hugepage availability. Using Slurm as a system management policy engine combined with advanced scheduling algorithms aware of transient states of the system, it becomes possible to actively manage compute node performance while keeping future planning of large/full scale jobs intact. This work describes how NERSC configures and uses Slurm to actively manage system performance, schedule a wide diversity of jobs, and increase application performance. Optimisation of PBS Hooks on the Cray XC40 Sam Clarke (Met Office, UK) Abstract Abstract The Met Office, the UK's national weather agency, is both a global forecasting centre with a requirement to produce regular, timely weather forecasts, and a major centre for climate and weather science research. Each of these groups require access to a large supercomputing facility which is highly available, reliable, and which provides good turnaround to facilitate scientific development work. Impact of Large Jobs and Reservations on Cray Systems Using Slurm Yun (Helen) He (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory); Emily Zhang (University of California, Berkeley); and Woo-Sun Yang (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory) Abstract Abstract Compute node resources on the two large NERSC production Cray systems, Cori and Edison, are scheduled with the Slurm batch scheduler. A system "drains" when the scheduler gathers nodes for a large job of high priority or a system reservation, meaning only small and short backfill jobs can run. It is very important to understand the impact of such drain events in order to maintain a good system utilization. PBS 18 Multiplatform Scheduling Peter Schmid (General Electric) Abstract Abstract GE Global Research has interesting and novel things to share about running PBS Professional v18.2.101 in a single complex with both Cray and Linux execution nodes using boolean expressions and MultiSched to handle different scheduling policies for different platforms in our HPC environment; we will also talk about Docker integration on the machine learning side of our program and how this solves some of the security concerns surrounding Docker but gives us the flexibility of using Docker for Machine Learning Frameworks. Paper/Presentation Technical Session 29B Chair: Jim Rogers (Oak Ridge National Laboratory) Statistical Analysis of Titan Reliability as it reaches End of Life Jim Rogers (Oak Ridge National Laboratory) Abstract Abstract After seven years, the Cray XK7 Titan supercomputer at ORNL will end production in 2019. With 18,688 heterogeneous nodes, with each node containing a CPU, GPU, DIMMs, and GDDR5 memory, Titan continues to represent one of the largest HPC systems ever produced. The majority of HPC systems are retired in their fourth year of production. The longevity of this system allows a statistical analysis of the component failure rates that is rarely available. Titan has experienced three very significant maintenance events in its life, with the latest, in 2017, requiring the replacement of more than 11,000 SXM/GPU assemblies. The opportunity remains to examine the changes to the FIT rate for many SKUs as the system nears EOL, with an analysis of the right-hand side of the expected bathtub curve. Perlmutter: A 2020 pre-exascale system optimized for Science Katie Antpyas, Jay Srinivasan, and Nicholas Wright (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) Abstract Abstract Perlmutter, the next generation NERSC computing system, is a pre-exascale Cray Shasta system that will be deployed in 2020. It will have GPU-accelerated nodes as well as AMD Milan CPU-only nodes and will feature the Cray Slingshot network, resulting in a system that is well optimized to support the diverse NERSC science workload. This talk will provide details on the system and on the collaborative efforts with Cray and partners to ensure that the system is usable and will meet the needs of both large scale simulations and data analysis from experimental and observational facilities. Measuring and Mitigating Processor Performance Inconsistencies Kevin D. Stroup (Los Alamos National Laboratory, Cray Inc.) and Paul Peltz (Oak Ridge National Laboratory) Abstract Abstract Application performance inconsistency is a problem that has plagued users and system engineers for a long time. When a user reports that an application took longer than normal to run or was running more slowly than usual, the engineer is faced with a wide range of potential causes. Some of them may be outside the engineer's control, including changes the user has made, interactions with other workload and numerous other factors. One possibility that is detectable and within the engineer's control is that one or more nodes is underperforming, or possibly overperforming. Some sophisticated users may be able to detect this if they have instrumented their application, but quite often the problem report is far from specific or informative. Overperforming nodes can impact application performance in unpredictable ways and may also result in thermal issues that can impact processor lifetime and reliability as well as impacting other components of the system. Los Alamos National Laboratory (LANL) has worked on a number of processes to detect, isolate, and where possible resolve the issue of nodes performing outside expected levels. Evaluating Compiler Vectorization Capabilities on Blue Waters Celso L. Mendes, Gregory H. Bauer, Brett Bode, and William T. Kramer (National Center for Supercomputing Applications/University of Illinois) Abstract Abstract Vectorization remains an important technique for scientific applications to achieve good performance on current computers. From a programming productivity’s viewpoint, compilers are essential tools to generate efficient vectorized code. However, previous work has shown that popular compilers may vary considerably in their vectorizing capabilities. In this paper, we analyze the vectorizing behavior of the four compilers currently available on Blue Waters (namely Cray, Intel, PGI and GNU). By employing a public test suite consisting of 151 loops and some of our SPP application benchmarks, we measure both the fraction of loops successfully vectorized and the vectorization speedup obtained on Blue Waters, for each compiler. We also compare these measurements to those obtained on a more modern Cray-XC system with the same compilers. Our results show that, despite concrete progress over earlier versions, there is still diversity across the compilers’ capabilities, and there is certainly room for improvement in some compilers. Paper/Presentation Technical Session 29C Chair: Brian Skjerven (Pawsey Supercomputing Centre) Modelling the earth’s geomagnetic environment on Cray machines using PETSc and SLEPc Nick Brown (EPCC) Abstract Abstract The British Geological Survey's global geomagnetic model, Model of the Earth's Magnetic Environment (MEME), is an important tool for calculating the earth's magnetic field, which is continually in flux. Whilst the ability to collect data from ground based observation sites and satellites has grown rapidly, the memory bound nature of the code has proved a significant limitation in modelling problem sizes required by modern science. In this paper we describe work done replacing the bespoke, sequential, eigen-solver with that of the SLEPc package for solving the system of normal equations. This work had a dual purpose, to break through the memory limit of the code, and thus support the modelling of much larger systems, by supporting execution on distributed machines, and to improve performance. But when adopting SLEPc it was not just the solving of the normal equations, but also fundamentally how we build and distribute the data structures. We describe an approach for building symmetric matrices in a way that provides good load balance and avoids the need for close co-ordination between the processes or replication of work. We also study the memory bound nature of the code from an irregular memory accesses perspective and combine detailed profiling with software cache prefetching to significantly optimise this. Performance and scaling characteristics are explored on ARCHER, a Cray XC30, where we achieved a speed up for the solver of 294 times by replacing the model's bespoke approach with SLEPc. This work also provided the ability to model much larger system sizes, up to 100,000 model coefficients, which is also demonstrated. Some of the challenges of modelling systems of this large scale are explored, and mitigations including hybrid MPI+OpenMP along with the use of iterative solvers are also considered. The result of this work is a modern MEME model that is not only capable of simulating problem sizes demanded by state of the art geomagnetism but also acts as further evidence to the utility of the SLEPc libary. Sustained Petascale Direct Numerical Simulation on Cray XC40 Systems (Trinity, Shaheen2 and Cori) Bilel Hadri (KAUST Supercomputing Lab), Matteo Parsani (King Abdullah University of Science and Technology), Maxwell Hutchinson (Citrine Informatics), Alexander Heinecke (Intel Corporation), and Lisandro Dalcin and David Keyes (King Abdullah University of Science and Technology) Abstract Abstract We present in this paper a comprehensive performance study of highly efficient extreme scale direct numerical simulations of secondary flows, using an optimized version of NeK5000. Our investigations are conducted on various Cray XC40 systems, using the very high-order spectral element method. Single-node efficiency is achieved by auto-generated assembly implementations of small matrix multiplies and key vector-vector operations, streaming lossless I/O compression, aggressive loop merging and selective single precision evaluations. Comparative studies across different Cray XC40 systems at scale, Trinity (LANL), Cori(NERSC) and ShaheenII(KAUST), show that a Cray programming environment, network configuration, parallel file system and Burst buffer all have a major impact on the performance. All the 3 systems possess a similar hardware with similar CPU nodes and parallel file system, but they have a different network theoretical bandwidth, a different OS and different CDT versions of the programming environment. Our study reveals how these slight configuration differences can be critical in terms of performance of the application. We also find that using 294,912 cores (9216 nodes) on Trinity XC40 sustaines the petascale performance, and as well 50% of peak memory bandwidth over the entire solver (500 TB/s in aggregate). On 3072 KNL nodes of Cori, we reach 378 TFLOP/s with an aggregated bandwidth of 310 TB/s, corresponding to time-to-solution 2.11x faster than obtained with the same number of Haswell nodes. Unravelling the Origin of Magnetic fields in Accretion Discs Through Numerical Simulations Prasun Dhang and Prateek Sharma (Indian Institute of Science, Bangalore) Abstract Abstract A coupled system of an astrophysical accretion disc and a jet is ubiquitous in the Universe. Large-scale magnetic fields are integral components to produce jets. However, the origin of the large-scale magnetic fields in the accretion disc is an unsolved problem. The most promising way of generating large-scale magnetic fields is by an instability, namely, the Magnetorotational instability (MRI) driven large scale dynamo. We study the dynamo action in accretion discs using a widely used computational model of a shearing box. However, to begin with, we check the performance of our grid-based code PLUTO by investigating its scalability in the SahasraT, a Cray XC40 system at SERC, IISc Bangalore. We find PLUTO shows good scalability for our numerical set-up on SahasraT up to 31,104 processors. We also observe that the performance of the code is best when the ratio of the total number grid points to the total number of processors is an integer. In near future, we wish to characterize convergence and dynamo action by studying the long-term behaviour of the MRI turbulence in shearing box set-up with unprecedented resolution. Massively Parallel SVD Solver on Cray Supercomputers Hatem Ltaief and Dalal Sukkari (KAUST), Aniello Esposito (Cray Inc.), Yuji Nakatsukasa (University of Oxford), and David Keyes (KAUST) Abstract Abstract We present the performance of a massively parallel Singular Value Decomposition (SVD) solver, i.e., the workhorse of linear algebra, on two large-scale Cray supercomputers. Based on the polar decomposition with the Zolotarev rational functions—introduced by Zolotarev (ZOLO) in 1877, the new ZOLO-SVD algorithm comes at the price of higher arithmetic costs and memory footprint than the standard SVD solver, as implemented in ScaLAPACK from Cray scientific library. The extra floating-point operations of ZOLO-SVD can, however, be processed in an embarrassingly parallel fashion, as opposed to the traditional one-stage bidiagonal reduction. We demonstrate performance improvements using up to 102,400 cores on two Cray systems based on homogeneous Intel Haswell and Broadwell architectures. In particular, in the presence of a large number of processing units, ZOLO-SVD is able to outperform PDGESVD from Cray LibSci by up to an order of magnitude, especially in situations where PDGESVD runs out of work, for instance, in the strong scaling mode of operation. ZOLO-SVD has been integrated into the latest release of Cray LibSci (v19.02.1) and may significantly improve scientific applications relying on SVD. Plenary General Session 4 Chair: Colin McMurtrie (Swiss National Supercomputing Centre) Keynote: Robust Deep Learning Inference with Limited Resources Vincent Gripon (IMT Atlantique, Université de Montréal) Abstract Abstract Deep learning architectures are the golden standard for many machine learning problems. Thanks to their large number of trainable parameters, they are able to absorb complex dependencies in the input data to produce correct decisions, when trained appropriately. However, this dependency on a very large number of parameters is also a weakness: their computation and memory footprints are considerable and it is hard -- if not impossible -- to guarantee their ability to perform well when dealing with corrupted and noisy inputs. In this talk, we shall review the main strategies that have been proposed in the literature to reduce computations and memory of deep learning systems, including quantization, factorization, and pruning. We shall also discuss how adequate are these systems to faulty implementations. In a last part, we will discuss the susceptibility of deep learning architectures to deviations of the inputs, what appears to have become a major open question. Plenary General Session 7 Chair: Brian Skjerven (Pawsey Supercomputing Centre) Scaling Results From the First Generation of Arm-based Supercomputers Simon McIntosh-Smith, James Price, Andrei Poenaru, and Tom Deakin (University of Bristol) Abstract Abstract In this paper we present the first scaling results from Isambard, the first production supercomputer to be based on Arm CPUs that have been optimised specifically for HPC. Isambard is a Cray XC50 `Scout' system, combining Marvell ThunderX2 Arm-based CPUs with Cray's Aries interconnect. The full Isambard system was delivered in late 2018 and contains a full cabinet of 168 dual socket nodes, for a total of 10,752 heavyweight Arm cores. In this work, we build on the single node results we presented at CUG 2018, and present scaling results for the full system. We compare Isambard's scaling results with Aries-based XC systems based on x86 CPUs, including Intel Skylake and Broadwell. We focus on a range of applications and mini-apps important to the UK national HPC service, ARCHER, and to Isambard project partners. Driving Innovation in HPC Trish Damkroger (Intel Corporation) Abstract Abstract In this presentation, Trish Damkroger of Intel will discuss the changing landscape of high performance computing, key trends, and the convergence of HPC-AI-HPDA that is transforming our industry and will fuel HPC to fulfil its potential as a scientific tool for business and innovation. Trish will highlight not only key forces driving this shift but discuss how this transformation requires a fundamental paradigm shift and is opening up unprecedented opportunities for HPC. Plenary General Session 13 Chair: Colin McMurtrie (Swiss National Supercomputing Centre) Fifty Years and Counting: The past and future of Numerical Weather Prediction at Environment & Climate Change Canada (ECCC Richard Hogue (Meteorological Service of Canada) Abstract Abstract Our presentation will be twofold. We will first look at the evolution of high performance computing (HPC) within ECCC’s 50 year history of Numerical Weather Prediction (NWP) production. Along with this history, we will highlight some key milestones in the advancement of weather and environmental modeling. To begin, we will show how High Performance Computing (HPC) is the workhorse of ECCC’s monitoring and forecasting system, assimilating and processing hundreds of terabytes of meteorological and environmental data on a daily basis to support weather, air quality, water and climate change predictions over various time scales. The HPC is also the backbone of ECCC’s Climate and Weather research and innovation, the master tool that enables advances in environmental decision-making information services to Canadians. Plenary General Session 16 Chair: Brian Skjerven (Pawsey Supercomputing Centre) Plenary General Session 20 Chair: Brian Skjerven (Pawsey Supercomputing Centre) Plenary General Session 25 Chair: Jim Rogers (Oak Ridge National Laboratory) The HPC Processor Landscape: A Panel Discussion of Shasta Architecture Options David Cownie (AMD), Brent Gorda (Arm), Jeff Watters (Intel), and Tom Reed (NVIDIA) Abstract Abstract Cray has announced their new Shasta system as a design that will address the next generation of problems in HPC, including exascale computing and data-centric workloads. Part of this design includes a range of processor options (x86, ARM, GPUs, FPGAs) and interconnects (Omni-Path, Slingshot, Mellanox). This panel will feature representatives from Intel, AMD, ARM, and NVIDIA and discussion will be focused on upcoming challenges in HPC and how each architecture can address them. Technical Workshop Technical Workshop 1A Chair: Wade Doll (Cray Inc.) Shasta Hardware Technical Workshop Wade Doll and Bob Alverson (Cray Inc.) Abstract Abstract This 3 hour session will cover Shasta hardware basics, including Mountain and River cabinets, diverse processor support, infrastructure such packaging and cooling and power, and the features of the Slingshot network. Technical Workshop Technical Workshop 1A Continued Chair: Wade Doll (Cray Inc.) Shasta Hardware Technical Workshop Wade Doll and Bob Alverson (Cray Inc.) Abstract Abstract This 3 hour session will cover Shasta hardware basics, including Mountain and River cabinets, diverse processor support, infrastructure such packaging and cooling and power, and the features of the Slingshot network. Technical Workshop Technical Workshop 2A Chair: Larry Kaplan (Cray Inc.) Shasta Software Technical Workshop Larry Kaplan, Matt Haines, Jason Rouault, Harold Longley, Bill Sparks, and John Fragalla (Cray Inc.) and Dave Poulsen (Cray) Abstract Abstract This 3-hour session will provide an overview of Shasta software. This will consist of highlights from areas such as system administration, including various management and service provisioning topics, and the Shasta Linux Environment, including the User Access Services (UAS), workload management, and containers. The Cray programming environment and analytics software will also be briefly covered. Technical Workshop Technical Workshop 2A Continued Chair: Larry Kaplan (Cray Inc.) Shasta Software Technical Workshop Larry Kaplan, Matt Haines, Jason Rouault, Harold Longley, Bill Sparks, and John Fragalla (Cray Inc.) and Dave Poulsen (Cray) Abstract Abstract This 3-hour session will provide an overview of Shasta software. This will consist of highlights from areas such as system administration, including various management and service provisioning topics, and the Shasta Linux Environment, including the User Access Services (UAS), workload management, and containers. The Cray programming environment and analytics software will also be briefly covered. Tutorial Tutorial 1B Chair: Michael Ringenburg (Cray, Inc) Analytics and AI on Cray Systems Michael Ringenburg and Kristyn Maschhoff (Cray Inc.) and Mustafa Mustafa, Thorsten Kurth, and Steven Farrell (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) Abstract Abstract Artificial Intelligence (AI) and data analytics have emerged as critical use cases for supercomputing resources. This tutorial will continue and expand upon our well-attended tutorials from the 2017 and 2018 CUG conferences. We will describe various options for optimizing and running the most popular AI and analytics frameworks on Cray systems, and discuss the pros and cons of containerizing these workflows. We will also explore the capabilities of Cray’s Urika software stacks, and preview the analytics and artificial intelligence capabilities of Shasta systems. In addition, we will add some new content from the successful NERSC/Cray joint tutorial at Supercomputing ’18, covering topics including distributed, scalable training of deep neural networks, convergence of training at scale, and distributed hyperparameter optimization. Simple exercises using interactive Jupyter notebooks will be interspersed to allow attendees to apply what they have learned. We will assume a basic familiarity with Cray systems. Tutorial Tutorial 1C Chair: John Levesque (Cray Inc.) Preparing an application for Hybrid Supercomputing using Cray's Tool Suite John Levesque (Cray Inc.) Abstract Abstract This talk will investigate a computation fluid dynamics application called Leslie3D. In the initial state Leslie3D is all MPI and moving it to an efficient hybrid multi/many core application is a challenge that is made easier with the use of several tools from Cray's Perftools suite. First the computation characteristics of the application is obtained using several of the perftools-lite facilities. Then Reveal is employed to assist in high level threading of the major computational loops. Finally Cray's memory analysis tool will be used to identify areas where the memory bandwidth is limiting the performance. In the end we will identify how this application is then performance portable to accelerated systems using GPUs using OpenMP 4.5 accelerator directive. Performance will be given on a variety of architectures Tutorial Tutorial 1B Continued Chair: Michael Ringenburg (Cray, Inc) Analytics and AI on Cray Systems Michael Ringenburg and Kristyn Maschhoff (Cray Inc.) and Mustafa Mustafa, Thorsten Kurth, and Steven Farrell (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) Abstract Abstract Artificial Intelligence (AI) and data analytics have emerged as critical use cases for supercomputing resources. This tutorial will continue and expand upon our well-attended tutorials from the 2017 and 2018 CUG conferences. We will describe various options for optimizing and running the most popular AI and analytics frameworks on Cray systems, and discuss the pros and cons of containerizing these workflows. We will also explore the capabilities of Cray’s Urika software stacks, and preview the analytics and artificial intelligence capabilities of Shasta systems. In addition, we will add some new content from the successful NERSC/Cray joint tutorial at Supercomputing ’18, covering topics including distributed, scalable training of deep neural networks, convergence of training at scale, and distributed hyperparameter optimization. Simple exercises using interactive Jupyter notebooks will be interspersed to allow attendees to apply what they have learned. We will assume a basic familiarity with Cray systems. Tutorial Tutorial 1C Continued Chair: John Levesque (Cray Inc.) Preparing an application for Hybrid Supercomputing using Cray's Tool Suite John Levesque (Cray Inc.) Abstract Abstract This talk will investigate a computation fluid dynamics application called Leslie3D. In the initial state Leslie3D is all MPI and moving it to an efficient hybrid multi/many core application is a challenge that is made easier with the use of several tools from Cray's Perftools suite. First the computation characteristics of the application is obtained using several of the perftools-lite facilities. Then Reveal is employed to assist in high level threading of the major computational loops. Finally Cray's memory analysis tool will be used to identify areas where the memory bandwidth is limiting the performance. In the end we will identify how this application is then performance portable to accelerated systems using GPUs using OpenMP 4.5 accelerator directive. Performance will be given on a variety of architectures Tutorial Tutorial 2B Chair: Michael Ringenburg (Cray, Inc) Analytics and AI on Cray Systems Michael Ringenburg and Kristyn Maschhoff (Cray Inc.) and Mustafa Mustafa, Thorsten Kurth, and Steven Farrell (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) Abstract Abstract Artificial Intelligence (AI) and data analytics have emerged as critical use cases for supercomputing resources. This tutorial will continue and expand upon our well-attended tutorials from the 2017 and 2018 CUG conferences. We will describe various options for optimizing and running the most popular AI and analytics frameworks on Cray systems, and discuss the pros and cons of containerizing these workflows. We will also explore the capabilities of Cray’s Urika software stacks, and preview the analytics and artificial intelligence capabilities of Shasta systems. In addition, we will add some new content from the successful NERSC/Cray joint tutorial at Supercomputing ’18, covering topics including distributed, scalable training of deep neural networks, convergence of training at scale, and distributed hyperparameter optimization. Simple exercises using interactive Jupyter notebooks will be interspersed to allow attendees to apply what they have learned. We will assume a basic familiarity with Cray systems. Tutorial Tutorial 2C Chair: Andrey Ovsyannikov (Intel) Intel® Xeon® Platinum 9200 Processor Performance Evaluation for weather. AI Benchmarking and Preparing for Intel® OPTANE™ DC Persistent Memory. Andrey Ovsyannikov, Christine Cheng, and Jackson Marusarz (Intel Corporation) Abstract Abstract This two hour talk covers three topics, presented by Intel content experts. There will be opportunity for in-depth discussion. Tutorial Tutorial 2B Continued Chair: Michael Ringenburg (Cray, Inc) Analytics and AI on Cray Systems Michael Ringenburg and Kristyn Maschhoff (Cray Inc.) and Mustafa Mustafa, Thorsten Kurth, and Steven Farrell (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) Abstract Abstract Artificial Intelligence (AI) and data analytics have emerged as critical use cases for supercomputing resources. This tutorial will continue and expand upon our well-attended tutorials from the 2017 and 2018 CUG conferences. We will describe various options for optimizing and running the most popular AI and analytics frameworks on Cray systems, and discuss the pros and cons of containerizing these workflows. We will also explore the capabilities of Cray’s Urika software stacks, and preview the analytics and artificial intelligence capabilities of Shasta systems. In addition, we will add some new content from the successful NERSC/Cray joint tutorial at Supercomputing ’18, covering topics including distributed, scalable training of deep neural networks, convergence of training at scale, and distributed hyperparameter optimization. Simple exercises using interactive Jupyter notebooks will be interspersed to allow attendees to apply what they have learned. We will assume a basic familiarity with Cray systems. Tutorial Tutorial 2C Continued Chair: Andrey Ovsyannikov (Intel) Intel® Xeon® Platinum 9200 Processor Performance Evaluation for weather. AI Benchmarking and Preparing for Intel® OPTANE™ DC Persistent Memory. Andrey Ovsyannikov, Christine Cheng, and Jackson Marusarz (Intel Corporation) Abstract Abstract This two hour talk covers three topics, presented by Intel content experts. There will be opportunity for in-depth discussion. |
Birds of a Feather BoF 3B Chair: Bilel Hadri (KAUST Supercomputing Lab) Birds of a Feather BoF 3C Chair: Sadaf R. Alam (CSCS) Birds of a Feather BoF 12A Chair: Sadaf R. Alam (CSCS) Birds of a Feather BoF 12B Chair: Bilel Hadri (KAUST Supercomputing Lab) Birds of a Feather BoF 12C Chair: Veronica G. Vergara Larrea (Oak Ridge National Laboratory) Birds of a Feather BoF 24C Chair: Colin McMurtrie (Swiss National Supercomputing Centre) New Site New Site 8 Chair: Brian Skjerven (Pawsey Supercomputing Centre) New Site New Site 17 Chair: Brian Skjerven (Pawsey Supercomputing Centre) New Site Site Talk 21 Chair: Trey Breckenridge (Mississippi State University) Paper/Presentation Technical Session 10A Chair: Jim Rogers (Oak Ridge National Laboratory) Paper/Presentation Technical Session 10B Chair: Scott Michael (Indiana University) Paper/Presentation Technical Session 10C Chair: Bilel Hadri (KAUST Supercomputing Lab) Paper/Presentation Technical Session 11A Chair: Veronica G. Vergara Larrea (Oak Ridge National Laboratory) Paper/Presentation Technical Session 11B Chair: Abhinav S. Thota (Indiana University) Paper/Presentation Technical Session 11C Chair: David Hancock (Indiana University) Characterizing Full-system Network Performance and Congestion Management Capabilities with Improved Network Benchmarks pdfPaper/Presentation Technical Session 19A Chair: Brian Skjerven (Pawsey Supercomputing Centre) Roofline-based Performance Efficiency of HPC Benchmarks and Applications on Current Generation of Processor Architectures pdf, pdfPaper/Presentation Technical Session 19B Chair: David Hancock (Indiana University) Uncovering Lustre Performance Issues in Operational Weather Forecasting at DMI with View for ClusterStor pdfAnalysis of parallel I/O use on the UK national supercomputing service, ARCHER using Cray LASSi and EPCC SAFE pdfPaper/Presentation Technical Session 19C Chair: Abhinav S. Thota (Indiana University) Running Alchemist on Cray XC and CS Series Supercomputers: Dask and PySpark Interfaces, Deployment Options, and Data Transfer Times pdf, pdfPaper/Presentation Technical Session 28A Chair: Tina Declerck (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory) Paper/Presentation Technical Session 28B Chair: Ann Gentile (Sandia National Laboratories) Paper/Presentation Technical Session 28C Chair: Bilel Hadri (KAUST Supercomputing Lab) Paper/Presentation Technical Session 29A Chair: Kevin Buckley (Pawsey Supercomputing Centre) Paper/Presentation Technical Session 29B Chair: Jim Rogers (Oak Ridge National Laboratory) Plenary General Session 4 Chair: Colin McMurtrie (Swiss National Supercomputing Centre) Plenary General Session 7 Chair: Brian Skjerven (Pawsey Supercomputing Centre) Plenary General Session 13 Chair: Colin McMurtrie (Swiss National Supercomputing Centre) Plenary General Session 16 Chair: Brian Skjerven (Pawsey Supercomputing Centre) Plenary General Session 20 Chair: Brian Skjerven (Pawsey Supercomputing Centre) Plenary General Session 25 Chair: Jim Rogers (Oak Ridge National Laboratory) Technical Workshop Technical Workshop 1A Chair: Wade Doll (Cray Inc.) Technical Workshop Technical Workshop 1A Continued Chair: Wade Doll (Cray Inc.) Technical Workshop Technical Workshop 2A Chair: Larry Kaplan (Cray Inc.) Tutorial Tutorial 1B Chair: Michael Ringenburg (Cray, Inc) Tutorial Tutorial 1C Chair: John Levesque (Cray Inc.) Tutorial Tutorial 1B Continued Chair: Michael Ringenburg (Cray, Inc) Tutorial Tutorial 1C Continued Chair: John Levesque (Cray Inc.) Tutorial Tutorial 2B Chair: Michael Ringenburg (Cray, Inc) Tutorial Tutorial 2C Chair: Andrey Ovsyannikov (Intel) Tutorial Tutorial 2B Continued Chair: Michael Ringenburg (Cray, Inc) |