Special Paper New Special Paper Session I/O Performance Characterization and Prediction through Machine Learning on HPC Systems Lipeng Wan, Matthew Wolf, Feiyi Wang, Jong Youl Choi, George Ostrouchov, Jieyang Chen, Norbert Podhorszki, Jeremy Logan, Kshitij Mehta, Scott Klasky, and Dave Pugmire (Oak Ridge National Laboratory) Abstract Abstract When scientists run their applications on high-performance computing (HPC) systems, they often experience highly variable runtime I/O performance, and sometimes unexpected I/O performance degradations can dramatically slow down the applications' execution. This issue is mainly caused by I/O bandwidth contention, since the storage subsystem of HPC systems is usually shared by many concurrently running applications and the I/O performance of each application might be affected by I/O traffic from others. In order to mitigate the I/O bandwidth contention, scientific applications running on HPC systems need to schedule their I/O operations in a proactive and intelligent manner, which necessitates the capability of predicting the near-future runtime I/O performance. However, the runtime I/O performance prediction in production HPC environments is extremely challenging, as the storage subsystems are complex and the I/O operations of those running applications might have irregular patterns. Performance and Power Modeling and Prediction Using MuMMI and Ten Machine Learning Methods Xingfu Wu and Valerie Taylor (Argonne National Laboratory, The University of Chicago) and Zhiling Lan (Illinois Institute of Technology) Abstract Abstract In this paper, we use the modeling and prediction tool Multiple Metrics Modeling Infrastructure (MuMMI) and 10 machine learning methods to model and predict performance and power and compare their prediction error rates. We use a fault-tolerant linear algebra code and a fault-tolerant heat distribution code to conduct our modeling and prediction study on the Cray XC40 Theta and IBM Blue Gene/Q Mira at Argonne National Laboratory and the Intel Haswell cluster Shepard at Sandia National Laboratories. Our experiment results show that the prediction error rates in performance and power using MuMMI are less than 10% for most cases. Based on the models for runtime, node power, CPU power, and memory power, we identify the most significant performance counters for potential optimization efforts associated with the application characteristics and the target architectures, and we predict theoretical outcomes of the potential optimizations. When we compare the prediction accuracy using MuMMI with that using 10 machine learning methods, we observe that MuMMI not only results in more accurate prediction of both performance and power but also presents how performance counters contribute in the performance and power models. This information provides some insights into how to fine-tune the applications and/or systems for energy efficiency. Unified Model Global Weather Forecast Performance on HPE Cray EX Peter Johnsen and Steven Warren (HPE) Abstract Abstract The next generation HPE Cray EX (formerly Cray Shasta) supercomputer offers excellent performance for a wide range of applications including numerical weather prediction. In a compact architecture that includes AMD EPYC Rome processors along with the latest HPE Slingshot high-speed interconnect, the HPE Cray EX system is showing superb weather simulation performance. In this paper we look at the performance of the Unified Model (UM) from the UK Met Office. The UM is currently producing global and regional forecasts at a number of operational weather centers around the world. The UM global weather forecast ensembles at 10 km resolution are achieving net simulation speeds of up to 45 forecast days per wall clock hour on 700 nodes of the HPE Cray EX. This includes full model forecast output and shows very little run time variability across ensemble copies. Utilizing Ensemble Learning for Performance and Power Modeling and Improvement of Parallel Cancer Deep Learning CANDLE Benchmarks Xingfu Wu and Valerie Taylor (Argonne National Laboratory, The University of Chicago) Abstract Abstract Machine learning (ML) continues to grow in importance across nearly all domains and is a natural tool in modeling to learn from data. Often a tradeoff exists between a model's ability to minimize bias and variance. In this paper, we utilize ensemble learning to combine linear, nonlinear, and tree-/rule-based ML methods to cope with the bias-variance tradeoff and result in more accurate models. Hardware performance counter values are correlated with properties of applications that impact performance and power on the underlying system. We use the datasets collected for two parallel cancer deep learning CANDLE benchmarks, NT3 (weak scaling) and P1B2 (strong scaling), to build performance and power models based on hardware performance counters using single-object and multiple-objects ensemble learning to identify the most important counters for improvement. Based on the insights from these models, we improve the performance and energy of P1B2 and NT3 by optimizing the deep learning environments TensorFlow, Keras, Horovod, and Python under the huge page size of 8 MB on the Cray XC40 Theta at Argonne National Laboratory. Experimental results show that ensemble learning not only produces more accurate models but also provides more robust performance counter ranking. We achieve up to 61.15% performance improvement and up to 62.58% energy saving for P1B2 and up to 55.81% performance improvement and up to 52.60% energy saving for NT3 on up to 24,576 cores. HPE Cray Supercomputers: System User Access; User Access Node or User Access Instance, Which is Right for Me? Jeff Keopp and Alan Mutschelknaus (Hewlett Packard Enterprise) Abstract Abstract User Access Nodes (UANs) and User Access Instances (UAIs) represent the primary entry point for end users of the new HPE Cray Supercomputers. While UANs align closely with the eLogin nodes on prior Cray systems, UAIs offer a new cloud-like approach with dynamic, single user containers which provide portability, and user isolation. Together, UANs and UAIs offer complementing feature sets that benefit different sets of users. This paper will discuss the features of UANs and UAIs to help customers choose which implementation best suits their needs. Differences from the eLogin nodes used by previous Cray systems will also be discussed. Early User Experience on and Lessons Learned from the NERSC Cori GPU Cluster Kelly L. Rowland, Brian Friesen, Brandon Cook, Jack Deslippe, and Ershaad Basheer (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory) and Max Katz (NVIDIA) Abstract Abstract The next-generation NERSC supercomputer "Perlmutter" will feature a combination of nodes which are CPU-only and nodes which contain both CPUs and GPUs. To help users prepare for this system, NERSC has procured a Cray CS-Storm 500NX system of 18 CPU-GPU nodes. Despite having little in common architecturally to the current NERSC production system "Cori", this cluster has been fully integrated into Cori and is available to users by access request. These GPU-accelerated nodes are primarily for testing and development work, with priority access given to users participating in the NERSC Exascale Science Applications Program (NESAP). In this paper, we discuss management and deployment of the GPU-specific software provided by NERSC consultants for use on the nodes, job scheduling policies, and efforts in user communication. A scheduling policy to improve 10% of communication time in parallel FFT Samar Aseeri (King Abdullah University of Science and Technology), Anando Chatterjee and Mahendra Verma (Indian Institute of Technology Kanpur), and David Keyes (King Abdullah University of Science and Technology) Abstract Abstract The fast Fourier transform (FFT) has applications in almost every frequency related studies, e.g. in image and signal processing, and radio astronomy. It is also used to solve partial differential equations used in fluid flows, density functional theory, many-body theory, and others. Three-dimensional \(N^3\) FFT has large time complexity (O( N^3 log_2 N )). Hence, parallel algorithms are made to compute such FFTs. Popular libraries perform slab division or pencil decomposition of (N^3) data. None of the existing libraries have achieved perfect inverse scaling of time with ((T^-1 ~ n)) cores because FFT requires all-to-all communication and clusters hitherto do not have physical all-to-all connections. With advances in switches and topologies, we now have Dragonfly topology, which physically connects various units in an all-to-all fashion. Thus, we show that if we align the all-to-all communication of FFT with the physical connections of Dragonfly topology we will achieve a better scaling and reduce around 10% of communication time. Deriving Workload Expectations: Monitoring and Analysis Using HPC Job Profiles Joshi Fullop and Brett L. Layman (Los Alamos National Laboratory) Abstract Abstract With the growing availability of time series metric data from High Performance Computing (HPC) machines, there is significant potential for using this data to improve the monitoring and analysis of HPC workloads and the systems on which they run. In this paper, use statistical aggregation of node data to create hierarchical job profile data-structures in JSON format. These job-oriented structures enable a wide set of applications. In addition to creating job profiles, we generate workload expectations based on job profiles of past successful runs. We then compare a currently running job to its expectation to determine deviance in real time. We further examine potential use cases for the profiles and expectations in areas such as shared resource scheduling, holistic system monitoring and benchmark trending. Towards Acceptance Testing at the Exascale Frontier Veronica G. Vergara Larrea, Michael J. Brim, Arnold Tharrington, Reuben Budiardja, and Wayne Joubert (Oak Ridge National Laboratory) Abstract Abstract In 2007, the Oak Ridge Leadership Computing (OLCF) introduced the Acceptance Test Harness (ATH). The ATH is the testing framework that was utilized to conduct acceptance testing of Jaguar and Titan. Advanced Topics in Configuration Management Ryan Bak and Randy Kleinman (HPE) Abstract Abstract For the configuration of the latest generation of Cray supercomputers, the Configuration Framework Service (CFS) is a flexible framework used to prepare both images and booted nodes to meet their functional requirements. To help users get the most out of CFS, this paper will explore many advanced topics, such as the different modes of operation for CFS, configuration of both compute and non-compute nodes, how to configure CFS for best performance, and how to write the Ansible code for fast and efficient deployment of your configuration, as well as the differences between CFS and the previous generation Cray XC series system configuration management. Enabling Power Measurement and Control on Astra: The First Petascale Arm Supercomputer Ryan E. Grant, Simon D. Hammond, James H. Laros III, Michael Levenhagen, Stephen L. Olivier, Kevin Pedretti, H. Lee Ward, and Andrew J. Younge (Sandia National Laboratories) Abstract Abstract As the first large-scale deployment of the Apollo 70 architecture and Marvell ThunderX2 Arm processor, Astra system at Sandia required close collaboration between all stakeholders to bring to fruition. One key area of co-development has been enabling power measurement and control capabilities for the platform and improving them over time. This functionality has proven useful from multiple perspectives, including enabling real-time evaluation of the system’s health and measuring the power usage of important workloads. This paper describes the design and implementation of Astra’s power management capabilities from the individual Arm processor to full system scales, along with two case studies: 1) visual analysis of node-level power usage to investigate and resolve node crashes, and 2) characterizing CPU frequency vs. power usage for key workloads. Our results are the first exploration of Apollo 70 power management and provide motivation for providing similar capabilities on future systems. Not All Applications Have Boring Communication Patterns: Profiling Message Matching with BMM Taylor Groves (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory); Naveen Ravichandrasekaran (Cray Inc.); Brandon Cook and Brian Friesen (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center); Noel Keen, David Trebotich, and Nicholas J. Wright (Lawrence Berkeley National Laboratory); and Bob Alverson, Duncan Roweth, and Keith Underwood (Cray Inc.) Abstract Abstract Message matching within MPI is an important performance consideration for applications that utilize two-sided semantics. In this work we present a instrumentation of the CrayMPI library which allows the collection of detailed message-matching statistics as well as an implementation of hashed matching in software. We use this functionality to profile key DoE applications with complex communication patterns to determine under what circumstances an application might benefit from hardware offload capabilities within the NIC to accelerate message matching. |
Special Paper New Special Paper Session Utilizing Ensemble Learning for Performance and Power Modeling and Improvement of Parallel Cancer Deep Learning CANDLE Benchmarks pdfHPE Cray Supercomputers: System User Access; User Access Node or User Access Instance, Which is Right for Me? pdf |