CUG2015 Proceedings | Created 2015-5-13 |
Birds of a Feather · CUG Board, Filesystems & I/O, PE & Applications, Systems, XTreme Interactive Session 15B Tutorial · Filesystems & I/O Tutorial 1B continued Cray XC Power Monitoring and Control Steven Martin, David Rush, and Matthew Kappel (Cray Inc.) Abstract Abstract This tutorial will focus on the setup, usage and use cases for Cray XC power monitoring and management features. The tutorial will cover power and energy monitoring and control from three perspectives: site and system administrators working from the SMW command line, users who run jobs on the system, and third party software development partners integrating with Cray’s RUR and CAPMC features. Paper · Filesystems & I/O Technical Session 7B Chair: Sharif Islam (National Center for Supercomputing Applications) How Distributed Namespace Boosts Lustre Metadata Performance Andreas Dilger (Intel Corporation) Abstract Abstract The Lustre Distributed Namespace Environment (DNE) feature allows Lustre metadata performance to scale upward with the addition of metadata servers to a single file system. Under development by Intel and others for several years, DNE functionality is a vital part of the latest production releases of Lustre software. During this technical session you’ll learn how DNE works today, an update on continued improvements, and how DNE allows Lustre metadata performance to scale to meet the demands of applications having many thousands of threads. Toward Understanding Life-Long Performance of a Sonexion File System Mark S. Swan and Doug Petesch (Cray Inc.) Abstract Abstract Many of Cray’s customers will be using their systems for several years to come. The one resource that is most affected by long-term use is storage. Files, both big and small, both striped and unstriped, are continually created and deleted, leaving behind free space of different sizes and in difference places on the spinning media. This paper will explore the effects of continual reuse of a Sonexion file system and a method of tuning the allocation parameters of the OSTs to minimize these effects. Paper · Filesystems & I/O, Systems Technical Session 8B Chair: Jason Hill (Oak Ridge National Laboratory) A Storm (Lake) is Coming to Fast Fabrics: The Next-Generation Intel® Omni-Path Architecture Barry Davis (Intel Corporation) Abstract Abstract The Intel® Omni-Path Architecture, Intel’s next-generation fabric product line, is designed around industry-leading technologies developed as a result of Intel’s multi-year fabric development program. The Intel Omni-Path Architecture will deliver new levels of performance, resiliency, and scalability, overcoming InfiniBand limitations and paving the path to Exascale. Learn how the Intel Omni-Path Architecture will deliver significant enhancements and optimization for HPC at both the host and fabric levels, providing huge benefits to HPC applications over standard Infiniband-based designs. Data Transfer Study for HPSS Archiving James R. Wynne, Suzanne T. Parete-Koon, Quinn D. Mitchell, Stanley White, and Tom Barron (Oak Ridge National Laboratory) Abstract Abstract The movement of large data produced by codes run in a High Performance Computing (HPC) environment can be a bottleneck for project workflows. To balance filesystem capacity and performance requirements, HPC centers enforce data management policies to purge old files to make room for new user project data. Users at Oak Ridge Leadership Computing Facility (OLCF) and other HPC user facilities must archive data to avoid the purge, therefore the time associated with data movement is something that all users must consider. This study observed the difference in transfer speed from the Lustre filesystem to the High Performance Storage System (HPSS) using a number of different transfer agents. The study tested files that spanned a variety of sizes and compositions that reflect OLCF user data. This will be used to help users of Titan plan their workflow and archival data transfers to increase their project’s efficiency. Applying Advanced IO Architectures to Improve Efficiency in Single and Multi-Cluster Environments Mike Vildibill (DataDirect Networks) Abstract Abstract For 15 years DDN has been working with the majority of leading supercomputing facilities, pushing the limits of storage IO to improve the productivity of the world¹s largest systems. Storage technology advancement toward Excascale have not progressed as quickly as computing technology. The gap cannot be bridged by improving today's technologies - drive interface speeds are not increasing fast enough, parallel file systems need optimization to accomplish Exascale concurrency, and scientists will always want to model more data than is financially reasonable to hold in memory. Discontinuous innovation is called for. Paper · Filesystems & I/O Technical Sessions 10B Chair: Sharif Islam (National Center for Supercomputing Applications) Tuning Parallel I/O on Blue Waters for Writing 10 Trillion Particles Suren Byna (Lawrence Berkeley Laboratory), Robert Sisneros and Kalyana Chadalavada (National Center for Supercomputing Applications), and Quincey Koziol (The HDF Group) Abstract Abstract Large-scale simulations running on hundreds of thousands of processors produce hundreds of terabytes of data that need to be written to files for analysis. One such application is VPIC code that simulates plasma behavior such as magnetic reconnection and turbulence in solar weather. The number of particles VPIC simulates is in the range of trillions and the size of data files to store is in the range of hundreds of terabytes. To test and optimize parallel I/O performance at this scale on Blue Waters, we used the I/O kernel extracted from a VPIC magnetic reconnection simulation. Blue Waters is a supercomputer at National Center for Supercomputing Applications (NCSA) that contains Cray XE6 and XK7 nodes with Lustre parallel file systems. In this paper, we will present optimizations used in tuning the VPIC-IO kernel to write a 5TB file with 5120 MPI processes and a 290TB file with ~300,000 MPI processes. Evaluation of Parallel I/O Performance and Energy Consumption with Frequency Scaling on Cray XC30 Suren Byna (Lawrence Berkeley National Laboratory) and Brian Austin (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) Abstract Abstract Large-scale simulations produce massive data that needs to be stored on parallel file systems. The simulations use parallel I/O to write data into file systems, such as Lustre. Since writing data to disks is often a synchronous operation, the application-level computing workload on CPU cores is minimal during I/O and hence we consider whether energy may be saved by keeping the cores in lower power states. To examine this postulation, we have conducted a thorough evaluation of energy consumption and performance of various I/O kernels from real simulations on a Cray XC30 supercomputer, named Edison, at the National Energy Research Supercomputing Center (NERSC). To adjust CPU power consumption, we use the frequency scaling capabilities provided by the Cray power management and monitoring tools. In this paper, we present our initial observations that when the I/O load is high enough to saturate the capability of the filesystem, down-scaling the CPU frequency on compute nodes reduces energy consumption without diminishing I/O performance. A More Realistic Way of Stressing the End-to-end I/O System Veronica G. Vergara Larrea, Sarp Oral, and Dustin B. Leverman (Oak Ridge National Laboratory); Hai Ah Nam (Los Alamos National Laboratory); and Feiyi Wang and James Simmons (Oak Ridge National Laboratory) Abstract Abstract Synthetic I/O benchmarks and tests are insufficient by themselves in realistically stressing a complex end-to-end I/O path. Evaluations built solely around these benchmarks can help establish a high-level understanding of the system and save resources and time, however, they fail to identify subtle bugs and error conditions that can occur only when running at large-scale. The Oak Ridge Leadership Computing Facility recently started an effort to assess the I/O path more realistically and improve the evaluation methodology used for major and minor file system software upgrades. To this end, an I/O test harness was built using a combination of real-world scientific applications and synthetic benchmarks. The experience with the harness and the testing methodology introduced are presented in this paper. The more systematic testing performed with the harness resulted in a successful upgrade of Lustre on OLCF systems and a more stable computational and analysis environment. Paper · Filesystems & I/O Technical Session 14B Chair: Tina Butler (National Energy Research Scientific Computing Center) The time is now. Unleash your CPU cores with Intel® SSDs Andrey Kudryavtsev (Intel Corporation) Abstract Abstract When trying to solve humankind’s most difficult and important challenges, time is critical. Whether it’s mapping population flows to thwart the spread of Ebola, identifying in real-time potential terrorists or analyzing big data to find a promising cure for cancer, data scientists, government leaders, researchers, engineers, all of us can’t wait. Yet, today most super computing platforms require users to do just this. CPUs remain idle while waiting for data to arrive for analysis or waiting for data to be written back. In this session, Bill Leszinske and Andrey Kudryavtsev will discuss advancements in Intel SSD technology that are unleashing the power of the CPU and Moore’s Law. They’ll dive into NVMe, a new standard specification interface for SSDs that can greatly benefit the HPC community, talk about the results early adopters are experiencing, and how adoption sets the foundation for consumption of disruptive NVM technology on the horizon. DataWarp: First Experiences Stefan Andersson and Stephen Sachs (Cray Inc.) and Christian Tuma and Thorsten Schuett (Zuse Institute Berlin) Abstract Abstract In this paper, we’ll talk about our first experiences using the new Cray® XC™ DataWarp™ applications I/O accelerator technology on both I/O benchmarks and real world applications. Birds of a Feather · CUG Board, Filesystems & I/O, PE & Applications, Systems, XTreme Interactive Session 15B Paper · Filesystems & I/O Technical Session 18B Chair: Brett Bode (National Center for Supercomputing Applications/University of Illinois) Utilizing Unused Resources To Improve Checkpoint Performance Ross G. Miller and Scott Atchley (Oak Ridge National Laboratory) Abstract Abstract Titan, the Cray XK7 at Oak Ridge National Laboratory, has 18,688 compute nodes. Each node consists of a 16-core AMD CPU, an NVIDIA GPU and 32GB ram. In addition, there is another 6GB of ram on each GPU card. Not all the applications that run on Titan make use of all a node’s resources. For applications that are not otherwise using the GPU, this paper discusses a technique for using the GPU’s ram as a large write-back cache to improve the application’s file write performance. Sonexion - SW versions/roadmap Stan Friesen (Cray Inc.) Abstract Abstract New Sonexion software releases for the Sonexion product line will include significant improvements including changes to Reliability, Availability, Serviceability as well as support for Lustre 2.5. The paper will explain the incremental changes, the planned timeline, and the targeted products (i.e. 900, 1600, 2000) for each software release. Lustre Metadata DNE Performance on Seagate Lustre System John Fragalla (Seagate) Abstract Abstract Alongside the high demands of streaming bandwidth in High Performance Computing (HPC) storage, there is a growing need for increased metadata performance associated with various applications and workloads. The Lustre parallel filesystem provides a distributed namespace feature, which divided across multiple metadata servers, allows the metadata throughput to scale with increasing numbers of servers. This presentation explains how Seagate’s solution addresses DNE Phase 1 in terms of performance, scalability, and high availability, including details on the DNE configuration and MDTEST performance benchmark results. Paper · Filesystems & I/O Technical Session 19B Chair: Veronica G. Vergara Larrea (Oak Ridge National Laboratory) Experiences Running and Optimizing the Berkeley Data Analytics Stack on Cray Platforms Kristyn J. Maschhoff and Michael F. Ringenburg (Cray Inc.) Abstract Abstract The Berkeley Data Analytics Stack (BDAS) is an emerging framework for big data analytics. It consists of the Spark analytics framework, the Tachyon in-memory filesystem, and the Mesos cluster manager. Spark was designed as an in-memory replacement for Hadoop that can in some cases improve performance by up to 100X. Cyber-threat analytics using graph techniques Eric Dull (Cray Inc.) Abstract Abstract Computer network analysis can be very challenging due to the volumes and varieties of data. Organizations struggle with analyzing their network data, merging it against contextual information, and using that information. Graph analysis is an analytic approach that overcomes these challenges. Urika-GD powered graph analytics have been demonstrated at SC14 while Cray participated in the Network Security team on SCinet, the SC14 conference network. Cray participates on the network security team because of the scale of data (18 billion triples from 5 days of data), time-to-first solution (analytics need to be developed in minutes to an hour or two), and time-to-solution (answers need to be generated in seconds to minutes to be useful) requirements. This talk describe computer network information, computer network analysis problems, graph algorithm applications to these problems, and successes using Urika-GD to perform graph analytics during SC14. Staying Out of the Wind Tunnel with Virtual Aerodynamics Greg Clifford (Cray Inc.) and Scott Suchyta (Altair Engineering, Inc.) Abstract Abstract In this presentation, Altair will present results from recent benchmark testing for both small and large simulations using HyperWorks Virtual Wind Tunnel on Cray XC30 systems. The tests focused on two problems of different sizes: a relatively small (22 million element) analysis of a benchmark car model used frequently in auto manufacturing, and a large (1 billion finite element cells) problem involving the drafting simulation of two Formula-1 cars. Result highlights included: virtually ideal efficiency when scaling to 300 cores; and for the larger problem, excellent performance up to 1600 cores with very good performance at 3000+ cores. Tutorial · PE & Applications, Systems Tutorial 1C Preparing for a smooth landing: Intel’s Knights Landing and Modern Applications Jason Sewall (Intel Corporation) Abstract Abstract Knights Landing, the 2nd generation Intel® Xeon Phi™ processor, utilizes many breakthrough technologies to combine breakthrough’s in power performance with standard, portable, and familiar programming models. This presentation will provide an overview of new technologies delivered by Knights Landing microarchitecture. Additionally, Dr. Sewall will provide studies of how applications have been developed using the first generation Intel® Xeon Phi™ coprocessor to be ready for Knights Landing. Tutorial · PE & Applications Tutorial 1C continued Preparing for a smooth landing: Intel’s Knights Landing and Modern Applications Jason Sewall (Intel Corporation) Abstract Abstract Knights Landing, the 2nd generation Intel® Xeon Phi™ processor, utilizes many breakthrough technologies to combine breakthrough’s in power performance with standard, portable, and familiar programming models. This presentation will provide an overview of new technologies delivered by Knights Landing microarchitecture. Additionally, Dr. Sewall will provide studies of how applications have been developed using the first generation Intel® Xeon Phi™ coprocessor to be ready for Knights Landing. Tutorial · PE & Applications, Systems Tutorial 2B Job-Level Tracking with XALT: A Tutorial for System Administrators and Data Analysts Mark Fahey (Argonne National Laboratory), Robert McLay (Texas Advanced Computing Center), and Reuben Budiardja (National Institute for Computational Sciences) Abstract Abstract Let’s talk real, no-kiddin’ supercomputer analytics, aimed at moving beyond monitoring the machine as a whole or even its individual hardware components. We’re interested in drilling down to the level of individual batch submissions, users, and binaries. And we’re not just targeting performance: we’re after ready answers to the "what, where, how, when and why" that stakeholders are clamoring for – everything from which libraries (or individual functions!) are in demand to preventing the problems that get in the way of successful research. This tutorial will show how to install and set up the XALT tool that can provide this type of job-level insight. Tutorial · PE & Applications, Systems Tutorial 2C Debugging, Profiling and Tuning Applications on Cray CS and XC Systems Beau Paisley (Allinea Software) Abstract Abstract The debugger Allinea DDT and profiler Allinea MAP are widely available to users of Cray systems - this tutorial, aimed at scientists and developers that are involved in writing or maintaining code, will introduce debugging and profiling using the tools. Tutorial · PE & Applications, Systems Tutorial 2B continued Job-Level Tracking with XALT: A Tutorial for System Administrators and Data Analysts Mark Fahey (Argonne National Laboratory), Robert McLay (Texas Advanced Computing Center), and Reuben Budiardja (National Institute for Computational Sciences) Abstract Abstract Let’s talk real, no-kiddin’ supercomputer analytics, aimed at moving beyond monitoring the machine as a whole or even its individual hardware components. We’re interested in drilling down to the level of individual batch submissions, users, and binaries. And we’re not just targeting performance: we’re after ready answers to the "what, where, how, when and why" that stakeholders are clamoring for – everything from which libraries (or individual functions!) are in demand to preventing the problems that get in the way of successful research. This tutorial will show how to install and set up the XALT tool that can provide this type of job-level insight. Tutorial · PE & Applications Tutorial 2C continued Debugging, Profiling and Tuning Applications on Cray CS and XC Systems Beau Paisley (Allinea Software) Abstract Abstract The debugger Allinea DDT and profiler Allinea MAP are widely available to users of Cray systems - this tutorial, aimed at scientists and developers that are involved in writing or maintaining code, will introduce debugging and profiling using the tools. Birds of a Feather · PE & Applications Interactive 3B Chair: Timothy W. Robinson (Swiss National Supercomputing Centre) Birds of a Feather · PE & Applications, Systems Interactive 4A Chair: Ashley Barker (Oak Ridge National Laboratory) System Testing and Resiliency in HPC Ashley D. Barker (Oak Ridge National Laboratory) Abstract Abstract As supercomputing system offerings from Cray become increasingly larger, more heterogeneous, and more tightly integrated with storage and data analytics, the verification of hardware and software components becomes an ever more difficult and important aspect of system management. Whether HPC resources are dedicated to local users or are shared with an international user community, regression testing is necessary to ensure that centers are providing usable and trustworthy resources for scientific discovery. However, unlike regression testing in software development projects, where there exists a range of well-established continuous integration tools, regression testing in HPC production environments is typically carried out in a more ad hoc fashion, using custom scripts or tools developed independently by individual HPC centers and with little or no collaboration between centers. The aim of this session is to bring together those with experience and interest in regression testing theory and practice with the aim of fostering collaboration and coordination across CUG member sites. We will assess the state-of-the-art in regression testing at member sites and determine the needs of the community moving forward. We will discuss the testing of components in terms of both functionality and performance, including best practices for operating system, driver and programming environment updates. The session will provide an open forum to share ideas and concerns in order to produce a more concerted effort towards the treatment of system testing and resilience across HPC centers. The session will be used to kick off a cross-site working group dedicated to sharing ideas and frameworks. Paper · PE & Applications Technical Session 7C Chair: Frank M. Indiviglio (National Oceanic and Atmospheric Administration) Porting the Urika-GD Graph Analytic Database to the XC30/40 Platform Kristyn J. Maschhoff, Rob Vesse, and James D. Maltby (Cray Inc.) Abstract Abstract The Urika-GD appliance is a state of the art graph analytics database that provides high performance on complex SPARQL queries. This performance is due to a combination of custom multithreaded processors, a shared memory programming model and a unique network. Implementing a social-network analytics pipeline using Spark on Urika XA Michael Hinchey (Cray Inc.) Abstract Abstract We intend to discuss and demonstrate the use of new generation analytic techniques to find communities of users that discuss certain topics (consumer electronics, sports) and identify key users that play a role in or between those communities (originators, rebroadcasters, connectors). A Graph Mining "App-Store" for Urika-GD Sreenivas R. Sukumar, Sangkeun Lee, and Tyler C. Brown (Oak Ridge National Laboratory); Seokyong Hong (North Carolina State University); and Larry W. Roberts, Keela Ainsworth, and Seung-Hwan Lim (Oak Ridge National Laboratory) Abstract Abstract Researchers at Oak Ridge National Lab have created a suite of tools called EAGLE that will be made available for users of the Urika-GD installation. EAGLE is the acronym for “EAGLE 'Is A' Algorithmic Graph Library for Exploratory-Analysis” and includes an emulator environment for code development and testing, graph conversion and creation from heterogeneous data sources, interactive visualization along with implementation of traditional graph mining algorithms. We will present benchmark results of EAGLE on real world datasets across 5 seminal graph-theoretic algorithms (Degree distribution, PageRank, connected component analysis, node eccentricity, and triangle count). We compare EAGLE on Urika-GD with graph-mining on other architectures (e.g. distributed-memory GraphX, distributed-storage Pegasus) and programming models (Map-reduce, Pregel, SQL). We will conclude by demonstrating how EAGLE is serving as the building block of knowledge discovery using semantic reasoning and its application to biology, medicine and national security. Paper · PE & Applications Technical Session 8C Chair: Gregory Bauer (National Center for Supercomputing Applications) Sorting at Scale on BlueWaters in a Cosmological Simulation Yu Feng (University of California, Berkeley); Mark Straka (National Center for Supercomputing Applications); and Tiziana Di Matteo and Rupert Croft (Carnegie Mellon University) Abstract Abstract We implement and investigate a parallel sorting algorithm (MP-sort) on Blue Waters. MP-sort sorts distributed array items with non-unique integer keys into a new distributed array. The sorting algorithm belongs to the family of partition sorting algorithms: the target storage space of a parallel computing unit is represented by histogram bin whose edges are determined by partitioning the input keys, requiring exactly one global shuffling of the input data. The algorithm is used in a cosmology simulation (BlueTides) that utilizes 90\% of the computing nodes of Blue Waters, the Cray XE6 supercomputer at the National Center for Supercomputing Applications. MP-sort is optimal in communication: any array item is exchanged over the network at most once. We analyze a series of tests on Blue Waters with up to 160,000 MPI ranks. At scale, the single global shuffling of items takes up to 90\% of total sorting time, and overhead time added by other steps becomes negligible. MP-sort demonstrates expected performance on Blue Waters and served its purpose in the BlueTides simulation. We make the source code of MP-sort freely available to the public. Parallel Software usage on UK National HPC Facilities 2009-2015: How well have applications kept up with increasingly parallel hardware? Andrew Turner (EPCC, The University of Edinburgh) Abstract Abstract One of the largest challenges facing the HPC user community on moving from terascale, through petascale, towards exascale HPC is the ability of parallel software to meet the scaling demands placed on it by modern HPC architectures. In this paper we analyse the usage of parallel software across two UK national HPC facilities: HECToR and ARCHER to understand how well applications have kept pace with hardware advances. These systems have spanned the rise of multicore architectures: from 2 to 24 cores per compute node. We analyse and comment on: trends in usage over time; trends in parallel programming models; trends in the calculation size; and changes in research areas on the systems. The in-house Python tool that is used to collect and analyse the application usage statistics is also described. We conclude by using this analysis to look forward to how particular parallel applications may fare on future HPC systems. Use of Continuous Integration Tools for Application Performance Monitoring Veronica G. Vergara Larrea, Wayne Joubert, and Chris Fuson (Oak Ridge National Laboratory) Abstract Abstract High performance computing systems are becoming increasingly complex, both in node architecture and in the multiple layers of software stack required to compile and run applications. As a consequence, the likelihood is increasing for application performance regressions to occur as a result of routine upgrades of system software components which interact in complex ways. The purpose of this study is to evaluate the effectiveness of continuous integration tools for application performance monitoring on HPC systems. In addition, this paper also describes a prototype system for application performance monitoring based on Jenkins, a Java-based continuous integration tool. The monitoring system described leverages several features in Jenkins to track application performance results over time. Preliminary results and lessons learned from monitoring applications on Cray systems at the Oak Ridge Leadership Computing Facility are presented. Birds of a Feather · PE & Applications, Systems Interactive 9B Chair: Suzanne T. Parete-Koon (Oak Ridge National Laboratory) Getting the Most Out of HPC User Groups Suzanne Parete-Koon (Oak Ridge National Laboratory) Abstract Abstract User groups can provide HPC user facilities with valuable feedback about current and future center resources, services, and policies. User groups serve as hub to regularly allow users and HPC facility staff to connect and identify user needs for training, software, and hardware. They also provide a forum where facility staff and vendors, such as Cray, can make users aware of beneficial new resources and services. Birds of a Feather · PE & Applications Interactive 9C Chair: Duncan J. Poole (NVIDIA) Experiences with OpenACC Duncan Poole (NVIDIA) and Fernanda Foertter (Oak Ridge National Laboratory) Abstract Abstract The OpenACC API has been earning praise for leadership in directives programming models which accelerate code in a performance portable manner. This BOF will discuss the recent developer experiences with the latest OpenACC Compilers available from Cray and PGI. Several teams were brought together in just prior to CUG, which were composed of developers, compiler vendors, and other OpenACC supporters in a week long effort to make significant progress porting their code to use accelerators. Attendees of this BOF will get an opportunity to understand what obstacles were faced, how they were overcome, and what results could be achieved in short order with good support. Attendees will also come away with knowledge about the strengths and weaknesses of the approaches they took, of the current implementations, and what to expect in the future. Paper · PE & Applications Technical Sessions 10C Chair: Zhengji Zhao (Lawrence Berkeley National Laboratory) The Cray Programming Environment: Current Status and Future Directions Luiz DeRose (Cray Inc.) Abstract Abstract In order to achieve high performance on large-scale systems, application developers need a programming environment that can address and hide the issues of scale and complexity of high end HPC systems. In this talk I will present the recent activities and future directions of the Cray Programming Environment, which are being developed and deployed on Cray Clusters and Cray Supercomputers for scalable performance with high programmability. I will discuss some of the new functionality in the Cray compilers, tools, and libraries, such as support for GNU intrinsics, our C++11, and OpenMP plans, and will highlight the Cray’s activities to help porting and hybridization of applications to support the emerging MIC architectures (Intel PHI), such as the scoping tool Reveal and the recently released Cray Comparative Debugger. Finally, I will discuss our roadmap for all areas of the Cray Programming Environment. Using Reveal to Automate Parallelization for Many-Core Systems Heidi Poxon (Cray Inc.) Abstract Abstract Reveal, an application parallelization assistant, helps users add deeper levels of parallelism to an MPI program by analyzing loops, identifying issues with parallelization, and by automating tedious and error-prone tasks for the user. In preparation for Intel KNL many-core systems, Cray is extending Reveal with a new automatic parallelization mechanism that can be used in both "Build & Go" and "Tune & Go" user environments. With this functionality, the user follows a simple recipe to collect performance data that is typically done prior to application tuning. Instead of the user analyzing the data to determine where parallelism should be applied, Reveal and the Cray compiling environment analyze the data, and focus automated parallelization efforts on best-candidate loops. With a single step that requires no source code modifications, Reveal and the Cray compiling environment parallelize select loops and rebuild the program with the applied parallelism. An Investigation of Compiler Vectorization on Current and Next-generation Intel processors using Benchmarks and Sandia’s Sierra Application Mahesh Rajan, Doug Doerfler, Mike Tupek, and Si Hammond (Sandia National Laboratories) Abstract Abstract Motivated by the need for effective vectorization in order to take full advantage of the dual AVX-512 vector units in Intel’s Knights Landing (KNL) processor, to be used in the NNSA’s Cray XC Trinity supercomputer, we carry out a systematic study of vectorization effectiveness using GNU, Intel and Cray compilers using current-generation Intel processors. The study analyzes micro-benchmarks, mini-applications and a set of kernel operations from Sandia’s SIERRA mechanics application suite. Performance is measured with and without vectorization/optimizations and the effectiveness of the compiler generated performance improvement is measured. We also present an approach using C++ templates, data structure layout modifications and the direct use of Intel vector intrinsics to systematically improve vector performance of important Sandia Sierra application kernels, like eigenvalue/eigenvector computations and nonlinear material model evaluations, for which the current generation of compilers cannot effectively auto-vectorize. Paper · PE & Applications Technical Session 14C Chair: Timothy W. Robinson (Swiss National Supercomputing Centre) Using Maali to Efficiently Recompile Software Post-CLE Updates on the Cray XC Systems Ralph C. Bording, Christopher Harris, and David Schibeci (iVEC) Abstract Abstract One of the main operational challenges of High Performance Computing centers is the maintaining numerous scientific applications to support a large and diverse user community. At the Pawsey Supercomputing Centre we have developed “Maali”, is a lightweight automated system for managing a diverse set of optimized scientific libraries and applications on our HPC resources. Maali is a set of BASH scripts that reads a template file that contains the all the information to necessary to download a specific version of an application or library, configure and compile it. This paper will present how we recently used Maali after the latest CLE update and the hardware changes of Magnus a Cray XC40 to recompile a large portion of our scientific software stack. Including what changes to Maali were needed for both the CLE and hardware updates to differentiate between Magnus and our Cray XC30 system Galaxy. PGI C++ with OpenACC Brent Leback, Mat Colgrove, Michael Wolfe, and Christian Trott (PGI) Abstract Abstract The last year, PGI has moved OpenACC for C++ from a place where it could only offload code and data structures that looked like C, to providing support for many C++-specific language features. Working closely with Sandia National Labs, we continue to push into new areas of the language. In this paper and talk, we will use examples to illustrate accelerating code using class member functions, inheritance, templates, containers, handling the implicit 'this' pointer, lambda functions, private data and deep copies. OpenACC 2.0 features such as unstructured data regions and the "routine" directive are highlighted, as well as a PGI feature to auto-detect and generate class member functions which are called from compute regions as ³routine seq². Results using the beta Unified Memory functionality in PGI 15.x, which can simplify data management, will also presented. Finally, we'll discuss current limitations and the future directions of OpenACC with respect to C++. Contain This, Unleashing Docker for HPC Richard S. Canon, Larry Pezzaglia, Douglas M. Jacobsen, and Shreyas Choila (Lawrence Berkeley National Laboratory) Abstract Abstract Container-based computing is revolutionizing the way applications are developed and deployed and a new ecosystem has emerged around Docker to enable container based computing. However, this revolution has yet to reach the HPC community. In this paper, we will provide an overview of container computing and the potential value to the HPC community. We describe early work in using Docker to support scientific computing workloads. We will also discuss investigations in how Docker could be deployed in large-scale HPC systems. Birds of a Feather · CUG Board, Filesystems & I/O, PE & Applications, Systems, XTreme Interactive Session 15B Paper · PE & Applications Technical Session 17B Chair: Zhengji Zhao (Lawrence Berkeley National Laboratory) Optimizing Cray MPI and Cray SHMEM for Current and Next Generation Cray-XC Supercomputers Krishna Kandalla, David Knaak, and Mark Pagel (Cray Inc.) Abstract Abstract Modern compute architectures such as the Intel Many Integrated Core (MIC) and the NVIDIA GPUs are shaping the landscape of supercomputing systems. Current generation interconnect technologies, such as the Cray Aries, are further fueling the design and development of extreme scale systems. Message Passing Interface (MPI) and SHMEM programming models are strongly entrenched in High Performance Computing. However, it is critical to carefully design and optimize communication libraries on emerging computing and networking architectures to facilitate the development of next generation science. In this talk, I will present the primary research and development thrust areas in Cray MPI and SHMEM software products targeting the current and next generation Cray XC series systems. Next, we will discuss some of the MPI-I/O enhancements and our experiences with optimizing I/O intensive applications on the Cray XC. Finally, we will discuss the design and development of MPI-4 Fault Tolerance capabilities for Cray XC systems. Illuminating and Electrifying OpenMP + MPI Performance Beau Paisley (Allinea Software) Abstract Abstract The "one size fits all" MPI age has passed: ahead complex MPI and many-core OpenMP or MPI and GPUs. Increased core counts per CPU mean that performance will increasingly come from optimization within each node and this calls out for developer tools that point to the root causes of underwhelming performance or of bugs that prevent successful completion. Performance and Extension of a Particle Transport Code using Hybrid MPI/OpenMP Programming Models Gavin Pringle (EPCC, The University of Edinburgh); Dave Barrett and David Turland (AWE plc); and Michele Weiland and Mark Parsons (EPCC, The University of Edinburgh) Abstract Abstract We describe AWE's HPC benchmark particle transport code, which employs a wavefront sweep algorithm. After almost 4 years collaboration between EPCC and AWE, we present Chimaera-2_3D: a Fortran90 and MPI/OpenMP code which scales well to thousands of cores for large problem sizes. Significant restructuring has increased the degrees of parallelism available to efficiently exploit future many-core exascale systems. For OpenMP, we have introduced slices through the cuboid mesh which present a set of cells which may be computed independently; and computation over the angles within each cell can also be parallelized using OpenMP. Previously, the initial form of Chimaera computed a coupled, inter-dependent iteration over 'Energy Groups'. Our new code now decouples these iterations which, whilst increasing the computational time, permits a new task level of efficient parallelism encoded using MPI. This paper will present results from the extensive benchmarking exercise using a Cray XT4/5 (HECToR) and a Cray XC30 (ARCHER). Paper · PE & Applications Technical Session 17C Chair: Suzanne T. Parete-Koon (Oak Ridge National Laboratory) Application Performance on a Cray XC30 Evaluation System with Xeon Phi Coprocessors at HLRN-III Florian Wende, Matthias Noack, and Thorsten Schütt (Zuse Institute Berlin); Stephen Sachs (Cray Inc.); and Thomas Steinke (Zuse Institute Berlin) Abstract Abstract We report experiences in using the Cray XC30 Test and Development System (TDS) at the HLRN-III site at ZIB for many-core computing on the Intel Xeon Phi coprocessors. The TDS comprises 16 compute nodes, each of which with one Intel Xeon Phi 5120D coprocessor installed. We present performance data for selected workloads including BQCD, VASP, GLAT, and Ising-Swendsen-Wang. For the GLAT application, we use the HAM-Offload framework (developed at ZIB) to offload computations to remote Xeon Phis using Heterogeneous Active Messages. By means of micro- benchmarks, we determined the characteristics of the different communication paths between the host(s) and the Xeon Phi(s) involving the Aries interconnect and the PCIe link(s), and compare the respective measurements against those taken on the InfiniBand cluster. Based on these results, we discuss their impact on the performance of the applications considered. Climate Science Performance, Data and Productivity on Titan Benjamin Mayer (Oak Ridge National Laboratory), Rafael F. da Silva (USC Information Science Institute), and Patrick Worley and Abigail Gaddis (Oak Ridge National Laboratory) Abstract Abstract Climate Science models are flagship codes for the largest of HPC resources both in visibility, with the newly launched DOE ACME effort, and in terms of significant fractions of system usage. The performance of the DOE ACME model is captured with application level timers and examined through a sizeable run archive. Performance and variability of compute, queue time and ancillary services are examined will be discussed. Memory Scalability and Efficiency Analysis of Parallel Codes Tomislav Janjusic and Christos Kartsaklis (Oak Ridge National Laboratory) Abstract Abstract Memory scalability is an enduring problem and bottleneck that plagues many parallel codes. Parallel codes designed for High Performance Systems are typically designed over the span of several, and in some instances 10+, years. As a result, optimization practices which were appropriate for earlier systems may no longer be valid and thus require careful optimization consideration. Specifically, parallel codes whose memory footprint is a function of their scalability must be carefully considered for future exa-scale systems. Paper · PE & Applications, Systems Technical Session 18A Chair: Chris Fuson (ORNL) Custom Product Integration and the Cray Programming Environment Sean Byland and Ryan Ward (Cray Inc.) Abstract Abstract With Cray’s increasing customer base and product portfolio a faster, more scalable, and flexible software access solution for the Cray Programming Environment became required. The xt-asyncpe product-offering required manual updates to add new product and platform support, took a significant amount of time to evaluate the environment when building applications, and didn’t harness useful standards used by the Linux community. CrayPE 2.x, by incorporating the flexibility of modules, the power of pkg-config and a programmatic design, offers a stronger solution going forward with simplified extensibility, a more robust solution for adding products to a system, and a significant reduction in application build time for users. This paper discusses the issues addressed and the improved functionality available to support Cray, customers and 3rd-party software access. Cray Storm Programming David Race (Cray Inc.) Abstract Abstract The Cray Cluster Storm is a dense, but highly power efficient computing platform for both current and next generation scientific applications. This product combine the latest Intel processors (Haswell), eight NVIDIA K40s or K80s and single/dual Mellanox IB connections into a hardware package that delivers performance to applications. The ability to access this computing capability relies on the different programming options available to the users and their applications. At the end of this presentation, the user will have a basic understanding of the programming options available on the storm and some basic performance information of some of these options. The basic programming options will include - Compilers, OpenACC, MPI and MPI+X. HPC Workforce Preparation Scott Lathrop (National Center for Supercomputing Applications) Abstract Abstract Achieving the full potential of today’s HPC systems, with all of their advanced technology components, requires well-educated and knowledgeable computational scientists and engineers. Blue Waters is committed to working closely with the community to train and educate current and future generations of scientists and engineers to enable them to make effective use of the extraordinary capabilities provided by Blue Waters and other petascale computing systems. Paper · PE & Applications Technical Session 18C Chair: Abhinav S. Thota (Indiana University) Large-Scale Modeling of Epileptic Seizures: Scaling Properties of Two Parallel Neuronal Network Simulation Algorithms Lorenzo Pesce, Albert Wildeman, Jyothsna Suresh, and Tahra Eissa (The University of Chicago); Victor Eijkhout (Texas Advanced Computing Center); Mark Hereld (Argonne National Laboratory); and Wim Van Dongelen, Kazutaka Takahashi, and Karthikeyan Balasubramanian (The University of Chicago) Abstract Abstract Our limited understanding of the relationship between the behavior of individual neurons and large neuronal networks is an important limitation in current epilepsy research and may be one of the main causes of our inadequate ability to treat it. Addressing this problem directly via experiments is impossibly complex, thus, we have been developing and studying medium-large scale simulations of detailed neuronal networks to guide us. Flexibility in the connection schemas and a complete description of the cortical tissue seem necessary for this purpose. In this paper we examine some of the basic issues encountered in these multi-scale simulations. The observed memory and computation-time scaling behavior for a distributed memory implementation was very good over the range studied, both in terms of network sizes and processor pool sizes. We believe that these simulations proved that modeling of epileptic seizures on networks with millions of cells should be feasible on Cray supercomputers. The Impact of High-Performance Computing Best Practice Applied to Next-Generation Sequencing Workflows Carlos P. Sosa, Pierre Carrier, Bill Long, and Richard Walsh (Cray Inc.); Brian Haas and Timothy Tickle (Broad Institute of MIT and Harvard); and Thomas William (TU Dresden) Abstract Abstract Authors: Pierre Carrier, Richard Walsh, Bill Long and Jef Dawson, Cray Inc, Saint Paul, MN Carlos P. Sosa, Cray Inc and University of Minnesota Rochester, Saint Paul, MN Brian Haas, and Timothy Tickle, The Broad Institute, MIT & Harvard, Cambridge, MA Thomas William, Technische Universität Dresden, Dresden, Germany parallelization of whole genome analysis on a Cray XE6 Megan Puckelwartz (Northwestern University), Lorenzo Pesce (The University of Chicago), Elizabeth McNally (Northwestern University), and Ian Foster (The University of Chicago) Abstract Abstract The declining cost of generating DNA sequence is promoting an increase in whole genome sequencing, especially as applied to the human genome. Whole genome analysis requires the alignment and comparison of raw sequence data. Given that the human genome is made of approximately 3 billion base pairs, each of which can be sequenced 30 to 50 times, this generates large amounts of data that have to be processed by complex, computationally expensive, and quickly evolving workflows. Paper · PE & Applications Technical Session 19C Chair: Gregory Bauer (National Center for Supercomputing Applications) Reducing Cluster Compatibility Mode (CCM) Complexity Marlys A. Kohnke and Andrew Barry (Cray Inc.) Abstract Abstract Cluster Compatibility Mode (CCM) provides a suitable environment for running out of the box ISV and third party MPI applications, serial workloads, X11, and doing compilation on Cray XE/XC compute nodes. Preparation of codes for Trinity Courtenay T. Vaughan, Mahesh Rajan, Dennis C. Dinge, Clark R. Dohrmann, Micheal W. Glass, Kenneth J. Franko, Kendall H. Pierson, and Michael R. Tupek (Sandia National Laboratories) Abstract Abstract Sandia and Los Alamos National Laboratories are acquiring Trinity, a Cray XC40, with half of the nodes having Haswell processors and the other half having Knight's Landing processors. As part of our Center of Excellence with Cray, we are working on porting three codes, a Solid Mechanics code, a Solid Dynamics code, and an Aero code, to effectively use this machine. In this paper, we will detail the work that we have done in porting the codes in preparation of getting the machine. We have started by profiling the codes using tools including CrayPat, which showed that a large portion of the time is being spent in the solvers. We will describe the work we are doing on the solvers such as ongoing work on Haswell processors and Knight's Corner machines. Analyzing the Interplay of Failures and Workload on a Leadership-Class Supercomputer Esteban Meneses (University of Pittsburgh), Xiang Ni (University of Illinois at Urbana-Champaign), and Terry Jones and Don Maxwell (Oak Ridge National Laboratory) Abstract Abstract The unprecedented computational power of current supercomputers now makes possible the exploration of complex problems in many scientific fields, from genomic analysis to computational fluid dynamics. Modern machines are powerful because they are massive: they assemble millions of cores and a huge quantity of disks, cards, routers, and other components. But it is precisely the size of these machines that glooms the future of supercomputing. A system that comprises many components has a high chance to fail, and fail often. In order to make the next generation of supercomputers usable, it is imperative to use some type of fault tolerance platform to run applications on large machines. Most fault tolerance strategies can be optimized for the peculiarities of each system and boost efficacy by keeping the system productive. In this paper, we aim to understand how failure characterization can improve resilience in several layers of the software stack: applications, runtime systems, and job schedulers. We examine the Titan supercomputer, one of the fastest systems in the world. We analyze a full year of Titan in production and distill the failure patterns of the machine. By looking into Titan's log files and using the criteria of experts, we provide a detailed description of the types of failures. In addition, we inspect the job submission files and describe how the system is used. Using those two sources, we cross correlate failures in the machine to executing jobs and provide a picture of how failures affect the user experience. We believe such characterization is fundamental in developing appropriate fault tolerance solutions for Cray systems similar to Titan. We also investigate how failures impact long-running jobs. We provide a series of recommendations for developing resilient software on supercomputers. Invited Talk · Plenary General Session 5 Chair: David Hancock (Indiana University) Supercomputing in an Era of Big Data and Big Collaboration Edward Seidel (National Center for Supercomputing Applications) Edward Seidel Biography Biography Edward Seidel
Abstract Abstract Supercomputing has reached a level of maturity and capability where many areas of science and engineering are not only advancing rapidly due to computing power, they cannot progress without it. Detailed simulations of complex astrophysical phenomena, HIV, earthquake events, and industrial engineering processes are being done, leading to major scientific breakthroughs or new products that cannot be achieved any other way. These simulations typically require larger and larger teams, with more and more complex software environments to support them, as well as real world data. But as experiments and observation systems are now generating unprecedented amounts of data, which also must be analyzed via large-scale computation and compared with simulation, a new type of highly integrated environment must be developed where computing, experiment, and data services will need to be developed together. I will illustrate examples from NCSA's Blue Waters supercomputer, and from major data-intensive projects including the Large Synoptic Survey Telescope, and give thoughts on what will be needed going forward. Invited Talk · Plenary General Session 6 Chair: Nicholas Cardo (Swiss National Supercomputing Centre) Cray Corporate Update Peter Ungaro (Cray Inc.) Biography Biography Peter Ungaro (Cray Inc.) Peter Ungaro serves as President and Chief Executive Officer. Mr. Ungaro joined Cray in 2003 as the senior vice president responsible for sales and marketing. He was appointed president in March 2005 and chief executive officer in August 2005. Prior to joining Cray in 2003, he served as vice president of worldwide deep computing sales for IBM, where he led global sales of all IBM server and storage products for high performance computing, life sciences, digital media and business intelligence markets. Prior to that role, he served in a variety of sales leadership positions at IBM starting in 1991. Mr. Ungaro received a B.A. from Washington State University. Invited Talk · Plenary General Session 11 Chair: Nicholas Cardo (Swiss National Supercomputing Centre) Changing Needs/Solutions/Roles Raj Hazra (Intel Corporation) Raj Hazra Biography Biography Raj Hazra
Abstract Abstract Relentless focus on system performance continues to be the mantra for HPC, driving fundamental changes in memory, fabric, power efficiency and storage, and the need for new architectural frameworks for future HPC systems. Big data analytics coupled with HPC will enable accessing broad data sets for real-time simulation, further increasing demand for HPC and storage as well as Cloud based capabilities. Join Raj as he discusses significant trends in technology and how Intel is working with key partners to innovate in HPC system architecture. Invited Talk · Plenary General Session 12 Chair: David Hancock (Indiana University) Invited Talk · Plenary General Session 13 Chair: David Hancock (Indiana University) Scalability Limits for Scientific Simulation Paul Fischer (University of Illinois) Paul Fischer Biography Biography Paul Fischer
Abstract Abstract Current high-performance computing platforms feature millions of processing units, and it is anticipated that exascale architectures featuring billion-way concurrency will be in place in the early 2020s. The extreme levels of parallelism in these architectures influence many design choices in the development of next-generation algorithms and software for scientific simulation. This talk explores some of the challenges faced by the scientific computing community in the post-frequency-scaling era. To set the stage, we first describe our experiences in the development of scalable codes for computational fluid dynamics that have been deployed on over a million processors. We then explore fundamental computational complexity considerations that are technology drivers for the future of PDE-based simulation. We present performance data from leading-edge platforms over the past three decades and couple this with communication and work models to predict the performance of domain decomposition methods on model exascale architectures. We identify the key performance bottlenecks and expected performance limits at these scales and note a particular need for design considerations that will support strong scaling in the future. Invited Talk · Plenary General Session 16 Chair: Nicholas Cardo (Swiss National Supercomputing Centre) New Member Lightning Talk, Hong Kong Sanatorium & Hospital Louis Shun and Thomas Leung (Hong Kong Sanatorium & Hospital) New Member Lightning Talk, European Centre for Medium Range Weather Forecasts Christian Weihrauch@ecmwf.int (European Centre for Medium-Range Weather Forecasts) pdfInvited Talk · Plenary General Session 20 Chair: David Hancock (Indiana University) Tutorial · Systems Tutorial 1A Next Generation Cray Management System for XC Systems Harold Lonelgy, John Hesterberg, and John Navitsky (Cray Inc.) Abstract Abstract New major versions of CLE and SMW are being developed that include the next generation Cray Management System (CMS) for Cray XC systems. This next generation of CMS is bringing more common and easy to use system management tools and processes to the Cray XC systems, while at the same time preserving the system reliability and scalability upon which you depend. The next generation CMS includes a new common installation process for SMW and CLE, and more tightly integrates external Cray Development and Login (CDL) nodes as part of the Cray XC system. It includes the Image Management and Provisioning System (IMPS) that provides prescriptive image creation and centralized configuration. Finally, it integrates with the next major Linux distribution version from SUSE, SUSE Linux Enterprise Server 12. Tutorial · Systems Tutorial 1B Cray XC Power Monitoring and Control Steven Martin, David Rush, and Matthew Kappel (Cray Inc.) Abstract Abstract This tutorial will focus on the setup, usage and use cases for Cray XC power monitoring and management features. The tutorial will cover power and energy monitoring and control from three perspectives: site and system administrators working from the SMW command line, users who run jobs on the system, and third party software development partners integrating with Cray’s RUR and CAPMC features. Tutorial · PE & Applications, Systems Tutorial 1C Preparing for a smooth landing: Intel’s Knights Landing and Modern Applications Jason Sewall (Intel Corporation) Abstract Abstract Knights Landing, the 2nd generation Intel® Xeon Phi™ processor, utilizes many breakthrough technologies to combine breakthrough’s in power performance with standard, portable, and familiar programming models. This presentation will provide an overview of new technologies delivered by Knights Landing microarchitecture. Additionally, Dr. Sewall will provide studies of how applications have been developed using the first generation Intel® Xeon Phi™ coprocessor to be ready for Knights Landing. Tutorial · Systems Tutorial 1A continued Next Generation Cray Management System for XC Systems Harold Lonelgy, John Hesterberg, and John Navitsky (Cray Inc.) Abstract Abstract New major versions of CLE and SMW are being developed that include the next generation Cray Management System (CMS) for Cray XC systems. This next generation of CMS is bringing more common and easy to use system management tools and processes to the Cray XC systems, while at the same time preserving the system reliability and scalability upon which you depend. The next generation CMS includes a new common installation process for SMW and CLE, and more tightly integrates external Cray Development and Login (CDL) nodes as part of the Cray XC system. It includes the Image Management and Provisioning System (IMPS) that provides prescriptive image creation and centralized configuration. Finally, it integrates with the next major Linux distribution version from SUSE, SUSE Linux Enterprise Server 12. Tutorial · Systems Tutorial 2A Next Generation Cray Management System for XC Systems Harold Lonelgy, John Hesterberg, and John Navitsky (Cray Inc.) Abstract Abstract New major versions of CLE and SMW are being developed that include the next generation Cray Management System (CMS) for Cray XC systems. This next generation of CMS is bringing more common and easy to use system management tools and processes to the Cray XC systems, while at the same time preserving the system reliability and scalability upon which you depend. The next generation CMS includes a new common installation process for SMW and CLE, and more tightly integrates external Cray Development and Login (CDL) nodes as part of the Cray XC system. It includes the Image Management and Provisioning System (IMPS) that provides prescriptive image creation and centralized configuration. Finally, it integrates with the next major Linux distribution version from SUSE, SUSE Linux Enterprise Server 12. Tutorial · PE & Applications, Systems Tutorial 2B Job-Level Tracking with XALT: A Tutorial for System Administrators and Data Analysts Mark Fahey (Argonne National Laboratory), Robert McLay (Texas Advanced Computing Center), and Reuben Budiardja (National Institute for Computational Sciences) Abstract Abstract Let’s talk real, no-kiddin’ supercomputer analytics, aimed at moving beyond monitoring the machine as a whole or even its individual hardware components. We’re interested in drilling down to the level of individual batch submissions, users, and binaries. And we’re not just targeting performance: we’re after ready answers to the "what, where, how, when and why" that stakeholders are clamoring for – everything from which libraries (or individual functions!) are in demand to preventing the problems that get in the way of successful research. This tutorial will show how to install and set up the XALT tool that can provide this type of job-level insight. Tutorial · PE & Applications, Systems Tutorial 2C Debugging, Profiling and Tuning Applications on Cray CS and XC Systems Beau Paisley (Allinea Software) Abstract Abstract The debugger Allinea DDT and profiler Allinea MAP are widely available to users of Cray systems - this tutorial, aimed at scientists and developers that are involved in writing or maintaining code, will introduce debugging and profiling using the tools. Tutorial · Systems Tutorial 2A continued Next Generation Cray Management System for XC Systems Harold Lonelgy, John Hesterberg, and John Navitsky (Cray Inc.) Abstract Abstract New major versions of CLE and SMW are being developed that include the next generation Cray Management System (CMS) for Cray XC systems. This next generation of CMS is bringing more common and easy to use system management tools and processes to the Cray XC systems, while at the same time preserving the system reliability and scalability upon which you depend. The next generation CMS includes a new common installation process for SMW and CLE, and more tightly integrates external Cray Development and Login (CDL) nodes as part of the Cray XC system. It includes the Image Management and Provisioning System (IMPS) that provides prescriptive image creation and centralized configuration. Finally, it integrates with the next major Linux distribution version from SUSE, SUSE Linux Enterprise Server 12. Tutorial · PE & Applications, Systems Tutorial 2B continued Job-Level Tracking with XALT: A Tutorial for System Administrators and Data Analysts Mark Fahey (Argonne National Laboratory), Robert McLay (Texas Advanced Computing Center), and Reuben Budiardja (National Institute for Computational Sciences) Abstract Abstract Let’s talk real, no-kiddin’ supercomputer analytics, aimed at moving beyond monitoring the machine as a whole or even its individual hardware components. We’re interested in drilling down to the level of individual batch submissions, users, and binaries. And we’re not just targeting performance: we’re after ready answers to the "what, where, how, when and why" that stakeholders are clamoring for – everything from which libraries (or individual functions!) are in demand to preventing the problems that get in the way of successful research. This tutorial will show how to install and set up the XALT tool that can provide this type of job-level insight. Birds of a Feather · Systems Interactive 3A Chair: Jason Hill (Oak Ridge National Laboratory) Birds of a Feather · PE & Applications, Systems Interactive 4A Chair: Ashley Barker (Oak Ridge National Laboratory) System Testing and Resiliency in HPC Ashley D. Barker (Oak Ridge National Laboratory) Abstract Abstract As supercomputing system offerings from Cray become increasingly larger, more heterogeneous, and more tightly integrated with storage and data analytics, the verification of hardware and software components becomes an ever more difficult and important aspect of system management. Whether HPC resources are dedicated to local users or are shared with an international user community, regression testing is necessary to ensure that centers are providing usable and trustworthy resources for scientific discovery. However, unlike regression testing in software development projects, where there exists a range of well-established continuous integration tools, regression testing in HPC production environments is typically carried out in a more ad hoc fashion, using custom scripts or tools developed independently by individual HPC centers and with little or no collaboration between centers. The aim of this session is to bring together those with experience and interest in regression testing theory and practice with the aim of fostering collaboration and coordination across CUG member sites. We will assess the state-of-the-art in regression testing at member sites and determine the needs of the community moving forward. We will discuss the testing of components in terms of both functionality and performance, including best practices for operating system, driver and programming environment updates. The session will provide an open forum to share ideas and concerns in order to produce a more concerted effort towards the treatment of system testing and resilience across HPC centers. The session will be used to kick off a cross-site working group dedicated to sharing ideas and frameworks. Paper · Systems Technical Session 7A Chair: Matthew A. Ezell (Oak Ridge National Laboratory) Innovations for The Cray David Beer and Gary Brown (Adaptive Computing) Abstract Abstract Moab and Torque have been efficiently managing the workload on Cray supercomputers for years. Aside from the policy-rich scheduling Moab provides, several new advancements have been made and are being developed specifically for Cray supercomputers. Slurm Road Map 15.08 Jacob Jenson (SchedMD LLC) Abstract Abstract Slurm is an open source workload manager used on six of the world's top 10 most powerful computers and provides a rich set of features including topology aware optimized resource allocation, the ability to expand and shrink jobs on demand, failure management support for applications, hierarchical bank accounts with fair-share job prioritization, job profiling, and a multitude of plugins for easy customization. Driving More Efficient Workload Management on Cray Systems with PBS Professional Scott Suchyta (Altair Engineering, Inc.) Abstract Abstract The year 2014 brought an increase in adoption of key HPC technologies, from data analytics solutions to power-efficient scheduling. The HPC user landscape is changing, and it is now critical for workload management vendors to provide not only foundational scheduling functionality but also the adjacent capabilities that truly optimize system performance. In this presentation, Altair will provide a look at key advances in PBS Professional for improved performance on Cray systems. Topics include new Cray-specific features like Suspend/Resume, Xeon Phi support, HyperThreading, Power-aware Scheduling, and Exclusive/Non-exclusive ALPS reservations. The presentation will also preview the upcoming capabilities of cgroups and DataWarp integration. Paper · Systems Technical Session 8A Chair: Tina Butler (Lawrence Berkeley National Laboratory) Realtime process monitoring on the Cray XC30 Douglas M. Jacobsen and Shane Canon (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) and Jay Srinivasan (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory) Abstract Abstract Increasingly complex workflows of data-intensive calculations are extremely challenging to characterize. In preparation for the increasing prevalence of this new style of workload, we describe a recent effort to implement the “procmon” system on the Cray XC30 system. The procmon system was developed to characterize data-instensive workflows of the mid-range clusters at NERSC, enabling efficient whole system monitoring of all running processes on the system with live real-time analysis of the data. procmon's resource-conscious implementation results in a scalable monitoring system that is minimally disruptive to both user and system processes, thereby providing useful monitoring opportunities on the large-scale Cray systems deployed at NERSC. Use of AMQP messaging enables flexible and fault-tolerant delivery of messages, while HDF5 storage of data allows efficient analysis using standardized tools. This approach results in an open monitoring system that provides users and operators detailed, realtime feedback about the state of the system. Cray DataWarp: Administration & SLURM integration Tina M. Declerck and Iwona Sakrejda (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) and Dave Henseler (Cray Inc.) Abstract Abstract The National Energy Research Scientific Computing center (NERSC) is one of the Department of Energy’s (DOE) primary centers for high performance computing needed for research. One of the areas that large compute centers have worked to find a solution is the ability to efficiently move data to and from compute nodes. Cray is addressing this with their Data Warp technology. As new technologies are being developed and used, new tools are needed to address the administration and troubleshooting. NERSC is collaborating with Cray to develop the capabilities needed by NERSC to provide functionality for our user base. In addition, integration into the workload manager will be needed to allow access for jobs. To address this, NERSC is collaborating with SchedMD to implement the key features needed to integrate Cray’s Data Warp solution into SLURM. This paper will concentrate on the administrative interface and the integration with SLURM. Bright Cluster Manager - Managing your cluster for HPC, Hadoop and OpenStack Craig Hunneyman (Bright Computing) Abstract Abstract Bright Cluster Manager has been provisioning, monitoring and managing HPC clusters for over a decade. Last year, an add-on for Hadoop clusters became generally available. From bare-metal servers to the application stack, a single instance of the Bright GUI or command-line interface actually delivers a converged administration experience for HPC, Big Data Analytics, Private and Public Clouds. In practice, this means HPC admins can rapidly introduce Hadoop or OpenStack clusters alongside their existing HPC environments. And with Bright, HPC admins do not need to be experts on emerging technologies (e.g., Apache Hadoop including HDFS and YARN, Apache Spark and OpenStack) to deploy and maintain environments for pilot or production purposes. Introduction to Bright Cluster Manager for HPC, Hadoop and OpenStack, will be presented. A Use Case will be shared to illustrate how Bright Computing actually has used Bright OpenStack to create our own Private Cloud used for developing and testing our own software product. Our Private Cloud is then used for remote demonstrations to show the power of Bright Cluster Manager. Customers have begun using Bright to accelerate the creation of Hadoop clusters for Big Data Analytics, OpenStack clusters for temporary and permanent Private clouds alongside their existing HPC environments. Additional features have been added to Bright specifically for the Cray Management Nodes and will be listed for those who are interested. A live demo showing how easy it is for Bright Cluster Manager to monitor a site's Custom Metrics like FLEXlm licenses and display graphical data by a simple click of the mouse. Cray System Snapshot Analyzer (SSA) Richard J. Duckworth (Cray Inc.) Abstract Abstract The Cray System Snapshot Analyzer (SSA) represents a new support technology offering. SSA is a managed technology program designed to collect and analyze key customer system information. With SSA, we are targeting three areas of improvement. These areas are (1) reducing turn-around time for the collection of data in response to customer inquiries and issues (2) improving detection of and resolution time for customer system issues (3) improving Cray’s knowledge of the product configurations in the field throughout their life-cycle. In this paper, we will first provide an overview of SSA. Next we will discuss anticipated benefits to Cray and, most importantly, to our customers. We will then discuss the architecture, including measures to ensure transparency in the operation of SSA and its security features. Finally, we will discuss the anticipated release and feature schedules for SSA. Paper · Filesystems & I/O, Systems Technical Session 8B Chair: Jason Hill (Oak Ridge National Laboratory) A Storm (Lake) is Coming to Fast Fabrics: The Next-Generation Intel® Omni-Path Architecture Barry Davis (Intel Corporation) Abstract Abstract The Intel® Omni-Path Architecture, Intel’s next-generation fabric product line, is designed around industry-leading technologies developed as a result of Intel’s multi-year fabric development program. The Intel Omni-Path Architecture will deliver new levels of performance, resiliency, and scalability, overcoming InfiniBand limitations and paving the path to Exascale. Learn how the Intel Omni-Path Architecture will deliver significant enhancements and optimization for HPC at both the host and fabric levels, providing huge benefits to HPC applications over standard Infiniband-based designs. Data Transfer Study for HPSS Archiving James R. Wynne, Suzanne T. Parete-Koon, Quinn D. Mitchell, Stanley White, and Tom Barron (Oak Ridge National Laboratory) Abstract Abstract The movement of large data produced by codes run in a High Performance Computing (HPC) environment can be a bottleneck for project workflows. To balance filesystem capacity and performance requirements, HPC centers enforce data management policies to purge old files to make room for new user project data. Users at Oak Ridge Leadership Computing Facility (OLCF) and other HPC user facilities must archive data to avoid the purge, therefore the time associated with data movement is something that all users must consider. This study observed the difference in transfer speed from the Lustre filesystem to the High Performance Storage System (HPSS) using a number of different transfer agents. The study tested files that spanned a variety of sizes and compositions that reflect OLCF user data. This will be used to help users of Titan plan their workflow and archival data transfers to increase their project’s efficiency. Applying Advanced IO Architectures to Improve Efficiency in Single and Multi-Cluster Environments Mike Vildibill (DataDirect Networks) Abstract Abstract For 15 years DDN has been working with the majority of leading supercomputing facilities, pushing the limits of storage IO to improve the productivity of the world¹s largest systems. Storage technology advancement toward Excascale have not progressed as quickly as computing technology. The gap cannot be bridged by improving today's technologies - drive interface speeds are not increasing fast enough, parallel file systems need optimization to accomplish Exascale concurrency, and scientists will always want to model more data than is financially reasonable to hold in memory. Discontinuous innovation is called for. Birds of a Feather · Systems Interactive 9A Chair: Michael Showerman (National Center for Supercomputing Applications) Systems monitoring of Cray systems Mike Showerman (National Center for Supercomputing Applications/University of Illinois) Abstract Abstract This session is intended to present some of the challenges and solutions to monitoring Cray systems. The range of topics include data collection methods, application impact analysis for large scale systems, data storage strategies and visualization. While solutions for monitoring compute node statistics are beginning to mature, there remain many challenges in integrating data across subsystems that produce the insights necessary for effective administration. We will seek to gather current best practices as well as approaches to produce cross cutting data that incorporates job, system, and filesystem information to maximize the end to end performance of Cray systems. The session will include a few short presentations from sites and move to an open discussion format. The product will be a summary of the findings and a report made available to CUG sites. Birds of a Feather · PE & Applications, Systems Interactive 9B Chair: Suzanne T. Parete-Koon (Oak Ridge National Laboratory) Getting the Most Out of HPC User Groups Suzanne Parete-Koon (Oak Ridge National Laboratory) Abstract Abstract User groups can provide HPC user facilities with valuable feedback about current and future center resources, services, and policies. User groups serve as hub to regularly allow users and HPC facility staff to connect and identify user needs for training, software, and hardware. They also provide a forum where facility staff and vendors, such as Cray, can make users aware of beneficial new resources and services. Paper · Systems Technical Sessions 10A Chair: Don Maxwell (ORNL) Experience with GPUs on the Titan Supercomputer from a Reliability, Performance and Power Perspective Devesh Tiwari, Saurabh Gupta, Jim Rogers, and Don Maxwell (Oak Ridge National Laboratory) Abstract Abstract Titan supercomputer, currently the world's second fastest supercomputer for open science, has more than 18,000 GPUs that domain scientists routinely use to perform scientific simulations. While GPUs have been shown to be performance-efficient, their reliability, utilization and energy-efficiency at scale have not been fully understand. In this work, we present a detailed study of GPU reliability characteristics, performance and energy efficiency. We share our experience with 18,688 GPUs on the Titan supercomputer and findings in the process of efficient operation of GPUs on the Titan supercomputer. We hope that our experience will be beneficial to other GPU-enabled HPC sites as well. Detecting and Managing GPU Failures Nicholas P. Cardo (Swiss National Supercomputing Centre) Abstract Abstract GPUs have been found to have a variety of failure modes. The easiest to detect and correct is a clear hardware failure of the device. However, there are a number of not so obvious failures that can be more difficult to detect. With the objective to provide a stable and reliable GPU computing platform, it is imperative to identify issues with the GPUs and remove them from service. At the Swiss National Supercomputing Centre (CSCS), a significant amount of effort has been invested in the detection and isolation of suspect GPUs. Techniques have been developed to identify suspect GPUs and automated testing put into practice, resulting in a more stable and reliable GPU computing platform. This paper will discuss these GPU failures and the techniques used identify suspect nodes. Enabling Advanced Operational Analysis Through Multi-Subsystem Data Integration on Trinity Jim Brandt, David DeBonis, and Ann Gentile (Sandia National Laboratories); Jim Lujan and Cindy Martin (Los Alamos National Laboratory); Dave Martinez, Stephen Olivier, and Kevin Pedretti (Sandia National Laboratories); Narate Taerat (Open Grid Computing); and Ron Velarde (Los Alamos National Laboratory) Abstract Abstract Operations management of the New Mexico Alliance for Computing at Extreme Scale (ACES) (a collaboration between Los Alamos National Laboratory and Sandia National Laboratories) Trinity platform will rely on data from a variety of sources including System Environment Data Collections (SEDC); node level information, such as high speed network (HSN) performance counters and high fidelity energy measurements; scheduler/resource manager; and plant environmental facilities. The water-cooled Cray XC platform requires a cohesive way to manage both the facility infrastructure and the platform due to several critical dependencies. We present preliminary results from analysis of integrated data on the Trinity Application Readiness Testbed (ART) systems as they pertain to enabling advanced operational analysis through the understanding of operational behaviors, relationships, and outliers. Paper · Systems Technical Session 14A Chair: Matthew A. Ezell (Oak Ridge National Laboratory) Cray XC System Node Level Diagnosability Jeffrey Schutkoske (Cray Inc.) Abstract Abstract Cray XC System node level diagnosability is not just about diagnostics. Diagnostics are just one aspect of the tool chain that includes BIOS, user commands, power and thermal data and event logs. There are component level tests that are used to checkout the individual components, but quite often issues do not appear until full scale is reached. From experience over the last few years, we have seen that no single tool or diagnostic can be used to identify problems, but rather multiple tools and multiple sources of data must be analyzed to provide proper identification, isolation, and notification of hardware and software problems. This paper provides detailed examples using the existing tool chain to diagnose node faults within the Cray XC system. Cray XC System Level Diagnosability Roadmap Update Jeffrey Schutkoske (Cray Inc.) Abstract Abstract This paper highlights the current capabilities and the technical direction of Cray XC System level diagnosability. Cray has made a number of enhancements to existing diagnostics, commands, and utilities as well as providing new diagnostics, commands and utilities. This paper reviews the new capabilities that are available now, such as the Simple Event Correlator (SEC), Workload Test Suite (WTS), Node diagnostics and HSS diagnostic utilities. It will also look at what is planned for in the upcoming releases, including the initial integration of new technologies from the OpenStack projects. Lustre Resiliency: Understanding Lustre Message Loss and Tuning for Resiliency Chris A. Horn (Cray Inc.) Abstract Abstract Cray systems are engineered to withstand the loss of components, however, Lustre, historically, has not been as resilient in some cases. In this paper we discuss recent enhancements made to Lustre to improve resiliency and best practices for realizing Lustre RAS on Cray systems including how to tune timeouts and configure certain Lustre features for resiliency. Birds of a Feather · CUG Board, Filesystems & I/O, PE & Applications, Systems, XTreme Interactive Session 15B Paper · Systems Technical Session 17A Chair: Jim Rogers (Oak Ridge National Laboratory) Overview of the KAUST’s Cray X40 System – Shaheen II Bilel Hadri, Samuel Kortas, Saber Feki, Rooh Khurram, and Greg Newby (King Abdullah University of Science and Technology) Abstract Abstract In November 2014, King Abdullah University of Science and Technology (KAUST) acquired a Cray XC40 supercomputer along with DataWarp technology, a Cray Sonexion 2000 storage system, a Cray Tiered Adaptive Storage (TAS) system and a Cray Urika-GD graph analytics appliance. This new Cray XC40 system installed in March 2015, named Shaheen II, will deliver 25 times the sustained computing capability of KAUST’s current system. Shaheen II is composed of 6174 nodes representing a total of 197,568 processors cores tightly integrated with a richly layered memory hierarchy and dragonfly interconnection network. Total storage space is of 17 PB with additional 1.5 PB dedicated to burst-buffer. An overview of the systems specifications, the challenges raised in term of power capping, and the software ecosystem monitoring the usage will be presented and discussed. Resource Utilization Reporting Two Year Update Andrew P. Barry (Cray Inc.) Abstract Abstract In the two years since CUG 2013 the Cray RUR feature has gone from powerpoint to the forth release of software, running on a variety of Cray systems. The most basic features of RUR have proven the most interesting to the widest spread of users: Cpu usage, memory usage, and energy usage are enduring concerns for site planning. Functionality added since the first release of RUR has largely focused on providing greater fidelity of measurement, and support for a full range of hardware. Cray Advanced Platform Monitoring and Control (CAPMC) Steven J. Martin, David Rush, and Matthew Kappel (Cray Inc.) Abstract Abstract With SMW 7.2.UP02 and CLE 5.2.UP02, Cray released its platform monitoring and management API called CAPMC (Cray Advanced Platform Monitoring and Control). This API is primarily directed toward workload manager vendors to enable power-aware scheduling and resource management on Cray XC-series systems and beyond. In this paper, we give an overview of CAPMC features, applets, and their driving use cases. We further describe the RESTful architecture of CAPMC, its security model, and discuss tradeoffs made in design and development. Finally, we preview future enhancements to CAPMC in support of in-band control and additional use cases. Paper · PE & Applications, Systems Technical Session 18A Chair: Chris Fuson (ORNL) Custom Product Integration and the Cray Programming Environment Sean Byland and Ryan Ward (Cray Inc.) Abstract Abstract With Cray’s increasing customer base and product portfolio a faster, more scalable, and flexible software access solution for the Cray Programming Environment became required. The xt-asyncpe product-offering required manual updates to add new product and platform support, took a significant amount of time to evaluate the environment when building applications, and didn’t harness useful standards used by the Linux community. CrayPE 2.x, by incorporating the flexibility of modules, the power of pkg-config and a programmatic design, offers a stronger solution going forward with simplified extensibility, a more robust solution for adding products to a system, and a significant reduction in application build time for users. This paper discusses the issues addressed and the improved functionality available to support Cray, customers and 3rd-party software access. Cray Storm Programming David Race (Cray Inc.) Abstract Abstract The Cray Cluster Storm is a dense, but highly power efficient computing platform for both current and next generation scientific applications. This product combine the latest Intel processors (Haswell), eight NVIDIA K40s or K80s and single/dual Mellanox IB connections into a hardware package that delivers performance to applications. The ability to access this computing capability relies on the different programming options available to the users and their applications. At the end of this presentation, the user will have a basic understanding of the programming options available on the storm and some basic performance information of some of these options. The basic programming options will include - Compilers, OpenACC, MPI and MPI+X. HPC Workforce Preparation Scott Lathrop (National Center for Supercomputing Applications) Abstract Abstract Achieving the full potential of today’s HPC systems, with all of their advanced technology components, requires well-educated and knowledgeable computational scientists and engineers. Blue Waters is committed to working closely with the community to train and educate current and future generations of scientists and engineers to enable them to make effective use of the extraordinary capabilities provided by Blue Waters and other petascale computing systems. Paper · Systems Technical Session 19A Chair: Jim Rogers (Oak Ridge National Laboratory) Monitoring and Analyzing Job Performance Using Resource Utilization Reporting (RUR) on A Cray XE6 System Shiquan Su (National Institute for Computational Sciences) and Troy Baer, Gary Rogers, Stephen McNally, Robert Whitten, and Lonnie Crosby (University of Tennessee) Abstract Abstract This paper describes the collection and analysis of job performance metrics using the Cray Resource Utilization Reporting (RUR) software on Mars, a Cray XE6 system at the National Institute for Computational Sciences (NICS). Cray offers users a new feature RUR in the second half of 2013. We can collect an easily expanded set of utilization data about each user¹s applications with RUR. The overhead and scalability of RUR will be measured using an assortment of benchmarks that covers a wide range of typical cases in realistic user environment, including computational-bound, memory-bound, and communication-bound applications. A number of the Cray-supplied data and output RUR plugins will be investigated. Possible integration with the XSEDE Metrics on Demand (XDMoD) projects will also be discussed. Molecular Modelling and the Cray XC30 Power Management Counters Michael R. Bareford (EPCC, The University of Edinburgh) Abstract Abstract This paper explores the usefulness of the data provided by the power management (PM) hardware counters available on the Cray XC30 platform. PM data are collected for two molecular modelling codes, DL POLY and CP2K, both of which are run over multiple compute nodes. The first application is built for three programming environments (Cray, Intel and gnu): hence, the data collected is used to test the hypothesis that the choice of compiler should not impact energy use significantly. The second code, CP2K, is run in a mixed OpenMP/MPI mode, allowing us to explore the relationship between energy usage and thread count. Implementing "Pliris-C/R" Resiliency Features Into the EIGER Application Mike Davis (Cray Inc.), Joseph D. Kotulski (Sandia National Laboratories), and William W. Tucker (Consultant) Abstract Abstract EIGER is a frequency-domain electromagnetics simulation code based on the boundary element method. This results in a linear equation whose matrix is complex valued and dense. To solve this equation the Pliris direct solver package from the Trilinos library is used to factor and solve this matrix. This code has been used on the Cielo XE6 platform to solve matrix equations of order 2 million requiring 5000 nodes for 24 hours. Birds of a Feather Interactive 3C Birds of a Feather Interactive 4B Birds of a Feather Interactive 4C Birds of a Feather Interactive Session 15A Birds of a Feather · CUG Board, Filesystems & I/O, PE & Applications, Systems, XTreme Interactive Session 15B Birds of a Feather Interactive Session 15C |