CUG2013 Proceedings | Created 2013-8-6 |
Birds of a Feather Interactive 3A Chair: Colin McMurtrie (Swiss National Supercomputing Centre) Birds of a Feather Interactive 3B Chair: Helen He (National Energy Research Scientific Computing Center) Programming Environments, Applications and Documentation SIG Helen He (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) Birds of a Feather Interactive 3C Birds of a Feather Interactive 8A Chair: Nick Cardo (National Energy Research Scientific Computing Center) Open discussion with CUG Board Nick Cardo (National Energy Research Scientific Computing Center) Birds of a Feather Interactive 8B Chair: Duncan J. Poole (NVIDIA) OpenACC BOF Duncan Poole (NVIDIA) Abstract Abstract This BOF would discuss the status of OpenACC as an organization and as a specification. Topics of interest to CUG would include: the OpenACC 2.0 specification and member activites including developing new products, benchmarks, example codes, and profiling interface. Many OpenACC members will be present at CUG, and a lot of progress has been made, so this can be a lively interactive session. Birds of a Feather Interactive 8C Chair: John Hesterberg (Cray, Inc.) System Management Futures John Hesterberg (Cray Inc.) Abstract Abstract System Management futures. Discuss ideas about what comes next after the Installation, Image Management, Provisioning, and Configuration changes being planned at Cray. What is the right way to do system administration and management for Exascale? What are your best practices in system administration and management for large systems? Birds of a Feather Interactive 11A/B Chair: David Henty (EPCC, The University of Edinburgh) HPC training and education David Henty (EPCC, The University of Edinburgh) Abstract Abstract Education and training activities have a crucial role in ensuring that the end-users of any HPC infrastructure are able to fully exploit the strengths of existing and future hardware and software resources. In this interactive session we will discuss the status of HPC education and training activities (around the globe), identify existing and potential challenges, and possibly finding some solutions to them as well. The session will be prefaced by two short case studies, one about experiences in running an MSc programme in HPC at EPCC, the University of Edinburgh, and another about organizing training activities within the pan-European virtual research infrastructure for HPC, PRACE. The attendees are invited to contribute similar case studies. Birds of a Feather Interactive 11C Chair: Jeff Keopp (Cray Inc.) Cray External Services Systems Jeff Keopp (Cray Inc.) Birds of a Feather Interactive 17A Chair: John Hesterberg (Cray, Inc.) System Monitoring, Accounting and Metrics John Hesterberg (Cray Inc.) Abstract Abstract Let's talk about data collection! Birds of a Feather Interactive 17B Chair: Jenett Tillotson (Indiana University) Experiences with Moab and TORQUE Jenett Tillotson (Indiana University) Abstract Abstract This BoF will focus on administrator experiences with Moab and TORQUE. In particular the interface with ALPS, experiences with Moab 7 and TORQUE 4, and running Moab and/or TORQUE outside the Cray on an external scheduling node. Attendees will be asked to share their configurations, and we will discuss possible best practices for Moab and TORQUE configurations on Cray systems. Birds of a Feather Interactive 17C Invited Talk General Session 4 Chair: Nick Cardo (National Energy Research Scientific Computing Center) CUG Welcome Nick Cardo (National Energy Research Scientific Computing Center) Why we need Exascale, and why we won't get there by 2020 Horst D. Simon (Lawrence Berkeley National Laboratory) Abstract Abstract It may come as surprise to many who are currently deeply engaged in research and development activities that could lead us to exascale computing, that it has been already exactly six years, since the first set of community town hall meetings were convened in the U.S. to discuss the challenges for the next level of computing in science. It was in April and May 2007, when three meetings were held in Berkeley, Argonne and Oak Ridge that formed the basis for the first comprehensive look at exascale [1]. Invited Talk General Session 5 Chair: David Hancock (Indiana University) Invited Talk General Session 9 Chair: Nick Cardo (National Energy Research Scientific Computing Center) Big Bang, Big Data, Big Iron – Analyzing Data From The Planck Satellite Mission Julian Borrill (Lawrence Berkeley National Laboratory) Abstract Abstract On March 21st 2013 the European Space Agency announced the first cosmology results from its billion-dollar Planck satellite mission. The culmination of 20 years of work, Planck’s observations of the Cosmic Microwave Background – the faint echo of the Big Bang itself – provide profound insights into the foundations of cosmology and fundamental physics. Invited Talk General Session 10 Chair: David Hancock (Indiana University) Introduction and CUG 2013 Best Paper Award David Hancock (Indiana University) The Changing Face of High Performance Computing Rajeeb Hazra (Intel Corporation) Abstract Abstract The continuing growth of computer performance is delivering an unprecedented capability to solve increasingly complex problems. This growth in performance along with the recent explosion of new devices, sensors, and social networks delivering real-time feeds over the web and into datacenters is causing a flood of data – and adding a new challenge for systems, software and applications development - for organizations that are looking to convert this data into knowledge. Invited Talk General Session 12 Chair: David Hancock (Indiana University) Invited Talk Closing General Session 20 Chair: Nick Cardo (National Energy Research Scientific Computing Center) Paper Technical Session 6A Chair: Tina Butler (National Energy Research Scientific Computing Center) Image Management and Provisioning System Overview John Hesterberg (Cray Inc.) Abstract Abstract This document provides an overview of the new Image Management and Provisioning System (IMPS) under development at Cray. IMPS is a new set of features that changes how software is installed, managed, provisioned, booted, and configured on Cray systems. It focuses on adopting common industry tools and procedures where possible, combined with scalable Cray technology, to produce an enhanced solution ultimately capable of effectively supporting all Cray systems, from the smallest to the largest. Paper Technical Session 6B Chair: Jason Hill (Oak Ridge National Laboratory) Instrumenting IOR to Diagnose Performance Issues on Lustre File Systems Doug J. Petesch and Mark S. Swan (Cray Inc.) Abstract Abstract Large Lustre file systems are made of thousands of individual components all of which have to perform nominally to deliver the designed I/O bandwidth. When the measured performance of a file system does not meet expectations, it is important to identify the slow pieces of such a complex infrastructure quickly. This paper will describe how Cray has instrumented IOR (a popular I/O benchmark program) to automatically generate pictures that show the relative performance of the many OSTs, servers, LNET routers and other components involved. The plots have been used to diagnose many unique problems with Lustre installations at Cray customer sites. Taking Advantage of Multicore for the Lustre Gemini LND Driver James A. Simmons (Oak Ridge National Laboratory) and John Lewis (Cray Inc.) Abstract Abstract High performance computing systems have long embraced the move to multi-core processors, but parts of the operating system stack have only recently been optimized for this scenario. Lustre improved its performance on high core-count systems by keeping related work on a common set of cores, though low-level network drivers must be adapted to the new API. The multi-threaded Lustre network driver (LND) for the Cray Gemini high-speed network improved performance over its single-threaded implementation, but did not employ the benefits of the new API. In this paper, we describe the advantages of the new API and performance gains achieved by modify the Gemini LND to use it. A file system utilization metric for I/O characterization Andrew Uselton and Nicholas Wright (Lawrence Berkeley National Laboratory) Abstract Abstract Today, an HPC platform's “scratch” file system typically represents 10-20% of its cost. However, disk performance is not keeping up with gains in processors, therefore keeping the same relative I/O performance will require an increasingly larger fraction of the budget. Therefore, it is important to understand the I/O workload of HPC platforms in order to provision the file system correctly. Although it is relatively straightforward to measure the peak bandwidth of a file system, this accounts for only part of the overall load: the size of individual I/O transactions strongly affects performance. In this work we introduce a new metric for file system utilization that accounts for such effects and provides a better view of the overall load on the file system. We present a description of our model, our work to calibrate it, and early results from the file systems at NERSC. Paper Technical Session 6C Chair: Craig Stewart (Indiana University) The Cray Programming Environment: Current Status and Future Directions Luiz DeRose (Cray Inc.) Abstract Abstract The scale of current and future high end systems, as well as the increasing system software and architecture complexity, brings a new set of challenges for application developers. In order to achieve high performance on peta-scale systems, application developers need a programming environment that can address and hide the issues of scale and complexity of high end HPC systems. Users must be supported by intelligent compilers and runtime systems, automatic performance analysis tools, adaptive libraries, and debugging and porting tools. Moreover this programming environment must be capable of supporting millions of processing elements in an heterogeneous environment. In this talk I will present the recent activities and future directions of the Cray Programming Environment, which are being developed and deployed according to Cray’s adaptive supercomputing strategy to improve user’s productivity on the Cray Supercomputers. Enhancements to the Cray Performance Measurements and Analysis Tools Heidi Poxon (Cray Inc.) Abstract Abstract The Cray Performance Measurement and Analysis Tools offer performance measurement and analysis feedback for applications running on Cray multi-core and hybrid computing systems. As with any tool, using the Cray performance analysis toolset involves a learning curve. Recent work focuses on a new interface to obtain basic application performance information for users not familiar with the Cray performance tools. CrayPat-lite has been developed to provide performance statistics at the end of a job by simply loading a modulefile. After a program completes execution, output such as job size, wallclock time, MFLOPS, top time consuming routines, etc. is automatically presented through stdout. Modifications to the “classic” performance tools interface have also been made to unify the two paths so that users who start with CrayPat-lite can easily transition to using CrayPat. This paper presents the CrayPat-lite enhancement to the toolset. Cray Compiling Environment Update Suzanne LaCroix and James Beyer (Cray Inc.) Abstract Abstract The Cray Compiling Environment (CCE) has evolved over the last several years to support high performance computing needs on Cray systems. New system architectures, new language standards, and ever-increasing performance and scaling requirements have driven this change. This talk will present an overview of current CCE capabilities and recently added features. Future plans and challenges will also be discussed. Paper Technical Session 7A Chair: Jeff Broughton (NERSC/LBNL) New Member Talk: iVEC and the Pawsey Centre Charles Schwartz (iVEC) Abstract Abstract The Pawsey Centre is a supercomputing facility being built in Kensington, Western Australia, to be operated by iVEC, an unincorporated joint venture of four public universities and CSIRO. It is a research facility, specialising in radio-astronomy and geosciences, but available to the larger Australian academic research community as well. The Evolution of Cray Management Services Tara Fly, Alan Mutschelknaus, Andrew Barry and John Navitsky (Cray Inc.) Abstract Abstract Cray Management Services is quickly evolving to address the changing nature of Cray Systems. NodeKares adds advanced features to support gang scheduling, reservation and application level health checking, as well as other serviceability features. Lightweight Log Manager provides more complete and standardized log collection. Modular xtdumpsys will provide an extensible framework for system dumping. Resource utilization reporting provides a scalable, extensible framework for data collection, including power management, GPU utilization, and application resource utilization data. This paper presents these new features, including configuration, migration, and benefits. CRAY XC30 Installation – A System Level Overview Nicola Bianchi, Colin McMurtrie and Sadaf Alam (Swiss National Supercomputing Centre) Abstract Abstract In this paper we detail the installation of the 12-cabinet Cray XC30 system at the Swiss National Super- computing Centre (CSCS). At the time of writing this is the largest such system worldwide and hence the system-level challenges of this latest generation Cray platform will be of interest to other sites. The intent is to present a systems and facilities point of view regarding the Cray XC30 installation, operational setup and identify key differences between the Cray XC30 and previous generation Cray systems such as the Cray XE6. We identify key system configuration options and challenges when integrating the entire machine ecosys- tem into a complex operational environment: Sonexion1600 Lustre storage appliance management and tuning, Lustre fine grained routing, esLogin cluster installation and management using Bright Cluster Manager, IBM GPFS integration, Slurm installation, facility management and network considerations. Cray External Services Systems Overview Harold Longley and Jeff Keopp (Cray Inc.) Abstract Abstract Cray External Services systems expand the functionality of the Cray XE/XK and Cray XC systems by providing more powerful external login (esLogin) nodes and an external Lustre file system (esFS). A management server (esMS) provides administration and monitoring functions as well as node provisioning and automated Lustre failover for the external file system. The esMS is available in a single server or high-availability configuration. A great advantage of these systems is that the external Lustre file system remains available to the external login nodes regardless of the state of the Cray XE/XK or Cray XC system. External login nodes are the standard login node on Cray XC systems. Paper Technical Session 7B Chair: Andrew Uselton (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) Architecting Resilient Lustre Storage Solution John Fragalla (Xyratex) Abstract Abstract The concept of scratch HPC storage is quickly becoming less critical than the importance of high availability (HA) and reliability. In this presentation, Xyratex discusses architecting a resilient and reliable Lustre storage solution to increase availability and eliminate downtime within HPC environments for continual data access. Xyratex will discuss how solutions based on ClusterStor Technologies address the architectural challenges of HA and reliability without sacrificing performance with protection against hardware faults, power failures, data loss, potential software issues for continued data access based on the tight integration, test processes, and an integrated Lustre storage platform. Xyratex will point out in this presentation its extensive disk drive testing at multiple stages to reduce disk failures and decrease annual failure rates (AFR), the benefits of providing options for live software patches, updates, and revisions by using failover and failback procedures, and the overall Xyratex ClusterStor based solution, which leverages these concepts within its design. Sonexion 1600 I/O Performance Nicholas P. Cardo (National Energy Research Scientific Computing Center) Abstract Abstract The Sonexion 1600 is the latest in Cray’s storage products. An investigative look into the I/O performance of the new devices yields insights into the expected performance. Various I/O scenarios are explored by varying the number of readers and writers to files along with differing I/O patterns. These tests explore the performance characteristics of individual OST’s as well as the aggregate for the file system. Metadata performance is also investigated for creates, unlinks and stats. In both cases, metadata and data, the investigation attempts to identify the sustained and peak performance of the Sonexion 1600. The results can then be used to design a file system on the Sonexion 1600 to achieve desired I/O performance. OLCF's 1 TB/s, next-generation Spider file system David Dillow, Sarp Oral, Douglas Fuller, Jason Hill, Dustin Leverman, Sudharshan Vazhkudai, Feiyi Wang, Kim Youngjae, James H. Rogers, James Simmons and Ross G. Miller (Oak Ridge National Laboratory) Abstract Abstract The Oak Ridge Leadership Computing Facility (OLCF) at Oak Ridge National Laboratory (ORNL) has a long history of deploying the world's fastest supercomputers to enable open science. At the time it was deployed in 2008, the Spider file system had a formatted capacity of 10 PB and sustained transfer speeds of 240 GB/s which made it the fastest Lustre file system in the world. However, the addition of Titan, a 27 PFLOPS Cray XK7 system, along with other OLCF computational resources, has radically increased the I/O demand beyond the capabilities of the existing Spider parallel file system. The next-generation Spider Lustre file system is designed to provide 32 PB of capacity to open science users at OLCF, at an aggregate transfer rate of 1 TB/s. This paper details the architecture, design choices, and configuration of the next-generation Spider file system at OLCF. Paper Technical Session 7C Chair: Sadaf R. Alam (CSCS) Optimizing GPU to GPU Communication on Cray XK7 Jeff M. Larkin (NVIDIA) Abstract Abstract When developing an application for Cray XK7 systems, optimization of compute kernels is only a small part of maximizing scaling and performance. Programmers must consider the effect of the GPU’s distinct address space and the PCIe bus on application scalability. Without such considerations applications rapidly become limited by transfers to and from the GPU and fail to scale to large numbers of nodes. This paper will demonstrate methods for optimizing GPU to GPU communication and present XK7 results for these methods. Debugging and Optimizing Programs Accelerated with Intel® Xeon® Phi™ Coprocessors Chris Gottbrath (Rogue Wave Software) Abstract Abstract Intel® Xeon® Phi™ coprocessors present an exciting opportunity for Cray users to take advantage of many-core processor technology. Since the Intel Xeon Phi coprocessor shares many architectural features and much of the development tool chain with multi-core Intel Xeon processors, it is generally fairly easy to get a program running on the Intel Xeon Phi coprocessor. However, taking full advantage of the Intel Xeon Phi coprocessor requires expressing a level of parallelism that may require significant re-thinking of algorithms. Scientists need tools that allow them to debug and optimize hybrid MPI/OpenMP parallel applications that may have dozens or even hundreds of threads per node. Portable and Productive Performance on Hybrid System with OpenACC Compilers and Tools Luiz DeRose (Cray Inc.) Abstract Abstract The current trend in the supercomputing industry is to provide hybrid systems with accelerators attached to multi-core processors. Some of the critical hurdles for the widespread adoption of accelerated computing in high performance computing are portability and programmability. In order to facilitate the migration to hybrid systems with accelerators attached to CPUs, users need a simple programming model that is portable across machine types. Moreover, to allow for users to maintain a single code base, this programming model, and the required optimization techniques, should not be significantly different for “accelerated” nodes from the approaches used on current multi-core x86 processors. Tesla vs Xeon Phi vs Radeon: A Compiler Writer's Perspective Brent Leback, Douglas Miles and Michael Wolfe (The Portland Group) Abstract Abstract Today, most CPU+Accelerator systems incorporate NVIDIA GPUs. Intel Xeon Phi and the continued evolution of AMD Radeon GPUs make it likely we will soon see, and have to program, a wider variety of CPU+Accelerator systems. PGI already supports NVIDIA GPUs, and is working to add support for Xeon Phi and AMD Radeon. This talk explores the features common to all three types of accelerators, those unique to each, and the implications for programming models and performance portability from a compiler writer's and applications perspective. Paper Technical Session 13A Chair: Douglas W. Doerfler (Sandia National Laboratories) SeaStar Unchained: Multiplying the Performance of the Cray SeaStar Network David A. Dillow and Scott Atchley (Oak Ridge National Laboratory) Abstract Abstract The Cray SeaStar ASIC, with its programmable embedded processor, provides an excellent platform to investigate the properties of various network protocols and programming interfaces. This paper describes our native implementation of the Common Communication Interface (CCI) on the SeaStar platform, and details how we implemented full operating system (OS) bypass for common operations. We demonstrate a 30% to 50% reduction in latency, more than a six-fold increase in message injection rate, and an almost 7x improvement in bandwidth for small message sizes when compared to the generic Cray Portals implementation. Intel Multicore, Manycore, and Fabric Integrated Parallel Computing Jim Jeffers (Intel Corporation) Abstract Abstract Dramatic increases in node level parallelism are here with the introduction of many-core Intel® Xeon Phi™ coprocessors along with the continued generational core increases in multi-core Intel® Xeon® processors. Jim will discuss the impacts on software development for these platforms and the important considerations for scaling highly parallel applications both within the node and across clusters. He will also discuss Intel’s current network fabric products and the future directions Intel is pursuing to address the next critical challenge - Efficient internode communications for the next generation of HPC platforms. Understanding the Impact of Interconnect Failures on System Operation Matthew A. Ezell (Oak Ridge National Laboratory) Abstract Abstract Hardware failures are inevitable on large high performance computing systems. Faults or performance degradations in the high-speed network can reduce the entire system’s performance. Since the introduction of the Gemini interconnect, Cray systems have become resilient to many networking faults. These new network reliability and resiliency features have enabled higher uptimes on Cray systems by allowing them to continue running with reduced network performance. Oak Ridge National Laboratory has developed a set of user-level diagnostics that stresses the high-speed network and searches for components that are not performing as expected. Nearest-neighbor bandwidth tests check every network chip and network link in the system. Additionally, performance counters stored on the network ASIC’s memory mapped registers (MMRs) are used to get a more full picture of the state of the network. Applications have also been characterized under various suboptimal network conditions to better understand what impact network problems have on user codes. Paper Technical Session 13B Chair: Jason Hill (Oak Ridge National Laboratory) The Changing Face of Storage for Exascale Brent Gorda (Intel Corporation) Abstract Abstract Cray joins Intel (Whamcloud), the HDFGroup, EMC and DDN as partners in the US Department of Energy Fastforward program, which is aimed at spurring research in key technologies for exascale. This two-year program is mostly research, but does have proof of concept (and open source) code delivery attached. As we near the halfway point in the program, we will present the big Exascale picture, progress to date, and the view of the path forward at this point in time. Cray's Implementation of LNET Fine Grained Routing: Overview and Characteristics Mark S. Swan (Cray Inc.) and Nic Henke (Xyratex) Abstract Abstract As external Lustre file systems become large and more complicated, configuring the Lustre network transport layer (LNET) can also become more complicated. This paper will focus on where Fine Grained Routing (FGR) came from, why Cray uses FGR, tools Cray has developed to aid in FGR configurations, analysis of FGR schemes, and performance characteristics. Discovery in Big Data using a Graph Analytics Appliance Amar Shan and Ramesh Menon (Cray Inc.) Abstract Abstract Discovery, the uncovering of hidden relationships and unknown patterns, lies at the heart of advancing knowledge. Discovery has long been viewed as the province of human intellect, with automation difficult. However, things have to change: the explosion of Big Data has made automating the synthesis of insight from raw data mandatory. Paper Technical Session 13C Chair: Helen He (National Energy Research Scientific Computing Center) Using the Cray Gemini Performance Counters Kevin Pedretti, Courtenay Vaughan, Richard Barrett, Karen Devine and K. Scott Hemmert (Sandia National Laboratories) Abstract Abstract This paper describes our experience using the Cray Gemini performance counters to gain insight into the network resources being used by applications. The Gemini chip consists of two network interfaces and a common router core, each providing an extensive set of performance counters. Based on our experience, we have found some of these counters to be more enlightening than others. More importantly, we have performed a set of controlled experiments to better understand what the counters are actually measuring. These experiments led to several surprises, described in this paper. This supplements the documentation provided by Cray and is essential information for anybody wishing to make use of the Gemini performance counters. The MPI library and associated tools that we have developed for gathering Gemini performance counters are described and are available to other Cray users as open-source software. Performance Measurements of the NERSC Cray Cascade System Harvey J. Wasserman, Nicholas J. Wright, Brian M. Austin and Matthew J. Cordery (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) Abstract Abstract We present preliminary performance results for NERSC’s “Edison” system, one of the first Cray XC30 supercomputers.The primary new feature of the XC30 architecture is the Cray Aries interconnect. We use several network-centric “microbenchmarks” to measure the Aries’ substantial improvements in bandwidth, latency, message rate, and scalability. The distinctive contribution of this work consists of performance results for the NERSC Sustained System Performance (SSP) application benchmarks. The SSP benchmarks span a wide range of science domains, algorithms and implementation choices, and provide a more holistic performance metric. We examine the performance and scalability of these benchmarks on the XC30 and compare performance with other state-of-the-art HPC platforms. Edison nodes are composed of two 8-core Intel "Sandy Bridge" processors, and two hyperthreads per core. With 32 hardware threads per node, multi-threading is essential for optimal performance. We report the OpenMP, core-specialization and hyperthreading settings that maximize SSP on the XC30. From thousands to millions: visual and system scalability for debugging and profiling Mark O'Connor, David Lecomber, Ian Lumb and Jonathan Byrd (Allinea Software) Abstract Abstract Behind the achievements of double digit Petaflop counts, million core systems, and sustained Petaflop real-world applications, software tools have been the silent unsung heroes. Ready to aid software migration or solve a critical acceptance bug - tools such as Allinea DDT have been ready. We will explore how Allinea DDT has been prepared for today's hybrid Cray XK7s and the Cray XC30s. Paper Technical Session 14A Chair: Ashley Barker (Oak Ridge National Laboratory) Investigating Topology Aware Scheduling David Jackson (Adaptive Computing) Abstract Abstract For many years, HPC networks have been able to assume good support for all-to-all communications, meaning that no matter how workloads were placed across the network, the application would experience maximum performance. While all networks have some limitations associated with their underlying hardware and topology, the difference between the best possible allocation and the worst possible was often small enough to be in the realm of statistical noise and thus any associated issues were generally ignored. Now, as systems and workloads grow into petascale and exascale range, the communication within an application becomes massive and the difference between the best case and worst-case allocations becomes significant. The differences between one placement decision and another can now noticeably impact application efficiency, job run time consistency, and even impact neighboring workloads. External Torque / Moab and Fairshare on the Cray XC30 Tina Declerck and Iwona Sakrejda (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory) Abstract Abstract NERSC's new Cray XC30, Edison, utilizes a new capability in Adaptive Computing's Torque 4.x and Moab 7.x products which allows the Torque server and Moab to execute external to the mainframe. This configuration offloads the mainframe server database and provides a unified view of the workload. Additionally, it allows job submissions when the mainframe is unavailable/offline. This paper discusses the configuration process, differences between the old and new methods, troubleshooting techniques, fairshare experiences, and user feedback. While this capability addresses some of the needs of the NERSC community it is not without tradeoffs and challenges. Production Experiences with the Cray-Enabled TORQUE Resource Manager Matthew A. Ezell and Don Maxwell (Oak Ridge National Laboratory) and David Beer (Adaptive Computing) Abstract Abstract High performance computing resources utilize batch systems to manage the user workload. Cray systems are uniquely different from typical clusters due to Cray’s Application Level Placement Scheduler (ALPS). ALPS manages binary transfer, job launch and monitoring, and error handling. Batch systems require special support to integrate with ALPS using an XML protocol called BASIL. Paper Technical Session 14B Chair: Steve Simms (Indiana University) Evaluation of A Flash Storage Filesystem on the Cray XE-6 Jay Srinivasan and Shane Canon (Lawrence Berkeley National Laboratory) Abstract Abstract This paper will discuss some of the approaches and show early results for a Flash file system mounted on a Cray XE-6 using high-performance PCI-e based cards. We also discuss some of the gaps and challenges in integrating flash into HPC systems and potential mitigations as well as new solid state storage technologies and their likely role in the future. Analysis of the Blue Waters File System Architecture for Application I/O Performance Kalyana Chadalavada and Robert Sisneros (National Center for Supercomputing Applications, University of Illinois) Abstract Abstract The NCSA Blue Waters features one of the fastest file systems for scientific applications. Using the Lustre file system technology, Blue Waters provides over 1 TB/s of usable storage bandwidth. The underlying storage units are connected to the compute nodes in a unique fashion. The Blue Waters file system connects a subset of storage units to the high speed torus network at distinct points. Utilizing standard benchmarks and scientific applications, we examine the impact of this architecture on application I/O performance. Given the size of the system and its intended applications, scaling I/O performance will be a challenge. Identifying the optimal I/O methodology can help alleviate a large number of application performance issues. All exercises are done in a production environment to ensure that beneficial results are directly applicable to Blue Waters users. Trillion Particles, 120,000 cores, and 350 TBs: Lessons Learned from a Hero I/O Run on Hopper Suren Byna and Andrew Uselton (Lawrence Berkeley National Laboratory), Prabhat Mr. (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center), David Knaak (Cray Inc.) and Yun (Helen) He (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) Abstract Abstract Modern petascale applications can present a variety of configuration, runtime, and data management challenges when run at scale. In this paper, we describe our experiences in running a large-scale plasma physics simulation, called VPIC, on the NERSC Hopper Cray-XE6 system. The simulation ran on 120,000 cores using ~80% of computing resources, 90% of the available memory on each node and 50% of a Lustre file system. Over two trillion particles were simulated for 23,000 timesteps, and 10 one-trillion particle dumps, each ranging between 30 and 42TB were written to HDF5 files at a sustained rate of ~27GB/s. To the best of our knowledge, this job represents the largest I/O undertaken by a NERSC application and the largest collective writes to single HDF5 files. We outline several obstacles that we overcame in the process of completing this run, and list lessons learned that are of potential interest to HPC practitioners. Paper Technical Session 14C Chair: Helen He (National Energy Research Scientific Computing Center) Performance Comparison of Scientific Applications on Cray Architectures Haihang You, Reuben D. Budiardja, Jeremy Logan, Lonnie D. Crosby, Vincent Betro, Pragneshkumar Patel, Bilel Hadri and Mark Fahey (National Institute for Computational Sciences) Abstract Abstract Current HPC architectures are changing drastically and rapidly while mature scientific applications usually evolve at a much slower rate. New architectures almost certainly impact the performance of these heavily used scientific applications. Therefore, it is prudent to understand how the supposed performance benefits and improvements of new architectures translate to the applications. In this paper, we attempt to quantify the differences between theoretical performance improvements (due to changes in architecture) and “real-world” improvements in applications by gathering performance data for selected applications from the fields of chemistry, climate, weather, materials science, fusion, and astrophysics running on three different Cray architectures: XT5, XE6, and XC30. The performance evaluations of these selected applications on these three architectures may give the user perspective into the potential benefits of each architecture. These evaluations are done by comparing the improvements of numerical (micro)-benchmarks to the improvements of the selected applications when run on these architectures. First 12-cabinets Cray XC30 System at CSCS: Scaling and Performance Efficiencies of Applications Sadaf Alam, Themis Athanassiadou, Tim Robinson, Gilles Fourestey, Andreas Jocksch, Luca Marsella, Jean-Guillaume Piccinali and Jeff Poznanovic (Swiss National Supercomputing Centre) Abstract Abstract CSCS has recently deployed one of the largest Cray XC30 systems, which is composed of 6 groups or 12 cabinets of dual-socket Intel Sandy Bridge processors, and the new Aries network chips with a dragonfly topology. With respect to earlier Cray XT and XE series platforms, the Cray XC30 has several unique features that have the potential to affect application performance: (1) Intel Xeon vs. AMD Opteron based nodes; (2) Aries vs. Gemini network and router chip; (3) PCIe vs. Hypertransport interface to the network chip; (4) dragonfly vs. 3D torus topology; (5) mixed optical and copper vs. all copper cables; (6) number of compute nodes per communication NIC; (7) Hyperthreading enabled nodes; and (8) compute cabinet layouts. In this report, we compare scaling and performance efficiencies of a range of applications on CSCS Cray XC30 and Cray XE6 platforms. Effects of Hyper-Threading on the NERSC workload on Edison Zhengji Zhao, Nicholas J. Wright and Katie Anytpas (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) Abstract Abstract Edison, a Cray XC30, is the NERSC's newest peta-scale supercomputer. Along with the Aries interconnect, Hyper-Threading (HT) is one of the new technologies available on the system. HT provides simultaneous multithreading capability on each core with two hardware threads available. In this paper, we analyze the potential benefits of HT for the NERSC workload by investigating the performance implications of HT on a few selected applications among the top 15 codes at NERSC, which represent more than 60% of the workload. By connecting the observed HT results with more detailed profiling data we discuss if it is possible to predict how and when the users should utilize HT in their production computations on Edison. Paper Technical Session 15A Chair: Craig Stewart (Indiana University) Preparing Slurm for use on the Cray XC30 Stephen Trofinoff and Colin McMurtrie (Swiss National Supercomputing Centre) Abstract Abstract In this paper we describe the technical details associated with the preparation of Slurm for use on the 12 cabinet XC30 system installed at the Swiss National Super- computing Centre (CSCS). The system comprises internal and external login nodes and a new ALPS/BASIL version so a number of technical challenges needed to be overcome in order to have Slurm working on the system. Thanks to a Cray- supplied emulator of the system interface, work was possible ahead of delivery and this eased the installation when the system arrived. However some problems were encountered and their identification and resolution is described in detail. We also provide detail of the work done to improve the Slurm task affinity bindings on a general-purpose Linux cluster so that they, as closely as possible, match the Cray bindings, thereby providing our users with some degree of consistency in application behaviour between these systems. Lesson’s From 20 Continuous Years of Cray/HPC Systems Liam Forbes, Don Bahls, Gene McGill, Oralee Nudson and Gregory Newby (Arctic Region Supercomputing Center, UAF) Abstract Abstract The Arctic Region Supercomputing Center (ARSC) was founded in 1992/1993 with a Cray YMP (denali) and since then has operated or owned at least one Cray system, including most recently a Cray XK6m-200 (fish). For 20 years, ARSC has shared high performance computing (HPC) experiences, users, and problems with other University HPC centers, DoD HPC centers, and DoE HPC centers. In this paper, we will document and present the user support and system administration lessons we have learned from the perspective of a smaller, regional University HPC center operating and supporting the same architectures as some of the largest systems in the world over that time. Comparisons to experiences with HPC hardware and software products from other vendors will be used to illustrate some of the points. Cray Workload Management with PBS Professional 12.0 Scott Suchyta and Sam Goosen (Altair Engineering, Inc.) Abstract Abstract Changing requirements, trends and technologies in HPC computing are frequent, and workload managers like PBS Professional must continually evolving to accommodate these. One challenge sites have faced has been configuring PBS to address their individual requirements. Site defined custom resources and configurable scheduling policies were introduced to help accomplish this, but are insufficient to address more complex scenarios. A more robust infrastructure is required to manage the dynamic resources and policies that are unique to modern HPC sites. Our discussion will include examples that customers may wish to adopt or customize to address their specific needs including admission control, allocation management, and on-the-fly tuning. Independent of plugins, PBS Professional supports multithreaded processors available on current Cray platforms. Additional enhancements will become available when integration with BASIL version 1.3 is complete. In the interim, details about configuring these systems for use with PBS Professional 12.0 will be presented. Paper Technical Session 15B Chair: Robert Henschel (Indiana University) Introduction to HSA Hardware, Software and HSAIL with A HPC Usage Example Vinod Tipparaju (AMD, Inc.) Abstract Abstract Heterogeneous systems have been around for several years, and the accelerator-based heterogeneous systems (CPU-GPU) have become popular in the last five years. Particularly, accelerating general-purpose computation using GPUs is gaining momentum in both academic research and vendors in the industry. OpenCL and CUDA are the two most popular programming models that enable end-application programmers to take advantage of the GPGPU through the compiler, runtime, and driver tool chain. While the opportunity of GPGPU has been opened up to expert programmers, this has not reached a big mass yet, primarily, because of the following reasons: (i) The CPU-GPU system has a distributed-asymmetric memory that needs to be explicitly managed for coherency and synchronization (ii) Two-way high-latency memory copies and kernel dispatch (iii) Lack of support for dynamic scheduling or load balancing, advanced debugging, system calls, exception handling etc. Reliable Computation Using Unpredictable Components Joel O. Stevenson, Robert A. Ballance, Suzanne M. Kelly, John P. Noe and Jon R. Stearley (Sandia National Laboratories) and Michael E. Davis (Cray Inc.) Abstract Abstract Based on our experiences over the last year running large simulations on the DOE/ASC platform Cielo, we will discuss strategies that enable large, long-running simulations to make predictable progress despite platform component failures. From an application perspective, complex systems like Cielo have multiple sources of interrupts and slowdowns that combine to make the system appear unpredictable. We will discuss the component failures observed and identify those where application recovery has been possible. Requirements Analysis for Adaptive Supercomputing using the Cray XK7 as a Case Study Sadaf R. Alam, Mauro Bianco, Ben Cumming, Gilles Fourestey, Jeffrey Poznanovic and Ugo Varetto (Swiss National Supercomputing Centre) Abstract Abstract In this report, we analyze readiness of the code development and execution environment for adaptive supercomputers where a processing node is composed of heterogeneous computing and memory architectures. Current instances of such a system are Cray XK6 and XK7 compute nodes, which are composed of x86_64 CPU and NVIDIA GPU devices and DDR3 and GDDR5 memories respectively. Specifically, we focus on the integration of the CPU and accelerator programming environments, tools, MPI, numerical libraries as well as operational features such as resource monitoring, and system maintainability and upgradability. We highlight portable, platform independent technologies that exist for the Cray XE and XK, and XC30 platforms and discuss dependencies in the CPU, GPU and network tool chains that lead to current challenges for integrated solutions. This discussion enables us to formulate requirements for a future, adaptive supercomputing platform, which could contain a diverse set of node architectures. Paper Technical Session 15C Chair: Douglas W. Doerfler (Sandia National Laboratories) Improving the Performance of the PSDNS Pseudo-Spectral Turbulence Application on Blue Waters using Coarray Fortran and Task Placement Robert A. Fiedler, Nathan Wichmann and Stephen Whalen (Cray Inc.) and Dmitry Pekurovsky (San Diego Supercomputer Center) Abstract Abstract The PSDNS turbulence application performs many 3D FFTs per time step, which entail frequently transposing distributed 3D arrays. These transposes are achieved via multiple concurrent All-to-All communication operations, which dominate the overall execution time at large scales. We improve the All-to-All times for benchmarks on 3072 to 12288 nodes using three main strategies: 1) eliminating off-node communication for one of the two sets of transposes by assigning one sheet of the 3D Cartesian grid to each node (35% speedup), 2) placing tasks on nodes that are distributed randomly throughout the gemini network in order to maximize the All-to-All bandwidth that can be utilized by the job's nodes (21% speedup), and 3) reducing contention and overhead by replacing calls to MPI_AlltoAll with a drop-in library written in Coarray Fortran (33% speedup). We also describe how this library is implemented and integrated efficiently in PSDNS. A Review of The Challenges and Results of Refactoring the Community Climate Code COSMO for Hybrid Cray HPC Systems. Benjamin Cumming (Swiss National Supercomputing Centre), Carlos Osuna (Center For Climate Systems Modeling ETHZ), Tobias Gysi (Supercomputing Systems AG), Mauro Bianco (Swiss National Supercomputing Centre), Xavier Lapillonne and Oliver Fuhrer (Federal Office of Meteorology and Climatology MeteoSwiss) and Thomas C. Schulthess (ETH Zurich) Abstract Abstract We summarize the results of porting the numerical weather simulation code COSMO on different hybrid Cray HPC systems. COSMO was written in Fortran with MPI, and the aim of the refactoring was to support both many-core systems and GPU-accelerated systems with minimal disruption to the user community. With this in mind, different approaches were taken to refactor the different components of the code: the dynamical core was refactored with a C++-based domain specific language for structured grids which provides both CUDA and OpenMP back ends; and the physical parameterizations were refactored by adding OpenACC and OpenMP directives to the original Fortran code. This report gives a detailed description of the challenges presented by such a large refactoring effort using different languages on Cray systems, along with performance results on three different Cray systems at CSCS: Rosa (XE6), Todi (XK7) and Daint (XC30). CloverLeaf: Preparing Hydrodynamics Codes for Exascale Andrew C. Mallinson and David A. Beckingsale (University of Warwick), Wayne P. Gaudin and John A. Herdman (Atomic Weapons Establishment), John M. Levesque (Cray Inc.) and Stephen A. Jarvis (University of Warwick) Abstract Abstract In this work we directly evaluate five candidate programming models for future exascale applications (MPI, MPI+OpenMP, MPI+OpenACC, MPI+CUDA and CAF) using a recently developed Lagrangian-Eulerian explicit hydrodynamics mini-application. The aim of this work is to better inform the exacsale planning at large HPC centres such as AWE. Such organisations invest significant resources maintaining and updating existing scientific codebases, many of which were not designed to run at the scale required to reach exascale levels of computation on future system architectures. We present our results and experiences of scaling these different approaches to high node counts on existing large-scale Cray systems (Titan and HECToR). We also examine the effect that improving the mapping between process layout and the underlying machine interconnect topology can have on performance and scalability, as well as highlighting several communication-focused optimisations. Paper Technical Session 16A Chair: John Noe (Sandia National Laboratories) Methods and Results for Measuring Kepler Utilization on a Cray XK7 Jim Rogers (Oak Ridge National Laboratory), Roger Green (NVIDIA) and Kevin Peterson (Cray Inc.) Abstract Abstract NVIDIA is providing an API as part of their official CUDA 5.5 release branch (R319) that Cray can then use to provide specific and inherent utilization information from the Kepler GPU. NVIDIA and Cray will provide this capability as part of a featured release once the cadence for both the NVIDIA driver and Cray software release are complete. The intent of the talk is to provide early description of the driver changes, the API, the Cray interface, and some examples against the Titan workload using a pre-release version of both the NVIDIA driver/API and Cray accounting software. Resource Utilization Reporting on Cray Systems Andrew P. Barry (Cray Inc.) Abstract Abstract Many Cray customers want to evaluate how their systems are being used, across a variety of metrics. Neither previous Cray accounting tools, nor commercial server management software allow the collection of all the desirable statistics with minimal performance impact. Resource Utilization Reporting (RUR) is being developed by Cray, to collect statistics on how systems are used. RUR provides a reliable, high-performance framework into which plugins may be inserted, which will collect data about the usage of a particular resource. RUR is configurable, extensible, and lightweight. Cray will supply plugins to support several sets of collected data, which will be useful to a wide array of Cray customers; customers can implement plugins to collect data uniquely interesting to that system. Plugins also support multiple methods to output collected data. Cray is expecting to release RUR in the second half of 2013. The Complexity of Arriving at Useful Reports to Aid in the Succesful Operation of an HPC Center Ashley Barker, Adam Carlyle, Chris Fuson, Mitch Griffith and Don Maxwell (Oak Ridge National Laboratory) Abstract Abstract While reporting may not be the first item to come to mind as one of the many challenges that HPC centers face, it is certainly a task that all of us have to devote resources to. One of the biggest problems with reporting is determining what information is needed in order to make impactful decisions that can influence everything from policies to purchasing decisions. There is also the problem of how frequently to review the data collected. For some data points, it is necessary to look at reports on a daily basis while others are not useful unless examined over longer periods of time. This paper will look at the efforts the Oak Ridge Leadership Computing Facility has taken over the last few years to refine the data that is collected, reported, and reviewed. Paper Technical Session 16B Chair: Liz Sim (EPCC, The University of Edinburgh) Building Balanced Systems for the Cray Datacenter of the Future Keith Miller (DataDirect Networks) Abstract Abstract The top computing sites worldwide are faced with unique data access, management and protection challenges. In this talk DDN the Leader in Massively Scalable Storage Solution for Big Data Applications will discuss how joint DDN, Cray customers are achieving balanced, highly-productive HPC environments today in the face of huge capacity, performance and reliability requirements and directions in building the Cray data center of the future. The content will include DDN recent developments and roadmap for DDN block, file, object and analytics solutions and appliances and touch on Lustre performance testing. Surviving the Life Sciences Data Deluge using Cray Supercomputers Bhanu Rekapalli and Paul Giblock (National Institute for Computational Sciences) Abstract Abstract The growing deluge of data in the Life Science domains threatens to overwhelm computing architectures. This persistent trend necessitates the development of effective and user-friendly computational components for rapid data analysis and knowledge discovery. Bioinformatics, in particular employs data-intensive applications driven by novel DNA-sequencing technologies, as do the high-throughput approaches that complement proteomics, genomics, metabolomics, and meta-genomics. We are developing massively parallel applications to analyze this rising flood of life sciences data for large scale knowledge discovery. We have chosen to work with the desktop or cluster based applications most widely used by the scientific community, such as NCBI BLAST, HMMER, DOCK6, and MUSCLE. Our endeavors encompass to extend highly scalable parallel applications that scales to tens of thousands of cores on Cray's XT architecture to Cray’s next generation XE, XK, and XC architectures along with focusing on making them robust and optimized, which will be discussed in this paper. Early Experience on Crays with Genomic Applications Used as Part of Next Generation Sequencing Workflow Mikhail Kandel (University of Illinois), Steve Behling and Bill Long (Cray Inc.), Carlos P. Sosa (Cray Inc. and University of Minnesota Rochester), Sebastien Boisvert and Jacques Corbeil (Universite Laval) and Lorenzo Pesce (University of Chicago) Abstract Abstract Recent progress in DNA sequencing technology has yielded a new class of devices that allow for the analysis of genetic material with unprecedented speed and efficiency. These advances, styled under the name Next Generation Sequencing (NGS) are well suited for High-Performance Computing (HPC) systems. By breaking up DNA into millions of small strands (20 to 1000 bases) and reading them in parallel, the rate at which genetic material can be acquired has increase by several orders of magnitude. The technology to generate raw genomic data is becoming increasingly fast and inexpensive when compared to the rate that this data can be analyzed. In general, assembling small reads into a useful form is done by either assembling individual reads (de novo) or mapping these pieces against a reference. In this paper we present our experience with these applications on Cray supercomputers. In particular with Ray, a parallel short-read assembler code. Paper Technical Session 16C Chair: Nicholas J. Wright (LBNL/NERSC) Measuring Sustained Performance on Blue Waters with the SPP Metric William Kramer (National Center for Supercomputing Applications) Abstract Abstract The Blue Waters Project developed the Sustained Petascale Performance metric to assess the potential for the Blue Waters system to meeting its goal of sustained petascale performance for a diverse set of science and engineering problems. The SPP, consisting of over 20 individual tests (code+input), is unique and truly representative of the ability for a system to support many areas of science and engineering. The SPP is a method that allows an accurate assessment of hybrid systems that have more than one type of node, which has not been possible before. Experiences Porting a Molecular Dynamics Code to GPUs on a Cray XK7 Donald K. Berry (Indiana University), Joseph Schuchart (Technische Universität Dresden) and Robert Henschel (Indiana University) Abstract Abstract GPU computing has rapidly gained popularity as a way to achieve higher performance of many scientific applications. In this paper we report on the experience of porting a hybrid MPI+OpenMP molecular dynamics code to a GPU enabled CrayXK7 to make a hybrid MPI+GPU code. The target machine, Indiana University's Big Red II, consists of a mix of nodes equipped with two 16-core Abu Dhabi X86-64 processors, and nodes equipped with one AMD Interlagos X86-64 processor and one Nvidia Kepler K20 GPU board. The code, IUMD, is a Fortran program developed at Indiana University for modeling matter in compact stellar objects (white dwarf stars, neutron stars and supernovas). We compare experiences using CUDA and OpenACC. Chasing Exascale: the Future of GPU Computing Steve Scott (NVIDIA) Abstract Abstract Changes in underlying silicon technology are creating a significant disruption to computer architectures and programming models. Power has become the primary constraint to processor performance, and threatens our ability to continue historic rates of performance improvement. With silicon technology no longer providing the rapid rate of improvement it once did, we must rely on advances in architectural efficiency. This has led to the creation of heterogeneous (or accelerated) architectures, and the rise of GPU computing. Paper Technical Session 18A Chair: Ashley Barker (Oak Ridge National Laboratory) Blue Waters Acceptance: Challenges and Accomplishments Celso L. Mendes, Brett Bode, Gregory H. Bauer, Joseph R. Muggli, Cristina Beldica and William T. Kramer (National Center for Supercomputing Applications) Abstract Abstract Blue Waters, the largest supercomputer ever built by Cray, comprises an enormous amount of computational power. This paper describes some of the challenges encountered during the deployment and acceptance of Blue Waters, and presents how those challenges were handled by the NCSA team. After briefly reviewing our originally designed acceptance plans, we highlight the steps actually taken for that process, describe how those steps were conducted, and comment on lessons learned during that process. Besides listing the scope of the applied tests, we present an overview of their results and analyze the manner in which those results guided both the Cray and NCSA teams in tuning the system configuration. The Blue Waters acceptance testing process consisted of hundreds of tests summarized in the paper, covering many areas directly related to the Cray system as well as other items, such as the near-line storage and the external user-support environment. Saving Energy with “Free” Cooling and the Cray XC30 Brent Draney, Tina Declerck, Jeffrey Broughton and John Hutchings (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory) Abstract Abstract Located in Oakland, CA, NERSC is running its new XC30, Edison, using “free” cooling. Leveraging the benign San Francisco Bay Area environment, we are able to provide a year-round source of water from cooling towers alone (no chillers) to supply the innovative cooling system in the XC30. While this approach provides excellent energy efficiency (PUE < 1.1), it is not without its challenges. This paper describes our experience designing and operating such a system, the benefits that we have realized, and the trade-offs relative to conventional approaches. Real-time mission critical supercomputing with Cray systems Jason Temple and Luc Corbeil (Swiss National Supercomputing Centre) Abstract Abstract System integrity and availability is essential for Real-time Scientific Computing in Mission Critical Environments. Human lives rely on decisions derived from results provided by Cray supercomputers. The tools used for science in general must be reliable and produce the same results every time without fail, on demand, or the results will not be trustworthy or worthwhile. In this paper, we will describe the engineering challenges to provide a reliable and highly available system to the Swiss Weather service using Cray solutions, and we will relate recent real life experiences that lead to specific design choices . Paper Technical Session 18B Chair: Jenett Tillotson (Indiana University) High Fidelity Data Collection and Transport Service Applied to the Cray XE6/XK6 Jim Brandt (Sandia National Laboratories), Tom Tucker (Open Grid Computing), Ann Gentile (Sandia National Laboratories), David Thompson (Kitware Inc.) and Victor Kuhns and Jason Repik (Cray Inc.) Abstract Abstract A common problem experienced by users of large scale High Performance Computer (HPC) systems, including the Cray XE6, is the inability to gain insight into their computational environments. Our Lightweight Distributed Metric Service (LDMS) is intended to be run as a continuous system service for providing low-overhead remote collection and on-node access to high-fidelity data, capable of handling 100’s of data values per node per second, vastly exceeding the data collection sizes and rates typically handled by current HPC monitoring services while still maintaining much lower overhead. We present a case study of utilizing LDMS on the Cray XE6 platform, Cielo, to enable remote storage of system resource data for post run analysis and node-local access to data for run-time in-situ analysis and workload rebalancing. We also present information from deployment on an XK6 system at Sandia, where we leverage RDMA over the Gemini transport to further reduce LDMS overhead. Production I/O Characterization on the Cray XE6 Philip Carns (Argonne National Laboratory), Yushu Yao (Lawrence Berkeley National Laboratory), Kevin Harms, Robert Latham and Robert Ross (Argonne National Laboratory) and Katie Antypas (Lawrence Berkeley National Laboratory) Abstract Abstract I/O performance is an increasingly important factor in the productivity and efficiency of large-scale HPC systems such as Hopper, a 153,216 core Cray XE6 system operated by the National Energy Research Scientific Computing Center (NERSC). The scientific workload diversity of such systems presents a challenge for I/O performance tuning, however. Applications vary in terms of data volume, I/O strategy, and access method, making it difficult to consistently evaluate and enhance their I/O performance. Improvement of TOMCAT-GLOMAP File Access with User Defined MPI Datatypes Mark Richardson (Numerical Algorithms Group) and Martyn Chipperfield (University of Leeds) Abstract Abstract This paper describes the modification of the file access patterns that occur throughout the simulation runs. The analysis identified several subroutines where the workload was not actually in the accessing of the data but the processing of data either before writing it or after reading it. The main gains in the project have been through a change in the practice of overloading MPI task zero. The overhead in time per iteration of the writing step has been reduced from 8 seconds to 0.01s on a small case and it is reduced from 38 seconds to 0.05s for a larger case. Paper Technical Session 18C Chair: Liam O. Forbes (Arctic Region Supercomputing Center, UAF) Cray’s Cluster Supercomputer Architecture John Lee, Susan Kraus and Maria McLaughlin (Cray Inc.) Abstract Abstract In the first half of this presentation, we will discuss Cray’s Cluster Supercomputer architecture designs built upon industry standard optimized modular server platforms. You will learn how platform selection is one of the key factors influencing today’s datacenter decisions for configuration flexibility, scalability and performance-per-watt based on the latest processing technologies. You will also learn how industry standard high-performance network connectivity, streamlined I/O and diverse storage options can maximize system performance at a lower cost of ownership. You will share some examples of different high-performance networking topologies systems such as Fat Tree or 3D Torus (InfiniBand) with single- or dual-rail configurations that are meeting a variety of HPC workload technical requirements. In the second half of this presentation, we will discuss the essential cluster software and management tools that are required to build and support cluster architecture combined with key compatibility features of the Advanced Cluster Engine™ (ACE) management software. Performance Metrics and Application Experiences on a Cray CS300-AC™ Cluster Supercomputer Equipped with Intel® Xeon Phi™ Coprocessors Vincent C. Betro, Robert P. Harkness, Bilel Hadri, Haihang You, Ryan C. Hulguin, R. Glenn Brook and Lonnie D. Crosby (National Institute for Computational Sciences) Abstract Abstract Given the growing popularity of accelerator-based supercomputing systems, it is beneficial for applications software programmers to have cognizance of the underlying platform and its workings while writing or porting their codes to a new architecture. In this work, the authors highlight experiences and knowledge gained from porting such codes as ENZO, H3D, GYRO, a BGK Boltzmann solver, HOMME-CAM, PSC, AWP-ODC, TRANSIMS, and ASCAPE to the Intel Xeon Phi architecture running on a Cray CS300-AC™ Cluster Supercomputer named Beacon. Beacon achieved 2.449 GFLOP/W in High Performance LINPACK (HPL) testing and a number one ranking on the November 2012 Green500 list \cite{Green500}. Areas of optimization that bore the most performance gain are highlighted, and a set of metrics for comparison and lessons learned by the team at the National Institute for Computational Sciences Application Acceleration Center of Excellence is presented, with the intention that it can give new developers a head start in porting as well as a baseline for comparison of their own code's exploitation of fine and medium-grained parallelism. Paper Technical Session 19A Chair: Tina Butler (National Energy Research Scientific Computing Center) Effect of Rank Placement on Cray XC30 Communication Cost Reuben D. Budiardja, Lonnie D. Crosby and Haihang You (National Institute for Computational Sciences) Abstract Abstract The newly released Cray XC30 supercomputer boasts the new Aries interconnect that incorporates a Dragonfly network topology. This hierarchical network topology has obvious advantages with respect to local communication. However, as communication patterns extend further down the hierarchy and grow more separated the overall impact of particular bottlenecks and trade-offs between bandwidth and latency become less apparent. In particular, applications may be more or less latency sensitive based on their communication pattern. The dynamic routing options, as a result, may affect some applications more severely than others. In this paper, we investigate the effect of process placement on the communication costs associated with typical communication patterns shared by many scientific applications. Observations concerning the communication performance of benchmarks and selected applications are presented and discussed. Evaluating Node Orderings For Improved Compactness Carl Albing (Cray Inc.) Abstract Abstract This paper demonstrates an evaluation technique that provides guidance for site-specific selection of the node ordering related to application placement. Reasonable performance of parallel applications has been achieved through application placement in Cray XT/XE/XK 3D-torus systems using allocation strategies based on an ordered, one-dimensional sequence of nodes. Node ordering is a low (computation) cost way to incorporate topological information into application placement decisions. With several orderings from which to choose - and others that could be created - what is the basis for choosing one ordering over others? Improving Task Placement for Applications with 2D, 3D, and 4D Virtual Cartesian Topologies on 3D Torus Networks with Service Nodes Robert A. Fiedler and Stephen Whalen (Cray Inc.) Abstract Abstract We describe two new methods for mapping applications with multidimensional virtual Cartesian process topologies onto 3D torus networks with randomly distributed service nodes. The first method, “Adaptive Layout”, works for any number of processes and distributes the MILC (lattice QCD, 4D topology) workload to ensure communicating processes are close together on the torus. This scheme reduces the run time by 2.7X compared to default placement. The second method, “Topaware”, selects a prism of nodes slightly larger than the ideal prism one would select if there were no service nodes. The application’s processes are ordered to group neighboring processes on the same node and to place groups of neighbors onto nodes which are no more than a few hops apart. Up to 40% run time reductions are obtained for 2D and 3D virtual topologies. In dedicated mode, using Topaware with MILC reduces the run time by 3.7X compared to default placement. Paper Technical Session 19B Chair: Zhengji Zhao (Lawrence Berkeley National Laboratory) The State of the Chapel Union Bradford L. Chamberlain, Sung-Eun Choi, Martha B. Dumler, Thomas Hildebrandt, David Iten, Vassily Litvinov and Greg Titus (Cray Inc.) Abstract Abstract Chapel is an emerging parallel programming language that originated under the DARPA High Productivity Computing Systems~(HPCS) program. Although the HPCS program is now complete, the Chapel language and project remain very much alive and well. Under the HCPS program, Chapel generated sufficient interest among HPC user communities to warrant continuing its evolution and development over the next several years. In this paper, we reflect on the progress that was made with Chapel under the auspices of the HPCS program, noting key decisions made during the project's history. We also summarize the current state of Chapel for programmers who are interested in using it today. And finally, we describe current and ongoing work to evolve it from prototype to production-grade; and also to make it better suited for execution on next-generation systems. Recent enhancements to the Automatic Library Tracking Database infrastructure at the Swiss National Supercomputing Centre Timothy W. Robinson and Neil Stringfellow (Swiss National Supercomputing Centre) Abstract Abstract The Automatic Library Tracking Database (ALTD)—an infrastructure developed previously by staff at the National Institute for Computational Sciences (NICS)—is in production today on Cray XT, XE, XK, and XC30 systems at several Cray sites, including NICS, Oak Ridge National Laboratory, the National Energy Research Scientific Computing Center, and the Swiss National Supercomputing Centre (CSCS). The Automatic Library Tracking Database automatically and transparently stores information about applications running on Cray systems and also records which libraries are linked to those applications, and from these data, support staff at HPC centres can derive a wealth of information about software usage—such as the use or non-use of particular compiler suites or the uptake of numerical libraries and third-party applications—right down to the level of specific version numbers. The tool works by intercepting the GNU linker to gather information on compilers and libraries, and intercepting the job launcher to track the execution of applications at launch time. We have recently extended the ALTD framework deployed at CSCS to record more detailed information on the individual jobs executed on our machines: the job information recorded by the previous incarnation of ALTD was limited to user name, executable, (batch) job id, and run date; we have extended the tool to record many additional job characteristics such as begin and end times, requested versus used core counts, number of processing elements and threads per process, and mode of linking (e.g. static, dynamic). In combination with custom post-processing scripts—which map executables to software codes, research domains or research groups—our ALTD implementation now delivers a far more complete picture of system usage, providing not only a list of running applications but also information on the way that these same applications are being run. On a practical level, such information can be used, for example, to guide future hardware and software procurements, or to assess whether or not researchers are using our systems in the manner for which they were provided with resource allocations. Comparing Compiler and Library Performance in Material Science Applications on Edison Jack Deslippe and Zhengji Zhao (National Energy Research Scientific Computing Center) Abstract Abstract Materials science and chemistry applications are expected to represent approximately 1/3 of the computational workload on NERSCs Cray XC30 system, Edison. The performance of these applications can often depend sensitively on the compiler and compiler options used at build-time. For this reason, the NERSC user services group supplies users with optimized builds of the most commonly used materials science applications in order to ensure these cycles are used as efficiently as possible. In this paper, we compare the performance of various material science and chemistry applications when built with the Cray, Intel and GNU compiler suites under various compiler options as well as linked against the MKL, LibSci and FFTW libraries. We compare the optimal compilers and libraries on Edison with those previously obtained on the NERSC Cray XE6 machine, Hopper. Paper Technical Session 19C Chair: John Noe (Sandia National Laboratories) A Single Pane of Glass: Bright Cluster Manager for Cray Matthijs van Leeuwen, Mark Blessing and David Maples (Bright Computing) Abstract Abstract Bright Cluster Manager provides comprehensive cluster management for Cray systems in one integrated solution: deployment, provisioning, scheduling, monitoring, and management. Its intuitive GUI provides complete system visibility and ease of use for multiple systems and clusters simultaneously, including automated tasks and intervention. Bright also provides a powerful management shell for those who prefer to manage via a command-line interface. Supporting Multiple Workloads, Batch Systems, and Computing Environments on a Single Linux Cluster Larry Pezzaglia (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory) Abstract Abstract A new Intel-based, Infiniband-attached computing system from Cray Cluster Solutions (formerly Appro), at NERSC, provides computational resources to transparently expand several existing NERSC production systems serving three different constituencies: a mixed serial/parallel mid-range workload, a serial, high-throughput, High-Energy Physics/Nuclear Physics workload, and a mixed serial/parallel Genomics workload. Tools to Execute An Ensemble of Serial Jobs on a Cray Abhinav Thota, Scott Michael, Sen Xu, Thomas G. Doak and Robert Henschel (Indiana University) Abstract Abstract Traditionally, Cray supercomputers have been located at large supercomputing centers and were used to run highly parallel applications. The user base consisted mostly of researchers from the fields of physics, mathematics, astronomy and chemistry. But in recent times, Cray supercomputers have become available to a wider range of users from a variety of disciplines. Examples include the Kraken machine at the National Institute for Computational Sciences (NICS), Hopper at the National Energy Research Scientific Computing Center (NERSC), and Big Red II at Indiana University. Predictably, as the diversity of end users has grown, the workload has expanded to include a variety of workflows containing serial and hybrid applications, as well as complex workflows involving pilot-jobs. Projects that employ a massive number of serial jobs--in an embarrassingly data-parallel manner--have not been targeted to run on Cray supercomputers. To accomplish such projects, it is usually necessary to bundle a large number of serial jobs into a much larger parallel job, via either a pilot job framework, an MPI wrapper, or custom scripting. In this article, we explore several of the current offerings for bundling serial jobs on a Cray supercomputer and discuss some of the benefits and shortcomings of each of the approaches. The approaches we evaluate include BigJob, PCP, and native aprun with scripts. Tutorial Tutorial 1A Programming Accelerators using OpenACC in the Cray Compilation Environment James C. Beyer (Cray Inc.) Abstract Abstract This tutorial will introduce the novice accelerator programmer to the OpenACC Application Programming Interface (API) as well provide the more advanced programmer with ideas for extracting even more performance. The tutorial will start with an introduction to the OpenACC 2.0 specification. The specification will be presented in a user centric manner intended to teach the novice user how to port code to heterogeneous systems such as the XE6 and XK7. The significances of the execution and memory model will be presented first. Once the ground work has been set the parallel and kernels constructs will be introduced along with how they are inserted into the code. Examples will be used to introduce the reset of the API in situ. Special attention will be given to the new features in the 2.0 specification covering the benefits and pitfalls. (Assuming we get all of these features into the spec the following and possible more will be covered.) The concept of unstructured data lifetimes will be discussed and use cases presented. The highly anticipated separate compilation unit, Call, support feature will be explained. The interaction between Call support and Nested parallelism will be explored due to its impact on the Call support feature. Once the API has been covered hints and tricks for using both the API itself as well as from the Cray Compilation Environment (CCE) will be presented. Tutorial Tutorial 1B System Administration for Cray XE and XK Systems Richard Slick (Cray Inc.) Abstract Abstract The Cray Linux Environment requires tasks and processes beyond what is required for managing basic Linux systems. This short seminar covers some system administration basics, as well as a collection of tools and procedures to enhance monitoring and logging and efficient command usage. The talk will include new capabilities in logging, Node Health Check, and ALPS. New features in recent releases will also be discussed. The session is geared towards new system administrators, as well as those with more experience. Tutorial Tutorial 1C Lustre Troubleshooting and Tuning Brett Lee (Intel Corporation) Abstract Abstract Lustre is an open source, parallel file system that has earned a reputation in the High Performance Computing (HPC) community for its speed and scalability. Lustre, however, has also earned a reputation for being mysterious and thus hard to administer. The purpose of this talk is to pull back the curtain on some of the mystery and provide the fundamental knowledge necessary to administer, troubleshoot and tune a Lustre file system. Tutorial Tutorial 2A Refactoring Applications for the XK7 John Levesque (Cray Inc.) and Jeff Larkin (NVIDIA) Abstract Abstract This tutorial will cover the process of porting an all MPI application to the XK7. Numerous paths will be explored including OpenACC, Cuda Fortran and Cuda. Examples during the tutorial will be drawn from the applications that were developed for Titan over the past year. In the process of porting the application, one must first generate a good hybrid version of the application that uses OpenMP on the node and MPI between the nodes. The process of developing the hybrid code frequently ends up improving the overall performance of the application even before using the accelerator. In the process of developing the hybrid version of the application significant code modifications may be necessary to restructure the application to exhibit high level parallelism, keeping in mind that the accelerator will need large kernels of computation in order to achieve the best performance. OpenACC has progressed to a viable programming model that will allow the application developer to generate a performance portable application that will run well on many-core systems including current XK7 and future XC30 systems with Intel MIC or Nvidia accelerators. This past year the number of OpenACC applications have grown to a point where an excellent foundation of techniques can be given. A wide variety of applications will be presented in the process of explaining the techniques used to develop efficient hybrid applications. Larkin and Levesque are currently writing a book which will contain the examples given in the tutorial. Tutorial Tutorial 2B Configuration and Administration of Cray External Services Systems Jeff Keopp and Harold Longley (Cray Inc.) Abstract Abstract Cray External Services systems expand the functionality of the Cray XE/XK and Cray XC systems by providing more powerful external login (esLogin) nodes and an external Lustre file system (esFS). A management server (esMS) provides administration and monitoring functions as well as node provisioning and automated Lustre failover for the external Lustre file system. The esMS is available in a single server or high-availability configuration. A great advantage of these systems is that the external Lustre file system remains available to the external login nodes regardless of the state of the Cray XE/XK or Cray XC system. Tutorial Tutorial 2C Debugging Heterogeneous HPC Applications with TotalView Chris Gottbrath (Rogue Wave Software) Abstract Abstract The new Cray XC series gives users the option of either using accelerators or coprocessors. Regardless of which path chosen, truly utilizing the full power of Cray systems’ hosting accelerators and coprocessors, like NVIDIA® Kepler/Fermi or Intel® Xeon® Phi™, means leveraging several different levels of parallelism. In addition, developers need to juggle a variety of different technologies, from MPI and OpenMP to CUDA™, OpenACC, or Intel Language Extensions for Offloading (LEO) on Intel Xeon Phi coprocessors. While troubleshooting and debugging applications are a natural part of any development or porting process, these efforts become even more critical when working with multiple levels of parallelism and various different technologies. |