CUG2013 Proceedings | Created 2013-8-6 |
Sunday, May 5th | Monday, May 6th 8:30am-12pm Tutorial 1A Zinfandel / Cabernet Programming Accelerators using OpenACC in the Cray Compilation Environment James C. Beyer (Cray Inc.) This tutorial will introduce the novice accelerator programmer to the OpenACC Application Programming Interface (API) as well provide the more advanced programmer with ideas for extracting even more performance. The tutorial will start with an introduction to the OpenACC 2.0 specification. The specification will be presented in a user centric manner intended to teach the novice user how to port code to heterogeneous systems such as the XE6 and XK7. The significances of the execution and memory model will be presented first. Once the ground work has been set the parallel and kernels constructs will be introduced along with how they are inserted into the code. Examples will be used to introduce the reset of the API in situ. Special attention will be given to the new features in the 2.0 specification covering the benefits and pitfalls. (Assuming we get all of these features into the spec the following and possible more will be covered.) The concept of unstructured data lifetimes will be discussed and use cases presented. The highly anticipated separate compilation unit, Call, support feature will be explained. The interaction between Call support and Nested parallelism will be explored due to its impact on the Call support feature. Once the API has been covered hints and tricks for using both the API itself as well as from the Cray Compilation Environment (CCE) will be presented. Tutorial Tutorial 1B Merlot / Syrah System Administration for Cray XE and XK Systems Richard Slick (Cray Inc.) The Cray Linux Environment requires tasks and processes beyond what is required for managing basic Linux systems. This short seminar covers some system administration basics, as well as a collection of tools and procedures to enhance monitoring and logging and efficient command usage. The talk will include new capabilities in logging, Node Health Check, and ALPS. New features in recent releases will also be discussed. The session is geared towards new system administrators, as well as those with more experience. Tutorial Tutorial 1C Atlas Peak / Castle Peak / Diamond Mountain (Vintners Ballroom) Lustre Troubleshooting and Tuning Brett Lee (Intel Corporation) Lustre is an open source, parallel file system that has earned a reputation in the High Performance Computing (HPC) community for its speed and scalability. Lustre, however, has also earned a reputation for being mysterious and thus hard to administer. The purpose of this talk is to pull back the curtain on some of the mystery and provide the fundamental knowledge necessary to administer, troubleshoot and tune a Lustre file system. The topics to be presented in this talk include: 1) Lustre functionality, a significant hurdle in learning to troubleshooting problems, 2) monitoring Lustre, a necessary skill to detecting problems, 3) some of the most commonly seen problems, 4) benchmarking Lustre performance, 5) configuring Lustre components for performance, and 6) tunable parameters available for Lustre. Tutorial 1pm-4:30pm Tutorial 2A Zinfandel / Cabernet Refactoring Applications for the XK7 John Levesque (Cray Inc.) and Jeff Larkin (NVIDIA) This tutorial will cover the process of porting an all MPI application to the XK7. Numerous paths will be explored including OpenACC, Cuda Fortran and Cuda. Examples during the tutorial will be drawn from the applications that were developed for Titan over the past year. In the process of porting the application, one must first generate a good hybrid version of the application that uses OpenMP on the node and MPI between the nodes. The process of developing the hybrid code frequently ends up improving the overall performance of the application even before using the accelerator. In the process of developing the hybrid version of the application significant code modifications may be necessary to restructure the application to exhibit high level parallelism, keeping in mind that the accelerator will need large kernels of computation in order to achieve the best performance. OpenACC has progressed to a viable programming model that will allow the application developer to generate a performance portable application that will run well on many-core systems including current XK7 and future XC30 systems with Intel MIC or Nvidia accelerators. This past year the number of OpenACC applications have grown to a point where an excellent foundation of techniques can be given. A wide variety of applications will be presented in the process of explaining the techniques used to develop efficient hybrid applications. Larkin and Levesque are currently writing a book which will contain the examples given in the tutorial. Tutorial Tutorial 2B Merlot / Syrah Configuration and Administration of Cray External Services Systems Jeff Keopp and Harold Longley (Cray Inc.) Cray External Services systems expand the functionality of the Cray XE/XK and Cray XC systems by providing more powerful external login (esLogin) nodes and an external Lustre file system (esFS). A management server (esMS) provides administration and monitoring functions as well as node provisioning and automated Lustre failover for the external Lustre file system. The esMS is available in a single server or high-availability configuration. A great advantage of these systems is that the external Lustre file system remains available to the external login nodes regardless of the state of the Cray XE/XK or Cray XC system. Configuration and administration of Cray External Services Systems (esMS, esLogin and esFS) will be covered in a tutorial by Cray technical personnel. Topics will include esMS failover, Lustre Failover, image management, node provisioning, secure configuration, system monitoring and troubleshooting. Tutorial Tutorial 2C Atlas Peak / Castle Peak / Diamond Mountain (Vintners Ballroom) Debugging Heterogeneous HPC Applications with TotalView Chris Gottbrath (Rogue Wave Software) The new Cray XC series gives users the option of either using accelerators or coprocessors. Regardless of which path chosen, truly utilizing the full power of Cray systems’ hosting accelerators and coprocessors, like NVIDIA® Kepler/Fermi or Intel® Xeon® Phi™, means leveraging several different levels of parallelism. In addition, developers need to juggle a variety of different technologies, from MPI and OpenMP to CUDA™, OpenACC, or Intel Language Extensions for Offloading (LEO) on Intel Xeon Phi coprocessors. While troubleshooting and debugging applications are a natural part of any development or porting process, these efforts become even more critical when working with multiple levels of parallelism and various different technologies. This tutorial provides an introduction to key parallel debugging techniques, including: MPI and subset debugging, process and thread sets, reverse debugging, comparative debugging, and techniques for CUDA, OpenACC, and Intel Xeon Phi coprocessor debugging. Tutorial 4:45pm-5:45pm Interactive 3A Zinfandel / Cabernet Colin McMurtrie System Support SIG Colin McMurtrie (Swiss National Supercomputing Centre) This is a meeting of the Systems Support Special Interest Group. Birds of a Feather Interactive 3B Merlot / Syrah Helen He Programming Environments, Applications and Documentation SIG Helen He (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) This is an interactive session to discuss topics within the Programming Environments, Applications and Documentation Special Interest Group. Birds of a Feather Interactive 3C Atlas Peak / Castle Peak / Diamond Mountain (Vintners Ballroom) Birds of a Feather | Tuesday, May 7th 8:30am-10am General Session 4 Zinfandel / Cabernet / Merlot / Syrah (Grand Ballroom) Nick Cardo CUG Welcome Nick Cardo (National Energy Research Scientific Computing Center) Cray User Group 2013 Welcome Why we need Exascale, and why we won't get there by 2020 Horst D. Simon (Lawrence Berkeley National Laboratory) It may come as surprise to many who are currently deeply engaged in research and development activities that could lead us to exascale computing, that it has been already exactly six years, since the first set of community town hall meetings were convened in the U.S. to discuss the challenges for the next level of computing in science. It was in April and May 2007, when three meetings were held in Berkeley, Argonne and Oak Ridge that formed the basis for the first comprehensive look at exascale [1]. What is even more surprising is that in spite of numerous national and international initiatives that have been created in the last five years, the community has not made any significant progress towards reaching the goal of an Exaflops system. If one reflects and looks back at early projections, for example in 2010, it seemed to be possible to build at least a prototype of an exascale computer by 2020. This view was expressed in documents such as [2], [3]. I believe that the lack of progress in the intervening years has made it all but impossible to see a working exaflops system by 2020. Specifically, I do not expect a working Exaflops system to appear on the #1 spot of the TOP500 list with a RMAX performance exceeding 1 Exaflop/s by November 2019. In this talk I will explain why this is a regrettable lack of progress and what the major barriers are. References 1. Simon, H., Zacharia, T., and Stevens, R.: Modeling and Simulation at the Exascale for Energy and Environment, Berkeley, Oak Ridge, Argonne (2007), http://science.energy.gov/ascr/news-and-resources/program-documents/ 2. Stevens, R. and White, A.: Crosscutting Technologies for Computing at Exaflops, San Diego, (2009). http://science.energy.gov/ascr/news-and- resources/workshops-and-conferences/grand-challenges/ 3. Shalf, J., Dosanjh, S. , Morrison, J. : Exascale Computing Technology Challenges. VECPAR 2010: 1-25. Invited Talk 10:30am-12pm General Session 5 Zinfandel / Cabernet / Merlot / Syrah (Grand Ballroom) David Hancock Cray Corporate Update Peter Ungaro (Cray Inc.) Cray Corporate Update Cray in Supercomputing Peg Williams (Cray Inc.) Cray in Supercomputing Invited Talk 1pm-2:30pm Technical Session 6A Zinfandel / Cabernet Tina Butler Cray System Software Road Map Image Management and Provisioning System Overview John Hesterberg (Cray Inc.) This document provides an overview of the new Image Management and Provisioning System (IMPS) under development at Cray. IMPS is a new set of features that changes how software is installed, managed, provisioned, booted, and configured on Cray systems. It focuses on adopting common industry tools and procedures where possible, combined with scalable Cray technology, to produce an enhanced solution ultimately capable of effectively supporting all Cray systems, from the smallest to the largest. pdf, pdf Paper Technical Session 6B Merlot / Syrah Jason Hill Instrumenting IOR to Diagnose Performance Issues on Lustre File Systems Doug J. Petesch and Mark S. Swan (Cray Inc.) Large Lustre file systems are made of thousands of individual components all of which have to perform nominally to deliver the designed I/O bandwidth. When the measured performance of a file system does not meet expectations, it is important to identify the slow pieces of such a complex infrastructure quickly. This paper will describe how Cray has instrumented IOR (a popular I/O benchmark program) to automatically generate pictures that show the relative performance of the many OSTs, servers, LNET routers and other components involved. The plots have been used to diagnose many unique problems with Lustre installations at Cray customer sites. pdf, pdf Taking Advantage of Multicore for the Lustre Gemini LND Driver James A. Simmons (Oak Ridge National Laboratory) and John Lewis (Cray Inc.) High performance computing systems have long embraced the move to multi-core processors, but parts of the operating system stack have only recently been optimized for this scenario. Lustre improved its performance on high core-count systems by keeping related work on a common set of cores, though low-level network drivers must be adapted to the new API. The multi-threaded Lustre network driver (LND) for the Cray Gemini high-speed network improved performance over its single-threaded implementation, but did not employ the benefits of the new API. In this paper, we describe the advantages of the new API and performance gains achieved by modify the Gemini LND to use it. pdf, pdf A file system utilization metric for I/O characterization Andrew Uselton and Nicholas Wright (Lawrence Berkeley National Laboratory) Today, an HPC platform's “scratch” file system typically represents 10-20% of its cost. However, disk performance is not keeping up with gains in processors, therefore keeping the same relative I/O performance will require an increasingly larger fraction of the budget. Therefore, it is important to understand the I/O workload of HPC platforms in order to provision the file system correctly. Although it is relatively straightforward to measure the peak bandwidth of a file system, this accounts for only part of the overall load: the size of individual I/O transactions strongly affects performance. In this work we introduce a new metric for file system utilization that accounts for such effects and provides a better view of the overall load on the file system. We present a description of our model, our work to calibrate it, and early results from the file systems at NERSC. pdf, pdf Paper Technical Session 6C Atlas Peak / Castle Peak / Diamond Mountain (Vintners Ballroom) Craig Stewart The Cray Programming Environment: Current Status and Future Directions Luiz DeRose (Cray Inc.) The scale of current and future high end systems, as well as the increasing system software and architecture complexity, brings a new set of challenges for application developers. In order to achieve high performance on peta-scale systems, application developers need a programming environment that can address and hide the issues of scale and complexity of high end HPC systems. Users must be supported by intelligent compilers and runtime systems, automatic performance analysis tools, adaptive libraries, and debugging and porting tools. Moreover this programming environment must be capable of supporting millions of processing elements in an heterogeneous environment. In this talk I will present the recent activities and future directions of the Cray Programming Environment, which are being developed and deployed according to Cray’s adaptive supercomputing strategy to improve user’s productivity on the Cray Supercomputers. Enhancements to the Cray Performance Measurements and Analysis Tools Heidi Poxon (Cray Inc.) The Cray Performance Measurement and Analysis Tools offer performance measurement and analysis feedback for applications running on Cray multi-core and hybrid computing systems. As with any tool, using the Cray performance analysis toolset involves a learning curve. Recent work focuses on a new interface to obtain basic application performance information for users not familiar with the Cray performance tools. CrayPat-lite has been developed to provide performance statistics at the end of a job by simply loading a modulefile. After a program completes execution, output such as job size, wallclock time, MFLOPS, top time consuming routines, etc. is automatically presented through stdout. Modifications to the “classic” performance tools interface have also been made to unify the two paths so that users who start with CrayPat-lite can easily transition to using CrayPat. This paper presents the CrayPat-lite enhancement to the toolset. Cray Compiling Environment Update Suzanne LaCroix and James Beyer (Cray Inc.) The Cray Compiling Environment (CCE) has evolved over the last several years to support high performance computing needs on Cray systems. New system architectures, new language standards, and ever-increasing performance and scaling requirements have driven this change. This talk will present an overview of current CCE capabilities and recently added features. Future plans and challenges will also be discussed. Paper 3pm-5pm Technical Session 7A Zinfandel / Cabernet Jeff Broughton New Member Talk: iVEC and the Pawsey Centre Charles Schwartz (iVEC) The Pawsey Centre is a supercomputing facility being built in Kensington, Western Australia, to be operated by iVEC, an unincorporated joint venture of four public universities and CSIRO. It is a research facility, specialising in radio-astronomy and geosciences, but available to the larger Australian academic research community as well. In this talk, we will introduce ourselves as a new CUG member and present: - the iVEC mission (who, what, why, where) - an overview of the Pawsey Centre in Kensington, WA, Australia - a high-level description of our Cray system - Cray system deployment schedule, progress, configuration and plans - issues that have come up (or, depending on schedule, that we expect to arise) The Evolution of Cray Management Services Tara Fly, Alan Mutschelknaus, Andrew Barry and John Navitsky (Cray Inc.) Cray Management Services is quickly evolving to address the changing nature of Cray Systems. NodeKares adds advanced features to support gang scheduling, reservation and application level health checking, as well as other serviceability features. Lightweight Log Manager provides more complete and standardized log collection. Modular xtdumpsys will provide an extensible framework for system dumping. Resource utilization reporting provides a scalable, extensible framework for data collection, including power management, GPU utilization, and application resource utilization data. This paper presents these new features, including configuration, migration, and benefits. pdf, pdf CRAY XC30 Installation – A System Level Overview Nicola Bianchi, Colin McMurtrie and Sadaf Alam (Swiss National Supercomputing Centre) In this paper we detail the installation of the 12-cabinet Cray XC30 system at the Swiss National Super- computing Centre (CSCS). At the time of writing this is the largest such system worldwide and hence the system-level challenges of this latest generation Cray platform will be of interest to other sites. The intent is to present a systems and facilities point of view regarding the Cray XC30 installation, operational setup and identify key differences between the Cray XC30 and previous generation Cray systems such as the Cray XE6. We identify key system configuration options and challenges when integrating the entire machine ecosys- tem into a complex operational environment: Sonexion1600 Lustre storage appliance management and tuning, Lustre fine grained routing, esLogin cluster installation and management using Bright Cluster Manager, IBM GPFS integration, Slurm installation, facility management and network considerations. pdf, pdf Cray External Services Systems Overview Harold Longley and Jeff Keopp (Cray Inc.) Cray External Services systems expand the functionality of the Cray XE/XK and Cray XC systems by providing more powerful external login (esLogin) nodes and an external Lustre file system (esFS). A management server (esMS) provides administration and monitoring functions as well as node provisioning and automated Lustre failover for the external file system. The esMS is available in a single server or high-availability configuration. A great advantage of these systems is that the external Lustre file system remains available to the external login nodes regardless of the state of the Cray XE/XK or Cray XC system. External login nodes are the standard login node on Cray XC systems. This discussion will provide an overview of the Cray External Services system components, installation process and security update process. pdf, pdf Paper Technical Session 7B Merlot / Syrah Andrew Uselton Architecting Resilient Lustre Storage Solution John Fragalla (Xyratex) The concept of scratch HPC storage is quickly becoming less critical than the importance of high availability (HA) and reliability. In this presentation, Xyratex discusses architecting a resilient and reliable Lustre storage solution to increase availability and eliminate downtime within HPC environments for continual data access. Xyratex will discuss how solutions based on ClusterStor Technologies address the architectural challenges of HA and reliability without sacrificing performance with protection against hardware faults, power failures, data loss, potential software issues for continued data access based on the tight integration, test processes, and an integrated Lustre storage platform. Xyratex will point out in this presentation its extensive disk drive testing at multiple stages to reduce disk failures and decrease annual failure rates (AFR), the benefits of providing options for live software patches, updates, and revisions by using failover and failback procedures, and the overall Xyratex ClusterStor based solution, which leverages these concepts within its design. BlueWaters I/O Performance Mark S. Swan and Doug Petesch (Cray Inc.) The BlueWaters system, installed at NCSA, is a landmark achievement not only in computational capability but also I/O capacity and performance. This paper will describe the I/O infrastructure of BlueWaters, achievements, challenges, and lessons learned. pdf, pdf Sonexion 1600 I/O Performance Nicholas P. Cardo (National Energy Research Scientific Computing Center) The Sonexion 1600 is the latest in Cray’s storage products. An investigative look into the I/O performance of the new devices yields insights into the expected performance. Various I/O scenarios are explored by varying the number of readers and writers to files along with differing I/O patterns. These tests explore the performance characteristics of individual OST’s as well as the aggregate for the file system. Metadata performance is also investigated for creates, unlinks and stats. In both cases, metadata and data, the investigation attempts to identify the sustained and peak performance of the Sonexion 1600. The results can then be used to design a file system on the Sonexion 1600 to achieve desired I/O performance. OLCF's 1 TB/s, next-generation Spider file system David Dillow, Sarp Oral, Douglas Fuller, Jason Hill, Dustin Leverman, Sudharshan Vazhkudai, Feiyi Wang, Kim Youngjae, James H. Rogers, James Simmons and Ross G. Miller (Oak Ridge National Laboratory) The Oak Ridge Leadership Computing Facility (OLCF) at Oak Ridge National Laboratory (ORNL) has a long history of deploying the world's fastest supercomputers to enable open science. At the time it was deployed in 2008, the Spider file system had a formatted capacity of 10 PB and sustained transfer speeds of 240 GB/s which made it the fastest Lustre file system in the world. However, the addition of Titan, a 27 PFLOPS Cray XK7 system, along with other OLCF computational resources, has radically increased the I/O demand beyond the capabilities of the existing Spider parallel file system. The next-generation Spider Lustre file system is designed to provide 32 PB of capacity to open science users at OLCF, at an aggregate transfer rate of 1 TB/s. This paper details the architecture, design choices, and configuration of the next-generation Spider file system at OLCF. pdf, pdf Paper Technical Session 7C Atlas Peak / Castle Peak / Diamond Mountain (Vintners Ballroom) Sadaf R. Alam Optimizing GPU to GPU Communication on Cray XK7 Jeff M. Larkin (NVIDIA) When developing an application for Cray XK7 systems, optimization of compute kernels is only a small part of maximizing scaling and performance. Programmers must consider the effect of the GPU’s distinct address space and the PCIe bus on application scalability. Without such considerations applications rapidly become limited by transfers to and from the GPU and fail to scale to large numbers of nodes. This paper will demonstrate methods for optimizing GPU to GPU communication and present XK7 results for these methods. pdf, pdf Debugging and Optimizing Programs Accelerated with Intel® Xeon® Phi™ Coprocessors Chris Gottbrath (Rogue Wave Software) Intel® Xeon® Phi™ coprocessors present an exciting opportunity for Cray users to take advantage of many-core processor technology. Since the Intel Xeon Phi coprocessor shares many architectural features and much of the development tool chain with multi-core Intel Xeon processors, it is generally fairly easy to get a program running on the Intel Xeon Phi coprocessor. However, taking full advantage of the Intel Xeon Phi coprocessor requires expressing a level of parallelism that may require significant re-thinking of algorithms. Scientists need tools that allow them to debug and optimize hybrid MPI/OpenMP parallel applications that may have dozens or even hundreds of threads per node. This talk will discuss how recent improvements to TotalView® and ThreadSpotter™ are setting the stage so that Cray users will be able to adopt the Intel Xeon Phi coprocessor with confidence. pdf, pdf Portable and Productive Performance on Hybrid System with OpenACC Compilers and Tools Luiz DeRose (Cray Inc.) The current trend in the supercomputing industry is to provide hybrid systems with accelerators attached to multi-core processors. Some of the critical hurdles for the widespread adoption of accelerated computing in high performance computing are portability and programmability. In order to facilitate the migration to hybrid systems with accelerators attached to CPUs, users need a simple programming model that is portable across machine types. Moreover, to allow for users to maintain a single code base, this programming model, and the required optimization techniques, should not be significantly different for “accelerated” nodes from the approaches used on current multi-core x86 processors. In this talk I will present Cray’s approach to accelerator programming, which is based on a high level programming environment with tightly coupled OpenACC compilers, libraries, and tools that can interoperate and hide the complexity of the system. Ease of use is possible with compiler making it feasible for users to write applications in Fortran, C, or C++ with OpenACC directives, tools to help users port, debug, and optimize for GPUs, as well as conventional multi-core CPUs. In this programming environment, the compiler does the “heavy lifting” to split off the work destined for the accelerator and perform the necessary data transfers. In addition, it does optimizations to take advantage of the accelerator and the multi-core X86 hardware appropriately. A full debugger with integrated support for the CPU and the GPU is available with DDT from Allinea or TotalView from Rogue Wave Software. The Cray Performance Tools provide statistics for the whole application, which could be grouped by accelerator directive or mapped back to the high level source by line number. A single performance report can include statistics for both the host and the accelerator, including hardware performance counters information. The Cray Scientific Libraries uses the Cray auto-tuning framework to select the best kernel for the each task. With this scientific libraries interface, data copy is automatic and the GPU or host execution placement is automatic. Finally, the Cray Programming Environment for accelerators supports experienced CUDA developers, by providing interoperability between the compiler, performance tools, and debugger with existing CUDA codes. Tesla vs Xeon Phi vs Radeon: A Compiler Writer's Perspective Brent Leback, Douglas Miles and Michael Wolfe (The Portland Group) Today, most CPU+Accelerator systems incorporate NVIDIA GPUs. Intel Xeon Phi and the continued evolution of AMD Radeon GPUs make it likely we will soon see, and have to program, a wider variety of CPU+Accelerator systems. PGI already supports NVIDIA GPUs, and is working to add support for Xeon Phi and AMD Radeon. This talk explores the features common to all three types of accelerators, those unique to each, and the implications for programming models and performance portability from a compiler writer's and applications perspective. Paper 5:15pm-6pm Interactive 8A Zinfandel / Cabernet Nick Cardo Open discussion with CUG Board Nick Cardo (National Energy Research Scientific Computing Center) This interactive session is an open discussion with the CUG Board. Topics to be discussed include the CUG program, by-law changes, elections, and methods to broaden participation in CUG throughout the year. Birds of a Feather Interactive 8B Merlot / Syrah Duncan J. Poole OpenACC BOF Duncan Poole (NVIDIA) This BOF would discuss the status of OpenACC as an organization and as a specification. Topics of interest to CUG would include: the OpenACC 2.0 specification and member activites including developing new products, benchmarks, example codes, and profiling interface. Many OpenACC members will be present at CUG, and a lot of progress has been made, so this can be a lively interactive session. Note - Session lead is also the president of OpenACC, so if you prefer the submitting institution can be OpenACC itself. Birds of a Feather Interactive 8C Atlas Peak / Castle Peak / Diamond Mountain (Vintners Ballroom) John Hesterberg System Management Futures John Hesterberg (Cray Inc.) System Management futures. Discuss ideas about what comes next after the Installation, Image Management, Provisioning, and Configuration changes being planned at Cray. What is the right way to do system administration and management for Exascale? What are your best practices in system administration and management for large systems? Birds of a Feather | Wednesday, May 8th 8:30am-10am General Session 9 Zinfandel / Cabernet / Merlot / Syrah (Grand Ballroom) Nick Cardo CUG Business Nick Cardo (Nation) Cray User Group business meeting and elections Big Bang, Big Data, Big Iron – Analyzing Data From The Planck Satellite Mission Julian Borrill (Lawrence Berkeley National Laboratory) On March 21st 2013 the European Space Agency announced the first cosmology results from its billion-dollar Planck satellite mission. The culmination of 20 years of work, Planck’s observations of the Cosmic Microwave Background – the faint echo of the Big Bang itself – provide profound insights into the foundations of cosmology and fundamental physics. Planck has been making 10,000 observations of the CMB every second since mid-2009, providing a dataset of unprecedented richness and precision; however the analysis of these data is an equally unprecedented computational challenge. For the last decade we have been developing the high performance computing tools needed to make this analysis tractable, and deploying them on Cray systems at supercomputing centers in the US and Europe. This first Planck data release required tens of millions of CPU-hours on the National Energy Research Scientific Computing Center’s Hopper system. This included generating the largest Monte Carlo simulation set ever fielded in support of a CMB experiment, comprising 1,000 realizations of the mission reduced to 250,000 maps of the Planck sky. However our work is far from done; future Planck data releases will require ten times as many simulations, and next-generation CMB experiments will gather up to a thousand times as much data as Planck. CUG Business Nick Cardo (Nation) Cray User Group business meeting and elections Invited Talk 10:30am-12pm General Session 10 Zinfandel / Cabernet / Merlot / Syrah (Grand Ballroom) David Hancock Introduction and CUG 2013 Best Paper Award David Hancock (Indiana University) Presentation of the CUG 2013 Best Paper award and introduction for the Intel Executive presentation. The Changing Face of High Performance Computing Rajeeb Hazra (Intel Corporation) The continuing growth of computer performance is delivering an unprecedented capability to solve increasingly complex problems. This growth in performance along with the recent explosion of new devices, sensors, and social networks delivering real-time feeds over the web and into datacenters is causing a flood of data – and adding a new challenge for systems, software and applications development - for organizations that are looking to convert this data into knowledge. For vendors catering to both HPC and big data, this trend of unprecedented data challenges is reinforcing the need for investment in high-end systems with high performance storage, networks and applications if the potential of HPC is to continue to be realized. These systems, while addressing the needs of HPC, must also tailor the capability to address the requirements of new breeds of applications that emphasize rapid processing of unstructured and structured data that is being fired into corporate datacenters and research facilities at unprecedented rates. The industry has adopted three strategies to mitigate these challenges and increase the performance of systems and applications to address both the HPC and the big data space: parallel applications development; addition of accelerators to standard commodity compute nodes; and development of new purpose-built systems for the high end. In addition, at the high end of HPC, novel technologies are needed in the areas of memory subsystems, parallel system interconnects, and packaging. In this talk, Raj will discuss the current dynamics of the HPC market, how Intel is innovating to address the changing trends, and how the key acquisitions that Intel have made over the last year, along with our collaborations with key partners, will fit together to enable a complete and affordable solution for the entire HPC ecosystem. Invited Talk 12pm-1pm Interactive 11A/B Zinfandel / Cabernet / Merlot / Syrah (Grand Ballroom) David Henty HPC training and education David Henty (EPCC, The University of Edinburgh) Education and training activities have a crucial role in ensuring that the end-users of any HPC infrastructure are able to fully exploit the strengths of existing and future hardware and software resources. In this interactive session we will discuss the status of HPC education and training activities (around the globe), identify existing and potential challenges, and possibly finding some solutions to them as well. The session will be prefaced by two short case studies, one about experiences in running an MSc programme in HPC at EPCC, the University of Edinburgh, and another about organizing training activities within the pan-European virtual research infrastructure for HPC, PRACE. The attendees are invited to contribute similar case studies. Birds of a Feather Interactive 11C Atlas Peak / Castle Peak / Diamond Mountain (Vintners Ballroom) Jeff Keopp Cray External Services Systems Jeff Keopp (Cray Inc.) Customers of existing Cray External Services systems (esMS, esLogin and esFS) will have the opportunity to trade experiences, tips, and techniques and to provide feedback to Cray technical personnel in this “Birds of a Feather” session. Birds of a Feather 1pm-1:45pm General Session 12 Zinfandel / Cabernet / Merlot / Syrah (Grand Ballroom) David Hancock 1 on 100 or more Peter Ungaro (Cray Inc.) Open discussion with Cray CEO. No other Cray employees or Cray partners are permitted during this session. Invited Talk 2pm-3:30pm Technical Session 13A Zinfandel / Cabernet Douglas W. Doerfler SeaStar Unchained: Multiplying the Performance of the Cray SeaStar Network David A. Dillow and Scott Atchley (Oak Ridge National Laboratory) The Cray SeaStar ASIC, with its programmable embedded processor, provides an excellent platform to investigate the properties of various network protocols and programming interfaces. This paper describes our native implementation of the Common Communication Interface (CCI) on the SeaStar platform, and details how we implemented full operating system (OS) bypass for common operations. We demonstrate a 30% to 50% reduction in latency, more than a six-fold increase in message injection rate, and an almost 7x improvement in bandwidth for small message sizes when compared to the generic Cray Portals implementation. pdf, pdf Intel Multicore, Manycore, and Fabric Integrated Parallel Computing Jim Jeffers (Intel Corporation) Dramatic increases in node level parallelism are here with the introduction of many-core Intel® Xeon Phi™ coprocessors along with the continued generational core increases in multi-core Intel® Xeon® processors. Jim will discuss the impacts on software development for these platforms and the important considerations for scaling highly parallel applications both within the node and across clusters. He will also discuss Intel’s current network fabric products and the future directions Intel is pursuing to address the next critical challenge - Efficient internode communications for the next generation of HPC platforms. Understanding the Impact of Interconnect Failures on System Operation Matthew A. Ezell (Oak Ridge National Laboratory) Hardware failures are inevitable on large high performance computing systems. Faults or performance degradations in the high-speed network can reduce the entire system’s performance. Since the introduction of the Gemini interconnect, Cray systems have become resilient to many networking faults. These new network reliability and resiliency features have enabled higher uptimes on Cray systems by allowing them to continue running with reduced network performance. Oak Ridge National Laboratory has developed a set of user-level diagnostics that stresses the high-speed network and searches for components that are not performing as expected. Nearest-neighbor bandwidth tests check every network chip and network link in the system. Additionally, performance counters stored on the network ASIC’s memory mapped registers (MMRs) are used to get a more full picture of the state of the network. Applications have also been characterized under various suboptimal network conditions to better understand what impact network problems have on user codes. pdf, pdf Paper Technical Session 13B Merlot / Syrah Jason Hill The Changing Face of Storage for Exascale Brent Gorda (Intel Corporation) Cray joins Intel (Whamcloud), the HDFGroup, EMC and DDN as partners in the US Department of Energy Fastforward program, which is aimed at spurring research in key technologies for exascale. This two-year program is mostly research, but does have proof of concept (and open source) code delivery attached. As we near the halfway point in the program, we will present the big Exascale picture, progress to date, and the view of the path forward at this point in time. Cray's Implementation of LNET Fine Grained Routing: Overview and Characteristics Mark S. Swan (Cray Inc.) and Nic Henke (Xyratex) As external Lustre file systems become large and more complicated, configuring the Lustre network transport layer (LNET) can also become more complicated. This paper will focus on where Fine Grained Routing (FGR) came from, why Cray uses FGR, tools Cray has developed to aid in FGR configurations, analysis of FGR schemes, and performance characteristics. pdf, pdf Discovery in Big Data using a Graph Analytics Appliance Amar Shan and Ramesh Menon (Cray Inc.) Discovery, the uncovering of hidden relationships and unknown patterns, lies at the heart of advancing knowledge. Discovery has long been viewed as the province of human intellect, with automation difficult. However, things have to change: the explosion of Big Data has made automating the synthesis of insight from raw data mandatory. Graph analytics is particularly well suited to discovery challenges for a number of reasons. The explicit representation of relationships enables the rapid incremental addition of new sources of data. New relationships can be added as their importance is understood, and automated inferencing can be applied to augment the body of knowledge. And most importantly, the use of the high performance uRiKA graph analytics appliance enables ad-hoc, pattern based queries to be run in real time – enabling joint man-machine discovery. This talk will present some of the use cases for discovery that have emerged over the past year, including person of interest identification, cyber-intrusion detection, medicare fraud analytics, drug discovery using a systems biology approach, operational risk assessment in financial organizations and a variety of others. Paper Technical Session 13C Atlas Peak / Castle Peak / Diamond Mountain (Vintners Ballroom) Helen He Using the Cray Gemini Performance Counters Kevin Pedretti, Courtenay Vaughan, Richard Barrett, Karen Devine and K. Scott Hemmert (Sandia National Laboratories) This paper describes our experience using the Cray Gemini performance counters to gain insight into the network resources being used by applications. The Gemini chip consists of two network interfaces and a common router core, each providing an extensive set of performance counters. Based on our experience, we have found some of these counters to be more enlightening than others. More importantly, we have performed a set of controlled experiments to better understand what the counters are actually measuring. These experiments led to several surprises, described in this paper. This supplements the documentation provided by Cray and is essential information for anybody wishing to make use of the Gemini performance counters. The MPI library and associated tools that we have developed for gathering Gemini performance counters are described and are available to other Cray users as open-source software. pdf, pdf Performance Measurements of the NERSC Cray Cascade System Harvey J. Wasserman, Nicholas J. Wright, Brian M. Austin and Matthew J. Cordery (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) We present preliminary performance results for NERSC’s “Edison” system, one of the first Cray XC30 supercomputers.The primary new feature of the XC30 architecture is the Cray Aries interconnect. We use several network-centric “microbenchmarks” to measure the Aries’ substantial improvements in bandwidth, latency, message rate, and scalability. The distinctive contribution of this work consists of performance results for the NERSC Sustained System Performance (SSP) application benchmarks. The SSP benchmarks span a wide range of science domains, algorithms and implementation choices, and provide a more holistic performance metric. We examine the performance and scalability of these benchmarks on the XC30 and compare performance with other state-of-the-art HPC platforms. Edison nodes are composed of two 8-core Intel "Sandy Bridge" processors, and two hyperthreads per core. With 32 hardware threads per node, multi-threading is essential for optimal performance. We report the OpenMP, core-specialization and hyperthreading settings that maximize SSP on the XC30. pdf, pdf From thousands to millions: visual and system scalability for debugging and profiling Mark O'Connor, David Lecomber, Ian Lumb and Jonathan Byrd (Allinea Software) Behind the achievements of double digit Petaflop counts, million core systems, and sustained Petaflop real-world applications, software tools have been the silent unsung heroes. Ready to aid software migration or solve a critical acceptance bug - tools such as Allinea DDT have been ready. We will explore how Allinea DDT has been prepared for today's hybrid Cray XK7s and the Cray XC30s. We will also detail the achieved high-performance of debugging on the Cray NCSA BlueWaters and ORNL Titan systems at full scale. Looking ahead with the Intel Xeon Phi entering the market, changes are being made to enable developers and researchers to run their applications successfully. We are able to show how two software tools - Allinea DDT and Allinea MAP - are changing the HPC software development environment - completing the lifecycle of HPC software. With one common tool suite enabling both debugging and application profiling. Paper 3:45pm-5:15pm Technical Session 14A Zinfandel / Cabernet Ashley Barker Investigating Topology Aware Scheduling David Jackson (Adaptive Computing) For many years, HPC networks have been able to assume good support for all-to-all communications, meaning that no matter how workloads were placed across the network, the application would experience maximum performance. While all networks have some limitations associated with their underlying hardware and topology, the difference between the best possible allocation and the worst possible was often small enough to be in the realm of statistical noise and thus any associated issues were generally ignored. Now, as systems and workloads grow into petascale and exascale range, the communication within an application becomes massive and the difference between the best case and worst-case allocations becomes significant. The differences between one placement decision and another can now noticeably impact application efficiency, job run time consistency, and even impact neighboring workloads. The integration of workload management solutions with network management infrastructure is the natural follow-on to this issue and allows the job scheduler to be aware of the configuration, topology, strengths, and limitations of a given network. With this knowledge, and with properly optimized placement algorithms, the scheduler can efficiently place workload in a network-aware manner. Intelligent placement algorithms can help maximize task proximity, minimize bottlenecks, and significantly improve job performance, run time consistency, and overall system throughput. In collaboration with Cray, NCSA, and other major Cray sites, Adaptive Computing has begun an ambitious research project to model Cray’s Gemini 3D torus network and enable a highly advanced topology-aware scheduling algorithm. This research has matured beyond initial prototypes and has begun evaluating various approaches against actual workloads. This talk will discuss the problem space, the general approaches and considerations, and the benefits seen to date when tested against these real-world workloads. External Torque / Moab and Fairshare on the Cray XC30 Tina Declerck and Iwona Sakrejda (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory) NERSC's new Cray XC30, Edison, utilizes a new capability in Adaptive Computing's Torque 4.x and Moab 7.x products which allows the Torque server and Moab to execute external to the mainframe. This configuration offloads the mainframe server database and provides a unified view of the workload. Additionally, it allows job submissions when the mainframe is unavailable/offline. This paper discusses the configuration process, differences between the old and new methods, troubleshooting techniques, fairshare experiences, and user feedback. While this capability addresses some of the needs of the NERSC community it is not without tradeoffs and challenges. pdf, pdf Production Experiences with the Cray-Enabled TORQUE Resource Manager Matthew A. Ezell and Don Maxwell (Oak Ridge National Laboratory) and David Beer (Adaptive Computing) High performance computing resources utilize batch systems to manage the user workload. Cray systems are uniquely different from typical clusters due to Cray’s Application Level Placement Scheduler (ALPS). ALPS manages binary transfer, job launch and monitoring, and error handling. Batch systems require special support to integrate with ALPS using an XML protocol called BASIL. Previous versions of Adaptive Computing’s TORQUE and Moab batch suite integrated with ALPS from within Moab, using PERL scripts to interface with BASIL. This would occasionally lead to problems when all the components would become unsynchronized. Version 4.1 of the TORQUE Resource Manager introduced new features that allow it to directly integrate with ALPS using BASIL. This paper describes production experiences at Oak Ridge National Lab using the new TORQUE software versions. pdf, pdf Paper Technical Session 14B Merlot / Syrah Steve Simms Evaluation of A Flash Storage Filesystem on the Cray XE-6 Jay Srinivasan and Shane Canon (Lawrence Berkeley National Laboratory) This paper will discuss some of the approaches and show early results for a Flash file system mounted on a Cray XE-6 using high-performance PCI-e based cards. We also discuss some of the gaps and challenges in integrating flash into HPC systems and potential mitigations as well as new solid state storage technologies and their likely role in the future. pdf, pdf Analysis of the Blue Waters File System Architecture for Application I/O Performance Kalyana Chadalavada and Robert Sisneros (National Center for Supercomputing Applications, University of Illinois) The NCSA Blue Waters features one of the fastest file systems for scientific applications. Using the Lustre file system technology, Blue Waters provides over 1 TB/s of usable storage bandwidth. The underlying storage units are connected to the compute nodes in a unique fashion. The Blue Waters file system connects a subset of storage units to the high speed torus network at distinct points. Utilizing standard benchmarks and scientific applications, we examine the impact of this architecture on application I/O performance. Given the size of the system and its intended applications, scaling I/O performance will be a challenge. Identifying the optimal I/O methodology can help alleviate a large number of application performance issues. All exercises are done in a production environment to ensure that beneficial results are directly applicable to Blue Waters users. pdf, pdf Trillion Particles, 120,000 cores, and 350 TBs: Lessons Learned from a Hero I/O Run on Hopper Suren Byna and Andrew Uselton (Lawrence Berkeley National Laboratory), Prabhat Mr. (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center), David Knaak (Cray Inc.) and Yun (Helen) He (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) Modern petascale applications can present a variety of configuration, runtime, and data management challenges when run at scale. In this paper, we describe our experiences in running a large-scale plasma physics simulation, called VPIC, on the NERSC Hopper Cray-XE6 system. The simulation ran on 120,000 cores using ~80% of computing resources, 90% of the available memory on each node and 50% of a Lustre file system. Over two trillion particles were simulated for 23,000 timesteps, and 10 one-trillion particle dumps, each ranging between 30 and 42TB were written to HDF5 files at a sustained rate of ~27GB/s. To the best of our knowledge, this job represents the largest I/O undertaken by a NERSC application and the largest collective writes to single HDF5 files. We outline several obstacles that we overcame in the process of completing this run, and list lessons learned that are of potential interest to HPC practitioners. We will elaborate on the following insights in the paper: 1. Collective writes to a single shared HDF5 file can work as well as file-per-process writes We demonstrate that collective writes from 20,000 MPI processes to a single, shared ~40TB HDF5 file using collective buffering can achieve a sustained performance of 27GB/s on a Lustre file system. The peak performance of the system is ~35GB/s, which is achieved by our code for a substantial fraction of the runtime. This outperformed the strategy where each process wrote a separate file, i.e. a total of 20,000 files, that achieved 24GB/s. 2. Advance verification of file system hardware is important for obtaining peak performance Our initial execution of VPIC achieved only 65% of Lustre peak performance. With the use of Lustre Monitoring Toolkit (LMT), we pinpointed the problem to a small set of slow OSTs, which were exhibiting degraded performance. We temporarily excluded these OSTs from our tests, and were able to demonstrate ~80% of the peak I/O rates. Advance verification for slow OSTs can avoid performance pitfalls. 3. Advance verification of available resources for memory-intensive applications is important Since the simulation requires 90% of the memory on each node, it was necessary to verify that each node reserved for executing this simulation had at least that much of available memory. Unreleased memory from previous applications could cause out-of-memory errors. We will also discuss tuning multiple layers of parallel I/O subsystem and emphasize the need for scalable tools for diagnosing software and hardware problems. pdf, pdf Paper Technical Session 14C Atlas Peak / Castle Peak / Diamond Mountain (Vintners Ballroom) Helen He Performance Comparison of Scientific Applications on Cray Architectures Haihang You, Reuben D. Budiardja, Jeremy Logan, Lonnie D. Crosby, Vincent Betro, Pragneshkumar Patel, Bilel Hadri and Mark Fahey (National Institute for Computational Sciences) Current HPC architectures are changing drastically and rapidly while mature scientific applications usually evolve at a much slower rate. New architectures almost certainly impact the performance of these heavily used scientific applications. Therefore, it is prudent to understand how the supposed performance benefits and improvements of new architectures translate to the applications. In this paper, we attempt to quantify the differences between theoretical performance improvements (due to changes in architecture) and “real-world” improvements in applications by gathering performance data for selected applications from the fields of chemistry, climate, weather, materials science, fusion, and astrophysics running on three different Cray architectures: XT5, XE6, and XC30. The performance evaluations of these selected applications on these three architectures may give the user perspective into the potential benefits of each architecture. These evaluations are done by comparing the improvements of numerical (micro)-benchmarks to the improvements of the selected applications when run on these architectures. First 12-cabinets Cray XC30 System at CSCS: Scaling and Performance Efficiencies of Applications Sadaf Alam, Themis Athanassiadou, Tim Robinson, Gilles Fourestey, Andreas Jocksch, Luca Marsella, Jean-Guillaume Piccinali and Jeff Poznanovic (Swiss National Supercomputing Centre) CSCS has recently deployed one of the largest Cray XC30 systems, which is composed of 6 groups or 12 cabinets of dual-socket Intel Sandy Bridge processors, and the new Aries network chips with a dragonfly topology. With respect to earlier Cray XT and XE series platforms, the Cray XC30 has several unique features that have the potential to affect application performance: (1) Intel Xeon vs. AMD Opteron based nodes; (2) Aries vs. Gemini network and router chip; (3) PCIe vs. Hypertransport interface to the network chip; (4) dragonfly vs. 3D torus topology; (5) mixed optical and copper vs. all copper cables; (6) number of compute nodes per communication NIC; (7) Hyperthreading enabled nodes; and (8) compute cabinet layouts. In this report, we compare scaling and performance efficiencies of a range of applications on CSCS Cray XC30 and Cray XE6 platforms. pdf, pdf Effects of Hyper-Threading on the NERSC workload on Edison Zhengji Zhao, Nicholas J. Wright and Katie Anytpas (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) Edison, a Cray XC30, is the NERSC's newest peta-scale supercomputer. Along with the Aries interconnect, Hyper-Threading (HT) is one of the new technologies available on the system. HT provides simultaneous multithreading capability on each core with two hardware threads available. In this paper, we analyze the potential benefits of HT for the NERSC workload by investigating the performance implications of HT on a few selected applications among the top 15 codes at NERSC, which represent more than 60% of the workload. By connecting the observed HT results with more detailed profiling data we discuss if it is possible to predict how and when the users should utilize HT in their production computations on Edison. pdf, pdf Paper | Thursday, May 9th 8:30am-10am Technical Session 15A Zinfandel / Cabernet Craig Stewart Preparing Slurm for use on the Cray XC30 Stephen Trofinoff and Colin McMurtrie (Swiss National Supercomputing Centre) In this paper we describe the technical details associated with the preparation of Slurm for use on the 12 cabinet XC30 system installed at the Swiss National Super- computing Centre (CSCS). The system comprises internal and external login nodes and a new ALPS/BASIL version so a number of technical challenges needed to be overcome in order to have Slurm working on the system. Thanks to a Cray- supplied emulator of the system interface, work was possible ahead of delivery and this eased the installation when the system arrived. However some problems were encountered and their identification and resolution is described in detail. We also provide detail of the work done to improve the Slurm task affinity bindings on a general-purpose Linux cluster so that they, as closely as possible, match the Cray bindings, thereby providing our users with some degree of consistency in application behaviour between these systems. pdf, pdf Lesson’s From 20 Continuous Years of Cray/HPC Systems Liam Forbes, Don Bahls, Gene McGill, Oralee Nudson and Gregory Newby (Arctic Region Supercomputing Center, UAF) The Arctic Region Supercomputing Center (ARSC) was founded in 1992/1993 with a Cray YMP (denali) and since then has operated or owned at least one Cray system, including most recently a Cray XK6m-200 (fish). For 20 years, ARSC has shared high performance computing (HPC) experiences, users, and problems with other University HPC centers, DoD HPC centers, and DoE HPC centers. In this paper, we will document and present the user support and system administration lessons we have learned from the perspective of a smaller, regional University HPC center operating and supporting the same architectures as some of the largest systems in the world over that time. Comparisons to experiences with HPC hardware and software products from other vendors will be used to illustrate some of the points. pdf, pdf Cray Workload Management with PBS Professional 12.0 Scott Suchyta and Sam Goosen (Altair Engineering, Inc.) Changing requirements, trends and technologies in HPC computing are frequent, and workload managers like PBS Professional must continually evolving to accommodate these. One challenge sites have faced has been configuring PBS to address their individual requirements. Site defined custom resources and configurable scheduling policies were introduced to help accomplish this, but are insufficient to address more complex scenarios. A more robust infrastructure is required to manage the dynamic resources and policies that are unique to modern HPC sites. Our discussion will include examples that customers may wish to adopt or customize to address their specific needs including admission control, allocation management, and on-the-fly tuning. Independent of plugins, PBS Professional supports multithreaded processors available on current Cray platforms. Additional enhancements will become available when integration with BASIL version 1.3 is complete. In the interim, details about configuring these systems for use with PBS Professional 12.0 will be presented. Paper Technical Session 15B Merlot / Syrah Robert Henschel Introduction to HSA Hardware, Software and HSAIL with A HPC Usage Example Vinod Tipparaju (AMD, Inc.) Heterogeneous systems have been around for several years, and the accelerator-based heterogeneous systems (CPU-GPU) have become popular in the last five years. Particularly, accelerating general-purpose computation using GPUs is gaining momentum in both academic research and vendors in the industry. OpenCL and CUDA are the two most popular programming models that enable end-application programmers to take advantage of the GPGPU through the compiler, runtime, and driver tool chain. While the opportunity of GPGPU has been opened up to expert programmers, this has not reached a big mass yet, primarily, because of the following reasons: (i) The CPU-GPU system has a distributed-asymmetric memory that needs to be explicitly managed for coherency and synchronization (ii) Two-way high-latency memory copies and kernel dispatch (iii) Lack of support for dynamic scheduling or load balancing, advanced debugging, system calls, exception handling etc. The Heterogeneous System Architecture (HSA) is a new set of architectural features (to be standardized) to efficiently support a wide range of data-parallel and task-parallel programming models. HSA architectural features include: Unified Virtual Address Space, Architected User Mode Queuing, Fully-Coherent Memory Model, Architected Queuing Language (AQL), and several others. Thus, the overarching goal of HSA is to bring GPGPU to the masses by drastically improving the productivity, performance and energy-efficiency of the applications that may want to take advantage of the GPU acceleration. HSA-enabled processors come with associated software ecosystem to expose the architectural features, which include: HSA Driver, HSA Runtime, and HSA Intermediate Language (HSAIL). Specifically, HSA Runtime exposes Coherent Memory, Architected User-Mode Queues, and Architected low-latency dispatch through low-level APIs. These APIs are designed (and standardized) to be generic, and can be consumed by several high-level runtimes, programming models and languagues (OpenCL, C++ AMP, Java, OpenMP etc). HSAIL is an abstract virtual machine language of HSA components, which will be standardized. Thus, each vendor of a HSA component will comply with the standard set of architectural features, provide a core runtime implementation (adhering to the standard), and a finalizer component that translates the HSAIL into its vendor-specific ISA. Overall, using the new architectural features of HSA, and its software ecosystem, it is possible to support several high-level programming models and languages, and at the same time, influence them to improve the programmability, thereby, bringing heterogeneous computing to the masses. Reliable Computation Using Unpredictable Components Joel O. Stevenson, Robert A. Ballance, Suzanne M. Kelly, John P. Noe and Jon R. Stearley (Sandia National Laboratories) and Michael E. Davis (Cray Inc.) Based on our experiences over the last year running large simulations on the DOE/ASC platform Cielo, we will discuss strategies that enable large, long-running simulations to make predictable progress despite platform component failures. From an application perspective, complex systems like Cielo have multiple sources of interrupts and slowdowns that combine to make the system appear unpredictable. We will discuss the component failures observed and identify those where application recovery has been possible. Users and application developers assist mitigating stability issues. One approach is to employ scripting mechanisms that trap/identify failures and recover where possible. An important aspect of running simulations — data and metadata management — will be stressed. Simulations often require multiple restarts before completion. Strategies for maximizing the ‘application availability’ and integration of huge volumes of data, in the context of still evolving robust file systems will be discussed. Specific recommendations for improving workflow will be provided. pdf, pdf Requirements Analysis for Adaptive Supercomputing using the Cray XK7 as a Case Study Sadaf R. Alam, Mauro Bianco, Ben Cumming, Gilles Fourestey, Jeffrey Poznanovic and Ugo Varetto (Swiss National Supercomputing Centre) In this report, we analyze readiness of the code development and execution environment for adaptive supercomputers where a processing node is composed of heterogeneous computing and memory architectures. Current instances of such a system are Cray XK6 and XK7 compute nodes, which are composed of x86_64 CPU and NVIDIA GPU devices and DDR3 and GDDR5 memories respectively. Specifically, we focus on the integration of the CPU and accelerator programming environments, tools, MPI, numerical libraries as well as operational features such as resource monitoring, and system maintainability and upgradability. We highlight portable, platform independent technologies that exist for the Cray XE and XK, and XC30 platforms and discuss dependencies in the CPU, GPU and network tool chains that lead to current challenges for integrated solutions. This discussion enables us to formulate requirements for a future, adaptive supercomputing platform, which could contain a diverse set of node architectures. pdf, pdf Paper Technical Session 15C Atlas Peak / Castle Peak / Diamond Mountain (Vintners Ballroom) Douglas W. Doerfler Improving the Performance of the PSDNS Pseudo-Spectral Turbulence Application on Blue Waters using Coarray Fortran and Task Placement Robert A. Fiedler, Nathan Wichmann and Stephen Whalen (Cray Inc.) and Dmitry Pekurovsky (San Diego Supercomputer Center) The PSDNS turbulence application performs many 3D FFTs per time step, which entail frequently transposing distributed 3D arrays. These transposes are achieved via multiple concurrent All-to-All communication operations, which dominate the overall execution time at large scales. We improve the All-to-All times for benchmarks on 3072 to 12288 nodes using three main strategies: 1) eliminating off-node communication for one of the two sets of transposes by assigning one sheet of the 3D Cartesian grid to each node (35% speedup), 2) placing tasks on nodes that are distributed randomly throughout the gemini network in order to maximize the All-to-All bandwidth that can be utilized by the job's nodes (21% speedup), and 3) reducing contention and overhead by replacing calls to MPI_AlltoAll with a drop-in library written in Coarray Fortran (33% speedup). We also describe how this library is implemented and integrated efficiently in PSDNS. pdf, pdf A Review of The Challenges and Results of Refactoring the Community Climate Code COSMO for Hybrid Cray HPC Systems. Benjamin Cumming (Swiss National Supercomputing Centre), Carlos Osuna (Center For Climate Systems Modeling ETHZ), Tobias Gysi (Supercomputing Systems AG), Mauro Bianco (Swiss National Supercomputing Centre), Xavier Lapillonne and Oliver Fuhrer (Federal Office of Meteorology and Climatology MeteoSwiss) and Thomas C. Schulthess (ETH Zurich) We summarize the results of porting the numerical weather simulation code COSMO on different hybrid Cray HPC systems. COSMO was written in Fortran with MPI, and the aim of the refactoring was to support both many-core systems and GPU-accelerated systems with minimal disruption to the user community. With this in mind, different approaches were taken to refactor the different components of the code: the dynamical core was refactored with a C++-based domain specific language for structured grids which provides both CUDA and OpenMP back ends; and the physical parameterizations were refactored by adding OpenACC and OpenMP directives to the original Fortran code. This report gives a detailed description of the challenges presented by such a large refactoring effort using different languages on Cray systems, along with performance results on three different Cray systems at CSCS: Rosa (XE6), Todi (XK7) and Daint (XC30). pdf, pdf CloverLeaf: Preparing Hydrodynamics Codes for Exascale Andrew C. Mallinson and David A. Beckingsale (University of Warwick), Wayne P. Gaudin and John A. Herdman (Atomic Weapons Establishment), John M. Levesque (Cray Inc.) and Stephen A. Jarvis (University of Warwick) In this work we directly evaluate five candidate programming models for future exascale applications (MPI, MPI+OpenMP, MPI+OpenACC, MPI+CUDA and CAF) using a recently developed Lagrangian-Eulerian explicit hydrodynamics mini-application. The aim of this work is to better inform the exacsale planning at large HPC centres such as AWE. Such organisations invest significant resources maintaining and updating existing scientific codebases, many of which were not designed to run at the scale required to reach exascale levels of computation on future system architectures. We present our results and experiences of scaling these different approaches to high node counts on existing large-scale Cray systems (Titan and HECToR). We also examine the effect that improving the mapping between process layout and the underlying machine interconnect topology can have on performance and scalability, as well as highlighting several communication-focused optimisations. pdf, pdf Paper 10:30am-12pm Technical Session 16A Zinfandel / Cabernet John Noe Methods and Results for Measuring Kepler Utilization on a Cray XK7 Jim Rogers (Oak Ridge National Laboratory), Roger Green (NVIDIA) and Kevin Peterson (Cray Inc.) NVIDIA is providing an API as part of their official CUDA 5.5 release branch (R319) that Cray can then use to provide specific and inherent utilization information from the Kepler GPU. NVIDIA and Cray will provide this capability as part of a featured release once the cadence for both the NVIDIA driver and Cray software release are complete. The intent of the talk is to provide early description of the driver changes, the API, the Cray interface, and some examples against the Titan workload using a pre-release version of both the NVIDIA driver/API and Cray accounting software. Resource Utilization Reporting on Cray Systems Andrew P. Barry (Cray Inc.) Many Cray customers want to evaluate how their systems are being used, across a variety of metrics. Neither previous Cray accounting tools, nor commercial server management software allow the collection of all the desirable statistics with minimal performance impact. Resource Utilization Reporting (RUR) is being developed by Cray, to collect statistics on how systems are used. RUR provides a reliable, high-performance framework into which plugins may be inserted, which will collect data about the usage of a particular resource. RUR is configurable, extensible, and lightweight. Cray will supply plugins to support several sets of collected data, which will be useful to a wide array of Cray customers; customers can implement plugins to collect data uniquely interesting to that system. Plugins also support multiple methods to output collected data. Cray is expecting to release RUR in the second half of 2013. pdf, pdf The Complexity of Arriving at Useful Reports to Aid in the Succesful Operation of an HPC Center Ashley Barker, Adam Carlyle, Chris Fuson, Mitch Griffith and Don Maxwell (Oak Ridge National Laboratory) While reporting may not be the first item to come to mind as one of the many challenges that HPC centers face, it is certainly a task that all of us have to devote resources to. One of the biggest problems with reporting is determining what information is needed in order to make impactful decisions that can influence everything from policies to purchasing decisions. There is also the problem of how frequently to review the data collected. For some data points, it is necessary to look at reports on a daily basis while others are not useful unless examined over longer periods of time. This paper will look at the efforts the Oak Ridge Leadership Computing Facility has taken over the last few years to refine the data that is collected, reported, and reviewed. pdf, pdf Paper Technical Session 16B Merlot / Syrah Liz Sim Building Balanced Systems for the Cray Datacenter of the Future Keith Miller (DataDirect Networks) The top computing sites worldwide are faced with unique data access, management and protection challenges. In this talk DDN the Leader in Massively Scalable Storage Solution for Big Data Applications will discuss how joint DDN, Cray customers are achieving balanced, highly-productive HPC environments today in the face of huge capacity, performance and reliability requirements and directions in building the Cray data center of the future. The content will include DDN recent developments and roadmap for DDN block, file, object and analytics solutions and appliances and touch on Lustre performance testing. Surviving the Life Sciences Data Deluge using Cray Supercomputers Bhanu Rekapalli and Paul Giblock (National Institute for Computational Sciences) The growing deluge of data in the Life Science domains threatens to overwhelm computing architectures. This persistent trend necessitates the development of effective and user-friendly computational components for rapid data analysis and knowledge discovery. Bioinformatics, in particular employs data-intensive applications driven by novel DNA-sequencing technologies, as do the high-throughput approaches that complement proteomics, genomics, metabolomics, and meta-genomics. We are developing massively parallel applications to analyze this rising flood of life sciences data for large scale knowledge discovery. We have chosen to work with the desktop or cluster based applications most widely used by the scientific community, such as NCBI BLAST, HMMER, DOCK6, and MUSCLE. Our endeavors encompass to extend highly scalable parallel applications that scales to tens of thousands of cores on Cray's XT architecture to Cray’s next generation XE, XK, and XC architectures along with focusing on making them robust and optimized, which will be discussed in this paper. Early Experience on Crays with Genomic Applications Used as Part of Next Generation Sequencing Workflow Mikhail Kandel (University of Illinois), Steve Behling and Bill Long (Cray Inc.), Carlos P. Sosa (Cray Inc. and University of Minnesota Rochester), Sebastien Boisvert and Jacques Corbeil (Universite Laval) and Lorenzo Pesce (University of Chicago) Recent progress in DNA sequencing technology has yielded a new class of devices that allow for the analysis of genetic material with unprecedented speed and efficiency. These advances, styled under the name Next Generation Sequencing (NGS) are well suited for High-Performance Computing (HPC) systems. By breaking up DNA into millions of small strands (20 to 1000 bases) and reading them in parallel, the rate at which genetic material can be acquired has increase by several orders of magnitude. The technology to generate raw genomic data is becoming increasingly fast and inexpensive when compared to the rate that this data can be analyzed. In general, assembling small reads into a useful form is done by either assembling individual reads (de novo) or mapping these pieces against a reference. In this paper we present our experience with these applications on Cray supercomputers. In particular with Ray, a parallel short-read assembler code. pdf, pdf Paper Technical Session 16C Atlas Peak / Castle Peak / Diamond Mountain (Vintners Ballroom) Nicholas J. Wright Measuring Sustained Performance on Blue Waters with the SPP Metric William Kramer (National Center for Supercomputing Applications) The Blue Waters Project developed the Sustained Petascale Performance metric to assess the potential for the Blue Waters system to meeting its goal of sustained petascale performance for a diverse set of science and engineering problems. The SPP, consisting of over 20 individual tests (code+input), is unique and truly representative of the ability for a system to support many areas of science and engineering. The SPP is a method that allows an accurate assessment of hybrid systems that have more than one type of node, which has not been possible before. This talk to will cover 1) the underlying concepts of the SPP method, 2) the SPP implementation, 3) the selection of the codes and problem sets for the test cases, the optimizations, 4) improvements for each of the test, 6) the results and 7) challenges, lessons and future improvement of the SPP. Experiences Porting a Molecular Dynamics Code to GPUs on a Cray XK7 Donald K. Berry (Indiana University), Joseph Schuchart (Technische Universität Dresden) and Robert Henschel (Indiana University) GPU computing has rapidly gained popularity as a way to achieve higher performance of many scientific applications. In this paper we report on the experience of porting a hybrid MPI+OpenMP molecular dynamics code to a GPU enabled CrayXK7 to make a hybrid MPI+GPU code. The target machine, Indiana University's Big Red II, consists of a mix of nodes equipped with two 16-core Abu Dhabi X86-64 processors, and nodes equipped with one AMD Interlagos X86-64 processor and one Nvidia Kepler K20 GPU board. The code, IUMD, is a Fortran program developed at Indiana University for modeling matter in compact stellar objects (white dwarf stars, neutron stars and supernovas). We compare experiences using CUDA and OpenACC. pdf, pdf Chasing Exascale: the Future of GPU Computing Steve Scott (NVIDIA) Changes in underlying silicon technology are creating a significant disruption to computer architectures and programming models. Power has become the primary constraint to processor performance, and threatens our ability to continue historic rates of performance improvement. With silicon technology no longer providing the rapid rate of improvement it once did, we must rely on advances in architectural efficiency. This has led to the creation of heterogeneous (or accelerated) architectures, and the rise of GPU computing. This talk will describe the motivations behind GPU computing, assess the current state of the art, and discuss how it is likely to evolve over the coming decade as we endeavor to build Exascale computers. It will discuss an architectural convergence that is unifying processor designs, literally from cell phones to supercomputers, and the implications for how we program these machines. Paper 12pm-1pm Interactive 17A Zinfandel / Cabernet John Hesterberg System Monitoring, Accounting and Metrics John Hesterberg (Cray Inc.) Let's talk about data collection! Cray Management Services will present our immediate and near term roadmap in the areas of System Monitoring, Accounting, and Metrics collection as a starting point for conversation. For GPU-enabled Cray systems, NVIDIA and Cray are developing mechanisms for measuring GPU utilization, on-GPU memory bandwidth use, and high-water GPU memory usage. Available through feature releases from NVIDIA and Cray in Fall 2013, these metrics will be available on a per-job basis, and can be integrated in to your job and system accounting through a user-customizable plug-in. System monitoring, predictive failure analysis, job based energy accounting, application resource utilization, user billing: the use cases keep increasing. Provide input on future data collection priorities, and use cases. Birds of a Feather Interactive 17B Merlot / Syrah Jenett Tillotson Experiences with Moab and TORQUE Jenett Tillotson (Indiana University) This BoF will focus on administrator experiences with Moab and TORQUE. In particular the interface with ALPS, experiences with Moab 7 and TORQUE 4, and running Moab and/or TORQUE outside the Cray on an external scheduling node. Attendees will be asked to share their configurations, and we will discuss possible best practices for Moab and TORQUE configurations on Cray systems. Birds of a Feather Interactive 17C Atlas Peak / Castle Peak / Diamond Mountain (Vintners Ballroom) Birds of a Feather 1pm-2:30pm Technical Session 18A Zinfandel / Cabernet Ashley Barker Blue Waters Acceptance: Challenges and Accomplishments Celso L. Mendes, Brett Bode, Gregory H. Bauer, Joseph R. Muggli, Cristina Beldica and William T. Kramer (National Center for Supercomputing Applications) Blue Waters, the largest supercomputer ever built by Cray, comprises an enormous amount of computational power. This paper describes some of the challenges encountered during the deployment and acceptance of Blue Waters, and presents how those challenges were handled by the NCSA team. After briefly reviewing our originally designed acceptance plans, we highlight the steps actually taken for that process, describe how those steps were conducted, and comment on lessons learned during that process. Besides listing the scope of the applied tests, we present an overview of their results and analyze the manner in which those results guided both the Cray and NCSA teams in tuning the system configuration. The Blue Waters acceptance testing process consisted of hundreds of tests summarized in the paper, covering many areas directly related to the Cray system as well as other items, such as the near-line storage and the external user-support environment. pdf, pdf Saving Energy with “Free” Cooling and the Cray XC30 Brent Draney, Tina Declerck, Jeffrey Broughton and John Hutchings (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory) Located in Oakland, CA, NERSC is running its new XC30, Edison, using “free” cooling. Leveraging the benign San Francisco Bay Area environment, we are able to provide a year-round source of water from cooling towers alone (no chillers) to supply the innovative cooling system in the XC30. While this approach provides excellent energy efficiency (PUE < 1.1), it is not without its challenges. This paper describes our experience designing and operating such a system, the benefits that we have realized, and the trade-offs relative to conventional approaches. pdf, pdf Real-time mission critical supercomputing with Cray systems Jason Temple and Luc Corbeil (Swiss National Supercomputing Centre) System integrity and availability is essential for Real-time Scientific Computing in Mission Critical Environments. Human lives rely on decisions derived from results provided by Cray supercomputers. The tools used for science in general must be reliable and produce the same results every time without fail, on demand, or the results will not be trustworthy or worthwhile. In this paper, we will describe the engineering challenges to provide a reliable and highly available system to the Swiss Weather service using Cray solutions, and we will relate recent real life experiences that lead to specific design choices . pdf, pdf Paper Technical Session 18B Merlot / Syrah Jenett Tillotson High Fidelity Data Collection and Transport Service Applied to the Cray XE6/XK6 Jim Brandt (Sandia National Laboratories), Tom Tucker (Open Grid Computing), Ann Gentile (Sandia National Laboratories), David Thompson (Kitware Inc.) and Victor Kuhns and Jason Repik (Cray Inc.) A common problem experienced by users of large scale High Performance Computer (HPC) systems, including the Cray XE6, is the inability to gain insight into their computational environments. Our Lightweight Distributed Metric Service (LDMS) is intended to be run as a continuous system service for providing low-overhead remote collection and on-node access to high-fidelity data, capable of handling 100’s of data values per node per second, vastly exceeding the data collection sizes and rates typically handled by current HPC monitoring services while still maintaining much lower overhead. We present a case study of utilizing LDMS on the Cray XE6 platform, Cielo, to enable remote storage of system resource data for post run analysis and node-local access to data for run-time in-situ analysis and workload rebalancing. We also present information from deployment on an XK6 system at Sandia, where we leverage RDMA over the Gemini transport to further reduce LDMS overhead. pdf, pdf Production I/O Characterization on the Cray XE6 Philip Carns (Argonne National Laboratory), Yushu Yao (Lawrence Berkeley National Laboratory), Kevin Harms, Robert Latham and Robert Ross (Argonne National Laboratory) and Katie Antypas (Lawrence Berkeley National Laboratory) I/O performance is an increasingly important factor in the productivity and efficiency of large-scale HPC systems such as Hopper, a 153,216 core Cray XE6 system operated by the National Energy Research Scientific Computing Center (NERSC). The scientific workload diversity of such systems presents a challenge for I/O performance tuning, however. Applications vary in terms of data volume, I/O strategy, and access method, making it difficult to consistently evaluate and enhance their I/O performance. We have adapted and tuned the Darshan I/O characterization tool for use on Hopper in order to address this challenge. Darshan is an I/O instrumentation library that collects I/O access pattern information from large-scale production applications with minimal overhead. In this paper we present our experiences in both adapting Darshan to the unique challenges of the Cray XE6 platform and deploying Darshan for full-time production use. We validated our deployment strategy by measuring the overhead introduced by Darshan for a diverse collection of large scale applications. Our results indicate that Darshan introduces minimal runtime overhead and is therefore suitable for transparent production deployment. Darshan was enabled for all Hopper users on November 15, 2012, and instruments over 5,000 jobs per day on Hopper as of April 2013. We use data collected with Darshan to explore how metrics can be applied to characterization data to automatically identify applications that can most benefit from additional I/O tuning. pdf, pdf Improvement of TOMCAT-GLOMAP File Access with User Defined MPI Datatypes Mark Richardson (Numerical Algorithms Group) and Martyn Chipperfield (University of Leeds) This paper describes the modification of the file access patterns that occur throughout the simulation runs. The analysis identified several subroutines where the workload was not actually in the accessing of the data but the processing of data either before writing it or after reading it. The main gains in the project have been through a change in the practice of overloading MPI task zero. The overhead in time per iteration of the writing step has been reduced from 8 seconds to 0.01s on a small case and it is reduced from 38 seconds to 0.05s for a larger case. pdf, pdf Paper Technical Session 18C Atlas Peak / Castle Peak / Diamond Mountain (Vintners Ballroom) Liam O. Forbes Cray’s Cluster Supercomputer Architecture John Lee, Susan Kraus and Maria McLaughlin (Cray Inc.) In the first half of this presentation, we will discuss Cray’s Cluster Supercomputer architecture designs built upon industry standard optimized modular server platforms. You will learn how platform selection is one of the key factors influencing today’s datacenter decisions for configuration flexibility, scalability and performance-per-watt based on the latest processing technologies. You will also learn how industry standard high-performance network connectivity, streamlined I/O and diverse storage options can maximize system performance at a lower cost of ownership. You will share some examples of different high-performance networking topologies systems such as Fat Tree or 3D Torus (InfiniBand) with single- or dual-rail configurations that are meeting a variety of HPC workload technical requirements. In the second half of this presentation, we will discuss the essential cluster software and management tools that are required to build and support cluster architecture combined with key compatibility features of the Advanced Cluster Engine™ (ACE) management software. pdf, pdf Performance Metrics and Application Experiences on a Cray CS300-AC™ Cluster Supercomputer Equipped with Intel® Xeon Phi™ Coprocessors Vincent C. Betro, Robert P. Harkness, Bilel Hadri, Haihang You, Ryan C. Hulguin, R. Glenn Brook and Lonnie D. Crosby (National Institute for Computational Sciences) Given the growing popularity of accelerator-based supercomputing systems, it is beneficial for applications software programmers to have cognizance of the underlying platform and its workings while writing or porting their codes to a new architecture. In this work, the authors highlight experiences and knowledge gained from porting such codes as ENZO, H3D, GYRO, a BGK Boltzmann solver, HOMME-CAM, PSC, AWP-ODC, TRANSIMS, and ASCAPE to the Intel Xeon Phi architecture running on a Cray CS300-AC™ Cluster Supercomputer named Beacon. Beacon achieved 2.449 GFLOP/W in High Performance LINPACK (HPL) testing and a number one ranking on the November 2012 Green500 list \cite{Green500}. Areas of optimization that bore the most performance gain are highlighted, and a set of metrics for comparison and lessons learned by the team at the National Institute for Computational Sciences Application Acceleration Center of Excellence is presented, with the intention that it can give new developers a head start in porting as well as a baseline for comparison of their own code's exploitation of fine and medium-grained parallelism. pdf, pdf Paper 3pm-4:30pm Technical Session 19A Zinfandel / Cabernet Tina Butler Effect of Rank Placement on Cray XC30 Communication Cost Reuben D. Budiardja, Lonnie D. Crosby and Haihang You (National Institute for Computational Sciences) The newly released Cray XC30 supercomputer boasts the new Aries interconnect that incorporates a Dragonfly network topology. This hierarchical network topology has obvious advantages with respect to local communication. However, as communication patterns extend further down the hierarchy and grow more separated the overall impact of particular bottlenecks and trade-offs between bandwidth and latency become less apparent. In particular, applications may be more or less latency sensitive based on their communication pattern. The dynamic routing options, as a result, may affect some applications more severely than others. In this paper, we investigate the effect of process placement on the communication costs associated with typical communication patterns shared by many scientific applications. Observations concerning the communication performance of benchmarks and selected applications are presented and discussed. Evaluating Node Orderings For Improved Compactness Carl Albing (Cray Inc.) This paper demonstrates an evaluation technique that provides guidance for site-specific selection of the node ordering related to application placement. Reasonable performance of parallel applications has been achieved through application placement in Cray XT/XE/XK 3D-torus systems using allocation strategies based on an ordered, one-dimensional sequence of nodes. Node ordering is a low (computation) cost way to incorporate topological information into application placement decisions. With several orderings from which to choose - and others that could be created - what is the basis for choosing one ordering over others? A method is described herein for the static evaluation of node orderings. Several orderings are evaluated for actual Cray systems, large and small. Results provide visually compelling guidance on the choice of node ordering. This comparison of characteristic curves provides useful guidance to sites for choosing a node ordering which might extract further performance improvements from their systems. pdf, pdf Improving Task Placement for Applications with 2D, 3D, and 4D Virtual Cartesian Topologies on 3D Torus Networks with Service Nodes Robert A. Fiedler and Stephen Whalen (Cray Inc.) We describe two new methods for mapping applications with multidimensional virtual Cartesian process topologies onto 3D torus networks with randomly distributed service nodes. The first method, “Adaptive Layout”, works for any number of processes and distributes the MILC (lattice QCD, 4D topology) workload to ensure communicating processes are close together on the torus. This scheme reduces the run time by 2.7X compared to default placement. The second method, “Topaware”, selects a prism of nodes slightly larger than the ideal prism one would select if there were no service nodes. The application’s processes are ordered to group neighboring processes on the same node and to place groups of neighbors onto nodes which are no more than a few hops apart. Up to 40% run time reductions are obtained for 2D and 3D virtual topologies. In dedicated mode, using Topaware with MILC reduces the run time by 3.7X compared to default placement. pdf, pdf Paper Technical Session 19B Merlot / Syrah Zhengji Zhao The State of the Chapel Union Bradford L. Chamberlain, Sung-Eun Choi, Martha B. Dumler, Thomas Hildebrandt, David Iten, Vassily Litvinov and Greg Titus (Cray Inc.) Chapel is an emerging parallel programming language that originated under the DARPA High Productivity Computing Systems~(HPCS) program. Although the HPCS program is now complete, the Chapel language and project remain very much alive and well. Under the HCPS program, Chapel generated sufficient interest among HPC user communities to warrant continuing its evolution and development over the next several years. In this paper, we reflect on the progress that was made with Chapel under the auspices of the HPCS program, noting key decisions made during the project's history. We also summarize the current state of Chapel for programmers who are interested in using it today. And finally, we describe current and ongoing work to evolve it from prototype to production-grade; and also to make it better suited for execution on next-generation systems. pdf, pdf Recent enhancements to the Automatic Library Tracking Database infrastructure at the Swiss National Supercomputing Centre Timothy W. Robinson and Neil Stringfellow (Swiss National Supercomputing Centre) The Automatic Library Tracking Database (ALTD)—an infrastructure developed previously by staff at the National Institute for Computational Sciences (NICS)—is in production today on Cray XT, XE, XK, and XC30 systems at several Cray sites, including NICS, Oak Ridge National Laboratory, the National Energy Research Scientific Computing Center, and the Swiss National Supercomputing Centre (CSCS). The Automatic Library Tracking Database automatically and transparently stores information about applications running on Cray systems and also records which libraries are linked to those applications, and from these data, support staff at HPC centres can derive a wealth of information about software usage—such as the use or non-use of particular compiler suites or the uptake of numerical libraries and third-party applications—right down to the level of specific version numbers. The tool works by intercepting the GNU linker to gather information on compilers and libraries, and intercepting the job launcher to track the execution of applications at launch time. We have recently extended the ALTD framework deployed at CSCS to record more detailed information on the individual jobs executed on our machines: the job information recorded by the previous incarnation of ALTD was limited to user name, executable, (batch) job id, and run date; we have extended the tool to record many additional job characteristics such as begin and end times, requested versus used core counts, number of processing elements and threads per process, and mode of linking (e.g. static, dynamic). In combination with custom post-processing scripts—which map executables to software codes, research domains or research groups—our ALTD implementation now delivers a far more complete picture of system usage, providing not only a list of running applications but also information on the way that these same applications are being run. On a practical level, such information can be used, for example, to guide future hardware and software procurements, or to assess whether or not researchers are using our systems in the manner for which they were provided with resource allocations. pdf, pdf Comparing Compiler and Library Performance in Material Science Applications on Edison Jack Deslippe and Zhengji Zhao (National Energy Research Scientific Computing Center) Materials science and chemistry applications are expected to represent approximately 1/3 of the computational workload on NERSCs Cray XC30 system, Edison. The performance of these applications can often depend sensitively on the compiler and compiler options used at build-time. For this reason, the NERSC user services group supplies users with optimized builds of the most commonly used materials science applications in order to ensure these cycles are used as efficiently as possible. In this paper, we compare the performance of various material science and chemistry applications when built with the Cray, Intel and GNU compiler suites under various compiler options as well as linked against the MKL, LibSci and FFTW libraries. We compare the optimal compilers and libraries on Edison with those previously obtained on the NERSC Cray XE6 machine, Hopper. pdf, pdf Paper Technical Session 19C Atlas Peak / Castle Peak / Diamond Mountain (Vintners Ballroom) John Noe A Single Pane of Glass: Bright Cluster Manager for Cray Matthijs van Leeuwen, Mark Blessing and David Maples (Bright Computing) Bright Cluster Manager provides comprehensive cluster management for Cray systems in one integrated solution: deployment, provisioning, scheduling, monitoring, and management. Its intuitive GUI provides complete system visibility and ease of use for multiple systems and clusters simultaneously, including automated tasks and intervention. Bright also provides a powerful management shell for those who prefer to manage via a command-line interface. Bright Cluster Manager extends to cover the full range of Cray systems, spanning servers, clusters and mainframes; and external servers (large-scale Lustre file systems, login servers, data movers, pre- and post-processing servers). Cray has also used Bright Cluster Manager create additional services for its customers. Bright Computing also provides unique cloud bursting capabilities as a standard feature of Bright Cluster Manager, automatically cloud-enabling clusters at no extra cost. Users can seamlessly extend their clusters, adding and managing cloud-based nodes as needed, or create entirely new clusters on the fly with a few mouse clicks. Either way, users benefit from complete visibility and management, within the Bright environment. Bright Cluster Manager removes the complexity of configuring and managing Hadoop servers, enabling administrators to configure a compute-ready system from bare metal— typically less than an hour— and then fully manage this server with Bright. This presentation is an overview of Bright Cluster Manager and its capabilities, with particular emphasis on the value Bright provides to Cray users. Supporting Multiple Workloads, Batch Systems, and Computing Environments on a Single Linux Cluster Larry Pezzaglia (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory) A new Intel-based, Infiniband-attached computing system from Cray Cluster Solutions (formerly Appro), at NERSC, provides computational resources to transparently expand several existing NERSC production systems serving three different constituencies: a mixed serial/parallel mid-range workload, a serial, high-throughput, High-Energy Physics/Nuclear Physics workload, and a mixed serial/parallel Genomics workload. This is accomplished by leveraging a disciplined image building system and CHOS, a software package written at NERSC, comprised of a Linux kernel module, a PAM module, and batch system integration, to concurrently support multiple Linux compute environments on a single Linux system. Using CHOS, this new computing system can run workloads built for the hosted clusters, while simultaneously maintaining a single consistent, lean, and maintainable base OS across the entire platform. pdf, pdf Tools to Execute An Ensemble of Serial Jobs on a Cray Abhinav Thota, Scott Michael, Sen Xu, Thomas G. Doak and Robert Henschel (Indiana University) Traditionally, Cray supercomputers have been located at large supercomputing centers and were used to run highly parallel applications. The user base consisted mostly of researchers from the fields of physics, mathematics, astronomy and chemistry. But in recent times, Cray supercomputers have become available to a wider range of users from a variety of disciplines. Examples include the Kraken machine at the National Institute for Computational Sciences (NICS), Hopper at the National Energy Research Scientific Computing Center (NERSC), and Big Red II at Indiana University. Predictably, as the diversity of end users has grown, the workload has expanded to include a variety of workflows containing serial and hybrid applications, as well as complex workflows involving pilot-jobs. Projects that employ a massive number of serial jobs--in an embarrassingly data-parallel manner--have not been targeted to run on Cray supercomputers. To accomplish such projects, it is usually necessary to bundle a large number of serial jobs into a much larger parallel job, via either a pilot job framework, an MPI wrapper, or custom scripting. In this article, we explore several of the current offerings for bundling serial jobs on a Cray supercomputer and discuss some of the benefits and shortcomings of each of the approaches. The approaches we evaluate include BigJob, PCP, and native aprun with scripts. pdf, pdf Paper 4:45pm-5:15pm Closing General Session 20 Zinfandel / Cabernet / Merlot / Syrah (Grand Ballroom) Nick Cardo CUG Closing Session Nick Cardo (nation) The closing general session will include an appreciation for the local arrangements committee and a preview of CUG 2014 from the host site. Invited Talk |