CUG2014 Final Proceedings | Created 2014-5-26 |
Sunday, May 4th | Monday, May 5th 8:30am-12:00pm Tutorial 1A (Half Day) Track A/Plenary - B1 John Levesque Reveal - A remarkable scoping facilty for multi/many core systems Reveal - A remarkable scoping facilty for multi/many core systems John Levesque and Heidi Poxon (Cray Inc.) We have entered into an era when restructuring an application for the emerging HPC architectures is an absolute necessity. Most scientific applications today are all MPI (NERSC's latest measurements indicate that 80% of their applications only utilize one thread/MPI task). The most difficult task in converting an application to use OpenMP is to parallelize the most important loops within the application, which involves scoping many variables through complex call chains. Over the past three years, Cray has developed a tool to assist the user in this most difficult task. Reveal offers an incredible break-through in parallelism assistance by employing “whole program” analysis which gives it the ability to analyze looping structures that contain subroutine and function calls. Even with Reveal, there has to be an understanding of the issues related to inserting OpenMP directives. Reveal is like a multi-functional power tool that must be employed by a knowledgeable programmer. In this tutorial, Reveal will be demonstrated by the principal designer of the tool to give the attendee a good understanding of how best to navigate the tool, and an experienced user of Reveal/OpenMP who will explain the idiosyncrasies involved in adding OpenMP to an application based on Reveal’s feedback. Going forward, the importance of generating an efficient Hybrid application cannot be over-emphasized. Whether the target is OpenACC for the XC30 with GPUs or Intel's next generation Phi systems. This tutorial will demonstrate the performance that can be obtained with Reveal on applications that are utilized in the community today. Tutorial Tutorial 1C (Half Day) Track C - B2 David Bigagli Slurm Workload Manager Use Slurm Workload Manager Use David Bigagli and Morris Jette (SchedMD LLC) Slurm is an open source workload manager used on half of the world's most powerful computers and provides a rich set of features including topology aware optimized resource allocation, the ability to expand and shrink jobs on demand, failure management support for applications, hierarchical bank accounts with fair-share job prioritization, job profiling, and a multitude of plugins for easy customization. Recent changes to Cray software now permit Slurm to directly manage Cray XC30 network resources and directly launch MPI jobs, providing a richer job scheduling environment than previously possible. This tutorial will provide an overview for users and system administrators who are new to Slurm. Topic presented will include: 1) Slurm's architecture and design, including descriptions of resource managed, plugins and its daemons 2) Monitoring the system and queues 3) Submitting and running jobs 4) Managing jobs 5) File movement 6) Accounting Tutorial 8:30am-4:30pm Tutorial 1B/2B (Full Day) Track B - B3 Jim Jeffers Optimizing for MPI/OpenMP on Intel® Xeon Phi™ Coprocessors Optimizing for MPI/OpenMP on Intel® Xeon Phi™ Coprocessors John Pennycook, Hans Pabst and Jim Jeffers (Intel Corporation) Although Intel® Xeon Phi™ coprocessors are based on x86 architecture, making porting straightforward, existing codes will likely require additional tuning efforts to maximise performance. This all-day, hands-on tutorial will teach high-level techniques and design patterns for improving the performance and scalability of MPI and OpenMP applications while keeping source code maintainable and familiar. Attendees will run and optimize sample codes on a Cray CS300-AC supercomputer, and tuning examples for real-world codes using coprocessors will be discussed. Cluster administration considerations for coprocessors, based on experiences with Beacon at NICS, will be included. Morning Session The morning session will introduce Intel® Xeon Phi™ coprocessors and explore how to effectively leverage existing MPI and OpenMP parallelism, plus hybrid approaches that utilize both paradigms. Afternoon Session The afternoon session will focus on the best practices for using the new compiler-assisted offload and explicit vectorization constructs introduced in the recently unveiled OpenMP 4.0 standard. Tutorial 1:00pm-4:30pm Tutorial 2A (Half Day) Track A/Plenary - B1 Alistair Hart OpenACC: Productive, Portable Performance on Hybrid Systems Using High-Level Compilers and Tools OpenACC: Productive, Portable Performance on Hybrid Systems Using High-Level Compilers and Tools Alistair Hart, Luiz DeRose and James Beyer (Cray Inc.) Portability and programming difficulty are critical obstacles to widespread adoption of accelerators (GPUs and coprocessors) in High Performance Computing. The dominant programming models for accelerator-based systems (CUDA and OpenCL) can extract high performance from accelerators, but with extreme costs in usability, maintenance, development and portability. To be an effective HPC platform, hybrid systems need a high-level programming environment to enable widespread porting and development of applications that run efficiently on either accelerators or CPUs. In this tutorial we present the high-level, cross-vendor OpenACC parallel programming model for accelerator-based systems. The tutorial provides a step-by-step introduction to OpenACC, illustrated by real worked examples. Using personal experience in porting large-scale HPC applications, we provide development guidance, practical tricks and tips to enable effective and efficient use of these hybrid systems, including the Cray XC30 and XK7. This tutorial will not only introduce the OpenACC parallel programming model for accelerator based systems, but will also demonstrate the full development cycle when porting applications to OpenACC, covering compilers, libraries, and tools that are currently available and support this cross-vendor initiative. The presenters experience will be used to concentrate on the parts of the OpenACC v1.0 and v2.0 standards that are most relevant to HPC developers. This tutorial will benefit users looking to develop, port and optimize applications for accelerator-based HPC systems using OpenACC, including first-time users and experienced developers. No familiarity with accelerators or parallel programming models is assumed. Tutorial Tutorial 2C (Half Day) Track C - B2 Shawn Hoopes Tackle Massive Big Data Challenges with Big Workflow - Advanced Training for Multi-Dimensional Policies that Accelerate Insights Tackle Massive Big Data Challenges with Big Workflow - Advanced Training for Multi-Dimensional Policies that Accelerate Insights Shawn Hoopes (Adaptive Computing) Today, Cray and Adaptive Computing power the world’s largest and most robust supercomputers such as Blue Waters, HLRN, NERSC, NOAA and many more. Adaptive Computing and its Moab scheduling and optimization software play a huge role in accelerating insights, for CRAY users, with Big Workflow. Big Workflow is an industry term coined by Adaptive Computing to describe the acceleration of insights for IT professionals through more efficient processing of intense simulations and big data analysis. Adaptive’s Big Workflow solution unifies all available resources, optimizes the analysis process and monitors workflow status, allowing Cray systems do what they do best, deliver massive compute resources to tackle Big Data challenges. During this tutorial, Adaptive Computing will deliver advanced training on configuring multi-dimensional policies that work together to process workflows at the optimal time across the ideal combination of diverse resources. This hands-on tutorial will allow you to troubleshoot policies and learn how to accelerate insights getting the most out your Cray cluster. Also during this tutorial, Adaptive Computing will give a status update on Topology Aware Scheduling. For the past year, Adaptive Computing, in collaboration with Cray, NCSA, and other major Cray sites, has undertaken an effort to model Cray’s Gemini 3D torus network and develop an advanced topology-aware scheduling algorithm. The implementation is maturing and has been tested with production workloads on real Cray systems. Adaptive will discuss the problem space, the general approaches and considerations, and the benefits seen to date when tested against these real-world workloads. Tutorial 4:45pm-6:00pm Interactive 3A Track A/Plenary - B1 Colin McMurtrie System Support SIG System Support SIG Colin McMurtrie (Swiss National Supercomputing Centre) This is a meeting of the System Support Special Interest Group. Birds of a Feather Interactive 3B Track B - B3 Ashley Barker Programming Environments, Applications and Documentation SIG Programming Environments, Applications and Documentation SIG Ashley Barker (Oak Ridge National Laboratory) This is a meeting of the Programming Environments, Applications and Documentation Special Interest Group. Birds of a Feather Interactive 3C Track C - B2 Nicholas Cardo Open discussion with CUG Board Open discussion with CUG Board Nick Cardo (National Energy Research Scientific Computing Center) Open discussion with the current CUG Board Birds of a Feather | Tuesday, May 6th 8:30am-10:00am General Session 4 Track A/Plenary - B1 Nicholas Cardo CUG Welcome CUG Welcome Nick Cardo (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) Welcome from the CUG President Supercomputers: instruments for science or dinosaurs that haven’t gone extinct yet? Supercomputers: instruments for science or dinosaurs that haven’t gone extinct yet? Thomas Schulthess (Swiss National Supercomputing Centre) High-performance computing has dramatically improved scientific productivity over the past 50 years. It turned simulations into a commodity that all scientists can now use to produce knowledge and understanding about the world and the universe, using data from experiment and theoretical models that can be solved numerically. Since the beginnings of electronic computing, supercomputing – loosely defined as the most powerful scientific computing at any given time – has led the way in technology development. Yet, the way we interact with supercomputers today has not changed much since the days we stopped using punch cards. I do not claim to understand why, but nevertheless would like to propose a change in how we develop models and applications that run on supercomputers. General Session 10:30am-12:00pm General Session 5 Track A/Plenary - B1 Nicholas Cardo Industry Trends & Technologies Industry Trends & Technologies William Blake (Cray Inc.) Industry Trends & Technologies Cray Products - Shaping the Future Cray Products - Shaping the Future Barry Bolding (Cray Inc.) Cray Products - Shaping the Future General Session 1:00pm-2:30pm Technical Session 6A Track A/Plenary - B1 Hans-Hermann Frese Cray Management System Updates and Directions Cray Management System Updates and Directions John Hesterberg (Cray Inc.) Cray has made a number of updates to its system management in the last year, and we continue to evolve this area of our software. We will review the new capabilities that are available now, such as serial workloads on repurposed compute nodes, Resource Utilization Reporting (RUR) and the initial limited release of the Image Management and Provisioning System (IMPS). We will look at what is coming in the upcoming releases, including the next steps of IMPS capabilities, and the initial integration of new technologies from the OpenStack projects. Producing the Software that Runs the Most Powerful Machines in the World: the Inside Story on Cray Software Test and Release. Producing the Software that Runs the Most Powerful Machines in the World: the Inside Story on Cray Software Test and Release. Cray XC System Level Diagnosability: Commands, Utilities and Diagnostic Tools for the Next Generation of HPC Systems Cray XC System Level Diagnosability: Commands, Utilities and Diagnostic Tools for the Next Generation of HPC Systems Jeffrey J. Schutkoske (Cray Inc.) The Cray XC system is significantly different from the previous generation Cray XE system. The Cray XC system is built using new technologies including transverse cooling, Intel processor based nodes, PCIe interface from the node to the network ASIC, Aries Network ASIC and Dragonfly topology. The diagnosability of a Cray XC system has also been improved by a new set of commands, utilities and diagnostics. This paper describes how these tools are used to aid in system level diagnosability of the Cray XC system. Tech Paper - Systems Technical Session 6B Track B - B3 Scott Michael Lustre and PLFS Parallel I/O Performance on a Cray XE6 Lustre and PLFS Parallel I/O Performance on a Cray XE6 Brett M. Kettering, Alfred Torrez, David J. Bonnie and David L. Shrader (Los Alamos National Laboratory) Today’s computational science demands have resulted in larger, more complex parallel computers. Their PFSes (Parallel File Systems) generally perform well for N-N I/O (Input/Output), but often perform poorly for N-1 I/O. PLFS (Parallel Log-Structured File System) is a PFS layer under development that addresses the N-1 I/O shortcoming without requiring the application to rewrite its I/O. The PLFS concept has been covered in prior papers. In this paper, we will focus on an evaluation of PLFS with Lustre underlying it versus Lustre alone on a Cray XE6 system. We observed significant performance increases when using PLFS over these applications’ normal N-1 I/O implementations. While some work remains to make PLFS production-ready, it shows great promise to provide an application and underlying file system agnostic means of allowing programmers to use the N-1 I/O model and obtain near N-N I/O model performance without maintaining custom I/O implementations. Addressing Emerging Issues of Data at Scale Addressing Emerging Issues of Data at Scale Keith Miller (DataDirect Networks) As the storage provider powering over 2/3 of the world¹s fastest
supercomputers, Data Direct Networks (DDN) is uniquely positioned to deliver solutions for the emerging data-centric computing era. DDN is developing and delivering cache-centric and NoFS approaches to storage to help tackle new issues of data scale. In this talk, Keith Miller, Vice President, Technical Sales, Services & Support at Data Direct Networks will discuss how leading global supercomputing sites are leveraging cache-centric storage today to manage performance and cost at scale, and how this approach extends to pre-Exascale and Exascale deployments in the form of ³burst buffer caching². Object storage approaches will also be covered with site-specific use cases and deployment information. I/O Router Placement and Fine-Grained Routing on Titan to Support Spider II I/O Router Placement and Fine-Grained Routing on Titan to Support Spider II Matthew A. Ezell (Oak Ridge National Laboratory), David A. Dillow (N/A) and Sarp Oral, Feiyi Wang, Devesh Tiwari, Don E. Maxwell, Dustin Leverman and Jason Hill (Oak Ridge National Laboratory) The Oak Ridge Leadership Computing Facility (OLCF) introduced the concept of fine-grained routing in 2008 to improve I/O performance between the Jaguar supercomputer and Spider, the center-wide Lustre file system. Fine-grained routing organizes I/O paths to minimize congestion. Jaguar has since been upgraded to Titan, providing more than a ten-fold improvement in peak performance. To support the center’s increased computational capacity and I/O demand, the Spider file system has been replaced with Spider II.
Building on the lessons learned from Spider, an improved method for placing LNET routers was developed and implemented for Spider II. The fine-grained routing scripts and configuration have been updated to provide additional optimizations and better match the system setup.
This paper presents a brief history of fine-grained routing at OLCF, an introduction to the architectures of Titan and Spider II, methods for placing routers in Titan, and details about the fine-grained routing configuration. Tech Paper - Filesystems & I/O Technical Session 6C Track C - B2 Abhinav S. Thota The Cray Programming Environment: Current Status and Future Directions The Cray Programming Environment: Current Status and Future Directions Luiz DeRose (Cray Inc.) The scale of current and future high end systems, as well as the increasing system software and architecture complexity, brings a new set of challenges for application developers. In order to be able to close the gap between observed and achievable performance on current and future supercomputer systems, application developers need a programming environment that can hide the issues of scale and complexity of high end HPC systems. In this talk I will present the recent activities and future directions of the Cray Programming Environment, which are being developed and deployed according to Cray’s adaptive supercomputing strategy, focusing on maximizing performance and programmability, as well as ease porting and tuning efforts, to improve user’s productivity on the Cray Supercomputers. New Functionality in the Cray Performance Analysis and Porting Tools New Functionality in the Cray Performance Analysis and Porting Tools Heidi Poxon (Cray Inc.) The Cray performance analysis and porting tools are set on an evolutionary path to help address application performance challenges associated with the next generation of HPC systems. This toolset provides key porting and performance measurement and analysis functionality needed when parallelizing codes for better use of new processors, and when tuning codes that run on Cray multi-core and hybrid computing systems. The recent focus of the Cray tools has been on ease of use, and more intuitive user interfaces, as well as on the access to more information available from processors. This paper describes new functionality including AMD L3 cache counter support on Cray XE systems, a new GPU timeline for Cray systems with GPUs, additional OpenMP parallelization assistance through Reveal, and power metrics on Cray XC systems. It's all about the applications: how system owners and application developers can get more out of their Cray It's all about the applications: how system owners and application developers can get more out of their Cray David Lecomber (Allinea Software) Supercomputers are designed for the purpose of generating the results that enable scientific understanding and progress. More cores means more computing power, yet not all applications receive the same benefits from a new system immediately. The success of a system relies on its applications: making them scale, and making them efficient. We will introduce Allinea Performance Reports: this simple transparent tool that provides a one-page performance analysis to users and system owners of their applications - from memory use and vectorization to MPI. The reports can be used to target optimization to the applications that need the most help - improving efficiency. These lead naturally into Allinea Software's development tools - the unified debugging and performance profiling tools, Allinea DDT and Allinea MAP, that make scalable application development possible. We will show how the complete process of scalable HPC development is enhanced through the combined experience of Allinea's tools. Tech Paper - PE & Applications 3:00pm-5:00pm Technical Session 7A Track A/Plenary - B1 Jim Rogers Cray Hybrid XC30 Installation – Facilities Level Overview Cray Hybrid XC30 Installation – Facilities Level Overview Ladina Gilly, Colin McMurtrie and Tiziano Belotti (Swiss National Supercomputing Centre) In this paper we describe, from a facilities point of view, the installation of the 28-cabinet Cray hybrid XC30 system, Piz Daint, at the Swiss National Supercomputing Centre (CSCS). This system was the outcome of a 12 month collaboration between CSCS and Cray and, as a consequence, is the first such system of its type worldwide. The focus of the paper is on the site preparation and integration of the system into CSCS' state-of-the-art HPC data centre. As with any new system architecture, the installation phase brings challenges at all levels. In order to achieve a quick turnaround of the initial bring-up it is essential to ensure that the site design is flexible enough to accommodate unforeseen variances in system environmental requirements. In the paper we detail some of the challenges encountered and the steps taken to ensure a quick and successful installation of the new system. Cray XC30 Power Monitoring and Management Cray XC30 Power Monitoring and Management Steven J. Martin and Matthew Kappel (Cray Inc.) Cray customers are increasingly demanding better performance per watt and finer grained control of total power consumption in their data centers. Customers are requesting features that allow them to optimize application performance per watt, and to conduct research in support of future system and application power efficiency. New system procurements are increasingly constrained by site power and cooling limitations, the cost of power and cooling, or both. This paper describes features developed in support of system power monitoring and management for the Cray XC30 product line, expected use cases, and potential features and functions for the future. First Experiences With Validating and Using the Cray Power Management Databse Tool First Experiences With Validating and Using the Cray Power Management Databse Tool Gilles Fourestey, Benjamin Cumming, Ladina Gilly and Thomas C. Schulthess (Swiss National Supercomputing Centre) In October 2013 CSCS installed the first hybrid Cray XC-30 system, dubbed Piz Daint. This system features the power management database (PMDB), that was recently introduced by Cray to collect detailed power consumption information in a non-intrusive manner. Power measurements at taken on each node, with additional measurements for the Aries network and blowers, and recorded in a database. This enables fine-grained reporting of power consumption that is not possible with external power meters, and is useful to both application developers and facility operators. This paper will show how benchmarks of representative applications at CSCS were used to validate the PMDB on Piz Daint. Furthermore we will elaborate, with the well-known HPL benchmark serving as prototypical application, on how the PMDB streamlines the tuning for optimal power efficiency in production. Piz Daint is presently the most energy efficient petascale supercomputer in operation. Monitoring Cray Cooling Systems Monitoring Cray Cooling Systems Don Maxwell (Oak Ridge National Laboratory), Jeffrey Becklehimer (Cray Inc.) and Matthew Ezell, Matthew Donovan and Christopher Layton (Oak Ridge National Laboratory) While sites generally have systems in place to monitor the health of Cray computers themselves, often the cooling systems are ignored until a computer failure requires investigation into the source of the failure. The Liebert XDP units used to cool the Cray XE/XK models as well as the Cray proprietary cooling system used for the Cray XC30 models provide data useful for health monitoring. Unfortunately, this valuable information is often available only to custom solutions not accessible by a
center-wide monitoring system or is simply ignored entirely. In this paper, methods and tools used to harvest the monitoring data available are discussed, and the implementation needed to integrate the data into a center-wide monitoring system at the Oak Ridge National Laboratory is provided. Tech Paper - Systems Technical Session 7B Track B - B3 Sharif Islam Cray Data Management Platform: Cray Lustre File System Monitoring Cray Data Management Platform: Cray Lustre File System Monitoring Jeff Keopp and Harold Longley (Cray Inc.) The Cray Data Management Platform (formerly Cray External Services Systems) provides two external Lustre File System products – CLFS and Sonexion. The CLFS Lustre Monitor (esfsmon) keeps the CLFS file systems available by providing automated failover of Lustre assets in the event of MDS and/or OSS node failures. The Lustre Monitoring Toolkit (LMT) is now part of the standard ESF software release used by CLFS systems. This discussion will provide an overview of the latest version of esfsmon, which supports Lustre DNE and LMT for CLFS. Topics will include configuration, manual and automated Lustre failover and failback operations plus integration of LMT into the CIMS monitoring framework. Tuning and Analyzing Sonexion Performance Tuning and Analyzing Sonexion Performance Mark S. Swan (Cray Inc.) This paper will present performance analysis techniques that Cray uses with Sonexion-based file systems. Topics will include Lustre client-side tuning parameters, Lustre server-side tuning parameters, the Lustre Monitoring Toolkit (LMT), Cray modifications to IOR, file fragmentation analysis, OST fragmentation analysis, and Sonexion-specific information. Building an Enterprise class HPC storage system - Performance, Reliability and Management Building an Enterprise class HPC storage system - Performance, Reliability and Management Torben Kling Petersen (Xyratex) As deployed HPC storage systems continue to grow in size, performance needs and complexity, the need for a fully integrated and tested HPC storage solution is essential. Balanced system and environment components are crucial including tuned software stacks, integrated management systems, faster rebuilds of RAID subsystems and the detection and maintenance of data integrity. This presentation covers developments in delivering a scalable HPC storage solution that addresses the most challenging problems for HPC users today. Sonexion Grid RAID Characteristics Sonexion Grid RAID Characteristics Tech Paper - Filesystems & I/O Technical Session 7C Track C - B2 Rolf Rabenseifner Scalability Analysis of Gleipnir: A Memory Tracing and Profiling Tool, on Titan Scalability Analysis of Gleipnir: A Memory Tracing and Profiling Tool, on Titan Tomislav Janjusic, Christos Kartsaklis and Dali Wang (Oak Ridge National Laboratory) Understanding application performance properties is facilitated with various performance profiling tools. The scope of profiling tools varies in complexity, ease of deployment, profiling performance, and the detail of profiled information. Specifically, using profiling tools for performance analysis is a common task when optimizing and understanding scientific applications on complex and large scale systems such as Cray's XK7.
Gleipnir is a memory tracing tool built as a plug-in tool for the Valgrind instrumentation framework. The goal of Gleipnir is to provide fine-grained trace information. The generated traces are a stream of executed memory transactions mapped to internal structures per process, thread, function, and finally the data structure or variable.
This paper describes the performance characteristics of Gleipnir, a memory tracing tool, on the Titan Cray XK7 system when instrumenting large applications such as the Community Earth System Model. Debugging scalable hybrid and accelerated applications on the Cray XC30, CS300 with TotalView Debugging scalable hybrid and accelerated applications on the Cray XC30, CS300 with TotalView Chris Gottbrath (Rogue Wave Software) TotalView provides users with a powerful way to analyze and understand their codes and is a key tool in developing, tuning, scaling, and troubleshooting HPC applications on the Cray XC30 Supercomputer and CS300 Cluster Supercomputer Series. As a source code debugger TotalView provides users with complete control over program execution and a view into their program at the source code and variable level. TotalView uses a scalable tree based architecture and can scale up to hundreds of thousands of processes. This talk will introduce new users to TotalView's capabilities and give experienced users an update on recent developments including the new MRnet communication tree. The talk will also highlight memory debugging with MemoryScape (which is now available for the Xeon Phi), deterministic reverse debugging with ReplayEngine, and scripting with TVScript. Integration of Intel Xeon Phi Servers into the HLRN-III Complex: Experiences, Performance and Lessons Learned Integration of Intel Xeon Phi Servers into the HLRN-III Complex: Experiences, Performance and Lessons Learned Florian Wende, Guido Laubender and Thomas Steinke (Zuse Institute Berlin) The third generation of the North German Supercomputing Alliance (HLRN) compute and storage facilities comprises a Cray XC30 architecture with exclusively Intel Ivy Bridge compute nodes. In the second phase, scheduled for November 2014, the HLRN-III configuration will undergo a substantial upgrade together with the option of integrating accelerator nodes into the system. To
support the decision-making process, a four-node Intel Xeon Phi cluster is integrated into the present HLRN-III infrastructure at ZIB. This integration includes user/project management, file system access and job management via the HLRN-III batch system. For selected workloads, in-depth analysis, migration and optimization work on Xeon Phi is in progress. We will report our experiences and lessons learned within the Xeon Phi installation and integration process. For selected examples, initial results of the application evaluation on the Xeon Phi cluster platform will be discussed. Developing High Performance Intel® Xeon PhiT Applications Developing High Performance Intel® Xeon PhiT Applications Jim Jeffers (Intel Corporation) After introducing the Intel® Xeon Phi(tm) product family and roadmap, we will discuss how you can exploit the extensive parallel computing resources provided by Intel® Xeon PhiT products while enhancing your development investment in industry standards based applications for next generation systems. Performance optimization methods and tools will be surveyed with application examples. Tech Paper - PE & Applications | Wednesday, May 7th 7:30am-8:15am Interactive 8A Track A/Plenary - B1 John Hesterberg System Management System Management John Hesterberg (Cray Inc.) Open and interactive discussion about system administration and management of Cray systems. Possible topics could be experiences with the new Resource Utilization Reporting (RUR) capabilities, the Image Management and Provisioning System (IMPS), Advanced Cluster Engine (ACE), Cray XE upgrades to SLES11 SP3, serial workloads on repurposed compute nodes, problems and best practices for administering large scale systems, and experiences with OpenStack projects. Birds of a Feather Interactive 8B Track B - B3 Ashley Barker Developing Dashboards for HPC Centers to Enable Instantaneous and Informed Decisions to be made at a Glance Developing Dashboards for HPC Centers to Enable Instantaneous and Informed Decisions to be made at a Glance Ashley Barker (Oak Ridge National Laboratory) High Performance Computing Centers are collecting more data today than ever about every aspect of our operations including system performance, allocations, completed jobs, users, projects, trouble tickets, etc. It takes significant forethought and resources to turn the system data collected into knowledge. This knowledge in turn is used to make more impactful decision that can influence everything from policies to purchasing decisions to user satisfaction. Based on the assumption that the scarce resource for many people in the world of today is not information but human attention, the challenge for future human-centered computer systems is not to deliver more information “to anyone, at anytime, and from anywhere,” but to provide “the ‘right’ information, at the ‘right’ time, in the ‘right’ place, in the ‘right’ way to the ‘right’ person”. Not having enough information available can lead to poor decision-making. However, with the tools available today the bigger problem does not lie with not having enough information; rather we are now faced with the issue of weeding out the insignificant data so that we do not get overwhelmed and ignore what needs our attention. A popular method to organize this valuable information is into dashboards to make the information more easily available and reviewable. Dashboards provide at-a-glance views of information relevant to operational activities to enable instantaneous and informed decisions to be made at a glance. The purpose of this BOF is to discuss real-time dashboards from three centers and explore best practices for future dashboard work within the CUG community. Birds of a Feather Interactive 8C Track C - B2 Vincent Betro Creating a United Front in Taking HPC to the Forefront of Research: PRACE and XSEDE Training Partnerships and Roadmaps Creating a United Front in Taking HPC to the Forefront of Research: PRACE and XSEDE Training Partnerships and Roadmaps Vincent C. Betro (National Institute for Computational Sciences/University of Tennessee) In order for all scientific research to glean the benefits of high performance computing, researchers around the world must be given not only resources but also the tools and training to use those resources effectively. Both the NSF XSEDE project, the Extreme Science and Engineering Discovery Environment, in the United States and PRACE, the Partnership for Advanced Computing in Europe, in the European Union aim to supply both of these necessary elements to researchers. One area in which the two virtual organizations can cooperate very readily is training. Both face the same issues with researchers needing a just-in-time approach to applying the most modern computing resources to their research, the need for both synchronous and asynchronous training over several time zones, the need for language to be removed as a barrier, and the desire to keep the training relevant and up-to-date. Both the panelists (including Vince Betro from NICS, David Henty from EPCC and Maria Grazia-Giuffreda from CSCS) and participants will be discussing the myriad difficulties as well as the opportunities for growth in training programs through partnerships with industry and academia. Participants will be placed into small groups and given an opportunity to create a list of the most necessary training elements to share between organizations as well as contribute their knowledge of where many of these resources already exist so they may be cataloged and collected for members of both projects to utilize in growing their training programs. The resulting information will be disseminated to both projects. Birds of a Feather 8:30am-10:00am General Session 9 Track A/Plenary - B1 Nicholas Cardo Cray Products - Shaping the Future Cray Products - Shaping the Future Barry Bolding (Cray Inc.) Cray Products - Shaping the Future Adapting the COSMO weather and climate model for hybrid architectures Adapting the COSMO weather and climate model for hybrid architectures Oliver Fuhrer (Federal Office of Meteorology and Climatology MeteoSwiss) Higher grid resolution, larger ensembles as well as growing complexity of weather and climate models demand ever-increasing compute power. Since 2013, several large hybrid high performance computers that contain traditional CPUs as well as some type of accelerator (e.g. GPUs) are online and available to the user community. Early adopters of this technology trend may have considerable advantages in terms of available resources and energy-to-solution. On the downside, a substantial investment is required in order to adapt applications to such accelerator-based supercomputer. Within the COSMO Consortium and the Swiss HP2C Initiative, a version of the weather and regional climate prediction model able to run on GPUs is being developed. This contribution will give an overview of the status of this version and present a roadmap of further plans. The adaptions that have been made to the model (and why these adaptions will also profit CPU-based hardware architectures) will be presented. While the physical parameterizations have been ported to GPUs using OpenACC compiler directives, the dynamical core was refactored with a C++ based domain specific language for structured grids which provides both CUDA and OpenMP back ends. We will discuss our experience and advantages and disadvantages of these two porting approaches. This contribution will give a detailed description of the challenges presented by such a large refactoring effort using different languages on Cray systems, along with performance results on three different Cray systems at CSCS: Rosa (XE6), Todi (XK7), Daint (XC30). CUG Business Meeting CUG Business Meeting Nick Cardo (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) CUG Business Meeting: organizational updates, SIG presentations, and board elections. General Session 10:30am-12:00pm General Session 10 Track A/Plenary - B1 David Hancock CUG 2014 Best Paper Award CUG 2014 Best Paper Award David Hancock (Indiana University) Presentation of the CUG Best Paper Award by the CUG Vice President & Program Chair. Intel: Accelerating Insights … Together With You Intel: Accelerating Insights … Together With You Rajeeb Hazra (Intel Corporation) A major shift is underway in the technical computing industry across technology, business models, and market structure. First, the data explosion – both from proliferation of devices and the growth of current iterative HPC models – along with the desire to quickly turn it into valuable insight, is driving the demand for both predictive and real-time analytics - using HPC. Second, the transition of technical and high performance computing workloads to the cloud has begun, and will accelerate over the next few years as a function of the powerful economic, accessibility, usability, and scalability requirements and forces. Third, while the demand for more powerful compute will continue, the challenges associated with storing large volumes of data and then feeding it to the compute engines will spur innovation in storage and interconnect technology. Finally, technology challenges such as energy efficiency will require more attention (and innovation) at all components of the SW stack. In this talk, Raj will discuss the current dynamics of the HPC market, how Intel is innovating to address these changing trends – both in the short and long term – all while keeping an ecosystem view in mind, and how our collaborations with key partners will fit together to enable a complete and affordable solution for the entire HPC ecosystem. General Session 1:00pm-1:45pm General Session 11 Track A/Plenary - B1 David Hancock 1 on 100 or more 1 on 100 or more Peter Ungaro (Cray Inc.) Open discussion with Cray President and CEO. No other Cray employees or Cray partners are permitted during this session. General Session 2:00pm-3:30pm Technical Session 12A Track A/Plenary - B1 Tina Declerck Cray Lustre Roadmap Cray Lustre Roadmap Cory Spitz and Tom Sherman (Cray Inc.) Cray deploys ‘value-added’ Lustre offerings with modifications and enhancement over base Lustre releases. In order to supply the best technology, Cray has chosen to integrate upstream feature releases that may not have a community supported maintenance branch. With that, we carefully plan integration of upstream community features. This paper will discuss how the community features will map to Cray releases and how we facilitate them via our participation with the community and OpenSFS. It will also highlight Cray specific features and enhancements that will allow our Lustre offerings to fully exploit the scale and performance inherent to Cray HPC compute products. In addition, this paper will discuss the upgrade paths from prior Cray Lustre software releases. That discussion will explain the migration process between 1.8.x based software to Cray’s current 2.x-based offerings, including details of support for ‘legacy’ so-called direct-attached Lustre for XE/XK that had previously been announced as EOL. Cray’s Tiered Adaptive Storage An engineered system for large scale archives, tiered storage and beyond. Cray’s Tiered Adaptive Storage An engineered system for large scale archives, tiered storage and beyond. Craig Flaskerud and Scott Donoho (Cray Inc.), Harriet Coverston (Versity Inc.) and Nathan Schumann (Cray Inc.) A technical overview of Cray® Tiered Adaptive Storage (TAS) and its capabilities, with emphasis on the Cray TAS system architecture, scalability options and manageability. We will also cover specific details of the Cray TAS architecture and configuration options, in addition the Cray TAS software stack will be covered in detail, including specific innovations that differentiate Cray TAS from other archive systems. We will also characterize primary scalability factors of both the hardware and software layers of Cray TAS, as well as note the ease of Cray TAS integration with Lustre via HSM capabilities available in Lustre 2.5. Clearing the Obstacles to Backup PiB-Size Filesystems Clearing the Obstacles to Backup PiB-Size Filesystems Andy Loftus and Alex Parga (National Center for Supercomputing Applications/University of Illinois) How does computer game design relate to backups? What makes backups of a 2-PiB filesystem so hard? This paper answers these questions by taking a look at the major roadblocks to backing up PiB-size filesystems and offers a software solution to accomplish the task. It turns out that the design of an event management system for computer games is well suited to the task of backups. As for the challenges of scaling a backup system to a 2-PiB filesystem, the solution is to take advantage of all the parallelism that necessarily exists in a system of this size. Learn the details of how this software is built and help decide if this should become an open source project. Tech Paper - Filesystems & I/O Technical Session 12B Track B - B3 Hans-Hermann Frese Enhanced Job Accounting with PBS Works and Cray RUR: Better Access to Better Data Enhanced Job Accounting with PBS Works and Cray RUR: Better Access to Better Data Scott Suchyta (Altair Engineering, Inc.) Assessing how your system is being used can be a daunting task, especially when the data is spread across multiple sources or, even worse, the available data is sparse and unrelated to the true metrics you need to gather. Factor in the business requirements of reporting on user, group, and/or project usage of the system, and you find yourself creating a homegrown solution that you will need to maintain. This presentation will cover the installation and configuration details of PBS Analytics within a Cray environment -- including how to obtain useful metrics from Cray’s RUR within PBS Professional, visualize the data via PBS Analytics, and bring to light any issues encountered or opportunities to improve performance within the system. Measuring GPU Usage on Cray XK7 using NVIDIA's NML and Cray's RUR Measuring GPU Usage on Cray XK7 using NVIDIA's NML and Cray's RUR Jim Rogers and Mitchell Griffith (Oak Ridge National Laboratory) ORNL introduced a 27PF Cray XK7 in to production in May 2013. This system provides users with 18,688 hybrid compute nodes, where each node couples an AMD 6274 Opteron with an NVIDIA GK110 (Kepler) GPU. Beginning with Cray’s OS version CLE 4.2UP02, new features available in the GK110 device driver, the NVIDIA Management Library, and Cray’s Resource Utilization software provide a mechanism for measuring GPU usage by applications on a per-job basis. By coupling this data with job data from the workload manager, fine grained analysis of the use of GPUs, by application, are possible. This method will supplement, and eventually supplant an existing method for identifying GPU-enabled applications that detects, at link time, the libraries required by the resulting binary (ALTD, the Automatic Library Tracking Database). Analysis of the new mechanism for calculating per-application GPU usage is provided as well as results for a range of GPU-enabled application codes. Resource Management Analysis and Accounting Challenges Resource Management Analysis and Accounting Challenges Michael T. Showerman, Jeremy Enos, Mark Klein and Joshi Fullop (National Center for Supercomputing Applications/University of Illinois) Maximizing the return on investment in a large scale computing resource requires policy that best enables the highest value workloads. Measuring the impact of a given scheduling policy pesents great challenges with a highly variable workload. Defining and measuring the separate components of scheduling and resource management overhead is critical in reaching a valuable conclusion about the effectiveness of the system’s availability for your workload. NCSA has developed tools for collecting and analyzing both user workload and system availability to measure the delivered impact of the Blue Waters resource. This publication presents solutions for displaying the scheduler’s past and present workloads well as an accounting for the availability and usage at the system and compute node level for application availability. Tech Paper - Systems Technical Session 12C Track C - B2 Matt Allen Using HPC in Planning for Urban/Coastal Sustainability and Resiliency Using HPC in Planning for Urban/Coastal Sustainability and Resiliency Paul C. Muzio, Yauheni Dzedzits and Nikolaos Trikoupis (College of Staten Island/City University of New York) Population growth and the migration and concentration of people into urban areas is having a profound impact on the environment. In turn, climate change and rising sea levels is threatening the viability and sustainability of these large metropolitan areas, which are mainly located in coastal areas. Planning for urban sustainability and urban/coastal resiliency is increasing dependent on extensive modeling activities using high-performance computing. We discuss specific examples of the use of high-performance computing in the development of “PlaNYC”, a comprehensive 25-year plan for the City of New York. PlaNYC was an initiative of former New York City Mayor Michael Bloomberg. The Cray Framework for Hadoop for the Cray XC30 The Cray Framework for Hadoop for the Cray XC30 Howard Pritchard, Jonathan Sparks and Martha Dumler (Cray Inc.) This paper describes the Cray Framework for Hadoop on Cray XC30. This is a framework for supporting components of the Hadoop eco-system on XC30's managed by widely used batch schedulers.
The paper further describes experiences encountered in running Hadoop workloads over typical Lustre deployments on the Cray XC30. Related work to enable Hadoop to better utilize the XC high speed interconnect is discussed. HPCHadoop: A framework to run Hadoop on Cray X-series supercomputers HPCHadoop: A framework to run Hadoop on Cray X-series supercomputers Scott Michael, Abhinav Thota and Robert Henschel (Indiana University) The rise of Big Data in research and industry has seen the development of software frameworks to address many challenges, most notably the MapReduce framework. There are many implementations of the MapReduce framework, however one of the most widely used open source implementations is the Apache Hadoop
framework. In this paper we present the design of the HPCHadoop framework. HPCHadoop is a framework developed at the Indiana University Pervasive Technology Institute, designed to enable users to run Hadoop workloads on HPC systems. The framework is specifically targeted to enable Hadoop workloads on the Cray X-series supercomputers (XE, XK, and XC), but can be used on any supercomputing platform. We also present the results of the Intel HiBench Hadoop benchmark suite for a variety of Hadoop workload sizes and levels of parallelism for several hardware configurations including Big Red II, a Cray XE/XK system, and Quarry, a traditional gigabit connected cluster. Tech Paper - PE & Applications 3:45pm-5:15pm Technical Session 13A Track A/Plenary - B1 Tina Declerck Using Resource Utilization Reporting to Collect DVS Usage Statistics Using Resource Utilization Reporting to Collect DVS Usage Statistics Tina Butler (National Energy Research Scientific Computing Center) In recent releases of the Cray Linux Environment, a feature called Resource Utilization Reporting (RUR) as been introduced. RUR is designed as an extensible framework for collecting usage and monitoring statistics from compute nodes on a per application basis. This paper will discuss the installation and configuration of RUR, and the design and implementation of a custom RUR plugin for collecting
DVS client-side statistics on compute nodes. Toward Understanding Congestion Protection Events on Blue Waters Via Visual Analytics Toward Understanding Congestion Protection Events on Blue Waters Via Visual Analytics Robert Sisneros and Kalyana Chadalavada (National Center for Supercomputing Applications/University of Illinois) For a system the scale of Blue Waters it is of primary importance to minimize high-speed network (HSN) congestion. We hypothesize that the ability to analyze the HSN in a system-wide manner will aid in the detection of network traffic patterns thereby providing a clearer picture of HSN congestion. The benefit of this is obvious we want to eliminate, or at lest minimize HSN congestion and have a better chance of doing so with a more complete understanding. To this end we have developed a visual analytics tool for viewing system-wide traffic patterns. Specifically, we employ a simple representation of Blue Waters’ torus network to visually show congested areas of the network. In this work we will describe the development of this tool and demonstrate its potential uses. I/O performance on Cray XC30 I/O performance on Cray XC30 Zhengji Zhao (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory), Doug Petesch and David Knaak (Cray Inc.) and Tina Declerck (National Energy Research Scientific Computing Center/Lawrence Berkeley National Laboratory) Edison is NERSC's newest petascale Cray XC30 system. Edison has three Lustre file systems deploying the Cray Sonexion storage systems. During the Edison acceptance test period, we measured the system I/O performance on a dedicated system with the IOR benchmark code from the NERSC-6 benchmark suite. After the system entered production, we observed a significant I/O performance degradation for some tests even on a dedicated system. While some performance change is expected due to file system fragmentation and system software and hardware changes, some of the performance degradation was more than expected. In this paper, we analyze the I/O performance we observed on Edison, focusing on understanding the performance change over time. We will also present what we have done to resolve the major performance issue. Ultimately, we want to detect and monitor the I/O performance issues proactively, to effectively mitigate I/O performance variance on a production system. Tech Paper - Filesystems & I/O Technical Session 13B Track B - B3 Liz Sim Accelerate Insights with Topology, High Throughput and Power Advancements Accelerate Insights with Topology, High Throughput and Power Advancements Wil Wellington and Michael Jackson (Adaptive Computing) Cray and Adaptive Computing power the world’s largest and most robust supercomputers with leading systems at NCSA, ORNL, HLRN, NERSC, NOAA and many more. Adaptive Computing and its Moab scheduling and optimization software help accelerate insights with Big Workflow. Big Workflow is an industry term coined by Adaptive Computing to describe the acceleration of insights through more efficient processing of intense simulations and big data analysis. Adaptive’s Big Workflow solution unifies all available resources, optimizes the analysis process and guarantees services to the business, allowing Cray systems do what they do best, deliver massive compute resources to tackle today’s challenges. Adaptive Computing will highlight results from new product capabilities. These include topology-aware capabilities, that can improve application run-time consistency as well as shorten run-times, new high throughput capability, which can improve job launch times and job volume, and new advancements in an upcoming release, such as per application power optimization. Topology-Aware Job Scheduling Strategies for Torus Networks Topology-Aware Job Scheduling Strategies for Torus Networks Jeremy Enos (National Center for Supercomputing Applications), Greg Bauer, Robert Brunner and Sharif Islam (National Center for Supercomputing Applications/University of Illinois), Robert A. Fiedler (Cray Inc.) and Michael Steed and David Jackson (Adaptive Computing) Multiple sites having Cray systems with a Gemini network in a 3D torus configuration have reported inconsistent application run times as a consequence of task placement and application interference on the torus. In 2013, a collaboration between Adaptive Computing, NCSA (Blue Waters project), and Cray was begun, which includes Adaptive’s plan to incorporate topology awareness into the Moab scheduler product to mitigate this problem. In this paper, we describe the new scheduler features, tests and results that helped shape its design, and enhancements of the Topaware node selection and task placement tool that enable users to best exploit these new capabilities. We also discuss multiple alternative mitigation strategies implemented on Blue Waters that have shown success in improving application performance and consistency. These include predefined optimally-shaped groups of nodes that can be targeted by jobs, and custom modifications of the ALPS node order scheme. TorusVis: A Topology Data Visualization Tool TorusVis: A Topology Data Visualization Tool Omar Padron and David Semeraro (National Center for Supercomputing Applications/University of Illinois) The ever-growing scope of extreme-scale supercomputers requires an increasing volume of component-local metrics to better understand their systemic behavior. The collection and analysis of these metrics have become data-intensive tasks in their own right, the products of which inform system support activities critical to ongoing operations. With recent emphasis being placed on topology-awareness as a step towards better coping with extreme scale, the ability to visualize complex topology data has become increasingly valuable, particularly for the visualization of multidimensional tori. Several independent efforts to produce similar visualizations exist, but they have typically been in-house developments tailor-made for very specific purposes; and not trivially applicable to visualization needs not featured among those purposes. In contrast, a more general-purpose tool offers benefits that ease understanding of many interrelated aspects of a system's behavior, such as application performance, job node placement, and network traffic patterns. Perhaps more significantly, such a tool can offer analysts insight into the complex topological relationships shared among these considerations; relationships that are often difficult to quantify by any other means. We present TorusVis, a general-purpose visualization tool applicable to a wide variety of topology-related data presentation scenarios. Its general-purpose software architecture lends itself well to rapid prototyping of various data presentation concepts as well as publishing fully featured visualizations. We describe several key design elements and implementation strategies, and how they strike a balance between usability, generality, and simplicity. Furthermore, we present use case studies where the capabilities available in TorusVis aided understanding of system behavior in ways not possible, otherwise. Tech Paper - Systems Technical Session 13C Track C - B2 Matt Allen Applications of the YarcData Urika in Drug Discovery and Healthcare Applications of the YarcData Urika in Drug Discovery and Healthcare Robert Henschel, David Wild, Ying Ding, Abhik Seal and Jeremy Yang (Indiana University), Bin Chen (Stanford University) and Abhinav Thota and Scott Michael (Indiana University) The Cheminformatics & Chemogenomics Research Group (CCRG) at Indiana University has been working on algorithms and tools for large scale data mining of drug discovery, chemical and biological data using semantic technologies. The work includes finding new gene targets for drugs, identifying drug/drug interactions and pinpointing the cause for drug side effects. CCRG uses sematic web technologies like RDF triple stores and the SPARQL query language. The YarcData Urika appliance promises to radically speed up this specific type of research by implementing a SPARQL endpoint on specialized hardware. In this paper, we are describing how a Urika system could be integrated into the workflow and report on the performance for specific problems. Discovery in Big Data: Success Stories, Best Practices and Analytics Techniques Discovery in Big Data: Success Stories, Best Practices and Analytics Techniques Ramesh Menon and Amar Shan (YarcData Inc.) Discovery, the uncovering of hidden relationships and unknown patterns, has always been a uniquely human endeavor with relatively little automated assistance. The advent of big data has changed that by enabling discovery to be performed “in-silico: • Model high value customers to increase average service revenues • Discovery of drug re-purposing opportunities • Proactive identification of customers at risk of churn • Identification of non-compliance risks through Mobility Analytics Discovery involves iteration, where each set of questions lead to new questions and the need for new sources of data to answer them. Reasoning about relationships in the data is crucial, as is the ability to pose “fuzzy” questions and leverage a variety of sophisticated analytic techniques: statistical, clustering, clique detection, path analysis, visualization, … This talk will present the above use cases, and discuss analytics techniques and the best practices that have been found to be most effective. High-Level Analytics with R and pbdR on Cray Systems High-Level Analytics with R and pbdR on Cray Systems Pragneshkumar Patel and Drew Schmidt (National Institute for Computational Sciences/University of Tennessee), Wei-Chen Chen (University of Tennessee), George Ostrouchov (Oak Ridge National Laboratory) and Mark Fahey (National Institute for Computational Sciences) In this paper, we present the high-level analytics engine R, as well as the high-performance extension to R called pbdR (Programming with
Big Data in R). We intend to justify the need for high-level analytics in
supercomputing; in particular, we stress the importance of R, and the need
for pbdR. We also discuss build issues that arise with R and pbdR on
supercomputers, notably the loading of dynamic libraries, accessing file system,
and executing R programs on Cray machines. We conclude with extensive performance benchmarking using Titan (ORNL), Darter (NICS), Mars (NICS) and Blue Waters (NCSA) HPC resources. Tech Paper - PE & Applications 5:30pm-6:15pm Interactive 14A Track A/Plenary - B1 Jean-Guillaume Piccinali Parallel Debugging OpenACC/2.0 and OpenMP/4.0 Applications on Hybrid Multicore Systems Parallel Debugging OpenACC/2.0 and OpenMP/4.0 Applications on Hybrid Multicore Systems jg piccinali (Swiss National Supercomputing Centre) Significant increases in performance of parallel applications have been achieved with hybrid multicore systems (such as GPGPUs, MICs based systems). In order to improve programmer productivity, directive based accelerator programming interfaces (such as OpenACC/2.0 and OpenMP/4.0) have been released for incremental development and porting of existing MPI and OpenMP applications. As scientists migrate their applications with these new parallel programming models in mind, they expect a new generation of parallel debugging tools that can seamlessly troubleshoot their algorithms on the current and new architectures. Developers of compilers and debuggers also rely on input from application developers to determine the optimal design for their tools to support the widest range of parallel programming paradigms and accelerated systems. The goal of this BOF is to bring the community together interested in sharing their debugging experiences: * understand roadmaps of compiler and debug tool developers for these technologies; * share user experience and feedback in using the current instances of technologies and their shortcomings; * discuss and plan for a user group or forum to systematically share experiences, feedback, and regression tests between compilers, tools developers, HPC sites and end users. Specifically the BOF participants will have opportunity to discuss the following topics: * essential features targeting full OpenACC and OpenMP accelerators features debugging support, * coordination of software stack upgrades, * requirements of a unified regression suite to help testing new debugger releases. Speakers from the following institutions will give brief updates and presentations: CSCS, ORNL, EPCC, ALLINEA, CRAY, PGI. Birds of a Feather Interactive 14B Track B - B3 Sharif Islam Zen and the art of the Cray System Administration Zen and the art of the Cray System Administration Sharif Islam (National Center for Supercomputing Applications) System administrators have a unique and challenging role that requires a comprehensive knowledge of all the different components of the system beyond just installing and maintaining various pieces of software. System administrators are also the link between the users, applications developers, and storage and network admins, help desk, vendors, and documentation writers among other constituents using and supporting large, complex systems. This BOF will focus on tips, tools, and tricks that help achieve the zen-like comprehensiveness required of system admins in order to discover and solve problems and maintain a functioning, productive system for the users. In previous CUG sessions we have seen presentations focusing on how to interpret and correlate a large volume of log messages (such as Lustre and HSN logs) along with different tools to process data. The goal of this BOF is to share those knowledge and go beyond that and share novel strategies, ideas, challenges, and discoveries that are part and parcel of day-to-day system administration. Each Cray site may have specific setup and issues but there are underlying methods and techniques that are generic and worth sharing with CUG. Suggested topics: scheduler policies, lustre tuning and job performance, diagnosing HSN issues and job failures, hardware and console log analysis, decoding aprun failure codes. Birds of a Feather Interactive 14C Track C - B2 Mark Fahey Future needs for Understanding User-Level Activity with ALTD Future needs for Understanding User-Level Activity with ALTD Mark Fahey (National Institute for Computational Sciences) Let’s talk real, supercomputer analytics drilling down to the level of individual batch submissions, users, and binaries. And we’re not just targeting performance: we’re after everything from which libraries and/or individual functions are in demand to preventing the problems that get in the way of successful science. This BoF will bring together those with experience and interest in present and future system tools and technologies that can provide this type of job-level insight, and will be the kickoff meeting for a new Special Interest Group (SIG) for those who want to explore this topic more deeply. Dr. Fahey is the author of ALTD, a tool that reports software and library usage at the individual job level and principal investigator of a newly funded NSF grant to re-envision the infrastructure at XALT. ALTD is currently deployed at numerous major centers across the United States and Europe. Birds of a Feather | Thursday, May 8th 7:30am-8:15am Interactive 15A Track A/Plenary - B1 Jeff Keopp Cray Data Management Platform BOF Cray Data Management Platform BOF Jeff Keopp (Cray Inc.) Customers of existing Cray Data Management Platform systems (CIMS/esMS, CDL/esLogin, CLFS/esFS and Sonexion) will have the opportunity to trade experiences, best practices and techniques with Cray technical personnel in this “Birds of a Feather” session. Birds of a Feather Interactive 15B Track B - B3 Ian Bird Best practices in transitioning Hadoop into production Best practices in transitioning Hadoop into production Ian Bird (YarcData Inc.) Hadoop is rapidly making the transition from the laboratory to production use. As deployments grow in size, the total cost of ownership can grow even faster. The objective of this BOF is to discuss best practices for deploying Hadoop into production while minimizing TCO. Birds of a Feather Interactive 15C Track C - B2 Timothy W. Robinson OpenACC: CUG members' experiences and evolution of the standard OpenACC: CUG members' experiences and evolution of the standard Timothy Robinson (Swiss National Supercomputing Centre) OpenACC is an emerging parallel programming standard designed to simplify the programming of heterogeneous systems, where CPUs are combined with GPUs and/or other accelerator architectures. The API, developed principally by PGI, Cray, NVIDIA and CAPS, follows a directives-based approach to specify loops and/or regions of code for offloading from host to accelerator, providing portability across operating systems, host CPUs and accelerators. The model is particularly attractive to the HPC community because it allows application developers to port existing codes in Fortran, C or C++ without the need for additional programming languages, and without the need to explicitly initiate accelerator startup/shutdown or explicitly manage accelerator memory. OpenACC is currently supported by 18 member organizations – 11 of which are CUG sites and/or Cray partners.
The purpose of this BOF is two-fold: first, it is designed to update the user community of recent developments with the OpenACC specification and its future roadmap, including the relationship between OpenACC and other closely-related APIs (particularly OpenMP). Second, it will give OpenACC users an opportunity to describe their current experiences and needs for future releases. Discussions on language and construct-related issues will be led by Cray (in conjunction with their partners), while contributions from applications developers will describe the porting of two scientific codes: RAMSES, an AMR code developed in Switzerland for the simulation of galaxy formation, and ICON, a general circulation model developed by the Max Planck Institute for Meteorology and the German Weather Service. Additional contributions will be solicited from CUG member sites. Birds of a Feather 8:30am-10:00am Technical Session 16A Track A/Plenary - B1 Hans-Hermann Frese Systems-level Configuration and Customisation of Hybrid Cray XC30 Systems-level Configuration and Customisation of Hybrid Cray XC30 Nicola Bianchi, Sadaf Alam, Roberto Aielli, Vincenzo Annaloro, Colin McMurtrie, Massimo Benini, Timothy Robinson and Fabio Verzolli (Swiss National Supercomputing Centre) In November 2013 the Swiss National Supercomputing Centre (CSCS) upgraded the 12 cabinet Cray XC30 system, Piz Daint, to 28 cabinets. Dual-socket Intel Xeon nodes were replaced with the hybrid nodes containing one Intel Xeon E5-2670 CPU and one Nvidia K20X GPU. The new design resulted in several extensions to the system operating and management environment, in addition to user driven customisation. These include integration of elements from the Tesla Deployment Kit (TDK) for Node Health Check (NHC) tests and Nvidia Management Library (NVML). Cray extended the Resource Usage Reporting (RUR) tool to incorporate GPU usage statistics. Likewise, the Power Monitoring Database (PMDB) incorporated GPU power and energy usage data. Furthermore, custom configurations are introduced to the SLURM job scheduling system to support different GPU operating modes. In collaboration with Cray, we assessed the Cluster Compatibility Mode (CCM) with SLURM, which in turn allows for additional GPU usage scenarios, which are currently under investigation. Piz Daint is currently the only hybrid XC30 system in production. To support robust operations we invested in the development of: 1) an holistic regression suite that tests sanity of various aspects of the system, ranging from the development environment to the system hardware; 2) a methodology for screening the live system for complex transient issues, which are likely to develop at scale. Slurm Native Workload Management on Cray Systems Slurm Native Workload Management on Cray Systems Danny Auble and David Bigagli (SchedMD LLC) Cray's Application Level Placement Scheduler (ALPS) software has recently been refactored to expose low level network management interfaces in a new library. Slurm is the first workload manager to utilize this new Cray infrastructure to directly manage network resources and launch applications without ALPS. New capabilities provided by Slurm include the ability to execute multiple jobs per node, the ability to execute many applications within a single job allocation (ALPS reservation), greater flexibility in scheduling, and higher throughput without sacrificing the scalability and performance that Cray is famous for. This presentation includes a description of ALPS refactoring, new Slurm plugins for Cray systems, and the changes in functionality provided by this new architecture. Cori: A Cray XC Pre-Exascale System for NERSC Cori: A Cray XC Pre-Exascale System for NERSC Katie Antypas, Nicholas Wright, Nicholas Cardo and Matthew Cordery (National Energy Research Scientific Computing Center) and Allison Andrews (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) The next generation supercomputer for the National Energy Research Scientific Computing Center (NERSC) will be a next-generation Cray XC system. The system, named “Cori” after Nobel Laureate Gerty Cori, will bring together technological advances in processors, memory, and storage to enable the solution of the worlds most challenging scientific problems. This next-generation Cray XC supercomputer will use Intel’s next-generation Intel® Xeon Phi™ processor -- code-named “Knights Landing” -- a self-hosted, manycore processor with on-package high bandwidth memory and delivering more than 3 teraFLOPS of double-precision peak performance per single socket node. Scheduled for delivery in mid-2016, the system will deliver 10x the sustained computing capability of NERSC’s Hopper system, a Cray XE6 supercomputer. With the excitement of bringing new technology to bear on world-class scientific problems also come the many challenges in application development and system management. Strategies to overcome these challenges are key to successful deployment of this system. Tech Paper - Systems Technical Session 16B Track B - B3 Scott Michael Performance Analysis of Filesystem I/O using HDF5 and ADIOS on a Cray XC30 Performance Analysis of Filesystem I/O using HDF5 and ADIOS on a Cray XC30 Ruonan Wang (ICRAR, UWA), Christopher J. Harris (iVEC, UWA) and Andreas Wicenec (ICRAR, UWA) The Square Kilometer Array telescope will be one of the worlds largest scientific instruments, and will provide an unprecedented view of the radio universe. However, to achieve its goals the Square Kilometer Array telescope will need to process massive amounts of data through a number of signal and imaging processing stages. For example, for the correlation stage the SKA-Low Phase 1 will produce terabytes of data per second and significantly more for the second phase. The use of shares filesystems, such as Lustre, between these stages provides the potential to simplify these workflows. This paper investigates writing correlator output to the Lustre filesystem of a Cray XC30 using the HDF5 and ADIOS high performance I/O APIs. The results compare the performance of the two APIs, and identify key parameter optimisations for the application, APIs and the Lustre configuration. Fan-In Communication On A Cray Gemini Interconnect Fan-In Communication On A Cray Gemini Interconnect Terry Jones and Bradley Settlemyer (Oak Ridge National Laboratory) Using the Cray Gemini interconnect as our platform, we present a study of an important class of communication operations––the fan-in communication pattern. By its nature, fan-in communications form ‘hot spots’ that present significant challenges for any interconnect fabric and communication software stack. Yet despite the inherent challenges, these communication patterns are common in both applications (which often perform reductions and other collective operations that include fan-in communication such as barriers) and system software (where they assume an important role within parallel file systems and other components requiring high-bandwidth or low-latency I/O). Our study determines the effectiveness of differing client-server fan-in strategies. We describe fan-in performance in terms of aggregate bandwidth in the presence of varying degrees of congestion, as well as several other key attributes. Comparison numbers are presented for the Cray Aries interconnect. Finally, we provide recommended communication strategies based on our findings. HPC’s Pivot to Data HPC’s Pivot to Data Suzanne Parete-Koon (Oak Ridge National Laboratory), Jason Hick (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center), Jason Hill (Oak Ridge National Laboratory), Shane Canon (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center), Blake Caldwell (Oak Ridge National Laboratory), David Skinner (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center), Christopher Layton (Oak Ridge National Laboratory), Eli Dart (Lawrence Berkeley National Laboratory) and Jack Wells, Hai Ah Nam, Daniel Pelfrey and Galen Shipman (Oak Ridge National Laboratory) Computer centers such as NERSC and OLCF have traditionally focused on delivering computational capa- bility that enables breakthrough innovation in a wide range of science domains. Accessing that computational power has required services and tools to move the data from input and output to computation and storage. A “pivot to data” is occurring in HPC. Data transfer tools and services that were previously peripheral are becoming integral to scientific work- flows. Emerging requirements from high-bandwidth detectors, high-throughput screening techniques, highly concurrent sim- ulations, increased focus on uncertainty quantification, and an emerging open-data policy posture toward published research are among the data-drivers shaping the networks, file systems, databases, and overall HPC environment. In this paper we explain the pivot to data in HPC through user requirements and the changing resources provided by HPC with particular focus on data movement. For WAN data transfers we present the results of a study of network performance between centers. Tech Paper - Filesystems & I/O Technical Session 16C Track C - B2 Rebecca Hartman-Baker CP2K Performance from Cray XT3 to XC30 CP2K Performance from Cray XT3 to XC30 Iain Bethune and Fiona Reid (EPCC, The University of Edinburgh) and Alfio Lazzaro (Cray Inc.) CP2K is a powerful open-source program for atomistic simulation using a range of methods including Classical potentials, Density Functional Theory based on the Gaussian and Plane Waves approach, and post-DFT methods. CP2K has been designed and optimised for large parallel HPC systems, including a mixed-mode MPI/OpenMP parallelisation, as well as CUDA kernels for particular types of calculations. Developed by an open-source collaboration including University of Zurich, ETH Zurich, EPCC and others, CP2K has been well tested on several generations of Cray supercomputers, beginning with the XT3 in 2006 at CSCS, through XT4, XT5, XT/XE6, and XK7, to Ivy-Bridge and Sandy-Bridge based XC30 systems in 2014. We present a systematic view of benchmark data spanning 9 years and 7 generations of the Cray architecture, and report on recent efforts to carry out comprehensive comparative benchmarking and performance analysis of CP2K on the XE6 and XC30 systems at EPCC. We also describe work to enable CP2K for accelerators, and show performance data from the XK7 and XC30 at CSCS. Optimising Hydrodynamics applications for the Cray XC30 with the application tool suite Optimising Hydrodynamics applications for the Cray XC30 with the application tool suite Wayne P. Gaudin (Atomic Weapons Establishment), Andrew C. Mallinson (University of Warwick), Oliver FJ Perks and John A. Herdman (Atomic Weapons Establishment), John M. Levesque (Cray Inc.), Stephen A. Jarvis (University of Warwick) and Simon McIntosh-Smith (University of Bristol) Due to power constraints, HPC systems continue to increase hardware concurrency. Efficiently scaling applications on future machines will be essential for improved science and it is recognised that the "flat" MPI model will start to reach its scalability limits. The optimal approach is unknown, necessitating the use of mini-applications to rapidly evaluate new approaches. Reducing MPI task count through the use of shared memory programming models will likely be essential. We examine different strategies for improving the strong-scaling performance of explicit Hydrodynamics applications. Using the CloverLeaf mini-application at extreme scale across three generations of Cray platforms (XC30, XE6 and XK7). We show the utility of the hybrid approach and document our experiences with OpenMP, OpenACC, CUDA and OpenCL under both the PGI and CCE compilers. We also evaluate Cray Reveal as a tool for automatically hybridising HPC applications and Cray's MPI rank to network topology-mapping tools for improving application performance. Using a Developing MiniApp to Compare Platform Characteristics on Cray Systems Using a Developing MiniApp to Compare Platform Characteristics on Cray Systems Bronson Messer (Oak Ridge National Laboratory) The use of reduced applications that share many of the performance and implementation features of large, fully-featured code bases (“MiniApps”) has gained considerable traction in recent years, especially in the context of exascale planning exercises. We have recently developed a MiniApp designed to serve as a proxy for the CHIMERA code that we have dubbed Ziz. As an initial foray, we have used the directionally-split hydro version of Ziz to quantify a handful of architectural impacts on Cray XK7 and XC30 platforms and have compared these impacts to results from a new Infiniband-based cluster at the Oak Ridge Leadership Computing Facility (OLCF). We will describe these initial results, along with some observations about generating useful MiniApps from extant applications and what these artifacts might hope to capture. Tech Paper - PE & Applications 10:30am-12:00pm Technical Session 17A Track A/Plenary - B1 Thomas Leung Expanding Blue Waters with Improved Acceleration Capability Expanding Blue Waters with Improved Acceleration Capability Celso L. Mendes, Gregory H. Bauer and William T. Kramer (National Center for Supercomputing Applications) and Robert A. Fiedler (Cray Inc.) Blue Waters, the first open-science supercomputer to achieve a sustained rate of one petaflop/s on a broad mix of scientific applications, is the largest system ever built by Cray. It was originally deployed at NCSA with a configuration of 276 cabinets, containing a mix of XE (CPU) nodes and XK (CPU+GPU) nodes that share the same Gemini interconnection network. As a hybrid system, Blue Waters constitutes an excellent platform for developers of parallel applications who want to explore GPU acceleration. In 2013, Blue Waters was expanded with 12 additional cabinets of XK nodes, increasing the total system peak floating-point performance by 12%. This paper describes the expansion process, our analysis of multiple practical and performance-related issues leading to the final configurations, how the expanded system is being used by science teams, node failure rates, and our latest efforts toward monitoring system components associated with the GPUs. Accelerating Understanding: Data Analytics, Machine Learning, and GPUs Accelerating Understanding: Data Analytics, Machine Learning, and GPUs Steven M. Oberlin (NVIDIA) Amazing new applications and services employing machine learning algorithms to perform advanced analysis of massive streams and collections of structured and unstructured data are becoming quietly indispensable in our daily lives. Machine learning algorithms like deep learning neural networks are not new, but the rise of large scale applications hosted in massive cloud computing data centers collecting enormous volumes of data from and about their users have provided unprecedented training sets and opportunities for machine learning algorithms. Recognizers, classifiers, and recommenders are only a few component capabilities providing valuable new services to users, but the training of extreme scale learning systems is computationally intense. Fortunately, like so many areas of high-performance computing, great economies and speed-ups can be realized through the use of general purpose GPU accelerators. This talk will explore a few advanced data analytics and machine learning applications, and the benefits and value of GPU acceleration. Unlocking the Full Potential of the Cray XK7 Accelerator Unlocking the Full Potential of the Cray XK7 Accelerator Mark Klein (National Center for Supercomputing Applications/University of Illinois) and John Stone (University of Illinois) The Cray XK7 includes NVIDIA GPUs for acceleration of
computing workloads, but the standard XK7
system software inhibits the GPUs from accelerating
OpenGL and related graphics-specific functions.
We have changed the operating mode of the XK7 GPU firmware,
developed a custom X11 stack, and worked with Cray to acquire an
alternate driver package from NVIDIA in order to allow
users to render and post-process their data directly on Blue Waters.
Users are able to use NVIDIA's
hardware OpenGL implementation which has many features not
available in software rasterizers. By eliminating the transfer
of data to external visualization clusters, time-to-solution for users
has been improved tremendously. In one case, XK7 OpenGL rendering
has cut turnaround time from a month down to to just one day.
We describe our approach for enabling graphics on the XK7,
discuss how the new capabilities are exposed to users, and
highlight their use by science teams. Tech Paper - PE & Applications Technical Session 17B Track B - B3 Nicholas Cardo The Value of Tape and Tiered Adaptive Storage The Value of Tape and Tiered Adaptive Storage Steve Mackey (Spectra Logic) High performance environments require peak performance from computing equipment—including storage. Spectra’s T-Series libraries help push the boundaries of operational objectives, giving cost-effective storage that meets all of your performance, growth, and environmental needs. Spectra’s T-Series libraries are preconfigured, integrated, and tested as the archive tier in the Cray Tiered Adaptive Storage (TAS) offering. TAS provides data migration policies enabling transparent data movement across storage tiers to archive data to Spectra T-Series libraries. This easy to use, open archiving solution reduces costs and allows Cray High Performance Computing customers to preserve data indefinitely on Spectra T-Series libraries. Using Robinhood to Purge Lustre Filesystems Using Robinhood to Purge Lustre Filesystems Tina Declerck (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center) NERSC purges local scratch filesystems to ensure end user usability and availability along with filesystem reliability. This is accomplished through quotas, and by destructively purging files that are older than a specified period. This paper will describe in detail how our new purge mechanism was developed and deployed based upon the Robinhood capabilities. The actual purge operation is a separate step to ensure data and metadata are consistent before a destructive purge operation takes place since the state of the filesystem may have changed during Robinhood’s sampling period. Other details of the purge will also be included such as how long the purge takes, an analysis of the data being purged and it’s affect on overall process, as well as work done to improve the time required to purge. Finally, a discussion of the issues that we encountered and what was accomplished to resolve them. Site Overview ECMWF Site Overview ECMWF Oliver Treiber (European Centre for Medium-Range Weather Forecasts) The European Centre for Medium-Range Weather Forecasts (www.ecmwf.int) is an intergovernmental organisation supported by 34 states, providing operational medium- and extended-range weather forecasts alongside access to its supercomputing facilities for scientific research. ECMWF is located in Reading, UK. After a competitive procurement in 2012-2013, ECMWF selected a proposal from Cray for service based on two self-sufficient XC30 Ivy Bridge clusters with about 3500 nodes each, and four Lustre filesystems backed by a total of 100 Sonexion 1600 SSUs. This system will provide ECMWF's sole operational supercomputing facility from the summer of 2014, fully replacing the two currently used IBM Power7 iH clusters with multi-cluster GPFS filesystems. The Cray systems use PBS Pro for workload scheduling. This short CUG new member introduction will give an overview of ECMWF's activities and requirements as an operational weather site, followed by an outline of the Cray system configuration and ECMWF idiosyncrasies in its use. Tech Paper - Filesystems & I/O Technical Session 17C Track C - B2 Luc Corbeil Cray CS300-LC Cluster Direct Liquid Cooling Architecture Cray CS300-LC Cluster Direct Liquid Cooling Architecture Roger Smith (Mississippi State University) and Giridhar Chukkapalli and Maria McLaughlin (Cray Inc.) In this white paper, you will learn about a hybrid, direct liquid cooling architecture developed for the Cray CS300-LC cluster supercomputer. First, we will review the pros and cons of a variety of direct liquid cooling solutions implemented by various competing vendors. Next, we will review the business and technical challenges of the CS300-LC cluster supercomputer with best practices and implementation details. The paper will also describe the challenges of this architecture adhering to open standards and TCO.
In collaboration with MSU, We will provide detailed energy efficiency analysis of the CS300-LC system. Additional details of resiliency, remote monitoring, management of Hybrid cooling system is described as well as exploring potential future work of making CS300-LC close to 100% warm water cooled. Mississippi State University High Performance Computing Collaboratory - A Brief Overview Mississippi State University High Performance Computing Collaboratory - A Brief Overview Trey Breckenridge (Mississippi State University) The High Performance Computing Collaboratory (HPC²), an evolution of the
MSU NSF Engineering Research Center (ERC) for Computational Field
Simulation, at Mississippi State University is a coalition of member
centers and institutes that share a common core objective of advancing the
state-of-the-art in computational science and engineering using high
performance computing; a common approach to research that embraces a
multi-disciplinary, team-oriented concept; and a commitment to a full
partnership between education, research, and service. The MSU HPC² has a
long and rich history in high performance computing dating back to the mid-1980's, with pioneering efforts in commodity clusters, low latency interconnects,
grid generation, and the original implementation of MPICH. A Single Pane of Glass: Bright Cluster Manager for Cray A Single Pane of Glass: Bright Cluster Manager for Cray Matthijs van Leeuwen (Bright Computing) Bright Cluster Manager provides comprehensive cluster management for Cray systems in one integrated solution: deployment, provisioning, scheduling, monitoring, and management. Its intuitive GUI provides complete system visibility and ease of use for multiple systems and clusters simultaneously, including automated tasks and intervention. Bright also provides a powerful management shell for those who prefer to manage via a command-line interface. Bright Cluster Manager extends to cover the full range of Cray systems, spanning servers, clusters and mainframes; and external servers (large-scale Lustre file systems, login servers, data movers, pre- and post-processing servers). Cray has also used Bright Cluster Manager create additional services for its customers. This presentation is an overview of Bright Cluster Manager and its capabilities, with particular emphasis on the value Bright provides to Cray users. Tech Paper - Systems 1:00pm-2:30pm Technical Session 18A Track A/Plenary - B1 Douglas W. Doerfler Large Scale System Monitoring and Analysis on Blue Waters using OVIS Large Scale System Monitoring and Analysis on Blue Waters using OVIS Michael T. Showerman (National Center for Supercomputing Applications/University of Illinois), Jeremy Enos and Joseph Fullop (National Center for Supercomputing Applications), Paul Cassella (Cray Inc.), Nichamon Naksinehaboon, Narate Taerat and Tom Tucker (Open Grid Computing) and Jim M. Brandt, Ann C. Gentile and Benjamin Allan (Sandia National Laboratories) Understanding the complex interplay between ap- plications competing for shared platform resources can be key to maximizing both platform and application performance. At the same time, use of monitoring tools on platforms designed to support extreme scale applications presents a number of challenges with respect to scaling and impact on applications due to increased noise and jitter. In this paper, we present our approach to high fidelity whole system monitoring of resource utilization including High Speed Network link data on NCSA’s Cray XE/XK platform Blue Waters utilizing the OVIS monitoring framework. We then describe architectural implementation details that make this monitoring system suit- able for scalable monitoring within the Cray hardware and software environment. Finally we present our methodologies for measuring impact and the results. A Diagnostic Utility For Analyzing Periods Of Degraded Job Performance A Diagnostic Utility For Analyzing Periods Of Degraded Job Performance Joshi Fullop and Robert Sisneros (National Center for Supercomputing Applications) In this work we present a framework for identifying possible causes for observed differences in job performance from one run to another. Our approach is to contrast periods of time through the profiling of system log messages. On a large scale system there are generally multiple, independently reporting subsystems, each capable of producing mountainous streams of events. The ability to sift through these logs and pinpoint events provides a direct benefit in managing HPC resources. This is particularly obvious when applied to diagnosing, understanding, and preventing system conditions that lead to overall performance degradation.
To this end, we have developed a utility with real-time access to the full history of Blue Waters’ data where event sets from two jobs can be compared side by side. Furthermore, results are normalized and arranged to focus on those events with greatest statistical divergence, thus separating the chaff from the wheat. A first glance at DWD's new Cray XC30 A first glance at DWD's new Cray XC30 Florian Prill (Deutscher Wetterdienst) In December 2013 the German Meteorological Service (DWD) has installed two identical Cray XC30 clusters in Offenbach. The new systems serve as a supercomputing resource for the operational weather service and also provide sufficient capacity for DWD's research purposes. After overall completion of the second phase, the compute clusters will reach a peak performance of roughly 2x549 TF. This presentation summarizes the first experience with this hardware setup and the Cray programming environment from a user's perspective. The main focus will be on the Icosahedral Nonhydrostatic (ICON) model, which is the upcoming weather prediction code at DWD. The ICON model couples the different components of the earth system model, e.g. dynamics, soil, radiation, and ocean, and is a perfect example for the fast-growing demand for memory capacity and processing speed within computational meteorology and geophysics. Tech Paper - Systems Technical Session 18B Track B - B3 Robert M. Whitten “Piz Daint:” Application driven co-design of a supercomputer based on Cray’s adaptive system design “Piz Daint:” Application driven co-design of a supercomputer based on Cray’s adaptive system design Sadaf R. Alam (Swiss National Supercomputing Centre) and Thomas C. Schulthess (ETH Zurich) “Piz Daint” is a 28-cabinet Cray XC30 supercomputer that has been co-designed along with applications, and is presently the most energy efficient petascale supercomputer. Starting from selected applications in climate science, geophysics, materials science, astrophysics, and biology that have been designed for distributed memory systems with massively multi-threaded nodes, a rigorous evaluation of node architecture was performed, yielding hybrid CPU-GPU nodes as the optimum. Two applications, the limited-area climate model COSMO and the electronic structure code CP2K were selected for further co-development with the system. “Piz Daint” was deployed in two phases. First in a 12-cabinet standard multi-core configuration in fall 2012 that allowed testing of the network and development of the applications at scale. While hybrid nodes were being constructed for the second phase, applications as well as necessary extensions to the programming environment were co-developed. We discuss the co-design methodology and present performance results for the selected applications. Performance Portability and OpenACC Performance Portability and OpenACC Doug Miles, Dave Norton and Michael Wolfe (PGI / NVIDIA) Performance portability means a single program gives good performance
across a variety of systems, without modifying the program. OpenACC is
designed to offer performance portability across CPUs with SIMD extensions
and accelerators based on GPU or many-core architectures. Using a sequence
of examples, we explore the aspects of performance portability that are
well-addressed by OpenACC itself and those that require underlying compiler
optimization techniques. We introduce the concepts of forward and backward
performance portability, where the former means legacy codes optimized for
SIMD-capable CPUs can be compiled for optimal execution on accelerators and
the latter means the opposite. The goal of an OpenACC compiler should be
to provide both, and we uncover some interesting opportunities as we explore
the concept of backward performance portability. Transferring User Defined Types in OpenACC Transferring User Defined Types in OpenACC James C. Beyer, David Oehmke and Jeff Sandoval (Cray Inc.) A preeminent problem blocking the adoption of OpenACC by many programmers is support for user-defined types: classes and structures in C/C++ and derived types in Fortran. This problem is particularly challenging for data structures that involve pointer indirection, since transferring these data structures between the disjoint host and accelerator memories found on most modern accelerators requires deep-copy semantics. This paper will look at the mechanisms available in OpenACC 2.0 to allow the programmer to design transfer routines for OpenACC programs. Once these mechanisms have been explored, a new directive-based solution will be presented. Code examples will be used to compare the current state-of-the-art and the new proposed solution. Tech Paper - PE & Applications Technical Session 18C Track C - B2 Frank M. Indiviglio On the Current State of Open MPI on Cray Systems On the Current State of Open MPI on Cray Systems Nathan Hjelm and Samuel Gutierrez (Los Alamos National Laboratory) and Manjunath Venkata (Oak Ridge National Laboratory) Open MPI provides an implementation of the MPI standard supporting
native communication over a range of high-performance network
interfaces. Los Alamos National Laboratory (LANL) and Oak Ridge
National Laboratory (ORNL) collaborated on creating a port for Cray XE
and XK systems. That work has continued and with the release of
version 1.8 Open MPI now conforms to MPI-2.2 and MPI-3.0 on Cray
XE, XK, and XC systems. The features introduced with this work include
dynamic process support (MPI_Comm_spawn()), important for implementing
fault-tolerant MPI systems; improved collective operations required
for scalability and performance of applications; and Aries support to
enable running Open MPI on Cray XC systems. In this paper, we
present an update on the design and implementation of Open MPI for
Cray systems and evaluate the performance and scaling characteristics
on both Gemini and Aries networks. User-level Power Monitoring and Application Performance on Cray XC30 supercomputers User-level Power Monitoring and Application Performance on Cray XC30 supercomputers Alistair Hart and Harvey Richardson (Cray Inc.), Jens Doleschal, Thomas Ilsche and Mario Bielert (Technische Universität Dresden) and Matthew Kappel (Cray Inc.) In this paper, we show how users can access and display new power measurement hardware counters on Cray XC30 systems (with and without accelerators), either directly or through extended prototypes of the Score-P performance measurement infrastructure and Vampir application performance monitoring visualiser. This work leverages new power measurement and control features introduced in the Cray XC supercomputer range and targeted at both system administrators and users. We discuss how to use these counters to monitor energy consumption, both for complete jobs and also for application phases. We then use this information to investigate energy efficient application placement options on Cray XC30 architectures, including mixed use of both CPU and GPU on accelerated nodes and interleaving processes from multiple applications on the same node. A Hybrid MPI/OpenMP 3D FFT for Plane Wave Ab-intio Materials Science Codes A Hybrid MPI/OpenMP 3D FFT for Plane Wave Ab-intio Materials Science Codes Andrew Canning (Lawrence Berkeley National Laboratory) Ab-initio materials science and chemistry codes based on density functional theory and a plane-wave (Fourier) expansion of the electron wavefunctions are the most commonly used approach for electronic structure calculations in materials and nanoscience. This approach has become the largest user of cycles at scientific computer centers around the world through codes such as VASP, Quantum Espresso, Abinit, PEtot etc. Therefore, like in many other application areas, (fluid mechanics, climate research, accelerator design, etc.) efficient parallel scalable 3DFFTs are required. In this paper we show how our specialized hybrid MPI/OpenMP implementation of the 3DFFT on the Cray XE6(Hopper) and XC30(Edison) can significantly outperform and scale better than the pure MPI version, particularly on large core counts, by sending fewer larger messages. Our 3DFFT has been implemented in the full electronic structure code PEtot and results scaling to 10,000s cores on Cray platforms for PEtot will also be presented. Tech Paper - PE & Applications 3:00pm-4:30pm Technical Session 19A Track A/Plenary - B1 Liz Sim Workload Managers - A Flexible Approach Workload Managers - A Flexible Approach Blaine Ebeling (Cray Inc.) Workload Managers (WLM) are the main user interfaces for running HPC jobs on Cray systems. Application Level Placement Services (ALPS) is a resource placement infrastructure provided Cray systems to support WLMs. Until now, WLMs have interfaced with ALPS through the BASIL protocol for node reservations, and the aprun command (apinit daemon) for launching applications. Over the last several years, the requirement to support more platforms, processor capabilities, dynamic resource management, and new features, led Cray to investigate alternative ways to provide more flexible methods for supporting and expanding WLM capabilities and new WLMs. This paper will highlight Cray's plans to expose low level hardware interfaces by refactoring ALPS to allow 'native' WLM implementations that do not rely on the current ALPS interface mechanism. The process for Cray testing, certification and support of WLMs will be included. Partnering with both our vendors and customers is vital to our future direction. Analysis and reporting of Cray service data using the SAFE. Analysis and reporting of Cray service data using the SAFE. Stephen P. Booth (EPCC, The University of Edinburgh) The SAFE (Service Administration from EPCC) is a user services system
developed by EPCC that handles user management and report generation for all
our HPC services including the Cray services.
SAFE is used to administer both the HECToR (Cray XE6) service, and its
successor ARCHER (Cray XC30). An important function of this system is the
ingestion of accounting data into the database and the generation of usage
reports. In this paper we will present an overview of the design and
implementation of this reporting system. Designing Service-Oriented Tools for HPC Account Management and Reporting Designing Service-Oriented Tools for HPC Account Management and Reporting Adam G. Carlyle, Robert D. French and William A. Renaud (Oak Ridge National Laboratory) The User Assistance Group at the National Center for Computational Sciences (NCCS) at Oak Ridge National Laboratory (ORNL) maintains detailed records and auxiliary data for thousands of HPC user accounts every year. These data are used across every aspect of center operations, e.g., within system administration scripts, written reports, and end-user communications. Record storage tools in use today evolved in an ad-hoc fashion as the center’s needs changed; now they are at risk of becoming inflexible and unmaintainable. They also exhibit some scalability issues both with respect to computational effort, and—perhaps more importantly—with respect to staff effort and to the triage of new development tasks. The solutions needed to address these issues must be strongly "service-oriented". This paper details recent efforts at NCCS to redesign the center’s two primary record-management solutions into service-oriented applications capable of meeting these future challenges of scalability and maintainability. Tech Paper - Systems Technical Session 19B Track B - B3 Iain A. Bethune Performance of the fusion code GYRO on four generations of Cray computers Performance of the fusion code GYRO on four generations of Cray computers Mark Fahey (National Institute for Computational Sciences) GYRO is a code used for the direct numerical simulation of plasma microturbulence. Here we show the comparative performance and scaling on four generations of Cray supercomputers simultaneously including the newest addition - the Cray XC30. We also show that the recently added hybrid OpenMP/MPI implementation shows a great deal of promise on traditional HPC systems that utilize fast CPUs and proprietary interconnects. Four machines of varying sizes were used in the experiment, all of which are located at the National Institute for Computational Sciences at the University of Tennessee at Knoxville and Oak Ridge National Laboratory. The advantages, limitations, and performance of using each system are discussed, as well as the direction of future optimizations. Time-dependent density-functional theory on massively parallel computers Time-dependent density-functional theory on massively parallel computers Jussi Enkovaara (CSC - IT Center for Science Ltd.) GPAW is versatile open source software for various quantum mechanical simulations utilizing the density-functional theory (DFT) and time-dependent density functional theory (TD-DFT). GPAW is implemented in combination of Python and C programming languages. High level algorithms are implemented in Python, while numerical intensive kernels are implemented in C or utilize libraries. The parallelization is done with MPI, and MPI calls can be made both from Python and C parts of the code. The approach enables fast software development due to high-level nature of Python, while ensuring good performance from compiled language. We present here the most important details related to the Python based implementation. We discuss in detail the aspects related to the parallelization of linear response TD-DFT calculations. We present benchmark results showing good parallel scalability up to tens of thousands of CPU cores, as well as recent use case in the simulations of optical spectra of gold nanoclusters. Toward Improved Support for Loosely Coupled Large Scale Simulation Workflows Toward Improved Support for Loosely Coupled Large Scale Simulation Workflows Swen Boehm, Wael R. Elwasif, Thomas Naughton and Geoffroy Vallee (Oak Ridge National Laboratory) High-performance computing (HPC) workloads are increasingly leveraging loosely coupled large scale simulations [1]. Unfortunately, most large-scale HPC platforms, including Cray/ALPS environments, are designed for the execution of long-running jobs based on coarse-grained launch capabilities (e.g., one MPI rank per core on all allocated compute nodes). This assumption limits capability-class workload campaigns [2] that require large numbers of discrete or loosely coupled simulations, and where time-to-solution is an untenable pacing issue. This paper describes the challenges related to the support of fine-grained launch capabilities that are necessary for the execution of loosely coupled large scale simulations on Cray/ALPS platforms. More precisely, we present the details of an enhanced runtime system to support this use case, and report on initial results from early testing on systems at Oak Ridge National Laboratory. Tech Paper - PE & Applications Technical Session 19C Track C - B2 Zhengji Zhao Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver on Cray's Manycore Architectures Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver on Cray's Manycore Architectures Xiaoye Sherry Li (Lawrence Berkeley National Laboratory) This paper presents the first hybrid MPI+OpenMP+CUDA algorithm and implementation
of a right-looking unsymmetric sparse LU factorization
with static pivoting for scalable heterogeneous architectures.
While BLAS calls can account for more than 40\% of the overall factorization time,
the difficulty is that small problem sizes dominate the workload, making efficient
GPU utilization challenging.
This motivates our new algorithmic developments, which are to find ways
to aggregate collections of small BLAS operations into larger ones;
to schedule operations to achieve load balance and hide long-latency operations,
such as PCIe transfer;
and to exploit simultaneously all of a node's available CPU cores and GPUs.
We extensively evaluate this implementation to understand the strengths
and limits of our method. Extending the Capabilities of the Cray Programming Environment with CLang-LLVM Framework Integration Extending the Capabilities of the Cray Programming Environment with CLang-LLVM Framework Integration Ugo Varetto, Benjamin Cumming and Sadaf Alam (Swiss National Supercomputing Centre) Recent developments in programming for multi-core processors and accelerators using C++11, OpenCL and Domain Specific Languages (DSL) have prompted us to look into tools that offer compilers and both static and runtime analysis toolchains to complement the Cray Programming Environment capabilities. In this paper we report our preliminary experiences from using the CLang-LLVM framework on a hybrid Cray XC30 to perform tasks such as generating NVIDIA PTX code from C++ and OpenCL in a portable and flexible manner. Specifically we investigate how to overcome some of the limitations currently imposed by the standard tools such as the complete lack of C++11 support in CUDA C and outdated 32 bit versions of OpenCL. We also demonstrate how Clang-LLVM tools, for example, the static analyzer can bring additional capabilities to the Cray environment. Finally we describe how CLang-LLVM integrates with the standard Cray Programming Environment (PE), for instance, Cray MPI, perftools and libraries, and the steps required to properly install such tools on various Cray platforms. Tri-Hybrid Computational Fluid Dynamics on DOE’s Cray XK7, Titan Tri-Hybrid Computational Fluid Dynamics on DOE’s Cray XK7, Titan Aaron Vose (Cray Inc.), Brian Mitchell (General Electric) and John Levesque (Cray Inc.) A tri-hybrid port of General Electric's in-house, 3D, Computational Fluid Dynamics (CFD) code TACOMA is created utilizing MPI, OpenMP, and OpenACC technologies. This new port targets improved performance on NVidia Kepler accelerator GPUs, such as those installed in the world's second largest supercomputer, Titan, the Department of Energy's 27 petaFLOP Cray XK7 located at Oak Ridge National Laboratory. We demonstrate a 1.4x speed improvement on Titan when the GPU accelerators are enabled. We highlight key optimizations and techniques used to achieve these results. These optimizations enable larger and more accurate simulations than were previously possible with TACOMA, which not only improves GE's ability to create higher performing turbomachinery blade rows, but also provides "lessons learned" which can be applied to the process of optimizing other codes to take advantage of tri-hybrid technology with MPI, OpenMP, and OpenACC. Tech Paper - PE & Applications |