CUG2013 Proceedings

Sunday, May 5th

Monday, May 6th

8:30am-12pm

Tutorial 1A

Zinfandel / Cabernet

Programming Accelerators using OpenACC in the Cray Compilation Environment

Tutorial

Tutorial 1B

Merlot / Syrah

System Administration for Cray XE and XK Systems

Tutorial

Tutorial 1C

Atlas Peak / Castle Peak / Diamond Mountain (Vintners Ballroom)

Lustre Troubleshooting and Tuning

Tutorial

1pm-4:30pm

Tutorial 2A

Zinfandel / Cabernet

Refactoring Applications for the XK7

Tutorial

Tutorial 2B

Merlot / Syrah

Configuration and Administration of Cray External Services Systems

Tutorial

Tutorial 2C

Atlas Peak / Castle Peak / Diamond Mountain (Vintners Ballroom)

Debugging Heterogeneous HPC Applications with TotalView

Tutorial

4:45pm-5:45pm

Interactive 3A

Zinfandel / Cabernet

Colin McMurtrie

System Support SIG

Birds of a Feather

Interactive 3B

Merlot / Syrah

Helen He

Programming Environments, Applications and Documentation SIG

Birds of a Feather

Interactive 3C

Atlas Peak / Castle Peak / Diamond Mountain (Vintners Ballroom)

Birds of a Feather

Tuesday, May 7th

8:30am-10am

General Session 4

Zinfandel / Cabernet / Merlot / Syrah (Grand Ballroom)

Nick Cardo

CUG Welcome

Why we need Exascale, and why we won't get there by 2020

Invited Talk

10:30am-12pm

General Session 5

Zinfandel / Cabernet / Merlot / Syrah (Grand Ballroom)

David Hancock

Cray Corporate Update

Cray in Supercomputing

Invited Talk

1pm-2:30pm

Technical Session 6A

Zinfandel / Cabernet

Tina Butler

Cray System Software Road Map

Image Management and Provisioning System Overview

Paper

Technical Session 6B

Merlot / Syrah

Jason Hill

Instrumenting IOR to Diagnose Performance Issues on Lustre File Systems

Taking Advantage of Multicore for the Lustre Gemini LND Driver

A file system utilization metric for I/O characterization

Paper

Technical Session 6C

Atlas Peak / Castle Peak / Diamond Mountain (Vintners Ballroom)

Craig Stewart

The Cray Programming Environment: Current Status and Future Directions

Enhancements to the Cray Performance Measurements and Analysis Tools

Cray Compiling Environment Update

Paper

3pm-5pm

Technical Session 7A

Zinfandel / Cabernet

Jeff Broughton

New Member Talk: iVEC and the Pawsey Centre

The Evolution of Cray Management Services

CRAY XC30 Installation – A System Level Overview

Cray External Services Systems Overview

Paper

Technical Session 7B

Merlot / Syrah

Andrew Uselton

Architecting Resilient Lustre Storage Solution

BlueWaters I/O Performance

Sonexion 1600 I/O Performance

OLCF's 1 TB/s, next-generation Spider file system

Paper

Technical Session 7C

Atlas Peak / Castle Peak / Diamond Mountain (Vintners Ballroom)

Sadaf R. Alam

Optimizing GPU to GPU Communication on Cray XK7

Debugging and Optimizing Programs Accelerated with Intel® Xeon® Phi™ Coprocessors

Portable and Productive Performance on Hybrid System with OpenACC Compilers and Tools

Luiz DeRose (Cray Inc.)

The current trend in the supercomputing industry is to provide hybrid systems with accelerators attached to multi-core processors. Some of the critical hurdles for the widespread adoption of accelerated computing in high performance computing are portability and programmability. In order to facilitate the migration to hybrid systems with accelerators attached to CPUs, users need a simple programming model that is portable across machine types. Moreover, to allow for users to maintain a single code base, this programming model, and the required optimization techniques, should not be significantly different for “accelerated” nodes from the approaches used on current multi-core x86 processors.

In this talk I will present Cray’s approach to accelerator programming, which is based on a high level programming environment with tightly coupled OpenACC compilers, libraries, and tools that can interoperate and hide the complexity of the system. Ease of use is possible with compiler making it feasible for users to write applications in Fortran, C, or C++ with OpenACC directives, tools to help users port, debug, and optimize for GPUs, as well as conventional multi-core CPUs.

In this programming environment, the compiler does the “heavy lifting” to split off the work destined for the accelerator and perform the necessary data transfers. In addition, it does optimizations to take advantage of the accelerator and the multi-core X86 hardware appropriately. A full debugger with integrated support for the CPU and the GPU is available with DDT from Allinea or TotalView from Rogue Wave Software. The Cray Performance Tools provide statistics for the whole application, which could be grouped by accelerator directive or mapped back to the high level source by line number. A single performance report can include statistics for both the host and the accelerator, including hardware performance counters information. The Cray Scientific Libraries uses the Cray auto-tuning framework to select the best kernel for the each task. With this scientific libraries interface, data copy is automatic and the GPU or host execution placement is automatic. Finally, the Cray Programming Environment for accelerators supports experienced CUDA developers, by providing interoperability between the compiler, performance tools, and debugger with existing CUDA codes.

Tesla vs Xeon Phi vs Radeon: A Compiler Writer's Perspective

Paper

5:15pm-6pm

Interactive 8A

Zinfandel / Cabernet

Nick Cardo

Open discussion with CUG Board

Birds of a Feather

Interactive 8B

Merlot / Syrah

Duncan J. Poole

OpenACC BOF

Birds of a Feather

Interactive 8C

Atlas Peak / Castle Peak / Diamond Mountain (Vintners Ballroom)

John Hesterberg

System Management Futures

Birds of a Feather

Wednesday, May 8th

8:30am-10am

General Session 9

Zinfandel / Cabernet / Merlot / Syrah (Grand Ballroom)

Nick Cardo

CUG Business

Big Bang, Big Data, Big Iron – Analyzing Data From The Planck Satellite Mission

CUG Business

Invited Talk

10:30am-12pm

General Session 10

Zinfandel / Cabernet / Merlot / Syrah (Grand Ballroom)

David Hancock

Introduction and CUG 2013 Best Paper Award

The Changing Face of High Performance Computing

Invited Talk

12pm-1pm

Interactive 11A/B

Zinfandel / Cabernet / Merlot / Syrah (Grand Ballroom)

David Henty

HPC training and education

Birds of a Feather

Interactive 11C

Atlas Peak / Castle Peak / Diamond Mountain (Vintners Ballroom)

Jeff Keopp

Cray External Services Systems

Birds of a Feather

1pm-1:45pm

General Session 12

Zinfandel / Cabernet / Merlot / Syrah (Grand Ballroom)

David Hancock

1 on 100 or more

Invited Talk

2pm-3:30pm

Technical Session 13A

Zinfandel / Cabernet

Douglas W. Doerfler

SeaStar Unchained: Multiplying the Performance of the Cray SeaStar Network

Intel Multicore, Manycore, and Fabric Integrated Parallel Computing

Understanding the Impact of Interconnect Failures on System Operation

Paper

Technical Session 13B

Merlot / Syrah

Jason Hill

The Changing Face of Storage for Exascale

Cray's Implementation of LNET Fine Grained Routing: Overview and Characteristics

Discovery in Big Data using a Graph Analytics Appliance

Paper

Technical Session 13C

Atlas Peak / Castle Peak / Diamond Mountain (Vintners Ballroom)

Helen He

Using the Cray Gemini Performance Counters

Performance Measurements of the NERSC Cray Cascade System

From thousands to millions: visual and system scalability for debugging and profiling

Paper

3:45pm-5:15pm

Technical Session 14A

Zinfandel / Cabernet

Ashley Barker

Investigating Topology Aware Scheduling

David Jackson (Adaptive Computing)

For many years, HPC networks have been able to assume good support for all-to-all communications, meaning that no matter how workloads were placed across the network, the application would experience maximum performance. While all networks have some limitations associated with their underlying hardware and topology, the difference between the best possible allocation and the worst possible was often small enough to be in the realm of statistical noise and thus any associated issues were generally ignored. Now, as systems and workloads grow into petascale and exascale range, the communication within an application becomes massive and the difference between the best case and worst-case allocations becomes significant. The differences between one placement decision and another can now noticeably impact application efficiency, job run time consistency, and even impact neighboring workloads.

The integration of workload management solutions with network management infrastructure is the natural follow-on to this issue and allows the job scheduler to be aware of the configuration, topology, strengths, and limitations of a given network. With this knowledge, and with properly optimized placement algorithms, the scheduler can efficiently place workload in a network-aware manner. Intelligent placement algorithms can help maximize task proximity, minimize bottlenecks, and significantly improve job performance, run time consistency, and overall system throughput. In collaboration with Cray, NCSA, and other major Cray sites, Adaptive Computing has begun an ambitious research project to model Cray’s Gemini 3D torus network and enable a highly advanced topology-aware scheduling algorithm. This research has matured beyond initial prototypes and has begun evaluating various approaches against actual workloads. This talk will discuss the problem space, the general approaches and considerations, and the benefits seen to date when tested against these real-world workloads.

External Torque / Moab and Fairshare on the Cray XC30

Production Experiences with the Cray-Enabled TORQUE Resource Manager

Paper

Technical Session 14B

Merlot / Syrah

Steve Simms

Evaluation of A Flash Storage Filesystem on the Cray XE-6

Analysis of the Blue Waters File System Architecture for Application I/O Performance

Trillion Particles, 120,000 cores, and 350 TBs: Lessons Learned from a Hero I/O Run on Hopper

Suren Byna and Andrew Uselton (Lawrence Berkeley National Laboratory), Prabhat Mr. (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center), David Knaak (Cray Inc.) and Yun (Helen) He (Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center)

Modern petascale applications can present a variety of configuration, runtime, and data management challenges when run at scale. In this paper, we describe our experiences in running a large-scale plasma physics simulation, called VPIC, on the NERSC Hopper Cray-XE6 system. The simulation ran on 120,000 cores using ~80% of computing resources, 90% of the available memory on each node and 50% of a Lustre file system. Over two trillion particles were simulated for 23,000 timesteps, and 10 one-trillion particle dumps, each ranging between 30 and 42TB were written to HDF5 files at a sustained rate of ~27GB/s. To the best of our knowledge, this job represents the largest I/O undertaken by a NERSC application and the largest collective writes to single HDF5 files. We outline several obstacles that we overcame in the process of completing this run, and list lessons learned that are of potential interest to HPC practitioners.

We will elaborate on the following insights in the paper:

1. Collective writes to a single shared HDF5 file can work as well as file-per-process writes We demonstrate that collective writes from 20,000 MPI processes to a single, shared ~40TB HDF5 file using collective buffering can achieve a sustained performance of 27GB/s on a Lustre file system. The peak performance of the system is ~35GB/s, which is achieved by our code for a substantial fraction of the runtime. This outperformed the strategy where each process wrote a separate file, i.e. a total of 20,000 files, that achieved 24GB/s.

2. Advance verification of file system hardware is important for obtaining peak performance Our initial execution of VPIC achieved only 65% of Lustre peak performance. With the use of Lustre Monitoring Toolkit (LMT), we pinpointed the problem to a small set of slow OSTs, which were exhibiting degraded performance. We temporarily excluded these OSTs from our tests, and were able to demonstrate ~80% of the peak I/O rates. Advance verification for slow OSTs can avoid performance pitfalls.

3. Advance verification of available resources for memory-intensive applications is important Since the simulation requires 90% of the memory on each node, it was necessary to verify that each node reserved for executing this simulation had at least that much of available memory. Unreleased memory from previous applications could cause out-of-memory errors.

We will also discuss tuning multiple layers of parallel I/O subsystem and emphasize the need for scalable tools for diagnosing software and hardware problems.

pdf, pdf

Paper

Technical Session 14C

Atlas Peak / Castle Peak / Diamond Mountain (Vintners Ballroom)

Helen He

Performance Comparison of Scientific Applications on Cray Architectures

First 12-cabinets Cray XC30 System at CSCS: Scaling and Performance Efficiencies of Applications

Effects of Hyper-Threading on the NERSC workload on Edison

Paper

Thursday, May 9th

8:30am-10am

Technical Session 15A

Zinfandel / Cabernet

Craig Stewart

Preparing Slurm for use on the Cray XC30

Lesson’s From 20 Continuous Years of Cray/HPC Systems

Cray Workload Management with PBS Professional 12.0

Paper

Technical Session 15B

Merlot / Syrah

Robert Henschel

Introduction to HSA Hardware, Software and HSAIL with A HPC Usage Example

Vinod Tipparaju (AMD, Inc.)

Heterogeneous systems have been around for several years, and the accelerator-based heterogeneous systems (CPU-GPU) have become popular in the last five years. Particularly, accelerating general-purpose computation using GPUs is gaining momentum in both academic research and vendors in the industry. OpenCL and CUDA are the two most popular programming models that enable end-application programmers to take advantage of the GPGPU through the compiler, runtime, and driver tool chain. While the opportunity of GPGPU has been opened up to expert programmers, this has not reached a big mass yet, primarily, because of the following reasons: (i) The CPU-GPU system has a distributed-asymmetric memory that needs to be explicitly managed for coherency and synchronization (ii) Two-way high-latency memory copies and kernel dispatch (iii) Lack of support for dynamic scheduling or load balancing, advanced debugging, system calls, exception handling etc.

The Heterogeneous System Architecture (HSA) is a new set of architectural features (to be standardized) to efficiently support a wide range of data-parallel and task-parallel programming models. HSA architectural features include: Unified Virtual Address Space, Architected User Mode Queuing, Fully-Coherent Memory Model, Architected Queuing Language (AQL), and several others. Thus, the overarching goal of HSA is to bring GPGPU to the masses by drastically improving the productivity, performance and energy-efficiency of the applications that may want to take advantage of the GPU acceleration. HSA-enabled processors come with associated software ecosystem to expose the architectural features, which include: HSA Driver, HSA Runtime, and HSA Intermediate Language (HSAIL). Specifically, HSA Runtime exposes Coherent Memory, Architected User-Mode Queues, and Architected low-latency dispatch through low-level APIs. These APIs are designed (and standardized) to be generic, and can be consumed by several high-level runtimes, programming models and languagues (OpenCL, C++ AMP, Java, OpenMP etc). HSAIL is an abstract virtual machine language of HSA components, which will be standardized. Thus, each vendor of a HSA component will comply with the standard set of architectural features, provide a core runtime implementation (adhering to the standard), and a finalizer component that translates the HSAIL into its vendor-specific ISA.

Overall, using the new architectural features of HSA, and its software ecosystem, it is possible to support several high-level programming models and languages, and at the same time, influence them to improve the programmability, thereby, bringing heterogeneous computing to the masses.

Reliable Computation Using Unpredictable Components

Requirements Analysis for Adaptive Supercomputing using the Cray XK7 as a Case Study

Paper

Technical Session 15C

Atlas Peak / Castle Peak / Diamond Mountain (Vintners Ballroom)

Douglas W. Doerfler

Improving the Performance of the PSDNS Pseudo-Spectral Turbulence Application on Blue Waters using Coarray Fortran and Task Placement

A Review of The Challenges and Results of Refactoring the Community Climate Code COSMO for Hybrid Cray HPC Systems.

CloverLeaf: Preparing Hydrodynamics Codes for Exascale

Paper

10:30am-12pm

Technical Session 16A

Zinfandel / Cabernet

John Noe

Methods and Results for Measuring Kepler Utilization on a Cray XK7

Resource Utilization Reporting on Cray Systems

The Complexity of Arriving at Useful Reports to Aid in the Succesful Operation of an HPC Center

Paper

Technical Session 16B

Merlot / Syrah

Liz Sim

Building Balanced Systems for the Cray Datacenter of the Future

Surviving the Life Sciences Data Deluge using Cray Supercomputers

Early Experience on Crays with Genomic Applications Used as Part of Next Generation Sequencing Workflow

Paper

Technical Session 16C

Atlas Peak / Castle Peak / Diamond Mountain (Vintners Ballroom)

Nicholas J. Wright

Measuring Sustained Performance on Blue Waters with the SPP Metric

Experiences Porting a Molecular Dynamics Code to GPUs on a Cray XK7

Chasing Exascale: the Future of GPU Computing

Paper

12pm-1pm

Interactive 17A

Zinfandel / Cabernet

John Hesterberg

System Monitoring, Accounting and Metrics

Birds of a Feather

Interactive 17B

Merlot / Syrah

Jenett Tillotson

Experiences with Moab and TORQUE

Birds of a Feather

Interactive 17C

Atlas Peak / Castle Peak / Diamond Mountain (Vintners Ballroom)

Birds of a Feather

1pm-2:30pm

Technical Session 18A

Zinfandel / Cabernet

Ashley Barker

Blue Waters Acceptance: Challenges and Accomplishments

Saving Energy with “Free” Cooling and the Cray XC30

Real-time mission critical supercomputing with Cray systems

Paper

Technical Session 18B

Merlot / Syrah

Jenett Tillotson

High Fidelity Data Collection and Transport Service Applied to the Cray XE6/XK6

Production I/O Characterization on the Cray XE6

Improvement of TOMCAT-GLOMAP File Access with User Defined MPI Datatypes

Paper

Technical Session 18C

Atlas Peak / Castle Peak / Diamond Mountain (Vintners Ballroom)

Liam O. Forbes

Cray’s Cluster Supercomputer Architecture

Performance Metrics and Application Experiences on a Cray CS300-AC™ Cluster Supercomputer Equipped with Intel® Xeon Phi™ Coprocessors

Paper

3pm-4:30pm

Technical Session 19A

Zinfandel / Cabernet

Tina Butler

Effect of Rank Placement on Cray XC30 Communication Cost

Evaluating Node Orderings For Improved Compactness

Improving Task Placement for Applications with 2D, 3D, and 4D Virtual Cartesian Topologies on 3D Torus Networks with Service Nodes

Paper

Technical Session 19B

Merlot / Syrah

Zhengji Zhao

The State of the Chapel Union

Recent enhancements to the Automatic Library Tracking Database infrastructure at the Swiss National Supercomputing Centre

Timothy W. Robinson and Neil Stringfellow (Swiss National Supercomputing Centre)

The Automatic Library Tracking Database (ALTD)—an infrastructure developed previously by staff at the National Institute for Computational Sciences (NICS)—is in production today on Cray XT, XE, XK, and XC30 systems at several Cray sites, including NICS, Oak Ridge National Laboratory, the National Energy Research Scientific Computing Center, and the Swiss National Supercomputing Centre (CSCS). The Automatic Library Tracking Database automatically and transparently stores information about applications running on Cray systems and also records which libraries are linked to those applications, and from these data, support staff at HPC centres can derive a wealth of information about software usage—such as the use or non-use of particular compiler suites or the uptake of numerical libraries and third-party applications—right down to the level of specific version numbers. The tool works by intercepting the GNU linker to gather information on compilers and libraries, and intercepting the job launcher to track the execution of applications at launch time. We have recently extended the ALTD framework deployed at CSCS to record more detailed information on the individual jobs executed on our machines: the job information recorded by the previous incarnation of ALTD was limited to user name, executable, (batch) job id, and run date; we have extended the tool to record many additional job characteristics such as begin and end times, requested versus used core counts, number of processing elements and threads per process, and mode of linking (e.g. static, dynamic). In combination with custom post-processing scripts—which map executables to software codes, research domains or research groups—our ALTD implementation now delivers a far more complete picture of system usage, providing not only a list of running applications but also information on the way that these same applications are being run. On a practical level, such information can be used, for example, to guide future hardware and software procurements, or to assess whether or not researchers are using our systems in the manner for which they were provided with resource allocations.

pdf, pdf

Comparing Compiler and Library Performance in Material Science Applications on Edison

Paper

Technical Session 19C

Atlas Peak / Castle Peak / Diamond Mountain (Vintners Ballroom)

John Noe

A Single Pane of Glass: Bright Cluster Manager for Cray

Supporting Multiple Workloads, Batch Systems, and Computing Environments on a Single Linux Cluster

Tools to Execute An Ensemble of Serial Jobs on a Cray

Paper

4:45pm-5:15pm

Closing General Session 20

Zinfandel / Cabernet / Merlot / Syrah (Grand Ballroom)

Nick Cardo

CUG Closing Session

Invited Talk