CUG2012 Final Proceedings | Created 2012-5-13 |
Sunday 6:00 P.M. - 7:00 P.M. Maritim Lobby Welcome Reception Monday 8:00 A.M. - 10:30 A.M. Köln Tutorial (1C) Chair: John Noe (Sandia National Laboratories) Introduction to Debugging for the Cray Systems David Lecomber (Allinea Software) Abstract Abstract The art of debugging HPC software has come a long way in recent years - and in this tutorial, we will show how Allinea DDT can be used to debug MPI and accelerator code on the Cray XK6. There will be walk-throughs and hands-on opportunities to work with some typical bug scenarios - showing how easily debuggers provide the means of resolving software problems and some of the "tricks of the trade" which will give users experience. There will also be demonstrations of debugging at Petascale and exploring how debuggers can get to grips with software issues in applications at scale. Monday 8:00 A.M. - 10:30 A.M. Bonn Tutorial (1B) Chair: Jason Hill (Oak Ridge National Laboratory) Lustre 2.x Architecture Johann Lombardi (whamcloud, Inc.) Abstract Abstract This tutorial will review the new architecture in Lustre 2.x, including the new features in Lustre 2.0 - 2.2. Some of the features covered will be Imperative Recovery, Wide Striping, Parallel Directory Operations, and a new metadata performance tool called mds-survey. Monday 8:00 A.M. - 10:30 A.M. Hamburg Tutorial (1A) Chair: Mark Fahey (National Institute for Computational Sciences) Application development for the XK6 John Levesque and jeff Larkin (Cray Inc.) Abstract Abstract This tutorial will address porting and optimizing an application to the Cray XK6. Given the heterogeneous architecture of this system, its effective utilization is only achieved by refactoring the application to exhibit three levels of parallelism. The first level is the typical intra-node MPI parallelism which is already present on those applications being run on XT systems. The other two levels, shared memory parallelism on the node and vectorization or single instruction multiple threads (SIMT) will be where the major programming challenges arise. The tutorial will take the approach of using an all MPI applications and rewriting it to exhibit the other two levels of parallelism. The instruction will include the use of Cray GPU programming environment to isolate important sections of code for parallelism on the node. Once the code has been hybridized with the introduction of OpenMP on the node, the utilization of the accelerator can be done with the newly announced OpenACC directives and/or using CUDA or CUDA Fortran. Once a accelerated version of the code is developed, statistics gathering tools can be used to identify bottleneck and optimize data transfer, vectorization and memory utilization. Real world examples will be employ to illustrate the advantage of the approach. Comparisons will be given between the use of the OpenACC directives and CUDA. Instructions for using all the available tools from Cray and Nvidia will all be given. Monday 10:30 A.M. - 11:00 A.M. Maritim Foyer Break. Whamcloud, Sponsor Dan Ferber (Whamcloud) Abstract Abstract Whamcloud was established in 2010 by High-Performance Computing experts Brent Gorda and Eric Barton when they recognized that future advances in computational performance were going to require a revolutionary advance in parallel storage. Whamcloud’s vision is to evolve the state of parallel storage by focusing strategically on high performance and cloud computing applications with demanding requirements for scalability. Whamcloud provides: (1) a tested community Lustre release tree and releases to the entire Lustre community, (2) Level 3 support for Lustre to Whamcloud's customers and partners, (3) feature development under contracts, and (4) training and professional services. Monday 11:00 A.M. - 12:00 P.M. Köln / Bonn / Hamburg Opening General Session (2) Chair: Nick Cardo (National Energy Research Scientific Computing Center) CUG Welcome Nick Cardo (National Energy Research Scientific Computing Center) The Future of HPC Michael M. Resch (High Performance Computing Center Stuttgart) Abstract Abstract Scalability is considered to be the key factor for supercomputing in the coming years. A quick look at the TOP500 list shows that the level of parallelism has started to increase at a much faster pace than anticipated 20 years ago. However, there are a number of other issues that will have a substantial impact on HPC in the future. This talk will address some of the issues. It will look into current hardware and software strategies and evaluate the potential of concepts like co-design. Furthermore it will investigate the potential use of HPC in non-traditional fields like business intelligence. Monday 12:00 P.M. - 1:00 P.M. Restaurant Rôtisserie Lunch. Xyratex Technology Ltd, Sponsor Monday 1:00 P.M. - 2:30 P.M. Köln Technical Sessions (3C) Chair: Liz Sim (EPCC, The University of Edinburgh) Comparing One-Sided Communication With MPI, UPC and SHMEM Christopher M. Maynard (University of Edinburgh) Abstract Abstract Two-sided communication, with its linked send and receive message construction has been the dominant communication pattern of the MPI era. With the rise of multi-core processors and the consequent dramatic increase in the number of computing cores in a supercomputer this dominance may be at an end. One-sided communication is often cited as part of the programming paradigm which would alleviate the punitive synchronisation costs of two-sided communication for an exascale machine. This paper compares the performance of one-sided communication in the form of put and get operations for MPI, UPC and Cray SHMEM on a Cray XE6, using the Cray C compiler. This machine has support for Remote Memory Access (RMA) in hardware, and the Cray C compiler supports UPC, as well as environment support for SHMEM. A distributed hash table application is used to test the performance of the different approaches, as this requires one-sided communication. Balancing shared memory and messaging interactions in UPC on the XE6 Ahmad Anbar, Olivier Serres, Asila Wati, Lubomir Riha and Tarek El-Ghazawi (The George Washington University) Abstract Abstract While many-cores have huge performance potential, it could be wasted if the programming was not done carefully. Because of the locality awareness, PGAS was able to achieve scalability at the clusters level. We believe as the chips grow bigger in terms of core count, PGAS will still be a good fit for the intra-node programming. One of unclear area is which mechanism is suitable for the PGAS model to rely upon for intra-node communication. Basically, that there are two mechanisms: processes and threads. This became a big decision as a big percentage of the overall communication cost of the programs happens within the nodes. As a case study, we evaluated the performance of several UPC application and synthetic micro-benchmarks. We evaluated the performance when the communication within nodes was based on threads, processes, or the mixture of the two. Finally we are recommending general guidelines and drawing main conclusions. Performance of Fortran Coarrays on the Cray XE6 David Henty (EPCC, The University of Edinburgh) Abstract Abstract Coarrays are a feature of the Fortran 2008 standard that enable parallelism using a small number of additional language elements. The execution model is that of a Partitioned Global Address Space (PGAS) language. The Cray XE architecture is particularly interesting for studying PGAS languages: it scales to very large numbers of processors; the underlying GEMINI interconnect is ideally suited to the PGAS model of direct remote memory access; the Cray compilers support PGAS natively. In this paper we present a detailed analysis of the performance of key coarray operations on XE systems including the UK national supercomputer HECToR, a 90,000-core Cray XE6 operated by EPCC at the University of Edinburgh. The results include a wide range of communications patterns and synchronisation methods relevant to real applications. Where appropriate, these are compared to the equivalent operation implemented using MPI. Monday 1:00 P.M. - 2:30 P.M. Bonn Technical Sessions (3B) Chair: Rolf Rabenseifner (High Performance Computing Center Stuttgart) Developing hybrid OpenMP/MPI parallelism for Fluidity-ICOM - next generation geophysical fluid modelling technology Xiaohu Guo (Science and Technology Facilities Council), Gerard Gorman (Department of Earth Science and Engineering, Imperial College London, London SW7 2AZ, UK) and Andrew Sunderland and Mike Ashworth (Science and Technology Facilities Council) Abstract Abstract Most modern high performance computing platforms can be described as clusters of multi-core compute nodes. The trend for compute nodes is towards greater numbers of lower power cores, with a decreasing memory to core ratio. This is imposing a strong evolutionary pressure on numerical algorithms and software to efficiently utilise the available memory and network bandwidth. Porting and Optimizing VERTEX-PROMETHEUS on the Cray XE6 at HLRS for Three-Dimensional Simulations of Core-Collapse Supernova Explosions of Massive Stars Florian Hanke (Max-Planck-Institut fuer Astrophysik), Andreas Marek (Rechenzentrum Garching) and Bernhard Mueller and Hans-Thomas Janka (Max-Planck-Institut fuer Astrophysik) Abstract Abstract Supernova explosions are among the most powerful cosmic events, whose physical mechanism and consequences are still incompletely understood. We have developed a fully MPI-OpenMP parallelized version of our VERTEX-PROMETHEUS code in order to perform three-dimensional simulations of stellar core-collapse and explosion on Tier-0 systems as Hermit at HLRS. Tests on up to 64000 cores have shown excellent scaling behavior. In this paper we will present our progress in porting, optimizing, and performing production runs on a large variety of machines, starting from vector machines an reaching to modern systems as the new Cray XE6 system in Stuttgart. Monday 1:00 P.M. - 2:30 P.M. Hamburg Technical Sessions (3A) Chair: Liam Forbes (Arctic Region Supercomputing Center) Reliability and Resiliency of XE6 and XK6 Systems: Trends, Observations, Challenges Steven J. Johnson (Cray Inc.) Abstract Abstract In 2011, Cray Inc. continued to observe improving reliability trends on XE6 systems and increased use of system resiliency capabilities such as Warm Swap. Late in the year, XK6 based systems began shipping to customers as did XE6 systems with the next generation AMD processors. This paper will discuss the reliability trends observed on all of these systems through 2011 and into early 2012 and examine the major factors affecting system wide outages and the occurrence of node drops in systems. The differences in site operations such as the frequency of scheduled maintenance will be explored to see what impact if any this may have on overall system reliability and availability. Finally, the paper will explore the where, when and how Warm Swap is being used on Cray systems and its overall effectiveness in maximizing system availability. Online Diagnostics at Scale Don Maxwell (Oak Ridge National Laboratory) and Jeff Becklehimer (Cray Inc.) Abstract Abstract The Oak Ridge Leadership Computing Facility (OLCF) housed at the Oak Ridge National Laboratory recently acquired a 200-cabinet Cray XK6. The computer will primarily provide capability computing cycles to the U.S. Department of Energy (DOE) Office of Science INCITE program. The OLCF has a tradition of installing very large computer systems requiring unique methods in order to achieve production status in the most expeditious and efficient manner. This paper will explore some of the methods that have been used over the years at OLCF to eliminate both early-life hardware failures and ongoing failures giving users a more stable machine for production. Monday 2:30 P.M. - 3:00 P.M. Maritim Foyer Break. Bright Computing, Sponsor Mark Blessing (Bright Computing) Abstract Abstract Bright Computing specializes in management software for clusters, grids and clouds, including compute, storage, Hadoop and database systems. Bright Cluster Manager’s fundamental approach and intuitive interface makes cluster management easy, while providing powerful and complete management capabilities for increasing productivity. Bright Cluster Manager now provides cloud-bursting capabilities into Amazon EC2, managing these external nodes as if part of the on-site system. Monday 3:00 P.M. - 4:00 P.M. Köln Interactive Session (4C) Chair: Tara Fly (Cray) Monday 3:00 P.M. - 4:00 P.M. Bonn Interactive Session (4B) Chair: Helen He (National Energy Research Scientific Computing Center) Monday 3:00 P.M. - 4:00 P.M. Hamburg Interactive Session (4A) Chair: Nick Cardo (National Energy Research Scientific Computing Center) Monday 4:30 P.M. - 10:00 P.M. Cannstatter Wasen Das Stuttgarter Frühlingsfest. DataDirect Networks, Sponsor Tuesday 8:30 A.M. - 10:00 A.M. Köln / Bonn / Hamburg General Session (5) Chair: David Hancock (Indiana University) Tuesday 10:00 A.M. - 10:30 A.M. Maritim Foyer Break. Altair Corporation, Sponsor Mary Bass (Altair Engineering, Inc.) Abstract Abstract PBS Works(tm), Altair's suite of on-demand cloud computing technologies, allows companies to maximize ROI on existing Cray systems. PBS Works is the most widely implemented software environment for managing grid, cloud, and cluster computing resources worldwide. The suite's flagship product, PBS Professional(r), allows Cray users and administrators to easily share distributed computing resources across geographic boundaries. With additional tools for portal-based submission, analytics, and data management, the PBS Works suite is a comprehensive solution for optimizing your Cray environment. Leveraging a revolutionary "pay-for-use" unit-based business model, PBS Works delivers increased value and flexibility over conventional software-licensing models. To learn more, please visit www.pbsworks.com. Tuesday 10:30 A.M. - 12:00 P.M. Köln / Bonn / Hamburg General Session (6) Chair: David Hancock (Indiana University) From PetaScale to ExaScale: How to Improve Sustained Performance? Wolfgang E. Nagel (Technische Universität Dresden) Abstract Abstract Parallelism and scalability have become major issues in all areas of Computing — nowadays pretty much everybody, even beyond the field of classical HPC, uses parallel codes. Nevertheless, the number of cores on a single chip – homogeneous as well as heterogeneous cores – is significantly increasing. Soon, we will have millions of cores in one HPC system. The ratios between flops and memory size, as well as bandwidth for memory, communication, and I/O, will worsen. At the same time, the need for energy might be extraordinary, and the best programming paradigm is still unclear. Tuesday 12:00 P.M. - 1:00 P.M. Restaurant Rôtisserie Lunch. Adaptive Computing, Sponsor Starla Mehaffey (Adaptive Computing) Abstract Abstract Adaptive Computing manages the world’s largest computing installations with its Moab® self-optimizing cloud management and HPC workload management solutions. The patented Moab multi-dimensional intelligence engine delivers policy-based governance, allowing customers to consolidate resources, allocate and manage services, optimize service levels and reduce operational costs. Our leadership in IT decision engine software has been recognized with over 45 patents and over a decade of battle-tested performance resulting in a solid Fortune 500 and Top500 supercomputing customer base. Tuesday 1:00 P.M. - 2:30 P.M. Köln Technical Sessions (7C) Chair: Mark Fahey (National Institute for Computational Sciences) Case Studies in Deploying Cluster Compatibility Mode Tara Fly, David Henseler and John Navitsky (Cray Inc.) Abstract Abstract Cray’s addition of Data Virtualization Service (DVS) and Dynamic Shared Libraries (DSL) to the Cray Linux Environment (CLE) software stack provides the foundations necessary for shared library support. The Cluster Compatibility Mode (CCM) feature introduced with CLE 3 completes the picture and allows Cray to provide “out-of-the-box” support for independent software vendor (ISV) applications built for Linux-x86 clusters. Cluster Compatibility Mode enables far greater workload flexibility, including install and execution of ISV applications and use of various third party MPI implementations, which necessitates a corresponding increase in complexity in system administration and site integration. This paper explores CCM architecture and a number of case studies from early deployment of CCM into user environments, sharing best practices learned, with hopes that sites can leverage these experiences for future CCM planning and deployment. Cray Cluster Compatibility Mode on Hopper Zhengji Zhao, Yun (Helen) He and Katie Antypas (Lawrence Berkeley National Laboratory) Abstract Abstract Cluster Compatibility Mode (CCM) is a Cray software solution that provides services needed to run most cluster-based independent software vendor (ISV) applications on the Cray XE6. CCM is of importance to NERSC because it can enable user applications that require the TCP/IP support, which are important parts of NERSC workload, on NERSC's Cray XE6 machine Hopper. Gaussian and NAMD replica exchange simulations are two important application examples that cannot run on Hopper without CCM. In this paper, we will present our CCM performance evaluation results on Hopper, and will present how CCM has been explored and utilized at NERSC. We will also discuss the benefits and issues of enabling CCM on the petascale production Hopper system. My Cray can do that? Supporting Diverse workloads on the Cray XE-6. Richard S. Canon, Jay Srinivasan and Lavanya Ramakrishnan (Lawrence Berkeley National Laboratory) Abstract Abstract The Cray XE architecture has been optimized to support tightly coupled MPI applications, but there is an increasing need to run more diverse workloads in the scientific and technical computing domains. Can platforms like the Cray XE line play a role here? In this paper, we will describe tools we have developed to support genomic analysis and other data intensive applications on NERSC’s Hopper system. These tools include a custom task farmer framework, tools to create virtual private clusters on the Cray, and using Cray’s Cluster Compatibility Mode (CCM) to support more diverse workload. In addition, we will describe our experience with running Hadoop, a popular open-source implementation of MapReduce, on Cray systems. We will present our experiences with this work including successes and challenges. Finally, we will discuss future directions and how the Cray platforms could be further enhanced to support these class of workloads. Tuesday 1:00 P.M. - 2:30 P.M. Bonn Technical Sessions (7B) Chair: Ashley Barker (ORNL) Accelerated Debugging: Bringing Allinea DDT to OpenACC on the Cray XK6 beyond Petascale David Lecomber (Allinea Software) Abstract Abstract The ability to debug at Petascale is now a reality for homogeneous systems such as the Cray XE6, and is a vital part of producing software that works. Developers are using Allinea DDT to debug their MPI codes regularly at Petascale - with an interface that is responsive and intuitive even at this extreme size. Third Party Tools for Titan Richard Graham, Oscar Hernandez, Christos Kartsaklis, Joshua Ladd and Jens Domke (Oak Ridge National Laboratory) and Jean-Charles Vasnier, Stephane Bihan and Georges-Emmanuel Moulard (CAPS Enterprise) Abstract Abstract Over the past few years, as part of the Oak Ridge Leadership Class Facility project (OLCF-3), Oak Ridge National Laboratory (ORNL) has been engaged with several third party tools vendors with the aim of enhancing the tool offerings for ORNL’s GPU-based platform, Titan. This effort has resulted in enhancements to CAPS' HMPP compiler, Allinea's DDT debugger, and the Vampir suite of performance analysis tools from the Technische Universit at Dresden. In this paper we will discuss the latest enhancements to these tools, and their impact on applications as ORNL readies Titan for full-scale production as a GPU based heterogeneous system. The Eclipse Parallel Tools Platform: Toward an Integrated Development Environment for Improved Software Engineering on Crays Jay Alameda and Jeffrey L. Overbey (National Center for Supercomputing Applications/University of Illinois) Abstract Abstract Eclipse is a widely used, open source integrated development environment that includes support for C, C++, Fortran, and Python. The Parallel Tools Platform (PTP) extends Eclipse to support development on high performance computers. PTP allows the user to run Eclipse on her laptop, while the code is compiled, run, debugged, and profiled on a remote HPC system. PTP provides development assistance for MPI, OpenMP, and UPC; it allows users to submit jobs to the remote batch system and monitor the job queue; and it provides a visual parallel debugger. Tuesday 1:00 P.M. - 2:30 P.M. Hamburg Technical Sessions (7A) Chair: Jason Hill (Oak Ridge National Laboratory) Cray’s Lustre Support Model and Roadmap Cory Spitz (Cray Inc.) Abstract Abstract Cray continues to deploy and support Lustre as the file system of choice for all of our systems. As such, Cray is committed to developing Lustre and ensuring its continued success on our platforms. This paper will discuss Cray’s Lustre deployment model, and how it ensures both a stable Lustre version and enables productivity. It will also outline how we work with the Lustre community through OpenSFS. Finally, it will roll out our updated Lustre roadmap, which includes Lustre 2.2 and Linux 3.0. Lustre Roadmap and Releases Dan Ferber (Whamcloud) Abstract Abstract Whamcloud, sponsored by OpenSFS, produces Lustre releases in addition to to providing Lustre development and support. This includes patch landings, testing, packaging, and release for the Lustre community. As an OpenSFS board level member and contributor, Cray plays a key role in helping support that activity. This presentation reviews the current Whamcloud Lustre roadmap, test reporting, and release schedules. DDN Exascale Directions And Cray Product/Partnership Update Keith Miller (DataDirect Networks) Abstract Abstract Very large compute environments are facing unprecedented challenges with respect to the storage systems that support them. In this talk, DDN - the world leader in massively scalable HPC storage technology - will discuss solutions to Petascale & Exascale I/O challenges and opportunities driven by the rise of trends such as: the continued expansion of file stripe sizes on larger pools of commodity technology, disk performance improvements which are disproportionate to CPU performance, scalable storage system usability, the advent of Big Data analytics for HPC and the emergence of GeoDistributed Object Storage as a viable platform for next-generation computing and Big Data collaboration. Additionally, information will be provided on DDN's forthcoming product portfolio updates and deployment experience in massively scalable Cray environments. Tuesday 2:30 P.M. - 3:00 P.M. Maritim Foyer Break. Allinea Software, Sponsor Tuesday 3:00 P.M. - 5:00 P.M. Köln Technical Sessions (8C) Chair: Liz Sim (EPCC, The University of Edinburgh) Porting and optimisation of the Met Office Unified Model on PRACE architectures Pier Luigi Vidale (NCAS-Climate, Dept. of Meteorology, Univ. of Reading. UK), Malcolm Roberts and Matthew Mizielinski (Met Office Hadley Centre, UK), Simon Wilson (Met Office, UK / NERC CMS), Grenville Lister (NERC CMS, Univ. of Reading), Oliver Darbyshire (Met Office, UK) and Tom Edwards (Cray Centre of Excellence for HECToR) Abstract Abstract We present porting, optimisation and scaling results from our work with the United Kingdom's Unified Model on a number of massively parallel architectures: the UK MONSooN and HECToR systems, the German HERMIT and the French Curie supercomputer, part of PRACE. Adaptive and Dynamic Load Balancing for Weather Forecasting Models Celso L. Mendes (University of Illinois), Eduardo R. Rodrigues (IBM-Research, Brazil), Jairo Panetta (CPTEC/INPE, Brazil) and Laxmikant V. Kale (University of Illinois) Abstract Abstract Climate and weather forecasting models require large processor counts on current supercomputers. However, load imbalance in these models may limit their scalability. We address this problem using AMPI, an MPI implementation based on the Charm++ infrastructure, where MPI tasks are implemented as user-level threads that can dynamically migrate across processors. In this paper, we explore an advanced load balancer, based on an adaptive scheme that frequently monitors the degree of load imbalance, but only takes corrective action (i.e. migrates work from one processor to another) when that action is expected to be profitable for subsequent time-steps in the execution. We present experimental results obtained on Cray systems with BRAMS, a mesoscale weather forecasting model. They reflect a trade-off between maintaining load balance and minimizing migration costs during rebalancing. Given the deployment of large systems at CPTEC and at Illinois, this novel load balancing mechanism will become a critical contribution to the effective use of those systems. Porting the Community Atmosphere Model - Spectral Element Code to Utilize GPU Accelerators Matthew Norman (Oak Ridge National Laboratory), Jeffrey Larkin (Cray Inc.), Richard Archibald (Oak Ridge National Laboratory), Ilene Carpenter (National Renewable Energy Laboratory), Valentine Anantharaj (Oak Ridge National Laboratory), Paulius Micikevicius (NVIDIA) and Katherine Evans (Oak Ridge National Laboratory) Abstract Abstract Here we describe our XK6 porting efforts for the Community Atmosphere Model – Spectral Element (CAM-SE), a large Fortran climate simulation code base developed by multiple institutions. Including more advanced physics and aerosols in future runs will address key climate change uncertainties and socioeconomic impacts. This, however, requires transporting up to order 100 quantities (called “tracers”) used in new physics and chemistry packages, consuming upwards of 85% of the total CAM runtime. Thus, we focus our GPU porting efforts on the transport routines. In this paper, we discuss data structure changes that allowed sufficient thread-level parallelism, reduction in PCI-e traffic, tuning of the individual kernels, analysis of GPU efficiency metrics, timing comparison with best-case CPU code, and validation of accuracy. We believe these experiences are unique, interesting, and valuable to others undertaking similar porting efforts. Performance Evaluation and Optimization of the ls1-MarDyn Molecular Dynamics code on the Cray XE6 Christoph Niethammer (High Performance Computing Center Stuttgart) Abstract Abstract Today Molecular Dynamics (MD) Simulations are a key tool in many research and industry areas: Biochemistry, solid state physics, chemical engineering, just mentioning some. While in the past MD was a playground for some very simple problems, the ever-increasing compute power of super computers lets handle more and more complex problems: It allows increasing number of particles and more sophisticated molecular models which were too compute intensive in the past. In this paper we present performance studies and results obtained with the ls1-MarDyn MD code on the new Hermit System (Cray XE6) at HLRS. The code's scalability up to the full system with 100.000 cores is discussed as well as a comparison to other platforms. Furthermore we present in detail code analysis using the Cray software environment. From the obtained results we discuss further improvements which will be indispensable for upcoming systems in the post petascale era. Tuesday 3:00 P.M. - 5:00 P.M. Bonn Technical Sessions (8B) Chair: Rolf Rabenseifner (High Performance Computing Center Stuttgart) The Cray Programming Environment: Current Status and Future Directions Luiz DeRose (Cray Inc.) Abstract Abstract The scale of current and future high end systems, as well as the increasing system software and architecture complexity, brings a new set of challenges for application developers. In order to achieve high performance on peta-scale systems, application developers need a programming environment that can address and hide the issues of scale and complexity of high end HPC systems. Users must be supported by intelligent compilers, automatic performance analysis tools, adaptive libraries, and scalable software. In this talk I will present the recent activities and future directions of the Cray Programming Environment that are being developed and deployed to improve user’s productivity on the Cray XE and XK Supercomputers. Cray Performance Measurement and Analysis Tools for the Cray XK System Heidi Poxon (Cray Inc.) Abstract Abstract The Cray Performance Measurement and Analysis Tools have been enhanced to support whole program analysis on Cray XK systems. The focus of support is on the new directive-based OpenACC programming model, helping users identify key performance bottlenecks within their X86/GPU hybrid programs. Advantages of the Cray tools include summarized and consolidated performance data beneficial for analysis of programs that use a large number of nodes and GPUs, statistics for the whole program mapped back to user source by line number, GPU statistics grouped by accelerated region, as well as the X86 statistics traditionally provided by the Cray performance tools. This paper discusses these enhancements, including support to help users add increased levels of parallelism to their MPI applications through OpenMP or OpenACC. Cray Scientific Libraries : New Features and Advanced Usage Adrian Tate (Cray Inc.) Abstract Abstract Cray scientific libraries are relied upon to extract the maximum of performance from a Cray system and so must be optimized for the Gemini network, the Interlagos and Magny-cours processors and now also for NVIDIA accelerators. In this talk I will discuss the scientific libraries that are available on each product, basic usage, how the different library components are optimized and what advanced performance controls are available to the user. In particular I will describe the new CrayBLAS library which has a radically different internal structure to previous BLAS libraries, and I will talk in detail about libsci for accelerators, which provides both simple usage and advanced hybrid performance of XK6. I will detail some communications optimization of our FFT library using Co-array Fortran, and I will also discuss upcoming libsci features and improvements. Applying Automated Optimisation Techniques to HPC Applications Thomas Edwards (Cray Inc.) Abstract Abstract Porting and optimising applications to a new processor architecture, a different compiler or the introduction of new features in the software or hardware environment can generate a large number of new parameters that have the potential to affect application performance. Vendors attempt to provide sensible defaults that perform well in general, for example grouping compiler optimisations into flag groupings and setting the default value of environment variables, they are inevitably based on the experience gained or expected behaviour of a normal application. In many cases applications will exhibit some behaviour that differs from the norm, for example requiring identical floating point results when changing MPI decompositions, or sending or receiving messages of unusual or irregular sizes. Manually finding the combination of flags and environment variables that provide optimum performance whilst maintaining a set of application specific criteria can be time consuming and tedious. There are a wide variety of potential algorithms and techniques that can be employed, each with various merits and suitability to the problem of optimising an HPC application. This paper explores, evaluates and compares techniques for automated optimisation HPC application parameters within fixed numbers of iterations.uiring identical floating point results when hanging MPI decompositions, or sending or receiving messages of unusual or irregular sizes. In many cases programmers opt to automate the optimisation process, using the computer to find an optimal solution. There are, however, a wide variety of potential algorithms and techniques that can be employed to perform the search, each with various merits. This paper will explore, evaluate and compare a set of techniques for automated optimisation, focusing specifically properties of HPC applications. Drawing on the author's practical experience with real-world applications the cost in compute resources compared to the runtime improvements gained can be evaluated and considered. Tuesday 3:00 P.M. - 5:00 P.M. Hamburg Technical Sessions (8A) Chair: Tina Butler (National Energy Research Scientific Computing Center) Xyratex ClusterStor Architecture Torben Kling Petersen (Xyratex) Abstract Abstract As the size, performance, and reliability requirements of HPC storage systems increase exponentially, building solutions utilizing practices and philosophies that have existed for over five years is no longer adequate or efficient. While some instability of HPC systems was tolerable in the past, commercial and lab HPC environments now require enterprise level stability and reliability for their peta scale systems. In order to meet these industry requirements, Xyratex architected an innovative Lustre based HPC storage solution known as ClusterStor. The ClusterStor solution utilizes enterprise grade storage and software components, fully automated installation procedures, and rigorous testing procedures prior to shipping out to customers in order to drive the highest levels of reliability for growing and evolving HPC environments. Minimizing Lustre ping effects at scale on Cray systems Cory Spitz, Nic Henke, Doug Petesch and Joe Glenski (Cray Inc.) Abstract Abstract Cray is committed to pushing the boundaries of scale of its deployed Lustre file systems, in terms of both client count and the number of Lustre server targets. However, scaling Lustre to such great heights presents a particular problem with the Lustre pinger, especially with routed LNET configurations used on so-called external Lustre file systems. There is an even greater concern for LNETs with finely grained routing. The routing of small messages must be improved otherwise Lustre pings have the potential to ‘choke out’ real bulk I/O, an effect we call ‘dead time’. Pings also contribute to OS jitter so it is important to minimize their impact even if a scale threshold has not been met that disrupts real I/O. Moreover, the Lustre idle pings are an issue even for very busy systems because each client must ping every target. This paper will discuss the techniques used to illustrate the problem and best practices for avoiding the effects of Lustre pings. Cray Sonexion Hussein Harake (CSCS) Abstract Abstract During SC11 Cray announced a new innovative HPC data storage solution named Cray Sonexion. CSCS installed an early Sonexion system in December 2011, the system is connected to a development Cray XE6 machine. The purpose of the study is to evaluate the mentioned product, covering installation, configuration and tuning including Lustre file-system and integrating it to the CRAY XE6. A Next-Generation Parallel File System Environment for the OLCF Galen Shipman, David Dillow, Douglas Fuller, Raghul Gunasekaran, Jason Hill, Youngjae Kim, Sarp Oral, Doug Reitz, James Simmons and Feiyi Wang (Oak Ridge National Laboratory) Abstract Abstract When deployed in 2008/2009 the Spider system at the Oak Ridge National Laboratory’s Leadership Computing Facility (OLCF) was the world’s largest scale Lustre parallel file system. Envisioned as a shared parallel file system capable of delivering both the bandwidth and capacity requirements of the OLCF’s diverse computational environment, Spider has since become a blueprint for shared Lustre environments deployed worldwide. Designed to support the parallel I/O requirements of the Jaguar XT5 system and other smaller- scale platforms at the OLCF, the upgrade to the Titan XK6 heterogeneous system will begin to push the limits of Spider’s original design by mid 2013. With a doubling in total system memory and a 10x increase in FLOPS, Titan will require both higher bandwidth and larger total capacity. Our goal is to provide a 4x increase in total I/O bandwidth from over 240GB/sec today to 1T B/sec and a doubling in total capacity. While aggregate bandwidth and total capacity remain important capabilities, an equally important goal in our efforts is dramatically increasing metadata performance, currently the Achilles heel of parallel file systems at leadership. We present in this paper an analysis of our current I/O workloads, our operational experiences with the Spider parallel file systems, the high-level design of our Spider upgrade, and our efforts in developing benchmarks that synthesize our performance requirements based on our workload characterization studies. Tuesday 5:00 P.M. - 5:00 P.M. Maritim Foyer Break Tuesday 5:00 P.M. - 5:45 P.M. Köln Interactive Session (9C) Chair: Jim Rogers (Oak Ridge National Laboratory) Tuesday 5:00 P.M. - 5:45 P.M. Bonn Interactive Session (9B) Chair: David Wallace (Cray, Inc.) Removing Barriers to Application Performance David Wallace (Cray Inc.) Abstract Abstract Application developers are often faced with having to work around hardware (or software) imposed system limitations. These compromises can require adoption of sub-optimal algorithms or the use of approaches which affect obtaining peak application performance. Cray is gathering requirements for implementation consideration for future systems. The intent of this moderated BoF session is to identify barriers in hardware and software that impact optimal application algorithms and affect achieving peak performance or impact application development productivity. Tuesday 6:30 P.M. - 9:30 P.M. Cray Social Abstract Abstract Cray invites all registered CUG 2012 attendees (badge required) and their guests to a dinner reception at the Vinum im Literaturhaus restaurant (http://www.vinum-im-literaturhaus.de). Vinum is located within walking distance of the Maritim hotel and conference center at Breitscheidstraße 4. Wednesday 8:30 A.M. - 10:00 A.M. Köln / Bonn / Hamburg General Session (10) Chair: Nick Cardo (National Energy Research Scientific Computing Center) CUG Business Nick Cardo (National Energy Research Scientific Computing Center) PRACE for Science and Industry Richard Kenway (University of Edinburgh) Abstract Abstract The Partnership for Advanced Computing in Europe was established as an international non-profit association, PRACE AISBL, in 2010 to create a pan-European supercomputing infrastructure for large-scale scientific and industrial research at the highest performance level. It has 24 member states and currently allocates petascale resources in France, Germany, Italy and Spain, through world-wide open competition. This talk will describe the successes of PRACE so far and its vision for the future. Wednesday 10:00 A.M. - 10:30 A.M. Maritim Foyer Break. The Portland Group, Sponsor Pat Brooks (The Portland Group) Abstract Abstract The Portland Group® (a.k.a. PGI®) is a premier supplier of software compilers and tools for parallel computing. PGI's goal is to provide the highest performance, production quality compilers and software development tools. Wednesday 10:30 A.M. - 12:00 P.M. Köln Technical Sessions (11C) Chair: Ashley Barker (ORNL) The PGI Fortran and C99 OpenACC Compilers Brent Leback, Michael Wolfe and Douglas Miles (The Portland Group) Abstract Abstract Abstract: This paper and talk provides an introduction to programming accelerators using the PGI OpenACC implementation in Fortran and C. It is suitable for application programmers who are not expert GPU programmers. The paper compares the use of the Parallel and Kernels constructs and provides guidelines for their use. Examples of inter-operating with lower-level explicit GPU languages will be shown. The material covers version 1.0 features of the language API, interpreting compiler feedback, performance analysis and tuning. This talk includes a live component with a demo application running on a Windows laptop. Performance Studies of A Co-Array Fortran Application Versus MPI Mike Ashworth (Science and Technology Facilities Council) Abstract Abstract An open question is whether future applications targetting multi-Petaflop systems with many-core nodes will best be served by the conventional approach of the hybrid MPI-OpenMP programming model or whether global address space languages, such as Co-Array Fortran (CAF), can offer equivalent performance with a simpler, more robust and maintainable programming interface. We will show performance results from a stand-alone but representative CFD code (the Shock Boundary Layer Interaction code) for which we have implementations using both programming models. Using the UK's HECToR Cray XE6 system, we shall investigate issues such as multi-threading scalability on the node and optimization of numbers of OpenMP threads and MPI tasks for the hybrid code as well as the efficiency of the CAF code which is expected to benefit from the improved implemetation of single-sided messaging in the Gemini network. Tools for Benchmarking, Tracing, and Simulating SHMEM Applications Mitesh R. Meswani, Laura Carrington and Allan Snavely (San Diego Supercomputer Center) and Stephen Poole (Oak Ridge National Laboratory) Abstract Abstract Cray’s SHMEM communication library provides a low-latency one-side communication paradigm for parallel applications to co-ordinate their activity. Hence a trace of SHMEM calls is an important tool towards understanding and tuning SHMEM applications communication performance. Towards this end we present a suite of tools to benchmark, trace, and simulate SHMEM communication speedily and accurately. Specifically, in this paper we present the following three tools: (1) ShmemBench – a benchmark generator that generates timed user specified APIs and communication sizes to benchmark SHMEM communication, (2) ShmemTracer – a lightweight library to trace SHMEM calls in a running application, and (3) Shmem Simulator – a tool to accurately and speedily simulate SHMEM traces for different target Cray systems. The three tools presented provide a powerful experimentation tool for Cray users to analyze and optimize performance of SHMEM applications. Wednesday 10:30 A.M. - 12:00 P.M. Bonn Technical Sessions (11B) Chair: Helen He (National Energy Research Scientific Computing Center) Open MPI for Cray XE/XK Systems Manjunath Gorentla Venkata and Richard L. Graham (Oak Ridge National Laboratory) and Nathan T. Hjelm and Samuel K. Gutierrez (Los Alamos National Laboratory) Abstract Abstract Open MPI provides an implementation of the MPI standard supporting communications over a range of high-performance network interfaces. Recently, ORNL and LANL have collaborated on creating a port of Open MPI for Gemini, the network interface for Cray XE and XK systems. In this paper, we present our design and implementation of Open MPI's point-to-point and collective operations for Gemini, and techniques we employ to provide good scaling, and performance characteristics. Early Results from the ACES Interconnection Network Project Scott Hemmert (Sandia National Laboratories), Duncan Roweth (Cray Inc.) and Richard Barrett (Sandia National Laboratories) Abstract Abstract In spring 2010, the Alliance for Computing at Extreme Scale (ACES), a collaboration between Los Alamos and Sandia National Laboratories, initiated the ACES Interconnection Network Project focused on a potential future interconnection network. The intent of the project is to analyze potential capabilities for inclusion in Pisces that would result in significant performance benefits for a suite of ASC applications. This paper will describe the simulation framework used for the project, as well as present a selection of initial research results. We show that the Dragonfly network topology is well suited to ASC applications and that adaptive routing provides significant performance benefits. Analyses and Modeling of Applications Used to Demonstrate Sustained Petascale Performance on Blue Waters Gregory H. Bauer (National Center for Supercomputing Applications), Torsten Hoefler (National Center for Supercomputing Applications/University of Illinois), William Kramer (National Center for Supercomputing Applications) and Robert A. Fiedler (Cray Inc.) Abstract Abstract The sustained petascale performance of the Blue Waters system, a US National Science Foundation (NSF) funded petascale computing resource, will be demonstrated using a suite of applications representing a wide variety of disciplines important to the science and engineering communities of the NSF: Lattice Quantum Chromodynamics (MILC), Materials Science (QMCPACK), Geophysical Science (H3D(M)) and SPECFEM3D), Atmospheric Science (WRF), and Computational Chemistry (NWCHEM). We will discuss the performance of these applications on the Blue Waters hardware and provide simple performance models that allow us to predict the sustained performance of the applications running at full scale. Several performance metrics will be used to identify optimization opportunities. Communication pattern analysis and topology mapping experiments will be used to characterize scalability. Wednesday 10:30 A.M. - 12:00 P.M. Hamburg Technical Sessions (11A) Chair: Jason Hill (Oak Ridge National Laboratory) Lustre at Petascale: Experiences in Troubleshooting and Upgrading Matthew A. Ezell (Oak Ridge National Laboratory) and Richard F. Mohr, Ryan Braby and John Wynkoop (National Institute for Computational Sciences) Abstract Abstract Some veterans in the HPC industry semi-facetiously define supercomputers as devices that convert compute-bound problems into I/O-bound problems. Effective utilization of large high performance computing resources often requires access to large amounts of fast storage. The National Institute for Computational Sciences (NICS) operates Kraken, a 1.17 PetaFLOPS Cray XT5 for the National Science Foundation (NSF). Kraken’s primary file system has migrated from Lustre 1.6 to 1.8 and is currently being moved to servers external to the machine. Additional bandwidth will be made available by mounting the NICS-wide Lustre file system. Newer versions of Lustre, beyond what Cray provides, are under evaluation for stability and performance. Over the past several years of operation, Kraken’s Lustre file system has evolved to be extremely stable in an effort to better serve Kraken's users. NetApp E-Series Storage Systems: The Lego Approach to HPC Storage Didier Gava (NetApp, Inc.) Abstract Abstract Every storage vendor offers storage systems based on performance and capacity; but some vendors force their customers into accepting minimum, monolithic configurations that typically exceed a customer's current demand by a factor of at least two to three times or more. Integrated Simulation of Object-Based File System for High-Performance Computing Hao Zhang (University of Tennessee) and Haihang You and Mark Fahey (National Institute for Computational Sciences) Abstract Abstract Besides requiring significant computational power, a large-scale scientific computing application in high-performance computing (HPC) usually involves large quantity of data. An inappropriate I/O configuration might severely degrade the performance of an application, thereby decreasing the overall user productivity. Moreover, tuning I/O performance of an application on a real file system of a supercomputer can be dangerous, expensive and time-consuming. Even in the application level, an improper I/O configuration might hinder the entire supercomputer. Also, a tuning and testing process always takes a long time and uses considerable computation and storage resources. Wednesday 12:00 P.M. - 1:00 P.M. Restaurant Rôtisserie Lunch. ANSYS, Sponsor Wim Slagter (ANSYS) Abstract Abstract ANSYS brings clarity and insight to customers' most complex design challenges through fast, accurate and reliable engineering simulation. Our technology enables organizations ― no matter their industry ― to predict with confidence that their products will thrive in the real world. Customers trust our software to help ensure product integrity and drive business success through innovation. Founded in 1970, ANSYS employs more than 2,000 professionals, many of them expert in engineering fields such as finite element analysis, computational fluid dynamics, electronics and electromagnetics, and design optimization. ANSYS is passionate about pushing the limits of world-class technology, all so our customers can turn their design concepts into successful, innovative products. ANSYS users today scale their largest simulations across thousands of processing cores, conducting simulations with more than a billion cells. They create incredibly dense meshes, model complex geometries, and consider complicated multiphysics phenomena. ANSYS is committed to delivering HPC performance and capability to take customers to new heights of simulation fidelity, engineering insight and continuous innovation. ANSYS partners with key hardware vendors such as Cray to ensure customers can get the most accurate solution in the fastest amount of time. The collaboration helps customers in all industries navigate the rapidly changing high-performance computing (HPC) landscape. ANSYS HPC products support highly scalable use of HPC - providing virtually unlimited access to HPC capacity for high-fidelity simulation within a workgroup or across a distributed enterprise, using local workstations, department clusters, or enterprise servers, wherever resources and people are located. HPC solutions from ANSYS enable enhanced engineering productivity by accelerating simulation throughput, enabling customers to consider more design ideas and make efficient product development decisions based on enhanced understanding of performance tradeoffs. The ANSYS approach to HPC licensing is cross-physics, providing customers with a single solution that can be leveraged across disciplines. Customers can ‘buy once’ and ‘deploy once’, getting more value from their investment in ANSYS. Our leadership in HPC is a differentiator that will return significant value to customers. Over the years, our steady growth and financial strength reflect our commitment to innovation and R&D. We reinvest 15 percent of our revenues each year into research to continually refine our software. We are listed on the NASDAQ stock exchange. Headquartered south of Pittsburgh, U.S.A., ANSYS has more than 60 strategic sales locations throughout the world with a network of channel partners in 40+ countries. Visit www.ansys.com for more information. Wednesday 1:00 P.M. - 2:30 P.M. Köln Technical Sessions (12C) Chair: John Noe (Sandia National Laboratories) A Heat Re-Use System for the Cray XE6 and Future Systems at PDC, KTH Gert Svensson (KTH/PDC) and Johan Söderberg (Hifab) Abstract Abstract The installation of a 16 cabinet Cray XE6 in 2010 at PDC was expected to increase the total power consumption from around 800 kW by an additional 500 kW. The intention was to refund some of the power cost and become more environmentally friendly by re-using the energy from the Cray to heat nearby buildings. Analysis and Optimization of a Molecular Dynamics Code using PAPI and the Vampir Toolchain Thomas William (Zentrum für Informationsdienste und Hochleistungsrechnen (ZIH)) and Robert Henschel and D. K. Berry (Indiana University) Abstract Abstract A highly diverse molecular dynamics program for the study of dense matter in white dwarfs and neutron stars was ported and run on a Cray XT5m using MPI, OpenMP and hybrid parallelization. The ultimate goal was to find the best configuration of available code blocks, compiler flags and runtime parameters for the given architecture. The serial code analysis provided the best candidates for parallel parameter sweeps using different MPI/OpenMP settings. Using PAPI counters and applying the Vampir toolchain a thorough analysis of the performance behavior was done. This step led to changes in the OpenMP part of the code yielding higher parallel efficiency to be exploited on machines providing larger core counts. The work was done in a collaboration between PTI (Indiana University) and ZIH (Technische Universität Dresden) on hardware provided by the NSF funded FutureGrid project. Simulating Laser-Plasma Interactions in Experiments at the National Ignition Facility on a Cray XE6 Steven H. Langer, Abhinav Bhatele, G. Todd Gamblin, Charles H. Still, Denise E. Hinkel, Michael E. Kumbera, A. Bruce Langdon and Edward A. Williams (Lawrence Livermore National Laboratory) Abstract Abstract The National Ignition Facility (NIF) [1] is a high energy density experimental facility run for the National Nu- clear Security Administration (NNSA) by Lawrence Livermore National Laboratory. NIF houses the world’s most powerful laser. The National Ignition Campaign (NIC) has a goal of using the NIF laser to ignite a fusion target by the end of FY12. Achieving fusion ignition in the laboratory will be a major step towards fusion energy. Wednesday 1:00 P.M. - 2:30 P.M. Bonn Technical Sessions (12B) Chair: Larry Kaplan (Cray Inc.) The Impact of a Fault Tolerant MPI on Scalable Systems Services and Applications Richard Graham, Joshua Hursey, Geoffroy Vallee, Thomas Naughton and Swen Bohm (Oak Ridge National Laboratory) Abstract Abstract Exascale targeted scientific applications must be prepared for a highly concurrent computing environment where failure will be a regular event during execution. Natural and algorithm-based fault tolerance (ABFT) techniques can often manage failures more efficiently than traditional checkpoint/restart techniques alone. Central to many petascale applications is an MPI standard that lacks support for ABFT. The Run-Through Stabilization (RTS) proposal, under consideration for MPI 3, allows an application to continue execution when process fail. The requirements of scalable, fault tolerant MPI implementations and applications will stress the capabilities of many system services. System services must evolve to efficiently support such applications and libraries in the presence of system component failure. This paper discusses how the RTS proposal impacts system services, highlighting specific requirements. Early experimentation results from Cray systems at ORNL using prototype MPI and runtime implementations are presented. Additionally, this paper outlines fault tolerance techniques targeted at leadership class applications. Leveraging the Cray Linux Environment Core Specialization Feature to Realize MPI Asynchronous Progress on the Cray XE Howard Pritchard, Duncan Roweth, David Henseler and Paul Cassella (Cray Inc.) Abstract Abstract Cray has enhanced the Linux operating system with a Core Specialization (CoreSpec) feature that allows for differentiated use of the processor cores available on Cray XE compute nodes. With CoreSpec, most cores on a node are dedicated to running the parallel application while one or more cores are reserved for OS and service threads. The MPICH2 MPI implementation has been enhanced to make use of this CoreSpec feature to better support MPI asynchronous progress. In this paper, we describe how the MPI implementation uses CoreSpec along with hardware features of the XE Gemini Network Interface to obtain overlap of MPI communication with computation for micro-benchmarks and applications. Debugging and Optimizing Scalable Applications on the Cray Chris Gottbrath (Rogue Wave Software) Abstract Abstract Cray XE6 and XK6 systems can deliver record-breaking computational power but only to applications that are error free and are optimized to take advantage of the performance that the system can deliver. The cycle of development, debugging and tuning is a constant task, especially when custom application developers implement new algorithms, simulate new physical systems, port software to leverage higher core count nodes or take advantage of accelerators, and scale their code to high and higher node, core or thread counts. Rogue Wave offers a powerful set of tools to aid in these efforts. ThreadSpotter pinpoints cache inefficiencies, educates and guides scientists and developers through the cache optimization process while TotalView provides scalable, bi-directional, parallel source code and memory debugging. Wednesday 1:00 P.M. - 2:30 P.M. Hamburg Technical Sessions (12A) Chair: Liam Forbes (Arctic Region Supercomputing Center) Blue Waters - A Super System for Super Challenges William Kramer (National Center for Supercomputing Applications/University of Illinois) Abstract Abstract Blue Waters is being deployed in 2012 for diverse science and engineering challenges that require huge amounts of sustained performance with 25 teams already selected to run. This talk explains the goals and expectations of the Blue Waters Project and how the new Cray XE/XK/Gemini/Sonexion technologies fulfill these expectations. The talk covers how NCSA is verifying the system meet is requirements for a more than a sustained petaflop/s for real science applications. It discusses a significant ideas on creating new methods and algorithms to improve application codes to take full advantage of systems like Blue Waters, with particular attention for the areas of scalability, use of accelerators, simultaneous use of x86 and accelerated nodes within single codes and application resiliency and discusses experiences and status of the "early science" use at the time of CUG. The final part of the talk discusses lessons learned from the co-design efforts. Early experiences with the Cray XK6 hybrid CPU and GPU MPP platform Sadaf Alam, Jeffrey Poznanovic, Ugo Varetto and Nicola Bianchi (Swiss National Supercomputing Centre), Antonio Penya (UJI) and Nina Suvanphim (Cray Inc.) Abstract Abstract We report on our experiences of deploying, operating and benchmarking a Cray XK6 system, which is composed of AMD Interlagos and NVIDIA X2090 nodes and Gemini interconnect. Specifically we outline features and issues that are unique to this system in terms of system setup, configuration, programming environment and tools as compared to a Cray XE6 system, which is based also on AMD Interlagos (dual-socket) nodes and the Gemini interconnect. Micro-benchmarking results characterizing hybrid CPU and GPU performance and MPI communication between the GPU devices are presented to identify parameters that could influence the achievable node and parallel efficiencies on this hybrid platform. Titan: Early experience with the Cray XK6 at Oak Ridge National Laboratory Arthur S. Bland, Jack C. Wells, Otis E. Messer, II, Oscar R. Hernandez and James H. Rogers (Oak Ridge National Laboratory) Abstract Abstract In 2011, Oak Ridge National Laboratory began an upgrade to Jaguar to convert it from a Cray XT5 to a Cray XK6 system named Titan. This is being accomplished in two phases. The first phase, completed in early 2012, replaced all of the XT5 compute blades with XK6 compute blades, and replaced the SeaStar interconnect with Cray’s new Gemini network. Each compute node is configured with an AMD Opteron™ 6274 16-core processors and 32 gigabytes of DDR3-1600 SDRAM. The system aggregate includes 600 terabytes of system memory. In addition, the first phase includes 960 NVIDIA X2090 Tesla processors. In the second phase, ORNL will add NVIDIA’s next generation Tesla processors to increase the combined system peak performance to over 20 PFLOPS. This paper describes the Titan system, the upgrade process from Jaguar to Titan, and the challenges of developing a programming strategy and programming environment for the system. We present initial results of application performance on XK6 nodes. Wednesday 2:30 P.M. - 3:00 P.M. Maritim Foyer Break. Rogue Wave Software, Sponsor Abstract Abstract Rogue Wave Software, Inc. is the largest independent provider of cross-platform software development tools and embedded components for the next generation of HPC applications. Rogue Wave products reduce the complexity of prototyping, developing, debugging, and optimizing multi-processor and data-intensive applications. Rogue Wave customers are industry leaders in the Global 2000, ISVs, OEMs, government laboratories and research institutions that leverage computationally-complex and data-intensive applications to enable innovation and outperform competitors. Developing parallel, data-intensive applications is hard. We make it easier. Wednesday 3:15 P.M. - 10:00 P.M. Schloss Solitude CUG Night Out Thursday 8:30 A.M. - 10:00 A.M. Köln Technical Sessions (13C) Chair: Liz Sim (EPCC, The University of Edinburgh) A fully distributed CFD framework for massively parallel systems Jens Zudrop, Harald Klimach, Manuel Hasert, Kannan Masilamani and Sabine Roller (Applied Supercomputing in Engineering, German Research School for Simulation Sciences GmbH and RWTH Aachen University) Abstract Abstract A solver framework, based on a linearized octree is presented. It allows for fully distributed computations and avoids special processes with potential bottlenecks, while enabling simulations with complex geometries. Scaling results on the Cray XE6 Hermit system at HLRS in Stuttgart are presented with runs up to 3072 nodes with 98304 MPI processes. Even with a fully indirect addressing a high sustained performance of more than 9 % can be reached on the system, enabling very large simulations. Two flow simulation methods are shown, a Finite Volume Method for compressible flows, and a Lattice Boltzmann Method for incompressible flows in complex geometries. Tuning And Understanding MILC Performance In Cray XK6 GPU Clusters Guochun Shi (National Center for Supercomputing Applications), Steve Gottlieb (Indiana University) and Michael Showerman (National Center for Supercomputing Applications) Abstract Abstract Graphics Processing Units (GPU) are becoming increasingly popular in high performance computing due to their high performance, high power eciency, and low cost. Lattice QCD is one of the elds that has successfully adopted GPUs and scaled to hundreds of them. In this paper, we report our Cray XK6 experience in proling and understanding perfor- mance for MILC, one of the Lattice QCD computation pack- ages, running on multi-node Cray XK6 computers using a domain specic GPU library called QUDA. QUDA is a library for accelerating Lattice QCD compu- tations on GPUs. It started at Boston University and has evolved into a multi-institution project. It supports multiple quark actions and has been interfaced to many applications, including MILC and Chroma. The most time consuming part of lattice QCD computation is a sparse matrix solver and QUDA supports ecient Conjugate Gradient (CG) and other solvers. By partitioning in the 4-D space time domain, the solvers in the QUDA library enable the applications to scale to hundreds of the GPUs with high eciency. The other computation intensive components, such as link fat- tening, gauge force and fermion force computations, have also been actively ported to GPUs. High-productivity Software Development for Accelerators Thomas Bradley (NVIDIA) Abstract Abstract Often, the simplest approach to using an accelerator is to call a pre-existing library. This talk will provide an overview of GPU enabled libraries, their advantages over their CPU equivalents, and how to call them from several languages. The talk will also address code development in C++ and how the emerging Thrust template library provides key programmer benefit. We will demonstrate how to decompose problems into flexible algorithms provided by Thrust, and how implementations are fast, and can remain concise and readable. Thursday 8:30 A.M. - 10:00 A.M. Bonn Technical Sessions (13B) Chair: Tina Butler (National Energy Research Scientific Computing Center) Expose, Compile, Analyze, Repeat: How to make effective use of Titan without programming in Cuda Robert M. Whitten (Oak Ridge National Laboratory) Abstract Abstract Reworking existing codes for GPU-based architectures is a daunting task. The OLCF has developed a methodology in partnership with its software vendor partners to eliminate the need to program in CUDA. This methodology involved exposing parallelism, compiling with directive-based tools, analyzing performance, and repeating the process where necessary. This paper explores the methodology with specific details of that process. Software Usage on Cray Systems across Three Centers (NICS, ORNL and CSCS) Bilel Hadri and Mark Fahey (National Institute for Computational Sciences), Timothy W. Robinson (Swiss National Supercomputing Centre) and William Renaud (Oak Ridge National Laboratory) Abstract Abstract In an attempt to better understand library usage and address the need to measure and monitor software usage and forecast requests, an infrastructure named the Automatic Library Tracking Database (ALTD) was developed and put into production on Cray XT and XE systems at NICS, ORNL and CSCS. The ALTD infrastructure prototype automatically and transparently stores information about libraries linked into an application at compilation time and also tracks the executables launched in a batch job. With the data collected, we can generate an inventory of all libraries and third party software used during compilation and execution, whether they be installed by the vendor, the center’s staff, or the users in their own directories. We will illustrate the usage of libraries and executables on several Cray XT and XE machines (namely Kraken, Jaguar and Rosa). We consider that an improved understanding of library usage could benefit the wider HPC community by helping to focus software development efforts toward the Exascale era. Running Large Scale Jobs on a Cray XE6 System Yun (Helen) He and Katie Antypas (Lawrence Berkeley National Laboratory) Abstract Abstract Users face various challenges with running and scaling large scale jobs on peta-scale production systems. For example, certain applications may not have enough memory per core, the default environment variables may need to be adjusted, or I/O dominates run time. Using real application examples, this paper will discuss some of the run time tuning options for running large scale pure MPI and hybrid MPI/OpenMP jobs successfully and efficiently on Hopper, the NERSC production XE6 system. These tuning options include MPI environment settings, OpenMP threads, memory affinity choices, and IO file striping settings. Thursday 8:30 A.M. - 10:00 A.M. Hamburg Technical Sessions (13A) Chair: Liam Forbes (Arctic Region Supercomputing Center) Application Workloads on the Jaguar Cray XT5 System Wayne Joubert (Oak Ridge National Laboratory) and Shiquan Su (National Institute for Computational Sciences) Abstract Abstract In this study we investigate computational workloads for the Jaguar system during its tenure as a 2.3 petaflop system at Oak Ridge National Laboratory. The study is based on a comprehensive analysis of MOAB and ALPS job logs over this period. We consider Jaguar utilization over time, usage patterns by science domain, most heavily used applications and their usage patterns, and execution characteristics of selected heavily-used applications. Implications of these findings for future HPC systems are also considered. Understanding the effects of process placement on application performance on an AMD Interlagos processor Kalyana Chadalavada and Manisha Gajbe (National Center for Supercomputing Applications/University of Illinois) Abstract Abstract We conduct a low-level analysis of possible resource contention on the Interlagos core modules using a compute intensive kernel to exemplify target workloads. We will also characterize the performance of OpenMP threads in packed and unpacked configuration. By using Cray PAT tools and and PAPI counters, we attempt to quantify bottlenecks to full utilization of the processors. Demonstrating which code constructs can achieve high levels of concurrent performance on packed integer cores on the module and which code constructs fare poorly on a packed configuration can help tune petascale class applications. We use this information and attempt to understand & optimize the performance profile of a full scale scientific application on a Cray XE6 system. PBS Professional 11: A Walkthrough of Architecture Improvements for Cray Users and Administrators Scott J. Suchyta and Lisa Endrjukaitis (Altair Engineering, Inc.) and Jason Coverston (Cray Inc.) Abstract Abstract Beginning with version 11, Altair has re-architected the Cray port for PBS Professional, its industry-leading workload management and job scheduling product. As a result, PBS Professional now offers Cray users a wider range of capabilities to extract every ounce of performance from their systems. This presentation will walk Cray users and administrators through the detailed changes from previous versions, focusing on what users need to know for a seamless upgrade. Topics covered will include robustness and scalability improvements, usage examples and tips, and lessons learned from initial deployments. The presentation will cover PBS’s topology-aware scheduling and how Cray users can leverage this to improve system utilization and throughput. The session will also touch on other new capabilities available with PBS Professional 11, including scheduling, submission and cold start improvements Thursday 10:00 A.M. - 10:30 A.M. Maritim Foyer Break. NVIDIA Corporation, Sponsor Liza Gabrielson (NVIDIA) Abstract Abstract NVIDIA is the world leader in visual computing technologies and inventor of the GPU. NVIDIA(r) serves the high performance computing market with its Tesla(tm) GPU computing products, available from resellers including Cray. Based on the CUDA(tm) parallel computing platform, NVIDIA Tesla GPU computing products are companion processors to the CPU and designed from the ground up for HPC - to accelerate application performance. To learn more, visit www.nvidia.com/tesla. Thursday 10:30 A.M. - 12:00 P.M. Köln Technical Sessions (14C) Chair: Tina Butler (National Energy Research Scientific Computing Center) Swift - a parallel scripting language for petascale many-task applications Ketan Maheshwari (Argonne National Laboratory), Mihael Hategan and David Kelly (University of Chicago), Justin Wozniak (Argonne National Laboratory), Jon Monette, Lorenzo Pesce and Daniel Katz (University of Chicago), Michael Wilde (Argonne National Laboratory) and David Strenski and Duncan Roweth (Cray Inc.) Abstract Abstract Important science, engineering and data analysis applications increasingly need to run thousands or millions of small jobs, each using a compute core for seconds to minutes, in a paradigm called many-task computing. These applications can readily have computation needs that extend into extreme scales. Most petascale systems, however, only schedule jobs to the node level. While it is possible to run multiple small tasks on the same node using manually-written ad-hoc scripts, this is not very convenient, making petascale systems unattractive to many-task applications. Shared Library Performance on Hopper Zhengji Zhao (Lawrence Berkeley National Laboratory), Mike Davis (Cray Inc.) and Katie Antypas, Yushu Yao, Rei Lee and Tina Butler (Lawrence Berkeley National Laboratory) Abstract Abstract NERSC's petascale machine, Hopper, a Cray XE6, supports dynamic shared libraries through the DVS projection of the shared root file system onto compute nodes. The performance of the dynamic shared libraries is crucial to some of the NERSC workload, especially for those large scale applications that use Python as the front end interfaces. The work we will present in this paper was motivated by the report from NERSC users, stating that the performance of dynamic shared libraries are very poor at large scale, and hence it is not possible for them to run large python applications on Hopper. In this paper, we will present our performance test results on the shared libraries on Hopper, using the standard Python benchmark code Pynamic and a NERSC user application code WARP, and will also present a few options which we have explored and developed to improve the shared library performance at scale on Hopper. Our effort has enabled Warp to start up in 7 minutes at 40K core concurrency. The Effects of Compiler Optimizations on Materials Science and Chemistry Applications at NERSC Megan Bowling, Zhengji Zhao and Jack Deslippe (Lawrence Berkeley National Laboratory) Abstract Abstract Materials science and chemistry applications consume around 1/3 of the computing cycles each allocation year at NERSC. To improve the scientific productivity of users, NERSC provides a large number of pre-compiled applications on the Cray XE6 machine Hopper. Depending on the compiler, compiler flags and libraries used to build the codes, applications can have large differences in performance. In this paper, we compare the performance differences arising from the use different compilers, compiler optimization flags and libraries available on Hopper over a set of materials science and chemistry applications that are widely used at NERSC. The selected applications are written in Fortran, C, C++, or a combination of these languages, and use MPI or other massage passing libraries as well as linear algebra, FFT, and global array libraries. The compilers explored are the PGI, GNU, Cray, Intel and Pathscale compilers. Thursday 10:30 A.M. - 12:00 P.M. Bonn Technical Sessions (14B) Chair: Larry Kaplan (Cray Inc.) uRiKA: Graph Appliance for Relationship Analytics in Big Data Amar Shan (Cray Inc.) Abstract Abstract The Big Data challenge is ubiquitous in HPC sites, which commonly have data storage measured in tens of petabytes, doubling every two to three years. Transforming this data into knowledge is critical to continued progress. Blue Waters Testing Environment Joseph Muggli, Brett Bode, Torsten Hoefler, William Kramer and Celso L. Mendes (National Center for Supercomputing Applications/University of Illinois) Abstract Abstract Acceptance and performance testing are critical elements of providing and optimizing HPC systems for scientific users. This paper will present the design and implementation of the testing harness for the Blue Waters Cray XE6/XK6 being installed at NCSA/University of Illinois. The Blue Waters system will be a leading-edge system in terms of computational power, on- and off-line storage size and performance, external networking performance, and the breadth of software needed to support a diverse NSF user community. Such a large and broad environment must not only be fully validated for system acceptance, but also continually retested over time to avoid regressions in performance following new software installations or hardware failures. This frequency of testing demands an automated means for running the tests and validating the results as well as tracking the results over time. Optimizing HPC and IT Efficiency from an ISV Perspective Wim Slagter (ANSYS) Abstract Abstract This presentation will show how the ANSYS engineering simulation platform can contribute to HPC & IT efficiency, and how our current solutions, partnerships, and roadmap can enable scalable, global deployment of simulation on internal or cloud-based HPC infrastructure. In addition, some recent ANSYS software advances in parallel scaling performance on Cray systems will be presented. Thursday 10:30 A.M. - 12:00 P.M. Hamburg Technical Sessions (14A) Chair: Jason Hill (Oak Ridge National Laboratory) NCRC Grid Allocation Management Frank Indiviglio and Ron Bewtra (National Oceanic and Atmospheric Administration) Abstract Abstract In support of the NCRC, NOAA has deployed an accounting system for the purpose of coordinating HPC system usage between NOAA user centers and the NCRC located at Oak Ridge National Laboratory. This system provides NOAA with a centralized location for reporting and management of allocations on all production resources located at the NCRC and at NOAA laboratories. This paper describes the design, deployment, and details of the first year of production using this system. We shall also discuss the future plans for extending its deployment other NOAA sites in order to provide centralized reporting and management of system utilization for all HPC resources. Speed Job Completion with Topology-Based Intelligent Scheduling David Hill (Adaptive Computing) Abstract Abstract Leverage the combined power and scale of Cray’s highly advanced systems architecture to speed the completion of multi-node, parallel-processing of jobs with the Moab® intelligence engine. In this session, you will see how topology-based scheduling will permit a cluster user to intelligently schedule jobs on inter-communicating nodes close to each other to minimize the overhead for message or information passing and/or data transfer. This enables jobs to complete in a shorter period than it would if the workloads used nodes spread across the cluster. Practical Support Solutions for a Workflow-Oriented Cray Environment Adam G. Carlyle, Ross G. Miller, Dustin B. Leverman, William A. Renaud and Don E. Maxwell (Oak Ridge National Laboratory) Abstract Abstract The National Climate-Computing Research Center (NCRC), a joint computing center between Oak Ridge National Laboratory (ORNL) and the National Oceanic and Atmospheric Administration (NOAA), employs integrated workflow software and data storage resources to enable production climate simulations on the Cray XT6/XE6 named "Gaea". The use of highly specialized workflow software and a necessary premium on data integrity together create a support environment with unique challenges. This paper details recent support efforts to improve the NCRC end-user experience and to safeguard the corresponding scientific workflow. Thursday 12:00 P.M. - 1:00 P.M. Restaurant Rôtisserie Lunch. NetApp Inc, Sponsor Dennis Watts (NetApp, Inc.) Abstract Abstract The NetApp® E5400 is a high-performance storage system that meets an organization’s demanding performance and capacity requirements without sacrificing simplicity and efficiency. Designed to meet wide-ranging requirements, its balanced performance is equally adept at supporting high-performance file systems, bandwidth-intensive streaming applications, and transaction-intensive workloads. The E5400 multiple drive shelf options enable custom configurations that can be tailored for any environment. Thursday 1:00 P.M. - 2:30 P.M. Köln Technical Sessions (15C) Chair: Liam Forbes (Arctic Region Supercomputing Center) Threat Management and Incident Coordination between National Data Centers for Scientific Computing Urpo Kaila and Joni Virtanen (CSC - IT Center for Science Ltd) Abstract Abstract National data centers for scientific computing provides IT services for researchers, who primary wants reliable and flexible access to high performance computing. Information Security is typically less prioritized, at least until a security incident endangers user data and credentials or generic availability of site services. Early Applications Experience on the Cray XK6 at the Oak Ridge Leadership Computing Facility Arnold Tharrington, Hai Ah Nam, Wayne Joubert, W. Michael Brown and Valentine G. Anantharaj (Oak Ridge National Laboratory) Abstract Abstract In preparation for Titan, the next-generation hybrid supercomputer at the Oak Ridge Leadership Computing Facility (OLCF), the existing 2.3 petaflops Jaguar system was upgraded from the XT5 architecture to the new Cray XK6. This system combines AMD’s 16-core Opteron 6200 processors, NVIDIA’s Tesla X2090 accelerators, and the Gemini interconnect. We present an early evaluation of OLCF’s CRAY XK6, including results for microbenchmarks and kernel and application benchmarks. In addition, we show preliminary results from GPU-enabled applications. Thursday 1:00 P.M. - 2:30 P.M. Bonn Technical Sessions (15B) Chair: John Noe (Sandia National Laboratories) Early Application Experiences with the Intel MIC Architecture in a Cray CX1 R. Glenn Brook, Bilel Hadri, Vincent C. Betro, Ryan C. Hulguin and Ryan Braby (National Institute for Computational Sciences) Abstract Abstract This work details the early efforts of the National Institute for Computational Sciences (NICS) to port and optimize scientific and engineering application codes to the Intel Many Integrated Core (Intel MIC) architecture in a Cray CX1. After the configuration of the CX1 is presented, the successful portings of several application codes are described, and scaling results for the codes on the Intel Knights Ferry (Intel KNF) software development platform are presented. High-Performance Exact Diagonalization Techniques Sergei Isakov (ETH Zurich), William Sawyer, Gilles Fourestey and Adrian Tineo (Swiss National Supercomputing Centre) and Matthias Troyer (ETH Zurich) Abstract Abstract In this work we analyze Cray XE6/XK6 performance and scalability of Exact Diagonalization (ED) techniques for an interacting quantum system. Typical models give rise to a relatively sparse Hamiltonian matrix H. The Lanczos algorithm is then used to determine a few eigenstates. The sparsity pattern is irregular, and the underlying matrix-vector operator exhibits only limited data locality. By grouping the basis states in a smart way, each node needs to communicate with only an order O(log(p)) subset of nodes. The resulting hybrid MPI/OpenMP C++ implementation scales to large CPU-configurations. We have also investigated one-way communication paradigms, such as MPI-2, SHMEM and UPC. We present the results for various communication paradigms on the Cray XE6 at CSCS. Developing Integrated Data Services for Cray Systems with a Gemini Interconnect Ron Oldfield (Sandia National Laboratories), Todd Kordenbrock (Hewlett Packard) and Gerald Lofstead (Sandia National Laboratories) Abstract Abstract Over the past several years, there has been increasing interest in injecting a layer of compute resources between a high-performance computing application and the end storage devices. For some projects, the objective is to present the parallel file system with a reduced set of clients, making it easier for file-system vendors to support extreme-scale systems. In other cases, the objective is to use these resources as “staging areas” to aggregate data or cache bursts of I/O operations. Still others use these staging areas for “in-situ” analysis on data in-transit between the application and the storage system. To simplify our discussion, we adopt the general term “Integrated Data Services” to represent these use-cases. This paper describes how we provide user-level, integrated data services for Cray systems that use the Gemini Interconnect. In particular, we describe our implementation and performance results on the Cray XE6, Cielo, at Los Alamos National Laboratory. Thursday 1:00 P.M. - 2:30 P.M. Hamburg Technical Sessions (15A) Chair: Tina Butler (National Energy Research Scientific Computing Center) A Single Pane of Glass: Bright Cluster Manager for Cray Matthijs van Leeuwen and Martijn de Vries (Bright Computing, Inc.) Abstract Abstract Bright Cluster Manager provides comprehensive cluster management for Cray systems in one integrated solution: deployment, provisioning, scheduling, monitoring, and management. Its intuitive GUI provides complete system visibility and ease of use for multiple clusters simultaneously, including automated tasks and intervention. Bright also provides a powerful cluster management shell for those who prefer to manage via a command-line interface. Real Time Analysis and Event Prediction Engine Joseph 'Joshi' Fullop, Ana Gainaru and Joel Plutchak (National Center for Supercomputing Applications) Abstract Abstract The cost of operating extreme scale supercomputers such as Blue Waters is high and growing. Predicting failures and reacting accordingly can prevent the loss of compute hours and their associated power and cooling costs. Forecasting the general state of the system, as well as predicting an exact failure event are two distinct ways to accomplish this. We have addressed the latter with a system that uses a self-modifying template algorithm to tag event occurrences. This enables fast mining and identification of correlated event sequences. The analysis is visually displayed using directed graphs to show the interrelationships between events across all subsystems. The system as a whole is self-updating and functions in real time and is planned to be used as a core monitoring component on the Blue Waters supercomputer at NCSA. Node Health Checker Kent J. Thomson (Cray Inc.) Abstract Abstract The Node Health Checker (NHC) component runs after job failures to take compute nodes out of service that are likely to cause future jobs to fail. Before NHC can take nodes out of the availability pool, however, it must run some tests on them to assess their health. While these tests are running, the nodes being tested cannot have new jobs run on them. This period of time is known as `Normal Mode'. By decreasing the average time of normal mode, job throughput can be increased. Performance investigation into the average run time of NHC normal mode showed that instead of scaling logarithmically with the number of nodes being tested, it instead scaled linearly, which becomes much slower at larger node counts. By localizing and fixing the bug causing the improper scaling the normal mode run time of node health was decreased by, in the best case, 100x. The analytical techniques involved in identifying scaling will be shown, including curve fitting and performance extrapolation using software tools. Additionally, the method of isolating the location of the bug by testing the different pieces of NHC separately will be discussed. Once the source of the poor scaling is revealed as calls to an external program for each node being tested, the fix of caching the required information on NHC startup in an intelligent manner is explained. Thursday 2:30 P.M. - 2:45 P.M. Maritim Foyer Break Thursday 2:45 P.M. - 3:15 P.M. Köln / Bonn / Hamburg Closing General Session (16) Chair: Nick Cardo (National Energy Research Scientific Computing Center) |