CUG2012 Final Proceedings | Created 2012-5-13 |
Sunday, April 29th 6:00PM - 7:00PM Welcome Reception Maritim Lobby Special Event | Monday, April 30th 8:00AM - 10:30AM Tutorial (1A) Hamburg Mark Fahey Application developmen... John Levesque and jeff Larkin (Cray Inc.) This tutorial will address porting and optimizing an application to the Cray XK6. Given the heterogeneous architecture of this system, its effective utilization is only achieved by refactoring the application to exhibit three levels of parallelism. The first level is the typical intra-node MPI parallelism which is already present on those applications being run on XT systems. The other two levels, shared memory parallelism on the node and vectorization or single instruction multiple threads (SIMT) will be where the major programming challenges arise. The tutorial will take the approach of using an all MPI applications and rewriting it to exhibit the other two levels of parallelism. The instruction will include the use of Cray GPU programming environment to isolate important sections of code for parallelism on the node. Once the code has been hybridized with the introduction of OpenMP on the node, the utilization of the accelerator can be done with the newly announced OpenACC directives and/or using CUDA or CUDA Fortran. Once a accelerated version of the code is developed, statistics gathering tools can be used to identify bottleneck and optimize data transfer, vectorization and memory utilization. Real world examples will be employ to illustrate the advantage of the approach. Comparisons will be given between the use of the OpenACC directives and CUDA. Instructions for using all the available tools from Cray and Nvidia will all be given. Tutorial Tutorial (1B) Bonn Jason Hill Lustre 2.x Architecture Johann Lombardi (whamcloud, Inc.) This tutorial will review the new architecture in Lustre 2.x, including the new features in Lustre 2.0 - 2.2. Some of the features covered will be Imperative Recovery, Wide Striping, Parallel Directory Operations, and a new metadata performance tool called mds-survey. The architecture discussion will highlight the significant changes in Lustre since 1.8, and will briefly cover some of the important changes to be delivered in the 2.3 and 2.4 releases. This class will also present development and process guidelines that the community follows when contributing work to Lustre. Tutorial Tutorial (1C) Köln John Noe Introduction to Debugg... David Lecomber (Allinea Software) The art of debugging HPC software has come a long way in recent years - and in this tutorial, we will show how Allinea DDT can be used to debug MPI and accelerator code on the Cray XK6. There will be walk-throughs and hands-on opportunities to work with some typical bug scenarios - showing how easily debuggers provide the means of resolving software problems and some of the "tricks of the trade" which will give users experience. There will also be demonstrations of debugging at Petascale and exploring how debuggers can get to grips with software issues in applications at scale. Tutorial 10:30AM - 11:00AM Break. Whamcloud, Sponsor Maritim Foyer Dan Ferber (Whamcloud) Whamcloud was established in 2010 by High-Performance Computing experts Brent Gorda and Eric Barton when they recognized that future advances in computational performance were going to require a revolutionary advance in parallel storage. Whamcloud’s vision is to evolve the state of parallel storage by focusing strategically on high performance and cloud computing applications with demanding requirements for scalability. Whamcloud provides: (1) a tested community Lustre release tree and releases to the entire Lustre community, (2) Level 3 support for Lustre to Whamcloud's customers and partners, (3) feature development under contracts, and (4) training and professional services. Break 11:00AM - 12:00PM Opening General Session (2) Köln / Bonn / Hamburg Nick Cardo CUG Welcome Nick Cardo (National Energy Research Scientific Computing Center) Cray User Group 2012 Welcome The Future of HPC Michael M. Resch (High Performance Computing Center Stuttgart) Scalability is considered to be the key factor for supercomputing in the coming years. A quick look at the TOP500 list shows that the level of parallelism has started to increase at a much faster pace than anticipated 20 years ago. However, there are a number of other issues that will have a substantial impact on HPC in the future. This talk will address some of the issues. It will look into current hardware and software strategies and evaluate the potential of concepts like co-design. Furthermore it will investigate the potential use of HPC in non-traditional fields like business intelligence. Invited Talk 12:00PM - 1:00PM Lunch. Xyratex Technology L... Restaurant Rôtisserie Lunch 1:00PM - 2:30PM Technical Sessions (3A) Hamburg Liam Forbes Cray OS Road Map Reliability and Resili... Steven J. Johnson (Cray Inc.) In 2011, Cray Inc. continued to observe improving reliability trends on XE6 systems and increased use of system resiliency capabilities such as Warm Swap. Late in the year, XK6 based systems began shipping to customers as did XE6 systems with the next generation AMD processors. This paper will discuss the reliability trends observed on all of these systems through 2011 and into early 2012 and examine the major factors affecting system wide outages and the occurrence of node drops in systems. The differences in site operations such as the frequency of scheduled maintenance will be explored to see what impact if any this may have on overall system reliability and availability. Finally, the paper will explore the where, when and how Warm Swap is being used on Cray systems and its overall effectiveness in maximizing system availability. pdf, pdf Online Diagnostics at... Don Maxwell (Oak Ridge National Laboratory) and Jeff Becklehimer (Cray Inc.) The Oak Ridge Leadership Computing Facility (OLCF) housed at the Oak Ridge National Laboratory recently acquired a 200-cabinet Cray XK6. The computer will primarily provide capability computing cycles to the U.S. Department of Energy (DOE) Office of Science INCITE program. The OLCF has a tradition of installing very large computer systems requiring unique methods in order to achieve production status in the most expeditious and efficient manner. This paper will explore some of the methods that have been used over the years at OLCF to eliminate both early-life hardware failures and ongoing failures giving users a more stable machine for production. pdf, pdf Paper Technical Sessions (3B) Bonn Rolf Rabenseifner Developing hybrid Open... Xiaohu Guo (Science and Technology Facilities Council), Gerard Gorman (Department of Earth Science and Engineering, Imperial College London, London SW7 2AZ, UK) and Andrew Sunderland and Mike Ashworth (Science and Technology Facilities Council) Most modern high performance computing platforms can be described as clusters of multi-core compute nodes. The trend for compute nodes is towards greater numbers of lower power cores, with a decreasing memory to core ratio. This is imposing a strong evolutionary pressure on numerical algorithms and software to efficiently utilise the available memory and network bandwidth. Unstructured finite elements codes have been effectively parallelised with domain decomposition methods by using libraries such as the Message Passing Interface for a long time. However, there are many algorithmic and implementation optimisation opportunities when threading is used for intra-node parallelisation for the latest multi-core/many-core platforms. For example, reduced memory requirements, cache sharing, reduced number of partitions and less MPI communication. While OpenMP is promoted as being easy to use and allows incremental parallelisation of codes, naive implementations frequently yield poor performance. In practice, as with MPI, equal care and attention should be exercised over algorithm and hardware details when programming with OpenMP. In this paper, we report progress implementing hybrid OpenMP-MPI for finite element matrix assembly within the unstructured finite element application software named Fluidity. The OpenMP parallel algorithm uses graph colouring to identify independent sets of elements that can be assembled simultaneously with no race conditions. Unstructured finite element codes are well known to be memory bound, therefore, particular attention is paid to ccNUMA architectures where data locality is particularly important to achieve good intra-node scaling characteristics. The profiling and the benchmark results on the latest CRAY platforms show that the best performance can be achieved by pure OpenMP within a node. Keywords: Fluidity; FEM; OpenMP; MPI; ccNUMA; Graph Colouring; pdf, pdf Porting and Optimizing... Florian Hanke (Max-Planck-Institut fuer Astrophysik), Andreas Marek (Rechenzentrum Garching) and Bernhard Mueller and Hans-Thomas Janka (Max-Planck-Institut fuer Astrophysik) Supernova explosions are among the most powerful cosmic events, whose physical mechanism and consequences are still incompletely understood. We have developed a fully MPI-OpenMP parallelized version of our VERTEX-PROMETHEUS code in order to perform three-dimensional simulations of stellar core-collapse and explosion on Tier-0 systems as Hermit at HLRS. Tests on up to 64000 cores have shown excellent scaling behavior. In this paper we will present our progress in porting, optimizing, and performing production runs on a large variety of machines, starting from vector machines an reaching to modern systems as the new Cray XE6 system in Stuttgart. Paper Technical Sessions (3C) Köln Liz Sim Comparing One-Sided Co... Christopher M. Maynard (University of Edinburgh) Two-sided communication, with its linked send and receive message construction has been the dominant communication pattern of the MPI era. With the rise of multi-core processors and the consequent dramatic increase in the number of computing cores in a supercomputer this dominance may be at an end. One-sided communication is often cited as part of the programming paradigm which would alleviate the punitive synchronisation costs of two-sided communication for an exascale machine. This paper compares the performance of one-sided communication in the form of put and get operations for MPI, UPC and Cray SHMEM on a Cray XE6, using the Cray C compiler. This machine has support for Remote Memory Access (RMA) in hardware, and the Cray C compiler supports UPC, as well as environment support for SHMEM. A distributed hash table application is used to test the performance of the different approaches, as this requires one-sided communication. pdf, pdf Balancing shared memor... Ahmad Anbar, Olivier Serres, Asila Wati, Lubomir Riha and Tarek El-Ghazawi (The George Washington University) While many-cores have huge performance potential, it could be wasted if the programming was not done carefully. Because of the locality awareness, PGAS was able to achieve scalability at the clusters level. We believe as the chips grow bigger in terms of core count, PGAS will still be a good fit for the intra-node programming. One of unclear area is which mechanism is suitable for the PGAS model to rely upon for intra-node communication. Basically, that there are two mechanisms: processes and threads. This became a big decision as a big percentage of the overall communication cost of the programs happens within the nodes. As a case study, we evaluated the performance of several UPC application and synthetic micro-benchmarks. We evaluated the performance when the communication within nodes was based on threads, processes, or the mixture of the two. Finally we are recommending general guidelines and drawing main conclusions. Performance of Fortran... David Henty (EPCC, The University of Edinburgh) Coarrays are a feature of the Fortran 2008 standard that enable parallelism using a small number of additional language elements. The execution model is that of a Partitioned Global Address Space (PGAS) language. The Cray XE architecture is particularly interesting for studying PGAS languages: it scales to very large numbers of processors; the underlying GEMINI interconnect is ideally suited to the PGAS model of direct remote memory access; the Cray compilers support PGAS natively. In this paper we present a detailed analysis of the performance of key coarray operations on XE systems including the UK national supercomputer HECToR, a 90,000-core Cray XE6 operated by EPCC at the University of Edinburgh. The results include a wide range of communications patterns and synchronisation methods relevant to real applications. Where appropriate, these are compared to the equivalent operation implemented using MPI. pdf, pdf Paper 2:30PM - 3:00PM Break. Bright Computing, Sp... Maritim Foyer Mark Blessing (Bright Computing) Bright Computing specializes in management software for clusters, grids and clouds, including compute, storage, Hadoop and database systems. Bright Cluster Manager’s fundamental approach and intuitive interface makes cluster management easy, while providing powerful and complete management capabilities for increasing productivity. Bright Cluster Manager now provides cloud-bursting capabilities into Amazon EC2, managing these external nodes as if part of the on-site system. Cray is an authorized Bright reseller. This partnership enables Cray to resell Bright Cluster Manager, as well as include the product as an integral feature of its high-performance external solutions for large HPC installations. Bright Cluster Manager provides a centralized monitoring and management solution for power management, image management, trouble-shooting, system provisioning, workload management and system health monitoring. Cray leverages Bright Cluster Manager's capabilities to offer its customers a combination of these services and additional services, such as automated Lustre server failover. Bright Cluster Manager is the solution of choice for many research institutes, universities, and companies across the world, and is used to manage several Top500 installations. Bright Computing has its headquarters in San Jose, California. http://www.brightcomputing.com Break 3:00PM - 4:00PM Interactive Session (4A) Hamburg Nick Cardo Open discussion with C... Nick Cardo (National Energy Research Scientific Computing Center) This interactive session is an open discussion with the CUG Board. Birds of a Feather Interactive Session (4B) Bonn Helen He Programming Environmen... Helen He (National Energy Research Scientific Computing Center) This is an interactive session to discuss topics within the Programming Environments, Applications and Documentation Special Interest Group. Birds of a Feather Interactive Session (4C) Köln Tara Fly Getting Up and Running... Tara Fly (Cray Inc.) This Birds of a Feather session allows users and administrators of Cluster Compatibility Mode to trade experiences, tips, techniques and feedback to Cray technical personnel. Birds of a Feather 4:30PM - 10:00PM Das Stuttgarter Frühlingsfe... Cannstatter Wasen Special Event | Tuesday, May 1st 8:30AM - 10:00AM General Session (5) Köln / Bonn / Hamburg David Hancock Cray Corporate Update Peter Ungaro (Cray Inc.) Cray Corporate Update HPC Systems Peter Ungaro (Cray Inc.) Cray HPC Systems Update Storage and Data Manag... Barry Bolding (Cray Inc.) Cray Storage and Data Management Update Invited Talk 10:00AM - 10:30AM Break. Altair Corporation,... Maritim Foyer Mary Bass (Altair Engineering, Inc.) PBS Works(tm), Altair's suite of on-demand cloud computing technologies, allows companies to maximize ROI on existing Cray systems. PBS Works is the most widely implemented software environment for managing grid, cloud, and cluster computing resources worldwide. The suite's flagship product, PBS Professional(r), allows Cray users and administrators to easily share distributed computing resources across geographic boundaries. With additional tools for portal-based submission, analytics, and data management, the PBS Works suite is a comprehensive solution for optimizing your Cray environment. Leveraging a revolutionary "pay-for-use" unit-based business model, PBS Works delivers increased value and flexibility over conventional software-licensing models. To learn more, please visit www.pbsworks.com. Altair Engineering, Inc., empowers client innovation and decision-making through technology that optimizes the analysis, management and visualization of business and engineering information. With a 26-year-plus track record for high-end software and consulting services for engineering, computing and enterprise analytics, Altair consistently delivers a competitive advantage to customers in a broad range of industries. Altair has offices throughout North America, South America, Europe and Asia/Pacific. To learn more, please visit www.altair.com. Break 10:30AM - 12:00PM General Session (6) Köln / Bonn / Hamburg David Hancock From PetaScale to ExaS... Wolfgang E. Nagel (Technische Universität Dresden) Parallelism and scalability have become major issues in all areas of Computing — nowadays pretty much everybody, even beyond the field of classical HPC, uses parallel codes. Nevertheless, the number of cores on a single chip – homogeneous as well as heterogeneous cores – is significantly increasing. Soon, we will have millions of cores in one HPC system. The ratios between flops and memory size, as well as bandwidth for memory, communication, and I/O, will worsen. At the same time, the need for energy might be extraordinary, and the best programming paradigm is still unclear. Furthermore, we have reached a point where data becomes the primary challenge, be it complexity, size, or rate of the data acquisition. This talk will describe technology developments, software requirements, and other related issues to identify challenges for the HPC community, which have to be carefully addressed – and solved – within the next couple of years. 1 on 100 or More Peter Ungaro (Cray Inc.) Open discussion with Cray CEO. No other Cray employees or Cray partners are permitted during this session. Invited Talk 12:00PM - 1:00PM Lunch. Adaptive Computing,... Restaurant Rôtisserie Starla Mehaffey (Adaptive Computing) Adaptive Computing manages the world’s largest computing installations with its Moab® self-optimizing cloud management and HPC workload management solutions. The patented Moab multi-dimensional intelligence engine delivers policy-based governance, allowing customers to consolidate resources, allocate and manage services, optimize service levels and reduce operational costs. Our leadership in IT decision engine software has been recognized with over 45 patents and over a decade of battle-tested performance resulting in a solid Fortune 500 and Top500 supercomputing customer base. The Moab intelligence engine is unique in its ability to accelerate and automate both complex IT decisions and processes through multi-dimensional policies. Only Moab can automate decisions and processes across business priorities and SLAs, current and future time horizons, and heterogeneous physical and virtual resources and management tools, as well as many other dimensions. Adaptive Computing’s mission is to bring higher levels of decision, control, and self-optimization to the challenges of deploying and managing large and complex IT environments so they accelerate business performance at a reduced cost. Customers look to Adaptive Computing to solve today’s complex management problems so they can lower costs, improve efficiency and service levels, and accelerate the IT that powers their business. Adaptive Computing products offer solutions to key challenges including: • Speeding the delivery of IT services to the business • Improving IT flexibility to meet SLA’s and priorities • Reducing capital costs by maximizing resource utilization • Reducing operating costs by eliminating manual management across heterogeneous IT • Managing IT service and resource usage cost transparency • Reducing instability and disruptive errors in IT services Adaptive’s Moab products accelerate, automate, and self-optimize IT workloads. Built for high scale, Moab meets the challenges of today’s complex HPC and cloud computing environments. Moab acts as a brain on top of existing infrastructure, enabling computing systems to self-optimize and deliver higher return on investment. The Moab product family includes: • Moab Cloud Suite for self-optimizing cloud management • Moab HPC Suite for self-optimizing HPC workload management The company’s global headquarters is in Provo, Utah (USA), with European offices in the United Kingdom and Asia Pacific offices in Singapore. This enables Adaptive Computing to deliver products and solutions to customers around the globe with local region sales as well as consulting and support services to ensure success. The company currently has over 120 employees and has grown steadily every year since inception to meet growing customer demands and needs. Lunch 1:00PM - 2:30PM Technical Sessions (7A) Hamburg Jason Hill Cray’s Lustre Support... Cory Spitz (Cray Inc.) Cray continues to deploy and support Lustre as the file system of choice for all of our systems. As such, Cray is committed to developing Lustre and ensuring its continued success on our platforms. This paper will discuss Cray’s Lustre deployment model, and how it ensures both a stable Lustre version and enables productivity. It will also outline how we work with the Lustre community through OpenSFS. Finally, it will roll out our updated Lustre roadmap, which includes Lustre 2.2 and Linux 3.0. pdf, pdf Lustre Roadmap and Rel... Dan Ferber (Whamcloud) Whamcloud, sponsored by OpenSFS, produces Lustre releases in addition to to providing Lustre development and support. This includes patch landings, testing, packaging, and release for the Lustre community. As an OpenSFS board level member and contributor, Cray plays a key role in helping support that activity. This presentation reviews the current Whamcloud Lustre roadmap, test reporting, and release schedules. DDN Exascale Direction... Keith Miller (DataDirect Networks) Very large compute environments are facing unprecedented challenges with respect to the storage systems that support them. In this talk, DDN - the world leader in massively scalable HPC storage technology - will discuss solutions to Petascale & Exascale I/O challenges and opportunities driven by the rise of trends such as: the continued expansion of file stripe sizes on larger pools of commodity technology, disk performance improvements which are disproportionate to CPU performance, scalable storage system usability, the advent of Big Data analytics for HPC and the emergence of GeoDistributed Object Storage as a viable platform for next-generation computing and Big Data collaboration. Additionally, information will be provided on DDN's forthcoming product portfolio updates and deployment experience in massively scalable Cray environments. Paper Technical Sessions (7B) Bonn Ashley Barker Accelerated Debugging:... David Lecomber (Allinea Software) The ability to debug at Petascale is now a reality for homogeneous systems such as the Cray XE6, and is a vital part of producing software that works. Developers are using Allinea DDT to debug their MPI codes regularly at Petascale - with an interface that is responsive and intuitive even at this extreme size. With the arrival of the Cray XK6, applications are changing to involve GPU acceleration and the need for debugging remains. This paper will discuss the results of work at Allinea to prepare for systems such as Titan, including adding support in Allinea DDT for the OpenACC model provided by the Cray Compiler Environment and for ensuring scalability in hybrid systems. Third Party Tools for... Richard Graham, Oscar Hernandez, Christos Kartsaklis, Joshua Ladd and Jens Domke (Oak Ridge National Laboratory) and Jean-Charles Vasnier, Stephane Bihan and Georges-Emmanuel Moulard (CAPS Enterprise) Over the past few years, as part of the Oak Ridge Leadership Class Facility project (OLCF-3), Oak Ridge National Laboratory (ORNL) has been engaged with several third party tools vendors with the aim of enhancing the tool offerings for ORNL’s GPU-based platform, Titan. This effort has resulted in enhancements to CAPS' HMPP compiler, Allinea's DDT debugger, and the Vampir suite of performance analysis tools from the Technische Universit at Dresden. In this paper we will discuss the latest enhancements to these tools, and their impact on applications as ORNL readies Titan for full-scale production as a GPU based heterogeneous system. pdf, pdf The Eclipse Parallel T... Jay Alameda and Jeffrey L. Overbey (National Center for Supercomputing Applications/University of Illinois) Eclipse is a widely used, open source integrated development environment that includes support for C, C++, Fortran, and Python. The Parallel Tools Platform (PTP) extends Eclipse to support development on high performance computers. PTP allows the user to run Eclipse on her laptop, while the code is compiled, run, debugged, and profiled on a remote HPC system. PTP provides development assistance for MPI, OpenMP, and UPC; it allows users to submit jobs to the remote batch system and monitor the job queue; and it provides a visual parallel debugger. In this talk, we will demonstrate the capabilities we have added to PTP to support Blue Waters, the Cray XE6/XK6 system being installed at NCSA. These capabilities include submission and monitoring of ALPS jobs, support for OpenACC, and integration with Cray compilers. We will describe ongoing work and directions for future collaboration, including integration with CrayPat, Loopmark compiler feedback, and parallel debugger integration. pdf, pdf Paper Technical Sessions (7C) Köln Mark Fahey Case Studies in Deploy... Tara Fly, David Henseler and John Navitsky (Cray Inc.) Cray’s addition of Data Virtualization Service (DVS) and Dynamic Shared Libraries (DSL) to the Cray Linux Environment (CLE) software stack provides the foundations necessary for shared library support. The Cluster Compatibility Mode (CCM) feature introduced with CLE 3 completes the picture and allows Cray to provide “out-of-the-box” support for independent software vendor (ISV) applications built for Linux-x86 clusters. Cluster Compatibility Mode enables far greater workload flexibility, including install and execution of ISV applications and use of various third party MPI implementations, which necessitates a corresponding increase in complexity in system administration and site integration. This paper explores CCM architecture and a number of case studies from early deployment of CCM into user environments, sharing best practices learned, with hopes that sites can leverage these experiences for future CCM planning and deployment. pdf, pdf Cray Cluster Compatibi... Zhengji Zhao, Yun (Helen) He and Katie Antypas (Lawrence Berkeley National Laboratory) Cluster Compatibility Mode (CCM) is a Cray software solution that provides services needed to run most cluster-based independent software vendor (ISV) applications on the Cray XE6. CCM is of importance to NERSC because it can enable user applications that require the TCP/IP support, which are important parts of NERSC workload, on NERSC's Cray XE6 machine Hopper. Gaussian and NAMD replica exchange simulations are two important application examples that cannot run on Hopper without CCM. In this paper, we will present our CCM performance evaluation results on Hopper, and will present how CCM has been explored and utilized at NERSC. We will also discuss the benefits and issues of enabling CCM on the petascale production Hopper system. pdf, pdf My Cray can do that?... Richard S. Canon, Jay Srinivasan and Lavanya Ramakrishnan (Lawrence Berkeley National Laboratory) The Cray XE architecture has been optimized to support tightly coupled MPI applications, but there is an increasing need to run more diverse workloads in the scientific and technical computing domains. Can platforms like the Cray XE line play a role here? In this paper, we will describe tools we have developed to support genomic analysis and other data intensive applications on NERSC’s Hopper system. These tools include a custom task farmer framework, tools to create virtual private clusters on the Cray, and using Cray’s Cluster Compatibility Mode (CCM) to support more diverse workload. In addition, we will describe our experience with running Hadoop, a popular open-source implementation of MapReduce, on Cray systems. We will present our experiences with this work including successes and challenges. Finally, we will discuss future directions and how the Cray platforms could be further enhanced to support these class of workloads. pdf, pdf Paper 2:30PM - 3:00PM Break. Allinea Software, Sp... Maritim Foyer Break 3:00PM - 5:00PM Technical Sessions (8A) Hamburg Tina Butler Xyratex ClusterStor Ar... Torben Kling Petersen (Xyratex) As the size, performance, and reliability requirements of HPC storage systems increase exponentially, building solutions utilizing practices and philosophies that have existed for over five years is no longer adequate or efficient. While some instability of HPC systems was tolerable in the past, commercial and lab HPC environments now require enterprise level stability and reliability for their peta scale systems. In order to meet these industry requirements, Xyratex architected an innovative Lustre based HPC storage solution known as ClusterStor. The ClusterStor solution utilizes enterprise grade storage and software components, fully automated installation procedures, and rigorous testing procedures prior to shipping out to customers in order to drive the highest levels of reliability for growing and evolving HPC environments. Minimizing Lustre ping... Cory Spitz, Nic Henke, Doug Petesch and Joe Glenski (Cray Inc.) Cray is committed to pushing the boundaries of scale of its deployed Lustre file systems, in terms of both client count and the number of Lustre server targets. However, scaling Lustre to such great heights presents a particular problem with the Lustre pinger, especially with routed LNET configurations used on so-called external Lustre file systems. There is an even greater concern for LNETs with finely grained routing. The routing of small messages must be improved otherwise Lustre pings have the potential to ‘choke out’ real bulk I/O, an effect we call ‘dead time’. Pings also contribute to OS jitter so it is important to minimize their impact even if a scale threshold has not been met that disrupts real I/O. Moreover, the Lustre idle pings are an issue even for very busy systems because each client must ping every target. This paper will discuss the techniques used to illustrate the problem and best practices for avoiding the effects of Lustre pings. pdf, pdf Cray Sonexion Hussein Harake (CSCS) During SC11 Cray announced a new innovative HPC data storage solution named Cray Sonexion. CSCS installed an early Sonexion system in December 2011, the system is connected to a development Cray XE6 machine. The purpose of the study is to evaluate the mentioned product, covering installation, configuration and tuning including Lustre file-system and integrating it to the CRAY XE6. pdf, pdf A Next-Generation Para... Galen Shipman, David Dillow, Douglas Fuller, Raghul Gunasekaran, Jason Hill, Youngjae Kim, Sarp Oral, Doug Reitz, James Simmons and Feiyi Wang (Oak Ridge National Laboratory) When deployed in 2008/2009 the Spider system at the Oak Ridge National Laboratory’s Leadership Computing Facility (OLCF) was the world’s largest scale Lustre parallel file system. Envisioned as a shared parallel file system capable of delivering both the bandwidth and capacity requirements of the OLCF’s diverse computational environment, Spider has since become a blueprint for shared Lustre environments deployed worldwide. Designed to support the parallel I/O requirements of the Jaguar XT5 system and other smaller- scale platforms at the OLCF, the upgrade to the Titan XK6 heterogeneous system will begin to push the limits of Spider’s original design by mid 2013. With a doubling in total system memory and a 10x increase in FLOPS, Titan will require both higher bandwidth and larger total capacity. Our goal is to provide a 4x increase in total I/O bandwidth from over 240GB/sec today to 1T B/sec and a doubling in total capacity. While aggregate bandwidth and total capacity remain important capabilities, an equally important goal in our efforts is dramatically increasing metadata performance, currently the Achilles heel of parallel file systems at leadership. We present in this paper an analysis of our current I/O workloads, our operational experiences with the Spider parallel file systems, the high-level design of our Spider upgrade, and our efforts in developing benchmarks that synthesize our performance requirements based on our workload characterization studies. pdf, pdf Paper Technical Sessions (8B) Bonn Rolf Rabenseifner The Cray Programming E... Luiz DeRose (Cray Inc.) The scale of current and future high end systems, as well as the increasing system software and architecture complexity, brings a new set of challenges for application developers. In order to achieve high performance on peta-scale systems, application developers need a programming environment that can address and hide the issues of scale and complexity of high end HPC systems. Users must be supported by intelligent compilers, automatic performance analysis tools, adaptive libraries, and scalable software. In this talk I will present the recent activities and future directions of the Cray Programming Environment that are being developed and deployed to improve user’s productivity on the Cray XE and XK Supercomputers. Cray Performance Measu... Heidi Poxon (Cray Inc.) The Cray Performance Measurement and Analysis Tools have been enhanced to support whole program analysis on Cray XK systems. The focus of support is on the new directive-based OpenACC programming model, helping users identify key performance bottlenecks within their X86/GPU hybrid programs. Advantages of the Cray tools include summarized and consolidated performance data beneficial for analysis of programs that use a large number of nodes and GPUs, statistics for the whole program mapped back to user source by line number, GPU statistics grouped by accelerated region, as well as the X86 statistics traditionally provided by the Cray performance tools. This paper discusses these enhancements, including support to help users add increased levels of parallelism to their MPI applications through OpenMP or OpenACC. Cray Scientific Librar... Adrian Tate (Cray Inc.) Cray scientific libraries are relied upon to extract the maximum of performance from a Cray system and so must be optimized for the Gemini network, the Interlagos and Magny-cours processors and now also for NVIDIA accelerators. In this talk I will discuss the scientific libraries that are available on each product, basic usage, how the different library components are optimized and what advanced performance controls are available to the user. In particular I will describe the new CrayBLAS library which has a radically different internal structure to previous BLAS libraries, and I will talk in detail about libsci for accelerators, which provides both simple usage and advanced hybrid performance of XK6. I will detail some communications optimization of our FFT library using Co-array Fortran, and I will also discuss upcoming libsci features and improvements. Applying Automated Opt... Thomas Edwards (Cray Inc.) Porting and optimising applications to a new processor architecture, a different compiler or the introduction of new features in the software or hardware environment can generate a large number of new parameters that have the potential to affect application performance. Vendors attempt to provide sensible defaults that perform well in general, for example grouping compiler optimisations into flag groupings and setting the default value of environment variables, they are inevitably based on the experience gained or expected behaviour of a normal application. In many cases applications will exhibit some behaviour that differs from the norm, for example requiring identical floating point results when changing MPI decompositions, or sending or receiving messages of unusual or irregular sizes. Manually finding the combination of flags and environment variables that provide optimum performance whilst maintaining a set of application specific criteria can be time consuming and tedious. There are a wide variety of potential algorithms and techniques that can be employed, each with various merits and suitability to the problem of optimising an HPC application. This paper explores, evaluates and compares techniques for automated optimisation HPC application parameters within fixed numbers of iterations.uiring identical floating point results when hanging MPI decompositions, or sending or receiving messages of unusual or irregular sizes. In many cases programmers opt to automate the optimisation process, using the computer to find an optimal solution. There are, however, a wide variety of potential algorithms and techniques that can be employed to perform the search, each with various merits. This paper will explore, evaluate and compare a set of techniques for automated optimisation, focusing specifically properties of HPC applications. Drawing on the author's practical experience with real-world applications the cost in compute resources compared to the runtime improvements gained can be evaluated and considered. pdf, pdf Paper Technical Sessions (8C) Köln Liz Sim Porting and optimisati... Pier Luigi Vidale (NCAS-Climate, Dept. of Meteorology, Univ. of Reading. UK), Malcolm Roberts and Matthew Mizielinski (Met Office Hadley Centre, UK), Simon Wilson (Met Office, UK / NERC CMS), Grenville Lister (NERC CMS, Univ. of Reading), Oliver Darbyshire (Met Office, UK) and Tom Edwards (Cray Centre of Excellence for HECToR) We present porting, optimisation and scaling results from our work with the United Kingdom's Unified Model on a number of massively parallel architectures: the UK MONSooN and HECToR systems, the German HERMIT and the French Curie supercomputer, part of PRACE. The model code used for this project is a configuration of the Met Office Unified Model (MetUM) called Global Atmosphere GA3.0, in its climate mode (HadGEM3). Initial development occurred on a NERC-MO joint facility, MONSooN, with 29 IBM-P6 nodes, using 12 nodes. In parallel with this activity, we have tested the model on the NERC/EPSRC supercomputer, HECToR (CRAY XE6), using 1536 to 24576 cores, on a special DEISA account. The scaling breakthroughs came after implementing the use of hybrid parallelism: OpenMP and MPI. The model scales effectively up to 12'244 cores and has now been successfully ported to Curie and HERMIT. Further optimisation will focus on turnaround. Adaptive and Dynamic L... Celso L. Mendes (University of Illinois), Eduardo R. Rodrigues (IBM-Research, Brazil), Jairo Panetta (CPTEC/INPE, Brazil) and Laxmikant V. Kale (University of Illinois) Climate and weather forecasting models require large processor counts on current supercomputers. However, load imbalance in these models may limit their scalability. We address this problem using AMPI, an MPI implementation based on the Charm++ infrastructure, where MPI tasks are implemented as user-level threads that can dynamically migrate across processors. In this paper, we explore an advanced load balancer, based on an adaptive scheme that frequently monitors the degree of load imbalance, but only takes corrective action (i.e. migrates work from one processor to another) when that action is expected to be profitable for subsequent time-steps in the execution. We present experimental results obtained on Cray systems with BRAMS, a mesoscale weather forecasting model. They reflect a trade-off between maintaining load balance and minimizing migration costs during rebalancing. Given the deployment of large systems at CPTEC and at Illinois, this novel load balancing mechanism will become a critical contribution to the effective use of those systems. pdf, pdf Porting the Community... Matthew Norman (Oak Ridge National Laboratory), Jeffrey Larkin (Cray Inc.), Richard Archibald (Oak Ridge National Laboratory), Ilene Carpenter (National Renewable Energy Laboratory), Valentine Anantharaj (Oak Ridge National Laboratory), Paulius Micikevicius (NVIDIA) and Katherine Evans (Oak Ridge National Laboratory) Here we describe our XK6 porting efforts for the Community Atmosphere Model – Spectral Element (CAM-SE), a large Fortran climate simulation code base developed by multiple institutions. Including more advanced physics and aerosols in future runs will address key climate change uncertainties and socioeconomic impacts. This, however, requires transporting up to order 100 quantities (called “tracers”) used in new physics and chemistry packages, consuming upwards of 85% of the total CAM runtime. Thus, we focus our GPU porting efforts on the transport routines. In this paper, we discuss data structure changes that allowed sufficient thread-level parallelism, reduction in PCI-e traffic, tuning of the individual kernels, analysis of GPU efficiency metrics, timing comparison with best-case CPU code, and validation of accuracy. We believe these experiences are unique, interesting, and valuable to others undertaking similar porting efforts. pdf, pdf Performance Evaluation... Christoph Niethammer (High Performance Computing Center Stuttgart) Today Molecular Dynamics (MD) Simulations are a key tool in many research and industry areas: Biochemistry, solid state physics, chemical engineering, just mentioning some. While in the past MD was a playground for some very simple problems, the ever-increasing compute power of super computers lets handle more and more complex problems: It allows increasing number of particles and more sophisticated molecular models which were too compute intensive in the past. In this paper we present performance studies and results obtained with the ls1-MarDyn MD code on the new Hermit System (Cray XE6) at HLRS. The code's scalability up to the full system with 100.000 cores is discussed as well as a comparison to other platforms. Furthermore we present in detail code analysis using the Cray software environment. From the obtained results we discuss further improvements which will be indispensable for upcoming systems in the post petascale era. pdf, pdf Paper 5:00PM - 5:00PM Break Maritim Foyer Break 5:00PM - 5:45PM Interactive Session (9A) Hamburg Joni Virtanen System Support SIG Cary Whitney (National Energy Research Scientific Computing Center) This is a meeting of the Systems Support Special Interest Group. Birds of a Feather Interactive Session (9B) Bonn David Wallace Removing Barriers to A... David Wallace (Cray Inc.) Application developers are often faced with having to work around hardware (or software) imposed system limitations. These compromises can require adoption of sub-optimal algorithms or the use of approaches which affect obtaining peak application performance. Cray is gathering requirements for implementation consideration for future systems. The intent of this moderated BoF session is to identify barriers in hardware and software that impact optimal application algorithms and affect achieving peak performance or impact application development productivity. Birds of a Feather Interactive Session (9C) Köln Jim Rogers (Invitation Only) Meth... Jim Rogers (Oak Ridge National Laboratory) An invitation only BoF for XK6 owners, Cray, and NVIDIA to discuss methods and mechanisms for measuring the use and utilization of accelerators in XK6 systems. Birds of a Feather 6:30PM - 9:30PM Cray Social Cray invites all registered CUG 2012 attendees (badge required) and their guests to a dinner reception at the Vinum im Literaturhaus restaurant (http://www.vinum-im-literaturhaus.de). Vinum is located within walking distance of the Maritim hotel and conference center at Breitscheidstraße 4. Special Event | Wednesday, May 2nd 8:30AM - 10:00AM General Session (10) Köln / Bonn / Hamburg Nick Cardo CUG Business Nick Cardo (National Energy Research Scientific Computing Center) Cray User Group Business & Elections PRACE for Science and... Richard Kenway (University of Edinburgh) The Partnership for Advanced Computing in Europe was established as an international non-profit association, PRACE AISBL, in 2010 to create a pan-European supercomputing infrastructure for large-scale scientific and industrial research at the highest performance level. It has 24 member states and currently allocates petascale resources in France, Germany, Italy and Spain, through world-wide open competition. This talk will describe the successes of PRACE so far and its vision for the future. CUG Business Nick Cardo (National Energy Research Scientific Computing Center) Cray User Group Business & Elections Invited Talk 10:00AM - 10:30AM Break. The Portland Group,... Maritim Foyer Pat Brooks (The Portland Group) The Portland Group® (a.k.a. PGI®) is a premier supplier of software compilers and tools for parallel computing. PGI's goal is to provide the highest performance, production quality compilers and software development tools. The Portland Group offers high performance scalar and parallel Fortran, C and C++ compilers and tools for workstations, servers and clusters based on: • 64-bit x86 (x64) processors from Intel (Intel 64) and AMD (AMD64) • NVIDIA CUDA-enabled GPGPUs • Linux, MacOS and Windows operating systems PGI offers native scalar and parallelizing compiler products for the following high-level languages: • Fortran 2003, OpenMP 3.0 compliant, GPU-enabled • ANSI C99 extensions, OpenMP 3.0 compliant, GPU-enabled • ANSI/ISO C++ , OpenMP 3.0 compliant PGI Unified Binary™ technology enables applications built with PGI compilers to execute efficiently and produce accurate results on either Intel or AMD CPU-based systems, and to dynamically detect and use NVIDIA GPU accelerators when available. With uniform features and capabilities across operating systems, PGI products enable application development and optimization on platforms ranging from mobile laptops to the world’s fastest supercomputers. GPU Programming --------------- PGI is the only independent supplier of compilers to provide all of the following capabilities for performing optimized integrated native compilation for all x86+NVIDIA accelerator platforms: • Global optimization, inter-procedural optimization, vectorization, shared-memory parallelization. • Profile-feedback optimization and heterogeneous parallel code- generation capabilities. • No external pre-processor dependence. In addition, the PGI Fortran compiler includes support for CUDA Fortran extensions. Co-defined by NVIDIA and PGI, CUDA Fortran enables explicit GPU accelerator programming through direct control of all aspects of data movement and offloading of compute-intensive functions. The PGI Fortran and C compilers also include support for the PGI Acclerator programming model, an implicit high-level model where offloading of compute-intensive code regions from a host CPU to an accelerator is accomplished using Fortran directives or C pragmas. The PGI Accelerator programming model includes support for the OpenACC 1.0 standard for directive-based GPU programming. Programs written using directives retain portability to other platforms and other compilers. PGI Products ------------ • PGI Workstation™ – single-user node-locked license • PGI Server™ – multi-user network-floating license • PGI CDK® Cluster Development Kit® – multi-user network-floating license with scalable MPI debugger and profiler • PGI Visual Fortran® – PGI Fortran integrated with Microsoft Visual Studio; available in single-user, multi-use, and as part of the PGI CDK for Windows. PGI Tools --------- In addition to the full suite of parallel language compilers, all PGI products contain the PGDBG ® OpenMP/MPI graphical parallel debugger and the PGPROF ® OpenMP/MPI/GPU performance profiler. PGI offers the only multi-core x64 parallel compilers, debugger and profiler available with parallelization support integrated directly into the compilers, debugger and profiler. This enables faster development, higher performance and much higher reliability for the programmer. Further Information ------------------- PGI offers a unrestricted free trial license. Registration is required. Follow this link to get started now: https://www.pgroup.com/account/register.php. Break 10:30AM - 12:00PM Technical Sessions (11A) Hamburg Jason Hill Lustre at Petascale: E... Matthew A. Ezell (Oak Ridge National Laboratory) and Richard F. Mohr, Ryan Braby and John Wynkoop (National Institute for Computational Sciences) Some veterans in the HPC industry semi-facetiously define supercomputers as devices that convert compute-bound problems into I/O-bound problems. Effective utilization of large high performance computing resources often requires access to large amounts of fast storage. The National Institute for Computational Sciences (NICS) operates Kraken, a 1.17 PetaFLOPS Cray XT5 for the National Science Foundation (NSF). Kraken’s primary file system has migrated from Lustre 1.6 to 1.8 and is currently being moved to servers external to the machine. Additional bandwidth will be made available by mounting the NICS-wide Lustre file system. Newer versions of Lustre, beyond what Cray provides, are under evaluation for stability and performance. Over the past several years of operation, Kraken’s Lustre file system has evolved to be extremely stable in an effort to better serve Kraken's users. pdf, pdf NetApp E-Series Storag... Didier Gava (NetApp, Inc.) Every storage vendor offers storage systems based on performance and capacity; but some vendors force their customers into accepting minimum, monolithic configurations that typically exceed a customer's current demand by a factor of at least two to three times or more. Cray offers a proven formula forecasting actual storage performance and capacity for Lustre-based systems, and allowing the customer to expand these resulting configurations just-in-time, meeting required, current performance and capacity levels, while protecting available budgets and power envelops. To reach current performance requirement with less gear, we offer a specific, calculated throughput per drive, which represents the highest performance per drive on the market today. Grow with cost-effective, small, modular building blocks based on actual needs rather than be subject to carry expensive unused, large minimum configurations imposed by some other storage vendor. Integrated Simulation... Hao Zhang (University of Tennessee) and Haihang You and Mark Fahey (National Institute for Computational Sciences) Besides requiring significant computational power, a large-scale scientific computing application in high-performance computing (HPC) usually involves large quantity of data. An inappropriate I/O configuration might severely degrade the performance of an application, thereby decreasing the overall user productivity. Moreover, tuning I/O performance of an application on a real file system of a supercomputer can be dangerous, expensive and time-consuming. Even in the application level, an improper I/O configuration might hinder the entire supercomputer. Also, a tuning and testing process always takes a long time and uses considerable computation and storage resources. In order to allow a user to evaluate the I/O performance of a job before its execution, an integrated simulator is developed in this work to simulate the object-based parallel file system, such as the Lustre file system, along with its workload. Our ultimate objective is to achieve automatically tuning of a job’s I/O configuration in the application level, by running a parameter optimization framework over the file system simulator, in order to provide specific information, such as the number of processors that operate I/O, to a user to improve the I/O performance of the job. In this work, an integrated object-based parallel file system simulator is implemented, which integrates both an object-based parallel file system simulation (OBPFS) and a virtual client generator (VCG). The OBPFS is designed as a collection of abstract functional models, which work coordinately and concurrently to simulate important behaviors of a real object-based file system. The VCG is developed to continuously provide virtual clients to the OBPFS with a similar pattern to the real-world supercomputer workload. When developing the integrated simulator, we tried to balance realism and simplicity, which allows the simulator to simulate a massive parallel file system with millions of I/O operations from hundreds of clients concurrently, and to get an acceptable simulation result within an acceptable amount of time. We also tried to implement the simulator to be modular, extensible, scalable and portable, to make it not so hard to understand and adapt to simulate other similar systems. Although the proposed simulator is designed based on the architecture of the Lustre file system, it should be applicable to other file systems with similar properties. The experimental result using the proposed simulator is presented in this paper, which is compared with the actual testing result over the Kraken supercomputer, which is a Cray XT5 supercomputer with the Lustre file system. Paper Technical Sessions (11B) Bonn Helen He Open MPI for Cray XE/X... Manjunath Gorentla Venkata and Richard L. Graham (Oak Ridge National Laboratory) and Nathan T. Hjelm and Samuel K. Gutierrez (Los Alamos National Laboratory) Open MPI provides an implementation of the MPI standard supporting communications over a range of high-performance network interfaces. Recently, ORNL and LANL have collaborated on creating a port of Open MPI for Gemini, the network interface for Cray XE and XK systems. In this paper, we present our design and implementation of Open MPI's point-to-point and collective operations for Gemini, and techniques we employ to provide good scaling, and performance characteristics. pdf, pdf Early Results from the... Scott Hemmert (Sandia National Laboratories), Duncan Roweth (Cray Inc.) and Richard Barrett (Sandia National Laboratories) In spring 2010, the Alliance for Computing at Extreme Scale (ACES), a collaboration between Los Alamos and Sandia National Laboratories, initiated the ACES Interconnection Network Project focused on a potential future interconnection network. The intent of the project is to analyze potential capabilities for inclusion in Pisces that would result in significant performance benefits for a suite of ASC applications. This paper will describe the simulation framework used for the project, as well as present a selection of initial research results. We show that the Dragonfly network topology is well suited to ASC applications and that adaptive routing provides significant performance benefits. Analyses and Modeling... Gregory H. Bauer (National Center for Supercomputing Applications), Torsten Hoefler (National Center for Supercomputing Applications/University of Illinois), William Kramer (National Center for Supercomputing Applications) and Robert A. Fiedler (Cray Inc.) The sustained petascale performance of the Blue Waters system, a US National Science Foundation (NSF) funded petascale computing resource, will be demonstrated using a suite of applications representing a wide variety of disciplines important to the science and engineering communities of the NSF: Lattice Quantum Chromodynamics (MILC), Materials Science (QMCPACK), Geophysical Science (H3D(M)) and SPECFEM3D), Atmospheric Science (WRF), and Computational Chemistry (NWCHEM). We will discuss the performance of these applications on the Blue Waters hardware and provide simple performance models that allow us to predict the sustained performance of the applications running at full scale. Several performance metrics will be used to identify optimization opportunities. Communication pattern analysis and topology mapping experiments will be used to characterize scalability. pdf, pdf Paper Technical Sessions (11C) Köln Ashley Barker The PGI Fortran and C9... Brent Leback, Michael Wolfe and Douglas Miles (The Portland Group) Abstract: This paper and talk provides an introduction to programming accelerators using the PGI OpenACC implementation in Fortran and C. It is suitable for application programmers who are not expert GPU programmers. The paper compares the use of the Parallel and Kernels constructs and provides guidelines for their use. Examples of inter-operating with lower-level explicit GPU languages will be shown. The material covers version 1.0 features of the language API, interpreting compiler feedback, performance analysis and tuning. This talk includes a live component with a demo application running on a Windows laptop. pdf, pdf Performance Studies of... Mike Ashworth (Science and Technology Facilities Council) An open question is whether future applications targetting multi-Petaflop systems with many-core nodes will best be served by the conventional approach of the hybrid MPI-OpenMP programming model or whether global address space languages, such as Co-Array Fortran (CAF), can offer equivalent performance with a simpler, more robust and maintainable programming interface. We will show performance results from a stand-alone but representative CFD code (the Shock Boundary Layer Interaction code) for which we have implementations using both programming models. Using the UK's HECToR Cray XE6 system, we shall investigate issues such as multi-threading scalability on the node and optimization of numbers of OpenMP threads and MPI tasks for the hybrid code as well as the efficiency of the CAF code which is expected to benefit from the improved implemetation of single-sided messaging in the Gemini network. Tools for Benchmarking... Mitesh R. Meswani, Laura Carrington and Allan Snavely (San Diego Supercomputer Center) and Stephen Poole (Oak Ridge National Laboratory) Cray’s SHMEM communication library provides a low-latency one-side communication paradigm for parallel applications to co-ordinate their activity. Hence a trace of SHMEM calls is an important tool towards understanding and tuning SHMEM applications communication performance. Towards this end we present a suite of tools to benchmark, trace, and simulate SHMEM communication speedily and accurately. Specifically, in this paper we present the following three tools: (1) ShmemBench – a benchmark generator that generates timed user specified APIs and communication sizes to benchmark SHMEM communication, (2) ShmemTracer – a lightweight library to trace SHMEM calls in a running application, and (3) Shmem Simulator – a tool to accurately and speedily simulate SHMEM traces for different target Cray systems. The three tools presented provide a powerful experimentation tool for Cray users to analyze and optimize performance of SHMEM applications. Paper 12:00PM - 1:00PM Lunch. ANSYS, Sponsor Restaurant Rôtisserie Wim Slagter (ANSYS) ANSYS brings clarity and insight to customers' most complex design challenges through fast, accurate and reliable engineering simulation. Our technology enables organizations ― no matter their industry ― to predict with confidence that their products will thrive in the real world. Customers trust our software to help ensure product integrity and drive business success through innovation. Founded in 1970, ANSYS employs more than 2,000 professionals, many of them expert in engineering fields such as finite element analysis, computational fluid dynamics, electronics and electromagnetics, and design optimization. ANSYS is passionate about pushing the limits of world-class technology, all so our customers can turn their design concepts into successful, innovative products. ANSYS users today scale their largest simulations across thousands of processing cores, conducting simulations with more than a billion cells. They create incredibly dense meshes, model complex geometries, and consider complicated multiphysics phenomena. ANSYS is committed to delivering HPC performance and capability to take customers to new heights of simulation fidelity, engineering insight and continuous innovation. ANSYS partners with key hardware vendors such as Cray to ensure customers can get the most accurate solution in the fastest amount of time. The collaboration helps customers in all industries navigate the rapidly changing high-performance computing (HPC) landscape. ANSYS HPC products support highly scalable use of HPC - providing virtually unlimited access to HPC capacity for high-fidelity simulation within a workgroup or across a distributed enterprise, using local workstations, department clusters, or enterprise servers, wherever resources and people are located. HPC solutions from ANSYS enable enhanced engineering productivity by accelerating simulation throughput, enabling customers to consider more design ideas and make efficient product development decisions based on enhanced understanding of performance tradeoffs. The ANSYS approach to HPC licensing is cross-physics, providing customers with a single solution that can be leveraged across disciplines. Customers can ‘buy once’ and ‘deploy once’, getting more value from their investment in ANSYS. Our leadership in HPC is a differentiator that will return significant value to customers. Over the years, our steady growth and financial strength reflect our commitment to innovation and R&D. We reinvest 15 percent of our revenues each year into research to continually refine our software. We are listed on the NASDAQ stock exchange. Headquartered south of Pittsburgh, U.S.A., ANSYS has more than 60 strategic sales locations throughout the world with a network of channel partners in 40+ countries. Visit www.ansys.com for more information. Lunch 1:00PM - 2:30PM Technical Sessions (12A) Hamburg Liam Forbes Blue Waters - A Super... William Kramer (National Center for Supercomputing Applications/University of Illinois) Blue Waters is being deployed in 2012 for diverse science and engineering challenges that require huge amounts of sustained performance with 25 teams already selected to run. This talk explains the goals and expectations of the Blue Waters Project and how the new Cray XE/XK/Gemini/Sonexion technologies fulfill these expectations. The talk covers how NCSA is verifying the system meet is requirements for a more than a sustained petaflop/s for real science applications. It discusses a significant ideas on creating new methods and algorithms to improve application codes to take full advantage of systems like Blue Waters, with particular attention for the areas of scalability, use of accelerators, simultaneous use of x86 and accelerated nodes within single codes and application resiliency and discusses experiences and status of the "early science" use at the time of CUG. The final part of the talk discusses lessons learned from the co-design efforts. Early experiences with... Sadaf Alam, Jeffrey Poznanovic, Ugo Varetto and Nicola Bianchi (Swiss National Supercomputing Centre), Antonio Penya (UJI) and Nina Suvanphim (Cray Inc.) We report on our experiences of deploying, operating and benchmarking a Cray XK6 system, which is composed of AMD Interlagos and NVIDIA X2090 nodes and Gemini interconnect. Specifically we outline features and issues that are unique to this system in terms of system setup, configuration, programming environment and tools as compared to a Cray XE6 system, which is based also on AMD Interlagos (dual-socket) nodes and the Gemini interconnect. Micro-benchmarking results characterizing hybrid CPU and GPU performance and MPI communication between the GPU devices are presented to identify parameters that could influence the achievable node and parallel efficiencies on this hybrid platform. pdf, pdf Titan: Early experien... Arthur S. Bland, Jack C. Wells, Otis E. Messer, II, Oscar R. Hernandez and James H. Rogers (Oak Ridge National Laboratory) In 2011, Oak Ridge National Laboratory began an upgrade to Jaguar to convert it from a Cray XT5 to a Cray XK6 system named Titan. This is being accomplished in two phases. The first phase, completed in early 2012, replaced all of the XT5 compute blades with XK6 compute blades, and replaced the SeaStar interconnect with Cray’s new Gemini network. Each compute node is configured with an AMD Opteron™ 6274 16-core processors and 32 gigabytes of DDR3-1600 SDRAM. The system aggregate includes 600 terabytes of system memory. In addition, the first phase includes 960 NVIDIA X2090 Tesla processors. In the second phase, ORNL will add NVIDIA’s next generation Tesla processors to increase the combined system peak performance to over 20 PFLOPS. This paper describes the Titan system, the upgrade process from Jaguar to Titan, and the challenges of developing a programming strategy and programming environment for the system. We present initial results of application performance on XK6 nodes. pdf, pdf Paper Technical Sessions (12B) Bonn Larry Kaplan The Impact of a Fault... Richard Graham, Joshua Hursey, Geoffroy Vallee, Thomas Naughton and Swen Bohm (Oak Ridge National Laboratory) Exascale targeted scientific applications must be prepared for a highly concurrent computing environment where failure will be a regular event during execution. Natural and algorithm-based fault tolerance (ABFT) techniques can often manage failures more efficiently than traditional checkpoint/restart techniques alone. Central to many petascale applications is an MPI standard that lacks support for ABFT. The Run-Through Stabilization (RTS) proposal, under consideration for MPI 3, allows an application to continue execution when process fail. The requirements of scalable, fault tolerant MPI implementations and applications will stress the capabilities of many system services. System services must evolve to efficiently support such applications and libraries in the presence of system component failure. This paper discusses how the RTS proposal impacts system services, highlighting specific requirements. Early experimentation results from Cray systems at ORNL using prototype MPI and runtime implementations are presented. Additionally, this paper outlines fault tolerance techniques targeted at leadership class applications. pdf, pdf Leveraging the Cray Li... Howard Pritchard, Duncan Roweth, David Henseler and Paul Cassella (Cray Inc.) Cray has enhanced the Linux operating system with a Core Specialization (CoreSpec) feature that allows for differentiated use of the processor cores available on Cray XE compute nodes. With CoreSpec, most cores on a node are dedicated to running the parallel application while one or more cores are reserved for OS and service threads. The MPICH2 MPI implementation has been enhanced to make use of this CoreSpec feature to better support MPI asynchronous progress. In this paper, we describe how the MPI implementation uses CoreSpec along with hardware features of the XE Gemini Network Interface to obtain overlap of MPI communication with computation for micro-benchmarks and applications. pdf, pdf Debugging and Optimizi... Chris Gottbrath (Rogue Wave Software) Cray XE6 and XK6 systems can deliver record-breaking computational power but only to applications that are error free and are optimized to take advantage of the performance that the system can deliver. The cycle of development, debugging and tuning is a constant task, especially when custom application developers implement new algorithms, simulate new physical systems, port software to leverage higher core count nodes or take advantage of accelerators, and scale their code to high and higher node, core or thread counts. Rogue Wave offers a powerful set of tools to aid in these efforts. ThreadSpotter pinpoints cache inefficiencies, educates and guides scientists and developers through the cache optimization process while TotalView provides scalable, bi-directional, parallel source code and memory debugging. pdf, pdf Paper Technical Sessions (12C) Köln John Noe A Heat Re-Use System f... Gert Svensson (KTH/PDC) and Johan Söderberg (Hifab) The installation of a 16 cabinet Cray XE6 in 2010 at PDC was expected to increase the total power consumption from around 800 kW by an additional 500 kW. The intention was to refund some of the power cost and become more environmentally friendly by re-using the energy from the Cray to heat nearby buildings. The custom made system, which makes it possible to heat nearby buildings at the campus without using heat-pumps, is described in detail. The base of the system is that hot air from the Cray is sent through industrial heat exchangers placed above the Cray racks. This makes it possible to heat the water to more than 30 °C. The problems encountered and the experiences gained are described as well as projection for the savings. A method of describing a mix of different cooling requirements shows the way for future improvements and addition of future systems. pdf, pdf Analysis and Optimizat... Thomas William (Zentrum für Informationsdienste und Hochleistungsrechnen (ZIH)) and Robert Henschel and D. K. Berry (Indiana University) A highly diverse molecular dynamics program for the study of dense matter in white dwarfs and neutron stars was ported and run on a Cray XT5m using MPI, OpenMP and hybrid parallelization. The ultimate goal was to find the best configuration of available code blocks, compiler flags and runtime parameters for the given architecture. The serial code analysis provided the best candidates for parallel parameter sweeps using different MPI/OpenMP settings. Using PAPI counters and applying the Vampir toolchain a thorough analysis of the performance behavior was done. This step led to changes in the OpenMP part of the code yielding higher parallel efficiency to be exploited on machines providing larger core counts. The work was done in a collaboration between PTI (Indiana University) and ZIH (Technische Universität Dresden) on hardware provided by the NSF funded FutureGrid project. pdf, pdf Simulating Laser-Plasm... Steven H. Langer, Abhinav Bhatele, G. Todd Gamblin, Charles H. Still, Denise E. Hinkel, Michael E. Kumbera, A. Bruce Langdon and Edward A. Williams (Lawrence Livermore National Laboratory) The National Ignition Facility (NIF) [1] is a high energy density experimental facility run for the National Nu- clear Security Administration (NNSA) by Lawrence Livermore National Laboratory. NIF houses the world’s most powerful laser. The National Ignition Campaign (NIC) has a goal of using the NIF laser to ignite a fusion target by the end of FY12. Achieving fusion ignition in the laboratory will be a major step towards fusion energy. NIC is currently considering several possible ignition target designs. The NIF laser can fire a limited number of shots, so simulations play a major roll in selecting the designs to be used in experiments. The NIF lasers beams reach intensities of 1e15 W/cm**2 in spots. That is high enough that interactions between the laser beams and fluctuations in the density of the ions and electrons may scatter laser light away from the target. pF3D is a laser-plasma interaction code used to assess proposed experimental designs for expected levels of scattering and to help understand measurements of scattered light in NIF experiments. NIF experiments have shown that laser-plasma interactions transfer significant amounts of energy between beams and increase the amount of backscattered light relative to what would occur without energy transfer. pF3D has run several simulations with two interacting beams and is starting to run simulations with three interacting beams. These simulations require over 200 billion zones and run for several weeks. The pF3D simulations presented in this paper were run on Cielo, a Cray XE-6 at Los Alamos National Laboratory. These simulations help us understand key experiments currently being carried out with the NIF laser. This paper reports on several modifications we have made to pF3D in the past year. These changes help pF3D run better on Cielo and they are also a step in preparing for future exascale computers. pdf, pdf Paper 2:30PM - 3:00PM Break. Rogue Wave Software,... Maritim Foyer Rogue Wave Software, Inc. is the largest independent provider of cross-platform software development tools and embedded components for the next generation of HPC applications. Rogue Wave products reduce the complexity of prototyping, developing, debugging, and optimizing multi-processor and data-intensive applications. Rogue Wave customers are industry leaders in the Global 2000, ISVs, OEMs, government laboratories and research institutions that leverage computationally-complex and data-intensive applications to enable innovation and outperform competitors. Developing parallel, data-intensive applications is hard. We make it easier. Many of Cray’s customers utilize TotalView to debug software on their systems, as well as the IMSL Numerical Libraries to implement advanced mathematics and statistics capabilities. TotalView has been used on Cray systems for more than 20 years, has been certified on the latest Gemini™ Interconnect and we work closely to ensure compatibility with each major system revision. Cray’s latest offering, the Cray XK6™, brings the power of NVIDIA processors to bear, with TotalView fully supporting the use of CUDA in these systems. Rogue Wave’s products have demonstrated years of consistent reliability, have been thoroughly tested on Cray equipment, and are fully supported on a worldwide basis. TotalView® is a highly scalable debugger that provides troubleshooting for a wide variety of applications including: serial, parallel, multi-threaded, multiprocess, and remote applications. A GUI-based source code defect analysis tool for C, C++ and Fortran applications, TotalView gives you unprecedented control over processes and thread execution and visibility into program state and variables. It allows you to debug one or many processes and/or threads with complete control over program execution. You can reproduce and troubleshoot difficult problems that can occur in concurrent programs that take advantage of threads, OpenMP, MPI, or GPUs. TotalView enables efficient debugging of memory errors and leaks and diagnosis of subtle problems like deadlocks and race conditions. It includes sophisticated memory debugging and analysis, reverse debugging and CUDA debugging capabilities. The IMSL Numerical Libraries are a comprehensive set of mathematical and statistical functions that programmers can embed into their software applications. The libraries can be embedded into C, C# for .NET, Java™ and Fortran applications, and can be used in a broad range of applications -- including programs that help airplanes fly, predict the weather, enable innovative study of the human genome, predict stock market behavior and provide risk management and portfolio optimization. Break 3:15PM - 10:00PM CUG Night Out Schloss Solitude Special Event | Thursday, May 3rd 8:30AM - 10:00AM Technical Sessions (13A) Hamburg Liam Forbes Application Workloads... Wayne Joubert (Oak Ridge National Laboratory) and Shiquan Su (National Institute for Computational Sciences) In this study we investigate computational workloads for the Jaguar system during its tenure as a 2.3 petaflop system at Oak Ridge National Laboratory. The study is based on a comprehensive analysis of MOAB and ALPS job logs over this period. We consider Jaguar utilization over time, usage patterns by science domain, most heavily used applications and their usage patterns, and execution characteristics of selected heavily-used applications. Implications of these findings for future HPC systems are also considered. Understanding the effe... Kalyana Chadalavada and Manisha Gajbe (National Center for Supercomputing Applications/University of Illinois) We conduct a low-level analysis of possible resource contention on the Interlagos core modules using a compute intensive kernel to exemplify target workloads. We will also characterize the performance of OpenMP threads in packed and unpacked configuration. By using Cray PAT tools and and PAPI counters, we attempt to quantify bottlenecks to full utilization of the processors. Demonstrating which code constructs can achieve high levels of concurrent performance on packed integer cores on the module and which code constructs fare poorly on a packed configuration can help tune petascale class applications. We use this information and attempt to understand & optimize the performance profile of a full scale scientific application on a Cray XE6 system. PBS Professional 11: A... Scott J. Suchyta and Lisa Endrjukaitis (Altair Engineering, Inc.) and Jason Coverston (Cray Inc.) Beginning with version 11, Altair has re-architected the Cray port for PBS Professional, its industry-leading workload management and job scheduling product. As a result, PBS Professional now offers Cray users a wider range of capabilities to extract every ounce of performance from their systems. This presentation will walk Cray users and administrators through the detailed changes from previous versions, focusing on what users need to know for a seamless upgrade. Topics covered will include robustness and scalability improvements, usage examples and tips, and lessons learned from initial deployments. The presentation will cover PBS’s topology-aware scheduling and how Cray users can leverage this to improve system utilization and throughput. The session will also touch on other new capabilities available with PBS Professional 11, including scheduling, submission and cold start improvements Paper Technical Sessions (13B) Bonn Tina Butler Expose, Compile, Analy... Robert M. Whitten (Oak Ridge National Laboratory) Reworking existing codes for GPU-based architectures is a daunting task. The OLCF has developed a methodology in partnership with its software vendor partners to eliminate the need to program in CUDA. This methodology involved exposing parallelism, compiling with directive-based tools, analyzing performance, and repeating the process where necessary. This paper explores the methodology with specific details of that process. pdf, pdf Software Usage on Cray... Bilel Hadri and Mark Fahey (National Institute for Computational Sciences), Timothy W. Robinson (Swiss National Supercomputing Centre) and William Renaud (Oak Ridge National Laboratory) In an attempt to better understand library usage and address the need to measure and monitor software usage and forecast requests, an infrastructure named the Automatic Library Tracking Database (ALTD) was developed and put into production on Cray XT and XE systems at NICS, ORNL and CSCS. The ALTD infrastructure prototype automatically and transparently stores information about libraries linked into an application at compilation time and also tracks the executables launched in a batch job. With the data collected, we can generate an inventory of all libraries and third party software used during compilation and execution, whether they be installed by the vendor, the center’s staff, or the users in their own directories. We will illustrate the usage of libraries and executables on several Cray XT and XE machines (namely Kraken, Jaguar and Rosa). We consider that an improved understanding of library usage could benefit the wider HPC community by helping to focus software development efforts toward the Exascale era. pdf, pdf Running Large Scale Jo... Yun (Helen) He and Katie Antypas (Lawrence Berkeley National Laboratory) Users face various challenges with running and scaling large scale jobs on peta-scale production systems. For example, certain applications may not have enough memory per core, the default environment variables may need to be adjusted, or I/O dominates run time. Using real application examples, this paper will discuss some of the run time tuning options for running large scale pure MPI and hybrid MPI/OpenMP jobs successfully and efficiently on Hopper, the NERSC production XE6 system. These tuning options include MPI environment settings, OpenMP threads, memory affinity choices, and IO file striping settings. pdf, pdf Paper Technical Sessions (13C) Köln Liz Sim A fully distributed CF... Jens Zudrop, Harald Klimach, Manuel Hasert, Kannan Masilamani and Sabine Roller (Applied Supercomputing in Engineering, German Research School for Simulation Sciences GmbH and RWTH Aachen University) A solver framework, based on a linearized octree is presented. It allows for fully distributed computations and avoids special processes with potential bottlenecks, while enabling simulations with complex geometries. Scaling results on the Cray XE6 Hermit system at HLRS in Stuttgart are presented with runs up to 3072 nodes with 98304 MPI processes. Even with a fully indirect addressing a high sustained performance of more than 9 % can be reached on the system, enabling very large simulations. Two flow simulation methods are shown, a Finite Volume Method for compressible flows, and a Lattice Boltzmann Method for incompressible flows in complex geometries. pdf, pdf Tuning And Understandi... Guochun Shi (National Center for Supercomputing Applications), Steve Gottlieb (Indiana University) and Michael Showerman (National Center for Supercomputing Applications) Graphics Processing Units (GPU) are becoming increasingly popular in high performance computing due to their high performance, high power eciency, and low cost. Lattice QCD is one of the elds that has successfully adopted GPUs and scaled to hundreds of them. In this paper, we report our Cray XK6 experience in proling and understanding perfor- mance for MILC, one of the Lattice QCD computation pack- ages, running on multi-node Cray XK6 computers using a domain specic GPU library called QUDA. QUDA is a library for accelerating Lattice QCD compu- tations on GPUs. It started at Boston University and has evolved into a multi-institution project. It supports multiple quark actions and has been interfaced to many applications, including MILC and Chroma. The most time consuming part of lattice QCD computation is a sparse matrix solver and QUDA supports ecient Conjugate Gradient (CG) and other solvers. By partitioning in the 4-D space time domain, the solvers in the QUDA library enable the applications to scale to hundreds of the GPUs with high eciency. The other computation intensive components, such as link fat- tening, gauge force and fermion force computations, have also been actively ported to GPUs. pdf, pdf High-productivity Soft... Thomas Bradley (NVIDIA) Often, the simplest approach to using an accelerator is to call a pre-existing library. This talk will provide an overview of GPU enabled libraries, their advantages over their CPU equivalents, and how to call them from several languages. The talk will also address code development in C++ and how the emerging Thrust template library provides key programmer benefit. We will demonstrate how to decompose problems into flexible algorithms provided by Thrust, and how implementations are fast, and can remain concise and readable. Paper 10:00AM - 10:30AM Break. NVIDIA Corporation,... Maritim Foyer Liza Gabrielson (NVIDIA) NVIDIA is the world leader in visual computing technologies and inventor of the GPU. NVIDIA(r) serves the high performance computing market with its Tesla(tm) GPU computing products, available from resellers including Cray. Based on the CUDA(tm) parallel computing platform, NVIDIA Tesla GPU computing products are companion processors to the CPU and designed from the ground up for HPC - to accelerate application performance. To learn more, visit www.nvidia.com/tesla. Break 10:30AM - 12:00PM Technical Sessions (14A) Hamburg Jason Hill NCRC Grid Allocation M... Frank Indiviglio and Ron Bewtra (National Oceanic and Atmospheric Administration) In support of the NCRC, NOAA has deployed an accounting system for the purpose of coordinating HPC system usage between NOAA user centers and the NCRC located at Oak Ridge National Laboratory. This system provides NOAA with a centralized location for reporting and management of allocations on all production resources located at the NCRC and at NOAA laboratories. This paper describes the design, deployment, and details of the first year of production using this system. We shall also discuss the future plans for extending its deployment other NOAA sites in order to provide centralized reporting and management of system utilization for all HPC resources. pdf, pdf Speed Job Completion w... David Hill (Adaptive Computing) Leverage the combined power and scale of Cray’s highly advanced systems architecture to speed the completion of multi-node, parallel-processing of jobs with the Moab® intelligence engine. In this session, you will see how topology-based scheduling will permit a cluster user to intelligently schedule jobs on inter-communicating nodes close to each other to minimize the overhead for message or information passing and/or data transfer. This enables jobs to complete in a shorter period than it would if the workloads used nodes spread across the cluster. Practical Support Solu... Adam G. Carlyle, Ross G. Miller, Dustin B. Leverman, William A. Renaud and Don E. Maxwell (Oak Ridge National Laboratory) The National Climate-Computing Research Center (NCRC), a joint computing center between Oak Ridge National Laboratory (ORNL) and the National Oceanic and Atmospheric Administration (NOAA), employs integrated workflow software and data storage resources to enable production climate simulations on the Cray XT6/XE6 named "Gaea". The use of highly specialized workflow software and a necessary premium on data integrity together create a support environment with unique challenges. This paper details recent support efforts to improve the NCRC end-user experience and to safeguard the corresponding scientific workflow. Monitoring and reporting of disk usage on Lustre filesystems can be a resource-intensive task, and can affect meta-data performance if not done in a centralized and scalable way. LustreDU is a non-intrusive tool that was developed at ORNL to address this issue by providing an end-user utility that queries a database which is populated daily for reporting disk utilization on directories in the NCRC Lustre file systems. The NCRC system is housed at ORNL, and has sets of geographically remote end-users at (3) separate sites, with a corresponding support staff team at each location. Conveying system status information to each remote center in a timely manner became important early into the project. The NCRC System Dashboard is a web interface and a set of corresponding system checks created by ORNL support staff to concisely and expediently inform those operational teams remote from the main data center of changes in system status. Filesystem issues and outages cause disruption to the automated workflow employed by NCRC end-users. Lustre-aware Moab is our response to this issue. By integrating knowledge of the filesystem state into the system's job scheduler, the workflow can be paused when a file system issue is detected. When the issue is resolved, affected jobs can be rerun, effectively rolling back the workflow's progression to a valid state. pdf, pdf Paper Technical Sessions (14B) Bonn Larry Kaplan uRiKA: Graph Applian... Amar Shan (Cray Inc.) The Big Data challenge is ubiquitous in HPC sites, which commonly have data storage measured in tens of petabytes, doubling every two to three years. Transforming this data into knowledge is critical to continued progress. Semantic Networks are one of the most promising approaches to Knowledge Discovery in Big Data. Ontologies add meaning, tremendously increasing the expressive power of queries and the ability to extract meaningful results. However, performance and scalability are problematic with semantic networks. The reasons lie in the architecture of modern computer systems: Microprocessor performance has advanced exponentially faster than memory performance. Caching attempts to address this imbalance, but semantic algorithms are typically cache-busting, resulting in poor performance and scalability. This talk explores Cray’s combination of custom hardware and semantic software to deliver an extensible platform for semantic data analysis, with very good performance and scaling. Several real world applications and results will be discussed. Blue Waters Testing En... Joseph Muggli, Brett Bode, Torsten Hoefler, William Kramer and Celso L. Mendes (National Center for Supercomputing Applications/University of Illinois) Acceptance and performance testing are critical elements of providing and optimizing HPC systems for scientific users. This paper will present the design and implementation of the testing harness for the Blue Waters Cray XE6/XK6 being installed at NCSA/University of Illinois. The Blue Waters system will be a leading-edge system in terms of computational power, on- and off-line storage size and performance, external networking performance, and the breadth of software needed to support a diverse NSF user community. Such a large and broad environment must not only be fully validated for system acceptance, but also continually retested over time to avoid regressions in performance following new software installations or hardware failures. This frequency of testing demands an automated means for running the tests and validating the results as well as tracking the results over time. The INCA testing package was selected as the main framework because it provides much of the desired functionality for a test harness. Some of INCA's featured abilities are the straightforward wrapping of individual tests by researchers who might not be familiar with the harness API, the ability to perform periodic regression testing for monitoring and checking software updates, version control of tests, the hierarchical grouping of individual tests, and a dashboard feature to provide a succinct overview of current acceptance and performance test results. In addition to describing the testing framework, the paper will also present an overview of the set of software and hardware tests being implemented for Blue Waters. These tests range from core performance (CPU, network, and storage), to the functionality of software layers (standards compliance and interoperability of MPI, OpenMP, Co-array FORTRAN, UPC, etc.), to the functionality of external tools, such as Eclipse, within the user environment. Differing test versions will validate functionality, do full performance characterization, or be suitable for a regression test suite. The regression test suite will ensure that Blue Waters not only satisfies all of the requirements for acceptance, but also maintains those characteristics throughout it’s production lifetime. pdf, pdf Optimizing HPC and IT... Wim Slagter (ANSYS) This presentation will show how the ANSYS engineering simulation platform can contribute to HPC & IT efficiency, and how our current solutions, partnerships, and roadmap can enable scalable, global deployment of simulation on internal or cloud-based HPC infrastructure. In addition, some recent ANSYS software advances in parallel scaling performance on Cray systems will be presented. Paper Technical Sessions (14C) Köln Tina Butler Swift - a parallel scr... Ketan Maheshwari (Argonne National Laboratory), Mihael Hategan and David Kelly (University of Chicago), Justin Wozniak (Argonne National Laboratory), Jon Monette, Lorenzo Pesce and Daniel Katz (University of Chicago), Michael Wilde (Argonne National Laboratory) and David Strenski and Duncan Roweth (Cray Inc.) Important science, engineering and data analysis applications increasingly need to run thousands or millions of small jobs, each using a compute core for seconds to minutes, in a paradigm called many-task computing. These applications can readily have computation needs that extend into extreme scales. Most petascale systems, however, only schedule jobs to the node level. While it is possible to run multiple small tasks on the same node using manually-written ad-hoc scripts, this is not very convenient, making petascale systems unattractive to many-task applications. Swift is a parallel scripting language that makes such many-task applications easy to express and run, using highly portable and system-independent scripts. The Swift language is implicitly parallel, high level and functional. Swift's runtime system automatically manages the execution of tens of thousands of small single-core or multi-core jobs, and dynamically packs those jobs tightly onto multiple nodes, thus fully utilizing node-scheduled systems. In this paper, we present our experience in running many-task science applications under Swift on Cray XT and XE systems. Shared Library Perform... Zhengji Zhao (Lawrence Berkeley National Laboratory), Mike Davis (Cray Inc.) and Katie Antypas, Yushu Yao, Rei Lee and Tina Butler (Lawrence Berkeley National Laboratory) NERSC's petascale machine, Hopper, a Cray XE6, supports dynamic shared libraries through the DVS projection of the shared root file system onto compute nodes. The performance of the dynamic shared libraries is crucial to some of the NERSC workload, especially for those large scale applications that use Python as the front end interfaces. The work we will present in this paper was motivated by the report from NERSC users, stating that the performance of dynamic shared libraries are very poor at large scale, and hence it is not possible for them to run large python applications on Hopper. In this paper, we will present our performance test results on the shared libraries on Hopper, using the standard Python benchmark code Pynamic and a NERSC user application code WARP, and will also present a few options which we have explored and developed to improve the shared library performance at scale on Hopper. Our effort has enabled Warp to start up in 7 minutes at 40K core concurrency. pdf, pdf The Effects of Compile... Megan Bowling, Zhengji Zhao and Jack Deslippe (Lawrence Berkeley National Laboratory) Materials science and chemistry applications consume around 1/3 of the computing cycles each allocation year at NERSC. To improve the scientific productivity of users, NERSC provides a large number of pre-compiled applications on the Cray XE6 machine Hopper. Depending on the compiler, compiler flags and libraries used to build the codes, applications can have large differences in performance. In this paper, we compare the performance differences arising from the use different compilers, compiler optimization flags and libraries available on Hopper over a set of materials science and chemistry applications that are widely used at NERSC. The selected applications are written in Fortran, C, C++, or a combination of these languages, and use MPI or other massage passing libraries as well as linear algebra, FFT, and global array libraries. The compilers explored are the PGI, GNU, Cray, Intel and Pathscale compilers. pdf, pdf Paper 12:00PM - 1:00PM Lunch. NetApp Inc, Sponsor Restaurant Rôtisserie Dennis Watts (NetApp, Inc.) The NetApp® E5400 is a high-performance storage system that meets an organization’s demanding performance and capacity requirements without sacrificing simplicity and efficiency. Designed to meet wide-ranging requirements, its balanced performance is equally adept at supporting high-performance file systems, bandwidth-intensive streaming applications, and transaction-intensive workloads. The E5400 multiple drive shelf options enable custom configurations that can be tailored for any environment. With over 20 years of storage development experience, the E5400 is based on a field-proven architecture designed to provide the highest reliability and 99.999% availability. Its redundant components, automated path failover, and online administration keep organizations productive 24/7/365. And its advanced protection features and extensive diagnostic capabilities consistently achieve high levels of data integrity. NetApp is one of the world's leading OEM storage providers, developing and delivering robust, high-performance storage system technology. We enable our OEM partners to add value and to differentiate their products to meet their customers’ storage needs. http://www.netapp.com Lunch 1:00PM - 2:30PM Technical Sessions (15A) Hamburg Tina Butler A Single Pane of Glass... Matthijs van Leeuwen and Martijn de Vries (Bright Computing, Inc.) Bright Cluster Manager provides comprehensive cluster management for Cray systems in one integrated solution: deployment, provisioning, scheduling, monitoring, and management. Its intuitive GUI provides complete system visibility and ease of use for multiple clusters simultaneously, including automated tasks and intervention. Bright also provides a powerful cluster management shell for those who prefer to manage via a command-line interface. Bright Cluster Manager extends to cover the full range of Cray systems, spanning clusters/mainframes, external servers (large-scale Lustre file systems, login servers, data movers, pre- and post-processing servers), and the new Cray storage solutions. Bright Computing also provides unique cloud bursting capabilities as a standard feature of Bright Cluster Manager, automatically cloud-enabling clusters at no extra cost. Users can seamlessly extend their clusters, adding and managing cloud-based nodes as needed, or create entirely new clusters on the fly with a few mouse clicks. Real Time Analysis and... Joseph 'Joshi' Fullop, Ana Gainaru and Joel Plutchak (National Center for Supercomputing Applications) The cost of operating extreme scale supercomputers such as Blue Waters is high and growing. Predicting failures and reacting accordingly can prevent the loss of compute hours and their associated power and cooling costs. Forecasting the general state of the system, as well as predicting an exact failure event are two distinct ways to accomplish this. We have addressed the latter with a system that uses a self-modifying template algorithm to tag event occurrences. This enables fast mining and identification of correlated event sequences. The analysis is visually displayed using directed graphs to show the interrelationships between events across all subsystems. The system as a whole is self-updating and functions in real time and is planned to be used as a core monitoring component on the Blue Waters supercomputer at NCSA. pdf, pdf Node Health Checker Kent J. Thomson (Cray Inc.) The Node Health Checker (NHC) component runs after job failures to take compute nodes out of service that are likely to cause future jobs to fail. Before NHC can take nodes out of the availability pool, however, it must run some tests on them to assess their health. While these tests are running, the nodes being tested cannot have new jobs run on them. This period of time is known as `Normal Mode'. By decreasing the average time of normal mode, job throughput can be increased. Performance investigation into the average run time of NHC normal mode showed that instead of scaling logarithmically with the number of nodes being tested, it instead scaled linearly, which becomes much slower at larger node counts. By localizing and fixing the bug causing the improper scaling the normal mode run time of node health was decreased by, in the best case, 100x. The analytical techniques involved in identifying scaling will be shown, including curve fitting and performance extrapolation using software tools. Additionally, the method of isolating the location of the bug by testing the different pieces of NHC separately will be discussed. Once the source of the poor scaling is revealed as calls to an external program for each node being tested, the fix of caching the required information on NHC startup in an intelligent manner is explained. Additionally, the new automatic dump and reboot feature of NHC is discussed. An architectural overview is given, along with common usage scenarios. pdf, pdf Paper Technical Sessions (15B) Bonn John Noe Early Application Expe... R. Glenn Brook, Bilel Hadri, Vincent C. Betro, Ryan C. Hulguin and Ryan Braby (National Institute for Computational Sciences) This work details the early efforts of the National Institute for Computational Sciences (NICS) to port and optimize scientific and engineering application codes to the Intel Many Integrated Core (Intel MIC) architecture in a Cray CX1. After the configuration of the CX1 is presented, the successful portings of several application codes are described, and scaling results for the codes on the Intel Knights Ferry (Intel KNF) software development platform are presented. pdf, pdf High-Performance Exact... Sergei Isakov (ETH Zurich), William Sawyer, Gilles Fourestey and Adrian Tineo (Swiss National Supercomputing Centre) and Matthias Troyer (ETH Zurich) In this work we analyze Cray XE6/XK6 performance and scalability of Exact Diagonalization (ED) techniques for an interacting quantum system. Typical models give rise to a relatively sparse Hamiltonian matrix H. The Lanczos algorithm is then used to determine a few eigenstates. The sparsity pattern is irregular, and the underlying matrix-vector operator exhibits only limited data locality. By grouping the basis states in a smart way, each node needs to communicate with only an order O(log(p)) subset of nodes. The resulting hybrid MPI/OpenMP C++ implementation scales to large CPU-configurations. We have also investigated one-way communication paradigms, such as MPI-2, SHMEM and UPC. We present the results for various communication paradigms on the Cray XE6 at CSCS. Depending on the model chosen, the matrix-vector operator can be computationally expensive and therefore applicable to GPUs. An initial accelerator-directive implementation has been developed, and we report results on a Cray XK6. pdf, pdf Developing Integrated... Ron Oldfield (Sandia National Laboratories), Todd Kordenbrock (Hewlett Packard) and Gerald Lofstead (Sandia National Laboratories) Over the past several years, there has been increasing interest in injecting a layer of compute resources between a high-performance computing application and the end storage devices. For some projects, the objective is to present the parallel file system with a reduced set of clients, making it easier for file-system vendors to support extreme-scale systems. In other cases, the objective is to use these resources as “staging areas” to aggregate data or cache bursts of I/O operations. Still others use these staging areas for “in-situ” analysis on data in-transit between the application and the storage system. To simplify our discussion, we adopt the general term “Integrated Data Services” to represent these use-cases. This paper describes how we provide user-level, integrated data services for Cray systems that use the Gemini Interconnect. In particular, we describe our implementation and performance results on the Cray XE6, Cielo, at Los Alamos National Laboratory. pdf, pdf Paper Technical Sessions (15C) Köln Liam Forbes The year in review (in... Wendy L. Palm (Cray Inc.) This presentation will provide a review of the notable vulnerabilities, hits & misses of the past year, as well as an update to any changes to the Cray Security Update process. Threat Management and... Urpo Kaila and Joni Virtanen (CSC - IT Center for Science Ltd) National data centers for scientific computing provides IT services for researchers, who primary wants reliable and flexible access to high performance computing. Information Security is typically less prioritized, at least until a security incident endangers user data and credentials or generic availability of site services. Many incidents affects several sites with similar computing platforms or user bases. There seems to be a growing demand for structured cooperation between sites for both proactive threat management and reactive incident coordination and crisis communication to stakeholders. In this paper we will show how data centers currently identify common threats and coordinates information security incidents among sites and other players, such as vendors, open source software providers and Computer Security Incident Response Teams. The study is based on current research and on a site survey. The conclusions will suggest improvements of current best practices for threat management and incident coordination. Early Applications Exp... Arnold Tharrington, Hai Ah Nam, Wayne Joubert, W. Michael Brown and Valentine G. Anantharaj (Oak Ridge National Laboratory) In preparation for Titan, the next-generation hybrid supercomputer at the Oak Ridge Leadership Computing Facility (OLCF), the existing 2.3 petaflops Jaguar system was upgraded from the XT5 architecture to the new Cray XK6. This system combines AMD’s 16-core Opteron 6200 processors, NVIDIA’s Tesla X2090 accelerators, and the Gemini interconnect. We present an early evaluation of OLCF’s CRAY XK6, including results for microbenchmarks and kernel and application benchmarks. In addition, we show preliminary results from GPU-enabled applications. Paper 2:30PM - 2:45PM Break Maritim Foyer Break 2:45PM - 3:15PM Closing General Session (16) Köln / Bonn / Hamburg Nick Cardo Invited Talk |