CUG 2010 home page

CUG 2010 Final Program and Proceedings

Links to papers and slides are in blue below. To read the abstract of a presentation, go to the abstracts page and search for the presenter's name or the presentation title. Click here for the final printed program in pdf format.

Welcome from
the Program
Committee

Program Notes

 
Monday Tuesday Wednesday Thursday

Abstracts of Papers

Author(s) Title Abstract
Mark Adams, Columbia University; C-S Chang and Seung-Hoe Ku, New York University; Eduardo D'Azevedo, Collin McCurdy, and Patrick Worley, Oak Ridge National Laboratory (ORNL) XGC1: Performance on the 8-core and 12-core Cray XT5 Systems at ORNL The XGC1 code is used to model multiscale tokamak plasma turbulence dynamics in realistic edge geometry. In June 2009, XGC1 demonstrated nearly linear weak and strong scaling out to 150,000 cores on a Cray XT5 with 8-core nodes when solving problems of relevance to running experiments on the ITER tokamak. Here we compare performance, and discuss further performance optimizations, when running XGC1 on an XT5 with 12-core nodes on up to 224,000 cores.
Sadaf Alam, Matthew Cordery, William Sawyer, Tim Stitt, and Neil Stringfellow, CSCS-Swiss National Supercomputing Centre (CSCS) Evaluation of Productivity and Performance Characteristics of CCE CAF and UPC Compilers The Co-Array Fortran (CAF) and Unified Parallel C (UPC) functional compilers available with the Cray Compiler Environment (CCE) on the Cray XT5 platform offer an integrated framework for code development and execution for Partitioned Global Address Space (PGAS) programming paradigm together with message-passing MPI and shared-memory OpenMP programming models. Using micro-benchmarks, conformance test cases and micro-kernels of representative scientific calculations, we attempt to evaluate the following characteristics of the CCE PGAS compilers: (1) usability of the framework for code development and execution; (2) completeness and integrity of code generation; (3) efficiency of the generated code, particularly usage of the communication layer (GASNet on SeaStar2); and (4) tools availability for performance measurement and diagnostics. Our initial results show that the current version of compiler provides a highly productive code development environment for CAF or UPC code development on our target Cray XT5 platform. At the same time however, we observe that the code transformation and generation processes are unable to aggregate remote memory access for simple access patterns causing significant slowdown. We will compare and contrast code generation with two multi-platform PGAS compilers: Berkley UPC environment that uses the Intrepid UPC compiler and the g95 CAF compiler extensions. In the full paper version, we would also include comparative results using the Rice CAF 2.0 compiler, if it becomes available in due time.
Carl Albing, Cray Inc. ALPS, Topology, and Performance Application performance can be improved or reduced depending on the compactness of the set of nodes on which an application is placed (as demonstrated convincingly by PSC at a recent CUG). This paper describes the approach to placements that ALPS now uses based on the underlying node topology, the reasons for this approach, and the variations that sites can use to optimize for their specific machine and workload.
Dario Alfe, University College of London and Lucian Anton, Numerical Algorithms Group Mixed Mode Computation in CASINO CASINO is a quantum Monte Carlo code that solves many particle Schroedinger equations with the help of configurations of random walkers. This method is suitable for parallel computation because it has a very good computation/communication ratio. The standard parallel algorithm increases the computation speed by distributing equally the configurations among the available processors. For a computation with P processing elements the computation time for Nc configurations is proportional with Nc*tc/P, where tc is the average time taken for one configuration step. On petascale computers one can have more processing elements than configurations and besides that for models with more that 1000 electrons tc increases significantly. We present a mixed mode implementation of CASINO that takes advantage of the architectures with large numbers of multicore processors to improve computation speed by using multiple OpenMP threads for the computation of each configuration step.
Richard C. Murphy, Kyle B. Wheeler, Brian W. Barrett, and James A. Ang, Sandia National Laboratories (SNLA) The Graph 500 New large-scale informatics applications require radically different architectures from those optimizing for 3D Physics. The 3D physics community is represented in the Top 500 list by a LINPACK as a single, simple, dense algebra benchmark. Informally, the Cray XMT performs significantly better than other known architectures on large-scale graph problems, which is a core informatics application kernel. The Graph 500 list, to be introduced at Supercomputing 2010, will formalize a single, unified graph benchmark for the informatics community to rally around and to precipitate innovation in the informatics space. This paper will discuss the need for this kind of benchmark, the benchmark itself, an initial set of results on a small subset of platforms (including XMT), and why those platforms are fundamentally different from other classes of supercomputer.
James Ang and Sudip Dosanjh, Sandia National Laboratories (SNLA); Ken Koch and John Morrison, Los Alamos National Laboratory An Alliance for Computing at the Extreme Scale Los Alamos and Sandia National Laboratories have formed a new high performance computing center, the Alliance for Computing at the Extreme Scale (ACES). The two labs will jointly architect, develop, procure and operate capability systems for DOE's Advanced Simulation and Computing Program. This presentation will discuss (1) a petascale production capability system, Cielo, that will be deployed in late 2010, (2) a technology roadmap for exascale computing and (3) a new partnership with Cray on advanced interconnect technologies.
Katie Antypas and Andrew Uselton, National Energy Research Scientific Computing Center (NERSC), Daniela Ushizima, Computational Research Division (CRD), Lawrence Berkeley National Lab and Jefferey Sukharev, University of California, Davis File System Monitoring as a Window Into User I/O Requirements The effective management of HPC I/O resources requires an understanding of user requirements, so the National Energy Research Scientific Computing center (NERSC) annually surveys its project leads for their anticipated needs. With the advent of detailed monitoring on the Lustre prarallel file system of the Franklin Cray XT it becomes possible to compare actual experience with the expectations presented in the surveys. A correlation of the Lustre Monitoring Tool (LMT) data with job log statistics reveals I/O behavior on a per-project basis. This feedback for both the users and the center enhances NERSC's ability to manage and provision Franklin's I/O subsytem as well as to plan for future I/O requirements.
Katie Antypas, Tina Butler, and Jonathan Carter, National Energy Research Scientific Computing Center (NERSC) External Services on the Cray XT5 System Hopper Cray External Service offerings such as login nodes, data mover nodes, and file systems which are external to the main XT system, provide an opportunity to make Cray XT High Performance Computing resources more robust and accessible to end users. This paper will discuss our experiences using external services on Hopper, a Cray XT5 system at the National Energy Research Scientific Computing (NERSC) Center. It will describe the motivation for externalizing services, early design decisions, security issues, implementation challenges and production feedback from NERSC users.
Edoardo Apra and Vinod Tipparaju, Oak Ridge National Laboratory (ORNL); Ryan Olson, Cray Inc. What is a 200,000 CPUs Petaflop Computer Good For (a Theoretical Chemist Perspective)? We describe the efforts undertaken to efficiently parallelize the computational chemistry code NWChem on the Cray XT hardware using the Global Arrays/ARMCI middleware. We show how we can now use 200K+ processors to address complex scientific problems.
Mike Ashworth and Andrew Porter, STFC Daresbury Laboratory Configuring and Optimising the Weather Research and Forecast Model on the Cray XT The Weather Research and Forecast (WRF) Model is a well-established and widely used application. Designed and written to be highly scalable, the code has a large number of configuration options at both compile- and run-time. We report the results of an investigation into the effect of these options on the performance of WRF on a Cray XT4 with a typical scientific use-case. Covering areas such as MPI/OpenMP comparison, cache usage and I/O performance, we discuss the implications for both regular WRF users and the authors of other application codes.
Mike Ashworth, Xiaohu Guo, STFC Daresbury Laboratory; Andrew Sunderland, EPCC (EPCC) and Gerard Gorman, Stephan Kramer, and Matthew Piggott, Imperial College London High Performance Computing Driven Software Development for Next-Generation Modeling of the World's Oceans The Imperial College Ocean Model (ICOM) is an open-source next generation ocean model build upon finite element methods and anisotropic unstructured adaptive meshing. Since 2009, a project has been funded by EPSRC to optimize the ICOM for the UK national HPC service, Hector. Extensive use of profiling tools such as CrayPAT and Vampiz has been made in order to understand performance issues of the code on the Cray XT4. Of particular interest is the scalability of the sparse linear solvers and the algebraic multigrid preconditioners required to solve the system of equations. Scalability of model I/O have been examined and we have implemented a parallel I/O strategy in the code for the Lustre filesystem.
Troy Baer, National Institute for Computational Sciences (NICS) Using Quality of Service for Scheduling on Cray XT Systems The University of Tennessee's National Institute for Computational Sciences (NICS) operates two Cray XT systems for the U.S. National Science Foundation (NSF): Kraken, an 88-cabinet XT5 system, and Athena, a 48-cabinet XT4 system. Access to Kraken is allocated through the NSF's Teragrid allocations process, while Athena is currently being dedicated to individual projects on a quarterly basis; as a result, the two systems have somewhat different scheduling goals. However, user projects on both systems have sometimes required the use of quality of service (QoS) levels for scheduling of certain sets of jobs. We will present case studies of three situations where QoS levels were used to fulfill specific requirements: two on Kraken in fully allocated production service, and one on Athena while dedicated to an individual project. These case studies will include lessons learned about impact on other users and unintended side effects.
Troy Baer, Lonnie Crosby, and Mike McCarty, National Institute for Computational Sciences (NICS) Regression Testing on Petaflop Computational Resources As the complexity of supercomputers increases, it is becoming more difficult to measure how system performance changes over time. Routine system checks performed after scheduled maintenance or emergency downtime give administrators an instantaneous glimpse of system performance; however, rigorous testing, such as that performed for machine acceptance, provides more in-depth information on system performance. Both routine and rigorous testing are necessary to fully characterize system performance, and a mechanism to store and compare previous results is needed to determine the change in system performance over time. A regression testing framework has been developed at the National Institute for Computational Sciences (NICS) which provides a mechanism to measure the change in system performance over time. These performance results can also be correlated to system events such as downtimes, system upgrades, or any other documented system change. We will describe the design and implementation of the regression testing framework, including the development of test suites, interfaces to the batch system, and the extraction of performance data. The import of extracted data into a relational database for long- term storage, report generation, and real- time analysis will also be discussed.
Jeff Becklehimer and Jeff Larkin, Cray Inc.; Dave Dillow, Don Maxwell, Ross Miller, Sarp Oral, Galen Shipman, and Feiyi Wang, Oak Ridge National Laboratory (ORNL) Reducing Application Runtime Variability on Jaguar XT5 Operating system (OS) noise is defined as interference generated by the OS that prevents the compute core from performing useful work. Compute node kernel daemons, network interfaces, and other OS related services are major sources of such interference. This interference on individual compute cores can vary in duration and frequency and can cause de-synchronization (jitter) in collective communication tasks and thus results in variable (degraded) overall parallel application performance. This behavior is more observable in large-scale applications using certain types of collective communication primitives, such as MPI_Allreduce. This paper presents our efforts towards reducing the overall effect of OS noise on our large-scale parallel applications. Our tests were performed on the quad-core Jaguar, the Cray XT5 at the Oak Ridge National Laboratory Leadership Computing Facility (OLCF). At the time of these tests, Jaguar was a 1.4 PFLOPS supercomputer with 144,000 compute cores and 8 cores per node. The technique we used was to aggregate and merge all OS noise sources onto a single compute core for each node. The scientific application was then run on the remaining seven cores in each node. Our results show that we were able to improve the MPI_Allreduce performance by two orders of magnitude and to boost the Parallel Ocean Program (POP) performance over 30% using this technique.
Iain Bethune, EPCC (EPCC) Improving the Performance of CP2K on the Cray XT CP2K is a freely available and increasingly popular Density Functional Theory code for the simulation of a wide range of systems. It is heavily used on many Cray XT systems, including 'HECToR' in the UK and 'Monte Rosa' in Switzerland. We describe performance optimisations made to the code in several key areas, including 3D Fourier Transforms, and present the implementation of a load balancing scheme for multi-grids. These result in performance gains of around 30% on 256 cores (for a generally representative benchmark) and up to 300% on 1024 cores (for non-homogeneous systems). Early results from the implementation of hybrid MPI/OpenMP parallelism in the code are also presented.
Arthur Bland, Oak Ridge National Laboratory (ORNL) Jaguar-The World's Most Powerful Computer System At the SC'09 conference in November 2009, Jaguar was crowned as the world's fastest computer by the web site www.Top500.org. In this paper, we will describe Jaguar, present results from a number of benchmarks and applications, and talk about future computing in the Oak Ridge Leadership Computing Facility.
John Blondin, North Carolina State University; Stephen Bruenn, Florida Atlantic University; Raph Hix, Bronson Messer, and Anthony Mezzacappa, Oak Ridge National Laboratory (ORNL) The Evolution of a Petascale Application: Work on CHIMERA CHIMERA is a multi-dimensional radiation hydrodynamics code designed to study core-collapse supernovae. We will review several recent enhancements to CHIMERA designed to better exploit features of the CRAY XT architecture, as well as some forward-looking work to take advantage of the next generation of Cray supercomputers.
Gregory Brown, Florida State University; Markus Eisenbach and Donald Nicholson, Oak Ridge National Laboratory (ORNL); Jeff Larkin, Cray Inc.; Thomas Schulthess, CSCS-Swiss National Supercomputing Centre (CSCS), and Chengang Zhou, JPMorgan Chase & Co. Thermodynamics of Magnetic Systems from First Principles: WL-LSMS We describe a method to combine classical thermodynamic Monte Carlo calculations (the Wang-Landau method) with a first principles electronic structure calculation, specifically our locally self-consistent multiple scattering (LSMS) code. The combined code shows superb scaling behavior on massively parallel computers and is able to calculate the transition temperature of Fe without external parameters. The code was the recipient of the 2009 Gordon-Bell prize for peak performance.
Ian Bush, Numerical Algorithms Group and William Smith and Ilian Todorov, STFC Daresbury Laboratory Optimisation of the I/O for Distributed Data Molecular Dynamics Applications With the increase in size of HPC facilities it is not only the parallel performance of applications that is preventing greater exploitation, in many cases it is the I/O which is the bottleneck. This is especially the case for distributed data algorithms. In this paper we will discuss how the I/O in the distributed data molecular dynamics application DL_POLY_3 has been optimised. In particular we shall show that extensive data redistribution specifically to allow best use of the I/O subsystem can result in a code that scales to many more processors, despite the large increase in communications required.
Kurt Carlson, Arctic Region Supercomputing Center (ARSC) Tools, Tips and Tricks for Managing Cray XT Systems Managing large complex systems requires processes beyond what is taught in vendor training classes. Many sites must manage multiple systems from different vendors. This paper covers a collection of techniques to enhance the usability, reliability and security of Cray XT systems. A broad range of activities, from complex tasks like security, integrity and environmental checks of the Cray Linux Environment, to relatively simple things like making 'rpm -qa' available to users will be discussed. Some techniques will be XT specific, such as monitoring L0/L1 environment, but others will be generic, such as security tools adapted from other systems and re-spun as necessary for the XT Cray Linux Environment.
Charlie Carroll, Cray Inc. Cray OS Road Map This paper will discuss Cray's operating system road map. This includes the compute node OS, the service node OS, the network stack, file systems, and administrative tools. Coming changes will be previewed, and themes of future releases will be discussed.
Professor Richard Catlow, Dean, Mathematical and Physical Sciences Faculty, University College London Massively Parallel Molecular Modelling in Materials Chemistry We describe the development and application of a massively parallel version of the ChemShell code for the study of catalysis. This package combines classical potential modeling and density functional theory and allows us to study the influence of metal oxide surfaces on the chemistry of adsorbed molecules. We have developed a number of massively parallel schemes which compile parallel versions of the component codes with a task farming approach to divide the problem further and improve efficiency on very large numbers of processors. The work has been performed as part of the dCSE support for the HECToR service.
Brad Chamberlain, Sung-Eun Choi, Steve Deitz, David Iten, and Lee Prokowich, Cray Inc. Five Powerful Chapel Idioms The Chapel parallel programming language, under development at Cray Inc., has the potential to deliver high performance to more programmers with less effort than current practices provide. This is especially the case with the many-core architectures that are already becoming more and more prevalent. This paper presents five reasons why: 1. Chapel supports easy-to-use asynchronous and synchronous remote tasks, 2. Chapel supports local and remote transactions, 3. Chapel supports simple data-parallel abstractions when applicable, 4. Chapel supports user-defined data distributions, and 5. Chapel supports arbitrarily nested parallelism.
Mathew Cordery, Will Sawyer, and Ulrich Schaettler, CSCS-Swiss National Supercomputing Centre (CSCS) Improving the Performance of COSMO-CLM The COSMO-Model, originally developed by Deutscher Wetterdienst, is a non-hydrostatic regional atmospheric model which can be used for numerical weather prediction and climate simulations and is now in use by a number of weather services for operational forecasting (e.g. MeteoSwiss). One current software engineering goal is to improve its scaling characteristics on multicore architectures by making it a hybrid MPI-OpenMP code. We will present hybridization strategies for different components of the model, show some first performance results, and discuss the impact on further development of the model.
David Cownie, AMD Performance on AMD Opteron™ 6000 Series Platforms (Socket G34) We will cover the feature set of the new 12-core AMD Opteron™ 6100 Series processor, show key industry performance benchmark results, and share insight into achieving optimal performance with the new processors.
Steve Deitz, Cray Inc. An Overview of the Chapel Programming Language and Implementation Chapel is a new parallel programming language under development at Cray Inc. as part of the DARPA High Productivity Computing Systems (HPCS) program. Chapel has been designed to improve the productivity of parallel programmers working on large-scale supercomputers as well as small-scale, multicore computers and workstations. It aims to vastly improve programmability over current parallel programming models while supporting performance and portability at least as good as today's technologies. In this tutorial, we will present an introduction to Chapel, from context and motivation to a detailed description of Chapel via many example computations. This tutorial will focus on writing Chapel programs for both multi-core and distributed-memory computers. We will explore the optimizations added to the Chapel implementation this past year that helped with the most recent Chapel HPCC entry.
Luiz DeRose, John Levesque, and Bob Moench, Cray Inc. Scaling Applications on Cray XT Systems In this tutorial we will present tools and techniques for application performance tuning on the Cray XT system, with focus on multi-core processors. Attendees will learn about the Cray XT architecture and its programming environment. They will have an initial understanding of potential causes of application performance bottlenecks, and how to identify some of these bottlenecks using the Cray Performance tools. In addition, attendees will learn advanced techniques to deal with scaling problems and how to access the on-line documentation for user help. Attendees will also have some exposure to the Cray debugging support tools, which provide innovative techniques to debug applications at scale.
Luiz DeRose, Cray Inc. The Cray Programming Environment: Current Status and Future Directions The Cray Programming Environment has been designed to address issues of scale and complexity of high end HPC systems. Its main goal is to hide the complexity of the system, such that applications can achieve the highest possible performance from the hardware. In this talk I will present the recent activities and future directions of the Cray Programming Environment, which consists of state of the art compiler, tools, and libraries, supporting a wide range of programming models.
Luiz DeRose, Cray Inc.; Barton Miller, University of Wisconsin; and Philip Roth, Oak Ridge National Laboratory (ORNL) MRNet: A Scalable Infrastructure for Development of Parallel Tools and Applications MRNet is a customizable, high-throughput communication software system for parallel tools and applications. It reduces the cost of these tools' activities by incorporating a tree-based overlay network (TBON) of processes between the tool's front-end and back-ends. MRNet was recently ported and released for Cray XT systems. In this talk we describe the main features that make MRNet well-suited as a general facility for building scalable parallel tools. We present our experiences with MRNet and examples of its use.
Gava Didier, LSI LSI Storage Best Practices for Deploying and Maintaining Large Scale Parallel File Systems Environments This BoF session will focus on the challenges and rewards of deploying and implementing a parallel file system to improve cluster performance. The discussion will focus on the impact that deployment and ongoing support has in terms of system performance, system availability and pain for support personnel. Discussion will highlight experiences with different file systems, different types of platform approaches, and balancing vendor support vs. in-house support. We will discuss best practices and ask for audience participation to try to refine those best practices to help users understand how to leverage parallel file systems successfully.
David Dillow, Jason Hill, Don Maxwell, Ross Miller, Sarp Oral, Galen Shipman, and Feiyi Wang, Oak Ridge National Laboratory (ORNL) Monitoring Tools for Large Scale Systems Operating computing systems, file systems, and associated networks at unprecedented scale offer unique challenges for fault monitoring, performance monitoring and problem diagnosis. Conventional system monitoring tools are insufficient to process the increasingly large and diverse volume of performance and status log data produced by the world's largest systems. In addition to the large data volume, the wide variety of systems employed by the largest computing facilities present diverse information from multiple sources, further complicating analysis efforts. At leadership scale, new tool development is required to acquire, condense, correlate, and present status and performance data to systems staff for timely evaluation. This paper details a set of system monitoring tools developed by the authors and utilized by systems staff at Oak Ridge National Laboratory's Leadership Computing Facility, including the Cray XT5 Jaguar. These tools include utilities to correlate I/O performance and event data with specific systems, resources, and jobs. Where possible, existing utilities are incorporated to reduce development effort and increase community participation. Future work may include additional integration among tools and implementation of fault-prediction tools.
David Dillow, Jason Hill, Dustin Leverman, Don Maxwell, Ross Miller, Sarp Oral, Galen Shipman, James Simmons, and Feiyi Wang, Oak Ridge National Laboratory (ORNL) Lessons Learned in Deploying the World's Largest Scale Lustre File System The Spider parallel file system at Oak Ridge National Laboratory's Leadership Computing Facility (OLCF) is the world's largest scale Lustre file system. It has nearly 27,000 file system clients, 10 PB of capacity, and over 240 GB/s of demonstrated I/O bandwidth. In full-scale production for over 6 months, Spider provides a high performance parallel I/O environment to a diverse portfolio of computational resources. These range from the high end, multi-Petaflop Jaguar XT5, the mid-range, 260 Teraflop Jaguar XT4, to the low end, with numerous systems supporting development, visualization, and data analytics. Throughout this period we have had a number of critical design points reinforced while learning a number of lessons on designing, deploying, managing, and using a system of this scale. This paper details our operational experience with the Spider file system, focusing on observed reliability (including MTTI and MTTF), manageability, and system performance under a diverse workload.
Douglas Doerfler and Courtenay Vaughan, Sandia National Laboratories (SNLA) Analyzing Multicore Characteristics for a Suite of Applications on an XT5 System In this paper, we will explore the performance of applications important to Sandia on an XT5 system with dual socket AMD 6 core Istanbul nodes. We will explore scaling as a function of the number of cores used on each node and determine the effective core utilization as core count increases. We will then analyze these results using profiling to better understand resource contention within and between nodes.
Thomas Edwards and Kevin Roy, Cray Inc. Using I/O Servers to Improve Performance on Cray XT Technology Amdhal's Law proposes that parallel codes are combinations of parallel and serial tasks. In many cases these tasks are inherently parallel and can be decomposed and performed asynchronously. Each task operates on a dedicated subset of processors with highly scalable tasks operating on very large numbers of processors and less scalable tasks (like IO) operating on a smaller number. By moving to this Multiple Instruction Multiple Data paradigm codes can achieve greater parallel efficiency and scale further. This paper specifically addresses the implementation and experiences of adapting several codes important to HECToR to offload writing output data onto a set of dedicated server processors.
David Emerson and Vincenzo Fico, STFC Daresbury Laboratory and Jason Reese, University of Strathclyde A Hybrid MPI/Openmp Code Employing a High-Order Compact Scheme for the Simulation of Hypersonic Aerodynamics High-order compact schemes are excellent candidates for Direct Numerical Simulation and Large Eddy Simulation of flow fields. We have devised a high-order compact scheme suitable for the simulation of hypersonic flows, to exploit both shared and distributed memory paradigms. Our hybrid application, employing both MPI and OpenMP standards, has been tested on HECToR.
Tom Engel, National Center for Atmospheric Research (NCAR) HPC at NCAR: Past, Present and Future The history of high-performance computing at NCAR is reviewed from Control Data Corporation's 3600 through the current IBM p575 cluster, but with special recognition of NCAR's relationship with Seymour Cray and Cray Research, Inc. The recent acquisition of a Cray XT5m is discussed, along with the rationale for that acquisition. NCAR's plans for the new NCAR-Wyoming Supercomputing Center in Cheyenne, Wyoming, and the current status of that construction project, are also described.
Matthew Ezell, National Institute for Computational Sciences (NICS) Collecting Application-Level Job Completion Statistics Job failures are common on large high performance computing systems, but logging, analyzing, and understanding the low-level error messages can be difficult on Cray XT systems. This paper describes a set of tools to log and analyze applications in real-time as they run on the system. By obtaining more information about typical error scenarios, system administrators can work to resolve the underlying issues and educate users.
Jihan Kim, Alice Koniges, Robert Preissl, and John Shalf, National Energy Research Scientific Computing Center (NERSC); David Eder, Aaron Fisher, Nathan Masters, and Mlaker Velimir, Lawrence Livermore National Laboratory; Stephan Ethier and Weixing Wang, Princeton Plasma Physics Laboratory; Martin Head-Gordon, University of California, Berkeley; Nathan Wickman, Cray Inc. Application Acceleration on Current and Future Cray Platforms Application codes in a variety of areas are being updated for performance on the latest architectures. We describe current bottlenecks and performance improvement areas for applications including plasma physics, chemistry related to carbon capture and sequestration, and material science.
Mark Fahey, Bilel Hadri and Nicholas Jones, National Institute for Computational Sciences (NICS) Automatic Library Tracking Database The National Institute for Computational Sciences and the National Center for Computational Sciences (both located at Oak Ridge National Laboratory) have been working on an automatic library tracking database whose purpose is to track which libraries are used on their Cray XT5 Supercomputers. The database stores the libraries that are used at link time and it records which executable is run during a batch job. With this data, many operationally important questions can be answered like which libraries are most frequently used and who is using deprecated libraries or applications. The infrastructure design and reporting mechanisms will be presented with production data to this point.
Al Geist, Guruprasad Kora, and Byung-Hoon Park, Oak Ridge National Laboratory (ORNL) and Junseong Heo, National Institute for Computational Sciences (NICS) RAVEN: RAS Data Analysis Through Visually Enhanced Navigation Supercomputer RAS data contain various signatures regarding system status, thus are routinely examined to detect and diagnose faults. However, due to voluminous sizes of logs generated during faulty situations, a comprehensive investigation that requires comparisons of different types of RAS logs over both spatial and temporal dimensions is often beyond the capacity of human operators, which leaves a cursory look to be the only feasible option. As an effort to better embrace informative but huge supercomputer RAS data in a fault diagnosis/detection process, we present a GUI tool called RAVEN that visually overlays various types of RAS logs on a physical system map where correlations between different fault types can be easily observed in terms of their quantities and locations at a given time. RAVEN also provides an intuitive fault navigation mechanism that helps examine logs by clustering them to their common locations, types, or user applications. By tracing down notable fault patterns reflected on the map and their clustered logs, and superimposing user application data, RAVEN, which has been adopted at National Institute of Computational Science (NICS) at the University of Tennessee, identified root causes of several system failures logged in Kraken XT5.
Al Geist, Raghul Gunasekaran, Byung Park, and Galen Shipman, Oak Ridge National Laboratory (ORNL) Correlating Log Messages for System Diagnostics In large-scale computing systems the sheer volume of log generated has challenged the interpretation of log messages for debugging and monitoring purposes. For a non-trivial event, the Jaguar XT5 at the Oak Ridge Leadership Computing Facility with more than eighteen thousand compute nodes would generate a few hundred thousand log entries in less than a minute. Determining the root cause of such events requires analyzing and understanding these log messages. Most often, these log messages are best understood when they are interpreted collectively rather than being read as individual messages. In this paper, we present our approach to interpreting log messages by identifying commonalities and grouping them into clusters. Given a set of log messages within a time interval, we parse and group the messages based on source, target, and/or error type, and correlate the messages with hardware and application information. We monitor the XT5's console, netwatch and sys log and show how such grouping of log messages help in detecting system events. By intelligent grouping and correlation of events from multiple sources we are able to provide system administrators with meaningful information in a concise format for root cause analysis.
David Gigrich, CUG President, The Boeing Company (BOEING) Open Discussion with the CUG Board The CUG Board members will be present to hold open discussions with member institutions. This is an excellent opportunity to learn more about decisions that have been made or to share your concerns with the Board.
Forest Godfrey, Cray Inc. Resiliency Features in the Next Generation Cray Gemini Network As system sizes scale to ever increasing numbers of nodes and network links, network failures become an increasingly important problem to address. With its next generation high speed network (code named Gemini), Cray will introduce a number of new resiliency features in this area. These features, including network link failover, are discussed in this paper as well as a comparison to other, more familiar, network technologies such as Ethernet and Infiniband.
Josh Goldenhar and Francesco Torricelli, DataDirect Networks, Inc. DataDirect™ Networks' Storage Fusion Architecture High performance storage systems usually fall into one of two categories: those with high IOPS capability or those with high throughput capability. In the world of supercomputing where the focus is usually on massive scale, the preference (and need) has traditionally favored storage systems with high throughput capabilities. The move to multi-core processors coupled with ever increasing number of nodes in supercomputing clusters has fundamentally changed the data patterns scalable storage must handle. Traditional storage systems cannot handle the need to capable of both high IOPS and high throughput. This paper presents a new storage architecture that can adapt to modern compute environments and the unique data storage challenges they present. Additionally, it outlines how the architecture allows for embedding clustered file systems directly into the storage resulting in reductions in complexity and latency.
Chris Gottbrath, TotalView Technologies Improving the Productivity of Scalable Application Development with TotalView Scientists and engineers who set out to solve grand computing challenges need TotalView at their side. The TotalView debugger provides a powerful and scalable tool for analyzing, diagnosing, debugging and troubleshooting a wide variety of different problems that might come up in the process of such achievements. These teams, and teams of scientists pursuing a wide range of computationally complex problems on Cray XT systems are frequently diverse and geographically distributed. These groups work collaboratively on complex applications in a computational environment that they access through a batch resource management system. This talk will explore the productivity challenges faced by scientists and engineers in this environment -- highlighting both long standing (but perhaps unfamiliar) and recently introduced capabilities that TotalView users on Cray can take advantage of to boost their productivity. The list of capabilities will include the CLI, subset attach, Remote Display Client, TVScript, MemoryScape's reporting, and ReplayEngine.
Richard Graham and Joshua Ladd, Oak Ridge National Laboratory (ORNL) Hierarchy Aware Blocking and Nonblocking Collective Communications—The Effects of Shared Memory and Torus Topologies in the Cray XT5 Environment MPI Collective operations tend to play a large role in limiting the scalability of high-performance scientific simulation codes. As such, developing methods for improving the scalability of these operations is critical to improving the scalability of such applications. Using infrastructure recently developed in the contest of the FASTOS program we will study the performance of blocking collective operations, as well as those of the recently added MPI nonblocking collective operations taking into account both shared memory and network topologies.
Richard L. Graham and Rainer Keller, Oak Ridge National Laboratory (ORNL) MPI Queue Characteristics of Large-scale Applications Applications running at scale have varying communication characteristics. By employing the PERUSE introspection interface of Open MPI, this paper evaluates several large-scale simulations running production-level input data-sets on the jaguar installation at ORNL. Maximum number of queued messages, average duration of unexpected receives and late sender and receiver information as a function of job size is being presented.
Xu Guo and Joachim Hein, EPCC (EPCC) PRACE Application Enabling Work at EPCC The Partnership for Advanced Computing in Europe (PRACE) created the prerequisites for a pan-European HPC service, consisting of several tier-0 centres. PRACE's aim has now moved to the implementation of this service. The now completed work looked into all aspects of the pan-European service, including the contractual and organisational issues, the system management, application enabling and future computer technologies. This talk discusses the work done by EPCC on the application codes HELIUM (from Queen's University Belfast, UK) and NAMD (from University of Illinois at Urbana Champaign, US) with a particular focus on the work carried out for the Prace prototype Louhi, which is Cray XT5 at CSC in Finland. We will also include a performance comparison with non-Cray systems available to PRACE.
David Hancock and Gregor von Laszewski, Indiana University (INDIANAU) FutureGrid: Design and Implementation of a National Grid Test-Bed Indiana University is leading the creation of a grid test-bed for the National Science Foundation with nine partner institutions. FutureGrid is a high performance grid test-bed that will allow scientists to work collaboratively to develop and test novel approaches to parallel, grid, and cloud computing.
Trev Harmon, Adaptive Computing Things to Consider When Developing and Deploying a Petascale System to Ensure Optimum Scheduling of Resources As Petascale systems are establishing themselves as the standard for high-end Top 500 machines, there are many important considerations in the design and efficient operation to be resolved. In this paper, Adaptive Computing will discuss how scalability issues are being addressed by some of these leading sites and outstanding issues that are on the horizon for future systems.
Yun (Helen) He, Hwa-Chun Wendy Lin, and Woo-Sun Yang, National Energy Research Scientific Computing Center (NERSC) Franklin Job Completion Analysis The NERSC Cray XT4 machine Franklin has been in production for 3000+ users since October 2007, where about 1800 jobs were run each day. There has been an on-going effort to better understand how well these jobs run, whether failed jobs are due to application errors or system issues, and to further reduce system related job failures. In this paper, we will talk about the progress we made in tracking job completion status, in identifying job failure root cause, and in expediting resolution of job failures, such as hung jobs, that are caused by system issues. In addition, we will present some Cray software design enhancements we requested to help us track application progress and identify errors.
Scott Hemmert, Sandia National Laboratories (SNLA) Multi-core Programming Paradigms and MPI Message Rates—A Growing Concern? The continued growth in per-node core count in high performance computing platforms has lead the community to investigate alternatives to an MPI-everywhere programming environment. A hybrid programming environment, in which MPI is used for coarse-grained, inter-node parallelism and a threaded environment (pthreads, OpenMP, etc.) is used for fine-grained, intra-node parallelism presents an appealing target for future applications. At the same time, memory and network bandwidth both continue to grow at a significantly slower pace than processor performance. This trend, combined with increased parallelism due to larger machine sizes, will drive applications away from the bandwidth-limited BSP model to one with a higher number of smaller messages, which avoids unnecessary memory-to-memory copies inside a single node. The increase in small message transfers requires a higher message rate from a single node. Current network designs rely on a number of tasks on a single node injecting messages into the network in order to achieve optimal message rates. This paper quantifies the impact of local process count on node-level message rate for Cray XT5 hardware. The results are an important metric in designing both MPI implementations and applications for the hybrid programming future.
Nic Henke, Chris Horn, and Cory Spitz, Cray Inc. Imperative Recovery for Lustre Recovery times for Lustre failover are mainly a function of the overriding bulk data timeout because clients must timeout to a server twice before initiating contact with its backup. As a result, failover completion times exceeding ten minutes are common. During failover and recovery, all IO operations stall and the long duration can lead to job timeouts, poor system utilization, and increased administrator load. To improve overall failover times we are implementing Imperative Recovery, the framework by which Lustre can initiate and finish failover without waiting for long timeouts. Imperative Recovery directs clients to switch server connections based on automatic processing of node health data. With these changes and Version Based Recovery, it is possible to begin recovery very fast, reducing overall failover times to a few minutes. This paper discusses Imperative Recovery from a system perspective and characterizes the speedup achieved.
Chris January, David Lecomber, and Mark O'Connor, Allinea Software Petascale Debugging The need for debugging at scale is well known—yet machine sizes have raced ahead of the levels reachable by debuggers for many years. This paper outlines major development of Allinea's DDT debugging tool to introduce production-grade petascale debugging on the Oak Ridge Jaguar XT5 system. The resulting scalable architecture is raising the bar of usability and performance in a debugger by multiple orders of magnitude—and has already achieved record 225,000 core debugging at ORNL.
Alice Koniges, John Shalf, Hongzhang Shan, and Nick Wright, National Energy Research Scientific Computing Center (NERSC); Haoqiang Jin, NAS Systems Division (NAS); and Seung-Jai Min, Lawrence Berkeley National Laboratory Analyzing the Effect of Different Programming Models Upon Performance and Memory Usage on Cray XT5 Platforms Harnessing the power of multicore platforms is challenging due to the additional levels of parallelism present. In this paper, we will examine the effect of the choice of programming model upon performance and overall memory usage. We will study how to make efficient use of the memory system and explore the advantages and disadvantage of MPI, OpenMP, and UPC on the Cray XT5 multicore platforms for several synthetic and application benchmarks.
Steve Johnson, Cray Inc. XT System Reliability: Metrics, Trends, and Actions In 2009, the XT product family saw a significant improvement in overall reliability as measured by Cray's support organization. This paper will discuss the reliability trends that have been observed and the main reasons for the improvements. We will also discuss the tools used to collect the field data, the metrics generated by Cray to evaluate XT product reliability and the actions taken as a result of this analysis.
Sergey Karabasov, University of Cambridge and Phil Ridley, Numerical Algorithms Group A Scalable Boundary Adjusting High-Resolution Technique for Turbulent Flows To accurately resolve turbulent flow structures, high-fidelity simulations require the use of millions of grid points. The Compact Accurately Boundary Adjusting High-Resolution Technique (CABARET) is capable of producing accurate results with at least 10 times more efficiency than conventional schemes. CABARET is based on a local second-order finite difference scheme which lends itself extremely well to large scale distributed systems. For Reynolds numbers of 10^4 the method gives rapid convergence without requiring additional preconditioning for Mach numbers as low as 0.05. In this paper we shall discuss the implementation and performance of the CABARET method on the HECToR XT4/6 system. We shall describe the development and optimization of an irregular parallel decomposition for the hexahedral numerical grid structure. Scalability of the code will be discussed in relation to i) the effectiveness of the load balancing for grids generated from the partitioning method ii) compiler performance and iii) efficient use of MPI and memory utilisation.
Sylvain Laizet and Ning Li, Numerical Algorithms Group 2DECOMP&FFT A Highly Scalable 2D Decomposition Library and FFT Interface As part of a HECToR distributed CSE support project, a general-purpose 2D decomposition (also known as 'pencil' or 'drawer' decomposition) communication library has been developed. This Fortran library provides a powerful and flexible framework to build applications based on 3D Cartesian data structures and spatially implicit numerical schemes (such as compact finite difference method or spectral method). The library also supports shared-memory architecture which becomes increasingly popular. A user-friendly FFT interface has been built on top of the communication library to perform distributed multi-dimensional FFTs. Both the decomposition library and the FFT interface scale well to tens of thousands of cores on Cray XT systems. The library has been applied to Incompact3D, a CFD application performing large-scale Direct Numerical Simulations of turbulence, enabling exciting scientific studies to be conducted.
Brent Leback, Douglas Miles, and Michael Wolfe, The Portland Group CUDA Fortran 2003 In the past year, The Portland Group has brought to market a low-level, explicit, Fortran GPU programming language, a higher-level, implicit, directive-based GPU programming model and implementation, and object-oriented features from the Fortran 2003 standard. Together, these provide a rich environment for programming today and tomorrow's many-core systems. In this paper we will present some of the latest features available in the PGI Fortran compiler from these three areas, and explain how they can be combined to access the performance of CPUs and GPUs while keeping application developers hidden from many of the messy details.
William Lu, Platform Computing Advanced Job Scheduling Features for Cray Systems with Platform LSF On large Cray systems where all simulation jobs are running through workload management, visibility of the system and jobs are critical for users and administrators to troubleshoot problems. Features in Platform LSF such as scheduling performance, resource reservation and job level data display help simulation users and system administrators easily overcome this challenge. Benchmark data will show how Platform LSF outperforms other workload schedulers. We will also discuss additional technologies from Platform including Platform MPI and its integration with Platform LSF.
Glenn Luecke and Olga Weiss, Iowa State University Performance Analysis of Pure MPI Versus MPI+OpenMP for Jacobi Iteration and a 3D FFT on the Cray XT5 Today many high performance computers are collections of shared memory compute nodes with each compute node having one or more multi-core processors. When writing parallel programs for these machines, one can use pure MPI or various hybrid approaches using MPI and OpenMP. Since OpenMP threads are lighter weight than MPI processes, one would expect that hybrid approaches will achieve better performance and scalability than pure MPI. In practice this is not always the case. This paper investigates the performance and scalability of pure MPI versus hybrid MPI+OpenMP for Jacobi iteration and a 3D FFT on the Cray XT5.
Graham Mann, University of Leeds and Mark Richardson, Numerical Algorithms Group Combining Open MP and MPI within GLOMAP Mode to Take Advantage of Multiple Core Processors: An Example of Legacy Software Keeping Pace with Hardware Developments The MPI version of GLOMAP MODE is being used in production runs for research into atmospheric science. The memory requirement prohibits use of high resolution scenarios so 32 MPI tasks is the usual decomposition. One way to attempt higher resolution simulations is to under-populate the nodes, making more memory available per MPI task. Although this is wasteful of resources, it does provide a shorter time per existing simulation. The NAG Ltd DCSE service has examined the code and introduced Open MP so that the otherwise "idle" cores can contribute to the MPI task. This improves the performance so that the additional cost of a simulation is reduced.
Pekka Manninen and Ari Turunen, CSC-Scientific Computing Ltd. (CSC) Towards a European Training Network in Computational Science The implementation phase of The Partnership for Advanced Computing in Europe (PRACE) project will develop and maintain a European training network in the field of computational science. Its key ingredients are solid contacts between the partner organisations and European research centres, as well as establishing new links to universities. In this talk, I will review the completed training-related activities of the preparatory phase of PRACE as well as plans for the implementation phase.
Kenneth Matney, Sr. and Galen Shipman, Oak Ridge National Laboratory (ORNL) Parallelism in System Tools The Cray XT, when employed in conjunction with the Lustre filesystem, provides the ability to generate huge amounts of data in the form of many files. This is accommodated by satisfying the requests of multiple Lustre clients in parallel. In contrast, a single service node (Lustre client) cannot provide timely management for such datasets. Consequently, as the dataset enters the 10+ TB range and/or hundreds of thousands of files, using traditional UNIX tools like cp, tar, or find . exec ... ; to manage these datasets causes the impact to user productivity to become substantial. For example, it would take about 12 hours to copy a 10 TB dataset from the service node via cp if dedicated resources were employed. In general, it is not practical to schedule dedicated resources for a data copy and, as a result, a typical duty factor of 4X is incurred. This means that, in practice, it would take 48 hours to perform a serial copy of a 10 TB dataset. Over the next three to four years, datasets are likely to grow by a factor of 4X. At that point, the simple copy of a dataset may be expected to take over a week and represents significant impediment to the investigation of science. In this paper, we introduce the Lustre User Toolkit for Cray XT, developed at the Oak Ridge National Laboratory Leadership Computing Facility (OLCF) and demonstrate that, by optimizing and parallelizing system tools, an order of magnitude performance increase or more can be achieved, thereby reducing or eliminating the bottleneck. The conclusion is self-evident: parallelism in system tools is vital to managing large datasets.
Ian Miller, Cray Inc. Overview of the Current and Future Cray CX Product Family Please join us for a detailed product briefing of an exciting new product from Cray. This new product will be a significant enhancement to the Cray portfolio, and will expand the range of capabilities  and programming models available to our customers and prospects.
Lawrence Mitchell, EPCC (EPCC) Near Real-time Simulations of Cardiac Arrhythmias CARP is a code for studying cardiac arrhymthias by discretizing an MRI scan and simulating the electric potential in the heart tissue. The aim is to use simulations to try out possible surgical interventions before using them on a patient. In this paper we report on work carried out to improve both the absolute and scaling performance on HECToR, the Cray XT at Edinburgh. We study the effects of different decomposition techniques, including a communication-hierarchy aware decomposition that minimizes inter-node communication at the expense of intra-node messages. These and further output optimizations have improved the scaling from 512 to 8192 cores on HECToR, allowing simulation of a single heartbeat in under 5 minutes.
Bob Moench, Cray Inc. Cray Debugging Support Tools for Petascale Applications As HPC systems have gotten ever larger, the amount of information associated with a debugging failing parallel application has grown beyond what the beleaguered applications developer has the time, resources, and wherewithal to analyze. With the release of the Cray Debugging Support package, Cray introduces several innovative methods of attacking this vexing problem. FTD (Fast Track Debugging) achieves debugging at fully optimized speeds. STAT (Stack Trace Analysis Tool) facilitates the evaluation and study of hung applications. ATP (Abnormal Termination Processing) captures a STAT-like view of applications that have taken a fatal trap. And Guard, the Cray comparative debugger, delivers an automated search for the location of program errors by comparing a working version of an application against a failing version. This paper describes and explores each of the above technologies.
Jace Mogil and David Hadin, Pacific Northwest National Laboratory (PNNL) A Comparison of Shared Memory Parallel Programming Models The dominant parallel programming models for shared memory computers, Pthreads and OpenMP, are both "thread-centric" in that they are based on explicit management of tasks and enforce data dependencies through task management. By comparison, the Cray XMT programming model is data-centric where the primary concern of the programmer is managing data dependencies, allowing threads to progress in a data flow fashion. The XMT implements this programming model by associating tag bits with each word of memory, affording efficient fine grained synchronization of data independent of the number of processors or how tasks are scheduled. When task management is implicit and synchronization is abundant, efficient, and easy to use, programmers have viable alternatives to traditional thread-centric algorithms. In this paper we compare the amount of available parallelism in a variety of different algorithms and data structures when synchronization does not need to be rationed, as well as identify opportunities for platform and performance portability of the data-centric programming model on multi-core processors.
Don Morton and Oralee Nudson, Arctic Region Supercomputing Center (ARSC) Use of the Cray XT5 Architecture to Push the Limits of WRF Beyond One Billion Gridpoints The Arctic Region Supercomputing Center (ARSC) Weather Research and Forecasting (WRF) model benchmark suite continues to push software and available hardware limits by successfully running a 1km resolution case study composed of more than one billion grid points. Simulations of this caliber are important for providing detailed weather forecasts over the rugged Alaska terrain and are intended for benchmarking on systems with tens of thousands of cores. In pursuing these large scale simulations, we have incurred numerical, software and hardware limitations that have required us to use various parallel I/O schemes and to explore different PBS "aprun" options. In this paper we will discuss issues encountered while gradually expanding the problem sizes in which WRF can operate and our solutions in running high resolution and/or large-scale WRF simulations on the Cray XT5 architecture.
John Noe, Sandia National Laboratories (SNLA) Dark Storm: Further Adventures in XT Architecture Flexibility The Cray/Sandia XT3 architecture instantiated in the Red Storm computer system at Sandia National Laboratories provides many opportunities for upgrade, modification and customization. The original 40TF platform has seen several upgrades over the years resulting in its current 284TF peak configuration. The system has also received updated disk subsystems reaching over 2 PBytes of storage split between the two original network heads. Recently Sandia was tasked with supporting efforts within DOE/NNSA aimed at extending the reach of traditional supercomputing into non-traditional areas of interest to National Security organizations. To support this new initiative, Sandia and Cray combined concepts to create, deploy and demonstrate an external network supported Lustre file system routed to Red Storm via multiple 10 GE connections. Access to Catamount compute nodes is routed through Service Nodes running the LNET router protocol. This concept was deployed and initial demonstrations indicated file system throughput of over 12 GB/second and support for over 12000 nodes. Utilizing an innovative Woven network switch which provides load balancing message handling, the concept can be extended to support multiple customer specific file systems all with equal access to the entire Red Storm system. Such separate file systems would connect serially to Red Storm, but the data would remain available to customers whether connected to Red Storm or separated. This talk depicts the testing, validation and debugging efforts which resulted in the successful deployment of this National Security resource, and some of the issues associated with debugging in a high security environment.
Bill Nitzberg, Altair Engineering, Inc. High Performance Computing with Clouds: Past, Present, and Future Cloud Computing has its roots in technologies spanning the past 30 years: in network operating systems, distributed systems, metacomputing, clusters, and Grids. The promise of Clouds—-seamless access to computing power and information resources, on-demand-—is delivering substantial benefits in real production settings for datacenter applications. Although, private Clouds are having success in HPC, key characteristics that make public Clouds a compelling solution for the datacenter are orders-of-magnitude larger in HPC. Whether this gap can be crossed in the coming years is an open question; Clouds for HPC are at their infancy.
Stephen Pickles, STFC Daresbury Laboratory Multi-Core Aware Performance Optimization of Halo Exchanges in Ocean Simulations The advent of multi-core brings new opportunities for performance optimization in MPI codes. For example, the cost of performing a halo exchange in a finite-difference simulation can be reduced by choosing a partition into sub-domains that takes advantage of the faster shared-memory mechanisms available for communication between MPI tasks on the same node. I have implemented these ideas in the Proudman Oceanographic Laboratory Coastal-Ocean Modelling System, and find that multi-core aware optimizations can offer significant performance benefit, especially on systems built from hex-core chips. I also review several multi-core agnostic techniques for improving halo exchange performance.
Fiona Reid, EPCC (EPCC) The NEMO Ocean Modelling Code: A Case Study We present a case study of a popular ocean modelling code, NEMO, on the Cray XT4 HECToR system. HECToR is the UK's high-end computing resource for academic users. Two different versions of NEMO have been investigated. The performance and scaling of the code has been evaluated and optimised by investigating the choice of grid dimensions, by examining the use of land versus ocean grid cells and also by checking for memory bandwidth problems. The code was profiled and the time spent carrying out file input/output was identified to be a potential bottleneck. We present a solution to this problem which gives a significant saving in terms of runtime and disk space usage.
James Rosinski, NOAA/ESRL General Purpose Timing Library (GPTL): A Tool for Characterizing Performance of Parallel and Serial Applications GPTL is an open source profiling library that reports a variety of performance statistics. Target codes may be parallel via threads and/or MPI. The code regions to be profiled can be hand-specified by the user, or GPTL can define them automatically at function-level granularity if the target application is built with an appropriate compiler flag. Output is presented in a hierarchical fashion that preserves parent-child relationships of the profiled regions. If the PAPI library is available, GPTL utilizes it to gather hardware performance counter data. GPTL built with PAPI support is installed on the jaguar machine at ORNL.
Jason Schildt, Cray Inc. Dynamic Shared Libraries and Virtual Cluster Environment Cray is expanding system functionality to support Dynamic Shared Libraries (DSL) on compute nodes, and the ability to run a wide range of packaged ISV applications on compute nodes. Built upon Data Virtualization Service (DVS), a more standard Linux runtime environment is distributed across the system by the DSL capability via DVS Server nodes to the Compute Node clients. The CLE Virtual Cluster Environment (VCE) adds a further layer of functionality, by supporting natively installed and executed ISV applications. This three component solution allows customers to meet a wide range of runtime environment demands with limited impact and complexity while increasing productivity.
Jason Schildt, Cray Inc. Dynamic Shared Libraries and Virtual Cluster Environment–BoF As a follow-up to the Dynamic Shared Libraries and Virtual Cluster Environment paper/presentation, Jason Schildt will chair a technical BOF covering deployment, configuration and functionality focused questions from the participants.
Jason Schildt, Cray Inc. A Quick-start and HOWTO Guide for CMS Tools The CMS team will lead an open discussion for customers interested in deploying and using CMS Tools in the CLE-2.2 / SMW-4.0 releases. Topics will include tool overview and configuration, followed by CMS Tool capabilities and use cases.
Bill Sellers, Manchester University Virtual Paleontology: Gait Reconstruction of Extinct Vertebrates Using High Performance Computing Gait reconstruction of extinct animals requires the integration of paleontological information obtained from fossils with biological knowledge of the anatomy, physiology and biomechanics of extant animals. Computer simulation provides a methodology for combining multimodal information to produce concrete predictions that can be evaluated and used to assess the likelihood of competing locomotor hypotheses. However, with the advent of much faster supercomputers, such simulations can also explore a wider range of possibilities, allowing the generation of gait hypotheses de novo.
Adrian Tate, Cray Inc. Overview and Performance Evaluation of Cray LibSci Products This talk serves as both an introduction to the Cray scientific library suite and as a tutorial on obtaining advanced performance with applications that utilize scientific libraries. The talk will include a thorough and frank performance evaluation of all scientific library products on Cray XT systems, including dense kernels on single core and multiple cores, dense linear solvers and eigensolvers in serial and parallel, serial and distributed Fourier Transforms and Sparse kernels within sparse iterative solvers. The emphasis will be on usage and how to increase performance by using different algorithms or libraries, better configurations, or advanced controls of the scientific libraries.
Monika ten Bruggencate, Cray Inc. DMAPP An API for One-sided Program Models on Baker Systems Baker Systems and follow-on systems will deliver a network with advanced remote memory access capabilities. A new API (DMAPP) has been developed to expose these capabilities to one-sided program models. This paper presents the DMAPP API as well as some preliminary performance data.
Robert Whitten, Jr., Oak Ridge National Laboratory (ORNL) A Pedagogical Approach to User Assistance This presentation will focus on a pedagogical approach to providing user assistance. By making user education the central theme in training, outreach, and user assistance activities, a set of competencies can be developed that encompasses the knowledge required for productive use of leadership-class computing resources such as the Cray XT5 Jaguar system.
Brian Wylie, Juelich Supercomputing Centre Scalable Performance Analysis of Large-scale Parallel Applications on Cray XT Systems with Scalasca The open-source Scalasca toolset [www.scalasca.org] supports integrated runtime summarization and automated trace analysis on a diverse range of HPC computer systems. An HPC-Europa2 visit to EPCC in 2009 resulted in significantly enhanced support for Cray XT systems, particularly the auxiliary programming environments and hybrid OpenMP/MPI. Combined with its previously demonstrated extreme scalability and portable performance analyses comparison capabilities, Scalasca has been used to analyse and tune numerous key applications (and benchmarks) on Cray XT and other PRACE prototype systems, from which experience with a representative selection is reviewed.

Web page design and support by Cray User Group (CUG) Conference Services