CUG 2008 home page

CUG 08 Proceedings


Monday Tuesday Wednesday Thursday

Abstracts of Papers

Presentation Titles and Authors
Keynote Address: Computers Crunching Lipids–From Cell Membranes to Lipoproteins, Professor Ilpo Vattulainen, Leader of the Biological Physics & Soft Matter Group, Tampere University of Technology Complex biological systems are characterized by a variety of length and time scales. The computational challenge to account for both atomistic and larger scales is particularly evident in lipid systems, here the length scales range from nano-sized lipids to 10-25 nm sized lipoproteins and further to micron-scale cell membranes. Here, we discuss means to bridge these scales through examples for a number of lipid systems, including e.g. domain formation in many-component membranes and the structural aspects of lipoproteins known as carriers of cholesterol.
Keynote adddress: Dark Energy–A Mystery?, Kari Enqvist, Professor of Cosmology, University of Helsinki; Theory Programme Director, Helsinki Institute of Physics According to cosmological observations, the dominant energy component in the universe appears to be dark energy, the energy of the vacuum. I consider the theoretical challenges raised by such dark energy, and discuss the issues related to deducing the existence of dark energy from the observations.
Interactive Session: Legacy Systems SIG, Chair: Jim Kasdorf (PITTSC) The purpose of this Interactive Session is to provide an open forum for discussions among owners and operators of Cray systems which are operating in the field but which are no longer offered for sale. Cray's support commitment to these systems is well known and appreciated, yet the nature of platforms without a follow-on product offering naturally generates concerns and questions. This meeting is where these and other questions should be addressed. This session is open to everyone with an interest in Legacy platforms.
Interactive Session: User Support SIG, Chair: Jim Glidewell (BOEING) The purpose of this Interactive Session is to provide an open forum for discussion of any issues related to the services or processes relating to User Support, whether supplied by Cray or provided internally by operating sites. This is an opportunity to help each other and learn from the collective experience. This session is open to everyone wanting to learn about the area of User Support.
Interactive Session: Applications and Programming Environment SIG, Chair: Robert Ballance (SNLA), Deputy Chair: Rolf Rabenseifner (HLRS) The purpose of this interactive Session is to provide an open forum for discussion of any issues related to software products or the interactions of these products. If you have an issue with a 3rd party software package, perhaps someone has had a similar experience and found a fix or workaround that would benefit you users. What can you do to improve performance of your favorite code? What new and powerful tool will help you provide better service to your program developers or analysts? This session is open to everyone wanting to learn and share more about Applications and Programming Environments.
Interactive Session: Systems Support SIG, Chair: Nick Cardo (NERSC) The purpose of this Interactive Session is to provide an open forum for discussion of any and all issues related to the installation, integration, maintenance and operation of major computing resources. Operations managers are familiar with the day to day challenges presented by complex computing systems and the constant goal is to improve all aspects of performance, reliability, ease of use, and maintainability, while providing secure reliable service to a demanding set of users. Listen to what others have tried in this area and contribute your ideas for improvement. Especially helpful are hints at what not to do. This session is open to everyone with an interest in Systems Support.
Migrating a Scientific Application from MPI to Co-Arrays,
John Ashby and John Reid, HPCX Consortium (HPCX)
MPI is a de facto standard for portable parallel programming using a message passing paradigm. Interest is growing in other paradigms, in particular Partitioned Global Address Space (PGAS) languages such as UPC and Titanium. Most Computational Science and Engineering codes are still written in Fortran, and the 2008 Fortran standard will include Co-Arrays, a Cray initiated PGAS extension of the language. We report on the experience of taking a moderately large CFD program and migrating it to a Cray X1 using Co-Arrays rather than MPI. Some comparison of the MPI and Co-Array versions will be given. We will discuss the transformation process, identifying any possibilities for automation, and the ease of programming and maintenance Co-Arrays offer.
Exploring Extreme Scalability of Scientific Applications,
Mike Ashworth, Ian Bush, Ning Li, Andrew Sunderland, and Ilian Todorov, HPCX Consortium (HPCX)
With processor clock speeds now only slowly increasing, it is clear that in order to progress to the next level of high-end systems - 'Petascale' systems - we will need to work with systems comprising hundreds of thousands of nodes, with shared-memory nodes built from multi-core processor chips. This presents applications developers with significant challenges. Will existing algorithms scale to such extremely high numbers of tasks? How will the performance and memory requirements scale as we increase the problem sizes? Will it be necessary to switch to new programming models?

The availability of Cray XT4 systems with more than 10,000 cores at HECToR and ORNL allow us to explore the performance of current scientific application codes to much higher levels than hitherto possible, both in terms of number of processors and of problem size.
We present performance and scalability results from codes from a range of disciplines,including CFD, molecular dynamics and ocean modeling. Performance bottlenecks which limit scalability are analyzed using profiling tools.
Application Monitoring,
Robert Ballance, Sandia National Laboratories (SNLA) and John Daly and Sarah E. Michalak, Los Alamos National Laboratory
Application monitoring is required to determine the true performance of an application while it is running. Part I of this paper presents a light-weight monitoring system for first-order determination of whether a job is making progress. However, full application monitoring requires deeper characterization and accurate measurement of the phases of an application's processing. Using the data gathered through application monitoring we can also derive information about the underlying system reliability. This paper will show how a straightforward maximum likelihood estimate (MLE) can be use to accurately and effectively derive information about platform component reliability by monitoring the progress of the application workload.
Exploring the Performance Potential of Chapel in Scientific Computations,
Richard Barrett and Stephen Poole, Oak Ridge National Laboratory (ORNL)
The Chapel Programming Language is being developed as part of the Cray CASCADE program. In this paper we report on our investigations into the potential of Chapel to deliver performance while adhering to its code development and maintenance goals. Experiments, executed on a Cray X1E, and AMD dual-core and Intel quad-core processor based systems, reveal that with the appropriate architecture and runtime support, the Chapel model can achieve performance equal to the best Fortran/MPI, Co-Array Fortran, and OpenMP implementations, while substantially easing the burden on the application code developer.
Reaching a Computational Summit: The 1 PFLOP Cray XT at the Oak Ridge National Laboratory, Arthur Bland, James Hack, and James Rogers (ORNL) The ORNL Leadership Computing Facility (LCF) is preparing for the delivery of the first Cray XT system with a peak performance of more than 1000 TFLOPs. Based on the XT5 compute module and the next-generation chilled-water cooling system, this system will contain more than 27,000 quad-core AMD Opteron processors and 200TB of memory. The introduction of the system late in 2008 will complete significant upgrades to the facilities and infrastructure that include an expansion of the electrical power distribution system to 14MW, expanded cooling capacity to 6600 tons, and a new high-bandwidth Lustre parallel file system.
FFT Libraries on Cray XT: Current Performance and Future Plans for Adaptive FFT Libraries,
Jonathan Bentz, Cray Inc.
FFT libraries are one large focus of the Cray Math Software Team. Cray provides both ACML and FFTW for calculation of Fast Fourier Transforms. This talk will showcase current performance results for both ACML and FFTW, as well as the future plans for FFT on Cray XT architecture. These plans include creating a generic and simplified interface to FFT and allowing the FFT library itself to choose the best algorithm "on-the-fly" from a number of different choices, e.g., FFTW, ACML or some other tuned algorithm which we may provide.
Sorting Using the Virtex-4 Field Programmable Gate Arrays on the Cray XD1,
Stephen Bique, Robert Rosenberg, and Wendell Anderson, Naval Research Laboratory, (NRL) and Marco Lanzagorta, ITT Industries
Sorting of lists is required by many scientific applications. To do so efficiently raises many questions. Which is faster to sort many lists interactively: sort many lists in parallel using a sequential algorithm, or sort few lists in parallel using a parallel algorithm? How large lists can be sorted using various data structures and what is the relative performance? This paper presents case studies using the Mitrion-C programming language and results in the context of a program running on the cores of a Cray XD1 node with a Virtex-4 FPGA as a coprocessor to do the actual sorting of the data. We introduce a novel implementation that is linear (with small constant) to sort relatively large lists and that returns the indices for the sorted array.
Recent Improvements to Open MPI for the Cray XT,
Ron Brightwell, Sandia National Laboratories (SNLA); Richard Graham and Galen Shipman, Oak Ridge National Laboratory (ORNL); Joshua Hursey, Indiana University; and Brian Barrett, Sandia National Laboratories (SNLA)
Recently several improvements have been made to the Open MPI implementation for the Cray XT series of platforms. Enhancements and optimizations have been made to nearly all types of MPI communication -- point-to-point, collective, as well as one-sided. Runtime support for using either Accelerated or Generic Portals has been added, and configuration support for both Catamount and CNL has also been added. This paper provides an overview of these additions and improvements and offers a detailed performance comparison and analysis of the different flavors of Open MPI using micro-benchmarks and applications.
Exploring Memory Management Strategies in Catamount,
Ron Brightwell, Kurt Ferreira, and Kevin Pedretti, Sandia National Laboratories (SNLA)
This paper describes recent work involving Catamount's memory management strategy. We characterize memory performance with respect to alternative page mapping strategies and discuss the impact that these strategies can have on network performance. We also describe a strategy for address space mapping that can potentially benefit intra-node data movement for multi-core systems.
An InfiniBand Compatibility Library for Portals,
Ron Brightwell, Sandia National Laboratories (SNLA) and Lisa Glendenning, HP
This paper describes the design, implementation, and performance of an InfiniBand compatibility library for Portals on the Cray XT series of machines. This library provides support for the reliable connection mode of the libibverbs interface as defined by the Open Fabrics Alliance. We will discuss the motivations for this work and present initial performance results.
The Need for Parallel I/O in Classical Molecular Dynamics,
Ian Bush, Ilian Todorov, and Andrew Porter, HPCX Consortium (HPCX)
One of the most commonly used computational techniques on capability computer systems is classical Molecular Dynamics (MD), which has been shown to scale excellently on such machines. However, the computational performance of modern top end machines is such that while the simulations may be performed in a reasonable time, writing the results of the simulation can become the time limiting step, especially when the I/O is performed in serial. Here, we examine the performance of the I/O within DL_POLY_3, a general purpose MD program from STFC Daresbury Laboratory, and show that for large systems and/or large processor counts it is crucial to adopt a parallel I/O strategy so as to be able to perform the desired science.
Detecting System Problems with Application Exit Codes,
Nicholas Cardo, National Energy Research Supercomputer Computer Center (NERSC)
With today's large systems, it is often difficult to detect system problems until a user reports an unexpected event. By analyzing application exit codes and batch job stderr/stdout files during batch job exit processing, it possible to detect and track system related problems. A methodology was developed at the National Energy Research Scientific Computing Center and implemented through custom utilities on the XT4 to detect and track system problems. The details of this methodology along with the tools used will be discussed in detail in this paper.
Highly Scalable Networking Configuration Management For Highly Scalable Systems,
Nicholas Cardo, National Energy Research Supercomputer Computer Center (NERSC)

Today's systems have large numbers of specialized nodes each requiring unique network configurations. On the XT4 at the National Energy Research Scientific Computing Center, there are 56 such nodes each requiring unique network addresses and network routes. Normal network management would be to specialize files to be unique to each node. A simpler mechanism with increased flexibility was needed. With very little initial setup, all network addresses and routes can be maintained through two common files, one for routes and one for addresses. Changes to the configuration now only require simple edits to these two files. The details of configuring this environment and simplicity of its management will be discussed in this paper.
Cray Operating System Plans and Status,
Charlie Carroll, Cray Inc.
Cray continues to improve and advance its system software. This paper and presentation will discuss new and imminent features with an emphasis on increased stability, robustness and performance.
Reverse Debugging with the TotalView Debugger,
Chris Gottbrath and Jill Colna, TotalView Technologies
Some of the most vexing software bugs to solve in both serial and parallel applications are those where the failure, such as a crash or implausible output data, happens long after and/or in a completely unrelated section of the program than the programming error that is the root cause of the bug. Scientists and engineers working "backwards" from the crash to the root cause often have to apply tedious and unreliable tricks to examine the program because they are proceeding "against the grain" with most debuggers. This talk will explore how the reverse debugging capability being developed by TotalView Technologies will radically improve the speed and accuracy and reduce the difficulty of troubleshooting this class of defects that is both common and challenging.
Cray XT5h (X2 blade) Performance Results,
Jef Dawson, Cray Inc.
The Cray XT5h offers multiple processor architectures in a single system. In this talk we will give a brief overview of the XT5h system, and then describe in more detail the novel features and performance of the Cray X2 Vector Processing Blade.
Application Performance Tuning on Cray XT Systems,
Adrian Tate, John Levesque, and Luiz DeRose, Cray Inc.
This tutorial addresses tools and techniques for application performance tuning on the Cray XT system. We will briefly discuss the system and architecture, focusing on aspects that are important for understanding the performance behavior of applications. The main part of the tutorial will describe compiler optimization flags and present the Cray performance measurement and analysis tools, as well as the high optimized Cray Scientific libraries. We will conclude the tutorial with optimization techniques, including discussions on MPI and I/O.
The Cray Programming Environment: Current Status & Future Directions,
Luiz DeRose, Mark Pagel, Heidi Poxon, Adrian Tate, Brian Johnson, and Suzanne LaCroix, Cray Inc.
In this paper we will present the Cray programming environment, focusing on its current status, recent features, and future directions. The paper will cover compilers, programming models, scientific libraries, and tools.
Performance Analysis and Optimization of the Trilinos Epetra Package on the Quad-Core AMD Opteron Processor,
Douglas Doerfler, Mike Heroux, and Sudip Dosanjh, Sandia National Laboratories (SNLA) and Brent Leback, The Portland Group
The emergence of multi-core microprocessor architectures is forcing application library developers to re-evaluate coding techniques, data structures and the impact of compiler optimizations to ensure efficient performance. In addition to the complexities of multiple cores per chip, SIMD word length is also increasing. Packages employed in the Trilinos framework, and in particular the Epetra package, must support this transition efficiently. This paper evaluates the performance of Epetra on the AMD Barcelona processor and investigates optimizations to exploit the performance potential of multiple cores and the double-wide SIMD unit.
Beyond Red Storm at Sandia,
Sudip Dosanjh and David Rogers, Sandia National Laboratories (SNLA)
This presentation describes architecture research and systems development activities at Sandia. The U.S. Congress established the Institute for Advanced Architectures in 2008 with Centers of Excellence in Albuquerque, NM and Knoxville, TN. A goal for the center is to develop key technologies that will enable useful exascale computing in the next decade in collaboration with industry and academia. Initial technology focus areas are memory subsystems, interconnect technologies, power and resilience. Also described is a new partnership with LANL in which the two labs will jointly architect, develop, procure and operate capability systems for DOE’s Advanced Simulation and Computing Program. The partnership is initially focused on a petascale production capability system that will be deployed in late 2009.
Massively Parallel Electronic Structure Calculations with Python Software,
Jussi Enkovaara, CSC–Scientific Computing Ltd. (CSC)
We have developed a Python program GPAW for electronic structure calculations. We show how Python program can achieve high performance by using C-extensions for the computationally most intensive parts. We use also optimized BLAS and LAPACK libraries for linear algebra and MPI for parallelization. We present examples about the efficiency and good parallel scaling of the program.
Design, Implementation, and Experiences of Third-Party Software Administration Policies at the ORNL NCCS,
Mark Fahey, Tom Barron, Nick Jones, and Arnold Tharrington, Oak Ridge National Laboratory (ORNL)
At the ORNL NCCS, we recently redesigned the structure and policies surrounding how we install third-party applications, most notably for use on our quad-core Cray XT4 (Jaguar) computer. Of particular interest is the addition of many scripts to automate installation and testing of these applications, as well as some reporting mechanisms. We will present an overview of the design and implementation, and also present some of the experiences we had to date (good and bad.)
Reducing Human Intervention in the Maintenance of Mass Storage Systems,
David Fellinger, Data Direct Networks, Inc
The I/O bandwidth requirements of simulation and visualization clusters are growing in scale with the compute requirements of petascale and exascale architectures. Demands of multiple terabytes per second will be required to minimize the I/O cycles of these large machines and this will require higher density storage systems housing larger numbers of mechanical components. A hardware and software solution will be discussed that greatly reduces the need for component attention and replacement by leveraging long term mass storage system experience.
Optimization of a Suite of Parallel Finite Element Routines for a Cray XT4,
Jonathan Gibson, Lee Margetts, Francisco Calvo Plaza, and Vendel Szeremi, University of Manchester (MCC)
ParaFEM is a library of finite element codes being developed at the University of Manchester. This paper will describe the optimization of these routines for the Cray XT4.
The Dynamic PBS Scheduler,
Jim Glidewell, The Boeing Company (Boeing)
Defining multiple queues can help a site to control the mix of system resources consumed by running jobs. Unfortunately, static limits associated with such queues can lead to underutilization of the overall system. This paper will describe our method for adjusting queue limits dynamically, based on the priority and resource requirements of the current mix of jobs.
Overview of the Activities of the MPI Forum,
Richard Graham, Oak Ridge National Laboratory (ORNL)
The MPI Forum is currently meeting to consider changes to the MPI standard. The forum is considering creating versions 2.1, 2.2, and 3.0 of the standard. MPI version 3.0 is targeted for new features, with support for items such as Fault-Tolerance and improved one-sided communications being considered for addition, features that are important for applications running at the peta-scale and beyond. This talk will present a status report on the Forum discussions, and will provide an opportunity for user feedback on the importance of these features, and for feedback on other shortcomings of the current MPI standard.
Open MPI's Dynamic Process Control and Shared Memory Optimizations for the Cray XT4,
Richard Graham, Oak Ridge National Laboratory (ORNL)
Recent work of porting Open MPI on the Cray XT4 to using Cray's ALPS scheduler, and the TCP/IP functionality available with CNL has made it possible to enable more of Open MPI's functionality on this platform. This paper will discuss the benefits to applications of enabling shared memory communications on the XT4, bypassing the Portals network layer, and new collective operation optimizations made to take advantage of these capabilities. Data will be presented on the effect of these capabilities on several applications running on The National Center For Computational Science Jaguar system at Oak Ridge National Laboratory. In addition, this paper will discuss the MPI dynamic process management capabilities enabled by this work.
Leadership Computing at the National Center for Computational Science: An Enabling Partner for Breakthrough Science,
James Hack, Arthur Bland, Doug Kothe, and James Rogers, Oak Ridge National Laboratory (ORNL)
In May 2004, the National Center for Computational Sciences (NCCS) at Oak Ridge National Laboratory was selected as the Leadership Computing Facility by the U.S. Department of Energy. This talk will review the evolution of Leadership Computing at the NCCS and how it has grown to be an important partner in the pursuit of breakthrough computational science. A large fraction of the computational resources, 145 million processor hours in 2008, are now provided to investigators selected for participation in the US Department of Energy's Innovative and Novel Computational Impact on Theory and Experiment (INCITE) program, which supports computationally intensive, large-scale scientific research projects. This talk will discuss several of the INCITE projects supported by the NCCS, and will provide a preview of plans for continuing to expand the capabilities of the center.
CNL–The Evolution of a Compute OS,
Jim Harrell and Dave Wallace, Cray Inc.
Compute Node Linux (CNL) was delivered this past year. We will review the status of CNL at sites and discuss migration decision points. The talk will cover the direction of CNL including the plans for features and functions in the next several versions.
Franklin: User Experiences,
Yun (Helen) He and Nicholas Cardo, National Energy Research Supercomputer Computer Center (NERSC)
The newest workhorse of the National Energy Research Scientific Computing Center is a Cray XT4 with 9736 dual core nodes. While there have been many challenges with migrating from an IBM platform to the Cray, the overall results have been quite successful. Details of these challenges and successes will be discussed in this paper. The evolution of the users environment will also be discussed.
Parallel 3D-FFTs for Mult-processing Core nodes on a Meshed Communication Network,
Joachim Hein, Alan Simpson, and Arthur Trew, HPCX Consortium (HPCX); Heike Jagode, ZIH, Dresden University; and Ulrich Sigrist, EPCC, University of Edinburgh
Parallel fast Fourier transformations are important for many scientific applications and are difficult to parallelize efficiently onto a large number of processors due to the all-to-all nature of the communications. We will discuss how the topology of a meshed communication network can affect the communication performance and explore how nodes offering several processing cores can be exploited to improve the communication performance. The presentation includes benchmarking results from the UK's national supercomputing services HECToR (Cray XT4 with dual core Opterons) and HPCx (IBM p575 cluster with 16-way SMP nodes and HPS interconnect) as well as the University of Edinburgh Blue Gene/L.
Lustre Scaling on the Cray XT,
Nicholas Henke, Cray Inc.
Our experiences with providing a Lustre-based parallel file system solution has exposed unique challenges at the scale of the Cray XT systems. This paper describes the distinct problems, how they need to be addressed at scale and how the increased knowledge base will enhance future Lustre products and Cray systems.
Parallel In-Line Finite-Element Mesh Generation,
David Hensinger, Sandia National Laboratories (SNLA)
Generation of large finite-element meshes is a serious bottleneck for parallel simulations. When mesh generation is limited to serial machines and element counts approach a billion, this bottleneck becomes a roadblock. To surmount this barrier the ALEGRA shock and multi-physics project has developed a parallel mesh generation library that allows on-the-fly scalable generation of finite element meshes for several simple geometries. It has been used to generate more that 1.1 billion elements on 17,576 processors. The library operates on the assumption (and constraint) that the mesh generation process is deterministic. Each processor in a parallel simulation is provided with a complete specification of the mesh, but it only creates a full representation of the elements that will be local to that processor. Because of this, no inter-processor communication is performed.
Moab and TORQUE Achieve High Utilization on Flagship NERSC XT4 System,
Michael Jackson, Cluster Resources, Inc.
Moab and TORQUE keep NERSC's Cray XT4 "Franklin" humming with ultra high utilization, high availability and rich policy controls. This document will use the NERSC leadership class deployment as a case study to highlight the features and capabilities implemented by Cluster Resources to achieve consistent utilization in the high 90s and improve manageability and usability of the Cray XT4 system leveraging ALPS on compute node Linux (CNL). Moab integrates Cray's monitoring and management toolset with its scheduling and reservation engine for a holistically optimized solution.
Managing MPI Event Queue & Message Buffer Space,
Geir Johansen, Cray Inc.
A potential factor that limits the performance of MPI applications on the Cray XT system is the amount of space available for MPI event queues and message buffers. The Cray XT implementation of MPI allows the user to configure the size of the MPI event queues and message buffers. The paper will outline the MPI buffer configurations that are available and how they can be used to improve the application's performance and scalability. Common MPI buffer error messages will be described along with their potential resolutions. Finally, the MPI configuration settings for several MPI applications will be discussed.
Best Practices for Security Management in Supercomputing,
Urpo Kaila, CSC–Scientific Computing Ltd. (CSC)
In all areas of IT we can see increasing threats endangering the three classical objectives of information security: confidentiality, integrity, and the availability of systems, data and services. At the same time, governments are increasing the pressure to comply with proactive security measures and security related legislation. Trouble also arises from the increasing complexity of technology, customer organizations and services provided, and a demand for better efficiency, and ease of use.

How does all this apply to Data Centers providing supercomputing services? How does and how should supercomputing security differ from security for "normal" computing?

All the basic security principles do apply to supercomputing as well. Risk analysis should be made, requirements should be understood, physical, technical, and administrative security controls should be implemented and audited.

We present a top-down overview on how to implement good practices of information security management in supercomputing and suggest more international security-related collaboration in comparing and benchmarking information security practices for supercomputing.
ALPS: Ascent,
Michael Karo, Cray Inc.
The Application Level Placement Scheduler (ALPS) provides application placement and launch services for Cray systems employing the Cray Linux Environment (CLE). This tutorial is entitled "Ascent", and extends upon last year's "Base Camp" presentation. Aspects of ALPS administration and troubleshooting will be discussed. We will also explore recent developments in the ALPS suite, including checkpoint/restart, comprehensive system accounting (CSA), and huge page memory support.
Adaptive IO System (ADIOS),
Scott Klasky, Chen Jin, Stephen Hodson, James B. (Trey) White III and Weikuan Yu, Oak Ridge National Laboratory (ORNL); Jay Lofstead, Karsten Schwan, and Matthew Wolf, Georgia Tech; Wei-keng Liao and Alok Choudhary, North Western University; and Manish Parashar and Ciprian Docan, Rutgers University
ADIOS is a state of the art componentization of the IO system that has demonstrated impressive IO performance results on the Cray XT system at ORNL. ADIOS separates the selection and implementation of any particular IO routines from the scientific code offering unprecedented flexibility in the choices for processing and storing data. The API was modeled on F90 IO routines emphasizing simplicity and clarity using external metadata for richness. The metadata is described in a stand-along XML file that is parsed once on code startup and determines what IO routines and parameters are used by the client code for each grouping of data elements generated by the code. By employing this API, a simple change to an entry in the XML file changes the codes to use either synchronous MPI-IO, collective MPI-IO, parallel HDF5, pnetcdf, NULL (no output), or asynchronous transports such as the Rutgers DART implementation and the Georgia Tech DataTap method. Simply by restarting the code, the new IO routines selected in the XML file will be employed. Furthermore, we have been defining additional metadata tags to support in-situ visualization solely through changes in the XML metadata file. The power of this technique is demonstrated on the GTC, GTC_S, XGC1, S3D, Chimera, and Flash codes. We show that when these codes run on a large number of processors, they can sustain high I/O bandwidth when they write out their restart and analysis files.
Exascale Computing: Science Prospects and Application Requirements,
Sean Ahern, Sadaf Alam, Mark Fahey, Rebecca Hartman-Baker, Ricky Kendall, Douglas Kothe, OE Bronson Messer, Richard Mills, Ramanan Sankaran, Arnold Tharrington, and James B. White III, Oak Ridge National Laboratory (ORNL)
The US Department of Energy recently sponsored a series of workshops on the potential for exascale computing in the next decade. The National Center for Computational Sciences followed these up by interviewing some of the top scientists now using our Crays. We report our findings from these interviews, including the scientific goals, potential breakthroughs, and expected requirements of exascale computing.
HPC in 2016 A View Point from NERSC,
Deborah Agarwal, Michael Banda, Wes Bethel, John Hules, Juan Meza, Horst Simon, and Micheal Wehner, Lawrence Berkeley Laboratory; William Kramer, Leonid Oliker, John Shalf, David Skinner, Francesca Verdier, Howard Walter, and Katherine Yelick, National Energy Research Supercomputer Computer Center (NERSC)
NERSC and Berkeley Lab created a vision of what attributes are needed for High Performance Computing Environments 10 to 15 years from now. This vision, spanning all areas of science and engineering, comes from in depth understanding of current and future applications, algorithms, science requirements and technology directions. Whether at Exa-scale or very high Petascale, there are significant challenges for both vendors, facility providers, stakeholders and users. This talk will present the NERSC 2016 vision and discuss its challenges and opportunities for success.
Petaflop Computing in the European HPC Ecosystem,
Kimmo Koski, Managing Director, CSC–Scientific Computing Ltd. (CSC)
During the last few years Europe has put together a joint initiative with a target to provide European research access to top end computing facilities exceeding a petaflop/s performance. This Partnership for Advanced Computing in Europe (PRACE) collaboration has started a preparatory phase project, partly funded by European Union, and targets to launch the first petaflop center in 2010. To be able to serve the user community with varying needs efficiently, it is important to link the future top-of-the-pyramid resources to the full European HPC Ecosystem including not only various levels of computing, but supporting infrastructure, competence in software development and collaboration between the various stakeholders.
A Micro-Benchmark Evaluation of Catamount and Cray Linux Environment (CLE) Performance,
Jeff Larkin, Cray Inc. and Jeff Kuehn, Oak Ridge National Laboratory (ORNL)
Over the course of 2007 Cray has put significant effort into optimizing the Linux kernel for large-scale supercomputers. Many sites have already replaced Catamount on their XT3/XT4 systems and many more will likely make the transition in 2008. In this paper we will present results from several micro-benchmarks, including HPCC and IMB, to categorize the performance differences between Catamount and CLE. The purpose of this paper is to provide users and developers a better understanding of the effect migrating from Catamount to CLE will have on their applications.
Debugging at Scale with Allinea DDT on the Cray XT4/5,
David Lecomber, Allinea Software
In this presentation we explore how Allinea DDT handles the challenge of debugging at thousands of processes, and how the architecture of the Cray Hybrid Parallel XT systems and DDT have enabled a fast and scalable debugging environment. We will cover how DDT presents debugging to the user in an easy and intuitive manner, simplifying this usually complex but essential part of the development process.
Initial Performance Results from the Quad Core XT4,
John Levesque, Cray Inc.
By May 2008 many of us will have upgraded our XT systems to quad-cores. Many questions need to be answered about the performance of these new sockets. This paper will examine performance issues from core performance to OpenMP on the Socket to using either straight MPI across all cores or using the system as a Distributed Shared Memory (DSM).
With vectorization we should see factors of two in flops/clock; however, is the memory bandwidth sufficient to sustain this increased performance. How does twice the number of cores on a node impact the interconnect performance?
Can we use OpenMP on the node and MPI between nodes (DSM) to mitigate the interconnect and memory bandwidth issues?
All of these questions and more will be discussed and results presented to shed some light on the performance issues with Quad cores.
Crossing the Boundaries of Materials Characterization Using the Cray XT4,
Lee Margetts, Francisco Calvo, and Vendel Szeremi, University of Manchester (MCC)
Materials with complex architectures such as composites and foams offer advantages over traditional engineering materials such as increased strength and reduced weight. Nowadays, samples can be routinely imaged using X-ray tomography and converted to robust three dimensional models comprising billions of finite elements. This paper presents recent efforts to characterize the bulk properties of these materials using an 11,000 core Cray XT4.
Restoring the CPA to CNL,
Don Maxwell and David Dillow, Oak Ridge National Laboratory (ORNL); Jeff Becklehimer and Cathy Willis, Cray Inc.
Job and processor accounting information useful primarily for workload and failure analysis is available in the Catamount release of UNICOS/lc but is not currently available in the CNL release. This paper describes work that was done to introduce accounting functionality into CNL and subsequent work that was done to provide failure analysis for jobs.
First Experiences with CHIMERA on Quad-Core Processors,
Bronson Messer, Raph Hix, and Anthony Mezzacappa, Oak Ridge National Laboratory (ORNL) and Stephen Bruenn, Florida Atlantic University
The advent of quad-core processors and methods for thread-level parallelism on the Cray XT4 have allowed us to begin exposing previously untapped levels of parallelism in our supernova simulation code CHIMERA. Computationally intensive modules evolving nuclear reaction networks and neutrino transport are our first targets for hybridization. I will describe our initial forays in this endeavor.
Large Scale Visualization on the Cray XT3 Using ParaView,
Kenneth Moreland, David Rogers, and John Greenfield, Sandia National Laboratories (SNLA); Berk Geveci, Alex Neundorf, and Pat Marion, Kitware, Inc.; Kent Eschenberg, Pittsburgh Supercomputing Center (PITTSCC)
Post-processing and visualization are key components to understanding any simulation. Porting ParaView, a scalable visualization tool, to the Cray XT3 allows our analysts to leverage the same supercomputer they use for simulation to perform post-processing. Visualization tools traditionally rely on a variety of rendering, scripting, and networking resources; the challenge of running ParaView on the Lightweight Kernel is to provide and use the visualization and post-processing features in the absence of many OS resources. We have successfully accomplished this at Sandia National Laboratories and the Pittsburgh Supercomputing Center.
A Technical Look at the New Scheduling and Reporting Features in PBS Professional,
Bill Nitzberg, Altair Engineering, Inc.
The newest release of PBS Professional includes three key features specifically designed for managing large systems with diverse user bases. The new submission filtering "qsub hook" provides a centralized method for admission control, allocation management, and on-the-fly tuning of job parameters. A totally new formula-based prioritization scheme allows full flexibility to mathematically define job priorities based on any combination of job-related parameters (e.g., eligible_time + WC*ncpus + WQ*q_priority + admin_adjust). And the new GridWorks Analytics package provides business intelligence style graphical reports with multi-dimensional slicing and dicing as well as drill-through capabilities.
Zest: The Maximum Reliable TBytes/sec/$ for Petascale Systems,
Paul Nowocyzynski, Nathan Stone, Jared Yanovich, and Jason Sommerfield, Pittsburgh Supercomputing Center (PITTSCC)
PSC has developed a prototype distributed file system infrastructure that vastly accelerates application checkpointing by maximizing per-disk bandwidth efficiency. We have prototyped a scalable solution on the Cray XT3 compute platform that will be directly applicable to future petascale compute platforms having of order 10^6 cores. Our design emphasizes highly sequentialized write patterns, client-side request aggregation (for small IO's), client-side parity generation, and a unique model of load-balancing outgoing I/O onto high-speed intermediate storage. This design aims to achieve 90% efficiency from every disk drive in the IO system.
Domain Decomposition Performance on ELMFIRE Plasma Simulation Code,
Francisco Ogando, Jukka Heikkinen, Salomon Janhunen, Timo Kiviniemi, Susan Leerink, and Markus Nora, CSC–Scientific Computing Ltd. (CSC)
ELMFIRE is a gyrokinetic full-f plasma simulation code, based on a particle-in-cell algorithm and parallelized with MPI. The coupled calculation of electrostatic field and plasma particle dynamics leads to huge memory demands that can be split among processors by the introduction of Domain Decomposition. This technique modifies the MPI-process topology with influence to collective operations. This paper shows the performance change due to the new algorithm, as well as the extension of capabilMoabities with Louhi cluster at CSC.
Modeling the Impact of Checkpoints on Next-Generation Systems,
Ron Oldfield and Rolf Riesen, Sandia National Laboratories (SNLA); Sarala Arunagiri and Patricia Teller and Maria Ruiz Varela, University of Texas at El Paso; Seetharami Seelam, IBM; Philip Roth, Oak Ridge National Laboratory (ORNL)
The next generation of capability-class, massively parallel processing (MPP) systems is expected to have hundreds of thousands of processors. For application-driven, periodic checkpoint operations, the state-of-the-art does not provide a solution that scales to next-generation systems. We demonstrate this by using mathematical modeling to compute a lower bound of the impact of these approaches on the performance of applications executed on three massive-scale, in-production, DOE systems and a theoretical petaop system. We also adapt the model to investigate a proposed optimization that makes use of "lightweight" storage architectures and overlay networks to overcome the storage system bottleneck. Our results indicate that (1) as we approach the scale of next-generation systems, traditional checkpoint/restart approaches will increasingly impact application performance, accounting for over 50% of total application execution time; (2) although our alternative approach improves performance, it has limitations of its own; and (3) there is a critical need for new approaches to checkpoint/restart that allow continuous computing with minimal impact on the scalability of applications.
The Lustre Centre of Excellence at ORNL,
Makia Minich, Sarp Oral, Galen Shipman, Shane Canon, and Al Geist, Oak Ridge National Laboratory (ORNL) and Oleg Drokin, Tom Wang, Peter Bojanic, Bryon Neitzel, and Eric Barton, Sun Microsystems, Inc.
A Lustre Centre of Excellence (LCE) was established at Oak Ridge National Lab (ORNL) in 2007. ORNL LCE is a joint effort between Sun Microsystems, Inc. and ORNL to solve the existing and potential file system I/O related problems towards the deployment of the PetaFLOP system at the ORNL. The three goals of the ORNL LCE are to develop risk mitigation approaches related to the file system for the Baker system; to assist ORNL science teams in tuning and improving their application file system I/O performances on NCCS supercomputers; and to develop a knowledge base of Lustre at ORNL. This presentation provides an overview of the ORNL LCE goals and efforts. One of such efforts presented in this paper is our analysis and improvement of the end-to-end file system I/O performance of the Parallel Ocean Program (POP) on the NCCS Cray XT4 system, Jaguar. Our experiments show a 13x increase in the end-to-end POP file system I/O performance. Also described in this presentation are our ongoing enhancements to the Lustre specific driver in the ROMIO implementation of MPI-IO.
The Harwest Compiling Environment: Accessing the FPGA World Through ANSI-C Programs,
Paolo Palazzar and Alessandro Marongiui, ENEA and Ylichron Srl
The Harwest Compiling Environment (HCE) is a set of compilation tools which transform an ANSI C program into an equivalent, optimized VHDL ready to be compiled and run onto a prefixed target board (DRC blades are among the supported targets).
The HCE, which is embedded within the VisualStudio IDE, adopts the following design flow:
1-The starting C program is debugged and tested through the standard C tools.
2-Once the code is debugged, it is transformed into a Control and Data Flow Graph (CDFG) which is analyzed to individuate a parallel architecture which fits with the parallelism of the computation.
3-The CDFG is scheduled onto the parallel architecture in order to minimize its time completion.
4-The VHDL code is generated; such a code enforces the scheduling (3-) on the parallel architecture (2-) and instantiates all the necessary interfaces to activate the code from an hosting node.
Summarizing, through the HCE flow, it is possible to implement an optimized computation for FPGA blades simply specifying its behavior through the well known C language. Some explanatory results will be reported to illustrate the benefits which can be achieved when adopting the HCE design flow.
Application Sensitivity to Link and Injection Bandwidth on a Cray XT4 System,
Kevin Pedretti, Scott Hemmert, and Brian Barrett, Sandia National Laboratories (SNLA)
This paper describes our efforts to characterize application sensitivity to link and injection bandwidth on a Cray XT4 system. Link bandwidth is controlled by modifying the number of rails activated per network link. Injection bandwidth is controlled by modifying the speed of the HyperTransport connection between the Opteron and the SeaStar. A suite of micro-benchmarks and applications is evaluated at several different operating points. The results of this study will be useful for designing and evaluating
future computer architectures.
Detecting Application Load Imbalance on Cray Systems,
Dean Johnson, Steve Kaufmann, Bill Homer, Luiz DeRose, and Heidi Poxon, Cray Inc.
Scientific applications should be well balanced to scale on current and future high end parallel systems. However, the identification of sources of load imbalance in such applications is not trivial. In this paper we present the extensions that were made to the Cray performance tools to help users identify, measure, and correct application load imbalance.
Parallel Analysis and Visualization on Cray Compute Node Linux,
David Pugmire, Oak Ridge National Laboratory (ORNL)
Capability computer systems are deployed to give researchers the computational power required to investigate and solve key challenges facing the scientific community. As the power of these computer systems increases, the computational problem domain typically increases in size, complexity and scope. These increases strain the ability of commodity analysis and visualization clusters to effectively perform post-processing tasks and provide critical insight and understanding to the computed results. An alternative to purchasing increasingly larger, separate analysis and visualization commodity clusters is to use the computational system itself to perform post-processing tasks. In this paper, the recent successful port of VisIt, a parallel, open source analysis and visualization tool, to compute node linux running on the Cray is detailed. Additionally, the unprecedented ability of this resource for analysis and visualization is discussed and a report on obtained results is presented.
Seshat Simulates Red Storm on a Cluster,
Neil Pundit and Rolf Riesen, Sandia National Laboratories (SNLA)
Seshat is a framework for a discrete event simulator that couples an application at the MPI level with a network simulator for the Cray XT3 Red Storm network. The simulator is not very detailed yet, but it is accurate enough to produce valid application timing results. Eventually we want to simulate a complete system and use Seshat plus various simulators to model new system features or even new hardware that does not exist yet.

At the moment, Seshat only models the Red Storm network. In order to test its predictive capability we are running it on a cluster (Thunderbird at Sandia) and predict the performance of applications running on Red Storm. Since the node hardware is quite different between the two systems we need to adjust the compute time. Seshat runs the application inside a virtual time environment and can easily adjust the time perceived by the application. We use this capability to adjust the times between MPI calls to more closely match the execution time on Red Storm.

With this simple adjustment we hope to model the Red Storm network accurately on a different machine without having to simulate a complete Red Storm node. In this paper we report early result of our attempts to do that.
Application Performance on the UK's New HECToR Service,
Fiona Reid, Mike Ashworth, Thomas Edwards, Alan Gray, Joachim Hein, Alan Simpson, and Michele Weiland, HPCX Consortium (HPCX)
HECToR is the UK's new high-end computing resource available for researchers at UK universities. The HECToR Cray XT4 system began user service in October 2007. The service offers 5,564 dual-core 2.8GHz AMD Opteron processors with each dual-core socket sharing 6GB of memory. A total of 11,328 processing cores are available to users. The results of benchmarking a number of popular application codes which are used by the UK academic community are presented. These include molecular dynamics, fusion, materials science and environmental science codes. The results are compared with those obtained on the UK's HPCx service which comprises 160 IBM e-Server p575 16-way SMP nodes. Where appropriate, benchmark results are also included from the Blue Gene service.
HECToR, the CoE and Large-Scale Application Performance on CLE,
Jason Beech-Brandt and Kevin Roy, Cray Inc.
HECToR, a Cray XT4 MPP system with 5,664 dual-core AMD Opteron processors, is the UK's new high-end computing resource for academic and research HPC users. This system provides a steep change in computing power for the UK academic community and is one of the first worldwide to adopt Cray Linux Environment (CLE) as its operating system. The role of the Cray Center of Excellence for HECToR is to work closely with the researchers on scalability, optimization and algorithm development with the aim to help researchers do breakthrough science by exploiting the massive scalability and performance available on HECToR. It has been working with the capability challenge projects within the UK to improve scalability and performance to further drive scientific output. This talk focuses on their work and explores the techniques used to increase performance for these important applications at the largest scales on CLE.
Performance Comparison of Cray XT4 with Altix 4700 Systems (Density and Performance), SGI ICE cluster, IBM POWER5+, IBM POWER6, and NEC SX-8 using HPCC and NPB Benchmarks,
Subhash Saini and Dale Talcott, NAS Systems Division (NAS) and Rolf Rabenseifner, Michael Schliephake, and Katharina Benkert, High Performance Computing Center Stuttgart (HLRS)
The HPC Challenge (HPCC) benchmark suite and the NAS Parallel Benchmarks (NPB) are used to compare and evaluate the combined performance of processor, memory subsystem and interconnect fabric of seven leading supercomputers–Cray XT4, SGI Altix 4700-Bandwidth, SGI Altix 4700-Density, SGI ICE cluster, IBM POWER5+, IBM POWER6, and NEC SX-8.
Customer Access to Cray Problem Information,
Dan Shaw, Cray Inc.
Cray is in the process of changing the way customers can access information. The main intent is to allow customers greater access to information that they deem important. To do this, Cray has embarked on a project to replace multiple tools specific to customer information. These tools relate to customer call tracking, software and hardware defect tracking, and customer portal. The discussion will give an overview of how these new tools work together to give customers greater and easier access to their important data and information.
Spider and SION: Supporting the I/O Demands of a Peta-scale Environment,
Galen Shipman, Shane Canon, Makia Minich, and Sarp Oral, Oak Ridge National Laboratory (ORNL)
The National Center for Computational Sciences (NCCS) at Oak Ridge National Laboratory (ORNL) is scheduled to deploy a Peta-scale system based on the Cray Baker architecture in 2008. This system will join a number of existing systems varying in size from just a few teraflops to our 100 teraflop (currently being upgraded to 250 teraflop) XT-4. In order to effectively support the I/O demands of these systems a 240 GByte/sec 10 Petabyte center-wide file system code-named "Spider" will also be deployed. This system will differ from the traditional Cray XT I/O architecture of direct attached storage. Spider will use Lustre routers to bridge between the internal high performance messaging network and a Scalable I/O Network (SION). SION is a high performance InfiniBand network providing I/O services among Leadership Computing Facility (LCF) computational systems and storage systems. A prototype testbed of this infrastructure is currently in limited production use at our facility. This paper will detail the overall infrastructure and highlight several key areas of integration effort. A performance evaluation of the testbed will then be discussed including an extrapolation to the larger scale Spider and SION systems.
An Individual Tree Simulator for Assessment of Forest Management Methods,
Artur Signell, Johan Schoering, Jan Westerholm, and Mats Aspnoes, Abo Akademi University
Suswood is a parallel forest simulator capable of simulating massive amounts of trees in multiple, polygon shaped compartments on a single tree level. Simulating on single tree level allows application of much more detailed management systems than on stand level as tree selection can be made based on the trees' properties such as height and diameter or on a tree's surroundings. The Suswood simulator is fully parallelized and able to take advantage of the computational power of 1000+ computing nodes for a single simulation and it has been designed to be able to simulate forests containing billions of trees using the Cray XT4 at CSC.
Common Administration Infrastructure,
Jim Harrell and Bill Sparks, Cray Inc.
Cray is merging the System Administration tools from all the products into a single structure that can support the direction of Adaptive Supercomputing. One of the next phases is to bring the newer functionality developed on the X2 product line into use on the XT and Baker products. This talk will describe the new features, how they are used, and when features are scheduled for release.
7X Performance Results Final Report: ASCI Red vs. Red Storm,
Michael Davis, Thomas Gardiner, and Dennis Dinge, Cray Inc.; Joel Stevenson, Robert Ballance, Karen Haskell, and John Noe, Sandia National Laboratories (SNLA)
The goal of the 7X performance testing was to assure Sandia National Laboratories, Cray Inc., and the Department of Energy that Red Storm would achieve its performance requirements which were defined as a comparison between ASCI Red and Red Storm. Our approach was to identify one or more problems for each application in the 7X suite, run those problems at two or three processor sizes in the capability computing range, and compare the results between ASCI Red and Red Storm. The first part of this paper describes the two computer systems, the 10 applications in the 7X suite, the 25 test problems, and the results of the performance tests on ASCI Red and Red Storm. During the course of the testing on Red Storm, we had the opportunity to run the test problems in both single-core mode and dual-core mode and the second part of this paper describes those results. Finally, we reflect on lessons learned in undertaking a major head-to-head benchmark comparison.
A Compiler Performance Study on the Cray XT Architecture Exploiting Catamount and Compute Node Linux (CNL) Compute Nodes,
Timothy Stitt and Jean-Guillaume Piccinali, CSCS–Swiss National Supercomputing Centre (CSCS)
Choosing the right compiler during the build process can critically influence the performance obtained from the final executable. While investment in algorithm design and library selection can account for the majority of performance gains at runtime, the correct choice of compiler technology may also significantly influence the performance that is ultimately obtained. To muddy the waters further, compiler performance (as measured by the wall clock of the resultant executable), may be tightly coupled to a given programming language and/or target architecture or operating system. In this study we investigate the performance of a set of compiler suites routinely used in the HPC community. We quantify the performance of a set of compiled scientific application kernels on the Cray XT architecture in two environments: with compute nodes running Catamount and compute nodes running Compute Node Linux (CNL). In each case we also compare the performance of kernels written in C, C++, Fortran 90/95 and legacy Fortran 77 programming languages.
Speeding genomic searches over 1000X over a single Opteron using multiple FPGAs on a Cray XD1,
Olaf O. Storaasli, Philip F. LoCascio, and Weikuan Yu, Oak Ridge National Laboratory (ORNL) and Dave Strenski, Cray Inc.
Our CUG07 paper demonstrated the Cray Smith-Waterman FPGA design could achieve 50-100X speedup over a single Opteron. This paper extends that work to demonstrate loosely-coupled human chromosome-to-chromosome searches, run on 100's of FPGAs, can exceed 1000X speedup. We examine how this design performs on future FPGA-enabled Cray systems.
Speeding genomic searches over 1000X over a single Opteron using multiple FPGAs on a Cray XD1,
Olaf O. Storaasli, Philip F. LoCascio, and Weikuan Yu, Oak Ridge National Laboratory (ORNL) and Dave Strenski, Cray Inc.
Our CUG07 paper demonstrated the Cray Smith-Waterman FPGA design could achieve 50-100X speedup over a single Opteron. This paper extends that work to demonstrate loosely-coupled human chromosome-to-chromosome searches, run on 100's of FPGAs, can exceed 1000X speedup. We examine how this design performs on future FPGA-enabled Cray systems.
Data Virtualization Service,
David Wallace and Stephen Sugiyama, Cray Inc.
DVS, the Data Virtualization Service, is a new capability being added to the XT software environment with the 2.1 release. DVS is a configurable service that provides compute-node access to a variety of file systems across the Cray high-speed network. The flexibility of DVS makes it a useful solution for many common issues at XT sites, such as providing I/O to NFS file systems on compute nodes. A limited set of use cases will be supported in the initial release but additional features will be added in the future.
Investigating the Performance of Parallel Eigensolvers on High-end Systems,
Andrew Sunderland, HPCX Consortium (HPCX)
Eigenvalue and eigenvector computations arise in a wide range of scientific and engineering applications and usually represent a huge computational challenge. It is therefore imperative that appropriate, highly efficient and scalable parallel eigensolver methods are used in order to facilitate the solution of the most demanding scientific problems. This presentation will analyze and compare the performance of several of the latest eigensolver algorithms, including pre-release ScaLAPACK routines, on contemporary high-end systems such as the 11,328 core Cray XT4 system HECToR. The analysis will involve symmetric matrix examples obtained from current problems of interest for a range of large-scale scientific applications.
CRAY XT in Second Life,
Jason Tan, Western Australia Supercomputer Program (WASP)
Virtual social networking allows multiple users to virtually interact with each other in the virtual world. Second Life provides that platform for CRAY XT series MPP educational opportunities to be injected into potential and current users at The University of Western Australia. Users can virtually lift-off and fly into the CRAY, launch jobs, monitor jobs and learn how to program in parallel thus provides an intuitive environment for user training and visualization.
Cray Adaptive Sparse Kernels (CASK) Project,
Adrian Tate, Cray Inc.
CASK is an auto-tuned library framework that improves parallel iterative solver performance via a sophisticated Sparse BLAS implementation. CASK is available on XT systems, and sits beneath large solver packages like PETSc. At run-time, CASK ascertains certain matrix characteristics, allowing the adaptive selection of a specifically tuned sparse kernel, which gives a considerable performance improvement at the solver level. This talk describes the auto-tuning and testing framework involved in the development of CASK, shows performance improvements on XT systems, and outlines the future direction of the project.
Efficient Scaling Up of Parallel Graph Algorithms for Genome-Scale Biological Problems on Cray XT Systems,
Buyng-Hoon Park, Oak Ridge National Laboratory (ORNL); Mathew Schmidt and Nagiza F. Samatova, North Carolina University & ORNL; Kevin Thomas, Cray Inc.
Many problems in biology can be represented as problems on graphs, but the exact solutions for these problems have a computational burden that grows often exponentially with increasing graph size. Due to this exponential growth, high-performance implementations of graph-theoretic algorithms are of great interest. We are developing a pGraph library for parallel graph algorithms of relevance to biological problems.This paper discusses the implementation decisions made during the development of a class of data-intensive enumeration algorithms on graphs. The data-intensive nature and highly irregular structure of the search space make efficient scaling up of such algorithms quite challenging. A specific scalable and efficient implementation of a ubiquotous maximal clique enumeration problem in biology is discussed. Performance results on real biological problems from Cray XT are presented.
Development of a Centre-wide Parallel File System at CSCS,
Dominik Ulmer and Hussein Harake, CSCS–Swiss National Supercomputing Centre (CSCS)
Massively-parallel simulation applications require the handling of large amounts of data. As the data is often either re-read by the same application or used by other applications for pre- and post-processing, data analysis, visualisation or for coupled simulations, a large data management space, to which all production systems of the HPC centre provide fast and parallel access, must be provided to the user. CSCS has investigated different file system and networking solutions for this purpose during 2007 and is in the process of putting the selected solution into production during early 2008.
Optimizing FFT for HPCC,
Mark Sears and Courtenay Vaughan, Sandia National Laboratories (SNLA)
For the High Performance Computing Challenge (HPCC) competition at SC2007, we implemented an optimized algorithm for the one-dimensional Fast Fourier Transform (FFT) test on Red Storm (a Sandia-Cray XT3/XT4 System) which resulted in Red Storm being awarded the #1 position (Red Storm also did well in the other portions of the competition). We will present highlights of our optimizations and comparisons with HPCC Benchmark Suites versions 1.0 and 1.2.
Application Performance under Different XT Operating Systems,
Courtenay Vaughan, John VanDyke, and Suzanne Kelly, Sandia National Laboratories (SNLA)
Under the sponsorship of DOE's Office of Science, Sandia has extended Catamount to support multiple CPUs per node on XT systems while Cray has developed Compute Node Linux (CNL) which also supports multiple CPUs per node. This paper will present results from several applications run under both operating systems including preliminary results with quadcore processors. We will present some details of the implementation of N-way virtual node mode in Catamount and how it differs from CNL to help explain the differences in the results.
Enabling Contiguous Node Scheduling on the Cray
Chad Vizino, Pittsburgh Supercomputing Center (PITTSCC)
PSC has enabled contiguous node scheduling to help applications achieve better performance. The implementation and application performance benefits will be discussed.
First Experiences in Hybrid Parallel Programming in Quad-core Cray XT4 Architecture,
Sebastian von Alfthan and Pekka Manninen, CSC–Scientific Computing Ltd. (CSC)
To date it has been difficult to obtain any major performance gains by employing hybrid shared-memory--message-passing parallel programming models as compared to pure message-passing parallelization. However, even with the extremely fast node interconnect of the Cray XT4 architecture the quad-core CPUs may require novel parallelization strategies to avoid the exhaustion of the interconnect, and the situation will most probably be even more severe in the forthcoming XT5 architecture. The thread-enabling Cray Linux Environment should be a promising platform for employing three-level hybrid MPI-OpenMP-vectorization programming strategies, and the number of SMP-like CPU cores could now be suitably large for true hardware-thread-based parallelization. In this contribution, we study different hybrid programming schemes in quad-core XT4 environment and compare them with flat-MPI results.
Cray XT The Software Annual Report,
Dave Wallace and Kevin Petersen, Cray Inc.
The Cray XT Software Annual Report will highlight the accomplishments over the last twelve months. The presentation will provide detailed status on software development projects including XT4 Quad Core and Cray XT5, statistics on Software Problem Reports, and software releases. The presentation will provide a brief view of software roadmap and plans for the next twelve months.
Acceleration of Time Integration,
Richard Archibald, Kate Evans, and James B. White III, Oak Ridge National Laboratory (ORNL)
We describe our computational experiments to test strategies for accelerating time integration for long-running simulations, such as those for global climate modeling. The experiments target the Cray XT systems at the National Center for Computational Sciences at Oak Ridge National Laboratory. Our strategies include fully implicit, parallel-in-time, and curvelet methods.
Preparing for Petascale,
Robert Whitten Jr, Oak Ridge National Laboratory (ORNL)
The National Center for Computational Sciences (NCCS) at Oak Ridge National Laboratory (ORNL) is executing an aggressive upgrade schedule on their Cray XT3/XT4 system. This upgrade schedule has seen a progression of technology from single core to quad core processors. Providing a production system to users during upgrades while simultaneously preparing training and documentation to assist users make full use of the system will continue to be an interesting challenge that will be applicable to the new petascale system delivered in 2008. This paper will focus on the tools and techniques employed to prepare and educate users for both current systems and petascale computing.

Web page design and support by Cray User Group (CUG) Conference Services