Final Program

Welcome from
the Program

Program Notes

Attendee List  
Monday Tuesday Wednesday Thursday
Presentation Titles and Authors Abstracts
Keynote Address: Science and Modeling of Puget Sound–A View From Beneath the Surface, Jan A. Newton, Ph.D. (bio), Applied Physics Laboratory, School of Oceanography, University of Washington This address will offer a view of the local oceanography of Puget Sound, how a water environment “works,” and what makes Puget Sound different than many of the nation’s other estuaries. I will also explain how scientists use computer models to represent its circulation, chemical, and biological processes, highlighting the University of Washington’s initiative PRISM, the Puget Sound Regional Synthesis Model. In this talk I will utilize examples of how PRISM models are increasing our scientific understanding of this environment as well as helping regional decision makers. For more information on PRISM, see
I/O for 25,000 Clients-Lustre & Red Storm, Peter Braam, Cluster File Systems, Inc. Our paper will discuss the structure of the Lustre file IO subsystem, and how it is capable of handling IO at huge scales. We will illustrate with measurements what implementation choices were encountered and what results were achieved in the area of performance and scalability.
Running Infiniband on the Cray XT3, Makia Minich, Oak Ridge National Laboratory (ORNL) In an effort to utilize the performance and cost benefits of the infiniband interconnect, this paper will discuss what was needed to install and load a single data rate infiniband host channel adapter into a service node on the Cray XT3. Along with the discussion on how to do it, this paper will also provide some performance numbers achieved from this connection to a remote system.
Redefining the State of the Art in Performance Tools with CrayPat and Cray Apprentice2, Heidi Poxon, Luiz DeRose, Bill Homer, Dean Johnson, and Steve Kaufman, Cray Inc. In order to tune and optimize applications on high end parallel systems, users need a new generation of performance tools to help detect and understand the large number of potential application performance bottlenecks, such as load imbalance, synchronization and I/O overhead, and memory hierarchy related issues. In this paper we present the new features of the Cray performance measurement and analysis infrastructure: CrayPat and Cray Apprentice2, which provide the data and insights needed to tune and optimize applications with a simple to use interface.
Strategies for Solving Linear Systems of Equations Using Chapel, Richard Barrett and Stephen Poole, Oak Ridge National Laboratory (ORNL) The requirement of solving systems of linear equations lies at the heart of many scientific application domains, such as astrophysics, biology, chemistry, fusion energy, power system networks, and structural engineering. In this talk we discuss how Chapel, the global view programming language being developed as part of the Cray Cascade, may be used to express algorithms for solving these systems, within the context of large scale scientific application programs.
Center for Computational Sciences (CCS) User Assistance and Outreach Group, Robert Whitten Jr., Oak Ridge National Laboratory (ORNL) The User Assistance and Outreach Group at the Center for Computational Sciences (CCS) at Oak Ridge National Laboratory (ORNL) is the primary interface to user support at the facility. The user assistance model used by the CCS will be discussed. Tools and techniques will be presented.
The Effects of System Options on Code Performance, Courtenay Vaughan, Sandia National Laboratories (SNLA) There are several options that can be used to run codes on a Cray XT3. In this paper, we will examine the effect of choice of page size, eager or non-eager communication protocol, and choice of malloc has on performance of several codes at different numbers of processors. We will also characterize the communication patterns of these codes and correlate that to the differences in performance.
Extending Catamount for Multi-Core Processors, John VanDyke, Suzanne Kelly, and Courtenay Vaughan, Sandia National Laboratories (SNLA) The XT3 Catamount Virtual Node (CVN) implementation was based on dual processor support in ASCI Red's Cougar Light Weight Kernel Operating System. That solution was limited to no more than 2 virtual nodes per physical node. This paper describes the design for extending Catamount to support more CPUs per node. It identifies the areas to be revamped and the resolution. Some preliminary performance results will be provided.
Bringing Up Red Storm-Lessons to Remember, Robert Balance and John P. Noe, Sandia National Laboratories (SNLA) Capability computer systems are designed, developed, and operated to provide the computational power to investigate and solve some of the most difficult and compelling problems facing humanity. These unique resources can be extremely helpful in scientific endeavors, but present exciting challenges for the operators and customers who utilize their power. While many of the difficulties in getting capability jobs run are inherent in the job itself, others intrude by way of operational issues. Consider Red Storm: in Jumbo mode, its largest configuration, it provides 13280 dual-core processors; 79,680 links; 3500 embedded Linux sensors; 88 RAID controllers; 50 10GigE network connections; plus cooling, power, and environmental sensors. For each component, the only definitely known state is "down." This technical presentation looks at capability computing from the aspect of operations, but draws many conclusions about the effects of system design on the effectiveness of system operation.
Performance Impact of the Red Storm Upgrade, Ron Brightwell, Sandia National Laboratories (SNLA) The Cray Red Storm system at Sandia National Laboratories recently completed an upgrade of the processor and network hardware. Single-core 2.0 GHz AMD Opteron processors were replaced with dual-core 2.4 GHz AMD Opterons, while the network interface hardware was upgraded from a sustained rate of 1.1 GB/s to 2.0 GB/s. This paper provides an analysis of the impact of this upgrade on the performance of several applications and micro-benchmarks. We show scaling results for applications out to thousands of processors and include an analysis of the impact of using dual-core processors on this system.
Performance of a Direct Numerical Simulation Solver for Turbulent Combustion on the Cray XT3/XT4, Ramanan Sankaran and Mark R. Fahey, Oak Ridge National Laboratory (ORNL); and Jacqueline H. Chen, Sandia National Laboratories (SNLA) Direct numerical simulation (DNS) is a valuable computational research tool for gaining a fundamental understanding of turbulent combustion. The insights obtained from these simulations will help in developing better predictive models for the design and optimization of combustion devices. The DNS code named 'S3D' has been successfully ported and optimized for the Cray XT3 system and has made several landmark simulations possible on the XT3 system at the national center for computational sciences (NCCS). Performance and parallel scaling characteristics of the S3D code will be presented along with details of the optimization strategies.
Implementing the Operational Weather Forecast Model of MeteoSwiss on a Cray XT3, Tricia Balle and Neil Stringfellow, CSCS-Swiss National Supercomputing Centre (CSCS) MeteoSwiss uses the computer facilities of CSCS to carry out the twice daily weather forecasts for Switzerland using a specialized alpine model called aLMo based on the LM code from Deutscher Wetterdienst. This paper describes the implementation of the aLMo suite on the Cray XT3 at CSCS and outlines some of the issues faced in porting and configuring a complex operational suite on a new architecture. Performance figures for the suite are given, and an outline is presented of some of the challenges to be faced in carrying out the next generation of high resolution forecasting.
Accelerating NCBI BLAST—FPGA Supercomputing Coming of Age, Stefan Möhl and Henrik Abelsson Mitrionics AB Using field programmable gate arrays (FPGAs) as coprocessors has long been a promising solution for accelerating software algorithms. With the development of an accelerated version of NCBI's BLAST application, FPGA supercomputing takes the step from proof-of-concept programs to the acceleration of full-blown production applications. We will present the open source port of NCBI BLAST for the Mitrion Virtual Processor and discuss the further development of bioinformatics applications within the framework of the Mitrion-C Open Bio Project.
Comparison of XT3 and XT4 Scalability, Patrick Worley, Oak Ridge National Laboratory (ORNL) Kernels and application codes are used to evaluate and compare the scalability of the XT3 and XT4 architectures. The focus is on interprocessor communication and the impact of the numerous MPI protocol options and system runtime options (e.g., MPI_COLL_OPT_ON, MPICH_RANK_REORDER_METHOD) on performance. Results include advice on how to use the systems most efficiently.
Large-Scale Meta-Population Patch Models of Infectious Diseases on Cray Machines, Iain Barrass and Steve Leach, Health Protection Agency; Kevin Roy, University of Manchester (MCC) Estimating the spatial spread of an infectious disease through a mobile human population is an important aspect in guiding public health policy in the event of a large scale outbreak. In this paper we consider ensembles of large stochastic meta-population patch models for the spread of an infectious disease in the United Kingdom and the role of high performance computing in the analysis of such models. In particular, we compare the use of two Cray machines for these types of model at varying levels of model detail and conclude that high performance computing can be a significant tool in assessing the impact of a newly emerging disease and any available mitigation strategies.
Python Based Applications on Red Storm: Porting a Python Based Application to the Lightweight Kernel, John Greenfield and Daniel Sands, Sandia National Laboratories (SNLA) It is desirable to be able to run analysis and validation applications as pre and post-processing stages on large systems like Red Storm. Many analysis and validation applications are based on Python and porting python based applications to the Lightweight Kernel presents some challenges. The example of porting the Data Services Toolkit (DSTK) python-based application to Red Storm is used to illustrate the challenges and their solutions.
The Naval Research Laboratory Cray XD1, Wendell Anderson, Jeanie Osburn, and Robert Rosenberg, Naval Research Laboratory (NRL) and Marco Lanzagorta, ITT Industries In June 2006, the Naval Research Laboratory expanded its Cray XD1 to 432 dual core Opteron 275 dual core CPUS and 150 FPGA's (144 Vertex 2 and 6 Vertex 4) making it the largest XD1 in the world. This paper will examine our experiences with porting NRL codes to the XD1 and our efforts with bringing the FPGA technology in the High Performance Computing Environment.
Illuminating the Shadow Mesh, Warren Hunt and Jon Goldman, Sandia National Laboratories (SNLA) As massively parallel supercomputers approach the peta-flop performance mark with tens of thousands of compute cores, understanding and managing the system becomes even more difficult. The challenge is in distilling the massive amounts of system meta-data to enable real time decision making by administrators. A visual analysis facilitates faster comprehension of performance, status, and anomaly detection. We employ scientific and information visualization techniques to help understand aspects of Sandia National Laboratories' Red Storm XT3 such as temporal job allocation, routing, and system physical characteristics.
Performance Evaluation of Biological Applications that Use FPGAs, Olaf Storaasli and Weikuan Yu, Oak Ridge National Laboratory (ORNL); Dave Strenski, Cray Inc.; Jim Maltby, Mitrionics This paper presents scalable, parallel, FPGA-accelerated results for the Smith Waterman computational biology algorithm used for DNA, RNA and protein sequencing. Results are presented using the search program from the Fasta suite and using the OpenFPGA benchmark data. The results will use single and multiple Virtex-II Pro 50's and Virtex-4 LX160 on the Cray XD1 system, and single and multiple Virtex-4 LX60's and Virtex-4 LX200 on the DRC development system as compared the Opteron microprocessor.
Performance Impact of Accelerated Portals, Ron Brightwell, Sandia National Laboratories (SNLA) This paper provides a detailed analysis of the performance impact of the accelerated Portals implementation on the upgraded Red Storm system at Sandia. Our evaluation includes MPI micro-benchmarks and applications, as well as results for the Berkeley GASNet/UPC package and the ARMCI/Global Arrays package from Pacific Northwest National Lab.
Efficiency Evaluation of Cray XT Parallel IO Stack, Weikuan Yu, Jeffrey Vetter, H. Sarp Oral, and Richard Barrett, Oak Ridge National Laboratory (ORNL) To gain insights into the efficiency of parallel IO on LCF XT3/XT4 computing platforms, this paper presents an evaluation of software stacks including MPI-IO, HDF5 and Parallel GridFTP. We employ a set of benchmarks to derive the achievable bandwidth and its efficiency as the I/O percolates through various layers of the software stacks. We present the performance and efficiency results under various run-time scenarios including single UNIX and liblustre client, hot-spot fan-in/fan-out, as well as the typical collective read/write.
Minimizing the I/O Cycle Time in Simulation Clusters Through the Use of High Performance Storage, Dave Fellinger, DataDirect Networks A high performance storage architecture will be presented that is optimized for capture with a capability to write data with consistent performance regardless of the operational status of the attached mechanical storage devices. This storage architecture has been deployed on the fastest clusters in existence to enable the maximization of the compute versus I/O duty cycle of these machines.
Design, Implementation, and Performance of a Portals Collective Communication Library, Ron Brightwell and James Schutt, Sandia National Laboratories (SNLA) In this paper, we describe the design and implementation of a Portals collective communication library. We have integrated this library into the production MPICH2 implementation from Cray, and we provide performance and scaling comparisons of MPI collective operations using the native MPICH2 implementation with our implementation.
The Performance Effect of Multi-Core on Scientific Applications, Jonathan Carter, Yun (Helen) He, John Shalf, Hongzhang Shan, and Harvey Wasserman, National Energy Research Scientific Computing Center (NERSC) The historical trend of increasing single CPU performance has given way to roadmap of increasing core count. The challenge of effectively utilizing these multi-core chips is just starting to be explored by vendors and application developers alike. In this study, we present some performance measurements of several complete scientific applications on single and dual core Cray XT3 and XT4 systems with a view to characterizing the effects of switching to multi-core chips. We consider effects within a node by using applications run at low concurrencies, and also effects on node-interconnect interaction using higher concurrency results. Finally, we construct a simple performance model based on the principle on-chip shared resourceÑmemory bandwidthÑand use this to predict the performance of the forthcoming quad-core system.
Using IOR to Analyze the I/O Performance of XT3, Hongzhang Shan and John Shalf, National Energy Research Supercomputer Computer Center (NERSC) The I/O performance has become a big concern on the petascale platforms. In this study, we are going to analyze the I/O performance of XT3 using IOR, an ASCI I/O benchmark, and compare it with other HPC systems.
Prioritization and Scheduling on a Cluster of XT3 Systems, Neil Stringfellow, CSCS-Swiss National Supercomputing Centre (CSCS) CSCS has implemented a scheduling policy on its cluster of XT3 machines which allows multiple project types and runtime restrictions to be incorporated while maintaining utilization of the primary system at around 90% of the available CPU Hours. This paper discusses the prioritization and scheduling policies across the systems in an environment which mixes academic research with operational weather forecasting.
Cray XT3 I/O Performance on Real Applications, Neil Stringfellow, CSCS-Swiss National Supercomputing Centre (CSCS) The increasing computational power provided by modern supercomputers allows the investigation of new scientific problems, but this power needs to be matched by efficient methods of accessing the file system if data input and output are not to become a major bottleneck. This paper investigates I/O performance and optimization techniques for a selection of real applications.
Summary of Results from Running on the ORNL Hood System, John Levesque, Cray Inc. During the bring-up of the new Hood system at ORNL, we were able to make numerous large scale runs on the 11,000 core system. The applications that were run varied from high energy physics to regional weather forecasting. This paper will summarize the results, indicating what features of the Hood architecture facilitated the impressive sustained performance that was achieved in these runs.
XT Series System Hardware Monitoring and Management with the XTGUI, Jim Robanske, Cray Inc. A new GUI to simplify monitoring and management of XT series hardware has been released with XT v1.5 software. The xtgui is a Java client/server based application, with the server side running on the SMW server, and the client side running on network attached workstations.
Real Time Health Monitoring of the Cray XT Series Using the Simple Event Correlator (SEC), Jeff Becklehimer and Cathy Willis, Cray Inc. and Josh Lothian, Don Maxwell, and David Vasil, Oak Ridge National Laboratory (ORNL) The log files produced by the Cray XT contain all of the known events which have occurred on the system. Examples of logged events include a memory fault, a node panic, a link failure or the successfully launch and execution of a user's application. When the system is experiencing problems the standard procedure is to perform a system dump and then examine the dump in hope of finding the cause. We have constructed a framework using the Simple Event Correlator (SEC) to monitor the log files in real time and to report events which are known to cause problems. The result is that many problems are now known before the users ever report a problem. This framework can be easily customized by the site to search for events of local interest.
Cray XT Series New Frontiers: Software Status and Plans, Luiz DeRose and Dave Wallace, Cray Inc. This presentation will discuss the current status of software and development plans for the CRAY XT3 and XT4 systems, including the Cray XT4 programming environment status and future direction (compilers, programming models, scientific libraries, and tools). A review of major milestones and accomplishments over the past year will be presented.
Cray BlackWidow System Update, Bob Murphy and Jeff Becklehimer, Cray Inc. This presentation will provide an update on the BlackWidow project, including hardware and software features, preliminary performance results, and product availability.
Design Support for Reconfigurable Accelerators on Future Cray Systems, Clay Marr, DRC Computer Cray and DRC announced a relationship several months ago where DRC would be providing FPGA acceleration technology for some of Cray’s future systems. DRC is now providing a development platform to allow Cray’s customers to write or port applications to run on these accelerators and test them on this relatively inexpensive workstation.
Dubbed the DS/XT, the development system is designed to emulate the acceleration node in a Cray system as closely as possible.
• It provides the same architecture in hardware and software, the same CPU/RPU communication and node-to-node communication environment as the eventual system implementation.
• It also provides DRC’s RPSysCore programming interface, which abstracts many of the most difficult issues in programming an FPGA.
• Optional limited licenses for 3rd party software tools are available for introductory prices, providing users an inexpensive way to try them.
This system will enable Cray’s end customers to begin development early and continue development and full debug of accelerated algorithms or applications without consuming cycles of the main system once installed.
This presentation will discuss some of these details as well as describe the balance of the tools and options which provide a complete stand-alone development environment.
High-performance Hybrid NAS/SAN Architecture from Bluearc: Tiered Storage Solutions for a Wide Range of I/O Requirements, James Reaney, BlueArc Corporation This presentation will highlight the hardware-based architectural features of the BlueArc Titan storage platform, with comparisons made to traditional software-based NAS filer architectures and standalone SAN solutions. A comparison of the performance of the Titan solution will be given for small-file random I/O versus large-file streaming I/O, with example configurations given to show how a single tiered storage solution can deliver performance tailored for both requirements simultaneously.
Mazama System Administration Infrastructure, Bill Sparks, Cray Inc. Mazama is Cray's project to improve system administration for large scale MPP systems. It was developed under contract for the Black Widow project and is being extended to Cray's other MPP systems.
Performance, Reliability, and Operational Issues for High Performance NAS Storage on Cray Platforms, John Kaitschuck, Cray Inc., James Reaney, BlueArc Inc., and Chris Hertel and Mathew O'Keefe, Alvarri, Inc. Cray is now deploying fast NAS appliances (Network Attached Storage) with its supercomputers. NAS provides a shared storage pool that plays an important role in staging data into the Cray supercomputer for initial processing, and staging data out of the supercomputer after a calculation completes. The shared NAS pool will also typically contain shared home directories, and can be used as a centralized local file store for all machines in the data center.

Since supercomputer workloads require the processing of large numbers of both large and small files and large amounts of data, the speed and reliability of these NAS systems are paramount. In this paper, we summarize our performance and failure analysis results for testing performed on the Blue Arc Titan NFS server cluster located in Cray's Chippewa Falls facility. The tests vary in the number of client nodes, number of threads per node, number of directories and their structure, number and size of files, block sizes, and sequential versus random access patterns. In general, the tests progressively add more configuration features (varying node counts, varying block sizes, larger files, more files, more directories, different OS and NFS clients, etc.) to find regressions against our four operational requirements for storage systems (reliability, uniformity, performance and scalability).

In this paper, we: (1) characterize the file transfer performance and fault recovery behavior of the Blue Arc under light and heavy loads, with varying file sizes and system access patterns; (2) measure the Blue Arc Titan NAS server for performance, scalability, reliability, and uniformity in operational scenarios; and (3) provide configuration guidelines and best practices to achieve the highest performance, most efficient utilization and effective data management using the Blue Arc platform capabilities.

We also compare Blue Arc performance and reliability to commodity NAS servers using Opteron processors and SATA storage.
Bringsel: A Tool for Measuring Storage System Reliability, Uniformity, Performance and Scalability, John Kaitschuck, Cray Inc. and Mathew O'Keefe, Alvarri, Inc. Bringsel is a benchmarking and storage measurement software tool developed by John Kaitschuck, a senior field analyst at Cray. Bringsel is a primary I/O testing program that enables the use of either POSIX or MPI-IO calls to perform benchmarking and evaluation to measure the reliability, uniformity, performance and scalability of file systems and storage technologies. It enables the creation of a large number of directories and files using both a threading model (POSIX) and the MPI library for multiple nodes to coordinate testing activity.

Bringsel has run on a variety of large-scale computing platforms, including Cray XTs, SGI Origin systems, Sun enterprise-scale SMP systems and Linux clusters. Bringsel's present feature set includes:

• A flexible benchmarking setup and measurement environment enabled via a configuration file parser and command line interface

• An environment variable interface for parameter passing

• An ability to perform file operations in parallel to determine performance in this critical (but often overlooked) area

• The ability to checksum (via the Haval algorithm) to verify the integrity of a given file and the associated writing or reading operations within the test space

Other IO performance measurement tools include Bonnie++, ior, iozone and xdd. Bringsel differs from these other tools in its emphasis on reliability, uniformity, and scalability. For all files written a checksum is computed and checked when a read access is performed (reliability). Each test can be run several times to verify that bandwidth and latency is uniform across different nodes and at different run times for the same node (uniformity). Of course, bandwidth and operations per second of a single node and across many parallel nodes can be measured (scalability). Bringsel's focus is to determine if a particular storage system, either serial or parallel, can meet operational requirements for stability, robustness, predictability and reliability, in addition to good performance. Cray is using Bringsel to test commercial storage systems to ensure they can meet Cray customer requirements, with an emphasis on reliability.
Scalable Collection of Large MPI Traces on Red Storm, Rolf Riesen, Sandia National Laboratories (SNLA) Gathering large MPI traces and statistics is important for performance analysis and trouble shooting of applications. Traces, with detailed information about each single message an application has sent, are crucial to characterize the message passing behavior of an application. On massively parallel systems like Red Storm the amount of data collected impacts the performance and behavior of the application and is therefore not feasible. We present a new tool to enable the scalable collection of large amounts of data on Red Storm class systems.
How to Choose a Compiler that Is Best Suited for Your Application, Geir Johansen, Cray Inc. The Cray XT3 and Cray XT4 supports compilers from the Portland Group, PathScale, and the GNU Compiler Collection. The goal of the paper is to provide Cray XT4 users the information to choose the compiler that is best suited for their application. Discussion will include a description of language and optimization features for each compiler, and performance results for various types of codes.
Performance and Functional Improvements in MPT Software for the Cray XT4 System, Mark Pagel, Cray Inc. Since the introduction of the Cray XT3, many performance improvements have been made in the MPT and portals software stack especially for dual-core support. Many of these are enabled by default but some require users to enable them. This paper will discuss these optimizations and show the measured performance improvements. In addition, key functional improvements will also be discussed. Planned MPT optimizations and functional improvements will be presented as well, including those being planned for Compute Node Linux.
Chapel, Brad Chamberlain, Cray Inc. As part of its research for DARPA's HPCS program, Cray has been developing a new programming language called Chapel - the Cascade High-Productivity Language. Chapel aims to improve parallel productivity by supporting high-level abstractions for expressing parallel computation and tuning its performance. Chapel also supports features designed to narrow the gap between mainstream programming languages like Java, Matlab, and Perl and those currently used by the HPC community. In this talk we will demonstrate Chapel's use in a few standard benchmarks as an introduction to the language and its features.
Compute Node Linux: Overview, Roadmap & Progress to Date, Dave Wallace, Cray Inc. This presentation will provide an overview of Compute Node Linux (CNL) for the CRAY XT machine series. Compute Node Linux is the next generation of compute node operating systems for the scalar line of Cray systems. This presentation will discuss the current status of Compute Node Linux development including results of scaling and performance testing. At the time of CUG, Cray will have shipped limited access versions of CNL to customers. Early customer experiences will be discussed, as well as a vision of the long-term objectives for CNL.
Compute Node Linux: New Frontiers in Compute Node Operating System, Dave Wallace, Cray Inc. This presentation will discuss some of the challenges in adapting Linux as the compute node OS on the Cray XT series. The discussion will present a brief view of the approach taken to scale Linux in a system that is targeted to support over 10K sockets.
Porting of VisIt Parallel Visualization Tool to the Cray XT3 System, Kevin Thomas, Cray Inc. VisIt, the popular visualization tool developed by the U.S. Department of Energy, was recently ported to run on the Cray XT3 system. While the porting of many types of applications to the Cray XT3 is straightforward, VisIt, with its extensible and distributed architecture, presents a unique challenge. The Catamount OS lacks support for critical features used by VisIt, but the visualization of large datasets requires parallel computation. Through a variety of strategies, the port was completed without requiring source code changes to the application.
Optimized Virtual Channel Assignment in the Cray XT3 System, Dennis Abts, Bob Alverson, Jim Nowicki, and Deborah Weisser, Cray Inc. The Cray XT3 is an MPP system that scales up to 32K nodes using a bidirectional 3-dimensional torus interconnection network. Four virtual channels are used to provide point-to-point flow control and deadlock avoidance. Using virtual channels avoids unnecessary head-of-line (HoL) blocking for different network traffic flows; however, the extent to which virtual channels improves network utilization depends on the distribution of packets among the virtual channels. This paper investigates the virtual channel balance-relative traffic carried on each virtual channel and its importance on network utilization. We discuss the routing algorithm and use of virtual channel datelines to avoid deadlocks around the torus links, and heuristics to balance the packet load across the virtual channels. We present network performance results from an 11-12-16 3D torus network.
An Efficient Technique for Parallel I/O on Cray XT Systems, Jeff Larkin, Cray Inc.; Mark Fahey, Chen Jin, and Scott Klasky, Oak Ridge National Laboratory (ORNL) Over the past year the Cray XT series has proven itself as a highly scalable computing architecture. With the increasing number of sockets per system and migration to multi-core processors, the already difficult I/O problem is only getting worse. Application developers must begin to give I/O more attention than they have in the past. I will show a simple technique that distributes program I/O to maximize bandwidth and minimize the effect of I/O on program execution.
Tools and Techniques for Application Performance Tuning on the Cray XT4 System, Luiz DeRose and John Levesque, Cray Inc. In this tutorial we will present tools and techniques for application performance tuning on the Cray XT4. We will briefly discuss the system and architecture, focusing on aspects that are important for understanding the performance behavior of applications; we will discuss compiler optimization flags and will present the Cray performance measurement and analysis tools; we will conclude the tutorial with optimization techniques, including discussions on MPI, numerical libraries and I/O.
ALPS-The Swiss Grand Challenge Programme on the Cray XT3, Dominik Ulmer and Neil Stringfellow, CSCS-Swiss National Supercomputing Centre (CSCS) With the installation of the first Cray XT3 in Europe in 2005, the Swiss National Supercomputing Centre CSCS made a strategic change towards large-scale massively-parallel computing. Consequently, CSCS started a new resource allocation scheme called "Advanced Large projects in Supercomputing (ALPS)" in 2006. Four projects targeting scientific breakthroughs by means of large-scale computing were accepted, covering molecular dynamics, climate modeling, and physics of the earth's magnetic field.
A Comparison of Application Performance Using Open MPI and
Cray MPI
, Richard Graham, Oak Ridge National Laboratory (ORNL)
Open MPI has recently been ported to the Cray XT platform. The performance of several applications from the National Center for Computational Sciences at Oak Ridge National Laboratory will be presented, using both Open MPI and Cray MPI. The impact of the messaging library on application performance will be discussed.
Scaling Computational Biology Applications on the Petascale, Pratul Agarwal, Oak Ridge National Laboratory (ORNL) We present performance and scaling results for two highly scalable computational biology applications, LAMMPS and NAMD, on the dual-core XT4 system. We demonstrate unprecedented time-to-solution results that are achieved on ORNL early evaluation XT4 system. Moreover, our detailed analysis has identified the scaling limitations and phases of calculations that can potentially benefit for SMP optimization targeted for multi-core processors. Strategies to overcome these limitations allowing scaling on Petascale machine will be discussed.
Fast and Asynchronous Data Transfer on the Cray XT3, Ciprian Docan and Manish Parashar, Rutgers University; and Scott Klasky, Oak Ridge National Laboratory (ORNL) High-performance computing is playing an important role in science and engineering and is enabling highly accurate simulations of complex phenomenon. However, as the computing systems grow in scale and computational capability, achieving desired computational efficiency becomes increasingly challenging. One of the key challenge is getting large amounts of data off the compute nodes at runtime, and over to service nodes or another cluster for real-time analysis. This IO should not impose additional synchronization requirements, should have minimal impact on the computational performance, maintain overall Quality of Service and ensure that no data is lost. Technologies such as RDMA and the Portals library allow fast memory access into the address space of an application without interrupting the computational process, and provide a mechanism that can support these IO requirements.

The objective of the DART (Decoupled Asynchronous Remote Transfers) project described in this paper is to build a thin communication layer on top of Portals library that allows fast low-overhead access to data at the compute elements, supports high-throughput low latency asynchronous IO. This layer is part of a high-throughput data streaming substrate that uses metadata rich outputs to support in-transit data processing and data redistributions for coupled simulations. The paper will describe the design, implementation and performance evaluation of the communication layer, and demonstrate how it can be used to offload expensive IO operations to dedicated service nodes allowing more efficient utilization of compute elements. Finally, we will examine the IO performance using this layer as part of XGC, a Particle In Cell code that is part of the Fusion SciDAC.
Using ALPS on CNL Systems, Michael Karo, Cray Inc. The Application Level Placement Scheduler (ALPS) is Cray's next generation application launch mechanism. It works cooperatively with the batch system to schedule, allocate, and launch applications on Cray systems comprised of heterogeneous compute resources. This tutorial will focus on ALPS features, usage, configuration, and performance. We will discuss our experiences with ALPS at scale, and provide examples that demonstrate the flexibility of node selection in the scheduling process.
Scaling Into Tomorrow, Nicholas P. Cardo, National Energy Research Scientific Computing Center (NERSC) NERSC's recent addition of a 102 Cabinet Cray XT4 system greatly enhances the computational capability of the center. An overview of the systems balanced configuration will be described along with early performance measurements and system metrics. In addition, the challenges of facility preparation will discussed. This paper will take into account all from preparation into production.
HPCC Results and Analysis from ORNL's XT3/XT4 System, Jeffery Kuehn, Oak Ridge National Laboratory (ORNL); and Jeff Larkin and Nathan Wichmann, Cray Inc. The ORNL XT3 system has had numerous upgrades since it was built in 2005, taking it from a 56 cabinet single core XT3 system to a 124 cabinet, dual core XT3/XT4 combined system. Along the way the HPCC benchmark suite has been used to evaluate system performance. We wil show and analyze the results from these benchmark runs and discuss the implications upon application writers.
Supernova Simulation with CHIMERA, Bronson Messer, Ralph Hix, and Anthony Mezzacappa, Oak Ridge National Laboratory (ORNL); and Stephen Bruenn, Florida Atlantic University; and Nathan Wichmann, Cray Inc. CHIMERA is a multi-dimensional radiation hydrodynamics code to study core-collapse supernovae developed by the PRISM collaboration. The code is made up of three essentially independent parts: hydrodynamics, nuclear burning, and neutrino transport solvers combined within an operator-split approach. I will discuss many of the design characteristics of the code and the path forward to Baker.
Application Requirement Analysis for the NLCF, Bronson Messer, Ricky Kendall, and Doug Kothe, Oak Ridge National Laboratory (ORNL) We have attempted to organize, analyze, and interpret several sets of empirical data regarding NLCF applications with an eye towards using actual science requirements to guide future system design. Much of this analysis has centered on the application of many different methodologies to the same sets of data. Perhaps surprisingly, many of these attempts have provided a somewhat consistent picture of what future NLCF platforms must be capable of to satisfy most application requirements.
An Overview of NCCS XT3/4 Acceptance Testing, Arnold Tharrington, Oak Ridge National Laboratory (ORNL) This paper explains the design and implementation of an acceptance test harness and a system-level regression test framework used at ORNL's National Center for Computational Sciences. The test harness is used to automatically manage a large number of acceptance applications for current and future system expansion, including expansion to petascale. The regression test framework has been developed to provide an easily extendable, automated framework to perform numerous acceptance tests on the file system, network system, and domain-based applications with each upgrade of the system software.
Debugging Heap Memory Problems on Cray Supercomputers with TotalView, Chris Gottbrath and Ariel Burton, Etnus, LLC; Robert Moench and Luis DeRose, Cray Inc. Memory bugs, resulting of mismanagement of heap memory, can be extremely frustrating to debug even in the best of conditions. They can be truly daunting when they occur in the kind of long-running, resource intensive, large scale simulations that scientists find themselves running on machines like the Cray XT3. This paper will outline how Cray XT3 users can solve these demanding problems using the TotalView debugger.
Enabling Moab’s Adaptive Computing for Cray XT3/XT4, Michael Jackson, Cluster Resources MOAB Moab offers an extensive suite of facilities allowing sites to intelligently and automatically adapt resources, workload and policies to a changing environment from QoS/SLA enforcement, to automated failure recovery, to machine-learning based optimization. We will discuss how Moab can help you apply next generation technologies today.
Moab & TORQUE on Cray Architectures , Michael Jackson, Cluster Resources MOAB Moab and TORQUE provides a powerful resource and workload management solution, combining a highly optimized and policy-rich scheduler with a widely-adopted and industry-standard open source resource manager. We will review this solution and how it allows users and admins to best utilize Cray resources.
Introducing the Cray XMT, Petr Konecny, Cray Inc. Cray recently introduced the Cray XMT system, which builds on the parallel architecture of the Cray XT3 system and the multithreaded architecture of the Cray MTA system with a Cray-designed multithreaded processor. This talk will highlight not only the details of the architecture, but also the application areas for which it is best suited.
HPC Computing Data Center Energy and Environmental Requirements, Steve Johnson, Cray Inc. There are many global and domestic environmental initiatives, (e.g., Energy Star), being introduced to address the growing concerns over the availability and use of materials, chemicals and energy in both government and commercial sectors. One vastly growing concern is the energy consumption rate for power and cooling within computing data centers. This is also important to Cray Inc., as we are considering these initiatives in to new designs and know the importance of making our product as efficient as possible while delivering world class computing power and reliability. Cray Inc. highly regards our user requirements and recommendations. Therefore, we wish to provide a forum at CUG whereby our customers can voice their views, concerns and requirements to help Cray Inc plan for the future needs of state of the art computer data centers going forth.
Disk-Based Technology for Multi-Petabyte Archives, Matthew O'Keefe, Alvarri Inc. and Aloke Guha, COPAN Systems Tape-based hierarchical storage management has been used for decades in large supercomputing centers to store large (defined as approximately 1 to 10 Petabytes in 2007), persistent archives. Cray supercomputers both create data for and ingest data from these large data archives. In the past, the datasets in these archives were so large that placing them on rotating magnetic disk storage was out of the question due to the cost, space, and power requirements of disk. These factors heavily outweighed the lack of random access and slow speeds of tape.

Recently, disk drive density increases (2x every 18-24 months) combined with price decreases have put disks in the same capacity and cost-per-gigabyte range as tape. Tape is more space-efficient and requires much less power than traditional disk-array-based storage. However, new disk array technology, including Massive Arrays of Idle Disks (MAID), addresses differences with tape in space efficiency and power by keeping most disks powered-down while using aggressive error detection and prevention mechanisms to reduce drive failures, increase data reliability, and increase drive longevity.

In this paper, we describe current tape-based HSM and other data migration technologies, and the market requirements that drive their usage. An alternative strategy using disk-based MAID storage for deep archiving is proposed to speed access and improve data management in these systems.
PERCU Results in a Reawakened Relationship for NERSC and Cray, William Kramer, National Energy Research Scientific Computing Center (NERSC) This talk will cover NERSC experience with the largest XT4 system in the world. It will include the evaluation criteria and experience, the expected use of the system and the future plans for improvement that Cray and NERSC have for the system.
New Advances in the Gyrokinetic Toroidal Code and Their Impact on Performance on the Cray XT Series, Nathan Wichmann, Cray Inc.; and Mark Adams and Stephane Ethier, Princeton Plasma Physics Laboratory With the recent agreement to build the multi-billion dollar international burning plasma experiment known as ITER, fusion simulations will be growing dramatically in both complexity and size. The Gyrokinetic Toroidal Code (GTC) is a 3D Particle-In-Cell code used for studying the impact of fine-scale plasma turbulence on energy and particle confinement in the core of tokamak fusion reactors. To tackle global ITER-size simulations with full kinetic ion and electron physics, GTC will require new algorithms and supercomputers will have to grow in capabilities. We will review recent code modifications made to prepare for these new, exciting simulations and examine the performance of GTC on both the Cray XT3 and Cray XT4 systems using the latest Cray tools. Finally, we will look to the future and discuss plans for GTC and how that will effect its performance on future computers.
Experiences with the Use of CrayPat in Performance Analysis, Mahesh Rajan, Sandia National Laboratories (SNLA) Performance analysis, tuning, and, modeling, of applications running on thousands of processors on the Sandia Red Storm/XT3 is facilitated by the use of CrayPat tool. This investigation will discuss the successful use of the tool with a variety of applications and also discuss some of the challenges encountered in its use. Performance data is compared against other tools and measurements on other HPC systems to fully understand serial bottlenecks and parallel scaling limitations.
Diagnostic Capabilities of the Red Storm Compliance Test Suite, Mike Davis and Chris Weber, Cray Inc. The Red Storm Compliance Test Suite, originally developed to demonstrate the system's compliance with the requirements of the Red Storm Statement of Work (SOW), has also proven to be useful for diagnosing system hardware and software problems and for exposing regressions in performance following system hardware and software upgrades. This paper will describe the tests that comprise the suite and their specific diagnostic capabilities.
It's About Time: Multi-Resolution Timers for Scalable Performance Debugging, James B. White III, Oak Ridge National Laboratory (ORNL) Traditional performance profiling of highly parallel applications does not always give enough information to diagnose performance bugs, particularly those caused by load imbalances and performance variability, yet the data files for such profiling can grow linearly with process count. In response to these limitations, I have developed application timers designed to limit data and reporting volumes at high process counts without dispersing the signals of load imbalance and performance variability. I will describe use of these timers to diagnose actual performance bugs running applications on a Cray XT4.
XT7? Integrating and Operating a Conjoined XT3+XT4 System, R. Shane Canon, Don Maxwell, Josh Lothian, Kenneth Matney, Makia Minich, and H. Sarp Oral, Oak Ridge National Laboratory (ORNL); and Jeffrey Becklehimer and Cathy Willis, Cray Inc. The Center for Computational Sciences at Oak Ridge National Laboratory runs a single Cray XT system of directly connected XT3 and XT4 cabinets. We describe the processes and tools used to move production work from the pre-existing XT3 to the new system incorporating that same XT3, including novel application of Lustre routing capabilities. We also describe the ongoing operation and use of the system, including batch configuration and scheduling of the heterogeneous computing resources.
Shared Object-Based Storage and the HPC Data Center, Jim Glidewell, The Boeing Company (BOEING) Providing a high performance storage solution is an essential requirement for most HPC data centers. This talk will focus on our experiences with the Panasas object-based storage system in a shared Cray X1 and Linux cluster environment. We will present an overview of our strategy for employing Panasas storage and the results we have obtained.
Application Performance Profiling on the Cray XD1 Using HPCToolkit, Jan Odegard, John Mellor-Crummey, Nathan Tallent, and Mike Fagan, Rice University (RICEU) HPCToolkit is an open-source suite of multi-platform tools for profile-based performance analysis of applications. The toolkit consists of components for collecting performance measurements of fully optimized executables without adding instrumentation, analysis of application binaries to understand the structure of optimized code, correlation of measurements with program structure, and a user interface that supports top-down exploration of performance measurements in a hierarchical presentation. This talk will provide an overview of HPCToolkit and describe our experiences using it to assist in application performance analysis and tuning on a Cray XD1.
Tuning for the Latest Generation of X86-64 Processors, Douglas Doerfler and David Hensinger, Sandia National Laboratories (SNLA); Douglas Miles and Brent Leback, The Portland Group At CUG 2006, a cache oblivious implementation of a two dimensional lagrangian hydrodynamics model of a single ideal gas material was presented. This paper looks at further optimizations to the code to allow packed, consecutive-element storage of vectors, some restructuring of loops containing neighborhood operations, and adding type qualifiers to some C++ pointer declarations to improve performance. In addition, compiler studies allowed further improvements to the PGI C++ compiler in the area of loop-carried redundancy elimination in C++, resolution of pointer aliasing conflicts and vectorization of loops containing min and max reductions. With these optimizations, speedups of 1.25 to 1.85 were realized on the latest generation of X86-64 processors.

Web page design and support by Cray User Group (CUG) Conference Services