CUG 2006 Abstracts

	Final Program and Proceedings			Attendee List
	Go to the pages for Monday-Thursday to find papers presented that day. To search for a paper by session, title, author, or abtract text, go to the Program Abstracts page.
Final Program in PDF format	Monday	Tuesday	Wednesday	Thursday

Monday

Session Number, Time, and Track

Paper Title and Author(s)

Abstract

1A 8:00 Tutorials

Portable Performance with Vectorization, Mark Fahey and James B. White III (ORNL)

Getting high performance is more than just finding the right compiler options. Modifying a code to tune for a specific system, such as a vector system, can slow the code down on other systems and make it more difficult to maintain. This tutorial uses examples from real applications to show how to program for portable performance while preserving maintainability.

1B 8:00 Tutorials

Portals Programming on the XT3, Ron Brightwell and Rolf Riesen (SNLA); Arthur B. Maccabe, University of New Mexico

This tutorial will describe the Portals network programming interface as it is implemented on the Cray XT3. We will describe the fundamental building blocks of Portals and how they can be combined to support upper-layer network transport protocols and semantics. We will also cover performance and usability issues that are specific to the XT3 system.

1C and 2C 8:00 and 10:20 Tutorials

Lustre Tutorial for Cray XD1 and Cray XT3 (continued at 10:20), Rick Slick, Cray Inc. and Jeff Denworth, CFS

In this tutorial, we will give an overview of Lustre, including architecture and common vocabulary, and describe differences between Cray XD1 and Cray XT3 Lustre implementations. We will discuss Lustre configuration guidelines and samples for both Cray XD1 and Cray XT3. We will also discuss Lustre administration, including initialization, monitoring, starting, and stopping. We will present the common user level commands with special attention to the lsf command and striping. The current state and future of Lustre will be given by CFS.

2A 10:20 Tutorials

Performance Measurement, Tuning, and Optimization on the Cray XT3, Luiz DeRose and John Levesque, Cray Inc.

In this tutorial we will discuss techniques and tools for application performance tuning on the Cray XT3. We will briefly review the system and architecture, focusing on aspects that are important to understand in order to realize performance; we will discuss compiler optimization flags and numerical libraries; we will present the Cray performance measurement and analysis tools; and will conclude with optimization techniques.

2B 10:20 Tutorials

PBS Tutorial, Michael Karo, Cray Inc.

We will have a PBS tutorial this year based on responses from last year's CUG. Joining us this year will be representatives from Altair to assist in the tutorial (at the suggestion of the CUG group last year). We will discuss PBS in terms of the X1/X1E, the XT3, and the XD1 systems. We would expect a fair amount of interaction with the audience for this tutorial.

3A 2:20 Opening–General Session

Leadership Computing at the NCCS, Arthur Bland, Ann Baker, R. Shane Canon, Ricky Kendall, Jeffrey Nichols, and Julia White (ORNL)

We will describe the Leadership Computing Facility (LCF) at the National Center for Computational Sciences (NCCS), both the current facility and our aggressive plans for upgrade. Project allocations were announced in January, and we will introduce some of the groundbreaking science enabled by the LCF. We will also describe the unique service model of the NCCS, designed for capability computing and breakthrough science.

4A 2:55 Development of Large Scale Systems/Applications

Implementing the Operational Weather Forecast Model of MeteoSwiss on a Cray XT3, Angelo Mangili, Emanuele Zala, Jean-Marie Bettems, Mauro Ballabio, and Neil Stringfellow (CSCS)

MeteoSwiss uses the computer facilities of CSCS to carry out the twice daily weather forecasts for Switzerland using a specialized alpine model called aLMo based on the LM code from Deutscher Wetterdienst. This paper describes the implementation of the aLMo suite on the Cray XT3 at CSCS and outlines some of the issues faced in porting and configuring a complex operational suite on a new architecture. Performance figures for the suite are given, and an outline is presented of some of the challenges to be faced in carrying out the next generation of high resolution forecasting.

4B 2:55 Chapel/Black Widow

Chapel: Cray Cascade's High Productivity Language, Mary Beth Hribar, Steven Deitz, and Bradford L. Chamberlain, Cray Inc.

In 2002, Cray joined DARPA's High Productivity Computing Systems (HPCS) program, with the goal of improving user productivity on High-End Computing systems for the year 2010. As part of Cray's research efforts in this program, we have been developing a new parallel language named Chapel, designed to: support a global view of parallel programming with the ability to tune for locality; support general parallel programming including codes with data-parallelism, task-parallelism, and nested parallelism; help narrow the gulf between mainstream and parallel languages. We will introduce the motivations and foundations for Chapel, describe several core language concepts, and show some sample computations written in Chapel.

4C 2:55 System Operations–Schedulers & Monitoring Tools

Application Level Placement Scheduler (ALPS), Michael Karo, Richard Lagerstrom, Marlys Kohnke, and Carl Albing, Cray Inc.

The Application Level Placement Scheduler (ALPS) is being introduced with the vector follow-on product to the X1/X1E. It is planned for use with future Cray products. ALPS is the replacement product to psched. It will manage scheduling and placement for distributed memory applications and support predictable application performance. It will integrate with system configuration and workload management components with an emphasis on extensibility, scalability, and maintainability.

4C 3:30 System Operations–Schedulers & Monitoring Tools

Moab Workload Manager on Cray XT3, Michael Jackson, Don Maxwell, and Scott Jackson, Cluster Resources Inc.

Cluster Resources Inc. has ported their Moab Workload Manager to work with Cray XT3 systems (both PBS Pro, runtime, and other resource manager environments) and the National Center for Computational Sciences (NCCS) is now using Moab to schedule production work. We will discuss the Moab port, how NCCS is using it, as well as how it helps provide added control over XT3 systems in terms of advanced policies, fine-grained scheduling, diagnostics, visualization of resources and improved resource utilization. We will also review plans and possibilities for Moab technologies on other Cray products.

4A 3:40 Development of Large Scale Systems/Applications

Analysis of an Application on Red Storm, Courtenay Vaughan and Sue Goudy (SNLA)

CTH is a widely used shock hydrodynamics code developed at Sandia. We will investigate scaling on Red Storm to 10,000 processors and will use those results to compare with an execution time model of the code.

4B 3:40 Chapel/Black Widow

The Cray Programming Environment for Black Widow, Luiz DeRose, Mary Beth Hribar, Terry Greyzck, Brian Johnson, Bill Long, and Mark Pagel, Cray Inc.

In this paper we will present the Cray programming environment for Black Widow, the next generation of Cray vector systems. The paper will cover programming models, scientific libraries, and tools. In addition, the paper will describe the new features from the Fortran standard that will be available on the Black Widow compiler.

4C 4:00 System Operations–Schedulers & Monitoring Tools

CrayViz: A Tool for Visualizing Job Status and Routing in 3D on the Cray XT3, John Biddiscombe and Neil Stringfellow (CSCS)

When initially installed, the Cray XT3 at CSCS showed node failures which appeared to occur in a pattern. The tools available for the analysis of the failures were limited to 2D printouts of node status and were difficult to interpret given the 3-dimensional connectivity between the processors. A visualization tool capable of displaying the status of compute nodes and the routes between them was developed to assist in the diagnosis of the failures. It has the potential to assist in the analysis of the working system and the development of improved allocation and job scheduling algorithms.

5A 4:30 X1/E-Interactive Session(s)

Cray X1/E Special Interest Group Meeting (SIG), Mark Fahey (ORNL), Chair
Systems, Jim Glidewell (BOEING) and Brad Blasing (NCS-MINN), Focus Chairs
Users, Mark Fahey (ORNL) and Rolf Rabenseifner (HLRS), Focus Chairs

The purpose of this Interactive Session is to provide a forum for open discussions on the Cray X1/E system and its environment. The “Systems” area focuses on issues related to Operations, Systems & Integration, and Operating Systems. The “Users” area covers issues related to Applications, Programming Environments, and User Services. Cray Inc. liaisons will be on hand to provide assistance and address questions. This session is open to everyone that has an interested in the Cray X1/E and wishing to participate.

5C 4:30 BoF

Cray Service Today and Tomorrow, Charlie Clark and Don Grooms, Cray Inc.

Cray will present its current support model and reporting tools used for XT3, X1, and other legacy products. The discussion will then explore possibilities in which Cray could provide service differently in the future, and discuss the potential changes required to both its products and to its service offerings to accommodate these changes. The session will start with a short presentation then move to Q&A.

Tuesday

Session Number, Track Name, Session Time

Paper Title and Author(s)

Abstract

7A 10:40 Physics

Quantum Mechanical Simulation of Nanocomposite Magnets on Cray XT3, Yang Wang (PITTSCC); G.M. Stocks, D.M.C. Nicholson, A. Rusanu, and M. Eisenbach (ORNL)

In this presentation, we demonstrate our capability of performing the quantum mechanical simulation of nanocomposite magnets using a Cray XT3 supercomputing system and the Locally Self-consistent Multiple Scattering (LSMS) method, a linear scaling ab initio method capable of treating tens of thousands of atoms. The simulation is intended to study the physical properties of magnetic nanocomposites made of FePt and Fe nanoparticles. We will discuss our results on the electronic and magnetic structure and the magnetic exchange coupling between nanoparticles embedded in metallic matrices.

7B 10:40 Mass Storage & Grid Computing

Storage Architectures for Cray-based Supercomputing Environments, Matthew O'Keefe, Cray Inc.

HPC data centers process, produce, store and archive large amounts of data. In this paper, existing storage architectures and technologies, including file systems, HSM, and large tape archives will be reviewed in light of current experience. The biggest storage problem facing Cray users may not be attaining peak IO and storage processing speeds, but instead may be sharing that data out to the rest of the data center from the Cray supercomputer. We will discuss technical problems in making this work efficiently and suggest possible solutions.

7C 10:40 System Operations

XT3 Operational Enhancements, Chad Vizino, Nathan Stone, and J. Ray Scott (PITTSCC)

The Pittsburgh Supercomputing Center has developed a set of operational enhancements to the supplied XT3 environment. These enhancements facilitate the operation of the machine by allowing it to be run efficiently and by providing timely and relevant information upon failure. Custom scheduling, job-specific console log gathering, event handling, and graphical monitoring of the machine will be reviewed and discussed in depth.

7A 11:20 Physics

Exploring Novel Quantum Phases on Cray X1E, Evgeni Burovski, Nikolay Prokofev, and Boris Svistunov, University of Massachusetts; Matthias Troyer, ETH Zürich

At temperatures close to the absolute zero (-273°) some metals go superconductive, i.e., lose the electrical resistance [Nobel Prizes 1913, 1972]. As a related phenomenon, some liquids go superfluid at sufficiently low temperature [Nobel Prizes 1978, 1996, 2001]. These phenomena are well understood in the two limiting cases, namely the Bardeen-Cooper-Schrieffer (BCS) limit, and the Bose-Eistein condensation (BEC) limit. Using the Cray X1E aphonic of the Oak Ridge National Lab, we performed Quantum Monte Carlo simulations to study the universal features of the crossover between these two extremes and, for the first time, obtained an accurate and quantitative description. The QMC code is highly parallelizable, and ideally suited to vector machines, with the most computationally expensive procedure being the vector outer product. The vector processors of the X1E were indispensable for the project, which would otherwise be impossible to complete.

7B 11:20 Mass Storage & Grid Computing

Current Progress of the GRID Project at KMA, Hee-Sik Kim, Tae-Hun Kim, Cray Korea; Ha-Young Oh and Dongil Lee (KMA)

KMA is one of the pioneer groups to implement GRID projects of KOREA in meteorology. Cray has supported the establishment of a grid infrastructure on the largest Cray X1E. We will introduce the effort and describe current progress.

7C 11:20 System Operations

A Preliminary Report on Red Storm RAS Performance, Robert A. Ballance and Jon Stearley (SNLA)

The Cray XT3 provides a solid infrastructure for implementing and measuring reliability, availability, and serviceability (RAS). A formal model interpreting RAS information in the context of Red Storm was presented at CUG 2005. This paper presents an implementation of that model–including measurements, their implementation, and lessons learned.

7B 11:45 Mass Storage & Grid Computing

PDIO: Interface, Functionality and Performance Enhancements, Nathan Stone, D. Balog, B. Gill, B. Johanson, J. Marsteller, P. Nowoczynski, R. Reddy, J.R. Scott, J. Sommerfield, K. Vargo, and C. Vizino (PITTSCC)

PSC's Advanced Systems Group has created Portals Direct I/O ("PDIO"), a special-purpose middleware infrastructure for sending data from compute processor memory on Cray XT3 compute nodes to remote agents anywhere on the WAN in real-time. The prototype provided a means for aggregation of outgoing data through multiple load-balanced PDIO daemons, end-to-end parallel data streams through XT3 SIO nodes, and a bandwidth feedback mechanism for stability and robustness. However, it was limited by the special-purpose nature of its initial design: adopters had to modify their application source code and each write was bound to an independent remote file. This version was used by one research group, demonstrated live at several conferences and shown to deliver bandwidths of up to 800 Mbits/sec.

PSC is advancing a beta release with a number of interface, functionality and performance enhancements. First, in order to make PDIO a general-purpose solution, developers are re-implementing the API to make it transparent. Users will invoke it through the standard C API, obviating the need for any changes to application source code. Second, developers already have prototypes for more scalable and robust portals IPC and support for broader file semantics. This includes a more general dynamic allocation mechanism to allow the PDIO daemons to be a truly shared resource for all users on the XT3 system. Finally, together with an updated WAN communication protocol, the goal of these enhancements is to continue to deliver performance limited only by the WAN connectivity and file systems of the remote host.

7C 11:45 System Operations

Extending the Connectivity of XT3 System Nodes, Davide Tacchella (CSCS)

XT3 system service nodes provide a limited connectivity to the outside world. The boot node can be accessed only from the control workstation and the sdb is connected only to the systar network. CSCS has extended the connectivity of both nodes on its XT3. These extensions allow us to back up the file system of the boot node at run time, and to query the system database without being logged on the machine. The talk will present the chosen solutions for extending the node connectivity and the experiences made with them.

8A 1:30 Performance/Evaluations

Performance Evaluations of User Applications on NCCS's Cray XT3 and X1E, Arnold Tharrington (ORNL)

The National Center of Computational Sciences has granted select projects the use of its Cray XT3 and X1E for fiscal year 2006. We present highlights of the projects and their performance evaluations.

8B 1:30 BlackWidow Overview

BlackWidow Hardware System Overview, Brick Stephenson, Cray Inc.

The BlackWidow system is a second-generation scalable vector processor closely coupled to an AMD Opteron based Cray XT3 system. Major hardware improvements over the Cray X1 system will be discussed including: single thread scalar performance enhancement features, high performance fat tree interconnect, high speed I/O bridge to the Cray XT3 system, and the high density packaging used on the BlackWidow system. Prototype hardware is starting to arrive in Chippewa Falls; a complete status of the system checkout progress will be given.

8C 1:30 FPGAs

FPGA-accelerated Finite-Difference Time-Domain Simulation on the Cray XD1 Using Impulse C, Peter Messmer, David Smithe, and Paul Schoessow, Tech-X Corporation; Ralph Bodenner, Impulse Accelerated Technologies

The Finite-Difference Time-Domain (FDTD) algorithm is a well-established tool for modeling transient electromagnetic phenomena. Using a spatially staggered grid, the algorithm alternately advances the electric and magnetic fields, according to Faraday's and Ampere's laws. These simulations often contain millions of grid cells and run for thousands of timesteps, requiring highly efficient grid update algorithms and high-performance hardware. Here we report on the implementation of the FDTD algorithm on the application accelerator of a Cray XD1. The FPGA is programmed using the Impulse C tool suite. These tools translate C code into VHDL and therefore enable the domain scientist to develop FPGA-enhanced applications. Different optimization strategies, ranging from algorithmic changes to exploitation of the high degree of parallelism on the FPGA, will be presented.

8A 2:00 Performance/Evaluations

Evaluation of the Cray XT3 at ORNL: A Status Report, Richard Barrett, Jeffrey Vetter, Sadaf Alam, Tom Dunigan, Mark Fahey, Bronson Messer, Richard Mills, Philip Roth, and Patrick Worley (ORNL)

Over the last year, ORNL has completed installation of its 25 TF Cray XT3, named Jaguar, and has started running leadership-scale applications on this new platform. In this paper, we describe the current state of Jaguar and provide an overview of the capabilities of this system on numerous strategic DOE application areas that include astrophysics, fusion, climate, materials, groundwater flow/transport, and combustion.

8B 2:00 BlackWidow Overview

BlackWidow Software Overview, Don Mason, Cray Inc.

The paper overviews the BlackWidow software. The OS for BlackWidow is Linux based, different than previous vector architectures. OS and IO software, including scheduling and administration is common to both BlackWidow and the scalar product line. The compiler and library software is an evolution of software developed for the Cray X1.

8C 2:00 FPGAs

Performance of Cray Systems—Kernels, Applications & Experiences, Mike Ashworth, Miles Deegan, Martyn Guest, Christine Kitchen, Igor Kozin, and Richard Wain (DARESBURYLAB)

We present an overview of our experience with the XD1 at CCLRC Daresbury Laboratory. Data for a number of synthetic and real applications will be discussed along with data gathered on other x86-64 based systems, e.g., Pathscale's Infinipath. In addition we will cover RAS and general usability issues we have encountered. Time permitting we will go on to discuss our initial efforts to exploit the Virtex IV FPGA as a co-processor on the XD1, along with details of benchmarking exercises carried out on XT3 and X1E systems at collaborating sites.

8A 2:30 Performance/Evaluations

Optimizing Job Placement on the Cray XT3, Deborah Weisser, Nick Nystrom, Shawn Brown, Jeff Gardner, Dave O'Neal, John Urbanic, Junwoo Lim, R. Reddy, Rich Raymond, Yang Wang, and Joel Welling (PITTSCC)

Application performance on the XT3 has increased as our detailed understanding of factors affecting performance has improved. Through copious acquisition and careful examination of data obtained through CrayPat, PAPI, and Apprentice2, we have identified (and subsequently eliminated) performance impediments in key applications. This paper presents widely applicable strategies we have employed to improve performance of these applications.

8B 2:30 BlackWidow Overview

Use of Common Technologies Between XT3 and BlackWidow, Jim Harrell and Jonathan Sparks, Cray Inc.

This talk will present how Cray uses common technologies between XT3 and BlackWidow. How past proven technologies and new emergent technologies help in administration and configuration of large scale systems.

9A 3:20 Architecture–Catamount

Overview of the XT3 Systems Experience at CSCS, Richard Alexander (CSCS),

In July 2005 CSCS took delivery of a 1100+ node XT3 from Cray. This system completed acceptance successfully on 22 December 2005. This paper describes most aspects of the work needed to bring this machine from loading dock to production status. We will also briefly mention the uncompleted goals, including batch scheduling, scalability of systems administration tools, and deployment of the Lustre file system.

9B 3:20 Tuning with Tools

Performance Tuning and Optimization with CrayPat and Cray Apprentice2, Luiz DeRose, Bill Homer, Dean Johnson, and Steve Kaufmann, Cray Inc.

In this paper we will present the Cray performance measurement and analysis infrastructure, which consists of the CrayPat Performance Collector and the Cray Apprentice2 Performance Analyzer. These performance measurement and analysis tools are available on all Cray platforms and provide an intuitive and easy to use interface for performance tuning of scientific applications. The paper will describe, with examples of use, the main features that are available for understanding the performance behavior of applications, as well as to help identify sources of performance bottlenecks.

9C 3:20 Simulation & Research Opportunities

Supercomputer System Design Through Simulation, Rolf Riesen (SNLA)

Changes to an existing supercomputer such as the Cray XT3, or planning and designing the next generation, are very complex tasks. Even small changes in network topology, network interface hardware, or system software can have a large impact on application performance and scalability. We are working on a simulator that will be able to answer questions about system performance and behavior before actually changing or improving hardware or software components. This paper gives an overview of our project, its approach, and some early results.

9A 4:00 Architecture–Catamount

Catamount Software Architecture with Dual Core Extensions, Ron Brightwell, Suzanne M. Kelly, and John P. VanDyke (SNLA)

Catamount is the lightweight kernel operating system running on the compute nodes of Cray XT3 systems. It is designed to be a low overhead operating system for a parallel computing environment. Functionality is limited to the minimum set needed to run a scientific computation. The design choices and implementations will be presented. This paper is a reprise of the CUG 2005 paper, but includes a discussion of how dual core support was added to the software in the fall/winter of 2005.

9B 4:00 Tuning with Tools

Performance and Memory Evaluation Using TAU, Sameer Shende, Allen D. Malony and Alan Morris, University of Oregon; Pete Beckman, Argonne National Laboratory

The TAU performance system is an integrated performance instrumentation, measurement, and analysis toolkit offering support for profiling and tracing modes of measurement. This paper introduces memory introspection capabilities of TAU featured on the Cray XT3 Catamount compute node kernel. One technique that TAU employs examines the memory headroom, or the amount of heap memory available, at routine entry, and correlates it to the program's callstack as an atomic event.

10A 4:30 XT3-Interactive Session(s)

Cray XT3 Special Interest Group Meeting (SIG), Robert A. Ballance (SNLA), Chair
Systems, Robert A. Ballance & Keith Underwood (SNLA), Focus Chairs
Users, Thomas Oppe (CSC-VBURG) & James Kasdorf (PITTSCC), Focus Chairs

The purpose of this Interactive Session is to provide a forum for open discussions on the Cray XT3 system and its environment. The “Systems” area covers issues related to Operations, Systems & Integration, and Operating Systems. The “Users” area deals with issues related to Applications, Programming Environments, and User Services. Cray Inc. liaisons will be on hand to provide assistance and address questions. This session is open to everyone that has an interested in the Cray XT3 and wishing to participate.

10C 4:30 BoF

OpenFPGA BoF—The Role of Reconfigurable Technology in HPC, Eric Stahlberg (OSC)

Advances in reconfigurable computing technology, such as in the Cray XD1, have reached a performance level where they can rival and exceed the performance of general purpose processors in the right situations. As present general purpose processors undergo transformation to increase throughput without increasing clock speeds, the opportunity for broader use of reconfigurable technology arises. OpenFPGA has emerged as an international effort aiming to ease the software transition required for applications to take advantage of reconfigurable technology. This BOF session will provide an opportunity to share viewpoints and opinions regarding the future incorporation of reconfigurable technology in computing, challenges faced, and approaches that can be taken within the context of an community wide effort to make progress.

Wednesday

Session Number, Track Name, Session Time

Paper Title and Author(s)

Abstract

11A 8:20 Performance

CAM Performance on the XT3 and X1E, Patrick Worley (ORNL)

The Community Atmospheric Model (CAM) is currently transitioning from a spectral Eulerian solver for the atmospheric dynamics to a finite volume-based solver that has the conservation properties required for atmospheric chemistry. We analyze the performance of CAM using the new dynamics on the Cray XT3 and Cray X1E systems at Oak Ridge National Laboratory, describing the performance sensitivities of the two systems to the numerous tuning options available in CAM. We then compare performance on the Cray systems with that on a number of other systems, including the Earth Simulator, an IBM POWER4 cluster, and an IBM POWER5 cluster.

11B 8:20 UPC

Using Unified Parallel C to Enable New Types of CFD Applications on the Cray X1E, Andrew Johnson (NCS-MINN)

We are using parallel global address-space (PGAS) programming models, supported at the hardware-level on the Cray X1E, that allow for the development of more complex parallel algorithms and methods and are enabling us to create new types of computational fluid dynamics codes that have advanced capabilities. These capabilities are the ability to solve complex fluid-structure interaction applications and those involving moving mechanical components or changing domain shapes. This is a result of coupling automatic mesh generation tools and techniques with parallel flow solver technology. Several complex CFD applications are currently being simulated and studied using this method on the Cray X1E and will be presented in this paper.

11C 8:20 Performance–XD1/FPGAs

Early Experiences with the Naval Research Laboratory XD1, Wendell Anderson, Marco Lanzagorta, Robert Rosenberg, and Jeanie Osburn (NRL)

The Naval Research Laboratory has recently obtained a two-cabinet XD1 that has 288 dual core Opteron 275 dual core CPUs and 144 Vertex 2 FPGAs. This paper will examine our experiences with running scientific applications that take advantage of these two new technologies.

11A 9:00 Performance

HPCC Update and Analysis, Jeff Kuehn (ORNL); Nathan Wichmann, Cray Inc.

The last year has seen significant updates in the programming environment and operating systems on the Cray X1E and Cray XT3 as well as the much anticipated release of version 1.0 of HPCC Benchmark. This paper will provide an update and analysis of the HPCC Benchmark Results for Cray XT3 and X1E as well as a comparison against last years' results.

11B 9:00 UPC

Evaluation of UPC on the Cray X1E, Richard Barrett (ORNL); Tarek El-Ghazawi and Yiyi Yao, George Washington University; Jeffey Vetter (ORNL)

Last year we presented our experiences with UPC on the Cray X1. This year we update our results, in consideration of the upgrade to X1E of the machine at Oak Ridge National Laboratory as well as modifications to the UPC implementation of the NAS Parallel Benchmarks.

11C 9:00 Performance–XD1/FPGAs

Molecular Dynamics Acceleration with Reconfigurable Hardware on Cray XD1: A System Level Approach, Brice Tsakam-Sotch, XLBiosim

Molecular dynamics simulations are most realistic with explicit solvent molecules. More than 90% of the computing time is consumed in the evaluation of forcefields for simulation runs that last months on standard computers. Cray XD1 systems allow designing improved solutions for these problems by the mean of reconfigurable hardware acceleration and high performance buses. We present the performance analysis of the molecular dynamics software application gromacs and the resulting optimizations at the system level, including hardware acceleration of forcefield evaluation and optimization of communications.

11A 9:25 Performance

Comparing Optimizations of GTC for the Cray X1E and XT3, James B. White III (ORNL); Nathan Wichmann, Cray Inc.; Stephane Ethier, Princeton Plasma Physics Laboratory

We describe a variety of optimizations of GTC, a 3D gyrokinetic particle-in-cell code, for the Cray X1E. Using performance data from multiple points in the optimization process, we compare the relative importance of different optimizations on the Cray X1E and their effects on XT3 performance. We also describe the performance difference between small pages and large pages on the XT3, a difference that is particularly significant for GTC.

11B 9:25 UPC

Performance of WRF Using UPC, Hee-Sik Kim and Jong-Gwan Do, Cray Korea Inc.

The RSL_LITE communication layer of WRF was modified, which can do UPC and/or SHMEM. Its performance comparison with MPI version will be discussed.

11C 9:25 Performance–XD1/FPGAs

High Level Synthesis of Scientific Algorithms for the Cray XD1 System, Paolo Palazzari, Giorgio Brusco and Paolo Verrecchia, Ylichron Srl; Alessandro Marongiu and Vittorio Rosato, Ylichron Srl and ENEA; Paolo Migliori, ENEA; Philip LoCascio (ORNL)

The rapid increase in integrated circuit densities, with the slowing in the increase in commodity CPU clock rates, has made the investigation of hybrid architectures using FPGA technologies to provide some specialized accelerated computational capability, such as on the Cray XD1 system. In order to take full advantage of the FPGA enabled nodes of the Cray XD1 system, avoiding the traditional problematic development times typical of custom digital circuit design, High Level Synthesis (HLS) methodologies appear to be a promising research avenue. An HLS allows one to obtain a synthesizable VHDL description, in a near automated way starting from a specific algorithm given in some High Level language (such as C, FORTAN etc.). In this presentation we describe the HLS methodology developed by Ylichron S.r.l. [a spin-off company of the Italian Agency for the New Technologies, the Energy and the Environment (ENEA)] in the framework of the HARWEST project and used by ENEA researchers to implement core algorithms utilizing the FPGA module of the XD1 system. A case study based on the ENEA-ORNL collaboration in the context of the implementation of some illustrative biological and molecular modeling kernels will be presented to describe the whole design flow, from the initial algorithm up to a working parallel prototype which employs both Opteron CPUS and FPGA resources.

11A 9:50 Performance

Performance Comparison of Cray X1E, XT3, and NEC SX8, Hongzhang Shan and Erich Strohmaier, Lawrence Berkeley National Laboratory

In this paper, we focus on the performance comparison of Cray X1E, XT3, and NEC SX8 using synthetic benchmarks and SciDAC applications. Further, we will analyze the relationship between performance differences and hardware features.

11B 9:50 UPC

Parallel Position-Specific Iterated Smith-Waterman Algorithm Implementation, Maciej Cytowski, Witold Rudnicki, Rafal Maszkowski, Lukasz Bolikowski, Maciej Dobrzynski, and Maria Fronczak (WARSAWU)

We have implemented a parallel version of Position-Specific Iterated Smith-Waterman algorithm using UPC. The core Smith-Waterman implementation is vectorized for Cray X1E, but it is possible to use an FPGA core instead. The quality of results and performance are discussed.

11C 9:50 Performance–XD1/FPGAs

Compiling Software Code to FPGA-based Application Accelerator Processors in the XD1, Doug Johnson, Chris Sullivan, and Stephen Chappell, Celoxica

We present a programming environment that enables developers to quickly map and implement compute-intensive software functions as hardware accelerated components in an FPGA. As part of the XD1 application acceleration subsystem, these reconfigurable FPGA devices provide parallel processing performance that can deliver superlinear speed up for targeted applications. We will demonstrate how supercomputer developers can quickly harness this potential using their familiar design methodologies, languages and techniques.

11B Performance–XD1/FPGAs
Paper only; no presentation

Atomic Memory Operations, Phillip Merkey (MTU)

The Atomic Memory Operations provided by the X1 are the foundation for lock-free data structures and synchronization techniques that help users make the most of the UPC programming model. This paper focuses on applications that either could not be written without AMOs, or were significantly improved by the use of AMOs on the X1. We will cover basic data structures like shared queues and heaps and show how their use improves scheduling and load balancing. We will give examples of graph algorithms that can be intuitively parallelized only if AMOs are available.

12A 10:30 User Services

Case Study: Providing Customer Support in the High Performance Computing Environment, Barbara Jennings (SNLA)

This study presents the experience gained through the design and implementation of a tool to deliver HPC customer support. Utilized by developers, consultants, and customers, the results provide insights to developing useful on-line tools to assist this caliber of user to complete their work. This presentation will showcase the actual product demonstrating the model that was discussed at the CUG 2005 for delivering collaboration, learning, information and knowledge.

12B 10:30 and 10:55 Optimizing Performance (Compilers & Tools)

Optimizing Application Performance on Cray Systems with PGI Compilers and Tools, Douglas Miles, Brent Leback, and David Norton, The Portland Group

PGI Fortran, C and C++ compilers and tools are available on most Cray XT3 and Cray XD1 systems. Optimizing performance of the AMD Opteron processors in these systems often depends on maximizing SSE vectorization, ensuring alignment of vectors, and minimizing the number of cycles the processors are stalled waiting on data from main memory. The PGI compilers support a number of directives and options that allow the programmer to control and guide optimizations including vectorization, parallelization, function inlining, memory prefetching, interprocedural optimization, and others. In this paper we provide detailed examples of the use of several of these features as a means for extracting maximum single-node performance from these Cray systems using PGI compilers and tools.

12C 10:30 and 10:55 Performance—MTA

Characterizing Applications on the MTA2 Multithreading Architecture, Richard Barrett, Jeffrey Vetter, Sadaf Alam, Collin McCurdy, and Philip Roth (ORNL)

ORNL is currently evaluating various strategic applications on the MTA2 architecture in order to better understand massively parallel threading as an architectural choice beyond scalar and vector architectures. In this paper, we describe our initial experiences with several applications including molecular dynamics, sparse matrix vector kernels, finite difference methods, satisfiability, and discrete event simulation.

12A 10:55 User Services

Beyond Books: New Directions for Cray Customer Documentation, Nola Van Vugt and Lynda Feng, Cray Inc.

Last year you told us what you liked and what you didn't like about our customer documentation. We listened, we responded. Come see a demonstration of the new online help system that's part XT3 SMW GUI and see what else we are doing to make our documentation more accessible and up-to-date.

13 11:40 General Session

CUG and the Cray CEO: AKA, "100 on 1," Pete Ungaro, CEO and President, Cray Inc.

Cray Inc. will provide the Cray User Group (CUG) with an opportunity to go one on one with its CEO and President, Pete Ungaro. This will give attendees a chance to reflect on what they have heard at the conference and have a direct line to the CEO without any concern of upsetting others at Cray. The expectation is that CUG members will share their feelings and concerns with Cray Inc. openly, without having to temper their feedback. Pete will offer his thoughts and comments, and hopes that this session will provide Cray Inc. with valuable insight into the Cray User Group.

14A 1:45 and 2:10 XT3 Performance Analysis

Enabling Computational Science on BigBen, Nick Nystrom, Shawn Brown, Jeff Gardner, Roberto Gomez, Junwoo Lim, David O'Neal, R. Reddy, John Urbanic, Yang Wang, Deborah Weisser, and Joel Welling (PITTSCC)

BigBen, PSC's 2090-processor Cray XT3, entered production on October 1, 2005 and now hosts a wide range of scientific inquiry. The XT3's high interconnect bandwidth and Catamount operating system support highly scalable applications, with the Linux-based SIO nodes providing valuable flexibility. In this talk and its associated paper, we survey applications, performance, utilization, and TeraGrid interoperability, and we present status updates on current PSC XT3 applications initiatives.

14B 1:45 Co-Array Fortran

Scaling to New Heights with Co-Array Fortran, Jef Dawson (NCS-MINN)

Supercomputing users continually seek to solve more demanding problems by increasing the scalability of their applications. These efforts are often severely hindered by programming complexity and inter-processor communication costs. Co-Array Fortran (CAF) offers innate advantages over previous parallel programming models, both in programmer productivity and performance. This presentation will discuss the fundamental advantages of CAF, the importance of underlying hardware support, and several code examples showing how CAF is being effectively used to improve scalability.

14C 1:45 System Support/Management

Job-based Accounting for UNICOS/mp, Jim Glidewell (BOEING)

With the absence of CSA (Cray System Accounting) from UNICOS/mp, the ability of sites to provide job-based accounting is limited. We have developed tools for utilizing UNICOS process accounting to provide both end-of-job and end-of-day reporting based on jobid. We will provide a look at our strategies for doing so, a detailed look at the contents of the UNICOS/mp process record, and some suggestions for how sites might make use of this data.

14B 2:10 Co-Array Fortran

Porting Experience of Open64-based Co-Array Fortran Compiler on XT3, Akira Nishida (UTOKYO)

Co-Array Fortran (CAF) is one of the most hopeful technologies for seamless parallel computing on multicore distributed memory machines. In this presentation, we report our experience of porting an open source CAF compiler for Cray XT3.

14C 2:10 System Support/Management

Resource Allocation and Tracking System (RATS) Deployment on the Cray X1E, XT3, and XD1 Platforms, Robert Whitten, Jr. and Tom Barron (ORNL); David L. Hulse, Computer Associates International, Inc.; Phillip E. Pfeiffer, East Tennessee State University; Stephen L. Scott (ORNL)

The Resource Allocation and Tracking System (RATS) is a suite of software components designed to provide resource management capabilities for high performance computing environments. RATS is currently supporting resource allocation management on the Cray X1E, XT3 and XD1 platforms at Oak Ridge National Laboratory. In this paper, the design and implementation of RATS is given with respect to the Cray platforms with an emphasis on the flexibility of the design. Future directions will also be explored.

14A 2:35 XT3 Performance Analysis

Boundary Element–Finite Element Method for Solid-Fluid Coupling on the XT3, Klaus Regenauer-Lieb (WASP)

An Implicit Finite Element (FEM) solver is employed for the solid and an Implicit Boundary Elements (BEM) code is used for calculating only the drag caused from the surrounding fluid. The BEM approach is applied along horizontal planes (assumption validated by 3-D flow model). This application is implemented on the XT3 and results are presented.

14B 2:35 Co-Array Fortran

Co-Array Fortran Experiences with Finite Differencing Schemes, Richard Barrett (ORNL)

A broad range of physical phenomena in science and engineering can be described mathematically using partial differential equations. Determining the solution of these equations on computers is commonly accomplished by mapping the continuous equation to a discrete representation and applying a finite differencing scheme. In this presentation we report on our experiences with several different implementations using Co-Array Fortran, with experiments run on the Cray X1E at ORNL.

14C 2:35 System Support/Management

Red Storm Systems Management: Topics on Extending Current Capabilities, Robert A. Ballance (SNLA); Dave Wallace, Cray Inc.

This paper will discuss how by using an extensible System Description Language, current management capabilities can be extended and/or supplemented. We will discuss two implementations on Red Storm hardware and how these concepts can be easily extended to address other management issues.

15A 3:20 Performance via Eldorado & MTA

Eldorado Programming Environment, John Feo, Cray Inc.

Cray's Eldorado programming model supports extremely concurrent computation on unpartitioned, shared data: hundreds of thousands of hardware streams may simultaneously execute instructions affecting arbitrary locations within large, dynamic data structures. While this flexible model enables programmers to use the most efficient parallel algorithms available, to do so induces demands on the system software that bottleneck other large parallel systems. This talk presents the Eldorado programming model, identifying and elaborating on some specific shared-memory challenges and how Eldorado uniquely addresses them.

15B 3:20 and 4:00 Libraries

Cray and AMD Scientific Libraries, Chip Freitag, AMD; Mary Beth Hribar, Bracy Elton, and Adrian Tate, Cray Inc.

Cray provides optimized scientific libraries to support the fast numerical computations that Cray's customers require. For the Cray X1 and X1E, LibSci is the library package that has been tuned to make the best use of the multistreamed vector processor based system. For the Cray XT3 and Cray XD1, AMD's Core Math Library (ACML) and Cray XT3/XD1 LibSci together provide the tuned scientific library routines for these Opteron based systems. This talk will summarize the current and planned features and optimizations for these libraries. And, we will present library plans for future Cray systems.

15C 3:20 Operations

Providing a Shared Computing Resource for Advancing Science and Engineering Using the XD1 at Rice University, Jan E. Odegard, Kim B. Andrews, Franco Bladilo, Kiran Thyagaraja, Roger Moye, and Randy Crawford (RICEU)

In this paper we will discuss the recent procurement and deployment of one of the largest XD1 systems in the world. The system, deployed by the Computer and Information Technology Institute at Rice University, is a major addition to the HPC infrastructure and will serve as the primary HPC resource supporting cutting edge research in engineering, bioX, science, and social science. At this point we envision that the majority of the paper will focus on hardware and software procurement, benchmarking, and deployment. The paper will also provide a view into how the system will be managed as well as discussions of performance, application porting, and tuning as well as discuss operation and support.

15A 4:00 Performance via Eldorado & MTA

A Micro-kernel Benchmark Suite for Multithreaded Architectures, Fabrizio Petrini and Allan Snavely (SDSC); John Feo, Cray Inc.

The micro-benchmarks used to assess the performance of conventional parallel architectures, such as computing clusters, measure network bandwidth and latency and other network communication characteristics such as message startup and throughput. These benchmarks are not appropriate for shared-memory, multithreaded architectures such as the Cray MTA2 and Eldorado that have no cache and do not support message passing. In this paper we define a more appropriate suite of benchmarks for these machines. We discuss how the suite of kernels may be different for the MTA2 (an UMA machine) and Eldorado (a NUMA machine). Finally, we use the kernels to measure the performance of the MTA2 and predict the performance of Eldorado.

15C 4:00 Operations

Safeguarding the XT3, Katherine Vargo (PITTSCC)

Service nodes and the management server provide critical services for the XT3 and its users. Using open-source software, these critical services can be monitored and safeguarded from common Linux attacks. Patch management, configuration verification, integrity checking, and benchmarking software provide a solid foundation for insulating services from assaults and enable system administrators to increase system availability and reliability.

16A 4:30 XD1 Interactive Session(s)

Cray XD1 Special Interest Group Meeting (SIG), Liam Forbes (ARSC), Chair
Systems, Liam Forbes (ARSC), Focus Chair
Users, Peter Cebull (INL), Focus Chair

The purpose of this Interactive Session is to provide a forum for open discussions on the Cray XD1 system and its environment. The “Systems” area covers issues related to Operations, Systems & Integration, and Operating Systems. The “Users” area deals with issues related to Applications, Programming Environments, and User Services. Cray Inc. liaisons will be on hand to provide assistance and address questions. This session is open to everyone that has an interested in the Cray XD1 and wishing to participate.

16C 4:30 BoF

Batch Schedulers, Richard Alexander (CSCS)

The objective of this BoF will be to discuss the various batch schedulers. It will not discuss batch systems per se, but rather, the scheduling philosophies and implementations of each of the following sites:
A) CSCS with a TCL based scheduler for PBS
B) PSC with original work on external PBS scheduling
C) ORNL with MOAB, the commercial version of MAUI, for PBS
D) ERDC with LSF

Thursday

Session Number, Track Name, Session Time

Paper Title and Author(s)

Abstract

17A 8:20 FPGAs & Supercomputers

Turning FPGAs Into Supercomputers—Debunking the Myths About FPGA-based Software Acceleration, Anders Dellson, Mitrionics AB

When considering whether to use field programmable gate arrays (FPGAs) as co-processors to accelerate software algorithms, it is easy to get the impression that this is an option limited to a small elite group of highly skilled and heavily funded hardware savvy researchers. In this presentation, we will show that this is no longer true and that the 10x to 30x performance benefit and the low power consumption of FPGA-based software acceleration are now readily available to all researchers and software developers. As an example, we will show an actual application that has been accelerated using the FPGAs in the Cray XD1 and walk through the steps needed for the implementation using the Mitrion Platform.

17B 8:20 Parallel Programming

Hybrid Programming Fun: Making Bzip2 Parallel with MPICH2 and pthreads on the XD1, Charles Wright (ASA)

The author shares his programming experience and performance analysis in making a parallel file compression program for the XD1.

17C 8:20 and 9:00 Mass Storage

Lustre File System Plans and Performance on Cray Systems, Charlie Carroll and Branislav Radovanovic, Cray Inc.

Lustre is an open source, high-performance, distributed file system from Cluster File Systems, Inc. designed to address performance, availability, and scalability issues. A number of high-performance parallel Lustre file systems for supercomputers and computer clusters are available today. This talk will discuss Cray's history with Lustre, the current state of Lustre, and Cray's Lustre plans for the future. We will also provide recent performance results on a Cray XT3 supercomputer, showing both scalability and achieved performance.

17A 9:00 FPGAs & Supercomputers

Experiences with High-Level Programming of FPGAs on Cray XD1, Thomas Steinke, Konrad Zuse-Zentrum für Informationstechnik Berlin; Thorsten Schuett and Alexander Reinefeld, Zuse-Institut Berlin

We recently started to explore the potential of FPGAs for some application kernels that are important for our user community at ZIB. We give an overview of our efforts and present results achieved so far on a Cray XD1 using the Mitrion-C programming environment.

17B 9:00 Parallel Programming

Parallel Performance Analysis on Cray Systems, Craig Lucas and Kevin Roy (MCC)

Solving serial performance problems is very different to solving the performance problems on parallel codes. In many cases the behaviour of an application can depend on the number of processors used. Some parts of the code scale well and others don't; when we look at parallel codes it is these nonscaling parts that are of most interest. Identifying the components and when they become a problem is not often an easy task. In this talk we present some software that interfaces to existing Cray performance analysis tools and presents clear and easy to view information.

17A 9:25 FPGAs & Supercomputers

Experiences Harnessing Cray XD1 FPGAs and Comparisons to other FPGA High Performance Computing (HPC) Systems, Olaf Storaasli, Melissa C. Smith, and Sadaf R. Alam (ORNL)

The Future Technologies Group at Oak Ridge National Laboratory (ORNL) is exploring advanced HPC architectures including FPGA-based systems. This paper describes experiences with applications and tools (including Mitrion C) on ORNL's Cray XD1 and comparisons with results from other FPGA-based systems (SRC, Nallatech, Digilent) and tools (VIVA, Carte) at ORNL.

17B 9:25 Parallel Programming

An Evaluation of Eigensolver Performance on Cray Systems, Adrian Tate and John G. Lewis, Cray Inc.; Jason Slemons, University of Washington

Solutions to the Eigenvalue equation are the computational essence of many large-scale applications, especially in engineering and physical chemistry. Eigensolver performance is expected to vary according to the physical properties of the underlying data, but when run on parallel systems, performance is also heavily dependent on the subtleties of interconnect network and the message passing implementation of the particular system. We will compare the performance of various eigensolvers across each of Cray's products. We will show which systems are better suited to certain methods and show how the performance of some real applications can be maximized by careful selection of method. We will also describe what research has been carried out, both inside and outside of Cray, to improve performance of parallel eigensolvers.

17C 9:25 and 9:50 Mass Storage

A Center Wide File System Using Lustre, Shane Canon and H. Sarp Oral (ORNL)

The National Leadership Computing Facility at Oak Ridge National Laboratory is currently deploying a Lustre based center wide file system. This file system will span multiple architectures and must meet demanding user requirements. The objectives for this file system will be presented along with an overview of the architecture. A discussion of issues that have been encountered during the deployment, as well as current performance numbers, will also be provided. The presentation will conclude with future plans and goals.

17A 9:50 FPGAs & Supercomputers

Status Report of the OpenFPGA Initiative: Efforts in FPGA Application Standardization, Eric Stahlberg, Kevin Wohlever (OSC); Dave Strenski, Cray Inc.

The OpenFPGA initiative began officially in early 2005. From early formative discussions, the effort has matured to become a community resource to foster and advance the adoption of FPGA technology for high-level applications. Participation includes vendors, application providers, and application users across academic, commercial and governmental organizations. With widespread and international participation by hundreds of participants, an active discussion list on FPGA related topics, and emerging working groups, the OpenFPGA effort is building a strong foundation for broad incorporation of FPGA accelerated high-level applications. This presentation will cover insight from early discussions, current organizational overview, future directions towards standardization, and information for becoming part of the OpenFPGA effort.

17B 9:50 Parallel Programming

Symmetric Pivoting in ScaLAPACK, Craig Lucas (MCC)

While the ScaLAPACK parallel numerical linear algebra library supports symmetric matrices, there is no symmetric (or "complete") pivoting in its algorithms. Here we look at an implementation of the pivoted Cholesky factorization of semi-definite matrices, which uses symmetric pivoting and is built on existing PBLAS and BLACS routines. We examine the performance of the code which is written in ScaLAPACK style and therefore uses block cyclic data distribution.

18A 10:40 Operating Systems—UNICOS

Recent Trends in Operating Systems and their Applicability to HPC, Rolf Riesen and Ron Brightwell (SNLA); Patrick Bridges and Arthur Maccabe, University of New Mexico

In this paper we consider recent trends in operating systems and discuss their applicability to high performance computing systems. In particular, we will consider the relationship between lightweight kernels, hypervisors, microkernels, modular kernels, and approaches to building systems with a single system image. We will also give a brief overview of the approaches being considered in the DoE FAST-OS program.

18B 10:40 Libraries—MPI

Message Passing Toolkit (MPT) Software on XT3, Howard Pritchard, Doug Gilmore, and Mark Pagel, Cray Inc.

This talk will focus on using the Message Passing Toolkit (MPT) for XT3. An overview of the MPI and SHMEM implementations on XT3 will be given. Techniques for improving the performance of message passing and SHMEM applications on XT3 will also be discussed. In addition, new features in the MPT package for XT3 1.3 and 1.4 releases will be presented.

18C 10:40 Eldorado/MTA2

Graph Software Development and Performance on the MTA-2 and Eldorado, Jonathan Berry and Bruce Hendrickson (SNLA)

We will discuss our experiences in designing and using a software infrastructure for processing semantic graphs on massively multithreaded computers. We have developed implementations of several algorithms for connected components and subgraph isomorphismwe will discuss their performance on the existing Cray MTA-2their predicted performance on the upcoming Cray Eldorado. We will also describe ways in which the underlying architecture and programming model have informed algorithm design and coding paradigms.

18A 11:20 Operating Systems—UNICOS

Compute Node OS for XT3, Jim Harrell, Cray Inc.

Cray is working on different kernels for the compute node operating system. This talk will describe the rationale, requirements, and progress.

18B 11:20 Libraries—MPI

Open MPI on the XT3, Brian Barrett, Jeff Squyres, and Andrew Lumsdaine, Indiana University; Ron Brightwell (SNLA)

The Open MPI implementation provides a high performance MPI-2 implementation for a wide variety of platforms. Open MPI has recently been ported to the Cray XT3 platform. This paper discusses the challenges of porting and describes important implementation decisions. A comparison of performance results between Open MPI and the Cray supported implementation of MPICH2 are also presented.

18C 11:20 Eldorado/MTA2

Evaluation of Active Graph Applications on the Cray Eldorado Architecture, Jay Brockman, Matthias Scheutz, Shannon Kuntz, and Peter Kogge, University of Notre Dame; Gary Block and Mark James, NASA JPL; John Feo, Cray Inc.

In this paper, we discuss an approach to such problems called an "active graph, where each node in the graph is tied to a distinct thread and where the flow of data over the edges is expressed by producer/consumer exchanges between threads. We will show how several cognitive applications, including a neural network and a production system, may be expressed as active graph problems and provide results on how active graph problems scale on both the MTA-2 and Eldorado architectures.

18A 11:45 Operating Systems—UNICOS

XT3 Status and Plans, Charlie Carroll and David Wallace, Cray Inc.

This paper will discuss the current status of XT3 software and development plans for 2006.

18B 11:45 Libraries—MPI

What if MPI Collectives Were Instantaneous? Rolf Riesen and Courtenay Vaughan (SNLA)

MPI collectives, such as broadcasts or reduce operations, play an important role in the performance of many applications. How much would these applications benefit, if we could improve collectives performance by using better data distribution and collection algorithms, or move some functionality into the Seastar firmware, closer to the network? We answer these questions by simulating a Cray XT3 system with hypothetical, instantaneous MPI collectives.

18C 11:45 Eldorado/MTA2

Scalability of Graph Algorithms on Eldorado, Keith Underwood, Megan Vance, Jonathan Berry, and Bruce Hendrickson (SNLA)

The Eldorado platform is a successor to the Cray MTA-2 that is being built within the Cray XT3 infrastructure. The XT3 infrastructure brings with it a number of limitations in terms of network bisection bandwidth and random access memory access that are not present in the MTA-2. While the MTA-2 is an exceptional performer on graph algorithms, the new limitations being introduced by the Eldorado platform could have negative implications for scaling. This paper analyzes Eldorado to explore the domain of applications requirements to enable scaling. By mapping current graph algorithms into this domain, we find that, while Eldorado will not scale quite as well as the MTA-2, it will scale sufficiently to offer orders of magnitude better graph performance than any other platform that is available.

19A 1:30 Sizing: Page, Cache, & Meshes

The Effect of Page Size and TLB Entries on XT3 Application Performance, Neil Stringfellow (CSCS)

The AMD Opteron processor allows for two page sizes to be selected for memory allocation, with a small page size of 4 Kilobytes being used for most standard Linux systems, and a larger 2 Megabyte page size which was selected for the Catamount lightweight kernel. Although the larger page size appears more attractive for HPC applications running under a lightweight kernel, the increase in page size comes at a cost in that there are very few entries in the TLB available to store references to these pages. This paper reports on work which highlighted problems with the use of the large page size and small number of TLB entries and shows some of the performance improvements which became possible with the introduction of a small page option to the yod job launcher.

19B 1:30 XD1 Applications

Alef Formal Verification and Planning System, Samuel Luckenbill, James R. Ezick, Donald D. Nguyen, Peter Szilagyi, and Richard A. Lethin, Reservoir Labs, Inc.

We have developed a parallel SAT solver using algorithms and heuristics that improve performance over existing approaches. We have also developed compiler technologies to transform problems from several problem domains to our solver. Our system runs on the Cray XD1 machine and particularly makes use of the high performance communication hardware and interface library.

19C 1:30 Benchmarking/Comparison

Performance Comparison of Cray X1 and Cray Opteron Cluster with Other Leading Platforms Using HPCC and IMB Benchmarks, Subhash Saini (NAS); Rolf Rabenseifner (HLRS); Brian T. N. Gunney, Thomas E. Spelce, Alice Koniges, and Don Dossa (LLNL); Panagiotis Adamidis (HLRS); Robert Ciotti (NAS); Sunil R. Tiyyagura (HLRS); Matthias Müller, Dresden University of Technology; Rod Fatoohi, San Jose State University

The HPC Challenge (HPCC) benchmark suite and the Intel MPI Benchmark (IMB) are used to compare and evaluate the combined performance of processor, memory subsystem and interconnect fabric of six leading supercomputers - SGI Altix BX2, Cray X1, Cray Opteron Cluster, Dell Xeon cluster, NEC SX-8 and IBM Blue Gene/L. These six systems use also six different networks (SGI NUMALINK4, Cray network, Myrinet, InfiniBand, NEC IXS and IBM Blue Gen/L Torus). The complete set of HPCC benchmarks are run on each of these systems. Additionally, we present Intel MPI Benchmarks (IMB) results to study the performance of 11 MPI communication functions on five of these systems.

19A 2:00 Sizing: Page, Cache, & Meshes

Performance Characteristics of Cache-Insensitive Implementation Strategies for Hyperbolic Equations on Opteron Based Super Computers, David Hensinger and Chris Luchini (SNLA)

For scientific computation moving data from RAM into processor cache incurs significant latency comes at the expense of useful calculation. Cache-insensitive strategies attempt to amortize the costs of movement of high latency data by carrying out as many operations as possible using in-cache data. For explicit methods applied to hyperbolic equations, cache insensitive algorithm advance regions of the solution domain through multiple time steps. Success of these strategies requires the performance advantage associated with computing in-cache to offset the overhead associated with managing multiple solution states. The performance characteristics of several cache insensitive implementation strategies were tested. The implications for multiprocessor simulations were also examined.

19B 2:00 XD1 Applications

XD1 Implementation of a SMART Coprocessor for Fuzzy Matching in Bioinformatics Applications, Eric Stahlberg and Harrison Smith (OSC)

Efficiently matching or locating small nucleotide sequences on large genomes is a critical step in the Serial Analysis of Gene Expression method. This challenge is made increasingly difficult as a result of experimental assignment errors introduced in the match and target sequence data. Early versions of SAGESpy were developed incorporating the Cray Bioinformatics Libraries to enable large-scale pattern matching of this type. This presentation describes the design of an FPGA-based scalable fuzzy DNA sequence matching algorithm implemented specifically for the XD1. Results of this implementation will be compared to performance on earlier Cray SV1 and current Cray X1 systems.

19C 2:00 Benchmarking/Comparison

Performance Analysis of Cray X1 and Cray Opteron Cluster, Panagiotis Adamidis (HLRS); Rod Fatoohi, San Jose State University; Johnny Chang and Robert Ciotti (NAS)

We study the performance of two Cray systems (Cray X1 and Cray Opteron Cluster) and compare their performance with an SGI Altix 3700 system. All three systems are located at NASA Ames Research Center. We mainly focus on network performance using different number of communicating processors and communication patterns—such as point-to-point communication, collective communication, and dense communication patterns. Our results show the impact of the network bandwidth and topology on the overall performance of each system.

19A 2:30 Sizing: Page, Cache, & Meshes

Investigations on Scaling Performance and Mesh Convergence with Sandia's ASC FUEGO Code for Fire Model Predictions of Heat Flux, Courtenay Vaughn, Mahesh Rajan and Amalia Black (SNLA)

Performance characteristics for coupled fire/thermal response prediction simulations are investigated using coarse, medium and fine finite-element meshes and running the simulation on the Red Storm/XT3. These simulations have leveraged computationally demanding mesh convergence studies to obtain detailed timings of the various phases of the computation and will be helpful in performance tuning.

19B 2:30 XD1 Applications

Simulating Alzheimer on the XD1, Jan H. Meinke and Ulrich H. E. Hansmann (KFA)

Misfolding and aggregation of proteins are the causes for Alzheimer, BSE, and other neurodegenerative diseases. We simulate the aggregation of a 7 amino acid long fragment of the protein Abeta, responsible for the formation of toxic fibrils in Alzheimer. We use parallel tempering, an advanced Monte Carlo algorithm that scales nearly optimally.

19C 2:30 Benchmarking/Comparison

An Accelerated Implementation of Portals on the Cray SeaStar, Ron Brightwell, Trammell Hudson, Kevin Pedretti, and Keith D. Underwood (SNLA)

This paper describes an accelerated implementation of the Portals data movement layer on the Cray SeaStar used in the XT3 platform. The current supported implementation of Portals is interrupt-driven and does not take full advantage of the embedded processor on the SeaStar. The accelerated implementation offloads a significant portion of the network protocol stack to the SeaStar, providing significant improvements in network performance. This paper will describe this new implementation and show results for several network micro-benchmarks as well as applications.

20 3:10 General Session

The AWE HPC Benchmark 2005, Ron Bell, S. Hudson, and N. Munday (AWE)

Near the end of 2005, the UK's Atomic Weapons Establishment (AWE) placed an order for a Cray XT3 system with 3936 dual-core nodes (over 40 Tflops peak) to replace its existing HPC system. This paper describes the design of the benchmark used during the competitive procurement preceding this order and presents details of the evaluation process and the results. These include the more than 2-times speedup obtained by Cray by tuning the source code of the most important application.

	Send e-mail to the CUG Office Copyright © Cray User Group, Incorporated
	Web page design and support by Cray User Group (CUG) Conference Services
Back to the 48th CUG 2006 conference page