Session day |
Session number |
Track name |
Session time |
Author(s) |
Paper title |
Abstract |
Monday |
1A |
Tutorials |
8:30 |
Mark Swan and Steve Sjoquist, Cray Inc. |
Red Storm Management System (RSMS)The Central Nervous System of the Cray XT3 and Cray Red Storm Computer Systems |
This tutorial will focus on the architecture and features of the RSMS. We will cover such topics as movement of events in the system, manager applications that process events, where log information is stored, etc. |
Monday |
1B |
Tutorials |
8:30 |
Amar Shan, Cray Inc. |
Active Manager for Cray XD1 |
Active Manager is Cray's solution to the challenges of maintaining large, multi-processor systems, cited by system administrators as the single biggest impediment to implementing production clusters. Active Manager combines single system management with task-centric wizards and the ability to collect, correlate, respond to and report on a wide variety of occurrences within the system. It also lets administrators tap into these capabilities and build custom behaviors. Active Manager is a key part of Cray's strategy to help customers build and maintain future generations of systems of tens of thousands of nodes. This tutorial will discuss the technical foundations for Active Manager, its capabilities, features and future directions. |
Monday |
2 |
General Session |
11:45 |
Erik P. DeBenedictis (SNLA) |
Petaflops, Exaflops, and Zettaflops for Science and Defense |
Supercomputing has become a driver for science, defense, and other problems for society. This talk will give examples of important problems solvable with supercomputers at Petaflops, Exaflops, and Zettaflops levels and some approaches to building such computers. While these levels go substantially beyond current Cray products, the talk is aimed at fostering a discussion of future products and their use. |
Monday |
3A |
Reconfigurable Computing (FPGA) |
2:00 |
Craig Ulmer and David Thompson (SNLA) |
Reconfigurable Computing Characteristics of the Cray XD1 |
The Cray XD1 is one of the first commercial HPC architectures to embed Field-Programmable Gate Arrays (FPGAs) into the memory fabric of a system's compute nodes. The tight coupling between FPGAs and CPUs in the XD1 provides an attractive platform for reconfigurable computing research, where FPGAs are used as a means of providing hardware support for accelerating application kernels. In this paper we examine the communication characteristics of the XD1 architecture for reconfigurable computing operations, and describe our initial experiences in utilizing the XD1's FPGAs for offloading computations. |
Monday |
3B |
Red Storm Performance & Visualization |
2:00 |
Mark Taylor and Bill Spotz (SNLA) |
Performance of the Spectral Element Method on Red Storm |
We hope to present Red Storm performance and scalability results for a spectral element atmospheric model (SEAM) which is part of NCAR's High Order Multi-scale Modeling Environment (HOMME). Results on other mesh interconnect supercomputers (ASCI Red and BlueGene/L) strongly suggest that SEAM, at resolutions of 20km and higher, will easily scale to the full the full 10,000 CPUs of Red Storm. HOMME/SEAM is being developed as a highly scalable dynamical core for use in the US's Community Climate System Model. |
Monday |
3C |
Benchmarking |
2:00 |
Nathan Wichmann, Cray Inc. |
Cray and HPCC: Benchmark Developments and Results from the Past Year |
The High Performance Computing Challenge (HPCC) benchmark continued to evolve in 2004 and 2005. HPCC developments will be discussed with particular emphasis on what has changed since CUG 2004. Results will be given for a number of different Cray machines. |
Monday |
3A |
Reconfigurable Computing (FPGA) |
2:30 |
Geert Wenes, Jim Maltby, and David Strenski, Cray, Inc., |
Applications Development with FPGAs: Simulate but Verify |
Reconfigurable Computing (RC) exploits hardware, such as FPGAs, that can be reconfigured to implement specific functionality more suitable for specially tailored hardware than on a general purpose processor. However, applications development and performance optimization possible with these systems typically are dependent on the skill and experience of hardware designers, not necessarily software engineers. This has prevented more widespread use of RC. Hence, a current challenge in this area is the establishment of a methodology that is targeted towards the software engineer and application developer. We will present, with examples, such methodology. |
Monday |
3B |
Red Storm Performance & Visualization |
2:30 |
Steve Plimpton (SNLA) |
Performance of Biological Modeling Applications on Red Storm |
I'm a developer and user of two applications used at Sandia and elsewhere: a molecular dynamics package called LAMMPS (www.cs.sandia.gov/~sjplimp/lammps.html) and a Monte Carlo cell simulator called ChemCell. I'll present performance and scalability results for these two codes running several benchmark problems on the new Red Storm machine, and compare the results to other machines. One of the molecular dynamics kernels is 3d FFTs, which are a good stress-test of a parallel architecture, so I'll also highlight their performance. |
Monday |
3C |
Benchmarking |
2:30 |
Mike Ashworth, Ian J. Bush, and Martyn F. Guest, CCLRC Daresbury Laboratory |
Vector vs. Scalar Processors: A Performance Comparison Using a Set of Computational Science Applications |
Vector processors have not gone away and there is still some considerable debate over the performance, cost-effectiveness and ease-of-use of vector versus scalar architectures. We have run a number of capability application benchmarks from CFD, environmental modelling, computational chemistry and molecular simulation on the Cray X1 in comparison with high-performance MPP systems such as the IBM p690 and the SGI Altix. We find a range of performance where the Cray X1 is clearly extremely effective for some codes but less well-suited to others, and we discuss how this arises from the computational requirements of each code. |
Monday |
3A |
Reconfigurable Computing (FPGA) |
3:00 |
Joseph Fernando, Dennis Dalessandro, and Ananth Devulapalli (OSC) |
Accelerated FPGA Based Encryption |
As the trend for increased accessibility to data increases, encryption is necessary to protect the integrity and security of that data. Due to the highly parallel nature of the AES encryption algorithm, an FPGA based approach provides the potential for up to an order of magnitude increase in performance over a traditional CPU. Our effort will showcase the capability of FPGA based encryption on the Cray XD1 as well as other FPGA related efforts at OSC's center for data intensive computing. |
Monday |
3B |
Red Storm Performance & Visualization |
3:00 |
Constantine Pavlakos and David White (SNLA) |
Red Storm’s Data Analysis and Visualization Environment |
The Red Storm platform will provide Sandia National Laboratories with a substantial increase in supercomputing capability. In order to support real computational science and engineering applications, Red Storm's supercomputing capability must be complemented with commensurate hardware infrastructure and high performance software tools that enable effective data analysis and visualization, for comprehension of supercomputing results. This presentation will describe the data analysis and visualization environment that is being implemented for Red Storm. |
Monday |
3C |
Benchmarking |
3:00 |
Jeff Candy and Mark Fahey (ORNL) |
GYRO Performance on a Variety of MPP Systems |
The performance of GYRO, both routine-specific and aggregate, on a collection of modern MPP platforms is given. We include data for a variety of IBM, Cray, AMD/Intel and other systems. The performance of a new FFT method for the Poisson bracket evaluation is also detailed, comparing performance of FFTW, ESSL and Cray SciLib. |
Tuesday |
7A |
Architectures |
11:00 |
Scott Studham, Arthur S. Bland, Robert J. Harrison, Thomas H. Dunigan, Mark R. Fahey, Thomas C. Schulthess, Robert J. Silvia, Jeffrey S. Vetter, James B. White III, and Patrick H. Worley (ORNL) |
Leadership Computing at Oak Ridge National Laboratory |
Oak Ridge National Laboratory is running the world's largest Cray X1 and the world's largest Cray XD1, and we have begun to take delivery of a 20+TF Cray XT3. In this report we provide an overview of the performance of these platforms using performance and scalability results from applications in the areas of chemistry, climate and materials. We then discuss ways in which we are working with Cray to establish a roadmap that will provide 100's of teraflops of sustained performance while integrating a balance of vector and scalar processors. |
Tuesday |
7B |
Modern FORTRAN |
11:00 |
Bill Long, Cray Inc. |
Fortran 2003 and Beyond |
Fortran 2003 has replaced f95 as the official Fortran standard. Cray's progress in implementing the new standard will be reviewed. Features of the evolving Fortran 2008 standard proposal will be discussed. |
Tuesday |
7C |
Scheduling |
11:00 |
Chad Vizino (PITTSC) |
Batch Scheduling on the Cray XT3 |
The Pittsburgh Supercomputing Center has implemented a custom batch scheduler, Simon, to operate with PBS on one of its massively parallel systems. The design and implementation of an early custom scheduler on the XT3 as well as a functional overview and adaptation of Simon to the XT3 will be discussed. |
Tuesday |
7A |
Architectures |
11:45 |
Simon Kahan, John Feo, David Harper, and Petr Konecny, Cray Inc. |
Eldorado |
Eldorado is Cray's upcoming shared-memory, multithreaded computing platform. Sharing all but the processor itself with the XT-3 and Red Storm platforms, Eldorado's design achieves high reliability at low cost. Leveraging the hardware infrastructure of these large-scale distributed-memory systems, Eldorado scales shared-memory to thousands of processors and over one-hundred terabytes of memory. Eldorado's processor is similar to that of the MTA-2, tolerating memory latency by virtue of multithreading and supporting a straightforward programming model. We present Eldorado by way of comparison to the MTA-2, highlighting architectural differences and illustrating with simple programming examples how and why high performance is achieved on problems that thwart SMP clusters due to their high memory and synchronization latencies. |
Tuesday |
7B |
Modern FORTRAN |
11:45 |
Bill Long and Ted Stern, Cray Inc. |
Programming for High Performance Computing in Modern Fortran |
We will discuss general programming guidelines for high performance computing in any language. We will then show that these guidelines are easy to follow using the capabilities of modern Fortran, and we will demonstrate this using several real world applications. |
Tuesday |
7C |
Scheduling |
11:45 |
James Ang, Robert A. Ballance, Lee Ann Fisk, Jeanette R. Johnston, and Kevin T. Pedretti (SNLA) |
Red Storm Capability Computing Queuing Policy |
Red Storm will be the first Tri-Lab [Sandia National Laboratories (SNL), Los Alamos National Laboratory (LANL), and Lawrence Livermore National Laboratory (LLNL)], U.S. Department of Energy/National Nuclear Security Administration, Advanced Simulation and Computing (ASC) platform to be managed under an explicit capability computing policy directive. Instead of allocating nodes among SNL:LANL:LLNL in the 2:1:1 ratio, Red Storm will use PBS-Pro (the commercial version of the Portable Batch System), to manage fairshare priority among the labs so that in the long run, their node-hours of usage will follow the 2:1:1 ratio. The basic queuing policy design will be described along with extensions to handle switching between classified and unclassified computing, use by ASC University Partners, priority access, etc. |
Tuesday |
8A |
XD1/AMD |
2:00 |
Mark Fahey, Thomas H. Dunigan, Jr., Nina Hathaway, R. Scott Studham, Jeffrey S. Vetter, and Patrick H. Worley (ORNL) |
Early Evaluation of the Cray XD1 |
Oak Ridge National Laboratory recently received 12 chassis of early access Cray XD1 nodes, where each chassis has 12 AMD Opteron processors. This paper describes our initial experiences with the system, including micro-, kernel, and application benchmarks results. |
Tuesday |
8B |
X1 Compiler Performance |
2:00 |
Terry Greyzck, Cray Inc. |
X1 Tuning for Performance Using Compilation Options |
The X1 compilers command-line options and source code directives provide significant user control over program optimization. This paper describes how commonly used options and directives influence specific optimizations, affect compile time, and sometimes cause unexpected side-effects. |
Tuesday |
8C |
XT3 Experience |
2:00 |
David M. Hensinger and Robert J. Mackinnon (SNLA) |
Performance Characteristics of Multi-Block Structured ALEGRA on Red Storm |
Preliminary performance results were collected from simulations using the multi-block structured capability of the ALEGRA code. The simulations involve detonations in a geologic material and the subsequent response of buried structures. Comparisons are made to the performance of similar runs on Janus. |
Tuesday |
8A |
XD1/AMD |
2:30 |
Douglas Doerfler and Courtenay Vaughan (SNLA) |
Characterizing Compiler Performance for the AMD Opteron Processor on a Parallel Platform |
Application performance on a high performance, parallel platform depends on a variety of factors, the most important being the performance of the high speed interconnect and the compute node processor. The performance of the compute processor depends on how well the compiler optimizes for a given processor architecture, and how well it optimizes the applications source code. An analysis of uni-processor and parallel performance using different AMD Opteron compilers on key SNL application codes is presented. |
Tuesday |
8B |
X1 Compiler Performance |
2:30 |
Lee Higbie, Tom Baring, and Ed Kornkven (ARSC) |
Simple Loop Performance on the Cray X1 |
We analyzed the performance of the Cray X1 on two sets of nested loops that were iterating over triply subscripted arrays and found performance variations from 1.8 to 818 M(64-bit)FLOPS (a 457:1 ratio). The interaction between the multi-level cache, vector operations, memory bank conflicts, and ñmulti-streamingî are the apparent cause for most of the large speed variations. We used only two simple computations but varied many parameters such as the order of the loops, the subscripting style, compilation options, precision, and language. With the inclusion of 32-bit and 128-bit floating point arithmetic, the speed range increased to about 40K to 1, from .04 to almost 1500 MFLOPS. We found that compiler feedback on loop vectorization and multistreaming was a poor performance predictor. We conclude the paper with some observations on estimating and improving Cray X1 application performance. |
Tuesday |
8C |
XT3 Experience |
2:30 |
Nick Nystrom, Roberto Gomez, David O'Neal, Richard Raymond, R. Reddy, John Urbanic, and Yang Wang (PITTSC) |
Early Applications Experience on the Cray XT3 |
Following initial efforts that resulted in executing applications on the XT3 in earthquake simulation, materials science, weather modeling, and cosmology for SC2004, PSC is now working with over 25 research groups to enable a broad range of scientific achievements. In this paper, we survey PSC's XT3 results to date, emphasizing applications' architectural and performance characteristics. |
Tuesday |
8A |
XD1/AMD |
3:00 |
Don Morton (ARSC), Eugene M. Petrescu, National Weather Service, and Bryce L. Nordgren, United States Forest Service |
Real-Time High-Resolution Weather Modeling in "Rugged" Regions |
Numerical weather prediction in mountainous regions presents unique challenges for capturing the important small-scale dynamics of these complex environments. The stakes are high, as expeditious, localized forecasts are required for activities such as firefighting and aviation. This work presents a performance evaluation of the Cray XD1 (in comparison with an IBM p655+) execution of the Weather Research and Forecasting (WRF) model applied to "rugged" regions with resolutions of 7.5km, 2.5km and 830m. |
Tuesday |
8B |
X1 Compiler Performance |
3:00 |
Hongzhang Shan and Erich Strohmaier, Lawrence Berkeley National Laboratory |
MPI and SHMEM Performance on the Cray X1A Case Study using APEX-Map |
Apex-Map is a synthetic performance probe focusing on global data movement. By integrating temporal locality and spatial locality into its design, it can be used to analyze the performance characteristics of a platform and compare performance across different architectures or different programming paradigms. In this paper, we present the results of APEX-Map on the Cray X1 and use them to analyze this specific platform. Also will to discuss several performance problems we have found regarding the current MPI implementation. |
Tuesday |
8C |
XT3 Experience |
3:00 |
Jeffrey Vetter, Sadaf Alam, Richard Barrett, Thomas H. Dunigan, Jr., Mark R. Fahey, Jeff Kuehn, James B. White III, and Patrick H. Worley (ORNL) |
Early Evaluation of the Cray XT3 at ORNL |
Oak Ridge National Laboratory recently received delivery of a Cray XT3. This paper describes our initial experiences with the system, including micro-, kernel, and application benchmarks results. In particular, we provide performance results for important DoE applications areas that include climate, materials, and fusion. |
Wednesday |
11A |
Electronic Structure/Molecular Dynamics |
8:30 |
Yang Wang (PITTSC), Malcolm Stocks (ORNL), Aurelian Rusanu, Florida Atlantic University, D.M.C. Nicholson and Markus Eisenbach (ORNL), and J.S. Faulkner, Florida Atlantic University |
Towards Petacomputing in Nanotechnology |
Nanoscience holds significant future scientific and technological opportunities and, to realize this potential, it requires realistic quantum mechanical simulation of real nanostructures that usually involves such large problem sizes (in the range of thousands to tens of thousands of atoms). Interestingly, recent advances in our locally self-consistent multiple scattering (LSMS) method, a first principles order-N scaling technique specifically implemented to exploit massively parallel computing, are making the direct quantum simulation of nano-structures a not too distant possibility. In this presentation we will show that this effectively accomplishes the first step towards understanding the electronic and magnetic structure of a 5nm cube of Fe which contains approximately 12,000 atoms, and will indicate to what extent future petaflop computing systems may enable the study of the dynamics of the magnetic switching process. |
Wednesday |
11B |
UPCX1, T3E, and ... |
8:30 |
Andrew Johnson (NCS-MINN) |
Evaluation and Performance of a UPC-Based CFD Application on the Cray X1 |
We present some recent results of the performance of our in-house computational fluid dynamics (CFD) codes on the Cray X1 for use in large-scale scientific and engineering applications. Both the vectorization/multi-streaming performance, and the code's implementation and use of Unified Parallel C (UPC) as the main inter-processor communication mechanism, will be discussed. Comparisons of UPC performance and behavior with the traditional MPI-based implementation will be shown through CFD benchmarking and other special communication benchmark codes. |
Wednesday |
11C |
Facilities and Systems Operations |
8:30 |
Robert A. Balance, Milton Clauser, Barbara J. Jennings, David J. Martinez, John H. Naegle, John P. Noe, and Leonard Stans (SNLA) |
Red Storm Infrastructure at Sandia National Laboratories |
Large computing systems, like Sandia National Laboratories Red Storm, are not deployed in isolation. The computer, its infrastructure support, and the operations of both must be designed to support the mission of the organization. This talk and paper will describe the infrastructure, operations, and support requirements for Red Storm along with a discussion of how Sandia is meeting those requirements. It will include overviews of the facilities and processes that surround and support Red Storm: shelter, power, cooling, security, networking, interactions with other Sandia facilities, operations support, and user services. |
Wednesday |
11A |
Electronic Structure/Molecular Dynamics |
9:15 |
Roberto Ansaloni, Cray, Inc., and Carlo Cavazzoni and Giovanni Erbacci (CINECA) |
Quantum-Espresso Performance on Cray Systems |
Quantum-ESPRESSO (opEn-Source Package for Research in Electronic Structure, Simulation, and Optimization) is a Total Energy and Molecular Dynamics simulation package based on Density-Functional Theory, using a Plane-Wave basis set and Pseudopotentials. The Quantum-ESPRESSO package is composed by three main simulation engines: PWscf for Total Energy and Ground State properties, FPMD for Car-Parrinello molecular dynamics with norm-conserving pesudopotentials, and CP for Car-Parrinello molecular dynamics with ultra-soft pseudopotentials. Quantum-ESPRESSO suite has been ported and optimized on Cray vector and MPP systems.This talk will discuss some porting and optimization issues and will report on the performance achieved. |
Wednesday |
11B |
UPCX1, T3E, and ... |
9:15 |
Philip Merkey (MTU) |
AMOs in UPC |
We have been focusing on trying to understand the performance implications of using one model on machines with completely different communication capabilities and on the use of atomic memory operations within the UPC programming model to control synchronization and to hide latency. Examples from runs on Beowulfs, the T3E, and the X1 will be presented. |
Wednesday |
11C |
Facilities and Systems Operations |
9:15 |
Jon Stearley (SNLA) |
Towards a Specification for the Definition and Measurement of Supercomputer Reliability, Availability, and Serviceability (RAS) |
The absence of agreed definitions and metrics for supercomputer RAS obscures meaningful discussion of the issues involved and hinders their solution. This paper provides a survey of existing practices, and proposes standardized definitions and measurements. These are modeled after the SEMI-E10 specification which is widely used in the semiconductor manufacturing industry. |
Wednesday |
11A |
Electronic Structure/Molecular Dynamics |
10:00 |
Konrad Wawruch, Witold Rudnicki, Wojciech Burakiewicz, Joanna Slowinska, Franciszek Rakowski, Michal Lopuszynski, and Lukasz Bolikowski, Mariusz Kozakiewicz, Maciej Cytowski, and Maria Fronczak (WARSAWU) |
Tuning Vector and Parallel Performance for Molecular Dynamics Codes on the Cray X1E |
ICM development team has ported and optimized several molecular dynamics codes on Cray Vector Architecture. We would present Cray X1e performance comparison with PC Opteron cluster on these codes, description of porting issues, vectorization and parallelism tuning. The applications we have worked on are Charmm, DFTB, VASP, CPMD, Siesta, Gromos. |
Wednesday |
11B |
UPCX1, T3E, and ... |
10:00 |
Tarek El-Ghazawi, Francois Cantonnet, and Yiyi Yao, George Washington University, Jeffey Vetter (ORNL) |
Evaluation of UPC on the Cray X1 |
UPC is parallel programming language which enables programmers to expose parallelism and data locality in applications with an efficient syntax. Recently, UPC has been gaining attention from vendors and users as an alternative programming model for distributed memory applications. Therefore, it is important to understand how such a potentially powerful language interacts with one of today's most powerful, contemporary architectures: the Cray X1. In this paper, we evaluate UPC on the Cray X1 and examine how the compiler exploits the important features of this architecture including the use of the vector processors and multi-streaming. Our experimental results on several benchmarks, such as STREAM, RandomAccess, and selected workloads from the NAS Parallel Benchmark suite, show that UPC can provide a high-performance, scalable programming model, and we show users how to leverage the power of X1 for their applications. However, we have also identified areas where compiler analysis can be more aggressive and potential performance caveats. |
Wednesday |
11C |
Facilities and Systems Operations |
10:00 |
Arthur S. Bland and Richard Alexander (ORNL), Jeffrey Becklehimer, Cray Inc., Nina Hathaway, Don Maxwell, and Robert Silvia (ORNL)and Cathy Wills, Cray Inc. |
System Integration Experience Across the Cray Product Line |
Two years ago, the Center for Computational Sciences (CCS) at Oak Ridge National Laboratory installed an early Cray X1, and the CCS has since upgraded this system to 512 MSPs. The CCS recently acquired substantial Cray XT3 and XD1 systems, which with the X1 cover the full spectrum of the Cray product line. We describe our experiences with installing, integrating, maintaining, and expanding these systems, along with plans for further expansion and system development. |
Wednesday |
12 |
General Session |
11:30 |
Barbara Jennings (SNLA) |
Support for High Performance Computing Users: Developing an On-line Knowledge Community |
Sandia faces a unique situation in addressing the support needs of the users of our high performance computer systems. The support required for the HPC (High Performance Computing) environment is quite unique in that: 1) the dynamic landscape is constantly changing and actually being created on a daily basis, 2) the customers are often the experts, 3) problem solving requires research into areas that may or may not have a previous record, 4) the users are local as well as remote and 5) support must be interoperative among the support teams from local sites as well as the remote host locations. Our model, supported by advanced academic research, is based on information gathering, dissemination, and management and results in a collaborative knowledge resource that doubles as a tool to be utilized by users who need to research information and consultants who need to research, collaborate, and compile technical solutions. |
Wednesday |
13A |
XD1 - FPGA (Bioinformatics) |
2:00 |
Eric Stahlberg, Joseph Fernando, and Kevin Wohlever (OSC) |
Accelerated Biological Meta-Data Generation and Indexing on the Cray XD1 |
The volume and diversity of biological information to be integrated in comparative bioinformatics studies continues to grow. Increasingly, the information is unstructured and without appreciable annotation necessary to make the necessary associations for comparative analysis. The FPGA and the Cray XD1 in particular provide a means to rapidly generate the dynamic metadata information needed to enhance the value of unstructured and semi-structured data. Similar to efforts required of compression and encryption, the presentation will showcase efforts at the OSC center for data intensive computing to employ FPGA technology to the challenge of generalized biological sequence indexing as a foundation for comparative analysis and subsequent predictive inference. |
Wednesday |
13B |
Communication/Bandwidth |
2:00 |
Patrick Worley, Thomas H. Dunigan, Jr., Mark R. Fahey, Jeffrey S. Vetter, James B. White III, and Patrick H. Worley (ORNL) |
Comparative Analysis of Interprocess Communication on the X1, XD1, and XT3 |
All three of Cray's most recent products, X1, XD1, and XT3, are built around high performance interconnects. In this talk we contrast and compare interprocess communication performance for these three systems, using both MPI and other messaging layers. We describe peak communication performance, how it can be achieved, and how the optimal programming models and styles differ among the three systems. |
Wednesday |
13C |
Networking |
2:00 |
Kevin Pedretti and Trammell Hudson (SNLA) |
Developing Custom Firmware for the Red Storm SeaStar Network Interface |
The Red Storm SeaStar network interface contains an embedded PowerPC 440 CPU that, out-of-the-box, is used solely to handle network protocol processing. Since this is a fully programmable CPU, power users may wish to program it to perform additional tasks such as NIC-based collective operations and NIC-level computation. This paper describes our experiences with developing custom firmware for the SeaStar. In order to make the SeaStar more accessible to nonexperts, we have developed a C version of the assembly based firmware provided by Cray. While this firmware results in slightly lower performance, it should be much easier to understand and to quickly extend with new features. A detailed overview of the SeaStar programming environment will be presented along with optimization techniques that we have found beneficial. |
Wednesday |
13A |
XD1 - FPGA (Bioinformatics) |
2:30 |
Jim Maltby, Jeff Chow and Steve Margerm, Cray Inc. |
FPGA Acceleration of Bioinformatics on the XD1: A Case Study |
Bioinformatics algorithms are particularly well-suited for FPGA acceleration because of their natural parallelism and use of subword data types. The Smith-Waterman algorithm is widely used in Bioinformatics, and several FPGA-based accelerators have been marketed. This talk will describe our experiences porting the Smith-Waterman algorithm to the XD1's FPGA Application Acceleration Coprocessor. |
Wednesday |
13B |
Communication/Bandwidth |
2:30 |
Rolf Rabenseifner (HLRS) |
Balance of HPC Systems Based on HPCC Benchmark Results |
Based on results reported by the HPC Challenge benchmark suite (HPCC), the balance between computational speed, communication bandwidth, and memory bandwidth is analyzed for HPC systems from Cray, NEC, IBM, and other vendors, and clusters with various network interconnects. Strength and weakness of the communication interconnect is examined for three communication patterns. |
Wednesday |
13C |
Networking |
2:30 |
Steven Carter (ORNL) |
High-Speed Networking with Cray Supercomputers at ORNL |
With modern supercomputers consuming and producing more data than can be transferred over networks in a reasonable amount of time, network performance is becoming an increasingly important aspect to understand and improve. This paper reviews the networking aspects of the Cray X1, the XD1, and the XT3 and presents Oak Ridge National Laboratory's work in integrating them into its high-speed networking environment. |
Wednesday |
13A |
XD1 - FPGA (Bioinformatics) |
3:00 |
Ronald Minnich and Matthew J. Sottile (LANL) |
Early Results from a Lattice-Gas Simulation on the Cray XD1 FPGA |
Lattice Gas Automata simulate the behavior of certain types of complex physical flows using massively parallel cellular automata. For example, modeling flow through porous media has been shown to be a successful problem for the LGA method and LGA was shown to produce comparable results to traditional CFD methods. In this talk, we describe a first implementation of a simple 2D LGE on the Cray XD1 FPGAs. We first describe a structural design, then a streaming design, and discuss results for each. |
Wednesday |
13B |
Communication/Bandwidth |
3:00 |
John Levesque, Cray Inc. |
Comparisons of the X1 to X1E on Major Scientific Applications |
This talk will cover recent work running applications such as POP, CAM, GYRO, CTH, PARATEC and other important DOE applications on the X1 and X1E. Analysis of memory bandwidth considerations, network bandwidth and MPI operations will be covered. |
Wednesday |
13C |
Networking |
3:00 |
Ron Brightwell, Tramm Hudson, Kevin Pedretti, Rolf Riesen, and Keith Underwood (SNLA) |
Portals 3.3 on the Sandia/Cray Red Storm System |
The Portals 3.3 data movement interface was developed at Sandia National Laboratories in collaboration with the University of New Mexico over the last ten years. The latest version of Portals is the lowest-level network transport layer on the Sandia/Cray Red Storm platform. In this paper, we describe how Portals are implemented for Red Storm and discuss many of the important features and benefits that Portals provide for supporting various services in the Red Storm environment. |
Thursday |
16A |
FPGAs/XD1 |
8:30 |
Dave Strenski, Cray Inc., Michael Babst and Roderick Swift, DSPLogic |
Evaluation of Running FFTs on the Cray XD1 with Attached FPGAs |
FFTW is a subroutine library for computing the discrete Fourier transform in one or more dimensions, of arbitrary input size, and of both real and complex data. FFTW has been benchmarked on several platforms and is the FFT library of choice for most applications. This paper will present the results from the development of a hardware-accellerated FFT using an FPGA attached to the Cray XD1. The results are compared to the FFTW benchmarks. |
Thursday |
16B |
Compilers and Tools |
8:30 |
Luiz DeRose, Bill Homer, Dean Johnson, and Steve Kaufmann, Cray Inc. |
The New Generation of Cray Performance Tools |
In order to achieve and sustain high performance on today's supercomputer systems, users require the help of state of art tools to measure and understand the performance behavior of applications. In this paper we will present the new generation of Cray performance tools and discuss our plans and future directions. |
Thursday |
16C |
X1 Performance and Experiences |
8:30 |
George R. Carr, Jr. (ORNL), Matthew J. Cordery, Cray Inc., John B. Drake, Michael W. Ham, Forrest M. Hoffman, and Patrick H. Worley (ORNL) |
Porting and Performance of the Community Climate System Model (CCSM3) on the Cray X1 |
The Community Climate System Model (CCSM3) is the primary model for global climate research in the United States and is supported on a variety of computer systems. We present our porting experiences and the initial performance of the CCSM3 on the Cray X1. We include the status of work in progress on other systems in the Cray product line. |
Thursday |
16A |
FPGAs/XD1 |
9:15 |
Tyler Reed and Mladen Markov, Koan Corporation |
Middleware Challenges for Reconfigurable Computing |
Lack of robust middleware for integrating reconfigurable computers (RC) like the Cray XD1 into production environments can reduce the velocity of technology adoption. Many challenges in creating such middleware lie in the uniqueness of RC platforms. The opportunities resulting from meeting these challenges can add whole new dimensions of implementation. We will highlight these by reviewing the development of the Cray XD1 interface within our middleware product. |
Thursday |
16B |
Compilers and Tools |
9:15 |
Rich Collier, Etnus, Luiz DeRose, Cray Inc., John DelSignore, Etnus, Robert W. Moench and Bob Clark, Cray Inc. |
TotalView Happenings on Cray Platforms |
TotalView is uniquely well-suited to debug applications that are developed on Cray supercomputers that are used for high-end modeling and simulation. These applications are very large and complex with long run times. This talk will provide an update on current available functionality and future plans for Etnus' TotalView Debugger on The Sandia Red Storm, Cray Red Storm, Cray X1, Cray XT3, and Cray XD1 machines. In addition general TotalView product directions for the Cray Platforms will be discussed including plans for Cray Rainier. |
Thursday |
16C |
X1 Performance and Experiences |
9:15 |
Rupak Biswas, Subhash Saini, Sharad Gavali, Henry Jin, Dennis C. Jespersen, M. Jahed Djomehri, and Nateri Madavan (NAS) |
NAS Experience with the Cray X1 |
Our recent experience at NAS with porting and executing micro-benchmarks, synthetic application kernels, and full-scale scientific applications on the NAS Cray X1 will be presented. Parallel performance results will be compared with other supercomputing platforms such as the IBM Power4 and the SGI Altix. |
Thursday |
16A |
FPGAs/XD1 |
10:00 |
Justin Tripp, Anders A. Hansson, Henning S. Mortve, and Maya B. Gokhale (LANL) |
Road Traffic Simulation on the Cray XD1 |
We have developed a very large scale road network simulator to simulate traffic patterns of major cities in the US. The simulator runs on the Cray XD1 and uses the FPGAs to accelerate the simulation. We will describe the traffic simulation application, hardware/software partitioning issues, and the hardware circuits, as well as overall performance compared to software-only. |
Thursday |
16B |
Compilers and Tools |
10:00 |
Douglas Miles, Steven Nakamoto, and Michael Wolfe, The Portland Group, |
Fortran, C and C++ Compilers and Tools for AMD Opteron Processor-based Supercomputers |
The Portland Group and Cray Inc. have been cooperating since mid-2003 to support the PGI optimizing parallel Fortran/C/C++ compilers and tools on AMD Opteron processor-based Cray systems. Current features, benchmark results, tuning tips, engineering challenges and the PGI compilers and tools roadmap will be presented. |
Thursday |
16C |
X1 Performance and Experiences |
10:00 |
Tony Meys (NCS-MINN) |
Performance Results for the Weather Research and Forecast (WRF) Model on AHPCRC HPC Systems |
The Army High Performance Computing Research Center (AHPCRC) supports research and development for many applications, including weather models. In recent months, the Weather Research and Forecast (WRF) Model has been implemented and tested on the AHPCRC Cray X1 and on an Atipa Opteron Linux cluster. This paper will discuss recent benchmarking results for configurations of WRF that have used mid-sized and large domains. In addition to performance analysis, this paper will also relate our recent experiences with post-processing and visualizing large, high-resolution weather simulations. |
Thursday |
17A |
Mining and Modeling with Cray Supercomputers |
11:00 |
Andrew Jones, Mike Pettipher, and Firat Tekiner (MCC) |
Datamining Using Cray Supercomputers |
In this paper, we report on the introduction of a large scale datamining code to the supercomputing environment, and to Cray supercomputers in particular. The codes are taken from a community who are not traditional scientific computing programmers nor use parallel computing and whose codes are subject to rapid development by researchers unfamiliar with the requirements of high end scientific computing. Thus, the challenge is to port the PC written C++ codes to specialist computers such as the X1 and XD1, and extract the enhanced performance, whilst retaining the common code base with the datamining researchers. This investigation is thus an evaluation of the programming environment on the Cray supercomputers as applied to this problem, and of the significant data management issues involved. |
Thursday |
17B |
Compilers and Scientific Libraries |
11:00 |
Terry Greyzck, Cray Inc. |
X1 Compiler Challenges (And How We Solved Them) |
The X1 architecture presents many new technological challenges for producing optimal code, including a decoupled scalar and vector memory architecture, full support for predicated vector execution, hardware support for efficient four-processor shared memory parallelism, and increases in memory latency. This paper describes how these issues were addressed through the development of new compiler optimizations, and states what is done well, and what needs improvement. |
Thursday |
17C |
X1 Solvers |
11:00 |
Thomas Oppe and Fred T. Tracy (CSC-VBURG) |
A Comparison of Several Direct Sparse Linear Equation Solvers for CGWAVE on the Cray X1 |
A number of sparse direct linear equation solvers are compared for the solution of sets of linear equations arising from the wave climate analysis program CGWAVE developed at the Coastal and Hydraulics Laboratory at the U.S. Army Engineer Research and Development Center (ERDC). CGWAVE is a general-purpose, state-of-the-art wave-prediction model. It is applicable to estimates of wave fields in harbors, open coastal regions, coastal inlets, around islands, and around fixed or floating structures. CGWAVE generates systems of simultaneous linear equations with complex coefficients that are difficult to solve by iterative methods. The vendor-supplied solver SSGETRF/SSGETRS is compared with the SuperLU package, in-core band solvers from LINPACK and LAPACK, and a direct, banded, out-of-core solver that utilizes reordering of the nodes to reduce the size of the bandwidth. The latter solver was optimized for the ERDC Major Shared Resource Center (MSRC) Cray X1. This paper will describe the details of how this optimization was done and give performance results for each of the solvers on the X1. |
Thursday |
17C |
X1 Solvers |
11:30 |
Richard Tran Mills, Mark Fahey, and Eduardo D'Azevedo (ORNL) |
Optimization of the PETSc Toolkit and Application Codes on the Cray X1 |
The Portable, Extensible Toolkit for Scientific Computation (PETSc) from Argonne National Laboratory is a suite of iterative solvers and data management routines for the solution of partial differential equations on parallel computers using MPI communication. PETSc has become a popular tool, but because it was designed with scalar architectures in mind, its out-of-the-box performance on vector machines such as the Cray X1 is unacceptably poor. This talk will detail some of the algorithmic changes we have made to some PETSc numerical kernels to improve their performance on the X1, and will describe our experiences running two applications, M3D (a nonlinear extended resistive magnetohydrodynamics code) and PFLOTRAN (a groundwater flow and transport code), that rely heavily on PETSc. |
Thursday |
17A |
Mining and Modeling with Cray Supercomputers |
11:45 |
Wendell Anderson, Guy Norton, Jorge C. Novarini, Planning Systems Inc., Robert Rosenberg and Marco Lanzagorta (NRL) |
Modeling Pulse Propagation and Scattering in a Dispersive Medium Using the Cray MTA-2 |
Accurate modeling of pulse propagation and scattering in a dispersive medium requires the inclusion of attenuation and its causal companion, dispersion. In this work a fourth order in time and space 2-D Finite Difference Time Domain (FDTD) scheme is used to solve the modified linear wave equation with a convolutional propagation operator to incorporate attenuation and dispersion. The MTA-2 with its multithreaded architecture and large shared memory provides an ideal platform for solving the equation, especially in the case when only a very small part of the medium is dispersive. |
Thursday |
17B |
Compilers and Scientific Libraries |
11:45 |
Mary Beth Hribar, Cray Inc., Chip Freitag and Tim Wilkens, AMD |
Scientific Libraries for Cray Systems: Current Features and Future Plans |
Cray continues to provide support and performance for the basic scientific computations that Cray's customers need. On the vector-based systems, LibSci is the only scientific library package provided by Cray. On the XD1 and the Cray XT3 systems, Cray is distributing AMD's Core Math Library (ACML) along with a much smaller Opteron LibSci package. This talk will describe the current and future scientific library support for each Cray system. In addition, performance results will be given for LibSci and ACML. |
Thursday |
17C |
X1 Solvers |
12:00 |
Lukasz Bolikowski, Witold Rudnicki, Rafal Maszkowski, Maciej Dobrzynski, Maciej Cytowski, and Maria Fronczak (WARSAWU) |
Implementation of the Vectorized Position Specific IteratedSmith-Waterman Algorithm with Heuristic Filtering Algorithm on Cray Architecture |
Smith-Waterman is the algorithm for finding local optimal alignments of two amino acid sequences, but is too slow for use in regular large database scanning. We have implemented SW algorithm on Cray X1 architecture, exploiting vectorization and parallelization. In addition we have proposed heuristic database filtering algorithm, which is particularly well suited for the Cray architecture, exploiting possibilities given by the BMM unit. |
Thursday |
18A |
Operating System Optimization |
2:00 |
Suzanne Kelly and Ron Brightwell (SNLA) |
Software Architecture of the Lightweight Kernel, Catamount |
Catamount is designed to be a low overhead operating system for a parallel computing environment. Functionality is limited to the minimum set needed to run a scientific computation. The design choices and implementations will be presented. |
Thursday |
18B |
XT3 Programming |
2:00 |
Monika ten Bruggencate, Cray Inc. |
SHMEM on XT3 |
The SHMEM communication library originated as a thin low-level layer over hardware on T3D to deliver excellent communication performance. Since then SHMEM has been popular with application developers who can use it to optimize portions of their MPI based applications. SHMEM has been supported on all Cray architectures, including the XT3. This talk will provide an overview of SHMEM for XT3. It will include brief discussions of the functionality supported, implementation details, strengths and weaknesses of the library on the specific architecture, preliminary performance data and future work. |
Thursday |
18C |
Servers and Services |
2:00 |
Jason Sommerfield, Paul Nowoczinski, J. Ray Scott, and Nathan Stone (PITTSC) |
Integrating External Storage Servers with the XT3 |
This paper discusses an extension of the Lustre parallel file system implementation provided on the Cray XT3. A group of external or outboard servers will be used as Lustre storage servers (Lustre OST's) over an Infiniband network. This external configuration will be compared with the standard XT3 SIO-based configuration on the basis of cost, scalability, functionality, and failure modes. |
Thursday |
18B |
XT3 Programming |
2:30 |
Geir Johansen, Cray Inc. |
C and C++ Programming for the Cray XT3 and Cray Red Storm Systems |
This paper discusses C & C++ programming for Cray XT3 and Cray Red Storm systems. Topics include unique features of the PGI C/C++ compiler, differences between Catamount system libraries and Linux system libraries, optimization techniques for the Cray XT3 and Red Storm systems, and use of other C/C++ compilers. The paper/talk will also discuss from a Cray Inc. perspective, the experiences of porting C and C++ code to the Sandia National Laboratories' Red Storm system. |
Thursday |
18C |
Servers and Services |
2:30 |
Lee Ward (SNLA) |
Administration and Programming for the Red Storm IO Subsystem |
A practical description of the initialization, run-time configuration and application programming interface of the Red Storm IO system as found on the Cray XT3 compute partition is presented. A discussion of compatibility with POSIX and ASCI Red, together with in-depth description and discussion of the initialization, configuration, and Red Storm specific API calls as a usable management and programming reference is given. |
|
|