CUG T3E Workshop Final Technical Program (with Slides)

Cray User Group T3E Workshop
October 7—8, 1999
Princeton, New Jersey

Final Program

with links to Slide Presentations
(See those marked with an blue arrow below)

Download a copy of the workshop attendee list here.

If you do not already have it, you may application from or the Acrobat plugin for your browser from


October 6

8:00-10:00 PM

Opening Reception at Nassau Inn in the Princeton Room
(sponsored by Cray Research/SGI)


October 7


Continental Breakfast (provided)
Thursday Morning Session Chair: Helene Kulsrud, CCR-P


Welcome, Sally Haerer, CUG President, NCAR and
Helene Kulsrud, Workshop Chair, CCR-P


The New Corporate Viewpoint, Steve Oberlin, Cray Research/SGI


Software Roadmap Changes due to the Divestiture of Cray,
Mike Booth, Cray Research/SGI




T3E Performance at NASA Goddard, Thomas Clune and Spencer Swift, NASA/SGI


Scalability 101: What did our experience with the Cray T3E teach us?, Jeff Brooks, Cray Research/SGI


Improving Performance on the T3E or How far are you willing to go?, Mike Merrill, DOD


Lunch (provided)
Thursday Afternoon Session Chair: Sergiu Sanielevici, PSC


Performance Evaluation of the Cray T3E for AEGIS Signal Processing, James Lebak, MIT Lincoln Laboratory


Approaches to the parallelization of a Spectral Atmosphere Model, V. Balaji, GFDL/SGI


Getting a Thumb-Lock on AIDS: T3E Simulations Show Joint-Like Motion in Key HIV Protein, Marcela Madrid, PSC




Cray T3E Update and Future Products, William White, Cray Research/SGI


Discussion and Q & A


SNIA as a Successor to the T3E, Steve Reinhardt, SGI


Dinner at Prospect House, Princeton University (provided by CUG)


October 8


Continental Breakfast (provided)
Friday Morning Session Chair: Guy Robinson, ARSC


Tutorial on Co-arrays, Robert Numrich, Cray Research/SGI


Introduction to UPC, William Carlson and Jesse Draper, CCS




HPF and the T3E: Applications in Geophysical Processing, Douglas Miles, The Portland Group, Inc.


First Principles Simulation of Complex Magnetic Systems: Beyond One Teraflop, Wang, PSC; Ujfalussy, Wang, Nicholson, Shelton, Stocks, Oak Ridge National Laboratory; Canning, NERSC; Gyorffy, Univ. of Bristol


Massively Parallel Simulations of Plasma Turbulence, Zhihong Lin, Princeton University


Early Stages of Folding of Small Proteins Observed in Microsecond-Scale Molecular Dynamics Simulations, Peter Kollman & Yong Duan, University of California, San Francisco (presentation canceled)


Lunch (provided)
Friday Afternoon Session Chair: James Craw, NERSC


Update on NERSC PScheD Experiences (link to the Slides or to the Paper), Tina Butler and Michael Welcome, NERSC


Programming Tools on the T3E: ARSC Real Experiences and Benefits, Guy Robinson, ARSC


Bad I/0 and Approaches to Fixing It, Jonathan Carter, NERSC




Achieving Maximum Disk Performance on Large T3E's: Hardware, Software and Users, John Urbanic and Chad Vizino, PSC


Running the UK Met. Office's Unified Model on the Cray T3E, Paul Burton, UKMET


I/O and File System Balance (link to the Slides or to the Paper), Tina Butler, NERSC





9;15 Software Roadmap Changes due to the Divestiture of Cray, (no abstract)
Mike Booth, Cray Research/SGI

10:30 T3E Performance at NASA Goddard
Thomas Clune and Spencer Swift, NASA/SGI

Round II of the NASA Earth and Space Science's (ESS) HPCC Program was a milestone-driven, three-way collaboration between government, industry, and academia. In this three year project, NASA acted as a funding agency to drive technology development in high-performance hardware and scientific software. SGI (formerly Cray Research) was selected as the hardware vendor, which was to provide a suitable high-performance computer, namely the Cray T3E, and expertise in developing software on the selected platform. Nine internationally recognized Principle Investigators (providing roughly 15 separate codes) from a variety of disciplines were selected to develop and publish research caliber scientific software capable of running at a sustained 100 gigaflops (GF). This talk will present a description of the NASA project and a survey of the applications and optimizations.

11:00 Scalability 101: What did our experience with the Cray T3E teach us?
Jeff Brooks,
Cray Research/SGI

11:30 Improving Performance on the T3E or How far are you willing to go?
Mike Merrill, DOD

We have had esoteric programming requirements for our T3E over the past several years. General problems faced and some lessons learned will be discussed.

1:30 Performance Evaluation of the Cray T3E for AEGIS Signal Processing
James Lebak, MIT Lincoln Laboratory

MIT Lincoln Laboratory is investigating potential upgrades to the shipborne AEGIS signal processor in support of the Navy's theater-wide ballistic missile defense program (TBMD). As part of this investigation, the Laboratory is examining commercial off-the-shelf (COTS) test bed processors to permit experimentation with different radar algorithms for AEGIS. The Cray T3E can be scaled to easily meet or exceed future throughput requirements that are beyond the capabilities of more conventional embedded COTS products. However, the real-time application has very stringent, deterministic latency requirements. In addition, the shipborne application has very stressing shock, vibration, and packaging requirements that the T3E cannot meet in its current commercial configuration.

We are demonstrating the performance of the T3E while running the AEGIS signal processing application and comparing it to the performance predicted by models based on these benchmarks. We also characterize the performance variation in the machine to evaluate its acceptability for use in a real-time system. Our results show that the required throughput is achievable on the T3E: however, significant performance variation occurs in the current system. We also describe a re-packaging scheme that allows the T3E to meet the AEGIS shock and vibration requirements, and achieves a processing density closer to that typically achieved by embedded multicomputers.

Note: This work was sponsored by the Department of the Navy under Air Force contract F19628-95-C-0002. Opinions, interpretations, conclusions, and recommendations are those of the author and are not necessarily endorsed by the United States Air Force or Navy.

2:00 Approaches to the parallelization of a Spectral Atmosphere Model
V. Balaji, GFDL/SGI

The spectral transform has proved a robust method for the treatment of the non-linear adiabatic Navier-Stokes equations of fluid flow on a sphere. It is robust and the intrinsic shortcomings of spectral methods (dispersion and truncation, and the formation of Gibbs ripples in the presence of steep gradients) are well understood. It is therefore one of the preferred forms for global atmospheric models. In the spectral transform method each field has a representation in spectral coefficients of spherical harmonics, and a corresponding grid field. The linear dynamics is generally treated in spectral space, while the physics (i.e. gridscale parameterization of sub-gridscale processes) and non-linear terms are treated in grid space.

Certain key features of the spectral transform method include: 1) There is a natural expression of the semi-implicit treatment of gravity waves in spectral space. 2) For other quantities (such as tracers where there is a need to maintain positive-definiteness), spectral advection has distinct disadvantages. It would be wise in the design of a code to retain the possibility of keeping certain quantities grid-primary, and advecting them on the grid. Parallelization must then be consistent with the requirements of advection schemes, e.g. semi-Lagrangian schemes. 3) At the resolutions currently conceivable, the physics of the atmosphere can be treated as being entirely within grid columns.

The current paper considers software design issues in the implementation of a parallel spectral model. Modularity is a key ingredient in design. Within the GFDL Flexible Modeling System, physics modules deal with data columns and are treated as being entirely interchangeable between spectral and grid models. A flexible 2D domain decomposition and message-passing module has been implemented that is equally at ease with the requirements of the spectral transform and of grid-domain data dependencies. The same module is used in other GFDL models as well. This permits various strategies to minimize inter-processor communication in the two stages of the spectral transform. We demonstrate different methods of implementing this within the dynamic decomposition approach, and discuss performance of the parallel spectral dynamical core on a Held-Suarez climate benchmark.

2:30 Getting a Thumb-Lock on AIDS: T3E Simulations Show Joint-Like Motion in Key HIV Protein
Marcela Madrid, PSC

HIV-1 reverse transcriptase (RT) is the enzyme that transcribes HIV's single-strand RNA to form double-stranded DNA, making it possible for HIV to replicate itself inside the immune system. Many AIDS drugs, including AZT, work by interacting with RT to block reproduction of the virus. Crystallographic studies have shown that the 3D structure of the RT region that interacts with RNA resembles a hand -- with palm, fingers and thumb subdomains. These studies also suggest that the thumb opens when RT transcribes RNA, making space for DNA to fit into the palm. Blocking this joint-like movement, it's believed, may be the key to designing more effective RT-inhibitor drugs.

Recent molecular dynamics simulations performed at the PSC verify that the thumb closes when DNA is absent, and they identify what parts of the protein are involved in this large-scale motion. The success of these simulations shows that computational simulations can be a valuable aid in evaluating the potential for success of new RT-inhibitor drugs.

3:30 Cray T3E Update and Future Products
William White,
Cray Research/SGI

A discussion of the current status of the T3E, including news and project update and a detailed presentation of future products.

4:30 Discussion and Q & A

Attendees get to ask questions and elicit answers from the SGI representatives and each other.

5:00 SNIA as a Successor to the T3E, Steve Reinhardt, SGI

The T3E system has been very successful because of its low-latency, high-bandwidth interconnect in large configurations, fast processor, and highly scalable, robust OS (key features being high resilience and flexible job-level scheduling). All of this with much greater use of commodity parts than previous systems.

The SNIA (scalable node IA-64) products from SGI will have similar characteristics, though with significant implementation differences. The interconnect will be the next-generation routers from SGI, with appropriate improvements in bandwidth and latency. The IA-64 processors will be Merced, McKinley, and Madison, with very high off-chip bandwidths and processing rates. The OS will be 'Linux/mk', a global OS which builds on the standard Linux kernel as a 'microkernel' on a modest number of processors, to provide an extremely scalable system supporting 2K processors in a standard configuration. Significant improvements in price-performance will be gained through widespread use of commodity parts.


8:00 Co-arrays in Fortran Tutorial
Robert Numrich, Cray Research/SGI

9:00 Introduction to UPC
William Carlson and Jesse Draper, CCS

UPC is a parallel extension of the C programming language intended for multiprocessors with a common global address space. A descendant of Split-C [CDG 93], AC [CaDr 95], and PCP [BrWa 95], UPC has two primary objectives: 1) to provide efficient access to the underlying machine, and 2) to establish a common syntax and semantics for explicitly parallel programming in C. The quest for high performance means in particular that UPC tries to minimize the overhead involved in communication among cooperating threads. When the underlying hardware enables a processor to read and write remote memory without intervention by the remote processor (as in the SGI/Cray T3D and T3E), UPC provides the programmer with a direct and easy mapping from the language to low-level machine instructions. At the same time, UPC's parallel features can be mapped onto existing message-passing software or onto physically shared memory to make its programs portable from one parallel architecture to another. As a consequence, vendors who wish to implement an explicitly parallel C could use the syntax and semantics of UPC as a basis for a standard.

10:00 HPF and the T3E: Applications in Geophysical Processing
Douglas Miles, The Portland Group, Inc.

The increasing quality and resolution of seismic data recordable today in oil exploration and prospecting is providing the industry with an unprecedented wealth of information in the form of multiple seismic attributes of the subsurface. When these attributes are correlated via neural networks with existing information at the wellbore (e.g. well-logs), a three-dimensional picture of the distribution of rock and fluid properties in the subsurface can be computed. Mobil Technology Company, using HPF applications developed on the CRAY T3E, is using these predictions of hydrocarbon distribution to optimally exploit reservoirs, targeting and drilling high-angle/horizontal wells with pinpoint accuracy and ranking a field's reserves potential ahead of the drill-bit. This has reduced the number of dry-holes, optimized primary and secondary recoveries and uncovered oil reserves that would otherwise have gone untapped. This talk in part highlights these breakthroughs and their significant business impact at Mobil.

10:30 First Principles Simulation of Complex Magnetic Systems: Beyond One Teraflop
Yang Wang, PSC, B. Ujfalussy, Xindong Wang, Xiaoguang Zhang, D. M. C. Nicholson, W.A. Shelton and G. M. Stocks, Oak Ridge National Laboratory, A. Canning, NERSC; B.L. Gyorffy, University of Bristol

The understanding of metallic magnetism is of fundamental importance for a wide range of technological applications ranging from thin film disc drive read heads to bulk magnets used in motors and power generation. In this presentation, we demonstrate the use of the power of massively parallel processing (MPP) computers for performing first principles calculations of large system models of non-equilibrium magnetic states in metallic magnets. The calculations are based on the massively parallel locally self-consistent multiple scattering (LSMS) method extended to treat general non-collinear arrangements of the magnetic moments. A general algorithm has been developed for self-consistently finding the constraining fields which are introduced into local spin density approximation (LSDA) in order to maintain a prescribed magnetic moment orientation configuration. The LSMS method we have developed exploits the locality in the physics of the problem to produce an algorithm that has only local and limited communications on parallel computers leading to very good scale-up to large processor counts and linear scaling of the number of operations with the number of atoms in the system. The computationally intensive step of inversion of a dense complex matrix is largely reduced to matrix-matrix multiplies which are implemented in BLAS. Throughout the code attention is paid to minimizing both the total operation count and total execution time, with primacy given to the latter. Full 64-bit arithmetic is used throughout. Due to linear scaling, this code has sustained 1.02 Teraflops on a 1480-processor T3E-1200. This breakthrough, which won the 1998 Gordon Bell Prize, opens up the possibility of realistic simulations of materials with complex structures (for which simulations are inadequate unless enough atoms are included in the unit cell).

11:00 Massively Parallel Simulations of Plasma Turbulence
Zhihong Lin, Princeton University

Turbulence is of fundamental interest in plasma physics research both as a complex nonlinear phenomenon and a paradigm for transport processes. Significant progress in understanding the turbulence behavior has been made through large scale particle-in-cell simulations which are now feasible because of recent advances in theory, algorithm, and massively parallel computing. These simulations self-consistently evolve the dynamics of more than 100 million particles for thousands of time steps to resolve disparate spatial and temporal scales. The parallel simulation code is portable using standard Fortran and MPI, and achieves nearly perfect scalability on various massively parallel computers (e.g., CRAY-T3E and Origin-2000). Domain decomposition is implemented to minimize inter-processor communication. Non-spectral Poisson solver is developed using iterative method. Single processor optimization mainly concerns the data structure to optimize local gathering-scattering operations. Comparisons of performance on T3E and other machines will be discussed.

11:30 Early Stages of Folding of Small Proteins Observed in Microsecond-Scale Molecular Dynamics Simulations
Peter Kollman and Yong Duan, University of California, San Francisco

A new approach in implementing classical molecular dynamics simulation for parallel computers has enabled a simulation to be carried out on a protein with explicit representation of water for 1 microsecond, about two orders of magnitude longer than the longest simulation of a protein in water reported to date. We have used this approach to study the folding of the villin headpiece subdomain, a 36-residue small protein consisting of three helices, and BBA1, a 28-residue alpha/beta protein, from unfolded structures. The villin headpiece reached a marginally stable state, which has a lifetime of about 150 nanoseconds, a favorable solvation free energy, and shows significant resemblance to the native structure; two pathways to this state have been observed. The process can be seen as a 60-nsec "burst" phase followed by a slow "conformational adjustment" phase. We found that the burial of the hydrophobic surface dominated the early phase of the folding process and appeared to be the primary driving force of the reduction of the radius of gyration in that phase. The BBA1 reached states that closely resemble the native secondary structures.

1:30 Update on NERSC PScheD Experiences (link to the Slides or to the Paper)
Tina Butler and Michael Welcome, NERSC

2:00 Programming Tools on the T3E: ARSC Real Experiences and Benefits
Guy Robinson, ARSC

The Arctic Region Supercomputing Centre (ARSC) strongly encourages its users to apply tools to investigate the performance of codes running on the sites T3E and J systems. This talk will describe experiences with tools and discuss mechanisms to ensure users are motivated to spend time and effort in applying tools to investigate code performance. This is in part done by personal contact, even for users off-site, and by means of a T3E newsletter which describes tools and gives examples of the rewards of tool application by other users as encouragement. Examples will be drawn from a wide range of applications from established MPP codes to the challenges of teaching parallel programming on both a cluster and a T3E system.

2:30 Bad I/0 and Approaches to Fixing It
Jonathan Carter, NERSC

During the past months we have observed that some types of application I/O lead to poor performance for both the application and for our T3E as a whole. Often times, restructuring of the application, or of the underlying I/O, can lead to dramatic improvements in performance. I will discuss a handful of case studies that illustrate some of our findings

3:20 Achieving Maximum Disk Performance on Large T3E's: Hardware, Software and Users
John Urbanic and Chad Vizino, PSC

Default I/O configurations on large T3E's are often far from optimal for both system and application performance. We will discuss all 3 phases of reconfiguring our 1TB T3E disk system to deliver over 1GB/s of sustainable bandwidth to actual applications. The first phase, hardware, involved the rearrangement of disk and GigaRing layout. The second phase involved software changes at the system level to optimize for the hardware by modifying such things as PE types, striping and primary and secondary disk areas. Finally, application software changes to best exploit this new configuration are illustrated. Benchmark timings for system tasks and application run-times as well as more standardized measurements will be considered at each step. "Before and after" comparisons from our scientific user community will be shown.

3:50 Running the UK Met. Office's Unified Model on the Cray T3E
Paul Burton, UKMET

The UK Met. Office has an 880 processor Cray T3E-900, which is soon to be joined by a second 640 processor T3E-1200. As the largest production MPP site within Europe, we have considerable experience of running a complex mixture of jobs with different resource requirement, while achieving an efficient utilisation of the machine.

The main modelling code run on the machine is the Met. Office's Unified Model (UM). The UM is a grid point model supporting both prediction and data assimilation for the atmosphere and ocean. It is used for operational global and limited area forecasting as well as climate modelling. Both single processor and parallel optimisation have significantly improved the performance of the UM on the T3E, and some of the techniques used will be presented. We will also discuss the issues involved in achieving an efficient utilisation of the machine when running a large number of jobs, often having very different resource requirements.

4:20 I/O and File System Balance (link to the Slides or to the Paper)
Tina Butler, NERSC

Questions and information: contact

Helene Kulsrud, Workshop Chair
Thanet Drive
Princeton, N.J. 08540 USA
(1-609) 279-6243
Cray User Group Office, Office Manager
Cray User Group Office
2911 Knoll Road
Shepherdstown, WV 25443 USA
(1-304) 263-1756
Fax: (1-304) 263-4841

Revised: December 7, 1999