CRAY T90 versus Tera MTA: The Old Champ Faces a New Challenger

Jay Boisseau, Larry Carter, Allan Snavely, Amitava Majumdar, David Callahan, John Feo, Simon Kahan, and Zhijun Wu
We compare the performance of the Tera MTA to that of the CRAY T90 on the NPB 2.3-serial benchmarks and a sample of T90 workload applications. We start by characterizing issues in performance programming for both architectures. Next, we provide an update on previously published NPB results that were obtained using a processor with a slower clock frequency. We compare single processor performance of the MTA and T90 on a flux-corrected transport code (LCPFCT), a finite element program (LS-DYNA3D), and a molecular dynamics application (AMBER). We also observe how the MTA performs on two processors on these applications, subject to the constraint that the current network board is not up to specification. Finally, we discuss performance issues and characterize porting effort involved in moving vector code to the MTA and discuss how parallelism is exploited in the MTA.

High-performance I/O on Cray T3E

Ulrich Detert
A great number of hardware and software features allow for potentially high I/O performance on Cray T3E systems. In the following, the influence of RAID technology, file striping, global I/O FFIO layers, and pcache on I/O performance are discussed. The I/O performance achieved with striped SCSI disks is compared to RAID disk performance and tuning opportunities on the system and the user level are evaluated.

Performance Co-Pilot and Large Systems Performance

Ken McDonell
A large computer system, like a high-end Silicon Graphics Origin(TM) 2000 running IRIX(TM), presents interesting challenges for system-level performance monitoring and performance management. We shall investigate some of the issues and explore how the Performance Co-Pilot(TM) (PCP) helps address these. Effective performance management in this environment relies on being able to easily extend the base capabilities to reflect the local requirements of the processing environment, the application mix and the criteria by which performance is judged. A case study outlines how the PCP components were used to develop a real-time 3-D visualization of the performance of MPI-based applications.

Thread-Parallel Job Performance in a Time Sharing Batch Environment on Origin 2000 Systems

David McWilliams
At the National Center for Supercomputing Applications (NCSA), we found that thread-parallel, gang-scheduled Origin 2000 jobs consumed more than 5 times as much CPU time when the load average on a system is high (96 on a 64-processor system). We worked with Silicon Graphics (SGI) as they made changes in the IRIX process scheduler to fix the problem. We will discuss improvements to the scheduler and the general problem of scheduling parallel jobs when processors are over-allocated. NCSA's Load Limiter ensures that the load average does not exceed the number of processors. Parallel jobs now have much more consistent performance.

Experiences with Industrial Applications on Massively Parallel Computers

Jörg Stadler
HWW (Hschstleistungsrechenzentrum für Wissenschaft und Wirtschaft GmbH) is a large german supercomputing facility, that offers its compute resources to scientific as well as industrial users. This article reports on the applications that industrial users run on the HWW equipment and the computer architectures they use. The current situation is characterized by an increasing demand for compute resources from commercial customers which eventually led to the installation of an additional machine in march 1998. But all of this demand is still exclusively met by vector machines, whereas massively parallel machines (MPPs) are still being evaluated by the industrial user community.


Table of Contents | Author Index | CUG Home Page | Home (Title Page)