Performance

A CMOS Vector Processor with a Custom Streaming Cache
Greg Faanes
The Cray SV1 processor is a CMOS chipset that represents an affordable approach to a custom CPU architecture for vector applications that is interesting in both the micro-architecture and design methodology. The design includes several new features not found in existing vector processors. The talk will describe the micro-architecture of the scalar processor, the vector processor and the custom streaming cache.

Performance Evaluation of MPI and MPICH on the CRAY T3E
Hans-Hermann Frese
MPI is a portable message passing interface for parallel applications on distributed memory machines. In this paper we present performance results for the native SGI/Cray MPI implementation and the portable ANL/MSU MPICH implementation on the CRAY T3E.

Performance Tuning for Large Systems
Edward Hayes-Hall
A set of strategies and methods of tuning useful for large Origin systems will be presented. This will cover existing mechanisms and the means to monitor the benefits of the changes.

How Moderate-Sized RISC-Based SMPs Can Outperform Much Larger Distributed Memory MPPs
Daniel Pressel, Walter B. Sturek, J. Sahu, and K.R. Heavey
Historically, comparison of computer systems was based primarily on theoretical peak performance. Today, based on delivered levels of performance, comparisons are frequently used. This of course raises a whole host of questions about how to do this. However, even this approach has a fundamental problem. It assumes that all FLOPS are of equal value. As long as one is only using either vector or large distributed memory MIMD MPPs, this is probably a reasonable assumption. However, when comparing the algorithms of choice used on these two classes of platforms, one frequently finds a significant difference in the number of FLOPS required to obtain a solution with the desired level of precision. While troubling, this dichotomy has been largely unavoidable. Recent advances involving moderate-sized RISC-based SMPs have allowed us to solve this problem. The net result is that for some problems a 128 processor Origin 2000 can outperform much larger MPPs.

Performance Metrics for Parallel Systems
Daniel Pressel
One frequently needs to compare the performance of two or more parallel computers; but how should this be done? The most straightforward way to do this would be to rely upon a suite of benchmarks. Unfortunately, there are many problems and limitations with this approach. Therefore, one is frequently forced to rely upon a combination of approaches. Unfortunately, many of the alternative approaches are all to frequently based on excessively simplistic approximations and extrapolations. This paper will discuss some of these issues so that the problems they can cause may be avoided in the future.

Performance Characteristics of Messaging Systems on the T3E and the Origin 2000
Mark Reed, Eric Sills, Sandy Henriquez, Steve Thorpe, and Lee Bartolotti
Interprocessor communication time has a large impact upon the scaling of parallel applications with processor number. To assess the effectiveness of the T3E and the Origin 2000 in using messages to move data to remote memory, the performance characteristics of PVM, MPI and SHMEM are investigated and performance curves for transfer time as a function of message size are presented. In addition, latencies, bandwidths and the message size required to achieve half of peak are reported. Performance is also compared with a cluster of O2 workstations connected via Ethernet. Differences for many types of point-to-point calls are presented and discussed. A variety of collective communication calls representing a diverse set of traffic patterns is investigated and results reported.

Optimizing AMBER for the CRAY T3E
Robert Sinkovits and Jerry Greenberg
AMBER is a widely used suite of programs employed to study the dynamics of biomolecules and is the single most used application on the SDSC CRAY T3E. In this paper, we describe the cache, intrinsic function, and other optimizations that were taken to tune the molecular dynamics module of AMBER, resulting in single processor speedups of more than 70% for pairlist and 25% for particle mesh Ewald calculations, respectively.

High Performance First Principles Method of Magnetic Properties: Breaking the Tflop Barrier
Balazs Ujfalussy, Xindong Wang, D. M. C. Nicholson, W. A. Shelton, G. M. Stocks, A. Canning, and Yang Wang
The understanding of metallic magnetism is of fundamental importance for a wide range of technological applications ranging from thin film disc drive read heads to bulk magnets used in motors and power generation. We use the power of massively parallel processing (MPP) computers to perform first principles calculations of large system models of non-equilibrium magnetic states in metallic magnets. The constrained LSMS method we have developed exploits the locality in the physics of the problem to produce an algorithm that has only local and limited communications on parallel computers leading to very good scale-up to large processor counts and linear scaling of the number of operations with the number of atoms in the system. The computationally intensive step of inversion of a dense complex matrix is largely reduced to matrix-matrix multiplies which are implemented in BLAS. Throughout the code attention is paid to minimizing both the total operation count and total execution time, with primacy given to the latter. Full 64-bit arithmetic is used throughout. The code shows near linear scale-up to 1458-processing elements (PE) and attained a performance of 1.02 Tflops on a Cray T3E1200 LC1500 computer at Cray Research. Additional performance figures of 657 and 276 Gflops have also been obtained on a Cray T3E1200 LC1024 at a US Government site, and on a T3E900 machine at the National Energy Research Scientific Computing Center (NERSC) respectively. All performance figures include necessary I/O.


Table of Contents | Author Index | CUG Home Page | Home (Title Page)