Simple Loop Performance on the Cray X1

Lee Higbie
Tom Baring
Ed Kornkven
Arctic Region Supercomputing Center
University of Alaska Fairbanks
P.O. Box 756020
Fairbanks, AK 99775-6020
USA
+1-970-450-8688
Fax: +1-907-450-8601

Abstract

We analyzed the performance of the Cray X1 on two sets of nested loops that were iterating over triply subscripted arrays and found performance variations from 1.8 to 818 M(64-bit)FLOPS (a 457:1 ratio). The interaction between the multi-level cache, vector operations, memory bank conflicts, and "multi-streaming" are the apparent cause for most of the large speed variations. We used only two simple computations but varied many parameters such as the order of the loops, the subscripting style, compilation options, precision, and language. With the inclusion of 32-bit and 128-bit floating point arithmetic, the speed range increased to about 40K to 1, from .04 to almost 1500 MFLOPS. We found that compiler feedback on loop vectorization and multistreaming was a poor performance predictor. We conclude the paper with some observations on estimating and improving Cray X1 application performance.

Lee Higbie

Tom Baring

Ed Kornkven

Click here for this paper in PDF format.

Click here the presentation slides in PDF format.

Links to local or remote resources here.


(last revised 23 Feb 05)