Abstract
We analyzed the performance of the Cray X1 on two sets of nested loops that were iterating over triply subscripted arrays and found performance variations from 1.8 to 818 M(64-bit)FLOPS (a 457:1 ratio). The interaction between the multi-level cache, vector operations, memory bank conflicts, and "multi-streaming" are the apparent cause for most of the large speed variations. We used only two simple computations but varied many parameters such as the order of the loops, the subscripting style, compilation options, precision, and language. With the inclusion of 32-bit and 128-bit floating point arithmetic, the speed range increased to about 40K to 1, from .04 to almost 1500 MFLOPS. We found that compiler feedback on loop vectorization and multistreaming was a poor performance predictor. We conclude the paper with some observations on estimating and improving Cray X1 application performance.
Click here for this paper in PDF format.
Click here the presentation slides in PDF format.
Links to local or remote
resources here.
(last revised 23 Feb 05)