Neal Gaarder, System Libraries Group Cray Research, Inc. 655-F Lone Oak Drive Eagan, Minnesota, USA
IntroductionA new capability of Cray's Programming Environment 3.0 is optional "vectorization" of select intrinsic routines such as SQRT and POPCNT. Applications which currently spend a significant portion of the time computing math intrinsic functions can speed up significantly. One customer application ran twice as fast on the T3E with vectorization, although this is not typical. Scalar intrinsics, such as SQRT and EXP, are functions that return one result per call. As one might expect, a "vectorized" intrinsic returns more than one result per call. On the MPP, these intrinsics operate on arrays ("vectors") in memory and return their results as arrays in memory. For example, a vectorized SQRT intrinsic processing an array of NARGS arguments, ARG_VEC, and returning an array of results, RSLT_VEC, might be SQRT_V( NARGS, ARG_VEC, RLST_VEC). Because they are called with an array of arguments, vector intrinsics can be optimized in ways that are not profitable or possible when given one argument. The vector intrinsics pipeline, unroll, prefetch, avoid calling overhead, and overlap more operations. In many cases the remaining code is more efficient, as a function call inside a loop can trigger register spills and preclude other optimizations. The vector routines themselves are typically 2-3 times faster than the equivalent scalar loop on the Cray T3E for loops with 100 iterations. Vector intrinsics can be as much as five times faster in certain cases.
Vector intrinsics on the MPPOver the years, Benchmarking's Tom Hewitt wrote many vector intrinsics for the T3D for internal use. These were collected into their library (bnchlib). Silicon Graphics also developed vector math intrinsics. Without compiler support, vector intrinsics had to be directly called by the user codes. This sometimes required a significant amount of recoding (as will be seen) so their application was limited. These routines were not generally available and not supported by normal channels. Also, while they are typically fully accurate over normal ranges, there were regions of lesser accuracy than libm and sometimes additional restrictions on the arguments. With Cray's Programming Environment Release 3.0, the C and Fortran compilers automatically use vector intrinsics. They not only translate F90 vector syntax into vector calls, but also replace scalar intrinsic calls inside simple and complex loops. This can involve loop splitting, loop stripmining (processing a loop in chunks) and creating temporary arrays as needed for intermediate quantities.
What will vectorize?Both complex and simple loops will vectorize. The rules for vectorization are similar to those for Cray vector machines. In PE 3.0 the following intrinsics vectorize:
Note: LOG, COS, COSS, SIN, and 32 bit 1/SQRT and EXP require the -lmfastv link option (explained later) for significant savings.
Examples of vectorizable loops
Fortran 90 array syntax loops vectorize:
y(1:n) = sqrt( x(1:n) )
|
Array Size | Speedup Factor |
---|---|
27 |
2.5x |
81 |
2.9x |
243 |
3.8x |
729 |
4.1x |
Array Size | Speedup Factor |
---|---|
27 |
1.7x |
81 |
2.4x |
243 |
2.5x |
729 |
2.5x |
The experience in field test was limited. Most of the applications that we have data on did not use intrinsics much and thus profited minimally (0%` to 3%) from vectorization. However, one customer application ran twice as fast, a 100% speedup. It was described as a DNA code that used intrinsics heavily.
This code may have benefited from more than just the intrinsic speedups, since in many cases the restructuring allows other optimizations to occur.
Suppose an application spends a fraction i of run time in vectorizable intrinsics, and it spends i*v seconds on intrinsics after vectorizing. Then a speedup of 1/(1-i+i*v) should be seen. For example, assume a code spends 50% of the time in vectorizable intrinsics. If the average vectorization speedup is 2.0, the application will be 33% faster (1.33x). If the average vectorization speedup is 3.0, the application will be 50% faster (1.50x).
Most users need only specify the command line options to get good results with vectorization. In some cases, a user may need a better result. This section discusses some advanced topics which might help. The topics are:
In some cases, it is desirable to be able to control vectorization within a code. Directives have been provided to do this. They are CDIR$ or !DIR$ in Fortran 90 or #pragma _cri in the C language.
Multiple calls to RANF within a loop will not normally vectorize because the sequence of random numbers will not have the same order as with the scalar loop. If this is acceptable, use CDIR$ IVDEP to make the compiler vectorize as shown below.
In some cases, it may not be desirable to vectorize a particular loop, perhaps because it is too short to do profitably. Use CDIR$ NEXTSCALAR to inhibit vectorization of a single loop as shown below.
CDIR$ directives will not inhibit vectorization of Fortran 90 array syntax when the vectorization option is specified.
In some cases, it may be desirable to vectorize only one section in a code. For this, place a CDIR$ NOVECTOR at the beginning of the code, a CDIR$ VECTOR in front of the loop, and a CDIR$ NOVECTOR after the loop, as shown below.
As explained earlier, for every function there is a breakeven point where the additional overhead of vectorization outweighs the efficiencies. For libm routines like SQRT and 64 bit EXP this is typically 10-30 iterations. For libmfastv routines it is typically 30-50 iterations.
Currently, the compiler does not consider the loop count before vectorizing.
Some codes will have many short loops and it may pay to selectively vectorize only the sections with long loops. This can be done with NOVECTOR and VECTOR directives.
Over the years Cray has studied the loop characteristics of its customers. A large fraction of loops fall in the 20-250 class. On the MPP, even codes from vector machines with very long loops tend to use much smaller loops as the work is divided among hundreds of PEs. For this reason, Cray has devoted much effort to reducing the overhead of the most frequently used intrinsics.
Loops over 100 iterations generally vectorize well. If a temporary array is created, the compiler will process the loop in "strips" of up to 256 iterations. This is called stripmining. If the temporary array can be avoided, that is best. Otherwise an array of size n can be defined to hold the temporary data if that is all that prevents full vectorization.
The vector intrinsics do not need to have the data in the first level cache for near peak performance, but having as much data in second level cache will improve performance. Temporary arrays increase the probability of cache conflicts and so should be avoided.
One of the most efficient techniques uses the same array for result and argument. For example, y(1:n) = sqrt( y(1:n) ).
Sometimes the compiler needs to allocate a temporary array to hold the argument and/or the result of a vector intrinsic. Examples are when an intrinsic is called with an expression as an argument, or the argument and/or result are scalars in the loop. Such a loop might be as follows:
Such a loop can be vectorized by creating a temporary array to hold the scalar computation as follows:
The temporary array can be avoided by assigning the argument to result_vector (the compiler may do this) as follows:
In this form, SQRT has shown up to 5x speedup for larger loops, due to reduced cache conflict, versus 4.5x speedup with different argument and result arrays.
In general it is good practice to avoid branches on RISC machines, such as the T3E. If a call to an intrinsic only occurs when some condition is met, vectorization is inhibited. For example, the following loop will not vectorize.
Once the call is made unconditional, this loop vectorizes. Even though a few more arguments may be processed, the great savings of vectorization will offset the cost. The following loop will vectorize.
The vectorized intrinsic feature in Release 3.0 can significantly improve the performance of select applications, those which now spend a significant portion of their time in loops computing intrinsic functions such as SQRT and EXP. This once required changes to the user code, but with Cray Programming Environment 3.0 codes can be optimized with command line options. Optional vectorization directives can be inserted in the code for finer control.
Loops dominated by vectorizable intrinsics and with over 100 iterations will typically be two to three times faster with vectorization. Intrinsic intensive applications can be up to twice as fast, but the speedup varies with the application.
Neal Gaarder develops Math Library (libm) software for Cray Research, a Silicon Graphics company. He developed the vector intrinsic library routines discussed here.
nealg@cray.com
www.cray.com
Neal Gaarder