CRAY X1 SCIENTIFIC LIBRARIES
Mary Beth Hribar
Cray Inc.
411 First Ave S, Ste 600
Seattle WA 98104
ABSTRACT:
The scientific libraries provided on the Cray X1 system, LibSci, provide performance for Cray X1 applications. The LibSci interface supports Cray customers who are used to programming Cray systems as well as Cray customers who have programmed other platforms. This paper describes the functionality and performance of LibSci, including a few programming tips. It also discusses plans for LibSci.
KEYWORDS:
LibSci, BLAS, FFTs, LAPACK, ScaLAPACK, BLACS, PE, optimization
The set of scientific libraries, LibSci, is a collection of numerical routines tuned for performance on the Cray X1 system. Users call these routines to optimize the performance of applications.
The routines provided in LibSci support the various programming models and features provided by the Cray X1 System.
The Cray X1 system combines powerful vector processors with both shared and distributed memory within a highly scalable configuration. It is constructed of nodes within a node interconnection network, each of which contains four multistreaming processors (MSPs) and globally-addressable shared memory. Each MSP furthermore contains four single-streaming processors (SSPs), plus a 2-MB cache shared among the four SSPs. Each SSP contains both a superscalar processing unit and a two-pipe vector processing unit.
Both 32-bit and 64-bit arithmetic are supported on the Cray X1 system. A single MSP provides 12.8 GFLOPs of peak computational power for 64-bit data, and 25.6 GFLOPs for 32-bit data. The user, however, should be aware that although 32-bit arithmetic operations execute at twice the rate as 64-bit operations, it is difficult to attain this 32-bit peak speed in actual applications as there is less computational time to overlap delays due to memory latency, functional unit latency and instruction issue.
Applications may execute on the Cray X1 system either in MSP mode or in SSP mode. For programs compiled in MSP mode, the compiler controls the streaming across the SSPs contained within each MSP. In this case, the MSP is the processing unit. For programs compiled in SSP mode, there is no automatic streaming across the SSPs since the SSP is the elemental processing unit in the Cray X1 system. The decision to execute in either of these modes is made at compile time: the executable is scheduled at run-time according to how it was compiled.
LibSci is included in the Cray Programming Environment (PE) which provides the compilers, libraries and tools. The LibSci provided in PE release 5.0 (June 2003) provides the following features:
· Support for MSP and SSP modes
· Support for 32-bit and 64-bit data types
· Fortran interfaces for all routines
· Single processor Fast Fourier Transform (FFT), filter and convolution routines tuned for MSP and SSP modes
· Single processor Basic Linear Algebra Subprograms (BLAS) tuned for MSP mode
· Single processor version 3.0 LAPACK (linear and eigensolvers)
· Version 1.7 ScaLAPACK (distributed memory parallel set of LAPACK routines)
· Basic Linear Algebra Communication Subprograms (BLACS)
The performance and functionality of LibSci is evolving and not yet complete. More functionality and optimizations are planned for future releases. These will give users the most up-to-date software and improved performance.
The Cray X1 system supports 32-bit and 64-bit arithmetic in hardware. The 32-bit support represents a new development for a Cray system; thus, it is a new feature of LibSci. LibSci includes two separate sets of libraries, the first (default) library is implemented for both 32-bit and 64-bit data types; the second library is implemented for all 64-bit data types.
The default LibSci contains both 32-bit and 64-bit data types. The default LibSci is automatically available with the modules feature (that is, via the load modulefile command), so no link options are required to use it.
The 64-bit version of LibSci contains all 64-bit data types. This library is provided to better support existing Cray customers whose programs are written with 64-bit data types. This library is available by using the –s default64 Fortran compiler option when linking. Or, if a using the C/C++ compiler, the library is available by using –lsci64 when linking.
In the PE 5.0 release of LibSci, ScaLAPACK and BLACS routines are available only in the default library. There are no 64-bit versions of these routines. All other listed routines are available in both the default LibSci and the 64-bit version of LibSci.
The following table summarizes the data types
available in each library. Type Declaration lists Fortran type
declarations in the star format. These
star formats are not affected by the –s default32 or –s default64
Fortran compiler options.
Table 2‑1 Sizes of Data Types in LibSci (default) and LibSci (64-bit)
Data Type |
LibSci
(default) |
LibSci
(64-bit) |
||
Size |
Type Declaration |
Size |
Type Declaration |
|
INTEGER |
32 bits |
INTEGER*4 |
64 bits |
INTEGER*8 |
REAL |
32 bits |
REAL*4 |
64 bits |
REAL*8 |
DOUBLE PRECISION |
64 bits |
REAL*8 |
N/A |
N/A |
COMPLEX |
64 bits |
COMPLEX*8 |
128 bits |
COMPLEX*16 |
DOUBLE COMPLEX |
128 bits |
COMPLEX*16 |
N/A |
N/A |
LibSci provides both single and double precision routines in the default library. Single precision routines use REAL and/or COMPLEX data types. Double precision routines use DOUBLE PRECISION and/or DOUBLE COMPLEX data types. For example, the level-1 BLAS routines which copy the contents of one vector, x, to another vector, y, are defined below.
Single precision real:
scopy (n, x, incx,
y, incy)
integer n, incx,
incy
real x, y
Single precision complex:
ccopy (n, x, incx,
y, incy)
integer n, incx,
incy
complex x, y
Double precision real:
dcopy (n, x, incx,
y, incy)
integer n, incx,
incy
double precision x, y
Double precision complex:
zcopy (n, x, incx,
y, incy)
integer n, incx,
incy
double complex x, y
For easy identification, single precision routine names begin with the letter ‘s’ or the letter ‘c’, and double precision routine names begin with the letter ‘d’ or the letter ‘z.’ This naming convention is a defined standard for the BLAS, LAPACK, ScaLAPACK and BLACS routines. The LibSci FFT routines also use this naming convention.
In LibSci for Cray PVP and Cray T3E systems, only single precision routine names were included. In the default LibSci for the Cray X1 system, the double precision names are available. The following tables list the double precision names that are now available for the BLAS, FFT, filter and convolution routines. Note that the double precision names are available for LAPACK, ScaLAPACK and BLACS routines also, but there are too many to list here.
Table 3‑1 Double Precision BLAS Routines on UNICOS/mp systems
DASUM, DZASUM |
DNRM2, DZNRM2 |
DSYR2K, ZSYR2K |
DAXPY, ZAXPY |
DROT, DROTG, ZROTG |
DSYRK, ZSYRK |
DCABS1 |
DSBMV, ZHBMV |
DTBMV, ZTBMV |
DCOPY, ZCOPY |
DSCAL, ZDSCAL, ZSCAL |
DTBSV, ZTBSV |
DDOT, ZDOTC, ZDOTU |
DSPMV, ZHPMV |
DTPMV, ZTPMV |
DGBMV, ZGBMV |
DSPR, ZHPR |
DTPSV, ZTPSV |
DGEMM, ZGEMM |
DSPR2, ZHPR2 |
DTRMM, ZTRMM |
ZHEMM |
DSWAP, ZSWAP |
DTRMV, ZTRMV |
DGEMV, ZGEMV |
DSYMM, ZSYMM |
DTRSM, ZTRSM |
ZHEMV |
DSYMV |
DTRSV, ZTRSV |
DGER, ZGERC, ZGERU |
DSYR |
IDAMAX, IZAMAX |
ZHER, ZHER2K, ZHERK |
DSYR2 |
|
Table 3‑2 Double Precision FFT routines
Dimensions |
Complex-to-complex |
Real-to-complex |
Complex-to-real |
One-dimensional (single) |
zzfft (zfft) |
dzfft |
zdfft |
One-dimensional (multiple) |
zzfftm (mzfft) |
dzfftm |
zdfftm |
Two-dimensional |
zzfft2d (zfft2d) |
dzfft2d |
zdfft2d |
Three-dimensional |
zzfft3d (zfft3d) |
dzfft3d |
zdfft3d |
Table 3‑3 Double precision convolution routines
Name |
|
Computes a correlation of two vectors |
dfilterg |
Computes a correlation of two vectors (assuming the filter
coefficient vector is symmetric) |
dfilters |
Solves the Weiner-Levinson linear equations |
dopfilt |
Table 3‑4 Double precision filter routines
Name |
|
Computes a standard complex convolution |
zcnvl |
Computes a convolution using FFTs |
zcnvlf |
The single precision
FFT routines in the default LibSci need special attention by the user. The table
array argument must always be an array of 64-bit words, regardless of the size
of the other data types within the routine.
Please see the intro_fft man
page for more information.
Note that the 64-bit
LibSci does not contain the double precision routines. This library is intended to provide support
for applications written for Cray PVP and T3E systems and thus only needs to
provide the single precision names. The
single precision routines in the 64-bit library use 64-bit real and complex
data types.
The single precision routines contained in the default LibSci and the 64-bit LibSci contain the same names, but are defined to use data types with different sizes. The default LibSci is loaded automatically. Users who wish to use the 64-bit single precision routines must explicitly link to the 64-bit library by using –s default64 Fortran compiler option or by using –lsci64. Failing to do so will result in using the 32-bit single precision routines which will result in run-time errors.
It is difficult to predetermine if an application will execute faster in MSP mode than SSP mode. Thus, as both modes are provided to users, Cray recommends that both modes be tested for applications. In fact, the Cray benchmarking group is currently working with Cray Technical Publications to publish guidelines for determining which mode works best with which types of applications.
Using the Fortran compiler option –O ssp or the C/C++ compiler option –h ssp will compile an executable in SSP mode and link to the LibSci for SSP mode. Using the default compiler options will create an executable in MSP mode, linked to the LibSci for MSP mode.
In the PE 5.0 release (June, 2003), LibSci for SSP and MSP modes will be functional. As for performance, routines in LibSci have been tuned to perform well in MSP mode, ensuring that the computations are partitioned efficiently across the four SSPs. The FFT routines are the only routines in LibSci which have been tuned specifically for SSP mode. More performance enhancements for both MSP and SSP modes will be coming in future releases.
LibSci for the Cray X1 system does not support all routines that were provided by LibSci for the Cray PVP and Cray T3E systems. Namely, the non-standard BLAS, the sparse iterative solvers and out-of-core solver routines are no longer supported. Tables 5-1, 5-2 and 5-3 list the level-1, level-2 and level-3 BLAS routines that are unavailable on the Cray X1 system. Table 5-4 lists the iterative and out-of-core solver routines that are no longer supported on the Cray X1 system. This information is also given in the Cray X1 User Environment Difference manual.
Also, LINPACK and EISPACK routines are not included in LibSci for the Cray X1 system. Users of these routines are encouraged to update to the LAPACK routines. LAPACK is the successor to LINPACK and EISPACK and it contains faster algorithms.
Table 5‑1 Unsupported level-1 BLAS routines for the Cray X1 system
Name |
Purpose |
HAXPY (GAXPY) |
Adds a scalar multiple of a real or complex vector to
another real or complex vector (32-bit version) |
HDOT (GDOTC GDOTU) |
Computes a dot product (inner product) of two real or
complex vectors (32-bit version) |
SAXPBY (CAXPBY) |
Adds a scalar multiple of a real or complex vector to a
scalar multiple of another vector |
SHAD |
Computes the Hadamard product of two vectors |
SPAXPBY |
Adds a scalar multiple of a vector to a sparse vector |
SPDOT |
Computes a dot product of a real vector and a sparse real
vector |
CSROT |
Applies a real plane rotation to a pair of complex
vectors |
CROT |
Applies a real plane rotation to a pair of complex
vectors |
SROTM |
Applies a modified Givens plane rotation |
SROTMG |
Constructs a modified Givens plane rotation |
SSUM (CSUM) |
Sums the elements of a real or complex vector |
Table 5‑2 Unsupported level-2 BLAS routines for the Cray X1 system
Name |
Purpose |
SGESUM |
Adds a scalar multiple of a real or complex matrix to a
scalar multiple of another real or complex matrix |
CSPMV |
Multiplies a complex vector by a complex symmetric packed
matrix |
CSPR |
Performs a symmetric rank 1 update of a complex symmetric
packed matrix |
CSSPR12 |
Performs two simultaneous symmetric rank 1 updates of a
real symmetric packed matrix |
CSYMV |
Multiplies a complex vector by a complex symmetric matrix |
CSYR |
Performs a symmetric rank 1 update of a complex symmetric
matrix |
CSSYR2 |
Performs a symmetric rank 2 update of a real symmetric
matrix |
Table 5‑3 Unsupported level-3 BLAS routines for the Cray X1 system
Name |
Purpose |
SCOPY2 (CCOPY2) |
Copies a real matrix into another real matrix; copies a
complex matrix into another complex matrix (used by the out-of-core routines) |
SGEMMS (CGEMMS) |
Multiplies a real general matrix by a real general
matrix; multiplies a complex general matrix by a complex general matrix; uses
Strassen's algorithm |
Table 5‑4 Unsupported iterative and out-of-core solver routines for the Cray X1 system
Purpose |
|
SITRSOL |
Solves a real general sparse system, using a
preconditioned conjugate gradient-like method (iterative solver) |
VBEGIN |
Initializes out-of-core routine data structures |
VEND |
Handles terminal processing for the out-of-core routines |
VSTORAGE |
Declares packed storage mode for a triangular, symmetric,
or Hermitian virtual matrix |
SCOPY2RV (CCOPY2RV) |
Copies a submatrix of a real (in memory) matrix to a
virtual matrix |
SCOPY2VR (CCOPY2VR) |
Copies a submatrix of a virtual matrix to a real (in
memory) matrix |
VSGETRF (VCGETRF) |
Computes an LU factorization of a virtual general matrix,
using partial pivoting with row interchanges |
VSGETRS (VCGETRS) |
Solves a system of linear equations AX = B; A is a
virtual general matrix whose LU factorization has been computed by VSGETRF |
VSPOTRF |
Computes the Cholesky factorization of a virtual real
symmetric positive definite matrix |
VSPOTRS |
Solves a system of linear equations AX = B; A is a
virtual real symmetric positive definite |
VSGEMM (VCGEMM) |
Multiplies a virtual general matrix by a virtual general
matrix |
VSTRSM (VCTRSM) |
Solves a virtual triangular system of equations with
multiple right-hand sides |
VSSYRK |
Performs symmetric rank k update of virtual
symmetric matrix |
LibSci performance has improved in each successive release and will continue to improve in coming releases. A more thorough report of the performance of LibSci will be published at a later date. Performance of a few routines is given here.
The Cray X1 system is a new architecture, so an understanding of how a variety of applications perform on it is still forthcoming. This information will be included in future releases of manuals and training materials. Specifically, LibSci man pages and manuals will be constantly updated with guidelines for getting the most performance from the routines.
As with the Cray PVP systems, certain strides in memory for the Cray X1 system perform better than others. For BLAS routines, the best strides in memory are obtained when the leading dimension of two-dimensional arrays is an odd multiple of four. For the FFT routines, it suffices to use odd leading dimensions for the arrays. The following results use these optimal performing strides.
Table 6-1 shows the performance of SGEMV and CGEMV, the single precision real and complex matrix-vector multiply. In Table 6-2, the performance of SGEMM and CGEMM, the single precision real and complex matrix-matrix multiply, is given. Results for the 32-bit and 64-bit versions for the Cray X1 system, and for the 64-bit version for the Cray SV1ex are given.
Note that the performance of the 32-bit routines is only about 40 or 50% faster than the 64-bit routines on the Cray X1 system. Also, note that the performance of the Cray X1 system (64-bit routines) is about eight times faster than the Cray SV1ex for the matrix-matrix multiply and about fives times faster for the matrix-vector multiply.
Further, the BLAS LibSci routines for the Cray X1 system in these two tables were tuned only in Fortran, and not in Cray Assembly Language (CAL). Thus this performance was obtained from the Fortran compiler. (Currently, only a set of FFT routines is implemented in CAL. The majority of LibSci routines are written in Fortran or C.)
The performance results reported here are for the most current LibSci, to be released in PE 5.0. As the performance of routines in LibSci continues to improve, these results will soon be outdated.
Table 6‑1 Performance of matrix-vector multiply in LibSci (PE 5.0, leading dimension odd multiple of 4, MSP mode for X1 results)
Size N=M |
SGEMV
(TRANS = ‘N’) (MFLOPS) |
CGEMV
(TRANS = ‘N’) (MFLOPS) |
||||
X1 32-bit |
X1 64-bit |
SV1ex |
X1 32-bit |
X1 64-bit |
SV1ex |
|
256 |
9769 |
7006 |
851 |
10572 |
8115 |
1112 |
512 |
12031 |
4551 |
870 |
13463 |
8336 |
1125 |
768 |
7567 |
4633 |
869 |
13151 |
8555 |
1128 |
1024 |
7129 |
4646 |
879 |
13324 |
8695 |
1130 |
Table 6‑2 Performance of matrix-matrix multiply in LibSci (PE 5.0, leading dimension odd multiple of 4, MSP mode for X1 results)
Size N=M=K |
SGEMM
(TRANS = ‘N’) (MFLOPS) |
CGEMM
(TRANS = ‘N’) (MFLOPS) |
||||
X1 32-bit |
X1 64-bit |
SV1ex |
X1 32-bit |
X1 64-bit |
SV1ex |
|
256 |
14850 |
10845 |
1302 |
16643 |
10677 |
1323 |
512 |
15842 |
10970 |
1319 |
17059 |
11013 |
1330 |
768 |
15944 |
11108 |
1326 |
17272 |
11040 |
1333 |
1024 |
15315 |
10959 |
1329 |
17305 |
11037 |
1334 |
Cray has recently published results
for the High Performance LINPACK benchmark.
The High Performance LINPACK benchmark determines the list of the Top500
Supercomputer Sites. The first Cray X1
systems will rank on this list in June 2003.
Table 7-1 shows the submitted
results for the benchmark. Note that
the single MSP result is given for reference only, and was not submitted. The benchmark performs at about 90% of peak
of the machine, and scales well.
This benchmark was implemented with
a highly tuned matrix-matrix multiply routine which obtains approximately 12
GFLOPS. The matrix-matrix multiply routine
was written in Cray Assembly Language (CAL), and is faster than the
matrix-matrix multiply routine in LibSci.
In future releases of LibSci, there may be matrix-matrix multiply
routines implemented partially in CAL for faster performance.
Shmem was used for the
communication routines in the benchmark.
Currently ScaLAPACK and the BLACS are implemented with MPI. Future releases of LibSci will have some of
the communication in ScaLAPACK and BLACS replaced with Shmem.
While the LINPACK benchmark
demonstrates the computational power of the Cray X1 system, it does not
represent the current performance of LibSci.
The performance of LibSci will improve, but the corresponding linear
solvers in LAPACK and ScaLAPACK will never run as fast as a benchmark highly
tuned for a specific case.
Table 7‑1 HP LINPACK performance results for the Cray X1 System (* 1 MSP results given for reference only and were not submitted)
Processors (MSPs) |
Rmax (GFLOPS) |
Rpeak (GLOPS) |
Nmax |
N 1/2 |
60 |
675.5 |
768.0 |
168960 |
20160 |
49 |
550.5 |
627.2 |
150528 |
16128 |
36 |
404.3 |
460.8 |
129024 |
13824 |
28 |
318.1 |
358.4 |
114688 |
11302 |
16 |
182.3 |
204.8 |
81920 |
8242 |
12 |
137.6 |
153.6 |
73728 |
6294 |
8 |
92.4 |
102.4 |
61440 |
4996 |
4 |
46.5 |
51.2 |
41984 |
3048 |
1* |
11.8 |
12.8 |
20992 |
1280 |
Table 7-2 gives comparative
performance results for the Cray X1 system with 8 MSPs with other systems of comparable
computational power. The vector systems
like the Cray X1 system and the NEC SX-6 achieve over 90+% of peak
performance. The other systems achieve
less.
Table 7‑2 Comparison of similar systems to 8 MSP Cray X1 System
Computer |
CPUs |
Rmax (GFLOPS) |
Rpeak (GFLOPS) |
% of peak |
Cray X1 |
8 |
92.4 |
102.4 |
90% |
IBM P690 Turbo |
32 |
91.3 |
166.4 |
55% |
HP Superdome |
64 |
86.4 |
141.3 |
61% |
Cray T3E 1200E |
112 |
90.4 |
134.0 |
67% |
NEC SX-6 |
8 |
63.2 |
64.0 |
99% |
Users should refer to LibSci documentation for further information. Each LibSci routine is documented in a man page, plus the following man pages provide introductory information: intro_libsci, intro_fft, intro_blas1, intro_blas2, intro_blas3, intro_lapack, intro_scalapack, and intro_blacs. These introductory man pages are updated frequently, and they provide current information about functionality and performance.
For more information on the differences between LibSci for the Cray X1 system and LibSci for other Cray PVP and Cray T3E systems, please refer to the Cray X1 User Environment Difference manual. This manual describes the new features of LibSci for the Cray X1 system, and lists the routines that are no longer supported.
The Migrating Applications to the Cray X1 System manual contains useful information to LibSci users. It has helpful information for calling Fortran routines from C or C++ programs. It also includes information for correctly linking to the 64-bit LibSci.
In later releases of the Optimizing Applications on the Cray X1 System, there will be a chapter for using LibSci. This chapter will describe how to use routines in LibSci to get the best performance.
Also, there will be a reference manual for LibSci coming later in 2003.
The following list gives helpful tips for using LibSci effectively.
· To use the 64-bit single precision routines, use the Fortran compiler option –s default64 or use –lsci64 when linking. This will link to the 64-bit version of LibSci instead of the default library. Also, use the 64-bit library if using 64-bit integers. Only the 64-bit library supports 64-bit integers.
· Reference the intro_fft man page for instructions on how to declare the table array when using the default LibSci FFT routines. This array must always contain 64-bit words, regardless of the size of the other data types in the routine.
· If porting an application to the Cray X1 system which has 32-bit data types and which also has been ported to Cray PVP or Cray T3E system, there may be CRAY macro definitions to change the size of the data types to be 64-bits. To disable this definition, use –U CRAY compiler option. Using 64-bit integer and real variables when calling the default LibSci routines will result in run-time errors.
· For the best results of the BLAS routines, set the leading dimensions of the arrays (lda, ldb, ldc, etc.) to be odd multiples of four.
· For the best results of the FFT routines, use odd leading dimensions in the arrays. Reference the intro_fft man page for more information.
· There are Fortran interfaces for all LibSci routines. To call these routines from C or C++ programs, follow the standard conventions. For more information on this topic, reference the Migrating Applications to the Cray X1 System manual. (Note: There are C interfaces for the ScaLAPACK and BLACS routines.)
· Reference the Cray X1 User Environment Difference manual for lists of routines that are no longer supported in LibSci for the Cray X1 system.
There will be more performance and functionality added to LibSci in releases following the PE 5.0 release. In PE 5.1 release of LibSci, there will be more optimizations added, especially for SSP mode support. By the end of 2003, there should be distributed memory parallel versions of the FFTs and sparse direct solvers.
There are also plans to support other numerical library ports to the Cray X1 system. More details will follow.
LibSci is the product of the Cray Scientific Libraries group. This group includes Mary Beth Hribar, manager; Bracy Elton, FFTs; Chao Yang, BLAS, LAPACK, sparse solvers and benchmarking work; and Rick Hangartner, BLAS.
Other Cray employees have contributed to LibSci also: Neal Gaarder ,BLAS and CAL support; Wendy Thrash, FFTs. Also, there are two contractors who have helped the LibSci effort: Kitrick Sheets, LAPACK testing; Jim Hoekstra at Iowa State University, ScaLAPACK port and LibSci testing.