CRAY X1 SCIENTIFIC LIBRARIES
Mary Beth Hribar
Cray Inc.
411 First Ave S, Ste 600
Seattle WA 98104
ABSTRACT:
The scientific libraries provided on the Cray X1 system, LibSci, provide performance for Cray X1 applications. The LibSci interface supports Cray customers who are used to programming Cray systems as well as Cray customers who have programmed other platforms. This paper describes the functionality and performance of LibSci, including a few programming tips. It also discusses plans for LibSci.
KEYWORDS:
LibSci, BLAS, FFTs, LAPACK, ScaLAPACK, BLACS, PE, optimization
The set of scientific libraries, LibSci, is a collection of numerical routines tuned for performance on the Cray X1 system. Users call these routines to optimize the performance of applications.
The routines provided in LibSci support the various programming models and features provided by the Cray X1 System.
The Cray X1 system combines powerful vector processors with both shared and distributed memory within a highly scalable configuration. It is constructed of nodes within a node interconnection network, each of which contains four multistreaming processors (MSPs) and globallyaddressable shared memory. Each MSP furthermore contains four singlestreaming processors (SSPs), plus a 2MB cache shared among the four SSPs. Each SSP contains both a superscalar processing unit and a twopipe vector processing unit.
Both 32bit and 64bit arithmetic are supported on the Cray X1 system. A single MSP provides 12.8 GFLOPs of peak computational power for 64bit data, and 25.6 GFLOPs for 32bit data. The user, however, should be aware that although 32bit arithmetic operations execute at twice the rate as 64bit operations, it is difficult to attain this 32bit peak speed in actual applications as there is less computational time to overlap delays due to memory latency, functional unit latency and instruction issue.
Applications may execute on the Cray X1 system either in MSP mode or in SSP mode. For programs compiled in MSP mode, the compiler controls the streaming across the SSPs contained within each MSP. In this case, the MSP is the processing unit. For programs compiled in SSP mode, there is no automatic streaming across the SSPs since the SSP is the elemental processing unit in the Cray X1 system. The decision to execute in either of these modes is made at compile time: the executable is scheduled at runtime according to how it was compiled.
LibSci is included in the Cray Programming Environment (PE) which provides the compilers, libraries and tools. The LibSci provided in PE release 5.0 (June 2003) provides the following features:
· Support for MSP and SSP modes
· Support for 32bit and 64bit data types
· Fortran interfaces for all routines
· Single processor Fast Fourier Transform (FFT), filter and convolution routines tuned for MSP and SSP modes
· Single processor Basic Linear Algebra Subprograms (BLAS) tuned for MSP mode
· Single processor version 3.0 LAPACK (linear and eigensolvers)
· Version 1.7 ScaLAPACK (distributed memory parallel set of LAPACK routines)
· Basic Linear Algebra Communication Subprograms (BLACS)
The performance and functionality of LibSci is evolving and not yet complete. More functionality and optimizations are planned for future releases. These will give users the most uptodate software and improved performance.
The Cray X1 system supports 32bit and 64bit arithmetic in hardware. The 32bit support represents a new development for a Cray system; thus, it is a new feature of LibSci. LibSci includes two separate sets of libraries, the first (default) library is implemented for both 32bit and 64bit data types; the second library is implemented for all 64bit data types.
The default LibSci contains both 32bit and 64bit data types. The default LibSci is automatically available with the modules feature (that is, via the load modulefile command), so no link options are required to use it.
The 64bit version of LibSci contains all 64bit data types. This library is provided to better support existing Cray customers whose programs are written with 64bit data types. This library is available by using the –s default64 Fortran compiler option when linking. Or, if a using the C/C++ compiler, the library is available by using –lsci64 when linking.
In the PE 5.0 release of LibSci, ScaLAPACK and BLACS routines are available only in the default library. There are no 64bit versions of these routines. All other listed routines are available in both the default LibSci and the 64bit version of LibSci.
The following table summarizes the data types
available in each library. Type Declaration lists Fortran type
declarations in the star format. These
star formats are not affected by the –s default32 or –s default64
Fortran compiler options.
Table 2‑1 Sizes of Data Types in LibSci (default) and LibSci (64bit)
Data Type 
LibSci
(default) 
LibSci
(64bit) 

Size 
Type Declaration 
Size 
Type Declaration 

INTEGER 
32 bits 
INTEGER*4 
64 bits 
INTEGER*8 
REAL 
32 bits 
REAL*4 
64 bits 
REAL*8 
DOUBLE PRECISION 
64 bits 
REAL*8 
N/A 
N/A 
COMPLEX 
64 bits 
COMPLEX*8 
128 bits 
COMPLEX*16 
DOUBLE COMPLEX 
128 bits 
COMPLEX*16 
N/A 
N/A 
LibSci provides both single and double precision routines in the default library. Single precision routines use REAL and/or COMPLEX data types. Double precision routines use DOUBLE PRECISION and/or DOUBLE COMPLEX data types. For example, the level1 BLAS routines which copy the contents of one vector, x, to another vector, y, are defined below.
Single precision real:
scopy (n, x, incx,
y, incy)
integer n, incx,
incy
real x, y
Single precision complex:
ccopy (n, x, incx,
y, incy)
integer n, incx,
incy
complex x, y
Double precision real:
dcopy (n, x, incx,
y, incy)
integer n, incx,
incy
double precision x, y
Double precision complex:
zcopy (n, x, incx,
y, incy)
integer n, incx,
incy
double complex x, y
For easy identification, single precision routine names begin with the letter ‘s’ or the letter ‘c’, and double precision routine names begin with the letter ‘d’ or the letter ‘z.’ This naming convention is a defined standard for the BLAS, LAPACK, ScaLAPACK and BLACS routines. The LibSci FFT routines also use this naming convention.
In LibSci for Cray PVP and Cray T3E systems, only single precision routine names were included. In the default LibSci for the Cray X1 system, the double precision names are available. The following tables list the double precision names that are now available for the BLAS, FFT, filter and convolution routines. Note that the double precision names are available for LAPACK, ScaLAPACK and BLACS routines also, but there are too many to list here.
Table 3‑1 Double Precision BLAS Routines on UNICOS/mp systems
DASUM, DZASUM 
DNRM2, DZNRM2 
DSYR2K, ZSYR2K 
DAXPY, ZAXPY 
DROT, DROTG, ZROTG 
DSYRK, ZSYRK 
DCABS1 
DSBMV, ZHBMV 
DTBMV, ZTBMV 
DCOPY, ZCOPY 
DSCAL, ZDSCAL, ZSCAL 
DTBSV, ZTBSV 
DDOT, ZDOTC, ZDOTU 
DSPMV, ZHPMV 
DTPMV, ZTPMV 
DGBMV, ZGBMV 
DSPR, ZHPR 
DTPSV, ZTPSV 
DGEMM, ZGEMM 
DSPR2, ZHPR2 
DTRMM, ZTRMM 
ZHEMM 
DSWAP, ZSWAP 
DTRMV, ZTRMV 
DGEMV, ZGEMV 
DSYMM, ZSYMM 
DTRSM, ZTRSM 
ZHEMV 
DSYMV 
DTRSV, ZTRSV 
DGER, ZGERC, ZGERU 
DSYR 
IDAMAX, IZAMAX 
ZHER, ZHER2K, ZHERK 
DSYR2 

Table 3‑2 Double Precision FFT routines
Dimensions 
Complextocomplex 
Realtocomplex 
Complextoreal 
Onedimensional (single) 
zzfft (zfft) 
dzfft 
zdfft 
Onedimensional (multiple) 
zzfftm (mzfft) 
dzfftm 
zdfftm 
Twodimensional 
zzfft2d (zfft2d) 
dzfft2d 
zdfft2d 
Threedimensional 
zzfft3d (zfft3d) 
dzfft3d 
zdfft3d 
Table 3‑3 Double precision convolution routines
Name 

Computes a correlation of two vectors 
dfilterg 
Computes a correlation of two vectors (assuming the filter
coefficient vector is symmetric) 
dfilters 
Solves the WeinerLevinson linear equations 
dopfilt 
Table 3‑4 Double precision filter routines
Name 

Computes a standard complex convolution 
zcnvl 
Computes a convolution using FFTs 
zcnvlf 
The single precision
FFT routines in the default LibSci need special attention by the user. The table
array argument must always be an array of 64bit words, regardless of the size
of the other data types within the routine.
Please see the intro_fft man
page for more information.
Note that the 64bit
LibSci does not contain the double precision routines. This library is intended to provide support
for applications written for Cray PVP and T3E systems and thus only needs to
provide the single precision names. The
single precision routines in the 64bit library use 64bit real and complex
data types.
The single precision routines contained in the default LibSci and the 64bit LibSci contain the same names, but are defined to use data types with different sizes. The default LibSci is loaded automatically. Users who wish to use the 64bit single precision routines must explicitly link to the 64bit library by using –s default64 Fortran compiler option or by using –lsci64. Failing to do so will result in using the 32bit single precision routines which will result in runtime errors.
It is difficult to predetermine if an application will execute faster in MSP mode than SSP mode. Thus, as both modes are provided to users, Cray recommends that both modes be tested for applications. In fact, the Cray benchmarking group is currently working with Cray Technical Publications to publish guidelines for determining which mode works best with which types of applications.
Using the Fortran compiler option –O ssp or the C/C++ compiler option –h ssp will compile an executable in SSP mode and link to the LibSci for SSP mode. Using the default compiler options will create an executable in MSP mode, linked to the LibSci for MSP mode.
In the PE 5.0 release (June, 2003), LibSci for SSP and MSP modes will be functional. As for performance, routines in LibSci have been tuned to perform well in MSP mode, ensuring that the computations are partitioned efficiently across the four SSPs. The FFT routines are the only routines in LibSci which have been tuned specifically for SSP mode. More performance enhancements for both MSP and SSP modes will be coming in future releases.
LibSci for the Cray X1 system does not support all routines that were provided by LibSci for the Cray PVP and Cray T3E systems. Namely, the nonstandard BLAS, the sparse iterative solvers and outofcore solver routines are no longer supported. Tables 51, 52 and 53 list the level1, level2 and level3 BLAS routines that are unavailable on the Cray X1 system. Table 54 lists the iterative and outofcore solver routines that are no longer supported on the Cray X1 system. This information is also given in the Cray X1 User Environment Difference manual.
Also, LINPACK and EISPACK routines are not included in LibSci for the Cray X1 system. Users of these routines are encouraged to update to the LAPACK routines. LAPACK is the successor to LINPACK and EISPACK and it contains faster algorithms.
Table 5‑1 Unsupported level1 BLAS routines for the Cray X1 system
Name 
Purpose 
HAXPY (GAXPY) 
Adds a scalar multiple of a real or complex vector to
another real or complex vector (32bit version) 
HDOT (GDOTC GDOTU) 
Computes a dot product (inner product) of two real or
complex vectors (32bit version) 
SAXPBY (CAXPBY) 
Adds a scalar multiple of a real or complex vector to a
scalar multiple of another vector 
SHAD 
Computes the Hadamard product of two vectors 
SPAXPBY 
Adds a scalar multiple of a vector to a sparse vector 
SPDOT 
Computes a dot product of a real vector and a sparse real
vector 
CSROT 
Applies a real plane rotation to a pair of complex
vectors 
CROT 
Applies a real plane rotation to a pair of complex
vectors 
SROTM 
Applies a modified Givens plane rotation 
SROTMG 
Constructs a modified Givens plane rotation 
SSUM (CSUM) 
Sums the elements of a real or complex vector 
Table 5‑2 Unsupported level2 BLAS routines for the Cray X1 system
Name 
Purpose 
SGESUM 
Adds a scalar multiple of a real or complex matrix to a
scalar multiple of another real or complex matrix 
CSPMV 
Multiplies a complex vector by a complex symmetric packed
matrix 
CSPR 
Performs a symmetric rank 1 update of a complex symmetric
packed matrix 
CSSPR12 
Performs two simultaneous symmetric rank 1 updates of a
real symmetric packed matrix 
CSYMV 
Multiplies a complex vector by a complex symmetric matrix 
CSYR 
Performs a symmetric rank 1 update of a complex symmetric
matrix 
CSSYR2 
Performs a symmetric rank 2 update of a real symmetric
matrix 
Table 5‑3 Unsupported level3 BLAS routines for the Cray X1 system
Name 
Purpose 
SCOPY2 (CCOPY2) 
Copies a real matrix into another real matrix; copies a
complex matrix into another complex matrix (used by the outofcore routines) 
SGEMMS (CGEMMS) 
Multiplies a real general matrix by a real general
matrix; multiplies a complex general matrix by a complex general matrix; uses
Strassen's algorithm 
Table 5‑4 Unsupported iterative and outofcore solver routines for the Cray X1 system
Purpose 

SITRSOL 
Solves a real general sparse system, using a
preconditioned conjugate gradientlike method (iterative solver) 
VBEGIN 
Initializes outofcore routine data structures 
VEND 
Handles terminal processing for the outofcore routines 
VSTORAGE 
Declares packed storage mode for a triangular, symmetric,
or Hermitian virtual matrix 
SCOPY2RV (CCOPY2RV) 
Copies a submatrix of a real (in memory) matrix to a
virtual matrix 
SCOPY2VR (CCOPY2VR) 
Copies a submatrix of a virtual matrix to a real (in
memory) matrix 
VSGETRF (VCGETRF) 
Computes an LU factorization of a virtual general matrix,
using partial pivoting with row interchanges 
VSGETRS (VCGETRS) 
Solves a system of linear equations AX = B; A is a
virtual general matrix whose LU factorization has been computed by VSGETRF 
VSPOTRF 
Computes the Cholesky factorization of a virtual real
symmetric positive definite matrix 
VSPOTRS 
Solves a system of linear equations AX = B; A is a
virtual real symmetric positive definite 
VSGEMM (VCGEMM) 
Multiplies a virtual general matrix by a virtual general
matrix 
VSTRSM (VCTRSM) 
Solves a virtual triangular system of equations with
multiple righthand sides 
VSSYRK 
Performs symmetric rank k update of virtual
symmetric matrix 
LibSci performance has improved in each successive release and will continue to improve in coming releases. A more thorough report of the performance of LibSci will be published at a later date. Performance of a few routines is given here.
The Cray X1 system is a new architecture, so an understanding of how a variety of applications perform on it is still forthcoming. This information will be included in future releases of manuals and training materials. Specifically, LibSci man pages and manuals will be constantly updated with guidelines for getting the most performance from the routines.
As with the Cray PVP systems, certain strides in memory for the Cray X1 system perform better than others. For BLAS routines, the best strides in memory are obtained when the leading dimension of twodimensional arrays is an odd multiple of four. For the FFT routines, it suffices to use odd leading dimensions for the arrays. The following results use these optimal performing strides.
Table 61 shows the performance of SGEMV and CGEMV, the single precision real and complex matrixvector multiply. In Table 62, the performance of SGEMM and CGEMM, the single precision real and complex matrixmatrix multiply, is given. Results for the 32bit and 64bit versions for the Cray X1 system, and for the 64bit version for the Cray SV1ex are given.
Note that the performance of the 32bit routines is only about 40 or 50% faster than the 64bit routines on the Cray X1 system. Also, note that the performance of the Cray X1 system (64bit routines) is about eight times faster than the Cray SV1ex for the matrixmatrix multiply and about fives times faster for the matrixvector multiply.
Further, the BLAS LibSci routines for the Cray X1 system in these two tables were tuned only in Fortran, and not in Cray Assembly Language (CAL). Thus this performance was obtained from the Fortran compiler. (Currently, only a set of FFT routines is implemented in CAL. The majority of LibSci routines are written in Fortran or C.)
The performance results reported here are for the most current LibSci, to be released in PE 5.0. As the performance of routines in LibSci continues to improve, these results will soon be outdated.
Table 6‑1 Performance of matrixvector multiply in LibSci (PE 5.0, leading dimension odd multiple of 4, MSP mode for X1 results)
Size N=M 
SGEMV
(TRANS = ‘N’) (MFLOPS) 
CGEMV
(TRANS = ‘N’) (MFLOPS) 

X1 32bit 
X1 64bit 
SV1ex 
X1 32bit 
X1 64bit 
SV1ex 

256 
9769 
7006 
851 
10572 
8115 
1112 
512 
12031 
4551 
870 
13463 
8336 
1125 
768 
7567 
4633 
869 
13151 
8555 
1128 
1024 
7129 
4646 
879 
13324 
8695 
1130 
Table 6‑2 Performance of matrixmatrix multiply in LibSci (PE 5.0, leading dimension odd multiple of 4, MSP mode for X1 results)
Size N=M=K 
SGEMM
(TRANS = ‘N’) (MFLOPS) 
CGEMM
(TRANS = ‘N’) (MFLOPS) 

X1 32bit 
X1 64bit 
SV1ex 
X1 32bit 
X1 64bit 
SV1ex 

256 
14850 
10845 
1302 
16643 
10677 
1323 
512 
15842 
10970 
1319 
17059 
11013 
1330 
768 
15944 
11108 
1326 
17272 
11040 
1333 
1024 
15315 
10959 
1329 
17305 
11037 
1334 
Cray has recently published results
for the High Performance LINPACK benchmark.
The High Performance LINPACK benchmark determines the list of the Top500
Supercomputer Sites. The first Cray X1
systems will rank on this list in June 2003.
Table 71 shows the submitted
results for the benchmark. Note that
the single MSP result is given for reference only, and was not submitted. The benchmark performs at about 90% of peak
of the machine, and scales well.
This benchmark was implemented with
a highly tuned matrixmatrix multiply routine which obtains approximately 12
GFLOPS. The matrixmatrix multiply routine
was written in Cray Assembly Language (CAL), and is faster than the
matrixmatrix multiply routine in LibSci.
In future releases of LibSci, there may be matrixmatrix multiply
routines implemented partially in CAL for faster performance.
Shmem was used for the
communication routines in the benchmark.
Currently ScaLAPACK and the BLACS are implemented with MPI. Future releases of LibSci will have some of
the communication in ScaLAPACK and BLACS replaced with Shmem.
While the LINPACK benchmark
demonstrates the computational power of the Cray X1 system, it does not
represent the current performance of LibSci.
The performance of LibSci will improve, but the corresponding linear
solvers in LAPACK and ScaLAPACK will never run as fast as a benchmark highly
tuned for a specific case.
Table 7‑1 HP LINPACK performance results for the Cray X1 System (* 1 MSP results given for reference only and were not submitted)
Processors (MSPs) 
Rmax (GFLOPS) 
Rpeak (GLOPS) 
Nmax 
N 1/2 
60 
675.5 
768.0 
168960 
20160 
49 
550.5 
627.2 
150528 
16128 
36 
404.3 
460.8 
129024 
13824 
28 
318.1 
358.4 
114688 
11302 
16 
182.3 
204.8 
81920 
8242 
12 
137.6 
153.6 
73728 
6294 
8 
92.4 
102.4 
61440 
4996 
4 
46.5 
51.2 
41984 
3048 
1* 
11.8 
12.8 
20992 
1280 
Table 72 gives comparative
performance results for the Cray X1 system with 8 MSPs with other systems of comparable
computational power. The vector systems
like the Cray X1 system and the NEC SX6 achieve over 90+% of peak
performance. The other systems achieve
less.
Table 7‑2 Comparison of similar systems to 8 MSP Cray X1 System
Computer 
CPUs 
Rmax (GFLOPS) 
Rpeak (GFLOPS) 
% of peak 
Cray X1 
8 
92.4 
102.4 
90% 
IBM P690 Turbo 
32 
91.3 
166.4 
55% 
HP Superdome 
64 
86.4 
141.3 
61% 
Cray T3E 1200E 
112 
90.4 
134.0 
67% 
NEC SX6 
8 
63.2 
64.0 
99% 
Users should refer to LibSci documentation for further information. Each LibSci routine is documented in a man page, plus the following man pages provide introductory information: intro_libsci, intro_fft, intro_blas1, intro_blas2, intro_blas3, intro_lapack, intro_scalapack, and intro_blacs. These introductory man pages are updated frequently, and they provide current information about functionality and performance.
For more information on the differences between LibSci for the Cray X1 system and LibSci for other Cray PVP and Cray T3E systems, please refer to the Cray X1 User Environment Difference manual. This manual describes the new features of LibSci for the Cray X1 system, and lists the routines that are no longer supported.
The Migrating Applications to the Cray X1 System manual contains useful information to LibSci users. It has helpful information for calling Fortran routines from C or C++ programs. It also includes information for correctly linking to the 64bit LibSci.
In later releases of the Optimizing Applications on the Cray X1 System, there will be a chapter for using LibSci. This chapter will describe how to use routines in LibSci to get the best performance.
Also, there will be a reference manual for LibSci coming later in 2003.
The following list gives helpful tips for using LibSci effectively.
· To use the 64bit single precision routines, use the Fortran compiler option –s default64 or use –lsci64 when linking. This will link to the 64bit version of LibSci instead of the default library. Also, use the 64bit library if using 64bit integers. Only the 64bit library supports 64bit integers.
· Reference the intro_fft man page for instructions on how to declare the table array when using the default LibSci FFT routines. This array must always contain 64bit words, regardless of the size of the other data types in the routine.
· If porting an application to the Cray X1 system which has 32bit data types and which also has been ported to Cray PVP or Cray T3E system, there may be CRAY macro definitions to change the size of the data types to be 64bits. To disable this definition, use –U CRAY compiler option. Using 64bit integer and real variables when calling the default LibSci routines will result in runtime errors.
· For the best results of the BLAS routines, set the leading dimensions of the arrays (lda, ldb, ldc, etc.) to be odd multiples of four.
· For the best results of the FFT routines, use odd leading dimensions in the arrays. Reference the intro_fft man page for more information.
· There are Fortran interfaces for all LibSci routines. To call these routines from C or C++ programs, follow the standard conventions. For more information on this topic, reference the Migrating Applications to the Cray X1 System manual. (Note: There are C interfaces for the ScaLAPACK and BLACS routines.)
· Reference the Cray X1 User Environment Difference manual for lists of routines that are no longer supported in LibSci for the Cray X1 system.
There will be more performance and functionality added to LibSci in releases following the PE 5.0 release. In PE 5.1 release of LibSci, there will be more optimizations added, especially for SSP mode support. By the end of 2003, there should be distributed memory parallel versions of the FFTs and sparse direct solvers.
There are also plans to support other numerical library ports to the Cray X1 system. More details will follow.
LibSci is the product of the Cray Scientific Libraries group. This group includes Mary Beth Hribar, manager; Bracy Elton, FFTs; Chao Yang, BLAS, LAPACK, sparse solvers and benchmarking work; and Rick Hangartner, BLAS.
Other Cray employees have contributed to LibSci also: Neal Gaarder ,BLAS and CAL support; Wendy Thrash, FFTs. Also, there are two contractors who have helped the LibSci effort: Kitrick Sheets, LAPACK testing; Jim Hoekstra at Iowa State University, ScaLAPACK port and LibSci testing.