CRAY X1 SCIENTIFIC LIBRARIES

Mary Beth Hribar

Cray Inc.

411 First Ave S, Ste 600

Seattle WA 98104

ABSTRACT:

The scientific libraries provided on the Cray X1 system, LibSci, provide performance for Cray X1 applications. The LibSci interface supports Cray customers who are used to programming Cray systems as well as Cray customers who have programmed other platforms. This paper describes the functionality and performance of LibSci, including a few programming tips. It also discusses plans for LibSci.

KEYWORDS:

LibSci, BLAS, FFTs, LAPACK, ScaLAPACK, BLACS, PE, optimization

1 Introduction

The set of scientific libraries, LibSci, is a collection of numerical routines tuned for performance on the Cray X1 system. Users call these routines to optimize the performance of applications.

The routines provided in LibSci support the various programming models and features provided by the Cray X1 System.

1.2 Brief Overview of the Cray X1 System

The Cray X1 system combines powerful vector processors with both shared and distributed memory within a highly scalable configuration. It is constructed of nodes within a node interconnection network, each of which contains four multistreaming processors (MSPs) and globally-addressable shared memory. Each MSP furthermore contains four single-streaming processors (SSPs), plus a 2-MB cache shared among the four SSPs. Each SSP contains both a superscalar processing unit and a two-pipe vector processing unit.

Both 32-bit and 64-bit arithmetic are supported on the Cray X1 system. A single MSP provides 12.8 GFLOPs of peak computational power for 64-bit data, and 25.6 GFLOPs for 32-bit data. The user, however, should be aware that although 32-bit arithmetic operations execute at twice the rate as 64-bit operations, it is difficult to attain this 32-bit peak speed in actual applications as there is less computational time to overlap delays due to memory latency, functional unit latency and instruction issue.

Applications may execute on the Cray X1 system either in MSP mode or in SSP mode. For programs compiled in MSP mode, the compiler controls the streaming across the SSPs contained within each MSP. In this case, the MSP is the processing unit. For programs compiled in SSP mode, there is no automatic streaming across the SSPs since the SSP is the elemental processing unit in the Cray X1 system. The decision to execute in either of these modes is made at compile time: the executable is scheduled at run-time according to how it was compiled.

1.3 Summary of LibSci Features

LibSci is included in the Cray Programming Environment (PE) which provides the compilers, libraries and tools. The LibSci provided in PE release 5.0 (June 2003) provides the following features:

· Support for MSP and SSP modes

· Support for 32-bit and 64-bit data types

· Fortran interfaces for all routines

· Single processor Fast Fourier Transform (FFT), filter and convolution routines tuned for MSP and SSP modes

· Single processor Basic Linear Algebra Subprograms (BLAS) tuned for MSP mode

· Single processor version 3.0 LAPACK (linear and eigensolvers)

· Version 1.7 ScaLAPACK (distributed memory parallel set of LAPACK routines)

· Basic Linear Algebra Communication Subprograms (BLACS)

The performance and functionality of LibSci is evolving and not yet complete. More functionality and optimizations are planned for future releases. These will give users the most up-to-date software and improved performance.

2 Data Types in LibSci

The Cray X1 system supports 32-bit and 64-bit arithmetic in hardware. The 32-bit support represents a new development for a Cray system; thus, it is a new feature of LibSci. LibSci includes two separate sets of libraries, the first (default) library is implemented for both 32-bit and 64-bit data types; the second library is implemented for all 64-bit data types.

The default LibSci contains both 32-bit and 64-bit data types. The default LibSci is automatically available with the modules feature (that is, via the load modulefile command), so no link options are required to use it.

The 64-bit version of LibSci contains all 64-bit data types. This library is provided to better support existing Cray customers whose programs are written with 64-bit data types. This library is available by using the –s default64 Fortran compiler option when linking. Or, if a using the C/C++ compiler, the library is available by using –lsci64 when linking.

In the PE 5.0 release of LibSci, ScaLAPACK and BLACS routines are available only in the default library. There are no 64-bit versions of these routines. All other listed routines are available in both the default LibSci and the 64-bit version of LibSci.

The following table summarizes the data types available in each library. Type Declaration lists Fortran type declarations in the star format. These star formats are not affected by the –s default32 or –s default64 Fortran compiler options.

Table 2‑1 Sizes of Data Types in LibSci (default) and LibSci (64-bit)

Data Type	LibSci (default)		LibSci (64-bit)
Data Type	Size	Type Declaration	Size	Type Declaration
INTEGER	32 bits	INTEGER*4	64 bits	INTEGER*8
REAL	32 bits	REAL*4	64 bits	REAL*8
DOUBLE PRECISION	64 bits	REAL*8	N/A	N/A
COMPLEX	64 bits	COMPLEX*8	128 bits	COMPLEX*16
DOUBLE COMPLEX	128 bits	COMPLEX*16	N/A	N/A

3 Single and Double Precision Routines

LibSci provides both single and double precision routines in the default library. Single precision routines use REAL and/or COMPLEX data types. Double precision routines use DOUBLE PRECISION and/or DOUBLE COMPLEX data types. For example, the level-1 BLAS routines which copy the contents of one vector, x, to another vector, y, are defined below.

Single precision real:

scopy (n, x, incx, y, incy)

integer n, incx, incy

real x, y

Single precision complex:

ccopy (n, x, incx, y, incy)

integer n, incx, incy

complex x, y

Double precision real:

dcopy (n, x, incx, y, incy)

integer n, incx, incy

double precision x, y

Double precision complex:

zcopy (n, x, incx, y, incy)

integer n, incx, incy

double complex x, y

For easy identification, single precision routine names begin with the letter ‘s’ or the letter ‘c’, and double precision routine names begin with the letter ‘d’ or the letter ‘z.’ This naming convention is a defined standard for the BLAS, LAPACK, ScaLAPACK and BLACS routines. The LibSci FFT routines also use this naming convention.

In LibSci for Cray PVP and Cray T3E systems, only single precision routine names were included. In the default LibSci for the Cray X1 system, the double precision names are available. The following tables list the double precision names that are now available for the BLAS, FFT, filter and convolution routines. Note that the double precision names are available for LAPACK, ScaLAPACK and BLACS routines also, but there are too many to list here.

Table 3‑1 Double Precision BLAS Routines on UNICOS/mp systems

DASUM, DZASUM	DNRM2, DZNRM2	DSYR2K, ZSYR2K
DAXPY, ZAXPY	DROT, DROTG, ZROTG	DSYRK, ZSYRK
DCABS1	DSBMV, ZHBMV	DTBMV, ZTBMV
DCOPY, ZCOPY	DSCAL, ZDSCAL, ZSCAL	DTBSV, ZTBSV
DDOT, ZDOTC, ZDOTU	DSPMV, ZHPMV	DTPMV, ZTPMV
DGBMV, ZGBMV	DSPR, ZHPR	DTPSV, ZTPSV
DGEMM, ZGEMM	DSPR2, ZHPR2	DTRMM, ZTRMM
ZHEMM	DSWAP, ZSWAP	DTRMV, ZTRMV
DGEMV, ZGEMV	DSYMM, ZSYMM	DTRSM, ZTRSM
ZHEMV	DSYMV	DTRSV, ZTRSV
DGER, ZGERC, ZGERU	DSYR	IDAMAX, IZAMAX
ZHER, ZHER2K, ZHERK	DSYR2

Table 3‑2 Double Precision FFT routines

Dimensions	Complex-to-complex	Real-to-complex	Complex-to-real
One-dimensional (single)	zzfft (zfft)	dzfft	zdfft
One-dimensional (multiple)	zzfftm (mzfft)	dzfftm	zdfftm
Two-dimensional	zzfft2d (zfft2d)	dzfft2d	zdfft2d
Three-dimensional	zzfft3d (zfft3d)	dzfft3d	zdfft3d

Table 3‑3 Double precision convolution routines

Purpose	Name
Computes a correlation of two vectors	dfilterg
Computes a correlation of two vectors (assuming the filter coefficient vector is symmetric)	dfilters
Solves the Weiner-Levinson linear equations	dopfilt

Table 3‑4 Double precision filter routines

Purpose	Name
Computes a standard complex convolution	zcnvl
Computes a convolution using FFTs	zcnvlf

The single precision FFT routines in the default LibSci need special attention by the user. The table array argument must always be an array of 64-bit words, regardless of the size of the other data types within the routine. Please see the intro_fft man page for more information.

Note that the 64-bit LibSci does not contain the double precision routines. This library is intended to provide support for applications written for Cray PVP and T3E systems and thus only needs to provide the single precision names. The single precision routines in the 64-bit library use 64-bit real and complex data types.

The single precision routines contained in the default LibSci and the 64-bit LibSci contain the same names, but are defined to use data types with different sizes. The default LibSci is loaded automatically. Users who wish to use the 64-bit single precision routines must explicitly link to the 64-bit library by using –s default64 Fortran compiler option or by using –lsci64. Failing to do so will result in using the 32-bit single precision routines which will result in run-time errors.

4 MSP and SSP Modes

It is difficult to predetermine if an application will execute faster in MSP mode than SSP mode. Thus, as both modes are provided to users, Cray recommends that both modes be tested for applications. In fact, the Cray benchmarking group is currently working with Cray Technical Publications to publish guidelines for determining which mode works best with which types of applications.

Using the Fortran compiler option –O ssp or the C/C++ compiler option –h ssp will compile an executable in SSP mode and link to the LibSci for SSP mode. Using the default compiler options will create an executable in MSP mode, linked to the LibSci for MSP mode.

In the PE 5.0 release (June, 2003), LibSci for SSP and MSP modes will be functional. As for performance, routines in LibSci have been tuned to perform well in MSP mode, ensuring that the computations are partitioned efficiently across the four SSPs. The FFT routines are the only routines in LibSci which have been tuned specifically for SSP mode. More performance enhancements for both MSP and SSP modes will be coming in future releases.

5 Routines No Longer Supported in LibSci

LibSci for the Cray X1 system does not support all routines that were provided by LibSci for the Cray PVP and Cray T3E systems. Namely, the non-standard BLAS, the sparse iterative solvers and out-of-core solver routines are no longer supported. Tables 5-1, 5-2 and 5-3 list the level-1, level-2 and level-3 BLAS routines that are unavailable on the Cray X1 system. Table 5-4 lists the iterative and out-of-core solver routines that are no longer supported on the Cray X1 system. This information is also given in the Cray X1 User Environment Difference manual.

Also, LINPACK and EISPACK routines are not included in LibSci for the Cray X1 system. Users of these routines are encouraged to update to the LAPACK routines. LAPACK is the successor to LINPACK and EISPACK and it contains faster algorithms.

Table 5‑1 Unsupported level-1 BLAS routines for the Cray X1 system

Name	Purpose
HAXPY (GAXPY)	Adds a scalar multiple of a real or complex vector to another real or complex vector (32-bit version)
HDOT (GDOTC GDOTU)	Computes a dot product (inner product) of two real or complex vectors (32-bit version)
SAXPBY (CAXPBY)	Adds a scalar multiple of a real or complex vector to a scalar multiple of another vector
SHAD	Computes the Hadamard product of two vectors
SPAXPBY	Adds a scalar multiple of a vector to a sparse vector
SPDOT	Computes a dot product of a real vector and a sparse real vector
CSROT	Applies a real plane rotation to a pair of complex vectors
CROT	Applies a real plane rotation to a pair of complex vectors
SROTM	Applies a modified Givens plane rotation
SROTMG	Constructs a modified Givens plane rotation
SSUM (CSUM)	Sums the elements of a real or complex vector

Table 5‑2 Unsupported level-2 BLAS routines for the Cray X1 system

Name	Purpose
SGESUM	Adds a scalar multiple of a real or complex matrix to a scalar multiple of another real or complex matrix
CSPMV	Multiplies a complex vector by a complex symmetric packed matrix
CSPR	Performs a symmetric rank 1 update of a complex symmetric packed matrix
CSSPR12	Performs two simultaneous symmetric rank 1 updates of a real symmetric packed matrix
CSYMV	Multiplies a complex vector by a complex symmetric matrix
CSYR	Performs a symmetric rank 1 update of a complex symmetric matrix
CSSYR2	Performs a symmetric rank 2 update of a real symmetric matrix

Table 5‑3 Unsupported level-3 BLAS routines for the Cray X1 system

Name	Purpose
SCOPY2 (CCOPY2)	Copies a real matrix into another real matrix; copies a complex matrix into another complex matrix (used by the out-of-core routines)
SGEMMS (CGEMMS)	Multiplies a real general matrix by a real general matrix; multiplies a complex general matrix by a complex general matrix; uses Strassen's algorithm

Table 5‑4 Unsupported iterative and out-of-core solver routines for the Cray X1 system

Name	Purpose
SITRSOL	Solves a real general sparse system, using a preconditioned conjugate gradient-like method (iterative solver)
VBEGIN	Initializes out-of-core routine data structures
VEND	Handles terminal processing for the out-of-core routines
VSTORAGE	Declares packed storage mode for a triangular, symmetric, or Hermitian virtual matrix
SCOPY2RV (CCOPY2RV)	Copies a submatrix of a real (in memory) matrix to a virtual matrix
SCOPY2VR (CCOPY2VR)	Copies a submatrix of a virtual matrix to a real (in memory) matrix
VSGETRF (VCGETRF)	Computes an LU factorization of a virtual general matrix, using partial pivoting with row interchanges
VSGETRS (VCGETRS)	Solves a system of linear equations AX = B; A is a virtual general matrix whose LU factorization has been computed by VSGETRF
VSPOTRF	Computes the Cholesky factorization of a virtual real symmetric positive definite matrix
VSPOTRS	Solves a system of linear equations AX = B; A is a virtual real symmetric positive definite
VSGEMM (VCGEMM)	Multiplies a virtual general matrix by a virtual general matrix
VSTRSM (VCTRSM)	Solves a virtual triangular system of equations with multiple right-hand sides
VSSYRK	Performs symmetric rank k update of virtual symmetric matrix

6 Performance of LibSci

LibSci performance has improved in each successive release and will continue to improve in coming releases. A more thorough report of the performance of LibSci will be published at a later date. Performance of a few routines is given here.

The Cray X1 system is a new architecture, so an understanding of how a variety of applications perform on it is still forthcoming. This information will be included in future releases of manuals and training materials. Specifically, LibSci man pages and manuals will be constantly updated with guidelines for getting the most performance from the routines.

As with the Cray PVP systems, certain strides in memory for the Cray X1 system perform better than others. For BLAS routines, the best strides in memory are obtained when the leading dimension of two-dimensional arrays is an odd multiple of four. For the FFT routines, it suffices to use odd leading dimensions for the arrays. The following results use these optimal performing strides.

Table 6-1 shows the performance of SGEMV and CGEMV, the single precision real and complex matrix-vector multiply. In Table 6-2, the performance of SGEMM and CGEMM, the single precision real and complex matrix-matrix multiply, is given. Results for the 32-bit and 64-bit versions for the Cray X1 system, and for the 64-bit version for the Cray SV1ex are given.

Note that the performance of the 32-bit routines is only about 40 or 50% faster than the 64-bit routines on the Cray X1 system. Also, note that the performance of the Cray X1 system (64-bit routines) is about eight times faster than the Cray SV1ex for the matrix-matrix multiply and about fives times faster for the matrix-vector multiply.

Further, the BLAS LibSci routines for the Cray X1 system in these two tables were tuned only in Fortran, and not in Cray Assembly Language (CAL). Thus this performance was obtained from the Fortran compiler. (Currently, only a set of FFT routines is implemented in CAL. The majority of LibSci routines are written in Fortran or C.)

The performance results reported here are for the most current LibSci, to be released in PE 5.0. As the performance of routines in LibSci continues to improve, these results will soon be outdated.

Table 6‑1 Performance of matrix-vector multiply in LibSci (PE 5.0, leading dimension odd multiple of 4, MSP mode for X1 results)

Size N=M	SGEMV (TRANS = ‘N’) (MFLOPS)			CGEMV (TRANS = ‘N’) (MFLOPS)
Size N=M	X1 32-bit	X1 64-bit	SV1ex	X1 32-bit	X1 64-bit	SV1ex
256	9769	7006	851	10572	8115	1112
512	12031	4551	870	13463	8336	1125
768	7567	4633	869	13151	8555	1128
1024	7129	4646	879	13324	8695	1130

Table 6‑2 Performance of matrix-matrix multiply in LibSci (PE 5.0, leading dimension odd multiple of 4, MSP mode for X1 results)

Size N=M=K	SGEMM (TRANS = ‘N’) (MFLOPS)			CGEMM (TRANS = ‘N’) (MFLOPS)
Size N=M=K	X1 32-bit	X1 64-bit	SV1ex	X1 32-bit	X1 64-bit	SV1ex
256	14850	10845	1302	16643	10677	1323
512	15842	10970	1319	17059	11013	1330
768	15944	11108	1326	17272	11040	1333
1024	15315	10959	1329	17305	11037	1334

7 LINPACK Benchmark Results

Cray has recently published results for the High Performance LINPACK benchmark. The High Performance LINPACK benchmark determines the list of the Top500 Supercomputer Sites. The first Cray X1 systems will rank on this list in June 2003.

Table 7-1 shows the submitted results for the benchmark. Note that the single MSP result is given for reference only, and was not submitted. The benchmark performs at about 90% of peak of the machine, and scales well.

This benchmark was implemented with a highly tuned matrix-matrix multiply routine which obtains approximately 12 GFLOPS. The matrix-matrix multiply routine was written in Cray Assembly Language (CAL), and is faster than the matrix-matrix multiply routine in LibSci. In future releases of LibSci, there may be matrix-matrix multiply routines implemented partially in CAL for faster performance.

Shmem was used for the communication routines in the benchmark. Currently ScaLAPACK and the BLACS are implemented with MPI. Future releases of LibSci will have some of the communication in ScaLAPACK and BLACS replaced with Shmem.

While the LINPACK benchmark demonstrates the computational power of the Cray X1 system, it does not represent the current performance of LibSci. The performance of LibSci will improve, but the corresponding linear solvers in LAPACK and ScaLAPACK will never run as fast as a benchmark highly tuned for a specific case.

Table 7‑1 HP LINPACK performance results for the Cray X1 System (* 1 MSP results given for reference only and were not submitted)

Processors (MSPs)	Rmax (GFLOPS)	Rpeak (GLOPS)	Nmax	N 1/2
60	675.5	768.0	168960	20160
49	550.5	627.2	150528	16128
36	404.3	460.8	129024	13824
28	318.1	358.4	114688	11302
16	182.3	204.8	81920	8242
12	137.6	153.6	73728	6294
8	92.4	102.4	61440	4996
4	46.5	51.2	41984	3048
1*	11.8	12.8	20992	1280

Table 7-2 gives comparative performance results for the Cray X1 system with 8 MSPs with other systems of comparable computational power. The vector systems like the Cray X1 system and the NEC SX-6 achieve over 90+% of peak performance. The other systems achieve less.

Table 7‑2 Comparison of similar systems to 8 MSP Cray X1 System

Computer	CPUs	Rmax (GFLOPS)	Rpeak (GFLOPS)	% of peak
Cray X1	8	92.4	102.4	90%
IBM P690 Turbo	32	91.3	166.4	55%
HP Superdome	64	86.4	141.3	61%
Cray T3E 1200E	112	90.4	134.0	67%
NEC SX-6	8	63.2	64.0	99%

8 Documentation

Users should refer to LibSci documentation for further information. Each LibSci routine is documented in a man page, plus the following man pages provide introductory information: intro_libsci, intro_fft, intro_blas1, intro_blas2, intro_blas3, intro_lapack, intro_scalapack, and intro_blacs. These introductory man pages are updated frequently, and they provide current information about functionality and performance.

For more information on the differences between LibSci for the Cray X1 system and LibSci for other Cray PVP and Cray T3E systems, please refer to the Cray X1 User Environment Difference manual. This manual describes the new features of LibSci for the Cray X1 system, and lists the routines that are no longer supported.

The Migrating Applications to the Cray X1 System manual contains useful information to LibSci users. It has helpful information for calling Fortran routines from C or C++ programs. It also includes information for correctly linking to the 64-bit LibSci.

In later releases of the Optimizing Applications on the Cray X1 System, there will be a chapter for using LibSci. This chapter will describe how to use routines in LibSci to get the best performance.

Also, there will be a reference manual for LibSci coming later in 2003.

9 Tips for Using LibSci

The following list gives helpful tips for using LibSci effectively.

· To use the 64-bit single precision routines, use the Fortran compiler option –s default64 or use –lsci64 when linking. This will link to the 64-bit version of LibSci instead of the default library. Also, use the 64-bit library if using 64-bit integers. Only the 64-bit library supports 64-bit integers.

· Reference the intro_fft man page for instructions on how to declare the table array when using the default LibSci FFT routines. This array must always contain 64-bit words, regardless of the size of the other data types in the routine.

· If porting an application to the Cray X1 system which has 32-bit data types and which also has been ported to Cray PVP or Cray T3E system, there may be CRAY macro definitions to change the size of the data types to be 64-bits. To disable this definition, use –U CRAY compiler option. Using 64-bit integer and real variables when calling the default LibSci routines will result in run-time errors.

· For the best results of the BLAS routines, set the leading dimensions of the arrays (lda, ldb, ldc, etc.) to be odd multiples of four.

· For the best results of the FFT routines, use odd leading dimensions in the arrays. Reference the intro_fft man page for more information.

· There are Fortran interfaces for all LibSci routines. To call these routines from C or C++ programs, follow the standard conventions. For more information on this topic, reference the Migrating Applications to the Cray X1 System manual. (Note: There are C interfaces for the ScaLAPACK and BLACS routines.)

· Reference the Cray X1 User Environment Difference manual for lists of routines that are no longer supported in LibSci for the Cray X1 system.

10 Future plans for LibSci

There will be more performance and functionality added to LibSci in releases following the PE 5.0 release. In PE 5.1 release of LibSci, there will be more optimizations added, especially for SSP mode support. By the end of 2003, there should be distributed memory parallel versions of the FFTs and sparse direct solvers.

There are also plans to support other numerical library ports to the Cray X1 system. More details will follow.

11 Acknowledgements

LibSci is the product of the Cray Scientific Libraries group. This group includes Mary Beth Hribar, manager; Bracy Elton, FFTs; Chao Yang, BLAS, LAPACK, sparse solvers and benchmarking work; and Rick Hangartner, BLAS.

Other Cray employees have contributed to LibSci also: Neal Gaarder ,BLAS and CAL support; Wendy Thrash, FFTs. Also, there are two contractors who have helped the LibSci effort: Kitrick Sheets, LAPACK testing; Jim Hoekstra at Iowa State University, ScaLAPACK port and LibSci testing.