CRAY X1 SCIENTIFIC LIBRARIES

 

 

Mary Beth Hribar

Cray Inc.

411 First Ave S, Ste 600

Seattle WA 98104

 

marybeth@cray.com

 

ABSTRACT:

The scientific libraries provided on the Cray X1 system, LibSci, provide performance for Cray X1 applications. The LibSci interface supports Cray customers who are used to programming Cray systems as well as Cray customers who have programmed other platforms. This paper describes the functionality and performance of LibSci, including a few programming tips. It also discusses plans for LibSci.

 

KEYWORDS:

LibSci, BLAS, FFTs, LAPACK, ScaLAPACK, BLACS, PE, optimization

 

1         Introduction

The set of scientific libraries, LibSci, is a collection of numerical routines tuned for performance on the Cray X1 system. Users call these routines to optimize the performance of applications.

The routines provided in LibSci support the various programming models and features provided by the Cray X1 System.

1.2      Brief Overview of the Cray X1 System

The Cray X1 system combines powerful vector processors with both shared and distributed memory within a highly scalable configuration. It is constructed of nodes within a node interconnection network, each of which contains four multistreaming processors (MSPs) and globally-addressable shared memory. Each MSP furthermore contains four single-streaming processors (SSPs), plus a 2-MB cache shared among the four SSPs. Each SSP contains both a superscalar processing unit and a two-pipe vector processing unit.

Both 32-bit and 64-bit arithmetic are supported on the Cray X1 system. A single MSP provides 12.8 GFLOPs of peak computational power for 64-bit data, and 25.6 GFLOPs for 32-bit data. The user, however, should be aware that although 32-bit arithmetic operations execute at twice the rate as 64-bit operations, it is difficult to attain this 32-bit peak speed in actual applications as there is less computational time to overlap delays due to memory latency, functional unit latency and instruction issue.

Applications may execute on the Cray X1 system either in MSP mode or in SSP mode. For programs compiled in MSP mode, the compiler controls the streaming across the SSPs contained within each MSP. In this case, the MSP is the processing unit. For programs compiled in SSP mode, there is no automatic streaming across the SSPs since the SSP is the elemental processing unit in the Cray X1 system. The decision to execute in either of these modes is made at compile time: the executable is scheduled at run-time according to how it was compiled.

1.3      Summary of LibSci Features

LibSci is included in the Cray Programming Environment (PE) which provides the compilers, libraries and tools. The LibSci provided in PE release 5.0 (June 2003) provides the following features:

        Support for MSP and SSP modes

        Support for 32-bit and 64-bit data types

        Fortran interfaces for all routines

        Single processor Fast Fourier Transform (FFT), filter and convolution routines tuned for MSP and SSP modes

        Single processor Basic Linear Algebra Subprograms (BLAS) tuned for MSP mode

        Single processor version 3.0 LAPACK (linear and eigensolvers)

        Version 1.7 ScaLAPACK (distributed memory parallel set of LAPACK routines)

        Basic Linear Algebra Communication Subprograms (BLACS)

The performance and functionality of LibSci is evolving and not yet complete. More functionality and optimizations are planned for future releases. These will give users the most up-to-date software and improved performance.

2         Data Types in LibSci

The Cray X1 system supports 32-bit and 64-bit arithmetic in hardware. The 32-bit support represents a new development for a Cray system; thus, it is a new feature of LibSci. LibSci includes two separate sets of libraries, the first (default) library is implemented for both 32-bit and 64-bit data types; the second library is implemented for all 64-bit data types.

The default LibSci contains both 32-bit and 64-bit data types. The default LibSci is automatically available with the modules feature (that is, via the load modulefile command), so no link options are required to use it.

The 64-bit version of LibSci contains all 64-bit data types. This library is provided to better support existing Cray customers whose programs are written with 64-bit data types. This library is available by using the s default64 Fortran compiler option when linking. Or, if a using the C/C++ compiler, the library is available by using lsci64 when linking.

In the PE 5.0 release of LibSci, ScaLAPACK and BLACS routines are available only in the default library. There are no 64-bit versions of these routines. All other listed routines are available in both the default LibSci and the 64-bit version of LibSci.

The following table summarizes the data types available in each library. Type Declaration lists Fortran type declarations in the star format. These star formats are not affected by the s default32 or s default64 Fortran compiler options.

 

Table 21 Sizes of Data Types in LibSci (default) and LibSci (64-bit)

 

Data Type

LibSci (default)

LibSci (64-bit)

Size

Type Declaration

Size

Type Declaration

INTEGER

32 bits

INTEGER*4

64 bits

INTEGER*8

REAL

32 bits

REAL*4

64 bits

REAL*8

DOUBLE PRECISION

64 bits

REAL*8

N/A

N/A

COMPLEX

64 bits

COMPLEX*8

128 bits

COMPLEX*16

DOUBLE COMPLEX

128 bits

COMPLEX*16

N/A

N/A

 

3         Single and Double Precision Routines

LibSci provides both single and double precision routines in the default library. Single precision routines use REAL and/or COMPLEX data types. Double precision routines use DOUBLE PRECISION and/or DOUBLE COMPLEX data types. For example, the level-1 BLAS routines which copy the contents of one vector, x, to another vector, y, are defined below.

Single precision real:

scopy (n, x, incx, y, incy)

integer n, incx, incy

real x, y

Single precision complex:

ccopy (n, x, incx, y, incy)

integer n, incx, incy

complex x, y

Double precision real:

dcopy (n, x, incx, y, incy)

integer n, incx, incy

double precision x, y

Double precision complex:

zcopy (n, x, incx, y, incy)

integer n, incx, incy

double complex x, y

For easy identification, single precision routine names begin with the letter s or the letter c, and double precision routine names begin with the letter d or the letter z. This naming convention is a defined standard for the BLAS, LAPACK, ScaLAPACK and BLACS routines. The LibSci FFT routines also use this naming convention.

In LibSci for Cray PVP and Cray T3E systems, only single precision routine names were included. In the default LibSci for the Cray X1 system, the double precision names are available. The following tables list the double precision names that are now available for the BLAS, FFT, filter and convolution routines. Note that the double precision names are available for LAPACK, ScaLAPACK and BLACS routines also, but there are too many to list here.

 

Table 31 Double Precision BLAS Routines on UNICOS/mp systems

DASUM, DZASUM

DNRM2, DZNRM2

DSYR2K, ZSYR2K

DAXPY, ZAXPY

DROT, DROTG, ZROTG

DSYRK, ZSYRK

DCABS1

DSBMV, ZHBMV

DTBMV, ZTBMV

DCOPY, ZCOPY

DSCAL, ZDSCAL, ZSCAL

DTBSV, ZTBSV

DDOT, ZDOTC, ZDOTU

DSPMV, ZHPMV

DTPMV, ZTPMV

DGBMV, ZGBMV

DSPR, ZHPR

DTPSV, ZTPSV

DGEMM, ZGEMM

DSPR2, ZHPR2

DTRMM, ZTRMM

ZHEMM

DSWAP, ZSWAP

DTRMV, ZTRMV

DGEMV, ZGEMV

DSYMM, ZSYMM

DTRSM, ZTRSM

ZHEMV

DSYMV

DTRSV, ZTRSV

DGER, ZGERC, ZGERU

DSYR

IDAMAX, IZAMAX

ZHER, ZHER2K, ZHERK

DSYR2

 

 

 

 

Table 32 Double Precision FFT routines

Dimensions

Complex-to-complex

Real-to-complex

Complex-to-real

One-dimensional (single)

zzfft (zfft)

dzfft

zdfft

One-dimensional (multiple)

zzfftm (mzfft)

dzfftm

zdfftm

Two-dimensional

zzfft2d (zfft2d)

dzfft2d

zdfft2d

Three-dimensional

zzfft3d (zfft3d)

dzfft3d

zdfft3d

 

Table 33 Double precision convolution routines

Purpose

Name

Computes a correlation of two vectors

dfilterg

Computes a correlation of two vectors (assuming the filter coefficient vector is symmetric)

dfilters

Solves the Weiner-Levinson linear equations

dopfilt

 

Table 34 Double precision filter routines

Purpose

Name

Computes a standard complex convolution

zcnvl

Computes a convolution using FFTs

zcnvlf

The single precision FFT routines in the default LibSci need special attention by the user. The table array argument must always be an array of 64-bit words, regardless of the size of the other data types within the routine. Please see the intro_fft man page for more information.

Note that the 64-bit LibSci does not contain the double precision routines. This library is intended to provide support for applications written for Cray PVP and T3E systems and thus only needs to provide the single precision names. The single precision routines in the 64-bit library use 64-bit real and complex data types.

The single precision routines contained in the default LibSci and the 64-bit LibSci contain the same names, but are defined to use data types with different sizes. The default LibSci is loaded automatically. Users who wish to use the 64-bit single precision routines must explicitly link to the 64-bit library by using s default64 Fortran compiler option or by using lsci64. Failing to do so will result in using the 32-bit single precision routines which will result in run-time errors.

4         MSP and SSP Modes

It is difficult to predetermine if an application will execute faster in MSP mode than SSP mode. Thus, as both modes are provided to users, Cray recommends that both modes be tested for applications. In fact, the Cray benchmarking group is currently working with Cray Technical Publications to publish guidelines for determining which mode works best with which types of applications.

Using the Fortran compiler option O ssp or the C/C++ compiler option h ssp will compile an executable in SSP mode and link to the LibSci for SSP mode. Using the default compiler options will create an executable in MSP mode, linked to the LibSci for MSP mode.

In the PE 5.0 release (June, 2003), LibSci for SSP and MSP modes will be functional. As for performance, routines in LibSci have been tuned to perform well in MSP mode, ensuring that the computations are partitioned efficiently across the four SSPs. The FFT routines are the only routines in LibSci which have been tuned specifically for SSP mode. More performance enhancements for both MSP and SSP modes will be coming in future releases.

5         Routines No Longer Supported in LibSci

LibSci for the Cray X1 system does not support all routines that were provided by LibSci for the Cray PVP and Cray T3E systems. Namely, the non-standard BLAS, the sparse iterative solvers and out-of-core solver routines are no longer supported. Tables 5-1, 5-2 and 5-3 list the level-1, level-2 and level-3 BLAS routines that are unavailable on the Cray X1 system. Table 5-4 lists the iterative and out-of-core solver routines that are no longer supported on the Cray X1 system. This information is also given in the Cray X1 User Environment Difference manual.

 

Also, LINPACK and EISPACK routines are not included in LibSci for the Cray X1 system. Users of these routines are encouraged to update to the LAPACK routines. LAPACK is the successor to LINPACK and EISPACK and it contains faster algorithms.

Table 51 Unsupported level-1 BLAS routines for the Cray X1 system

Name

Purpose

HAXPY (GAXPY)

Adds a scalar multiple of a real or complex vector to another real or complex vector (32-bit version)

HDOT (GDOTC GDOTU)

Computes a dot product (inner product) of two real or complex vectors (32-bit version)

SAXPBY (CAXPBY)

Adds a scalar multiple of a real or complex vector to a scalar multiple of another vector

SHAD

Computes the Hadamard product of two vectors

SPAXPBY

Adds a scalar multiple of a vector to a sparse vector

SPDOT

Computes a dot product of a real vector and a sparse real vector

CSROT

Applies a real plane rotation to a pair of complex vectors

CROT

Applies a real plane rotation to a pair of complex vectors

SROTM

Applies a modified Givens plane rotation

SROTMG

Constructs a modified Givens plane rotation

SSUM (CSUM)

Sums the elements of a real or complex vector

 

Table 52 Unsupported level-2 BLAS routines for the Cray X1 system

Name

Purpose

SGESUM

Adds a scalar multiple of a real or complex matrix to a scalar multiple of another real or complex matrix

CSPMV

Multiplies a complex vector by a complex symmetric packed matrix

CSPR

Performs a symmetric rank 1 update of a complex symmetric packed matrix

CSSPR12

Performs two simultaneous symmetric rank 1 updates of a real symmetric packed matrix

CSYMV

Multiplies a complex vector by a complex symmetric matrix

CSYR

Performs a symmetric rank 1 update of a complex symmetric matrix

CSSYR2

Performs a symmetric rank 2 update of a real symmetric matrix

 

Table 53 Unsupported level-3 BLAS routines for the Cray X1 system

Name

Purpose

SCOPY2 (CCOPY2)

Copies a real matrix into another real matrix; copies a complex matrix into another complex matrix (used by the out-of-core routines)

SGEMMS (CGEMMS)

Multiplies a real general matrix by a real general matrix; multiplies a complex general matrix by a complex general matrix; uses Strassen's algorithm

 

Table 54 Unsupported iterative and out-of-core solver routines for the Cray X1 system

Name

Purpose

SITRSOL

Solves a real general sparse system, using a preconditioned conjugate gradient-like method (iterative solver)

VBEGIN

Initializes out-of-core routine data structures

VEND

Handles terminal processing for the out-of-core routines

VSTORAGE

Declares packed storage mode for a triangular, symmetric, or Hermitian virtual matrix

SCOPY2RV (CCOPY2RV)

Copies a submatrix of a real (in memory) matrix to a virtual matrix

SCOPY2VR (CCOPY2VR)

Copies a submatrix of a virtual matrix to a real (in memory) matrix

VSGETRF (VCGETRF)

Computes an LU factorization of a virtual general matrix, using partial pivoting with row interchanges

VSGETRS (VCGETRS)

Solves a system of linear equations AX = B; A is a virtual general matrix whose LU factorization has been computed by VSGETRF

VSPOTRF

Computes the Cholesky factorization of a virtual real symmetric positive definite matrix

VSPOTRS

Solves a system of linear equations AX = B; A is a virtual real symmetric positive definite

VSGEMM (VCGEMM)

Multiplies a virtual general matrix by a virtual general matrix

VSTRSM (VCTRSM)

Solves a virtual triangular system of equations with multiple right-hand sides

VSSYRK

Performs symmetric rank k update of virtual symmetric matrix

6         Performance of LibSci

LibSci performance has improved in each successive release and will continue to improve in coming releases. A more thorough report of the performance of LibSci will be published at a later date. Performance of a few routines is given here.

 

The Cray X1 system is a new architecture, so an understanding of how a variety of applications perform on it is still forthcoming. This information will be included in future releases of manuals and training materials. Specifically, LibSci man pages and manuals will be constantly updated with guidelines for getting the most performance from the routines.

 

 

As with the Cray PVP systems, certain strides in memory for the Cray X1 system perform better than others. For BLAS routines, the best strides in memory are obtained when the leading dimension of two-dimensional arrays is an odd multiple of four. For the FFT routines, it suffices to use odd leading dimensions for the arrays. The following results use these optimal performing strides.

 

Table 6-1 shows the performance of SGEMV and CGEMV, the single precision real and complex matrix-vector multiply. In Table 6-2, the performance of SGEMM and CGEMM, the single precision real and complex matrix-matrix multiply, is given. Results for the 32-bit and 64-bit versions for the Cray X1 system, and for the 64-bit version for the Cray SV1ex are given.

 

Note that the performance of the 32-bit routines is only about 40 or 50% faster than the 64-bit routines on the Cray X1 system. Also, note that the performance of the Cray X1 system (64-bit routines) is about eight times faster than the Cray SV1ex for the matrix-matrix multiply and about fives times faster for the matrix-vector multiply.

 

Further, the BLAS LibSci routines for the Cray X1 system in these two tables were tuned only in Fortran, and not in Cray Assembly Language (CAL). Thus this performance was obtained from the Fortran compiler. (Currently, only a set of FFT routines is implemented in CAL. The majority of LibSci routines are written in Fortran or C.)

 

The performance results reported here are for the most current LibSci, to be released in PE 5.0. As the performance of routines in LibSci continues to improve, these results will soon be outdated.

Table 61 Performance of matrix-vector multiply in LibSci (PE 5.0, leading dimension odd multiple of 4, MSP mode for X1 results)

Size

N=M

SGEMV (TRANS = N)

(MFLOPS)

CGEMV (TRANS = N)

(MFLOPS)

X1 32-bit

X1 64-bit

SV1ex

X1 32-bit

X1 64-bit

SV1ex

256

9769

7006

851

10572

8115

1112

512

12031

4551

870

13463

8336

1125

768

7567

4633

869

13151

8555

1128

1024

7129

4646

879

13324

8695

1130

 

Table 62 Performance of matrix-matrix multiply in LibSci (PE 5.0, leading dimension odd multiple of 4, MSP mode for X1 results)

Size

N=M=K

 

SGEMM (TRANS = N)

(MFLOPS)

CGEMM (TRANS = N)

(MFLOPS)

X1 32-bit

X1 64-bit

SV1ex

X1 32-bit

X1 64-bit

SV1ex

256

14850

10845

1302

16643

10677

1323

512

15842

10970

1319

17059

11013

1330

768

15944

11108

1326

17272

11040

1333

1024

15315

10959

1329

17305

11037

1334

7         LINPACK Benchmark Results

Cray has recently published results for the High Performance LINPACK benchmark. The High Performance LINPACK benchmark determines the list of the Top500 Supercomputer Sites. The first Cray X1 systems will rank on this list in June 2003.

 

Table 7-1 shows the submitted results for the benchmark. Note that the single MSP result is given for reference only, and was not submitted. The benchmark performs at about 90% of peak of the machine, and scales well.

 

This benchmark was implemented with a highly tuned matrix-matrix multiply routine which obtains approximately 12 GFLOPS. The matrix-matrix multiply routine was written in Cray Assembly Language (CAL), and is faster than the matrix-matrix multiply routine in LibSci. In future releases of LibSci, there may be matrix-matrix multiply routines implemented partially in CAL for faster performance.

 

Shmem was used for the communication routines in the benchmark. Currently ScaLAPACK and the BLACS are implemented with MPI. Future releases of LibSci will have some of the communication in ScaLAPACK and BLACS replaced with Shmem.

 

While the LINPACK benchmark demonstrates the computational power of the Cray X1 system, it does not represent the current performance of LibSci. The performance of LibSci will improve, but the corresponding linear solvers in LAPACK and ScaLAPACK will never run as fast as a benchmark highly tuned for a specific case.

 

Table 71 HP LINPACK performance results for the Cray X1 System (* 1 MSP results given for reference only and were not submitted)

Processors

(MSPs)

Rmax

(GFLOPS)

Rpeak

(GLOPS)

Nmax

N 1/2

60

675.5

768.0

168960

20160

49

550.5

627.2

150528

16128

36

404.3

460.8

129024

13824

28

318.1

358.4

114688

11302

16

182.3

204.8

81920

8242

12

137.6

153.6

73728

6294

8

92.4

102.4

61440

4996

4

46.5

51.2

41984

3048

1*

11.8

12.8

20992

1280

 

Table 7-2 gives comparative performance results for the Cray X1 system with 8 MSPs with other systems of comparable computational power. The vector systems like the Cray X1 system and the NEC SX-6 achieve over 90+% of peak performance. The other systems achieve less.

 

Table 72 Comparison of similar systems to 8 MSP Cray X1 System

Computer

CPUs

Rmax

(GFLOPS)

Rpeak

(GFLOPS)

% of peak

Cray X1

8

92.4

102.4

90%

IBM P690 Turbo

32

91.3

166.4

55%

HP Superdome

64

86.4

141.3

61%

Cray T3E 1200E

112

90.4

134.0

67%

NEC SX-6

8

63.2

64.0

99%

 

8         Documentation

Users should refer to LibSci documentation for further information. Each LibSci routine is documented in a man page, plus the following man pages provide introductory information: intro_libsci, intro_fft, intro_blas1, intro_blas2, intro_blas3, intro_lapack, intro_scalapack, and intro_blacs. These introductory man pages are updated frequently, and they provide current information about functionality and performance.

 

For more information on the differences between LibSci for the Cray X1 system and LibSci for other Cray PVP and Cray T3E systems, please refer to the Cray X1 User Environment Difference manual. This manual describes the new features of LibSci for the Cray X1 system, and lists the routines that are no longer supported.

 

The Migrating Applications to the Cray X1 System manual contains useful information to LibSci users. It has helpful information for calling Fortran routines from C or C++ programs. It also includes information for correctly linking to the 64-bit LibSci.

 

In later releases of the Optimizing Applications on the Cray X1 System, there will be a chapter for using LibSci. This chapter will describe how to use routines in LibSci to get the best performance.

 

Also, there will be a reference manual for LibSci coming later in 2003.

9         Tips for Using LibSci

The following list gives helpful tips for using LibSci effectively.

 

        To use the 64-bit single precision routines, use the Fortran compiler option s default64 or use lsci64 when linking. This will link to the 64-bit version of LibSci instead of the default library. Also, use the 64-bit library if using 64-bit integers. Only the 64-bit library supports 64-bit integers.

        Reference the intro_fft man page for instructions on how to declare the table array when using the default LibSci FFT routines. This array must always contain 64-bit words, regardless of the size of the other data types in the routine.

        If porting an application to the Cray X1 system which has 32-bit data types and which also has been ported to Cray PVP or Cray T3E system, there may be CRAY macro definitions to change the size of the data types to be 64-bits. To disable this definition, use U CRAY compiler option. Using 64-bit integer and real variables when calling the default LibSci routines will result in run-time errors.

        For the best results of the BLAS routines, set the leading dimensions of the arrays (lda, ldb, ldc, etc.) to be odd multiples of four.

        For the best results of the FFT routines, use odd leading dimensions in the arrays. Reference the intro_fft man page for more information.

        There are Fortran interfaces for all LibSci routines. To call these routines from C or C++ programs, follow the standard conventions. For more information on this topic, reference the Migrating Applications to the Cray X1 System manual. (Note: There are C interfaces for the ScaLAPACK and BLACS routines.)

        Reference the Cray X1 User Environment Difference manual for lists of routines that are no longer supported in LibSci for the Cray X1 system.

10   Future plans for LibSci

There will be more performance and functionality added to LibSci in releases following the PE 5.0 release. In PE 5.1 release of LibSci, there will be more optimizations added, especially for SSP mode support. By the end of 2003, there should be distributed memory parallel versions of the FFTs and sparse direct solvers.

 

There are also plans to support other numerical library ports to the Cray X1 system. More details will follow.

11   Acknowledgements

LibSci is the product of the Cray Scientific Libraries group. This group includes Mary Beth Hribar, manager; Bracy Elton, FFTs; Chao Yang, BLAS, LAPACK, sparse solvers and benchmarking work; and Rick Hangartner, BLAS.

Other Cray employees have contributed to LibSci also: Neal Gaarder ,BLAS and CAL support; Wendy Thrash, FFTs. Also, there are two contractors who have helped the LibSci effort: Kitrick Sheets, LAPACK testing; Jim Hoekstra at Iowa State University, ScaLAPACK port and LibSci testing.