## LINPACK Benchmark Optimizations on a Virtual Processor Grid

Ed Anderson (eanderson@na-net.ornl.gov)<br>Maynard Brandt (retired)<br>Chao Yang (cwy@cray.com)



## Outline

- Organization of the Cray LINPACK Benchmark code
- Kernel optimizations on the CRAY X1
- The virtual processor grid


## The LINPACK benchmark

Solve a linear system

$$
A x=b
$$

using Gaussian elimination with partial pivoting.
The problem size and implementation are not specified.
The algorithm factors $\mathrm{A}=\mathrm{LU}$ and solves

$$
\mathrm{y}=\mathrm{L}^{-1} \mathrm{~b} ; \mathrm{x}=\mathrm{U}^{-1} \mathrm{y}
$$

Performance results are specified in
Gflop/s (billions of floating-point operations per second) or Tflop/s (trillions of floating-point operations per second)

## Block LU factorization

Right-looking algorithm:

I. Factor block column into $\mathrm{A}=\mathrm{LU}$
II. Exchange rows and update block row
III. Update the rest of the matrix using matrix multiplication

## Optimizing the panel factorization

Sub- blocking or recursion is used within the block column so that more work is done in the optimized MM kernel.


## Software choices

HPL (High Performance LINPACK)
http://www.netlib.org/benchmark/hpl

- Block-cyclic distribution
- Column-wise storage of blocks
- MPI communication

Cray's LINPACK Benchmark code

- Block-cyclic distribution
- Row-wise storage of blocks
- MPI or SHMEM communications


## 2-D block cyclic data distribution

Example on a $2 \times 3$ processor grid:

| 0 | 2 | 4 | 0 | 2 | 4 | 0 | 2 |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| 1 | 3 | 5 | 1 | 3 | 5 | 1 | 3 |
| 0 | 2 | 4 | 0 | 2 | 4 | 0 | 2 |
| 1 | 3 | 5 | 1 | 3 | 5 | 1 | 3 |
| 0 | 2 | 4 | 0 | 2 | 4 | 0 | 2 |
| 1 | 3 | 5 | 1 | 3 | 5 | 1 | 3 |
|  | 0 | 2 | 4 | 0 | 2 | 4 | 0 |

## Local view: ScaLAPACK/HPL

Blocks stored by columns, on processor 0:

| $\mathrm{A}_{1,1}$ | $\mathrm{~A}_{1,4}$ | $\mathrm{~A}_{1,7}$ |
| :---: | :---: | :---: |
| $\mathrm{~A}_{3,1}$ | $\mathrm{~A}_{3,4}$ | $\mathrm{~A}_{3,7}$ |
| $\mathrm{~A}_{5,1}$ | $\mathrm{~A}_{5,4}$ | $\mathrm{~A}_{5,7}$ |
| $\mathrm{~A}_{7,1}$ | $\mathrm{~A}_{7,4}$ | $\mathrm{~A}_{7,7}$ |

A(mb*nrblks, nb*ncblks)

Advantage: With abstraction of BLAS and LAPACK routines, we can maintain much of the LAPACK design.
Disadvantage: As matrix size gets large, distribution blocks get spread out through memory.

## Local view: Cray LBM code

Blocks stored by rows, processor 0:


Advantage: Distribution blocks and block rows are contiguous. Disadvantage: More indexing with the 4-D array.

## Main computation kernel

After communication of the column and row blocks, each processor performs a matrix-matrix multiplication of the form:


In ScaLAPACK/HPL, one call to SGEMM is required for this operation.

## Optimizing data layout for SGEMM

With row-wise storage of blocks, one call to SGEMM is needed to update each local block row.

(all the same block)

11

## Internals of SGEMM on T3E

Basic operation: nb-by-nb matrix times nb-by-n matrix

A

B
n

C

Innermost kernel: $12 \times n b$ matrix-vector multiply, 12-element result The nb-by-nb block of $A$ is used repeatedly and will reside in cache. The columns of $B$ are streams (if $L D B=n b, B$ is one long stream). The result vector is held in registers until combined into $C$.

## Transition of SGEMM: T3E $\rightarrow$ X1



T3E: Result vector was 12 elements long
X1: Result vector can be VL elements long (to 64)

## Transition of SGEMM: T3E $\rightarrow$ X1



## Transition of SGEMM: T3E $\rightarrow$ X1



## Transition of SGEMM: T3E $\rightarrow$ X1



## Transition of SGEMM: T3E $\rightarrow$ X1



## Transition of SGEMM: T3E $\rightarrow$ X1



## Internals of SGEMM on X1

Basic operation: nb-by-nb matrix times nb-by-n matrix

|  |  |
| :---: | :---: |
| nb |  |
|  | SSP0 |
| nb | SSP1 |
|  | SSP2 |
|  | SSP3 |

A

B
n

C

The nb-by-nb block of $A$ is used repeatedly and will reside in cache. $B$ is read 4 columns at a time and is shared by the 4 SSPs. The result vector is held in registers until combined into C .

## Prototype matrix multiply code

```
    subroutine sgemmnn( m, n, k, alpha, a, inra, b, inrb,
    &
                            beta, c, inrc )
    integer m, n, k, inra, inrb, inrc
    real alpha, beta, a(inra,*), b(inrb,*), c(inrc,*)
    integer i, js, m4
cdir$ SSP_PRIVATE dmmnn
    m4 = (m+3)/4
    do i = 0, 3
        js = min( m4, max( m-i*m4, 0 ) )
        call dmmnn( js, n, k, alpha, a(1+i*m4,1), inra,
&
                    b, inrb, beta, c(1+i*m4,1), inrc )
end do
return
end
```

Compile with:
ftn -Oaggress -03 -s default64 -c sgemmnn.f

## Optimizing the row exchanges

A row exchange is performed at each step of the block column factorization to put the largest element (in absolute value) on the diagonal. The vector IPIV records the exchanges.

Example:

$$
\begin{aligned}
& \operatorname{IPIV}(1)=20 \\
& \operatorname{IPIV}(2)=6 \\
& \operatorname{IPIV}(3)=9 \\
& \operatorname{IPIV}(4)=31 \\
& \operatorname{IPIV}(5)=5 \\
& \operatorname{IPIV}(6)=22 \\
& \operatorname{IPIV}(7)=20 \\
& \operatorname{IPIV}(8)=20
\end{aligned}
$$

## Gather/scatter indices

We can avoid synchronizing after every exchange by translating IPIV into gather and scatter permutation vectors.

| scatter | gather |
| :---: | :---: |
| $1 \rightarrow 7$ | $1 \leftarrow 20$ |
| $2 \rightarrow 22$ | $2 \leftarrow 6$ |
| $3 \rightarrow 9$ | $3 \leftarrow 9$ |
| $4 \rightarrow 31$ | $4 \leftarrow 31$ |
| $5 \rightarrow 5$ | $5 \leftarrow 5$ |
| $6 \rightarrow 2$ | $6 \leftarrow 22$ |
| $7 \rightarrow 8$ | $7 \leftarrow 1$ |
| $8 \rightarrow 20$ | $8 \leftarrow 7$ |

## Optimizing the communication

To optimize the communication, rows to be exchanged are first copied locally into a contiguous buffer.


## When a $p \times q$ grid isn't enough

Shortcomings of existing algorithm:

- Many processors are idle during column factorization.
- Not every processor count factors neatly into $N_{p}=p \times q$.

Example:

$$
\begin{aligned}
& 124=4 \times 31 \\
& 123=3 \times 41 \\
& 122=2 \times 61 \\
& 121=11 \times 11
\end{aligned}
$$

- On some systems it is better to leave one or two processors idle ( $\mathrm{N}_{\mathrm{p}}-1$ may not factor neatly).


## The Virtual Processor Grid

Generalize the 2-D grid factorization to $p \times q=k \times N_{p}$ where $k \geq 1, p \leq N_{p}$, and $\operatorname{lcm}\left(p, N_{p}\right)=k \times N_{p}$.
Example: 6 processors in a $4 \times 3$ virtual grid
$\left.\begin{array}{|l|l|l|}\hline 0 & 4 & 2 \\ \hline 1 & 5 & 3 \\ \hline 2 & 0 & 4 \\ \hline 3 & 1 & 5 \\ \hline\end{array}\right\}$

This becomes the tiling pattern for the distributed 2-D matrix

## Data structure for VPG

Recall the 4-D array used for row-wise storage of blocks: A( mb, nb, ncblks, nrblks )

Now add another dimension for the virtual processor index:
A ( mb, nb, ncblks, nrblks, nvpi )

Maintains contiguousness of distribution blocks and row blocks.
Need to add a loop over the virtual processor indices.
No extra storage except for some extra buffers for each virtual processor.

## Coding issues for VPG

Loop doesn't always go from 1 to nvpi.
Example: Send second column of following matrix across rows.

| 0 | 4 | 2 | 0 | 4 | 2 |
| :--- | :--- | :--- | :--- | :--- | :--- |
| 1 | 5 | 3 | 1 | 5 | 3 |
| 2 | 0 | 4 | 2 | 0 | 4 |
| 3 | 1 | 5 | 3 | 1 | 5 |
| 0 | 4 | 2 | 0 | 4 | 2 |
| 1 | 5 | 3 | 1 | 5 | 3 |

Send is initiated with virtual processor index 1 on $\{4,5\}$, with v.p. index 2 on $\{0,1\}$.

Recv is initiated with virtual processor index 2 on $\{2,3,4,5\}$ and with v.p. index 1 on $\{0,1\}$

do $\mathrm{i}=0$, nvpi- 1<br>\{work on block numbered<br>$1+\bmod ($ start-1+i, nvpi)\}<br>end do

27


## Summary

- Storage order of distribution blocks was optimized for the cache.
- Leading dimension was padded from 256 to 260 to optimize the matrix-multiply kernel.
- Main computational kernel uses SSP parallelism.
- Communication was optimized using SHMEM.
- No barriers! Communication was extensively overlapped with computation.
- Virtual processor grid improves parallelism of column and row operations.

