Example: Vector Code

5 multiplies, 3 adds, 2 reciprocals per iteration

For n of order 2000

vector lengths near maximum
all arrays fit in cache

For n of order 7000, data does not all fit in cache

do j=1,n

c(j)=1.0/a(j) + f(j) + (1.0/b(j) + g(j)) * e

end do

Previous slide Next slide Back to first slide View graphic version

Notes:

Let us look at a simple example that will allow us to see some of the characteristics of the cache

The following kernel vectorizes well and shows a nice balance among the functional units.

Because the cache is write through and write allocate, memory stores are written to cache. So the assignment will fill up a cache location as well as the four arrays on the right.

Given this, for n up to approx 6000 we should see good performance as all arrays will fit in cache

As n is increased, performance should begin to taper off as cache locations are overwritten with data before they can be reused