Single Processor Performance IssuesLessons Learned
The SV1 is still a vector machine!
Basic cache optimization rules obtain:
- best to structure application so that data fits in cache (cache blocking)
- best to access data with small memory strides
- avoid strides that are a large power of two especially
Probably best to focus on these first, rather than high VLs
Unlike caches with line sizes greater than one, thrashing is no worse than computing without the cache
The cache can hide the effects of bank conflicts
- Also at their worst for power-of-two strides
Acceptable cache hit rates are in the 40-60% range, depending on application
Many codes can achieve 50% of T90 performance
Notes:
- Program for vectorization
- Optimize for best cache use to maximize memory bandwidth
- Floating performance scales with memory bandwidth