Some Lessons
MSP performance is often as good as or better than that of the T90
For long-vector code, MSP execution can approach factor-of-four speedup over SSP
- Even more in some cases, due to larger effective cache
Can be 30% - 200% better than ìnaÔveî autotasking over four SSPs
- Tighter hardware integration
- Less contention for memory bandwidth
Single-SSP loop nest optimizations generally seem to still be best
- Must think about the cache!
Eliminate conditions that (currently) inhibit streaming, e.g.
- Data dependencies
- I/O
- Reduction operations