Lessons Learned
MSP performance is often as good as or better than the T90
MSP execution can approach factor-of-four speedup over SSP
- Even more in some cases, due to larger effective cache
Can be 30% - 200% better than “naïve” autotasking over four SSPs
- Tighter hardware integration
- Less contention for memory bandwidth
Single-SSP loop nest optimizations generally seem to be best
Must think about the cache!
Eliminate conditions that (currently) inhibit streaming, e.g.
- Data dependencies
- I/O
- Reduction operations