Lessons Learned

MSP execution can approach factor-of-four speedup over SSP
- Even more in some cases, due to larger effective cache

Can be 30% - 200% better than “naïve” autotasking over four SSPs
- Tighter hardware integration
- Less contention for memory bandwidth

Eliminate conditions that (currently) inhibit streaming, e.g.
- Data dependencies
- I/O
- Reduction operations

MSP performance is often as good as or better than the T90