The Multi-Stream Processor
Multi-streaming works on loops
In simple cases it acts as a pipe/register/cache multiplier
- Can result in super-linear speedups for appropriately sized problems
Loops that don’t vectorize (e.g. outer loops) can be streamed
- Can stream outer loops while vectorizing inner loops
Must eliminate conditions that inhibit streaming
Notes:
- In its simplest form, MSP code implements a loop-level parallelism similar to what may be accomplished using OpenMP or auto-tasking.
- For example, when a generic loop is encounted the iterations are partitioned over the individual SSPs that make up the MSP.
- The MSP has certain performance advantages relative to traditional auto-tasking (tighter hdw integration, low ovrhd)