Slide 31 of 41
Notes:
- -Recalling that V1, V2 and V3 are the different stages of optimization, we can see several trends in the floating pt
- 1) After vectorization, memory bank conflicts were a significant bottleneck. The SV1 cache helped to offset this and we saw a respectable 50% of T90 performance
- 2) Eliminating the memory bank conflicts in V2 significantly improved T90 performanc and we droped to ~30% SSP
- 3) At this point we would stop on older Cray platforms. Adding the 3rd step of removing serial dependencies allows the MSP code to match T90 performance (30% v2-> v3)