Computational Performance
- For the T90, V2 is the better performing code
- When targeting MSPs, entire code must be examined as in V3 to reduce code with serial dependencies
Notes:
- -Recalling that V1, V2 and V3 are the different stages of optimization, we can see several trends in the floating pt
- 1) After vectorization, memory bank conflicts were a significant bottleneck. The SV1 cache helped to offset this and we saw a respectable 50% of T90 performance
- 2) Eliminating the memory bank conflicts in V2 significantly improved T90 performanc and we droped to ~30% SSP
- 3) At this point we would stop on older Cray platforms. Adding the 3rd step of removing serial dependencies allows the MSP code to match T90 performance (30% v2-> v3)