Parallelising scatter
!$OMP ATOMIC seemed to be the answer.
- Allows the update to be performed in a controlled and parallel manner
- However, synchronisation costs with the associated locks are significant ñ have not yet not improved performance with this approach.
An alternative approach uses PRIVATE temporary arrays for each thread, and then brings together in a REDUCTION type operation.
- This does remove synchronisation costs of each update, but requires additional memory and array copies.
- Some, but limited improvement in performance achieved.
This is the major oustanding issue with this code.