A 5-year-profiling in production mode
at the University of Stuttgart has shown that more than 40 %
of the execution time of Message Passing Interface (MPI) routines
is spent in the collective communication routines
MPI_Allreduce and MPI_Reduce.
Although MPI implementations are now available for about 10 years
and all vendors are committed to this Message Passing Interface standard,
the vendors' and publicly available reduction algorithms
could be accelerated with new algorithms by a factor between
3 (IBM, sum) and 100 (Cray T3E, maxloc) for long vectors.
This paper presents five algorithms optimized for different
choices of vector size and number of processes.
The focus is on bandwidth dominated protocols for power-of-two
and non-power-of-two number of processes, optimizing the load balance
in communication and computation.
The new algorithms are compared also on the Cray X1 with the
current development version of Cray's MPI library (mpt.2.4.0.0.13)