Global Reductions
1 Overview
In a parallel application like OMEGA, it is often necessary to perform averages, sums or other global collective operations on the distributed data. These reductions are described here.
2 Requirements
2.1 Requirement: Global sums
The ability to perform the sum of all elements of a distributed array is required
2.2 Requirement: Global min/max
The ability to find the minimum or maximum value in a distributed array is required.
2.3 Requirement: Data types
There must be reductions for all supported arithmetic types (integers, floating point) and all array dimensions from 0-d (scalars) to 5-d.
2.4 Requirement: Sums with product
In many cases, a global sum is often performed on a product of two arrays. For example, a sum can be performed with a multiplicative mask or a sum may be a dot product of two vectors within a solver. A global sum interface that multiplies two arrays before the sum is required.
2.5 Requirement: Restricted address space
In order to compute reductions over only owned and active cells/edges or for regional diagnostics, there must be an option to restrict the address space over which the local portion of sums are computed.
2.6 Requirement: Multiple fields/arrays
For performance reasons, it is often beneficial to compute reductions for multiple arrays at the same time to reduce the number of messages and associated message latency. An interface for computing independent reductions for multiple arrays at once is needed. Note that the size of the arrays and any associated options must be the same for all arrays in a multi-field reduction. It would be possible to relax this restriction in the future.
2.7 Requirement: Other environments/decompositions
Because OMEGA supports multiple machine environments (communicators and processor sets) and decompositions, the reductions must be callable from environments other than the default.
2.8 Requirement: Reproducibility
The results (esp sums) must be bit-for-bit reproducible on a given machine with respect to processor and thread count
2.9 Requirement: Accelerators and location
When running on accelerated architectures, the ability to specify whether the reduction occurs on host or device is needed.
2.10 Requirement: CPU threading
For reductions performed on the CPU, the operations must support and MPI+OpenMP model while still meeting reproducibility requirements.
2.11 Desired: Other reduction operations
It will be desireable in the future to add other reduction operations like minloc/maxloc.
3 Algorithmic Formulation
For integers, the sum is straightforward. For single-precision floating point, we will perform sums in double precision and convert back to single before returning. In each of the above cases, the local Kokkos versions of sum, minval, maxval will be used for the local sum before accumulating the MPI sum across ranks.
For double-precision sums, we will begin with a version of the double-double (DDPDD) algorithm of Donald Knuth, further improved by David A Bailey and outlined in He and Ding (2001 J. of Supercomputing, 18, 259). In the future, we will also support the more robust version by Pat Worley in the E3SM reprosum routines.
4 Design
4.1 Data types and parameters
4.1.1 Parameters
In the future, if we support multiple reproducible algorithms, we will need an enum to define choice of algorithm.
4.1.2 Class/structs/data types
There are no new classes or data types associated with this functionality.
4.2 Methods
4.2.1 Global sum
For each supported array type ArrayTTDD (where TT is the data type I4, I8, R4, R8, Real and DD is the dimension 1D thru 5D) or supported scalar data type, there will be a sum function aliased to the globalSum interface:
sum = globalSum(const ArrayTTDD array,
const I4 mpiComm,
const std::vector<I4> indxRange);
where mpiComm is the MPI communicator extracted from the
relevant MachEnv (eg defaultEnv.comm) and indxRange is a
std::vector
of length 2x the number of array dimensions.
The entries of this vector will be the min, max index of
each array dimension. So, for example an array dimensioned
(nCellsAll+1, nVertLevels+1)
might need to be summed over
the indxRange{0, nCellsOwned-1, 0, nVertLevels-1}
. Note
that the indxRange supports a constant scalar values only so
does not support a variable index range like
minLevelCell/maxLevelCell. That is best managed either
by masking with the sum-product interface below or
ensuring the array has been set to zero in non-active
entries.
For scalar sums, the interface is the same with the indxRange argument absent.
We will assume that Kokkos host arrays will be summed on the host and device arrays will be summed on the device.
4.2.2 Global sum with product
There will be a similar interface for a sum with product:
sum = globalSum(const ArrayTTDD array1,
const ArrayTTDD array2,
const I4 mpiComm,
const std::vector<I4> indxRange);
that returns the sum of array1*array2
. The two arrays
must be of the same data type. It is not required that
the two arrays be of exactly the same size but both arrays
must be of appropriate size to accomodate the indxRange
and the indices must align appropriately. This typically
means that if the sizes differ, any extra entries occur
at the end of the respective dimension.
4.2.3 Global sum multi-field
To sum multiple fields at once, a std::vector
of
arrays must be constructed and the result will be
stored in a std::vector
of appropriate type:
std::vector<type> sum =
globalSum(const std::vector<ArrayTTDD> arrays,
const I4 mpiComm,
const std::vector<I4> indxRange)
As is the case with sum-product, each array must be of the same data type but can have slightly different sizes as long as the dimension lengths can accomodate the indxRange and that the indices align appropriately with the indxRange requested.
4.2.4 Global sum multi-field with product
The multi-field sum with product is still to be determined, but should have an interface like:
std::vector<type> sum =
globalSum(const std::vector<ArrayTTDD> arrays1,
const std::vector<ArrayTTDD> arrays2,
const I4 mpiComm,
const std::vector<I4> indxRange)
The implementation detail that needs to be determined is how to manage two use cases. A typical use case is for the second array in the product to be the same for all fields (eg a mask or an area array), but we may wish to support a case where each array product has a different array for both operands. Because Kokkos arrays contain metadata and a pointer to the data, it may be possible to create a std::vector of the same array, allowing both the case of a fixed array to be used for all or for all the array products to have different arrays. This requires some testing of approaches.
4.2.5 Global minval
The global minval will support interfaces similar to the global sums, but will return the minimum value instead.
varMin = globalMinval(const ArrayTTDD array,
const I4 mpiComm,
const std::vector<I4> indxRange);
varMin = globalMinval(const ArrayTTDD array1,
const ArrayTTDD array2,
const I4 mpiComm,
const std::vector<I4> indxRange);
std::vector<type> varMin =
globalMinval(const std::vector<ArrayTTDD> arrays,
const I4 mpiComm,
const std::vector<I4> indxRange)
Note that the minval routine will not be cognizant of any special values so any inactive entries should either be masked or set to an appropriately high value.
4.2.6 Global maxval
The maxval will have the identical form of minval but will return the maximum value. The same restrictions and comments on special values apply.
5 Verification and Testing
5.1 Test basics
For each data type, a set of reference arrays will be created with random numbers that extend across the entire range of that type (for later reproducibility tests). A reference sum will be computed serially using the same reproducible algorithm as the global implementation. The basic test will compare the global sum with the reference serial sum. Because the data types include both host and device arrays, this will also test reductions on either location.
tests requirement 2.1, 2.3, part of 2.8, 2.9
5.2 Reproducibility
A MachEnv and decomposition for a subset of MPI ranks will be created and the reference arrays above will be distributed across the new decomposition. Global sums in each case will be compared to the serial reference value and the full MPI rank case for bit reproducibility. Similarly, a test with CPU threading on will test reproducibility under threading.
tests requirement 2.7, 2.8, 2.10
5.3 Restricted index range
The sums will be tested with a reduced index range on the same reference arrays from above and compared with the similarly-restricted serial reference sums.
tests requirement 2.5
5.4 Sum with product
A mask array will be defined for each type and used with the sum-with-product interface and compared against a serial version of the same.
tests requirement 2.4
5.5 Multi-field sums
Additional arrays similar to the reference arrays will be created with reference serial sums. Then the multi-field interface will be tested against these reference sums.
tests requirement 2.6
5.6 Min/Max
A similar approach to the sums above will be used to test the min and max functions in all combinations of interfaces.
tests requirement 2.2