Global Reductions

1 Overview

In a parallel application like OMEGA, it is often necessary to perform averages, sums or other global collective operations on the distributed data. These reductions are described here.

2 Requirements

2.1 Requirement: Global sums

The ability to perform the sum of all elements of a distributed array is required

2.2 Requirement: Global min/max

The ability to find the minimum or maximum value in a distributed array is required.

2.3 Requirement: Data types

There must be reductions for all supported arithmetic types (integers, floating point) and all array dimensions from 0-d (scalars) to 5-d.

2.4 Requirement: Sums with product

In many cases, a global sum is often performed on a product of two arrays. For example, a sum can be performed with a multiplicative mask or a sum may be a dot product of two vectors within a solver. A global sum interface that multiplies two arrays before the sum is required.

2.5 Requirement: Restricted address space

In order to compute reductions over only owned and active cells/edges or for regional diagnostics, there must be an option to restrict the address space over which the local portion of sums are computed.

2.6 Requirement: Multiple fields/arrays

For performance reasons, it is often beneficial to compute reductions for multiple arrays at the same time to reduce the number of messages and associated message latency. An interface for computing independent reductions for multiple arrays at once is needed. Note that the size of the arrays and any associated options must be the same for all arrays in a multi-field reduction. It would be possible to relax this restriction in the future.

2.7 Requirement: Other environments/decompositions

Because OMEGA supports multiple machine environments (communicators and processor sets) and decompositions, the reductions must be callable from environments other than the default.

2.8 Requirement: Reproducibility

The results (esp sums) must be bit-for-bit reproducible on a given machine with respect to processor and thread count

2.9 Requirement: Accelerators and location

When running on accelerated architectures, the ability to specify whether the reduction occurs on host or device is needed.

2.10 Requirement: CPU threading

For reductions performed on the CPU, the operations must support and MPI+OpenMP model while still meeting reproducibility requirements.

2.11 Desired: Other reduction operations

It will be desireable in the future to add other reduction operations like minloc/maxloc.

3 Algorithmic Formulation

For integers, the sum is straightforward. For single-precision floating point, we will perform sums in double precision and convert back to single before returning. In each of the above cases, the local Kokkos versions of sum, minval, maxval will be used for the local sum before accumulating the MPI sum across ranks.

For double-precision sums, we will begin with a version of the double-double (DDPDD) algorithm of Donald Knuth, further improved by David A Bailey and outlined in He and Ding (2001 J. of Supercomputing, 18, 259). In the future, we will also support the more robust version by Pat Worley in the E3SM reprosum routines.

4 Design

4.1 Data types and parameters

4.1.1 Parameters

In the future, if we support multiple reproducible algorithms, we will need an enum to define choice of algorithm.

4.1.2 Class/structs/data types

There are no new classes or data types associated with this functionality.

4.2 Methods

4.2.1 Global sum

For each supported array type ArrayTTDD (where TT is the data type I4, I8, R4, R8, Real and DD is the dimension 1D thru 5D) or supported scalar data type, there will be a sum function aliased to the globalSum interface:

    sum = globalSum(const ArrayTTDD array,
                    const I4 mpiComm,
                    const std::vector<I4> indxRange);

where mpiComm is the MPI communicator extracted from the relevant MachEnv (eg defaultEnv.comm) and indxRange is a std::vector of length 2x the number of array dimensions. The entries of this vector will be the min, max index of each array dimension. So, for example an array dimensioned (nCellsAll+1, nVertLevels+1) might need to be summed over the indxRange{0, nCellsOwned-1, 0, nVertLevels-1}. Note that the indxRange supports a constant scalar values only so does not support a variable index range like minLevelCell/maxLevelCell. That is best managed either by masking with the sum-product interface below or ensuring the array has been set to zero in non-active entries.

For scalar sums, the interface is the same with the indxRange argument absent.

We will assume that Kokkos host arrays will be summed on the host and device arrays will be summed on the device.

4.2.2 Global sum with product

There will be a similar interface for a sum with product:

    sum = globalSum(const ArrayTTDD array1,
                    const ArrayTTDD array2,
                    const I4 mpiComm,
                    const std::vector<I4> indxRange);

that returns the sum of array1*array2. The two arrays must be of the same data type. It is not required that the two arrays be of exactly the same size but both arrays must be of appropriate size to accomodate the indxRange and the indices must align appropriately. This typically means that if the sizes differ, any extra entries occur at the end of the respective dimension.

4.2.3 Global sum multi-field

To sum multiple fields at once, a std::vector of arrays must be constructed and the result will be stored in a std::vector of appropriate type:

    std::vector<type> sum =
         globalSum(const std::vector<ArrayTTDD> arrays,
                   const I4 mpiComm,
                   const std::vector<I4> indxRange)

As is the case with sum-product, each array must be of the same data type but can have slightly different sizes as long as the dimension lengths can accomodate the indxRange and that the indices align appropriately with the indxRange requested.

4.2.4 Global sum multi-field with product

The multi-field sum with product is still to be determined, but should have an interface like:

    std::vector<type> sum =
         globalSum(const std::vector<ArrayTTDD> arrays1,
                   const std::vector<ArrayTTDD> arrays2,
                   const I4 mpiComm,
                   const std::vector<I4> indxRange)

The implementation detail that needs to be determined is how to manage two use cases. A typical use case is for the second array in the product to be the same for all fields (eg a mask or an area array), but we may wish to support a case where each array product has a different array for both operands. Because Kokkos arrays contain metadata and a pointer to the data, it may be possible to create a std::vector of the same array, allowing both the case of a fixed array to be used for all or for all the array products to have different arrays. This requires some testing of approaches.

4.2.5 Global minval

The global minval will support interfaces similar to the global sums, but will return the minimum value instead.

    varMin = globalMinval(const ArrayTTDD array,
                          const I4 mpiComm,
                          const std::vector<I4> indxRange);

    varMin = globalMinval(const ArrayTTDD array1,
                          const ArrayTTDD array2,
                          const I4 mpiComm,
                          const std::vector<I4> indxRange);

    std::vector<type> varMin =
         globalMinval(const std::vector<ArrayTTDD> arrays,
                      const I4 mpiComm,
                      const std::vector<I4> indxRange)

Note that the minval routine will not be cognizant of any special values so any inactive entries should either be masked or set to an appropriately high value.

4.2.6 Global maxval

The maxval will have the identical form of minval but will return the maximum value. The same restrictions and comments on special values apply.

5 Verification and Testing

5.1 Test basics

For each data type, a set of reference arrays will be created with random numbers that extend across the entire range of that type (for later reproducibility tests). A reference sum will be computed serially using the same reproducible algorithm as the global implementation. The basic test will compare the global sum with the reference serial sum. Because the data types include both host and device arrays, this will also test reductions on either location.

tests requirement 2.1, 2.3, part of 2.8, 2.9

5.2 Reproducibility

A MachEnv and decomposition for a subset of MPI ranks will be created and the reference arrays above will be distributed across the new decomposition. Global sums in each case will be compared to the serial reference value and the full MPI rank case for bit reproducibility. Similarly, a test with CPU threading on will test reproducibility under threading.

tests requirement 2.7, 2.8, 2.10

5.3 Restricted index range

The sums will be tested with a reduced index range on the same reference arrays from above and compared with the similarly-restricted serial reference sums.

tests requirement 2.5

5.4 Sum with product

A mask array will be defined for each type and used with the sum-with-product interface and compared against a serial version of the same.

tests requirement 2.4

5.5 Multi-field sums

Additional arrays similar to the reference arrays will be created with reference serial sums. Then the multi-field interface will be tested against these reference sums.

tests requirement 2.6

5.6 Min/Max

A similar approach to the sums above will be used to test the min and max functions in all combinations of interfaces.

tests requirement 2.2