Halo

1 Overview

In a distributed memory parallel environment, it is necessary to regularly exchange data on the interfaces between adjacent partitions of the mesh (i.e. halo elements). These halo exchanges will be accomplished utilizing the MPI library.

2 Requirements

2.1 Requirement: Compatible with all types of arrays

Handle halo exchanges for arrays of each supported data type and dimensionality.

2.2 Requirement: Exchange at all types of mesh elements

Meshes are composed of three types of elements (cells, edges, and vertices) at which values can be defined, we must be able to handle exchanges at each type of element.

2.3 Requirement: Variable halo depth

The default minimum required halo depth for most of the algorithms we will use is three cells. We should also be able to handle different halo depths.

2.4 Requirement: Exchange lists

Each MPI rank needs a send list and receive list for every neighboring rank and each type of mesh element (cells, edges, and vertices) containing the local indices for each owned element to be packed and sent to neighboring ranks and each element owned by neighboring ranks to be received and unpacked into local field arrays. Exchange lists should be organized by halo layer.

2.5 Requirement: Send and receive buffers

Each MPI rank needs two arrays for every neighboring rank to serve as a send buffer and a receive buffer to be passed to the MPI send and receive routines. On each rank, one buffer per neighbor will be sent concurrently, so there needs to be one send and one receive buffer array for each neighbor to facilitate communication.

2.6 Requirement: Non-blocking communication

To efficiently handle exchanges between each rank and all of its neighbors, we will use the non-blocking MPI routines MPI_Isend and MPI_Irecv. The MPI_Test routine will be used to determine when each communication is completed. Boolean variables will be needed to track when messages have been received and when buffers have been unpacked.

2.7 Requirement: Compatible with device-resident arrays

In environments with GPU devices, many arrays will reside solely on the device and we will need to be able to manage halo exchanges for these arrays, either via GPU-aware MPI or handling array transfers to host.

2.8 Desired: Communicate subset of halo layers

We may want the ability to communicate individual halo layers as opposed to always exchanging the entire halo for a given array. Exchange lists should be organized by halo layer to facilitate this.

2.9 Desired: Exchange multiple arrays simultaneously

Communications in MPAS are generally latency limited, so it may be advantageous to exchange multiple arrays at the same time, packing the halo elements of multiple arrays into the same buffer to reduce the number of messages.

2.10 Desired: Multiple environments/decompositions

OMEGA runs may contain multiple communication environments or mesh decompositions, would need to be able to perform halo exchanges in these circumstances.

2.11 Desired: OpenMP threading

If we allow OpenMP threading, halo exchanges would need to be carried out in a thread-safe manner, and a process to carry out local exchanges may be needed.

2.12 Desired: Minimize buffer memory allocation/deallocation

It might be advantageous to have persistent buffer arrays that are large enough to handle the largest halo exchange in a particular run, as opposed to calculating the needed buffer size and reallocating the buffer arrays for each halo exchange. Only a subset of the buffer array would need to be communicated for smaller exchanges. A method for determining the largest necessary buffer size for each neighbor would be needed, and would depend on which arrays are active, the number of halo elements in each index space, the number of vertical layers and tracers.

3 Algorithmic Formulation

No specific algorithms needed outside of those provided by standard MPI library

4 Design

The mesh decomposition will be performed during initialization via the Decomp class based on the input mesh file and options defined in the configuration file. A set of class objects described below will take the Decomp object and store information about the decomposition as well as allocate buffer memory necessary for performing halo exchanges.

4.1 Data types and parameters

4.1.1 Parameters

An enum defining mesh element types would be helpful to control which exchange lists to use for a particular field array:

enum meshElement{onCell, onEdge, onVertex};

An MPI datatype MPI_RealKind will be defined based on whether the default real data type is single or double precision via the SINGLE_PRECISION compile time switch:

#ifdef SINGLE_PRECISION
MPI_Datatype MPI_RealKind = MPI_FLOAT;
#else
MPI_Datatype MPI_RealKind = MPI_DOUBLE;
#endif

Halo exchanges will depend on the haloWidth parameter defined in the decomp configuration group.

4.1.2 Class/structs/data types

The ExchList class will hold both send and receive lists. Lists are separated by halo layer to allow for the possibility of exchanging individual layers. Buffer offsets are needed to pack/unpack all the halo layers into/from the same buffer.

class ExchList {

   private:

      I4 nList[haloWidth];        ///< number of mesh elements in each
                                  ///<   layer of list
      I4 nTot;                    ///< number of elements summed over layers
      I4 offsets[haloWidth];      ///< offsets for each halo layer

      std::vector<I4> indices[haloWidth];   ///< list of local indices

   friend class Neighbor;
   friend class Halo;
};

The Neighbor class contains all the information and buffer memory needed for carrying out each type of halo exchange with one neighboring rank. It contains the ID of the neighboring MPI rank, an object of the ExchList class for sends and receives for each mesh index space, a send and a receive buffer, MPI request handles that are returned by MPI_Irecv and MPI_Isend which are needed to test for completion of an individual message transfer, and boolean switches to control the progress of a Halo exchange. To avoid the need for separate integer and floating-point buffers, during the packing process integer values can be recast as reals in a bit-preserving manner using reinterpret_cast and then recast as integers on the receiving end.

class Neighbor {

   private:

      I4 rankID;                           ///< ID of neighboring MPI rank
      ExchList sendLists[3], recvLists[3]; ///< 0 = onCell, 1 = onEdge,
                                           ///< 2 = onVertex
      std::vector<Real> sendBuffer, recvBuffer;
      MPI_Request rReq, sReq;              ///< MPI request handles
      bool received = false;
      bool unpacked = false;

   friend class Halo;
};

The Halo class collects all Neighbor objects needed by an MPI rank to perform a full halo exchange with each of its neighbors. The total number of neighbors, the local rank, and the MPI communicator are saved here. The mesh element type for the current array being transferred is also stored here. This class will be the user interface for halo exchanges, it is a friend class to the subordinate classes so that it has access to all private data.

class Halo {

   private:

      I4 nNghbr;                           ///< number of neighboring ranks
      I4 myRank                            ///< local MPI rank ID
      MPI_Comm myComm;                     ///< MPI communicator handle
      meshElement elemType;                ///< index space of current array
      std::vector<Neighbor> neighbors;

   public:

      // methods

};

4.2 Methods

4.2.1 Constructors

The constructor for the Halo class will be the interface for declaring an instance of the Halo class and instances of all associated member classes, and requires info from the Decomp and MachEnv objects.

Halo(Decomp inDecomp, MachEnv inEnv);

The constructors for Neighbor and ExchList will be called from within the Halo constructor and will create instances of these classes based on info from the Decomp object.

4.2.2 Array halo exchange

The primary use for the Halo class will be a member function called exchangeFullArrayHalo which will exchange halo elements for the input array with each of its neighbors across all layers of the halo. The index space the array is located in (cells, edges, or vertices) needs to be fetched from the metadata associated with the array stored in the Halo object to determine which exchange lists to use. It will return an integer error code to catch errors (likewise for each subprocess below).

int exchangeFullArrayHalo(ArrayLocDDTT &array, meshElement elemType);

This will be an interface for different exchange funcitons for each type of ArrayLocDDTT (where Loc is the array location, i.e. device or host, DD is the dimensionality and TT is the data type) as defined in the DataTypes doc. If the array is located on a GPU device, this function may tranfer arrays from device to host. The ordering of steps for completing a halo exchange is:

  1. start receives

  2. pack buffers

  3. start sends

  4. unpack buffers

4.2.3 Start receive/send

A startReceives function will loop over all member Neighbor objects and call MPI_Irecv for each neighbor. It takes no arguments because all the info needed is already contained in the Halo object and its member objects.

int StartReceives();

Likewise, startSends will loop over all Neighbor objects and call MPI_Isend to send the buffers to each neighbor.

int StartSends();

4.2.4 Buffer pack/unpack

For each type of ArrayDDTT, there will be a buffer pack function aliased to a packBuffer interface:

int packBuffer(ArrayDDTT array, int iNeighbor);

where halo elements of the potentially multidimensional array ArrayDDTT are packed into the 1D send buffer for a particular Neighbor in the member std::vector neighbors using the associated ExchList based on the meshElement the array is defined on. The exchangeFullArrayHalo will loop over the member neighbors and call packBuffer for each Neighbor.

Similarly, buffer unpack functions for each ArrayDDTT type will be aliased to an unpackBuffer interface:

int unpackBuffer(ArrayDDTT &array, int iNeighbor);

5 Verification and Testing

5.1 Halo exchange tests

A unit test where a set of halo exchanges are performed for each type of ArrayDDTT and each index space could verify all requirements, except for 2.3. The test should use a mesh decomposition across multiple MPI ranks, and define an array for each type of ArrayDDTT and index space. The array elements could be filled based on the global IDs for the cells, edges, or vertices. After an exchange, all the elements in an array (owned+halo elements) would be checked to ensure they have the expected value to verify a successful test. A similar test utilizing a mesh decomposition with HaloWidth other than the default value would satisfy requirement 2.3 as well.