Halo
1 Overview
In a distributed memory parallel environment, it is necessary to regularly exchange data on the interfaces between adjacent partitions of the mesh (i.e. halo elements). These halo exchanges will be accomplished utilizing the MPI library.
2 Requirements
2.1 Requirement: Compatible with all types of arrays
Handle halo exchanges for arrays of each supported data type and dimensionality.
2.2 Requirement: Exchange at all types of mesh elements
Meshes are composed of three types of elements (cells, edges, and vertices) at which values can be defined, we must be able to handle exchanges at each type of element.
2.3 Requirement: Variable halo depth
The default minimum required halo depth for most of the algorithms we will use is three cells. We should also be able to handle different halo depths.
2.4 Requirement: Exchange lists
Each MPI rank needs a send list and receive list for every neighboring rank and each type of mesh element (cells, edges, and vertices) containing the local indices for each owned element to be packed and sent to neighboring ranks and each element owned by neighboring ranks to be received and unpacked into local field arrays. Exchange lists should be organized by halo layer.
2.5 Requirement: Send and receive buffers
Each MPI rank needs two arrays for every neighboring rank to serve as a send buffer and a receive buffer to be passed to the MPI send and receive routines. On each rank, one buffer per neighbor will be sent concurrently, so there needs to be one send and one receive buffer array for each neighbor to facilitate communication.
2.6 Requirement: Non-blocking communication
To efficiently handle exchanges between each rank and all of its neighbors, we
will use the non-blocking MPI routines MPI_Isend
and MPI_Irecv
. The
MPI_Test
routine will be used to determine when each communication is
completed. Boolean variables will be needed to track when messages have been
received and when buffers have been unpacked.
2.7 Requirement: Compatible with device-resident arrays
In environments with GPU devices, many arrays will reside solely on the device and we will need to be able to manage halo exchanges for these arrays, either via GPU-aware MPI or handling array transfers to host.
2.8 Desired: Communicate subset of halo layers
We may want the ability to communicate individual halo layers as opposed to always exchanging the entire halo for a given array. Exchange lists should be organized by halo layer to facilitate this.
2.9 Desired: Exchange multiple arrays simultaneously
Communications in MPAS are generally latency limited, so it may be advantageous to exchange multiple arrays at the same time, packing the halo elements of multiple arrays into the same buffer to reduce the number of messages.
2.10 Desired: Multiple environments/decompositions
OMEGA runs may contain multiple communication environments or mesh decompositions, would need to be able to perform halo exchanges in these circumstances.
2.11 Desired: OpenMP threading
If we allow OpenMP threading, halo exchanges would need to be carried out in a thread-safe manner, and a process to carry out local exchanges may be needed.
2.12 Desired: Minimize buffer memory allocation/deallocation
It might be advantageous to have persistent buffer arrays that are large enough to handle the largest halo exchange in a particular run, as opposed to calculating the needed buffer size and reallocating the buffer arrays for each halo exchange. Only a subset of the buffer array would need to be communicated for smaller exchanges. A method for determining the largest necessary buffer size for each neighbor would be needed, and would depend on which arrays are active, the number of halo elements in each index space, the number of vertical layers and tracers.
3 Algorithmic Formulation
No specific algorithms needed outside of those provided by standard MPI library
4 Design
The mesh decomposition will be performed during initialization via the Decomp class based on the input mesh file and options defined in the configuration file. A set of class objects described below will take the Decomp object and store information about the decomposition as well as allocate buffer memory necessary for performing halo exchanges.
4.1 Data types and parameters
4.1.1 Parameters
An enum defining mesh element types would be helpful to control which exchange lists to use for a particular field array:
enum meshElement{onCell, onEdge, onVertex};
An MPI datatype MPI_RealKind
will be defined based on whether the default
real data type is single or double precision via the SINGLE_PRECISION compile
time switch:
#ifdef SINGLE_PRECISION
MPI_Datatype MPI_RealKind = MPI_FLOAT;
#else
MPI_Datatype MPI_RealKind = MPI_DOUBLE;
#endif
Halo exchanges will depend on the haloWidth
parameter defined in the
decomp configuration group.
4.1.2 Class/structs/data types
The ExchList class will hold both send and receive lists. Lists are separated by halo layer to allow for the possibility of exchanging individual layers. Buffer offsets are needed to pack/unpack all the halo layers into/from the same buffer.
class ExchList {
private:
I4 nList[haloWidth]; ///< number of mesh elements in each
///< layer of list
I4 nTot; ///< number of elements summed over layers
I4 offsets[haloWidth]; ///< offsets for each halo layer
std::vector<I4> indices[haloWidth]; ///< list of local indices
friend class Neighbor;
friend class Halo;
};
The Neighbor class contains all the information and buffer memory needed for
carrying out each type of halo exchange with one neighboring rank. It contains
the ID of the neighboring MPI rank, an object of the ExchList class for sends
and receives for each mesh index space, a send and a receive buffer, MPI request
handles that are returned by MPI_Irecv
and MPI_Isend
which are
needed to test for completion of an individual message transfer, and boolean
switches to control the progress of a Halo exchange. To avoid the need for
separate integer and floating-point buffers, during the packing process integer
values can be recast as reals in a bit-preserving manner using
reinterpret_cast
and then recast as integers on the receiving end.
class Neighbor {
private:
I4 rankID; ///< ID of neighboring MPI rank
ExchList sendLists[3], recvLists[3]; ///< 0 = onCell, 1 = onEdge,
///< 2 = onVertex
std::vector<Real> sendBuffer, recvBuffer;
MPI_Request rReq, sReq; ///< MPI request handles
bool received = false;
bool unpacked = false;
friend class Halo;
};
The Halo class collects all Neighbor objects needed by an MPI rank to perform a full halo exchange with each of its neighbors. The total number of neighbors, the local rank, and the MPI communicator are saved here. The mesh element type for the current array being transferred is also stored here. This class will be the user interface for halo exchanges, it is a friend class to the subordinate classes so that it has access to all private data.
class Halo {
private:
I4 nNghbr; ///< number of neighboring ranks
I4 myRank ///< local MPI rank ID
MPI_Comm myComm; ///< MPI communicator handle
meshElement elemType; ///< index space of current array
std::vector<Neighbor> neighbors;
public:
// methods
};
4.2 Methods
4.2.1 Constructors
The constructor for the Halo class will be the interface for declaring an instance of the Halo class and instances of all associated member classes, and requires info from the Decomp and MachEnv objects.
Halo(Decomp inDecomp, MachEnv inEnv);
The constructors for Neighbor and ExchList will be called from within the Halo constructor and will create instances of these classes based on info from the Decomp object.
4.2.2 Array halo exchange
The primary use for the Halo class will be a member function called
exchangeFullArrayHalo
which will exchange halo elements for the input
array with each of its neighbors across all layers of the halo. The index space
the array is located in (cells, edges, or vertices) needs to be fetched from the
metadata associated with the array stored in the Halo object to determine which
exchange lists to use. It will return an integer error code to catch errors
(likewise for each subprocess below).
int exchangeFullArrayHalo(ArrayLocDDTT &array, meshElement elemType);
This will be an interface for different exchange funcitons for each type of ArrayLocDDTT (where Loc is the array location, i.e. device or host, DD is the dimensionality and TT is the data type) as defined in the DataTypes doc. If the array is located on a GPU device, this function may tranfer arrays from device to host. The ordering of steps for completing a halo exchange is:
start receives
pack buffers
start sends
unpack buffers
4.2.3 Start receive/send
A startReceives
function will loop over all member Neighbor objects and
call MPI_Irecv
for each neighbor. It takes no arguments because all the
info needed is already contained in the Halo object and its member objects.
int StartReceives();
Likewise, startSends
will loop over all Neighbor objects and call
MPI_Isend
to send the buffers to each neighbor.
int StartSends();
4.2.4 Buffer pack/unpack
For each type of ArrayDDTT, there will be a buffer pack function aliased to a
packBuffer
interface:
int packBuffer(ArrayDDTT array, int iNeighbor);
where halo elements of the potentially multidimensional array ArrayDDTT are
packed into the 1D send buffer for a particular Neighbor in the member
std::vector
neighbors using the associated ExchList based on the
meshElement the array is defined on. The exchangeFullArrayHalo
will loop
over the member neighbors and call packBuffer
for each Neighbor.
Similarly, buffer unpack functions for each ArrayDDTT type will be aliased to
an unpackBuffer
interface:
int unpackBuffer(ArrayDDTT &array, int iNeighbor);
5 Verification and Testing
5.1 Halo exchange tests
A unit test where a set of halo exchanges are performed for each type of ArrayDDTT and each index space could verify all requirements, except for 2.3. The test should use a mesh decomposition across multiple MPI ranks, and define an array for each type of ArrayDDTT and index space. The array elements could be filled based on the global IDs for the cells, edges, or vertices. After an exchange, all the elements in an array (owned+halo elements) would be checked to ensure they have the expected value to verify a successful test. A similar test utilizing a mesh decomposition with HaloWidth other than the default value would satisfy requirement 2.3 as well.