(omega-design-halo)= # Halo ## 1 Overview In a distributed memory parallel environment, it is necessary to regularly exchange data on the interfaces between adjacent partitions of the mesh (i.e. halo elements). These halo exchanges will be accomplished utilizing the MPI library. ## 2 Requirements ### 2.1 Requirement: Compatible with all types of arrays Handle halo exchanges for arrays of each supported data type and dimensionality. ### 2.2 Requirement: Exchange at all types of mesh elements Meshes are composed of three types of elements (cells, edges, and vertices) at which values can be defined, we must be able to handle exchanges at each type of element. ### 2.3 Requirement: Variable halo depth The default minimum required halo depth for most of the algorithms we will use is three cells. We should also be able to handle different halo depths. ### 2.4 Requirement: Exchange lists Each MPI rank needs a send list and receive list for every neighboring rank and each type of mesh element (cells, edges, and vertices) containing the local indices for each owned element to be packed and sent to neighboring ranks and each element owned by neighboring ranks to be received and unpacked into local field arrays. Exchange lists should be organized by halo layer. ### 2.5 Requirement: Send and receive buffers Each MPI rank needs two arrays for every neighboring rank to serve as a send buffer and a receive buffer to be passed to the MPI send and receive routines. On each rank, one buffer per neighbor will be sent concurrently, so there needs to be one send and one receive buffer array for each neighbor to facilitate communication. ### 2.6 Requirement: Non-blocking communication To efficiently handle exchanges between each rank and all of its neighbors, we will use the non-blocking MPI routines `MPI_Isend` and `MPI_Irecv`. The `MPI_Test` routine will be used to determine when each communication is completed. Boolean variables will be needed to track when messages have been received and when buffers have been unpacked. ### 2.7 Requirement: Compatible with device-resident arrays In environments with GPU devices, many arrays will reside solely on the device and we will need to be able to manage halo exchanges for these arrays, either via GPU-aware MPI or handling array transfers to host. ### 2.8 Desired: Communicate subset of halo layers We may want the ability to communicate individual halo layers as opposed to always exchanging the entire halo for a given array. Exchange lists should be organized by halo layer to facilitate this. ### 2.9 Desired: Exchange multiple arrays simultaneously Communications in MPAS are generally latency limited, so it may be advantageous to exchange multiple arrays at the same time, packing the halo elements of multiple arrays into the same buffer to reduce the number of messages. ### 2.10 Desired: Multiple environments/decompositions OMEGA runs may contain multiple communication environments or mesh decompositions, would need to be able to perform halo exchanges in these circumstances. ### 2.11 Desired: OpenMP threading If we allow OpenMP threading, halo exchanges would need to be carried out in a thread-safe manner, and a process to carry out local exchanges may be needed. ### 2.12 Desired: Minimize buffer memory allocation/deallocation It might be advantageous to have persistent buffer arrays that are large enough to handle the largest halo exchange in a particular run, as opposed to calculating the needed buffer size and reallocating the buffer arrays for each halo exchange. Only a subset of the buffer array would need to be communicated for smaller exchanges. A method for determining the largest necessary buffer size for each neighbor would be needed, and would depend on which arrays are active, the number of halo elements in each index space, the number of vertical layers and tracers. ## 3 Algorithmic Formulation No specific algorithms needed outside of those provided by standard MPI library ## 4 Design The mesh decomposition will be performed during initialization via the Decomp class based on the input mesh file and options defined in the configuration file. A set of class objects described below will take the Decomp object and store information about the decomposition as well as allocate buffer memory necessary for performing halo exchanges. ### 4.1 Data types and parameters #### 4.1.1 Parameters An enum defining mesh element types would be helpful to control which exchange lists to use for a particular field array: ```c++ enum meshElement{onCell, onEdge, onVertex}; ``` An MPI datatype ```MPI_RealKind``` will be defined based on whether the default real data type is single or double precision via the SINGLE_PRECISION compile time switch: ```c++ #ifdef SINGLE_PRECISION MPI_Datatype MPI_RealKind = MPI_FLOAT; #else MPI_Datatype MPI_RealKind = MPI_DOUBLE; #endif ``` Halo exchanges will depend on the ```haloWidth``` parameter defined in the decomp configuration group. #### 4.1.2 Class/structs/data types The ExchList class will hold both send and receive lists. Lists are separated by halo layer to allow for the possibility of exchanging individual layers. Buffer offsets are needed to pack/unpack all the halo layers into/from the same buffer. ```c++ class ExchList { private: I4 nList[haloWidth]; ///< number of mesh elements in each ///< layer of list I4 nTot; ///< number of elements summed over layers I4 offsets[haloWidth]; ///< offsets for each halo layer std::vector indices[haloWidth]; ///< list of local indices friend class Neighbor; friend class Halo; }; ``` The Neighbor class contains all the information and buffer memory needed for carrying out each type of halo exchange with one neighboring rank. It contains the ID of the neighboring MPI rank, an object of the ExchList class for sends and receives for each mesh index space, a send and a receive buffer, MPI request handles that are returned by `MPI_Irecv` and `MPI_Isend` which are needed to test for completion of an individual message transfer, and boolean switches to control the progress of a Halo exchange. To avoid the need for separate integer and floating-point buffers, during the packing process integer values can be recast as reals in a bit-preserving manner using `reinterpret_cast` and then recast as integers on the receiving end. ```c++ class Neighbor { private: I4 rankID; ///< ID of neighboring MPI rank ExchList sendLists[3], recvLists[3]; ///< 0 = onCell, 1 = onEdge, ///< 2 = onVertex std::vector sendBuffer, recvBuffer; MPI_Request rReq, sReq; ///< MPI request handles bool received = false; bool unpacked = false; friend class Halo; }; ``` The Halo class collects all Neighbor objects needed by an MPI rank to perform a full halo exchange with each of its neighbors. The total number of neighbors, the local rank, and the MPI communicator are saved here. The mesh element type for the current array being transferred is also stored here. This class will be the user interface for halo exchanges, it is a friend class to the subordinate classes so that it has access to all private data. ```c++ class Halo { private: I4 nNghbr; ///< number of neighboring ranks I4 myRank ///< local MPI rank ID MPI_Comm myComm; ///< MPI communicator handle meshElement elemType; ///< index space of current array std::vector neighbors; public: // methods }; ``` ### 4.2 Methods #### 4.2.1 Constructors The constructor for the Halo class will be the interface for declaring an instance of the Halo class and instances of all associated member classes, and requires info from the Decomp and MachEnv objects. ```c++ Halo(Decomp inDecomp, MachEnv inEnv); ``` The constructors for Neighbor and ExchList will be called from within the Halo constructor and will create instances of these classes based on info from the Decomp object. #### 4.2.2 Array halo exchange The primary use for the Halo class will be a member function called `exchangeFullArrayHalo` which will exchange halo elements for the input array with each of its neighbors across all layers of the halo. The index space the array is located in (cells, edges, or vertices) needs to be fetched from the metadata associated with the array stored in the Halo object to determine which exchange lists to use. It will return an integer error code to catch errors (likewise for each subprocess below). ```c++ int exchangeFullArrayHalo(ArrayLocDDTT &array, meshElement elemType); ``` This will be an interface for different exchange funcitons for each type of ArrayLocDDTT (where Loc is the array location, i.e. device or host, DD is the dimensionality and TT is the data type) as defined in the DataTypes doc. If the array is located on a GPU device, this function may tranfer arrays from device to host. The ordering of steps for completing a halo exchange is: 1. start receives 2. pack buffers 3. start sends 4. unpack buffers #### 4.2.3 Start receive/send A `startReceives` function will loop over all member Neighbor objects and call `MPI_Irecv` for each neighbor. It takes no arguments because all the info needed is already contained in the Halo object and its member objects. ```c++ int StartReceives(); ``` Likewise, `startSends` will loop over all Neighbor objects and call `MPI_Isend` to send the buffers to each neighbor. ```c++ int StartSends(); ``` #### 4.2.4 Buffer pack/unpack For each type of ArrayDDTT, there will be a buffer pack function aliased to a `packBuffer` interface: ```c++ int packBuffer(ArrayDDTT array, int iNeighbor); ``` where halo elements of the potentially multidimensional array ArrayDDTT are packed into the 1D send buffer for a particular Neighbor in the member `std::vector` neighbors using the associated ExchList based on the meshElement the array is defined on. The `exchangeFullArrayHalo` will loop over the member neighbors and call `packBuffer` for each Neighbor. Similarly, buffer unpack functions for each ArrayDDTT type will be aliased to an `unpackBuffer` interface: ```c++ int unpackBuffer(ArrayDDTT &array, int iNeighbor); ``` ## 5 Verification and Testing ### 5.1 Halo exchange tests A unit test where a set of halo exchanges are performed for each type of ArrayDDTT and each index space could verify all requirements, except for 2.3. The test should use a mesh decomposition across multiple MPI ranks, and define an array for each type of ArrayDDTT and index space. The array elements could be filled based on the global IDs for the cells, edges, or vertices. After an exchange, all the elements in an array (owned+halo elements) would be checked to ensure they have the expected value to verify a successful test. A similar test utilizing a mesh decomposition with HaloWidth other than the default value would satisfy requirement 2.3 as well.