Decomp
1 Overview
For parallel execution, OMEGA will utilize a hierarchical parallelism with the coarse-grained parallelism achieved via message passing. An OMEGA mesh will be decomposed into subdomains so that computations on each sub domain can proceed in parallel and messages will be passed to communicate data between domains. This document describes the requirements and design of this domain decomposition.
2 Requirements
2.1 Requirement: Mesh partition
The primary requirement is to be able to partition a mesh across a distributed memory machine. Many standard tools exist to perform this partitioning (eg metis, parmetis, zoltan) and their use is encouraged. This normally requires an input file with the primary mesh indices and adjacency (see Requirement 2.4).
2.2 Requirement: Metis option
For backward compatibility and to facilitate bit-for-bit comparisons with MPAS, at least one partition option must use the Metis library.
2.3 Required: Online partitioning
We require the partitioning to occur online (ie as part of model initialization rather than a separate pre-processing step). This is meant to avoid the past difficulties in maintaining a large number of partition input files for each MPI configuration. Experience has shown that even high-resolution partitioning is very inexpensive and can be performed as part of the model run and does not need this off-line step.
2.4 Required: Input mesh information
As input into the partitioning, we require a file with information
on cell adjacency. There are at least two potential design approaches
for this. The first option is to use the global graph.info
file
which is already in Metis input format with a line for each cell
listing the global indices of that cell’s neighbors. A second option
is to use the input mesh netCDF file which has the same information
in the global cellsOnCell array. This second option would enable us to
eliminate all graph.info files after the mesh file has been created.
2.5 Required: All mesh locations
While the primary partition is based on the cells and cell adjacency, the decomposition must also partition the edge and vertex index spaces.
2.6 Required: Halo
For parallel efficiency, a halo containing copies of nearest neighbor points will be defined. A separate design document describes the capability needed for updating those halo points for field data. Within the decomp type, we simply keep the halo index information.
2.7 Required: Local to global index mapping
Once the mesh is decomposed, each parallel process has a set of local indices renumbered for the local space. The decomposition must provide the ability to return the global index given the processor’s local index. It must also be able to provide the location (MPI rank and local index) for all the neighbor or halo points to facilitate later setup of halos and other infrastructure. Some of this information may only be needed for initialization (see Requirement 2.8 on release of memory).
2.8 Required: Local sorting
At a minimum, the local addresses must be sorted with owned indices first and each level of halo following. This is needed for both optimal communication and for selectively computing only what is needed. Additional sorting within those blocks may be desireable for optimizing memory locality.
2.9 Required: Release memory
Once the partition information has been used to set up halos, mesh variables, I/O and other quantities that depend on the partition, all or parts of the partitioning data may be released to free memory. Note that some of the information may be transferred to mesh variables as part of later mesh initialization.
2.10 Desired: Weighted partitioning option
For better load-balancing, support for supplying or computing weights for each cell index based on estimated workload is desireable.
2.11 Desired: Multiple partitions of same mesh
In the future, it may be desireable to perform parts of the calculation (eg the communication-dominated barotropic mode) on a smaller partition. We may need to support multiple partitions of the same mesh on different numbers of MPI ranks.
2.12 Desired: Multiple domains, sub-blocking or coloring
For some future capabilities (eg local timestepping in regionally focused mesh locations), it may be desireable to decompose the global domain into smaller subdomains and over-subscribe subdomains to a rank, combining low- and high-resolution subdomains for load balancing. It is unclear whether this is best achieved with a multiple sub-block structure (as in older MPAS) or accomodating this within a weighted partitioning. Identifying the mesh regions using a coloring of the mesh may be helpful.
3 Algorithmic Formulation
Most of the algorithms required are embedded and documented in the relevant packages (eg Metis multi-level k-way partitioning).
4 Design
The design presented here is initially for decomposing the global domain into a single block per MPI rank and does not yet include weighted decompositions. However, it should not prevent these options from being added later. The parmetis package will be used initially to provide metis back compatibility while reducing memory use over a serial metis approach. The cell mesh indices and adjacency will be read from the non-decomposed graph.info files in metis format, allowing the adjacency info to be read serially before the parallel IO is set up.
Once the primary (cell) mesh is partitioned, the edge and vertex index spaces will be assigned based on the adjacency to the local cells, as in the current MPAS model. Because much of this information will be later replicated in a mesh class with Kokkos arrays, this decomposition structure will be destroyed after initialization and any related setup of halos and other infrastructure.
4.1 Data types and parameters
4.1.1 Parameters
The decomposition includeds a public enum to define the supported partitioning method. Initially, this will support the default Metis method, but others can be added:
enum partMethod {
partMethodUnknown, ///< undefined for error checking
partMethodMetisKWay, ///< default KWay method in metis
}
Other methods, like the Metis GeomKWay (combination of space-filling curve with KWay) or Zoltan options may be added later.
Eventually, an enum defining options for computing weights for weighted partitions may also be required, but will not appear in this initial implementation.
Another parameter is the haloWidth
that defines the width
of the halo needed to provide neighbor information for all
local operators. The minimum is typically 3 for current
algorithms.
These parameters and the mesh input file will be read as part of the input configuration file in a decomp configuration group:
decomp:
meshInputFilename: 'omegaMeshFile.nc'
partitionMethod: 'MetisKWay'
haloWidth: 3
The Metis KWay partition method will be default and the haloWidth will initially default to 3.
4.1.2 Class/structs/data types
There will be a Decomp class that stores and defines a partitioning of the address space.
class Decomp {
private:
partMethod method; ///< method to use for partitioning
public:
I4 haloDepth; ///< depth of the halo
I4 nCellsGlobal; ///< num cells in the global domain
I4 nCellsAll; ///< num cells on local rank (own+halos)
I4 nCellsOwned; ///< num cells exclusively owned by local rank
std::vector<I4>[] nCellsHalo[]; ///< num of owned+halo points
///< for each level of halo
std::vector<I4>[] cellGlobalID[]; ///< global index for each
///< local cell (own+halo)
/// The adjacency (neighbor) info will be stored in the
/// compact form that Metis uses. In this form the neighbor
/// cell location for each owned and ghost cell in local cell
/// order is stored as a linear vector of integer pairs
/// (partition/processor number and local address). A second
/// vector stores the starting address in this list for the
/// neighbors associated with the local cell address.
std::vector<I4>[] nbrStartCell ///< start addr in nbr array for cell
std::vector<I4>[] nbrLocCell ///< list of nbr addresses
[ repeat identical variables for edge and vertex index spaces ]
// methods described below
}
4.2 Methods
Because most class members are public, only constructors and destructors are really needed.
4.2.1 Constructor
The main method will be a constructor that decomposes the mesh and creates the public variables. The default decomposition will require no arguments and will read options from the input config file:
Decomp mainDecomp;
For multiple decomposition of the same mesh, a second constructor will be needed that starts from the default decomposition and uses arguments for determining decomposition options. The exact form will be determined later, but might look something like:
Decomp newDecomp(oldDecomp, nParts, method, haloWidth);
4.2.2 Destructor
After most of the initialization is complete and the decomposition information has been used to initialize meshes, halo, I/O, etc., we will supply a destructor to release the space in memory.
delete myDecomp;
5 Verification and Testing
5.1 Test Metis
We will create a small sample graph/mesh for partitioning, similar to examples in Metis documention and create a partition. The resulting partition will be compared to that generated by the standalone gpmetis partition.
tests requirement 2.1, 2.2, 2.3, 2.4
5.2 Back compatibility
We will create a decomposition based on an MPAS input mesh and verify that the decomposition is the same as the older MPAS Fortran code. This test should only be needed after initial development and will not need to be repeated as part of routine testing
5.3 Check global IDs
To make sure all cells, etc. are accounted for, we can do a global sum of the number of cells as well as a sum of global IDs across the partition and ensure they the expected sum.