Tridiagonal Solvers
Omega provides batched solvers for tridiagonal systems of equations. Here, batched means that many systems of the same size need to be solved at the same time. Typically, this is because one system needs to be solved per ocean column. In addition to providing solvers for general tridiagonal systems, specialized solvers for diffusion-type systems with better stability properties are available. To obtain portable performance, different solution algorithms are utilized on CPUs and GPUs. The solvers implement the same user interface for CPUs and GPUs, so users should be able to ignore which algorithm is used. Two type aliases are provided that resolve to the optimal solver for a given architecture:
TriDiagSolver
for general systemsTriDiagDiffSolver
for diffusion-type systems
Representations of tridiagonal systems
General tridiagonal system
A general tridiagonal system is represented by three vectors
DL
, D
, DU
, representing the lower, main, and upper diagonal, respectively, and a right-hand-side vector X
Diffusion-type tridiagonal system
A diffusion-type system is represented by two vectors G
and H
, and a right-hand-side vector X
. It has the form
Using tridiagonal solvers
There are two main ways to use Omega batched tridiagonal solvers:
Create global arrays storing the coefficients of all systems and call the top-level
solve
methodUse a team-level solver inside a parallel loop using Kokkos team policy
While the first method is simpler, for computational performance it is better to use the second method. This is because the second method allows one to define a system, solve it, and do some post-processing of the solution in one parallel loop. With the first method this would require multiple parallel loops, and allocation of storage for the coefficients of all systems, which are typically no longer needed after the solution is obtained.
Top-level method
To solve NBatch
systems of size NRow
using the general solver
four arrays DL
, D
, DU
, and X
need to be defined.
All arrays are of size (NBatch
, NRow
) and contain the
coefficients for each of the systems.
Array2DReal DL("DL", NBatch, NRow); // DL(n, :) = lower diagonal of system n
Array2DReal D("D", NBatch, NRow); // D(n, :) = main diagonal of system n
Array2DReal DU("DU", NBatch, NRow); // DU(n, :) = upper diagonal of system n
Array2DReal X("X", NBatch, NRow); // X(n, :) = rhs of system n
Once these arrays are filled with the coefficients, to solve the systems the
static TriDiagSolver:solve
method is called
TriDiagSolver::solve(DL, D, DU, X);
After this call, the rhs array X
contains the solution.
Solving diffusion-type systems is similar, except the input arrays contain coefficients in the
form presented above
and the static TriDiagDiffSolver::solve
method is called
Array2DReal G("G", NBatch, NRow); // G(n, :) = coefficients G_i of system n
Array2DReal H("H", NBatch, NRow); // H(n, :) = coefficients H_i of system n
Array2DReal X("X", NBatch, NRow); // X(n, :) = rhs of system n
// fill G, H, X
TriDiagDiffSolver::solve(G, H, X);
Team-level method
Note: since the steps to use the team-level general and specialized solvers are similar, this subsection first shows how to use the general solver. At the end, an example of using the specialized solver is presented.
The team-level solvers need to be used inside a parallel loop using Kokkos team
policy. To create a team policy for solving NBatch
systems of size NRow
the static member function makeTeamPolicy
is used
TeamPolicy Policy = TridDiagSolver::makeTeamPolicy(NBatch, NRow);
Every team of threads except the last is responsible for solving VecLength
systems.
Since the total number of systems likely doesn’t evenly divide VecLength
, the last
team solves fewer than VecLength
systems. Hence, it is necessary to handle that with
if statements.
Typically, the parallel loop consists of three main parts:
defining the system
solving the system
using the solution
At the beginning of the loop, temporary scratch storage for the coefficients of VecLength
systems
needs to be allocated.
This is done by creating a TriDiagScratch
struct, which contains
four member arrays DL
, D
, DU
, and X
of size (NRow
, VecLength
).
Note that, contrary to the global arrays approach, the system dimension comes first.
These arrays need to be filled with the coefficients of VecLength
systems solved by the team.
This is done in a parallel loop over system size (using TeamThreadRange
)
and a serial loop over VecLength
.
Overall, the code that defines the systems coefficients looks typically like this:
parallel_for(Policy, KOKKOS_LAMBDA (TeamMember &member) {
// create scratch data
TriDiagScratch Scratch(Member, NRow);
int IStart = Member.league_rank() * VecLength;
// define the systems coefficients
parallel_for(TeamThreadRange(Member, NRow), [=] (int K) {
for (int IVec = 0; IVec < VecLength; ++IVec) {
const int I = IStart + IVec;
// handle last team having fewer systems
if (I < NBatch) {
Scratch.DL(K, IVec) = ...;
Scratch.D(K, IVec) = ...;
Scratch.DU(K, IVec) = ...;
Scratch.X(K, IVec) = ...;
}
}
...
});
To perform a team-level solve a different overload of the static member function TridDiagSolver::solve
needs to be called.
This overload takes the team member Member
and the filled scratch struct Scratch
.
This call needs to be surrounded by barriers to ensure that the
parallel loop setting the coefficients has finished and to make the computed solution available to all threads.
parallel_for(Policy, KOKKOS_LAMBDA (TeamMember &member) {
// previous code
// solve the system
Member.team_barrier();
TridDiagSolver::solve(Member, Scratch);
Member.team_barrier();
...
});
After the call to solve
the solution is available in Scratch.X
. It can be manipulated further or simply stored.
parallel_for(Policy, KOKKOS_LAMBDA (TeamMember &member) {
// previous code
// multiply the solution by 2 (just an example)
parallel_for(TeamThreadRange(Member, NRow), [=] (int K) {
for (int IVec = 0; IVec < VecLength; ++IVec) {
Scratch.X(K, IVec) *= 2;
}
});
Member.team_barrier();
// store the solution
parallel_for(TeamThreadRange(Member, NRow), [=] (int K) {
for (int IVec = 0; IVec < VecLength; ++IVec) {
const int I = IStart + IVec;
// handle last team having fewer systems
if (I < NBatch) {
X(I, K) = Scratch.X(K, IVec);
}
}
});
});
The team-level specialized diffusion solver can be used similarly, the only substantial difference is in the form of the coefficients. A minimal complete example of using it is shown below.
// create team policy
TeamPolicy Policy = TridDiagDiffSolver::makeTeamPolicy(NBatch, NRow);
parallel_for(Policy, KOKKOS_LAMBDA (TeamMember &member) {
// create scratch data
TriDiagDiffScratch Scratch(Member, NRow);
int IStart = Member.league_rank() * VecLength;
// define the systems coefficients
parallel_for(TeamThreadRange(Member, NRow), [=] (int K) {
for (int IVec = 0; IVec < VecLength; ++IVec) {
const int I = IStart + IVec;
// handle last team having fewer systems
if (I < NBatch) {
Scratch.G(K, IVec) = ...;
Scratch.H(K, IVec) = ...;
Scratch.X(K, IVec) = ...;
}
}
// solve the system
Member.team_barrier();
TridDiagDiffSolver::solve(Member, Scratch);
Member.team_barrier();
// store the solution
parallel_for(TeamThreadRange(Member, NRow), [=] (int K) {
for (int IVec = 0; IVec < VecLength; ++IVec) {
const int I = IStart + IVec;
// handle last team having fewer systems
if (I < NBatch) {
X(I, K) = Scratch.X(K, IVec);
}
}
});
});