Tridiagonal Solvers

Omega provides batched solvers for tridiagonal systems of equations. Here, batched means that many systems of the same size need to be solved at the same time. Typically, this is because one system needs to be solved per ocean column. In addition to providing solvers for general tridiagonal systems, specialized solvers for diffusion-type systems with better stability properties are available. To obtain portable performance, different solution algorithms are utilized on CPUs and GPUs. The solvers implement the same user interface for CPUs and GPUs, so users should be able to ignore which algorithm is used. Two type aliases are provided that resolve to the optimal solver for a given architecture:

TriDiagSolver for general systems
TriDiagDiffSolver for diffusion-type systems

Representations of tridiagonal systems

General tridiagonal system

A general tridiagonal system is represented by three vectors DL, D, DU, representing the lower, main, and upper diagonal, respectively, and a right-hand-side vector X

\[ (DL)_i Y_{i - 1} + D_i Y_i + (DU)_i Y_{i + 1} = X_i. \]

Diffusion-type tridiagonal system

A diffusion-type system is represented by two vectors G and H, and a right-hand-side vector X. It has the form

\[ -G_{i - 1} Y_{i - 1} + (G_{i - 1} + G_i + H_i) Y_i - G_{i} Y_{i + 1} = X_i. \]

Using tridiagonal solvers

There are two main ways to use Omega batched tridiagonal solvers:

Create global arrays storing the coefficients of all systems and call the top-level solve method
Use a team-level solver inside a parallel loop using Kokkos team policy

While the first method is simpler, for computational performance it is better to use the second method. This is because the second method allows one to define a system, solve it, and do some post-processing of the solution in one parallel loop. With the first method this would require multiple parallel loops, and allocation of storage for the coefficients of all systems, which are typically no longer needed after the solution is obtained.

Top-level method

To solve NBatch systems of size NRow using the general solver four arrays DL, D, DU, and X need to be defined. All arrays are of size (NBatch, NRow) and contain the coefficients for each of the systems.

Array2DReal DL("DL", NBatch, NRow); // DL(n, :) = lower diagonal of system n
Array2DReal D("D", NBatch, NRow);   // D(n, :)  = main diagonal of system n
Array2DReal DU("DU", NBatch, NRow); // DU(n, :) = upper diagonal of system n
Array2DReal X("X", NBatch, NRow);   // X(n, :)  = rhs of system n

Once these arrays are filled with the coefficients, to solve the systems the static TriDiagSolver:solve method is called

TriDiagSolver::solve(DL, D, DU, X);

After this call, the rhs array X contains the solution.

Solving diffusion-type systems is similar, except the input arrays contain coefficients in the form presented above and the static TriDiagDiffSolver::solve method is called

Array2DReal G("G", NBatch, NRow); // G(n, :) = coefficients G_i of system n
Array2DReal H("H", NBatch, NRow); // H(n, :) = coefficients H_i of system n
Array2DReal X("X", NBatch, NRow); // X(n, :) = rhs of system n

// fill G, H, X

TriDiagDiffSolver::solve(G, H, X);

Team-level method

Note: since the steps to use the team-level general and specialized solvers are similar, this subsection first shows how to use the general solver. At the end, an example of using the specialized solver is presented.

The team-level solvers need to be used inside a parallel loop using Kokkos team policy. To create a team policy for solving NBatch systems of size NRow the static member function makeTeamPolicy is used

TeamPolicy Policy = TridDiagSolver::makeTeamPolicy(NBatch, NRow);

Every team of threads except the last is responsible for solving VecLength systems. Since the total number of systems likely doesn’t evenly divide VecLength, the last team solves fewer than VecLength systems. Hence, it is necessary to handle that with if statements.

Typically, the parallel loop consists of three main parts:

defining the system
solving the system
using the solution

At the beginning of the loop, temporary scratch storage for the coefficients of VecLength systems needs to be allocated. This is done by creating a TriDiagScratch struct, which contains four member arrays DL, D, DU, and X of size (NRow, VecLength). Note that, contrary to the global arrays approach, the system dimension comes first.

These arrays need to be filled with the coefficients of VecLength systems solved by the team. This is done in a parallel loop over system size (using TeamThreadRange) and a serial loop over VecLength. Overall, the code that defines the systems coefficients looks typically like this:

parallel_for(Policy, KOKKOS_LAMBDA (TeamMember &member) {
   // create scratch data
   TriDiagScratch Scratch(Member, NRow);

   int IStart = Member.league_rank() * VecLength;

   // define the systems coefficients
   parallel_for(TeamThreadRange(Member, NRow), [=] (int K) {
      for (int IVec = 0; IVec < VecLength; ++IVec) {
         const int I = IStart + IVec;
         // handle last team having fewer systems
         if (I < NBatch) {
            Scratch.DL(K, IVec) = ...;
            Scratch.D(K, IVec)  = ...;
            Scratch.DU(K, IVec) = ...;
            Scratch.X(K, IVec)  = ...;
         }
      }
      ...
   });

To perform a team-level solve a different overload of the static member function TridDiagSolver::solve needs to be called. This overload takes the team member Member and the filled scratch struct Scratch. This call needs to be surrounded by barriers to ensure that the parallel loop setting the coefficients has finished and to make the computed solution available to all threads.

parallel_for(Policy, KOKKOS_LAMBDA (TeamMember &member) {
   // previous code

   // solve the system
   Member.team_barrier();
   TridDiagSolver::solve(Member, Scratch);
   Member.team_barrier();

   ...
});

After the call to solve the solution is available in Scratch.X. It can be manipulated further or simply stored.

parallel_for(Policy, KOKKOS_LAMBDA (TeamMember &member) {
   // previous code

   // multiply the solution by 2 (just an example)
   parallel_for(TeamThreadRange(Member, NRow), [=] (int K) {
      for (int IVec = 0; IVec < VecLength; ++IVec) {
        Scratch.X(K, IVec) *= 2;
      }
   });

   Member.team_barrier();

   // store the solution
   parallel_for(TeamThreadRange(Member, NRow), [=] (int K) {
      for (int IVec = 0; IVec < VecLength; ++IVec) {
         const int I = IStart + IVec;
         // handle last team having fewer systems
         if (I < NBatch) {
            X(I, K) = Scratch.X(K, IVec);
         }
      }
   });
});

The team-level specialized diffusion solver can be used similarly, the only substantial difference is in the form of the coefficients. A minimal complete example of using it is shown below.

// create team policy
TeamPolicy Policy = TridDiagDiffSolver::makeTeamPolicy(NBatch, NRow);

parallel_for(Policy, KOKKOS_LAMBDA (TeamMember &member) {
   // create scratch data
   TriDiagDiffScratch Scratch(Member, NRow);

   int IStart = Member.league_rank() * VecLength;

   // define the systems coefficients
   parallel_for(TeamThreadRange(Member, NRow), [=] (int K) {
      for (int IVec = 0; IVec < VecLength; ++IVec) {
         const int I = IStart + IVec;
         // handle last team having fewer systems
         if (I < NBatch) {
            Scratch.G(K, IVec) = ...;
            Scratch.H(K, IVec)  = ...;
            Scratch.X(K, IVec)  = ...;
         }
      }

   // solve the system
   Member.team_barrier();
   TridDiagDiffSolver::solve(Member, Scratch);
   Member.team_barrier();

   // store the solution
   parallel_for(TeamThreadRange(Member, NRow), [=] (int K) {
      for (int IVec = 0; IVec < VecLength; ++IVec) {
         const int I = IStart + IVec;
         // handle last team having fewer systems
         if (I < NBatch) {
            X(I, K) = Scratch.X(K, IVec);
         }
      }
   });
});