Shared steps
date: 2023/08/18
Contributors: Carolyn Begeman, Xylar Asay-Davis
Summary
The capability designed here is the ability to share steps across tasks. In this design document, “shared steps” refers to any step which may be used by multiple tasks that are available in polaris.
The main motivation behind this capability is the computational expense of running steps that could shared across tasks multiple times. In order to reflect the fact that steps are shared to the user, we present a new design for the working directory structure. The design is successful insofar as it guarantees that shared steps are run once per slurm job and that the role of shared steps is clear to users.
Requirements
Requirement: Shared steps are run once.
Shared steps should be run once per invocation of polaris serial
or
polaris run
.
Requirement: Shared steps are run before steps that depend on their output.
Requirement: Shared steps are not daughters of a task
A shared step’s class attributes do not include any task-related information such as a task it belongs to.
Requirement: Working directory structure is intuitive.
Shared step directories should be located at the highest level in the working directory structure where all tasks that use that step are run at or below that level.
Requirement: Working directory step paths are easily discoverable by users.
There should be a way to list the paths within the work directory of all steps in each task. There should also be a way for a user to find the steps in a task from the task’s work directory.
Requirement: The output of shared steps may be used by multiple tasks.
A step may only be shared across multiple tasks if its output would be identical for each task.
Requirement: tasks do not rely on outputs from steps in other tasks
All tasks are self-contained and rely only on either shared steps or steps they contain.
Implementation
Implementation: Shared steps are set up once.
As before, setup of either a list of tasks or a suite proceeds by iterating
through the tasks and then through the steps in each task. An attribute
setup_complete
has been added to Step
and is initialized to False
.
In the setup_task()
function, setup is skipped for any steps where
step.setup_complete == True
, and this attribute is set to True
when a step
has been completed.
Implementation: Shared steps are run before steps that depend on their output.
Requirement is already satisfied as part of task parallelism design, which makes use of file dependencies. When running in task-serial mode, the implementation will be to make sure shared steps are added to the dictionary of steps before other steps that rely on them.
Implementation: Shared steps are not daughters of a task
The task
attribute and constructor argument of the Step
class has been
replaced by the component
attribute. The step’s subdir
attribute is now
relative to the component’s work directory, rather than a parent task’s work
directory.
Implementation: Working directory structure is intuitive.
The only shared steps that reside inside of a task’s work directory are in
situations where another task also lies within the task’s work directory.
The only such tasks at the moment are the cosine_bell/with_viz
tasks, which
reside inside the cosine_bell
tasks. The cosine_bell/with_viz
tasks share
all of the steps of the cosine_bell
(base-mesh, init and forward for each
resolution, and a single analysis step) and also add remapping and
visualization steps that are not shared with any other tasks:
cosine_bell
:
ocean
spherical
qu
base_mesh
60km
90km
120km
150km
180km
210km
240km
cosine_bell
init
60km
90km
120km
150km
180km
210km
240km
forward
60km
90km
120km
150km
180km
210km
240km
analysis
cosine_bell/with_viz
:
ocean
spherical
qu
base_mesh
60km
90km
120km
150km
180km
210km
240km
cosine_bell
init
60km
90km
120km
150km
180km
210km
240km
forward
60km
90km
120km
150km
180km
210km
240km
analysis
with_viz
map
60km
90km
120km
150km
180km
210km
240km
viz
60km
90km
120km
150km
180km
210km
240km
Implementation: Working directory step paths are easily discoverable by users.
This is implemented in two ways.
First, polaris list --verbose
now lists the work-directory relative path of
steps, rather than their path relative to the task’s work directory:
$ polaris list --verbose
...
10: path: ocean/spherical/qu/cosine_bell/with_viz
name: cosine_bell
component: ocean
subdir: spherical/qu/cosine_bell/with_viz
steps:
- qu_base_mesh_60km: ocean/spherical/qu/base_mesh/60km
- qu_init_60km: ocean/spherical/qu/cosine_bell/init/60km
- qu_forward_60km: ocean/spherical/qu/cosine_bell/forward/60km
- qu_map_60km: ocean/spherical/qu/cosine_bell/with_viz/map/60km
- qu_viz_60km: ocean/spherical/qu/cosine_bell/with_viz/viz/60km
- qu_base_mesh_90km: ocean/spherical/qu/base_mesh/90km
- qu_init_90km: ocean/spherical/qu/cosine_bell/init/90km
- qu_forward_90km: ocean/spherical/qu/cosine_bell/forward/90km
- qu_map_90km: ocean/spherical/qu/cosine_bell/with_viz/map/90km
- qu_viz_90km: ocean/spherical/qu/cosine_bell/with_viz/viz/90km
- qu_base_mesh_120km: ocean/spherical/qu/base_mesh/120km
- qu_init_120km: ocean/spherical/qu/cosine_bell/init/120km
- qu_forward_120km: ocean/spherical/qu/cosine_bell/forward/120km
- qu_map_120km: ocean/spherical/qu/cosine_bell/with_viz/map/120km
- qu_viz_120km: ocean/spherical/qu/cosine_bell/with_viz/viz/120km
- qu_base_mesh_150km: ocean/spherical/qu/base_mesh/150km
- qu_init_150km: ocean/spherical/qu/cosine_bell/init/150km
- qu_forward_150km: ocean/spherical/qu/cosine_bell/forward/150km
- qu_map_150km: ocean/spherical/qu/cosine_bell/with_viz/map/150km
- qu_viz_150km: ocean/spherical/qu/cosine_bell/with_viz/viz/150km
- qu_base_mesh_180km: ocean/spherical/qu/base_mesh/180km
- qu_init_180km: ocean/spherical/qu/cosine_bell/init/180km
- qu_forward_180km: ocean/spherical/qu/cosine_bell/forward/180km
- qu_map_180km: ocean/spherical/qu/cosine_bell/with_viz/map/180km
- qu_viz_180km: ocean/spherical/qu/cosine_bell/with_viz/viz/180km
- qu_base_mesh_210km: ocean/spherical/qu/base_mesh/210km
- qu_init_210km: ocean/spherical/qu/cosine_bell/init/210km
- qu_forward_210km: ocean/spherical/qu/cosine_bell/forward/210km
- qu_map_210km: ocean/spherical/qu/cosine_bell/with_viz/map/210km
- qu_viz_210km: ocean/spherical/qu/cosine_bell/with_viz/viz/210km
- qu_base_mesh_240km: ocean/spherical/qu/base_mesh/240km
- qu_init_240km: ocean/spherical/qu/cosine_bell/init/240km
- qu_forward_240km: ocean/spherical/qu/cosine_bell/forward/240km
- qu_map_240km: ocean/spherical/qu/cosine_bell/with_viz/map/240km
- qu_viz_240km: ocean/spherical/qu/cosine_bell/with_viz/viz/240km
- analysis: ocean/spherical/qu/cosine_bell/analysis
Second, we add symlinks within the task to the shared step. In what follows,
the subdirectories in bold are shared steps that reside elsewhere up the
directory tree: each resolution in the base_mesh
, init
and forward
, and
also analysis
.
cosine_bell/with_viz
:
ocean
spherical
qu
cosine_bell
with_viz
base_mesh
60km
90km
120km
150km
180km
210km
240km
init
60km
90km
120km
150km
180km
210km
240km
forward
60km
90km
120km
150km
180km
210km
240km
map
60km
90km
120km
150km
180km
210km
240km
viz
60km
90km
120km
150km
180km
210km
240km
analysis
Thus, a structure similar to what we had before shared steps is maintained locally, which should make debugging easier.
Implementation: The output of shared steps may be used by multiple tasks.
Task steps that use the output of shared steps will make use of symbolic links as before.
Implementation: tasks do not rely on outputs from steps in other tasks
There were not any polaris tasks that relied on outputs from other tasks even
before the implementation of shared steps. There are tasks in Compass, though,
such as global ocean mesh
, init
and dynamic_adjustment
, that do allow
outputs from one task to be inputs of another. As these are ported to Polaris,
we will make sure they use shared steps instead.
Testing
Testing And Validation: Shared steps are run once.
Output from running a series of tasks or a suite indicates when shared steps
are skipped because they already ran (already completed
):
ocean/spherical/icos/cosine_bell
* step: icos_base_mesh_60km
execution: SUCCESS
runtime: 0:01:00
* step: icos_init_60km
execution: SUCCESS
runtime: 0:00:00
* step: icos_forward_60km
execution: SUCCESS
runtime: 0:00:38
...
* step: analysis
execution: SUCCESS
runtime: 0:00:02
task execution: SUCCESS
task runtime: 0:02:59
ocean/spherical/icos/cosine_bell/with_viz
* step: icos_base_mesh_60km
already completed
* step: icos_init_60km
already completed
* step: icos_forward_60km
already completed
* step: icos_map_60km
execution: SUCCESS
runtime: 0:00:20
* step: icos_viz_60km
execution: SUCCESS
runtime: 0:00:06
...
* step: analysis
already completed
task execution: SUCCESS
task runtime: 0:03:23
Testing And Validation: Shared steps are run before steps that depend on their output.
As before, steps are added to tasks in the order they are to be run, ensuring
that shared steps run before steps that require their output when running in
task serial (polaris serial
). Task parallelism already has mechanisms to
prevent steps from running before their dependencies are available, and this
is not expected to be affected by shared steps. However, no testing with
task parallelism will be performed at this time.
Testing And Validation: Shared steps are not daughters of a task
Steps run successfully even after we have removed the task
attribute from
them, indicating that they no longer rely on information about a task they
formerly belonged to.
Testing And Validation: Working directory structure is intuitive.
The intuitive work structure will need to be maintained by developers as new tasks and steps are added, as this is not enforced by the framework. The proposed implementation ensures that shared steps either reside close to the root of the directory structure from the tasks that use them or that they live inside of the tasks, which we have deemed an intuitive structure.
Testing And Validation: Working directory step paths are easily discoverable by users.
Between polaris list --verbose
and the local symlinks to shared steps within
each task, we think the shared steps will be discoverable by users and
developers.
Testing And Validation: The output of shared steps may be used by multiple tasks.
We have implemented shared steps for base meshes, initial conditions and forward runs, and shown that multiple tasks can make use of their output.
Testing And Validation: tasks do not rely on outputs from steps in other tasks
This is not enforced, it will simply need to be maintained as the preferred convention for future development. Currently, all tasks can be run independently and do not rely on any other tasks.