(users-parallel)=

# Parallel execution with `mache.parallel`

`mache.parallel` provides a machine-aware interface for launching parallel
workloads based on each machine's config file.

## Typical downstream workflow

Downstream software (for example, Polaris software) can:

1. Load machine config with `MachineInfo`.
2. Build a parallel-system object with `get_parallel_system()`.
3. Query available resources (`cores`, `nodes`, `gpus`, and `mpi_allowed`).
4. Build a machine-correct launcher command with `get_parallel_command()`.
5. Use the command for either generated job scripts or direct subprocess calls.

## Example: build a launcher command

```python
from mache import MachineInfo
from mache.parallel import get_parallel_system

machine_info = MachineInfo()
parallel_system = get_parallel_system(machine_info.config)

args = ["python", "-m", "your_package.run_task", "--case", "smoke"]
command = parallel_system.get_parallel_command(
    args=args,
    ntasks=4,
    cpus_per_task=2,
    gpus_per_task=0,
)

print(" ".join(command))
```

On a batch allocation, this returns an `srun`/`mpiexec` command using the
machine's configured launcher and resource flags. On login nodes for `slurm`
or `pbs` systems, `get_parallel_system()` falls back to `login`, where MPI is
intentionally disabled.

## GPU-per-task flags

When `gpus_per_task > 0` is passed to `get_parallel_command()`:

- `slurm` systems add `--gpus-per-task <N>` by default. This can be overridden
    with `gpus_per_task_flag` in the machine's `[parallel]` config.
- `pbs` systems require a machine-specific `gpus_per_task_flag` to be set in
    config before a GPU-per-task argument is added.

## Hyperthreading

`mache.parallel` does not currently have a dedicated
`hyperthreading = true/false` switch. Instead, hyperthreading behavior is
controlled through the machine's `[parallel]` config and the resource values
passed to `get_parallel_command()`. The most important config knobs are
`cores_per_node`, `max_mpi_tasks_per_node`, and `cpu_bind`.

The default convention in mache's machine configs is to describe CPU resources
in terms of physical cores, not hardware threads. For E3SM itself, and for
most downstream software, this means:

- `cores_per_node` should usually be the number of physical CPU cores per node
- `max_mpi_tasks_per_node` should usually reflect the intended non-hyperthreaded
    MPI rank count per node
- `cpu_bind = cores` is the recommended default when the launcher supports it
- `cpus_per_task` should usually be sized assuming physical cores

This is why several shipped machine configs explicitly document
`cores_per_node` as the count "without hyperthreading".

If a downstream application wants to take advantage of hyperthreading, it
should opt in by overriding the relevant parallel config values for that use
case. In practice, that usually means switching from physical-core counts to
hardware-thread counts and adjusting binding accordingly. For example, on a
machine with 64 physical cores and 2 hardware threads per core:

```ini
[parallel]
cores_per_node = 128
max_mpi_tasks_per_node = 128
cpu_bind = threads
```

Then, calls to `get_parallel_command()` should use `cpus_per_task` and
`ntasks` values that match that threaded layout.

The important point is that hyperthreading is opt-in. Mache's default machine
configs should generally preserve the physical-core layout that is appropriate
for E3SM and most downstream tools, while still allowing downstream users to
provide a config override when they intentionally want thread-level placement.

## Using this in generated job scripts

A common pattern is to generate scheduler directives separately, then use
`mache.parallel` only for launch lines. For example:

- Use `MachineInfo.get_account_defaults()` to populate account/partition/QOS.
- Use `MachineInfo.get_queue_specs()`, `MachineInfo.get_partition_specs()` or
    `MachineInfo.get_qos_specs()` for optional scheduler-target policy
    metadata (`min_nodes`, `max_nodes`, `max_wallclock`) when available.
- Render scheduler headers (`#SBATCH` or `#PBS`) in your template logic.
- Use `get_parallel_command()` to build the executable line.

This keeps scheduler policy in your tool while reusing machine-specific launch
behavior from `mache`.

## Selecting scheduler options by node count

`mache.parallel` also provides helpers for selecting queue/partition/QOS from
machine metadata:

- `ParallelSystem.get_scheduler_target(config, target_type, nodes)` selects
    one of `queue`, `partition`, or `qos`.
- `ParallelSystem.resolve_submission(config, nodes, target_type,
    min_nodes_allowed=None)` returns a `SubmissionResolution` with fields
    `target`, `requested_nodes`, `effective_nodes`, and `adjustment`
    (`exact`, `decrease`, or `increase`).
- `SlurmSystem.get_slurm_options(config, nodes, min_nodes_allowed=None)`
    returns `(partition, qos, constraint, gpus_per_node, max_wallclock,
    effective_nodes)`.
- `PbsSystem.get_pbs_options(config, nodes, min_nodes_allowed=None)` returns
    `(queue, constraint, gpus_per_node, max_wallclock, filesystems,
    effective_nodes)`.

For invalid gaps between scheduler ranges, node count is adjusted to the
nearest valid value, preferring lower adjustments when feasible. If
`min_nodes_allowed` disallows lower adjustments, resolution moves to the next
valid higher range. If no feasible target exists, these functions raise
`ValueError`.