Adding a New Machine to Mache

Adding an E3SM-known machine to mache requires adding a new config file, as well as updating the list of machines in discover.py.

Note

Only machines that are included in mache’s machine config list can be added to mache. This list is a copy of the E3SM cime machine config list which we try to keep up-to-date. If you wish to add a machine that is not included in this list, you must contact the E3SM-Project developers to add your machine.

Adding a new config file

Adding a new config file is usually straightforward if you follow the format of an existing config file.

For machines with a known E3SM inputdata location, also add an [inputdata] section with:

  • base_path: base directory for the shared E3SM inputdata tree

When the machine also appears in mache/cime_machine_config/config_machines.xml, this value should match that machine’s DIN_LOC_ROOT entry.

Parallel execution settings

Machine config files now include parallel-resource settings that are consumed by mache.parallel. At minimum, each machine should define a [parallel] section with:

  • system: one of slurm, pbs, single_node, or login

  • parallel_executable: launcher command (for example, srun --label or mpiexec --label)

Depending on the parallel system, the following options are typically required:

  • cores_per_node

  • gpus_per_node (if GPUs are available)

  • max_mpi_tasks_per_node

  • cpus_per_task_flag (primarily for PBS launchers)

  • cpu_bind, gpu_bind, mem_bind, placement (optional launcher tuning)

  • login_cores, login_gpus (for the login system)

For machines with hyperthreading, mache’s convention is that cores_per_node should normally be the physical-core count, not the hardware-thread count. Likewise, max_mpi_tasks_per_node should normally reflect the default non-hyperthreaded layout used by E3SM and most downstream software, and cpu_bind = cores is the preferred default when supported by the launcher. Downstream projects that intentionally want hyperthreading can override these settings in their own config to use hardware-thread counts and thread binding. In other words, these config fields are the hyperthreading controls rather than a dedicated boolean option.

Compiler-specific overrides can be provided in optional [parallel.<compiler>] sections, e.g. [parallel.gnu].

For machines with scheduler-target policy limits, you can also define optional sections for queue- or partition-based schedulers:

  • [queue.<name>] sections corresponding to entries in parallel.queues

  • [partition.<name>] sections corresponding to entries in parallel.partitions

  • [qos.<name>] sections corresponding to entries in parallel.qos

Supported keys are:

  • min_nodes: minimum node count for this scheduler target

  • max_nodes: maximum node count for this scheduler target (leave unset for no upper bound)

  • max_wallclock: maximum allowed wall-clock time (for example, 01:00:00)

Downstream software can query these values with MachineInfo.get_queue_specs(), MachineInfo.get_partition_specs(), MachineInfo.get_qos_specs() or MachineInfo.get_scheduler_specs().

These options are used to:

  • detect available resources on the current allocation,

  • construct launcher commands via mache.parallel, and

  • enforce machine-specific limits like max MPI tasks per node.

Adding the new machine to discover.py

You will need to amend the list of machine names in discover.py so that mache can identify the new machine via its hostname. This process is typically done using a regular expression, which is often possible whenever the machine’s hostname follows a standardized format. For example, we can identify known machines from hostnames with the following regular expressions:

'^chr-\d{4}'  # Chrysalis compute nodes with hostnames chr-0000 to chr-9999
'^compy'      # Compy nodes with hostname compy
'^dane\d{1,4}' # Dane nodes with hostnames dane0 to dane9999

In some cases, the hostname assigned to a machine is too generic to differentiate it from other machines. In these cases, we must identify the machine by its environment variables. However, this is not the recommended procedure and should only be done as a last resort. For example, we identify frontier by its LMOD_SYSTEM_NAME environment variable:

if machine is None and 'LMOD_SYSTEM_NAME' in os.environ:
    hostname = os.environ['LMOD_SYSTEM_NAME']
    if hostname == 'frontier':
        # frontier's hostname is too generic to detect, so relying on
        # LMOD_SYSTEM_NAME
        machine = 'frontier'

Note

Identifying the machine by environment variables is not recommended unless absolutely necessary.