Adding a New Machine to Mache
Adding an E3SM-known machine to mache requires adding a new config file, as
well as updating the list of machines in discover.py.
Note
Only machines that are included in mache’s machine config list can be added to mache. This list is a copy of the E3SM cime machine config list which we try to keep up-to-date. If you wish to add a machine that is not included in this list, you must contact the E3SM-Project developers to add your machine.
Adding a new config file
Adding a new config file is usually straightforward if you follow the format of an existing config file.
For machines with a known E3SM inputdata location, also add an [inputdata]
section with:
base_path: base directory for the shared E3SM inputdata tree
When the machine also appears in
mache/cime_machine_config/config_machines.xml, this value should match that
machine’s DIN_LOC_ROOT entry.
Parallel execution settings
Machine config files now include parallel-resource settings that are consumed
by mache.parallel. At minimum, each machine should define a [parallel]
section with:
system: one ofslurm,pbs,single_node, orloginparallel_executable: launcher command (for example,srun --labelormpiexec --label)
Depending on the parallel system, the following options are typically required:
cores_per_nodegpus_per_node(if GPUs are available)max_mpi_tasks_per_nodecpus_per_task_flag(primarily for PBS launchers)cpu_bind,gpu_bind,mem_bind,placement(optional launcher tuning)login_cores,login_gpus(for theloginsystem)
For machines with hyperthreading, mache’s convention is that cores_per_node
should normally be the physical-core count, not the hardware-thread count.
Likewise, max_mpi_tasks_per_node should normally reflect the default
non-hyperthreaded layout used by E3SM and most downstream software, and
cpu_bind = cores is the preferred default when supported by the launcher.
Downstream projects that intentionally want hyperthreading can override these
settings in their own config to use hardware-thread counts and thread binding.
In other words, these config fields are the hyperthreading controls rather
than a dedicated boolean option.
Compiler-specific overrides can be provided in optional
[parallel.<compiler>] sections, e.g. [parallel.gnu].
For machines with scheduler-target policy limits, you can also define optional sections for queue- or partition-based schedulers:
[queue.<name>]sections corresponding to entries inparallel.queues[partition.<name>]sections corresponding to entries inparallel.partitions[qos.<name>]sections corresponding to entries inparallel.qos
Supported keys are:
min_nodes: minimum node count for this scheduler targetmax_nodes: maximum node count for this scheduler target (leave unset for no upper bound)max_wallclock: maximum allowed wall-clock time (for example,01:00:00)
Downstream software can query these values with
MachineInfo.get_queue_specs(), MachineInfo.get_partition_specs(),
MachineInfo.get_qos_specs() or
MachineInfo.get_scheduler_specs().
These options are used to:
detect available resources on the current allocation,
construct launcher commands via
mache.parallel, andenforce machine-specific limits like max MPI tasks per node.
Adding the new machine to discover.py
You will need to amend the list of machine names in discover.py so that mache
can identify the new machine via its hostname. This process is typically done
using a regular expression, which is often possible whenever the machine’s
hostname follows a standardized format. For example, we can identify known
machines from hostnames with the following regular expressions:
'^chr-\d{4}' # Chrysalis compute nodes with hostnames chr-0000 to chr-9999
'^compy' # Compy nodes with hostname compy
'^dane\d{1,4}' # Dane nodes with hostnames dane0 to dane9999
In some cases, the hostname assigned to a machine is too generic to
differentiate it from other machines. In these cases, we must identify the
machine by its environment variables. However, this is not the recommended
procedure and should only be done as a last resort. For example, we identify
frontier by its LMOD_SYSTEM_NAME environment variable:
if machine is None and 'LMOD_SYSTEM_NAME' in os.environ:
hostname = os.environ['LMOD_SYSTEM_NAME']
if hostname == 'frontier':
# frontier's hostname is too generic to detect, so relying on
# LMOD_SYSTEM_NAME
machine = 'frontier'
Note
Identifying the machine by environment variables is not recommended unless absolutely necessary.