ACE2 Inference Tutorial

This guide walks you through running inference using the pre-trained ACE2-EAMv3 model.

Prerequisites for this guide

uv installed to set up the environment, including Py Torch. See our Python Environment Setup for more details.
Access to the internet to clone repositories

Steps

1. Clone the ACE Repository

Clone the ACE repository from GitHub:

git clone https://github.com/E3SM-Project/ace

2. Clone the Model Repository

Clone the ACE2-EAMv3 model repository from Hugging Face:

git clone https://huggingface.co/allenai/ACE2-EAMv3

git lfs

If you run into issues related to git lfs, you may need to install that.

3. Run Inference

Navigate to the ace repository directory:

cd ace

Create a configuration file named config-inference.yaml with the following content. Make sure to update the paths to match your environment.

experiment_dir: /pscratch/sd/m/mahf708/ACE2-EAMv3/test1 # (1)!
n_forward_steps: 1458 # (2)!
forward_steps_in_memory: 80 # (3)!
checkpoint_path: /pscratch/sd/m/mahf708/ACE2-EAMv3/ace2_EAMv3_ckpt.tar # (4)!
initial_condition: # (5)!
  path: /pscratch/sd/m/mahf708/ACE2-EAMv3/initial_conditions/1971010100.nc
  start_indices:
    n_initial_conditions: 2
    first: 0
    interval: 1
forcing_loader: # (6)!
  dataset:
    data_path: /pscratch/sd/m/mahf708/ACE2-EAMv3/forcing_data/
  num_data_workers: 2
logging: # (7)!
  log_to_screen: true
  log_to_wandb: false
  log_to_file: true
data_writer: # (8)!
  save_prediction_files: true

Output directory — All inference outputs (predictions, diagnostics, logs) are saved here. Create this directory before running.
Number of forward steps — Total timesteps to run. Each step is 6 hours, so 1458 steps ≈ 365 days. See note below on limitation.
Steps in memory — Batch size for GPU memory. Lower this if you run into OOM errors. 80 is a good default.
Model checkpoint — Path to the pretrained ACE2-EAMv3 weights (.tar file from Hugging Face).
Initial conditions — Starting atmospheric state. n_initial_conditions runs multiple ensemble members; first and interval control which samples to use from the IC file.
Forcing data — External forcing (SST, solar, GHGs, etc.). The loader reads Zarr/NetCDF files from this path. num_data_workers controls parallel I/O.
Logging options — log_to_screen prints progress; log_to_wandb sends metrics to Weights & Biases (requires login); log_to_file saves to inference_out.log.
Output writer — Set save_prediction_files: true to write NetCDF outputs. Set to false for validation-only runs. Additionally, one could names: [T_4, T_5] to request only T_4 and T_5 in the output.

maximum steps

Note that this is limited by the temporal length of the forcing data (in the example above, a year; see forcing_data) and the specifics of the initial conditions (in the example above, 2 seperated by a single time step starting from 0). That's why we have an offset of 2 steps from a full year in the prediction. If we have one initial conditions, then the number of forward steps would be 1459. The general formula is: max steps allowed = length of data - (first + interval * (n_initial_conditions-1))

Run the inference using the following command:

uv run python -m fme.ace.inference config-inference.yaml

compute node

The above command takes about 10 minutes on a single compute node on pm-gpu (4xA100). The command to get a compute pm-gpu compute node is:

salloc --nodes 1 --qos interactive --time 04:00:00 --constraint gpu --account=e3sm_g

uv cache

Sometimes, you will need to the enviornment variable UV_CACHE_DIR, e.g., on NERSC, export UV_CACHE_DIR="$PSCRATCH/.cache/uv"

4. Results

The results will be saved in the experiment_dir specified in the config file. The output directory structure will look like this:

> ls /pscratch/sd/m/mahf708/ACE2-EAMv3/test1 -1 
annual_diagnostics.nc
autoregressive_predictions.nc
autoregressive_target.nc
config.yaml
inference_out.log
initial_condition.nc
mean_diagnostics.nc
monthly_mean_predictions.nc
monthly_mean_target.nc
restart.nc
time_mean_diagnostics.nc

where the autoregressive_predictions.nc file has the following header:

ncdump -h /pscratch/sd/m/mahf708/ACE2-EAMv3/test1/autoregressive_predictions.nc
netcdf autoregressive_predictions {
dimensions:
        time = UNLIMITED ; // (1458 currently)
        sample = 2 ;
        lat = 180 ;
        lon = 360 ;
variables:
        int64 time(time) ;
                time:units = "microseconds" ;
        int64 init_time(sample) ;
                init_time:units = "microseconds since 1970-01-01 00:00:00" ;
                init_time:calendar = "noleap" ;
        ...

These are the available variables:

> ncdump -h /pscratch/sd/m/mahf708/ACE2-EAMv3/test1/autoregressive_predictions.nc | grep "float\|int64" | awk '{print $2}' | cut -d'(' -f1
time
init_time
valid_time
lat
lon
TS
net_energy_flux_sfc_into_atmosphere
T_4
V_2
surface_pressure_due_to_dry_air_absolute_tendency
T_3
T_0
T_6
V_4
U_3
U_4
tendency_of_total_water_path_due_to_advection
specific_total_water_3
specific_total_water_4
U_1
V_7
U_5
FLUT
surface_precipitation_rate
surface_upward_longwave_flux
top_of_atmos_upward_shortwave_flux
specific_total_water_0
T_7
U_2
net_energy_flux_into_atmospheric_column
specific_total_water_1
specific_total_water_6
V_0
total_water_path
V_1
V_6
FLDS
FSDS
PS
total_energy_ace2_path
T_1
V_3
surface_pressure_due_to_dry_air
U_6
specific_total_water_5
V_5
T_5
U_0
SHFLX
LHFLX
specific_total_water_7
surface_upward_shortwave_flux
T_2
U_7
specific_total_water_2
total_energy_ace2_path_tendency
total_water_path_budget_residual
implied_tendency_of_total_energy_ace2_path_due_to_advection
net_energy_flux_toa_into_atmosphere

Remaining tasks

Prepare forcing data for longer time period (e.g., 10 years)
Explain the variables and files produces
a restart.nc file is recorded. How do we perform a restart run?
more information about the variables and levels
Explore performance space, and producing larger ensembles