ACE2-ERA5 Training Workflow
This guide provides a complete workflow for training the ACE (AI2 Climate Emulator) model using the ACE2-ERA5 dataset.
Overview
ACE is a machine learning model for climate emulation developed by AI2. This workflow will guide you through setting up the environment and running training on a GPU compute node.
Prerequisites
- Access to a compute cluster with GPU nodes (e.g., pm-gpu)
- Storage space for the dataset (ACE2-ERA5)
- Network access to clone repositories
Resources
- Code Repository: E3SM-Project/ace
- Dataset: ACE2-ERA5 on Hugging Face
- Documentation: ACE Training Configuration Guide
Setup Instructions
1. Clone the Code Repository
Start by cloning the ACE repository and checking out the main branch:
2. Download the Dataset
Clone the ACE2-ERA5 dataset from Hugging Face:
Dataset Size
The ACE2-ERA5 dataset is large. Ensure you have sufficient storage space before cloning. Also, if you run into issues related to git lfs, you may need to install that.
3. Install uv Package Manager
Install the uv package manager, which is used to manage Python dependencies:
After installation, you may need to restart your console or source your profile to use uv.
4. Configure uv Cache
Set up the uv cache directory in an accessible location with sufficient storage:
Cache Location
Adjust $PSCRATCH to match your system's scratch directory path. This ensures the cache is stored in a location with adequate space.
5. Pin Python Version
Pin the Python version to 3.11:
Running Training
1. Request a GPU Compute Node
Request an interactive GPU node on your cluster:
Account Settings
Adjust the --account parameter to match your allocation account.
2. Prepare Training Configuration
Create a training configuration file named config-train.yaml in the repository root. You can start with a template from the ACE training configuration documentation, or use the sample below:
Sample Configuration (config-train.yaml)
experiment_dir: /path/to/your/ACE2-ERA5/train_output
save_checkpoint: true
validate_using_ema: true
max_epochs: 80
n_forward_steps: 1
inference:
n_forward_steps: 300 # ~75 days (adjust based on your needs)
forward_steps_in_memory: 1
loader:
start_indices:
first: 0
n_initial_conditions: 4
interval: 300 # adjusted to fit within dataset
dataset:
data_path: /path/to/your/ACE2-ERA5/training_validation_data/training_validation
num_data_workers: 4
logging:
log_to_screen: true
log_to_wandb: false
log_to_file: true
project: ace
entity: your_wandb_entity
train_loader:
batch_size: 4
num_data_workers: 2
prefetch_factor: 2
dataset:
concat:
- data_path: /path/to/your/ACE2-ERA5/training_validation_data/training_validation
validation_loader:
batch_size: 4
num_data_workers: 2
prefetch_factor: 2
dataset:
data_path: /path/to/your/ACE2-ERA5/training_validation_data/training_validation
subset:
step: 5
optimization:
enable_automatic_mixed_precision: false
lr: 0.0001
optimizer_type: AdamW
# can also set kwargs: fused: true for performance if using GPU
stepper:
loss:
type: MSE
step:
type: single_module
config:
builder:
type: SphericalFourierNeuralOperatorNet
config:
embed_dim: 16
filter_type: linear
hard_thresholding_fraction: 1.0
use_mlp: true
normalization_layer: instance_norm
num_layers: 2
operator_type: dhconv
scale_factor: 1
separable: false
normalization:
network:
global_means_path: /path/to/your/ACE2-ERA5/training_validation_data/normalization/centering.nc
global_stds_path: /path/to/your/ACE2-ERA5/training_validation_data/normalization/scaling-full-field.nc
loss:
global_means_path: /path/to/your/ACE2-ERA5/training_validation_data/normalization/centering.nc
global_stds_path: /path/to/your/ACE2-ERA5/training_validation_data/normalization/scaling-residual.nc
in_names:
- land_fraction
- ocean_fraction
- sea_ice_fraction
- DSWRFtoa
- HGTsfc
- PRESsfc
- surface_temperature
- air_temperature_0 # _0 denotes the top most layer of the atmosphere
- air_temperature_1
- air_temperature_2
- air_temperature_3
- air_temperature_4
- air_temperature_5
- air_temperature_6
- air_temperature_7
- specific_total_water_0
- specific_total_water_1
- specific_total_water_2
- specific_total_water_3
- specific_total_water_4
- specific_total_water_5
- specific_total_water_6
- specific_total_water_7
- eastward_wind_0
- eastward_wind_1
- eastward_wind_2
- eastward_wind_3
- eastward_wind_4
- eastward_wind_5
- eastward_wind_6
- eastward_wind_7
- northward_wind_0
- northward_wind_1
- northward_wind_2
- northward_wind_3
- northward_wind_4
- northward_wind_5
- northward_wind_6
- northward_wind_7
out_names:
- PRESsfc
- surface_temperature
- air_temperature_0
- air_temperature_1
- air_temperature_2
- air_temperature_3
- air_temperature_4
- air_temperature_5
- air_temperature_6
- air_temperature_7
- specific_total_water_0
- specific_total_water_1
- specific_total_water_2
- specific_total_water_3
- specific_total_water_4
- specific_total_water_5
- specific_total_water_6
- specific_total_water_7
- eastward_wind_0
- eastward_wind_1
- eastward_wind_2
- eastward_wind_3
- eastward_wind_4
- eastward_wind_5
- eastward_wind_6
- eastward_wind_7
- northward_wind_0
- northward_wind_1
- northward_wind_2
- northward_wind_3
- northward_wind_4
- northward_wind_5
- northward_wind_6
- northward_wind_7
- LHTFLsfc
- SHTFLsfc
- PRATEsfc
- ULWRFsfc
- ULWRFtoa
- DLWRFsfc
- DSWRFsfc
- USWRFsfc
- USWRFtoa
- tendency_of_total_water_path_due_to_advection
Important: Make sure to update the following in your config-train.yaml:
experiment_dir: Set this to a writable directory where training outputs will be saveddata_path: Point this to your downloaded ACE2-ERA5 dataset location
Fast Iteration
For faster iteration during initial testing, consider:
- Reducing the number of training epochs
- Using a smaller batch size
- Limiting the dataset size
3. Launch Training
From the repository root, launch the training job using torchrun:
This command will:
- Use
uv runto manage dependencies automatically - Launch
torchrunwith 4 processes (one per GPU) - Execute the training module with your configuration
Training Parameters
The torchrun command accepts several parameters:
--nproc_per_node=4: Number of processes per node (typically matches the number of GPUs)-m fme.ace.train: The Python module to runconfig-train.yaml: Your training configuration file
Monitoring Training
During training, monitor:
- GPU utilization: Use
nvidia-smito check GPU usage - Training logs: Check the output directory specified in
experiment_dir - Checkpoints: Models will be saved periodically based on your configuration
Troubleshooting
Common Issues
Out of Memory Errors
- Reduce batch size in
config-train.yaml - Decrease model size parameters
- Use fewer GPUs with
--nproc_per_node
Cache Directory Issues
- Ensure
UV_CACHE_DIRhas sufficient space - Check write permissions on the cache directory
Module Import Errors
- Verify Python version is pinned to 3.11
- Ensure you're running from the repository root
- Check that dependencies are properly installed by
uv