Troubleshooting Deployment
Even with well-maintained tools, deployment of E3SM-Unified on HPC systems often encounters system-specific or environment-specific problems. This page outlines common categories of issues and how to diagnose and resolve them.
This is an evolving list. Please make PRs to add descriptions of issues you have encountered and solutions you have found.
1. 🛠️ Spack Build Failures
Common Causes
Missing or incompatible system modules (
cmake
,perl
,bison
, etc.)Outdated Spack package definitions in the
spack_for_mache_<version>
branch on the E3SM forkSpack build cache pollution
Environment not set correctly for Spack to detect compilers/libraries
Solutions
If Spack is attempting to build common system tools (
cmake
,tar
, etc.), add their system versions to the Spack templates inmache
withbuildable: false
instead to save time and prevent build problems.Check with
spack find
,spack config get compilers
, andspack config get modules
Load required modules manually before re-running
Rebuild:
spack uninstall -y <package>
or delete the full deployment directoryDouble-check you are using the correct
spack_for_mache_<version>
branch
2. 🔢 Activation Script or Module Issues
Symptoms
Scripts not found or symlinks broken
Compute node not detected
Fixes
Inspect Jinja2 templates for logic errors (especially for new systems)
Re-run deployment with
--recreate
Validate compute node detection logic (
$SLURM_JOB_ID
,$COBALT_JOBID
, etc.)For new schedulers (e.g., PBS), extend template logic accordingly
3. 🚫 Conda Environment Problems
Symptoms
Conda fails to resolve dependencies
Environments install but are missing key packages
Fixes
Run with
--recreate
to force a rebuildInspect logs carefully for root cause messages
Use
recipes/e3sm-unified/conda_first_failure.py
to bisect failing specsCheck for channel mismatches or conflicting dev-label dependencies
4. 💾 Filesystem and Permission Issues
Symptoms
Scripts not executable by collaborators
Environment directories not group-readable
Fixes
Run:
chmod -R g+rx
andchgrp -R <group>
as neededConfirm deployment messages show permission updates succeeded
Use
ls -l
to inspect group ownership and mode bitsYou may need to coordinate with administrators or previous maintainers to set permissions (e.g. if you do not have write permission to contents under the E3SM-Unified base environment)
5. 🧰 mache
Configuration Problems
Symptoms
Unknown machine error during deployment
Spack fails to load environment due to incorrect module list
Fixes
Ensure the correct
mache
version or branch is being installedEnsure that the machine has been added to
mache
both under machine config files and in the logic for machine discoveryValidate updates to
config_machines.xml
and spack YAML templatesUse
utils/update_cime_machine_config.py
to compare against upstream E3SM config
6. 🪖 Spack Caching and Environment Contamination
Symptoms
Builds complete but produce incorrect or stale binaries
Environment behaves inconsistently between deploys
Fixes
Clear Spack caches manually if needed
Always deploy from a clean
$TMPDIR
and fresh clone if unsureDelete the entire directory:
rm -rf spack/e3sm_unified_<version>_<machine>_<compiler>_<mpi>
7. ⚠️ Common Fix: Full Clean + Re-run
When in doubt, remove and rebuild everything:
rm -rf <base_path>/spack/e3sm_unified_<version>_<machine>_<compiler>_<mpi>
./deploy_e3sm_unified.py --conda ~/miniforge3 --recreate
This often resolves cases where previous state is interfering with a clean build.