Production
This guide is intended to walk users through steps necessary to run a production simulation.
Running the Model
To prepare for the long production simulation, edit the run_e3sm script and set readonly run='production'
. In addition, you may need to customize some variables in the code block below to configure run options. Below is an example of how you might configure these variables:
# Production simulation
readonly PELAYOUT="L"
: 1=single processor, S=small, M=medium, L=large, X1=very large, X2=very very large. Production simulations typically use M or L. The size determines how many nodes will be used. The exact number of nodes will differ amongst machines.readonly WALLTIME="34:00:00"
: maximum wall clock time requested for the batch jobs.readonly STOP_OPTION="nyears"
: see next linereadonly STOP_N="50"
: units and length of each segment (i.e., each batch job). E.g, the current configuration stops after 50 years.readonly REST_OPTION="nyears"
: see next linereadonly REST_N="5"
: units and frequency for writing restart files (make sureSTOP_N
is a multiple ofREST_N
, otherwise the model will stop without writing a restart fie at the end). E.g., the current configurations saves restart files after every 5 years. 10 restart files will be saved, sinceSTOP_N=50
.readonly RESUBMIT=”9”
: number of resubmissions beyond the original segment. This simulation would run for a total of 500 years (=inital 50 + 9x50).readonly DO_SHORT_TERM_ARCHIVING=false
: leave set to false if you want to manually run the short term archive.
Since the code had already been fetched and compiled for the short tests, the toggle flags can be set to:
do_fetch_code=false
do_create_newcase=true
do_case_setup=true
do_case_build=false
do_case_submit=true
Finally, execute the script:
cd <run_scripts_dir>
./run.<case_name>.sh
The script will automatically submit the first job. New jobs will be automatically be resubmitted at the end until the total number of segments have been run.
Looking at Results
ls <simulations_dir>/<case_name>
explanation of directories:
build
: all the stuff to compile. The executable (e3sm.exe
) is also there.case_scripts
: the files for your particular simulation.run
: where all the output will be. Most components (atmosphere, ocean, etc.) have their own log files. The coupler exchanges information between the components. The top level log file will be of the formrun/e3sm.log.*
. Log prefixes correspond to components of the model:atm
: atmospherecpl
: couplerice
: sea icelnd
: landocn
: oceanrof
: river runoff
Run tail -f run/<component>.log.<latest log file>
to keep up with a log in real time.
You can use the sq
alias defined in Useful Aliases to check on the status of the job. The NODE
in the output indicates the number of nodes used and is dependent on the processor_config
/ PELAYOUT
size.
Note
When running on two different machines (such as Compy and Chrysalis) and/or two different compilers, the answers will not be the same, bit-for-bit. It is not possible using floating point operations to get bit-or-bit identical results across machines/compilers.
Logs being compressed to .gz
files is one of the last steps before the job is done and will indicate successful completion of the segment. less <log>.gz
will let you directly look at a gzipped log.
Re-Submitting a Job After a Crash
If a job crashes, you can rerun with:
cd <simulations_dir>/<case_name>/case_scripts
# Make any changes necessary to avoid the crash
./case.submit
If you need to change a XML value, the following commands in the case_scripts
directory are useful:
> ./xmlquery <variable> # Get value of a variable
> ./xmlchange -id <variable> -val <value> # Set value of a variable
Before re-submitting:
- Check that the rpointer files all point to the last restart. On very rare occasions, there might be some inconsistency if the model crashed at the end. Run
head -n 1 rpointer.*
to see the restart date. gzip
all the*.log
files from the faulty segment so that they get moved during the next short-term archiving. Togzip
log files from failed jobs, rungzip *.log.<job ID>*
(where<job ID>
has no periods/dots in it).- Delete core or error files, if there are any. MPAS components will sometimes produce a large number of them. The following commands are useful for checking for these files:
ls | grep -in core
ls | grep -in err
- If you are re-submitting the initial job, you will need to run
./xmlchange -id CONTINUE_RUN -val TRUE
.
Performance Information
Model throughput is the number of simulated years per day (SYPD). You can find this with:
cd <simulations_dir>/<case_name>/case_scripts/timing
grep "simulated_years" e3sm*
PACE provides detailed performance information. Go to PACE and enter your username to search for your jobs. You can also simply search by providing the JobID appended to log files (NNNNN.yymmdd-hhmmss
where NNNNN
is the SLURM job id). Click on a job ID to see its performance details. “Experiment Details” are listed at the top of the job’s page. There is also a helpful chart detailing how many processors and how much time each component (atm
, ocn
, etc.) used. White areas indicate time spent idle/waiting. The area of each box is essentially the "cost = simulation time * number of processors" of the corresponding component.