Usage

If running on Cori, it is preferable to run from $CSCRATCH rather than /global/homes. Running from the latter may result in a ‘Resource temporarily unavailable’ error.

Note

If you have not logged into HSI before, you will have to do so before running zstash with HPSS. On NERSC machines, just run hsi on the command line and enter your credentials.

Warning

When specifying files, wildcards should be enclosed in double quotes (e.g., "a*").

Warning

Specifying a high number for --workers will result in slow downloads for each of the tars since your bandwidth is limited. User discretion is advised.

Create

To create a new zstash archive:

$ zstash create --hpss=<path to HPSS> <local path>

where

  • --hpss=<path to HPSS> specifies the destination path on the HPSS file system where the archive files will be stored. This directory should be unique for each zstash archive. If --hpss=none, then files will be archived locally instead of being transferred to HPSS. The none option should be used when running Zstash on a machine without HPSS.

  • <local path> specifies the path to the local directory that should be archived.

Additional optional arguments:

  • --cache to use a cache other than the default of zstash. If hpss is --hpss=none, then this will be the archive.

  • --exclude comma separated list of file patterns to exclude

  • --keep to keep a copy of the tar files on the local file system after they have been transferred to HPSS. Normally, they are deleted after successful transfer.

  • --maxsize MAXSIZE specifies the maximum size (in GB) for tar files. The default is 256 GB. Zstash will create tar files that are smaller than MAXSIZE except when individual input files exceed MAXSIZE (as individual files are never split up between different tar files).

  • -v increases output verbosity.

Local tar files as well as the sqlite3 index database (index.db) will be stored under <local path>/zstash.

After you run zstash create it’s highly recommended that you run zstash check, detailed in the section below.

Basic example

To archive output from an E3SM simulation located under $CSCRATCH/ACME_simulations/20170731.F20TR.ne30_ne30.edison:

$ cd $CSCRATCH/ACME_simulations/20170731.F20TR.ne30_ne30.edison
$ zstash create --hpss=test/E3SM_simulations/20170731.F20TR.ne30_ne30.edison .

Once done, you should see the archive files on hsi:

$ hsi
> cd test/E3SM_simulations/20170731.F20TR.ne30_ne30.edison
> ls
000000.tar   index.db

The data from this test simulation is small, so in this case there is only a single tar file (000000.tar) and the index database (index.db).

Examples excluding some files

You may decide that certain files do not need to be archived. For example, if you want to exclude *.o and *.mod files under the build subdirectory:

$ cd $CSCRATCH/ACME_simulations/20170731.F20TR.ne30_ne30.edison
$ zstash create --hpss=test/ACME_simulations/20170731.F20TR.ne30_ne30.edison \
  --exclude="build/*/*.o","build/*/*.mod" .

Or you may decide that you only want to archive restart files every 5 years to conserve storage space:

$ cd $CSCRATCH/ACME_simulations/20170731.F20TR.ne30_ne30.edison
$ zstash create --hpss=test/ACME_simulations/20170731.F20TR.ne30_ne30.edison \
  --exclude="archive/rest/???[!05]-*/" .

This exclude pattern will skip all restart subdirectories under the short-term archive, except for those with years ending in ‘0’ or ‘5’.

Check

Note: Most of the commands for this are the same for zstash extract and zstash ls.

To verify that your files were uploaded on HPSS successfully, go to a new, empty directory and run:

$ zstash check --hpss=<path to HPSS> [--workers=<num of processes>] [--cache=<cache>] [--keep] [-v] [files]

where

  • --hpss=<path to HPSS> specifies the destination path on the HPSS file system.

  • --workers=<num of processes> an optional argument which specifies the number of processes to use, resulting in checking being done in parallel. Using a high number will result in slow downloads for each of the tars since your bandwidth is limited. User discretion is advised.

  • --cache to use a cache other than the default of zstash.

  • --keep to keep a copy of the tar files on the local file system after they have been extracted from the archive. Normally, they are deleted after successful transfer.

  • -v increases output verbosity.

  • [files] is a list of files to check (standard wildcards supported).

    • Leave empty to check all the files.

    • List of files with support for wildcards. Please note that any expression containing wildcards should be enclosed in double quotes (“…”) to avoid shell substitution.

    • Names of specific tar archives to check all files within these tar archives.

zstash check will download the tar archives to the local disk cache (under the zstash/ subdirectory) and verify the md5 checksum against the checksum stored in the index database (index.db).

After the check is complete, a list of all corrupted files in the HPSS archive, along with the tar archive they belong is listed. Below is an example:

INFO: Opening tar archive zstash/000000.tar
INFO: Checking archive/atm/hist/20180129.DECKv1b_piControl.ne30_oEC.edison.cam.h0.0001-01.nc
DEBUG: Valid md5: cfb388d9c4ffe3bf45985fa470855801 archive/atm/hist/20180129.DECKv1b_piControl.ne30_oEC.edison.cam.h0.0001-01.nc
INFO: Checking archive/atm/hist/20180129.DECKv1b_piControl.ne30_oEC.edison.cam.h0.0001-02.nc
DEBUG: Valid md5: ce9bb79fb60fdef2ca4c2c29afc54776 archive/atm/hist/20180129.DECKv1b_piControl.ne30_oEC.edison.cam.h0.0001-02.nc
...
ERROR: Encountered an error for files:
ERROR: archive/atm/hist/20180129.DECKv1b_piControl.ne30_oEC.edison.cam.h0.0214-06.nc in 00000a.tar
ERROR: archive/atm/hist/20180129.DECKv1b_piControl.ne30_oEC.edison.cam.h0.0214-07.nc in 00000a.tar
ERROR: archive/atm/hist/20180129.DECKv1b_piControl.ne30_oEC.edison.cam.h0.0214-08.nc in 00000a.tar
...
ERROR: archive/ocn/hist/mpaso.hist.am.timeSeriesStatsMonthly.0085-08-01.nc in 000029.tar
ERROR: archive/ocn/hist/mpaso.hist.am.timeSeriesStatsMonthly.0085-09-01.nc in 000029.tar
ERROR: The following tar archives had errors:
ERROR: 00000a.tar
ERROR: 000029.tar

If you encounter an error, save your original data. You may need to reupload it via zstash create. Please contact the zstash development team, we’re working on identifying what causes these issues.

Update

An existing zstash archive can be updated to add new or modified files:

$ cd <mydir>
$ zstash update --hpss=<path to HPSS> [--cache=<cache>] [--dry-run] [--exclude] [--keep] [-v]

where

  • --hpss=<path to HPSS> specifies the destination path on the HPSS file system,

  • --cache to use a cache other than the default of zstash.

  • --dry-run an optional argument to specify a dry run, only lists files to be updated in archive.

  • --exclude an optional argument of comma separated list of file patterns to exclude

  • --keep to keep a copy of the tar files on the local file system after they have been extracted from the archive. Normally, they are deleted after successful transfer.

  • -v increases output verbosity.

Example

Following the ‘zstash create’ example above, we now run zstash again with the ‘update’ functionality:

$ cd $CSCRATCH/ACME_simulations/20170731.F20TR.ne30_ne30.edison
$ zstash update --hpss=test/ACME_simulations/20170731.F20TR.ne30_ne30.edison

Since nothing has changed, zstash simply returns

INFO: Nothing to update

Now, let’s add a new file

$ mkdir new
$ echo "This is a new file..." > new/file.txt

and rerun zstash update

$ zstash update --hpss=test/ACME_simulations/20170731.F20TR.ne30_ne30.edison

Zstash recognizes the presence of a new file and adds it to the archive:

INFO: Gathering list of files to archive
INFO: Creating new tar archive 000001.tar
INFO: Archiving new/file.txt
DEBUG: Closing tar archive 000001.tar
INFO: Transferring file to HPSS: zstash/000001.tar
INFO: Transferring file to HPSS: zstash/index.db

Note that the new file is added into a new archive tar file (000001.tar) even though the first archive tar file (000000.tar) is smaller than the target size and therefore could potentially hold more data. This is a design choice that was made out of caution to avoid the risk of damaging an existing tar file by appending to it.

Extract

Note: Most of the commands for this are the same for zstash check and zstash ls.

To extract files from an existing zstash archive into current <mydir>:

$ cd <mydir>
$ zstash extract --hpss=<path to HPSS> [--workers=<num of processes>] [--cache=<cache>] [--keep] [-v] [files]

where

  • --hpss=<path to HPSS> specifies the destination path on the HPSS file system. Note that if --hpss=none, then --keep is automatically set to True.

  • --workers=<num of processes> an optional argument which specifies the number of processes to use, resulting in extracting being done in parallel. Using a high number will result in slow downloads for each of the tars since your bandwidth is limited. User discretion is advised.

  • --cache to use a cache other than the default of zstash.

  • --keep to keep a copy of the tar files on the local file system after they have been extracted from the archive. Normally, they are deleted after successful transfer.

  • -v increases output verbosity.

  • [files] is a list of files to be extracted (standard wildcards supported).

    • Leave empty to extract all the files.

    • List of files with support for wildcards. Please note that any expression containing wildcards should be enclosed in double quotes (“…”) to avoid shell substitution.

    • Names of specific tar archives to extract all files within these tar archives.

You must pass in the path relative to the top level for the file(s). For help finding path names, you can use zstash ls as documented below.

A few words about performance. All of the files are grouped into 256GB tar archives by default. (See the --maxsize argument for zstash create for more information). If the tar file is not already present in the local disk cache (under the zstash/ sub-directory), it must first be downloaded from HPSS before the desired file can be extracted.

  • Downloading a 256GB file on Cori/Edison takes about 30 mins (or more depending on load).

  • Using NERSC data transfer nodes (DTN) may be about 3x faster, according to some users.

  • Again, to see which of your files are in what tar archives, use zstash ls -l.

    • Note the -l argument.

    • The sixth column is the tar archive that the file is in.

    • Please see the List documentation below for more information.

Examples

Extracting a single file by its full path archive/logs/atm.log.8229335.180130-143234.gz

$ zstash extract --hpss=/home/g/golaz/2018/E3SM_simulations/20180129.DECKv1b_piControl.ne30_oEC.edison archive/logs/atm.log.8229335.180130-143234.gz
DEBUG: Opening index database
DEBUG: Running zstash extract
DEBUG: Local path : /global/cscratch1/sd/golaz/ACME_simulations/20180129.DECKv1b_piControl.ne30_oEC.edison
DEBUG: HPSS path  : /home/g/golaz/2018/E3SM_simulations/20180129.DECKv1b_piControl.ne30_oEC.edison
DEBUG: Max size  : 274877906944
DEBUG: Keep local tar files  : False
INFO: Opening tar archive zstash/000018.tar
INFO: Extracting archive/logs/atm.log.8229335.180130-143234.gz
DEBUG: Valid md5: e8161bba53500848dc917258d1d8f56a archive/logs/atm.log.8229335.180130-143234.gz
DEBUG: Closing tar archive zstash/000018.tar
DEBUG: Closing index database

If the index database is already in the local disk cache (zstash/index.db), you can leave out the --hpss path. For example:

$ zstash extract archive/logs/atm.log.8229335.180130-143234.gz

However, recall that wildcards are supported, so this full path isn’t needed when using them. Instead, you could download files matching "*atm.log.8229335.180130-143234.gz*". Note the use of double quotes (“) to avoid shell level substitution.

$ zstash extract --hpss=/home/g/golaz/2018/E3SM_simulations/20180129.DECKv1b_piControl.ne30_oEC.edison "*atm.log.8229335.180130-143234.gz*"
DEBUG: Opening index database
DEBUG: Running zstash extract
DEBUG: Local path : /global/cscratch1/sd/golaz/ACME_simulations/20180129.DECKv1b_piControl.ne30_oEC.edison
DEBUG: HPSS path  : /home/g/golaz/2018/E3SM_simulations/20180129.DECKv1b_piControl.ne30_oEC.edison
DEBUG: Max size  : 274877906944
DEBUG: Keep local tar files  : False
INFO: Opening tar archive zstash/000018.tar
INFO: Extracting archive/logs/atm.log.8229335.180130-143234.gz
DEBUG: Valid md5: e8161bba53500848dc917258d1d8f56a archive/logs/atm.log.8229335.180130-143234.gz
DEBUG: Closing tar archive zstash/000018.tar
INFO: Opening tar archive zstash/000047.tar
INFO: Extracting case_scripts/logs/atm.log.8229335.180130-143234.gz
DEBUG: Valid md5: e8161bba53500848dc917258d1d8f56a case_scripts/logs/atm.log.8229335.180130-143234.gz
DEBUG: Closing tar archive zstash/000047.tar
DEBUG: Closing index database

In this particular example, the pattern matches two specific files, one under archive/logs/ and another one under case_scripts/logs/. If you didn’t intend to retrieve both of them, a more efficient approach would have been to first identify the desired files with ‘zstash ls’.

Another example of wildcards would be to retrieve all cam.h0 (monthly atmosphere output files) between years 0030 and 0069 for the DECKv1 piControl simulation. The zstash command would be:

$ zstash extract --hpss=/home/g/golaz/2018/E3SM_simulations/20180129.DECKv1b_piControl.ne30_oEC.edison \
         "*.cam.h0.00[3-6]?-??.nc"

You may specify the cache with the --cache option. Notice that there is no need to include --keep when not using HPSS.

$ zstash extract --hpss=none \
--cache=/p/user_pub/e3sm/archive/1_1/BGC-v1/20181217.BCRC_CNPCTC20TR_OIBGC.ne30_oECv3.edison \
"*cam.h3.1906-01-*-*.nc"

List

Note: Most of the commands for this are the same for zstash extract and zstash check.

You can view the files in an existing zstash archive:

$ zstash ls --hpss=<path to HPSS> [-l] [--cache=<cache>] [-v] [files]

where

  • --hpss=<path to HPSS> specifies the destination path on the HPSS file system,

  • -l an optional argument to display more information.

  • --cache to use a cache other than the default of zstash.

  • -v increases output verbosity.

  • [files] is a list of files to be listed (standard wildcards supported).

    • Leave empty to list all the files.

    • List of files with support for wildcards. Please note that any expression containing wildcards should be enclosed in double quotes (“…”) to avoid shell substitution.

    • Names of specific tar archives to list all files within these tar archives.

Below is an example. Note the names of the columns:

$ zstash ls -l --hpss=/home/g/golaz/2018/E3SM_simulations/20180129.DECKv1b_piControl.ne30_oEC.edison "*atm.log.8229335.180130-143234.gz*"
DEBUG: Opening index database
DEBUG: Running zstash ls
DEBUG: HPSS path  : /home/g/golaz/2018/E3SM_simulations/20180129.DECKv1b_piControl.ne30_oEC.edison
id   name    size    mtime   md5     tar     offset
30482        archive/logs/atm.log.8229335.180130-143234.gz   20156521        2018-02-01 10:02:35     e8161bba53500848dc917258d1d8f56a        000018.tar      131697281536
51608        case_scripts/logs/atm.log.8229335.180130-143234.gz      20156521        2018-02-01 10:02:52     e8161bba53500848dc917258d1d8f56a        000047.tar      202381473280

Version

Starting with version 0.3, you can check the version of zstash from the command line:

$ zstash version
v0.3.0