Tutorial
Warning
This tutorial is specifically for zstash v0.4.2.
Some features have been added/changed since.
Paths for E3SM Unified have also changed.
Some statements on this tutorial have been pulled from other pages of the documentation. This is done to provide a comprehensive tutorial on a single page. There is also an accompanying video tutorial.
What is Zstash?
Zstash is an HPSS long-term archiving solution for E3SM.
Zstash is written entirely in Python using standard libraries. Its design is intentionally minimalistic to provide an effective long-term HPSS archiving solution without creating an overly complicated (and hard to maintain) tool. For more information, see Zstash documentation, https://e3sm.org/resources/tools/data-management/zstash-hpss-archive/, or https://e3sm.org/new-zstash-capabilities.
Terminology
zstash archive: A set of files {index.db, nnnnnn.tar where nnnnnn=000000, 000001, …}
HPSS: A long-term tape storage system.
HPSS archive: The zstash archive on HPSS.
Local archive (or cache): The zstash archive on our local file system. The default name is zstash.
Source directory: the directory whose contents we are archiving.
Commands
Create: create a new cache (local archive), create a tar file of the source directory’s contents, and if using HPSS, then store it on the HPSS archive.
Check: verify integrity of zstash archive (e.g., if using HPSS, check that files were uploaded on HPSS successfully.)
Update: add new or modified files to an existing zstash archive (ignoring unmodified files).
Extract: extract files from an existing zstash archive into current directory.
List (ls): view files in an existing zstash archive.
Version: show version number.
Options
--cache, --hpss, and --keep are covered in this tutorial.
A “x” indicates that the zstash command supports the given option.
Option |
Create |
Check |
Update |
Extract |
List |
|---|---|---|---|---|---|
|
x |
x |
x |
x |
x |
|
x |
||||
|
x |
x |
|||
|
x |
x |
x |
x |
x |
|
x |
x |
x |
x |
|
|
x |
||||
|
x |
||||
|
x |
x |
x |
x |
x |
|
x |
x |
Installation
Below is a basic guide to installation. For more information, see Getting started.
E3SM Unified Environment
Zstash is available in the E3SM unified environment. This is the recommended method of installation.
On Cori/NERSC:
$ source /global/cfs/cdirs/e3sm/software/anaconda_envs/load_latest_e3sm_unified.sh
On Compy:
$ source /share/apps/E3SM/conda_envs/load_latest_e3sm_unified.sh
For more information, see Getting started.
Installation with Conda
On Cori/NERSC:
$ module load python/3.7-anaconda-2019.10
$ source /global/common/cori_cle7/software/python/3.7-anaconda-2019.10/etc/profile.d/conda.sh
$ conda create -n zstash_env -c e3sm -c conda-forge zstash
$ conda activate zstash_env
On Compy:
$ module load anaconda3/2019.03
$ source /share/apps/anaconda3/2019.03/etc/profile.d/conda.sh
$ conda create -n zstash_env -c e3sm -c conda-forge zstash
$ conda activate zstash_env
On either machine, to install in an existing environment:
$ conda install zstash -c e3sm -c conda-forge
HPSS
NERSC machines have HPSS access. We can choose to use HPSS by setting
--hpss (or --hpss=none if we do not wish to archive to HPSS).
Note
Before using zstash with HPSS for the first time, run hsi on NERSC
and enter your credentials. Then, zstash will be able to access HPSS.
Compy and Anvil do not have HPSS access. Therefore, you’ll need to use a Globus URL
that points to a remote HPSS storage --hpss=globus://<Globus endpoint UUID/<path>
or set --hpss=none.
Examples
We will use NERSC so that we have access to HPSS. The commands remain the same on Compy,
with the exception that --hpss will have to be set to none in all cases.
Open a NERSC terminal in two windows.
Setup
We’ll create a tutorial directory in $CSCRATCH on NERSC. We can create our tutorial
directory anywhere but CSCRATCH is a useful workspace.
$ cd $CSCRATCH
$ mkdir tutorial
$ cd tutorial
Each example that follows is independent of the others. Therefore, we’ll need to set up the
directory structure before each example. We’ll call the following script setup.sh:
mkdir zstash_demo
mkdir zstash_demo/empty_dir
mkdir zstash_demo/dir
echo 'file0 stuff' > zstash_demo/file0.txt
echo '' > zstash_demo/file_empty.txt
echo 'file1 stuff' > zstash_demo/dir/file1.txt
In one NERSC window, create this script and run it (./setup.sh).
That will create the following directory structure:
zstash_demo/
dir/
file1.txt (contains ‘file1 stuff’)
empty_dir/
file0.txt (contains ‘file0 stuff’)
file_empty.txt (empty)
In some examples, we’ll also want to add files after running some zstash commands.
We’ll call the following script add_files.sh:
mkdir zstash_demo/dir2
echo 'file2 stuff' > zstash_demo/dir2/file2.txt
echo 'file1 stuff with changes' > zstash_demo/dir/file1.txt
If this is run (./add_files.sh) immediately after running setup.sh, then we would have the
following directory structure:
zstash_demo/
dir/
file1.txt (contains ‘file1 stuff with changes’)
dir2/
file2.txt (contains ‘file2 stuff’)
empty_dir/
file0.txt (contains ‘file0 stuff’)
file_empty.txt (empty)
In our second window, we’ll log into HPSS with hsi.
Note
If this is your first time using HPSS, you’ll have to enter your credentials.
If you haven’t used HPSS before, ls should print nothing.
After every example, we’ll want to remove the directories we created both in our workspace
($CSCRATCH/tutorial) and on HPSS.
Simple Case
This simple case will illustrate create, update, and extract.
Set up the directory structure:
$ ./setup.sh
Create:
Now, we will create the zstash archives – both locally and on HPSS.
Note that the path passed to --hpss can be a relative path from /home/f/<username> or
an absolute path on HPSS, not the local file system.
zstash_demo is the source directory whose contents we want to archive.
This is just a directory path; it can be a relative path and can contain .. or
it can be an absolute path.
$ zstash create --hpss=zstash_archive zstash_demo
This will output the following:
INFO: Gathering list of files to archive
INFO: Creating new tar archive 000000.tar
INFO: Archiving file0.txt
INFO: Archiving file_empty.txt
INFO: Archiving dir/file1.txt
INFO: Archiving empty_dir
INFO: Transferring file to HPSS: zstash/000000.tar
INFO: Transferring file to HPSS: zstash/index.db
The “cache” is our local archive. HPSS is our long-term archive.
Our cache is given the default name zstash and can be found in the source directory,
zstash_demo. We’ve also created an archive in HPSS named zstash_archive.
The cache zstash_demo/zstash now contains index.db,
which is the sqlite database used by zstash.
In our HPSS window, ls zstash_archive will show 000000.tar in addition to index.db.
The former is the tar file created from the files and directories in zstash_demo.
Note that the maximum size of tar files will default to 256 GB, unless --maxsize is passed in
with a different value to zstash create.
Update:
Now, let’s add some files to our source directory, zstash_demo.
We want to update our HPSS archive with these
changes, so we will enter the directory zstash_demo and tell zstash to update it.
Note that we have to be inside the source directory to run zstash update.
We also have to specify --hpss so that zstash will know which HPSS archive to update.
$ ./add_files.sh
$ cd zstash_demo
$ zstash update --hpss=zstash_archive
This will output the following:
INFO: Gathering list of files to archive
INFO: Creating new tar archive 000001.tar
INFO: Archiving dir/file1.txt
INFO: Archiving dir2/file2.txt
INFO: Transferring file to HPSS: zstash/000001.tar
INFO: Transferring file to HPSS: zstash/index.db
In our HPSS window, notice that ls zstash_archive now includes a new tar file 000001.tar.
zstash update produces a new tar file for each update.
Nothing will have changed in zstash_demo/zstash, however.
Extract:
Now, we will extract the files from the HPSS archive. We will extract into a new directory
because zstash will not extract files that are already present. So, we will exit the
zstash_demo directory and create a new directory. Extraction will always occur
in the current directory.
$ cd ..
$ mkdir zstash_extraction
$ cd zstash_extraction
$ zstash extract --hpss=zstash_archive
This will output the following:
INFO: Transferring file from HPSS: zstash/index.db
INFO: Transferring file from HPSS: zstash/000000.tar
INFO: Opening tar archive zstash/000000.tar
INFO: Extracting file0.txt
INFO: Extracting file_empty.txt
INFO: Extracting empty_dir
INFO: Transferring file from HPSS: zstash/000001.tar
INFO: Opening tar archive zstash/000001.tar
INFO: Extracting dir/file1.txt
INFO: Extracting dir2/file2.txt
INFO: No failures detected when extracting the files.
The contents of zstash_extraction will be identical to those of zstash_demo.
Cleanup:
Now, we will clean up for the next example.
In our NERSC window:
$ cd ..
$ rm -r zstash_demo/ zstash_extraction/
In our HPSS window (rm -r doesn’t work on HPSS):
$ rm zstash_archive/000000.tar zstash_archive/000001.tar zstash_archive/index.db
$ rmdir zstash_archive
No HPSS
This example will again illustrate create, update, and extract,
but this time without using HPSS.
Set up the directory structure:
$ ./setup.sh
Create:
Now, we will create the local zstash archive, but skip creating a HPSS archive.
$ zstash create --hpss=none zstash_demo
This will output the following:
INFO: Gathering list of files to archive
INFO: Creating new tar archive 000000.tar
INFO: Archiving file0.txt
INFO: Archiving file_empty.txt
INFO: Archiving dir/file1.txt
INFO: Archiving empty_dir
INFO: put: HPSS is unavailable
INFO: put: Keeping tar files locally and removing write permissions
INFO: zstash/000000.tar original mode=b"'660'"
INFO: zstash/000000.tar new mode=b"'440'"
INFO: put: HPSS is unavailable
zstash_demo/zstash (the cache, located in the source directory) is our only archive now.
It contains both index.db and 000000.tar.
In our HPSS window, ls will not show zstash_archive.
Our cache has completely replaced the HPSS archive we used in the simple case.
Update:
Now, we can add some more files and update this local archive.
$ ./add_files.sh
$ cd zstash_demo
$ zstash update --hpss=none
This will output the following:
INFO: Gathering list of files to archive
INFO: Creating new tar archive 000001.tar
INFO: Archiving dir/file1.txt
INFO: Archiving dir2/file2.txt
INFO: put: HPSS is unavailable
INFO: put: Keeping tar files locally and removing write permissions
INFO: zstash/000001.tar original mode=b"'660'"
INFO: zstash/000001.tar new mode=b"'440'"
INFO: put: HPSS is unavailable
000001.tar has been added to zstash_demo/zstash.
In our HPSS window, ls still shows no archive created from this example.
Extract:
Now, we will extract the files from this local archive. Again, we will extract into a new directory
because zstash will not extract files that are already present. So, we will exit the
zstash_demo directory and create a new directory. Unlike in the simple case however,
we will need to copy the cache zstash_demo/zstash into this new directory. We have
to do this because with --hpss=none, zstash simply extracts from the cache
in the current directory, rather than extracting from an HPSS archive. We could also
pass in --cache=../zstash_demo/zstash; --cache will be discussed in the next section.
$ cd ..
$ mkdir zstash_extraction
$ cd zstash_extraction
$ cp -r ../zstash_demo/zstash/ zstash
$ zstash extract --hpss=none
This will output the following:
INFO: Opening tar archive zstash/000000.tar
INFO: Extracting file0.txt
INFO: Extracting file_empty.txt
INFO: Extracting empty_dir
INFO: Opening tar archive zstash/000001.tar
INFO: Extracting dir/file1.txt
INFO: Extracting dir2/file2.txt
INFO: No failures detected when extracting the files.
As with the simple case, the contents of zstash_extraction
will be identical to those of zstash_demo. We still had to specify --hpss=none because
if we had kept --hpss=hpss_archive, then zstash would have tried to extract from a HPSS
archive that doesn’t exist.
Cleanup:
Now, we will clean up for the next example.
In our NERSC window (use the -f option to force deletion of tar files):
$ cd ..
$ rm -rf zstash_demo/ zstash_extraction/
There’s nothing to clean up in our HPSS window.
Change Cache Name
This example will again illustrate create, update, and extract,
but this time using a different cache.
Set up the directory structure:
$ ./setup.sh
Create:
Now, we will create the zstash archives – both locally and on HPSS.
This time, however, our local archive will be named my_cache instead zstash.
--cache just takes a directory path; it can
be a relative path and can contain .. or it can be an absolute path. However, it
is recommended that however the path is written, it places the cache in the source
directory (in this case, zstash_demo).
$ zstash create --hpss=zstash_archive --cache=my_cache zstash_demo
zstash create --hpss=zstash_archive --cache=/global/cscratch1/sd/forsyth/tutorial/zstash_demo/my_cache zstash_demo
This will output the following:
INFO: Gathering list of files to archive
INFO: Creating new tar archive 000000.tar
INFO: Archiving file0.txt
INFO: Archiving file_empty.txt
INFO: Archiving dir/file1.txt
INFO: Archiving empty_dir
INFO: Transferring file to HPSS: my_cache/000000.tar
INFO: Transferring file to HPSS: my_cache/index.db
zstash_demo/my_cache will now contain index.db just as zstash_demo/zstash
did in the simple case.
As in the simple case, in our HPSS window, ls zstash_archive shows both
000000.tar and index.db.
Update:
We will now add more files and update the archives.
$ ./add_files.sh
$ cd zstash_demo
$ zstash update --hpss=zstash_archive --cache=my_cache
This will output the following:
INFO: Gathering list of files to archive
INFO: Creating new tar archive 000001.tar
INFO: Archiving dir/file1.txt
INFO: Archiving dir2/file2.txt
INFO: Transferring file to HPSS: my_cache/000001.tar
INFO: Transferring file to HPSS: my_cache/index.db
In our HPSS window, notice that ls zstash_archive now includes a new tar file 000001.tar.
As in the simple case, nothing will have changed in the cache
(named zstash_demo/my_cache here).
Extract:
Now we will extract the files:
$ cd ..
$ mkdir zstash_extraction
$ cd zstash_extraction
$ zstash extract --hpss=zstash_archive --cache=my_cache
This will output the following:
INFO: Transferring file from HPSS: my_cache/index.db
INFO: Transferring file from HPSS: my_cache/000000.tar
INFO: Opening tar archive my_cache/000000.tar
INFO: Extracting file0.txt
INFO: Extracting file_empty.txt
INFO: Extracting empty_dir
INFO: Transferring file from HPSS: my_cache/000001.tar
INFO: Opening tar archive my_cache/000001.tar
INFO: Extracting dir/file1.txt
INFO: Extracting dir2/file2.txt
INFO: No failures detected when extracting the files.
As in the simple case, the contents of zstash_extraction will be identical to those of
zstash_demo. The difference is that this time the cache will be named my_cache in both
zstash_demo and zstash_extraction.
Cleanup:
Now, we will clean up for the next example.
In our NERSC window:
$ cd ..
$ rm -r zstash_demo/ zstash_extraction/
In our HPSS window (rm -r doesn’t work on HPSS):
$ rm zstash_archive/000000.tar zstash_archive/000001.tar zstash_archive/index.db
$ rmdir zstash_archive
Keep Files
This example will again illustrate create, update, and extract,
but this time storing the tar files in the local archive (the cache) in addition to the
HPSS archive. In the --hpss=none example, we saw that when there was no HPSS archive
being used, the cache contained tar files. The --keep option allows us to store these files
locally even if we’re using HPSS. If --keep is used with --hpss=none, there won’t be a
noticeable effect since the latter keeps tar files by default.
Set up the directory structure:
$ ./setup.sh
Create:
Now, we will create the zstash archives – both locally and on HPSS.
This time, however, we will specify that the cache should keep the tar file created
by zstash create.
$ zstash create --hpss=zstash_archive --keep zstash_demo
This will output the following:
INFO: Gathering list of files to archive
INFO: Creating new tar archive 000000.tar
INFO: Archiving file0.txt
INFO: Archiving file_empty.txt
INFO: Archiving dir/file1.txt
INFO: Archiving empty_dir
INFO: Transferring file to HPSS: zstash/000000.tar
INFO: Transferring file to HPSS: zstash/index.db
zstash_demo/zstash and our HPSS archive zstash_archive now have identical contents:
000000.tar and index.db. In the simple case, the former would not have contained
the tar file.
Update:
Now, we will add more files and update the archives.
$ ./add_files.sh
$ cd zstash_demo
$ zstash update --hpss=zstash_archive --keep
This will output the following:
INFO: Gathering list of files to archive
INFO: Creating new tar archive 000001.tar
INFO: Archiving dir/file1.txt
INFO: Archiving dir2/file2.txt
INFO: Transferring file to HPSS: zstash/000001.tar
INFO: Transferring file to HPSS: zstash/index.db
Again, zstash_demo/zstash and our HPSS archive zstash_archive
will have identical contents: 000001.tar in addition to 000000.tar and index.db.
In the simple case, the former would not have the tar file added.
Extract:
Now, we will extract the files from the HPSS archive.
$ cd ..
$ mkdir zstash_extraction
$ cd zstash_extraction
$ zstash extract --hpss=zstash_archive --keep
This will output the following:
INFO: Transferring file from HPSS: zstash/index.db
INFO: Transferring file from HPSS: zstash/000000.tar
INFO: Opening tar archive zstash/000000.tar
INFO: Extracting file0.txt
INFO: Extracting file_empty.txt
INFO: Extracting empty_dir
INFO: Transferring file from HPSS: zstash/000001.tar
INFO: Opening tar archive zstash/000001.tar
INFO: Extracting dir/file1.txt
INFO: Extracting dir2/file2.txt
INFO: No failures detected when extracting the files.
As in the simple case, the contents of zstash_extraction will be identical to those of
zstash_demo. The difference is that this time zstash_demo/zstash contains tar files
and thus so too does zstash_extraction/zstash, whereas this was not the case in
the simple case.
Cleanup:
Now, we will clean up for the next example.
In our NERSC window:
$ cd ..
$ rm -r zstash_demo/ zstash_extraction/
In our HPSS window (rm -r doesn’t work on HPSS):
$ rm zstash_archive/000000.tar zstash_archive/000001.tar zstash_archive/index.db
$ rmdir zstash_archive
Check File Integrity
This example will demonstrate how to check that files haven’t been damaged during archive or transfer.
Create:
First, let’s create local and HPSS archives as in the simple case:
$ ./setup.sh
$ zstash create --hpss=zstash_archive zstash_demo
Check:
Now, we will check the file integrity. Note that we’ll want to enter the directory where our
local cache (zstash) is. If we run zstash check in another directory,
a local cache (named zstash) will be created in that directory.
$ cd zstash_demo
$ zstash check --hpss=zstash_archive
This will output the following:
INFO: Transferring file from HPSS: zstash/000000.tar
INFO: Opening tar archive zstash/000000.tar
INFO: Checking file0.txt
INFO: Checking file_empty.txt
INFO: Checking dir/file1.txt
INFO: Checking empty_dir
INFO: No failures detected when checking the files.
No failures were detected, so the file integrity has been confirmed. This was done by verifying their md5 checksums against the original ones stored in the database.
zstash_demo/zstash will still only contain index.db and no files will have been added to
zstash_demo.
In our HPSS window, ls zstash archive will show 000000.tar and index.db.
Cleanup:
Now, we will clean up for the next example.
In our NERSC window:
$ cd ..
$ rm -r zstash_demo/
In our HPSS window (rm -r doesn’t work on HPSS):
$ rm zstash_archive/000000.tar zstash_archive/index.db
$ rmdir zstash_archive
List Files in HPSS Archive
This example will demonstrate how to list the files contained in the tar files in the HPSS archive.
Create:
First, let’s create local and HPSS archives as in the simple case:
$ ./setup.sh
$ zstash create --hpss=zstash_archive zstash_demo
Recall from the simple case that the cache zstash_demo/zstash now contains index.db and
that zstash_archive will contain 000000.tar in addition to index.db.
List:
We can check the contents of the tar file by running the following.
Note that it’s good to be in the directory with the cache
(in this case zstash_demo contains the cache zstash).
If we run zstash ls in another directory, a local cache (named zstash) will be created
in that directory.
$ cd zstash_demo
$ zstash ls --hpss=zstash_archive
This will output the following:
file0.txt
file_empty.txt
dir/file1.txt
empty_dir
Compare to the output of ls -R . (i.e., list all files in zstash_demo):
dir empty_dir file0.txt file_empty.txt zstash
./dir:
file1.txt
./empty_dir:
./zstash:
index.db
zstash ls lists the same files but excludes the cache (in this case, named zstash).
Extract:
Now, let’s extract the files.
$ cd ..
$ mkdir zstash_extraction
$ cd zstash_extraction
$ zstash extract --hpss=zstash_archive
This will output the following:
INFO: Transferring file from HPSS: zstash/index.db
INFO: Transferring file from HPSS: zstash/000000.tar
INFO: Opening tar archive zstash/000000.tar
INFO: Extracting file0.txt
INFO: Extracting file_empty.txt
INFO: Extracting dir/file1.txt
INFO: Extracting empty_dir
INFO: No failures detected when extracting the files.
List:
Now that we have extracted files, let’s list the files in the archive again.
$ cd ..
$ cd zstash_demo
$ zstash ls --hpss=zstash_archive
This will output the following:
file0.txt
file_empty.txt
dir/file1.txt
empty_dir
We can see that extraction did not alter what zstash ls displays, since extraction does not
change the contents of the HPSS archive.
Cleanup:
Now, we will clean up for the next example.
In our NERSC window:
$ cd ..
$ rm -r zstash_demo/ zstash_extraction/
In our HPSS window (rm -r doesn’t work on HPSS):
$ rm zstash_archive/000000.tar zstash_archive/index.db
$ rmdir zstash_archive
Archiving E3SM Data Output
Now, instead of using placeholder data, let’s archive actual data from E3SM. Archiving large amounts
of data can take several days, so we’ll use a data transfer node to improve performance and
screen so that we can detatch and reattach without stopping zstash. For this example,
however, we’ll use a small amount of data so zstash finishes in a reasonable amount of time.
We’ll call our directory e3sm_output this time instead of zstash_demo. We’ll enter this
directory and copy over a couple nc files from the data repository on NERSC.
$ mkdir e3sm_output
$ cd e3sm_output
$ cp /global/cfs/cdirs/e3smpub/E3SM_simulations/20180215.DECKv1b_H1.ne30_oEC.edison/archive/atm/hist/20180215.DECKv1b_H1.ne30_oEC.edison.cam.h0.1850-01.nc .
$ cp /global/cfs/cdirs/e3smpub/E3SM_simulations/20180215.DECKv1b_H1.ne30_oEC.edison/archive/atm/hist/20180215.DECKv1b_H1.ne30_oEC.edison.cam.h0.1850-02.nc .
Create:
Now, let’s enter the data transfer node, use screen, install zstash, and begin archiving.
$ ssh dtn01.nersc.gov
$ screen
$ bash
$ source /global/cfs/cdirs/e3sm/software/anaconda_envs/load_latest_e3sm_unified.sh
$ cd $CSCRATCH/tutorial
$ zstash create --hpss=zstash_archive e3sm_output
If the archiving is taking a long time, we could enter the keys CTRL-A and then CTRL-D to
detach and exit to get out of the data transfer node. We could reattach with:
$ ssh dtn01.nersc.gov
$ screen -r
For more information on archiving E3SM output, see Best practices for E3SM.
Cleanup:
Now, we will clean up.
In our NERSC window (not on the data transfer node):
$ cd ..
$ rm -r e3sm_output
In our HPSS window (rm -r doesn’t work on HPSS):
$ rm zstash_archive/000000.tar zstash_archive/index.db
$ rmdir zstash_archive