Update Data Catalog page

In particular, added Parallel_TarGz.batch link
This commit is contained in:
Spencer Bliven 2023-09-19 17:36:37 +02:00
parent 94b1b6c4d1
commit 03f36bf45c
2 changed files with 52 additions and 40 deletions

BIN
images/scicat_token.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 70 KiB

View File

@ -79,14 +79,12 @@ central procedures. Alternatively, if you do not know how to do that, follow the
**[Requesting extra Unix groups](/merlin6/request-account.html#requesting-extra-unix-groups)** procedure, or open
a **[PSI Service Now](https://psi.service-now.com/psisp)** ticket.
### Installation
### Documentation
Accessing the Data Catalog is done through the [SciCat software](https://melanie.gitpages.psi.ch/SciCatPages/).
Documentation is here: [ingestManual.pdf](https://melanie.gitpages.psi.ch/SciCatPages/ingestManual.pdf).
Documentation is here: [ingestManual](https://scicatproject.github.io/documentation/Ingestor/ingestManual.html).
#### (Merlin systems) Loading datacatalog tools
This is the ***official supported method*** for archiving the Merlin cluster.
#### Loading datacatalog tools
The latest datacatalog software is maintained in the PSI module system. To access it from the Merlin systems, run the following command:
@ -96,37 +94,28 @@ module load datacatalog
It can be done from any host in the Merlin cluster accessible by users. Usually, login nodes will be the nodes used for archiving.
#### (Non-standard systems) Installing datacatalog tools
### Finding your token
***This method is not supported by the Merlin admins***. However, we provide a small recipe for archiving from any host at PSI.
On any problems, Central AIT should be contacted.
As of 2022-04-14 a secure token is required to interact with the data catalog. This is a long random string that replaces the previous user/password authentication (allowing access for non-PSI use cases). **This string should be treated like a password and not shared.**
If you do not have access to PSI modules (for instance, when archiving from Ubuntu systems), then you can install the
datacatalog software yourself. These tools require 64-bit linux. To ingest from Windows systems, it is suggested to
transfer the data to a linux system such as Merlin.
1. Go to discovery.psi.ch
1. Click 'Sign in' in the top right corner. Click the 'Login with PSI account' and log in on the PSI login1. page.
1. You should be redirected to your user settings and see a 'User Information' section. If not, click on1. your username in the top right and choose 'Settings' from the menu.
1. Look for the field 'Catamel Token'. This should be a 64-character string. Click the icon to copy the1. token.
We suggest storing the SciCat scripts in ``~/bin`` so that they can be easily accessed.
![SciCat website](/images/scicat_token.png)
You will need to save this token for later steps. To avoid including it in all the commands, I suggest saving it to an environmental variable (Linux):
```bash
mkdir -p ~/bin
cd ~/bin
/usr/bin/curl -O https://intranet.psi.ch/pub/Daas/WebHome/datasetIngestor
chmod +x ./datasetIngestor
/usr/bin/curl -O https://intranet.psi.ch/pub/Daas/WebHome/datasetRetriever
chmod +x ./datasetRetriever
```
$ SCICAT_TOKEN=RqYMZcqpqMJqluplbNYXLeSyJISLXfnkwlfBKuvTSdnlpKkU
```
When the scripts are updated you will be prompted to re-run some of the above commands to get the latest version.
(Hint: prefix this line with a space to avoid saving the token to your bash history.)
You can call the ingestion scripts using the full path (``~/bin/datasetIngestor``) or else add ``~/bin`` to your unix PATH.
To do so, add the following line to your ``~/.bashrc`` file:
Tokens expire after 2 weeks and will need to be fetched from the website again.
```bash
export PATH="$HOME/bin:$PATH"
```
### Ingestion
### Ingestion
The first step to ingesting your data into the catalog is to prepare a file describing what data you have. This is called
**``metadata.json``**, and can be created with a text editor (e.g. *``vim``*). It can in principle be saved anywhere,
@ -172,20 +161,22 @@ section below. An example follows:
}
```
It is recommended to use the [ScicatEditor](https://bliven_s.gitpages.psi.ch/SciCatEditor/) for creating metadata files. This is a browser-based tool specifically for ingesting PSI data. Using the tool avoids syntax errors and provides templates for common data sets and options. The finished JSON file can then be downloaded to merlin or copied into a text editor.
Another option is to use the SciCat graphical interface from NoMachine. This provides a graphical interface for selecting data to archive. This is particularly useful for data associated with a DUO experiment and p-group. Type `SciCat`` to get started after loading the `datacatalog`` module. The GUI also replaces the the command-line ingestion described below.
The following steps can be run from wherever you saved your ``metadata.json``. First, perform a "dry-run" which will check the metadata for errors:
```bash
datasetIngestor metadata.json
datasetIngestor --token $SCICAT_TOKEN metadata.json
```
It will ask for your PSI credentials and then print some info about the data to be ingested. If there are no errors, proceed to the real ingestion:
```bash
datasetIngestor --ingest --autoarchive metadata.json
datasetIngestor --token $SCICAT_TOKEN --ingest --autoarchive metadata.json
```
For particularly important datasets, you may also want to use the parameter **``--tapecopies 2``** to store **redundant copies** of the data.
You will be asked whether you want to copy the data to the central system:
* If you are on the Merlin cluster and you are archiving data from ``/data/user`` or ``/data/project``, answer 'no' since the data catalog can
@ -240,7 +231,7 @@ find . -name '*~' -delete
find . -name '*#autosave#' -delete
```
### Troubleshooting & Known Bugs
#### Troubleshooting & Known Bugs
* The following message can be safely ignored:
@ -259,8 +250,22 @@ step will take a long time and may appear to have hung. You can check what files
where UID is the dataset ID (12345678-1234-1234-1234-123456789012) and PATH is the absolute path to your data. Note that rsync creates directories first and that the transfer order is not alphabetical in some cases, but it should be possible to see whether any data has transferred.
* There is currently a limit on the number of files per dataset (technically, the limit is from the total length of all file paths). It is recommended to break up datasets into 300'000 files or less.
* There is currently a limit on the number of files per dataset (technically, the limit is from the total length of all file paths). It is recommended to break up datasets into 300'000 files or less.
* If it is not possible or desirable to split data between multiple datasets, an alternate work-around is to package files into a tarball. For datasets which are already compressed, omit the -z option for a considerable speedup:
```
tar -f [output].tar [srcdir]
```
Uncompressed data can be compressed on the cluster using the following command:
```
sbatch /data/software/Slurm/Utilities/Parallel_TarGz.batch -s [srcdir] -t [output].tar -n
```
Run /data/software/Slurm/Utilities/Parallel_TarGz.batch -h for more details and options.
#### Sample ingestion output (datasetIngestor 1.1.11)
<details>
<summary>[Show Example]: Sample ingestion output (datasetIngestor 1.1.11)</summary>
<pre class="terminal code highlight js-syntax-highlight plaintext" lang="plaintext" markdown="false">
@ -353,15 +358,22 @@ user_n@pb-archive.psi.ch's password:
</pre>
</details>
### Publishing
After datasets are are ingested they can be assigned a public DOI. This can be included in publications and will make the datasets on http://doi.psi.ch.
For instructions on this, please read the ['Publish' section in the ingest manual](https://scicatproject.github.io/documentation/Ingestor/ingestManual.html#sec-8).
### Retrieving data
The retrieval process is still a work in progress. For more info, read the ingest manual.
Retrieving data from the archive is also initiated through the Data Catalog. Please read the ['Retrieve' section in the ingest manual](https://scicatproject.github.io/documentation/Ingestor/ingestManual.html#sec-6).
## Further Information
* **[PSI Data Catalog](https://discovery.psi.ch)**
* **[Full Documentation](https://melanie.gitpages.psi.ch/SciCatPages/)**: **[PDF](https://melanie.gitpages.psi.ch/SciCatPages/ingestManual.pdf)**.
* Data Catalog **[Official Website](https://www.psi.ch/photon-science-data-services/data-catalog-and-archive)**
* Data catalog **[SciCat Software](https://scicatproject.github.io/)**
* **[FAIR](https://www.nature.com/articles/sdata201618)** definition and **[SNF Research Policy](http://www.snf.ch/en/theSNSF/research-policies/open_research_data/Pages/default.aspx#FAIR%20Data%20Principles%20for%20Research%20Data%20Management)**
* **[Petabyte Archive at CSCS](https://www.cscs.ch/fileadmin/user_upload/contents_publications/annual_reports/AR2017_Online.pdf)**
* [PSI Data Catalog](https://discovery.psi.ch)
* [Full Documentation](https://scicatproject.github.io/documentation/Ingestor/ingestManual.html)
* [Published Datasets (doi.psi.ch)](https://doi.psi.ch)
* Data Catalog [PSI page](https://www.psi.ch/photon-science-data-services/data-catalog-and-archive)
* Data catalog [SciCat Software](https://scicatproject.github.io/)
* [FAIR](https://www.nature.com/articles/sdata201618) definition and [SNF Research Policy](http://www.snf.ch/en/theSNSF/research-policies/open_research_data/Pages/default.aspx#FAIR%20Data%20Principles%20for%20Research%20Data%20Management)
* [Petabyte Archive at CSCS](https://www.cscs.ch/fileadmin/user_upload/contents_publications/annual_reports/AR2017_Online.pdf)