Update Data Catalog page
In particular, added Parallel_TarGz.batch link
This commit is contained in:
parent
94b1b6c4d1
commit
03f36bf45c
BIN
images/scicat_token.png
Normal file
BIN
images/scicat_token.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 70 KiB |
@ -79,14 +79,12 @@ central procedures. Alternatively, if you do not know how to do that, follow the
|
||||
**[Requesting extra Unix groups](/merlin6/request-account.html#requesting-extra-unix-groups)** procedure, or open
|
||||
a **[PSI Service Now](https://psi.service-now.com/psisp)** ticket.
|
||||
|
||||
### Installation
|
||||
### Documentation
|
||||
|
||||
Accessing the Data Catalog is done through the [SciCat software](https://melanie.gitpages.psi.ch/SciCatPages/).
|
||||
Documentation is here: [ingestManual.pdf](https://melanie.gitpages.psi.ch/SciCatPages/ingestManual.pdf).
|
||||
Documentation is here: [ingestManual](https://scicatproject.github.io/documentation/Ingestor/ingestManual.html).
|
||||
|
||||
#### (Merlin systems) Loading datacatalog tools
|
||||
|
||||
This is the ***official supported method*** for archiving the Merlin cluster.
|
||||
#### Loading datacatalog tools
|
||||
|
||||
The latest datacatalog software is maintained in the PSI module system. To access it from the Merlin systems, run the following command:
|
||||
|
||||
@ -96,37 +94,28 @@ module load datacatalog
|
||||
|
||||
It can be done from any host in the Merlin cluster accessible by users. Usually, login nodes will be the nodes used for archiving.
|
||||
|
||||
#### (Non-standard systems) Installing datacatalog tools
|
||||
### Finding your token
|
||||
|
||||
***This method is not supported by the Merlin admins***. However, we provide a small recipe for archiving from any host at PSI.
|
||||
On any problems, Central AIT should be contacted.
|
||||
As of 2022-04-14 a secure token is required to interact with the data catalog. This is a long random string that replaces the previous user/password authentication (allowing access for non-PSI use cases). **This string should be treated like a password and not shared.**
|
||||
|
||||
If you do not have access to PSI modules (for instance, when archiving from Ubuntu systems), then you can install the
|
||||
datacatalog software yourself. These tools require 64-bit linux. To ingest from Windows systems, it is suggested to
|
||||
transfer the data to a linux system such as Merlin.
|
||||
1. Go to discovery.psi.ch
|
||||
1. Click 'Sign in' in the top right corner. Click the 'Login with PSI account' and log in on the PSI login1. page.
|
||||
1. You should be redirected to your user settings and see a 'User Information' section. If not, click on1. your username in the top right and choose 'Settings' from the menu.
|
||||
1. Look for the field 'Catamel Token'. This should be a 64-character string. Click the icon to copy the1. token.
|
||||
|
||||
We suggest storing the SciCat scripts in ``~/bin`` so that they can be easily accessed.
|
||||

|
||||
|
||||
You will need to save this token for later steps. To avoid including it in all the commands, I suggest saving it to an environmental variable (Linux):
|
||||
|
||||
```bash
|
||||
mkdir -p ~/bin
|
||||
cd ~/bin
|
||||
/usr/bin/curl -O https://intranet.psi.ch/pub/Daas/WebHome/datasetIngestor
|
||||
chmod +x ./datasetIngestor
|
||||
/usr/bin/curl -O https://intranet.psi.ch/pub/Daas/WebHome/datasetRetriever
|
||||
chmod +x ./datasetRetriever
|
||||
```
|
||||
$ SCICAT_TOKEN=RqYMZcqpqMJqluplbNYXLeSyJISLXfnkwlfBKuvTSdnlpKkU
|
||||
```
|
||||
|
||||
When the scripts are updated you will be prompted to re-run some of the above commands to get the latest version.
|
||||
(Hint: prefix this line with a space to avoid saving the token to your bash history.)
|
||||
|
||||
You can call the ingestion scripts using the full path (``~/bin/datasetIngestor``) or else add ``~/bin`` to your unix PATH.
|
||||
To do so, add the following line to your ``~/.bashrc`` file:
|
||||
Tokens expire after 2 weeks and will need to be fetched from the website again.
|
||||
|
||||
```bash
|
||||
export PATH="$HOME/bin:$PATH"
|
||||
```
|
||||
|
||||
### Ingestion
|
||||
### Ingestion
|
||||
|
||||
The first step to ingesting your data into the catalog is to prepare a file describing what data you have. This is called
|
||||
**``metadata.json``**, and can be created with a text editor (e.g. *``vim``*). It can in principle be saved anywhere,
|
||||
@ -172,20 +161,22 @@ section below. An example follows:
|
||||
}
|
||||
```
|
||||
|
||||
It is recommended to use the [ScicatEditor](https://bliven_s.gitpages.psi.ch/SciCatEditor/) for creating metadata files. This is a browser-based tool specifically for ingesting PSI data. Using the tool avoids syntax errors and provides templates for common data sets and options. The finished JSON file can then be downloaded to merlin or copied into a text editor.
|
||||
|
||||
Another option is to use the SciCat graphical interface from NoMachine. This provides a graphical interface for selecting data to archive. This is particularly useful for data associated with a DUO experiment and p-group. Type `SciCat`` to get started after loading the `datacatalog`` module. The GUI also replaces the the command-line ingestion described below.
|
||||
|
||||
The following steps can be run from wherever you saved your ``metadata.json``. First, perform a "dry-run" which will check the metadata for errors:
|
||||
|
||||
```bash
|
||||
datasetIngestor metadata.json
|
||||
datasetIngestor --token $SCICAT_TOKEN metadata.json
|
||||
```
|
||||
|
||||
It will ask for your PSI credentials and then print some info about the data to be ingested. If there are no errors, proceed to the real ingestion:
|
||||
|
||||
```bash
|
||||
datasetIngestor --ingest --autoarchive metadata.json
|
||||
datasetIngestor --token $SCICAT_TOKEN --ingest --autoarchive metadata.json
|
||||
```
|
||||
|
||||
For particularly important datasets, you may also want to use the parameter **``--tapecopies 2``** to store **redundant copies** of the data.
|
||||
|
||||
You will be asked whether you want to copy the data to the central system:
|
||||
|
||||
* If you are on the Merlin cluster and you are archiving data from ``/data/user`` or ``/data/project``, answer 'no' since the data catalog can
|
||||
@ -240,7 +231,7 @@ find . -name '*~' -delete
|
||||
find . -name '*#autosave#' -delete
|
||||
```
|
||||
|
||||
### Troubleshooting & Known Bugs
|
||||
#### Troubleshooting & Known Bugs
|
||||
|
||||
* The following message can be safely ignored:
|
||||
|
||||
@ -259,8 +250,22 @@ step will take a long time and may appear to have hung. You can check what files
|
||||
|
||||
where UID is the dataset ID (12345678-1234-1234-1234-123456789012) and PATH is the absolute path to your data. Note that rsync creates directories first and that the transfer order is not alphabetical in some cases, but it should be possible to see whether any data has transferred.
|
||||
|
||||
* There is currently a limit on the number of files per dataset (technically, the limit is from the total length of all file paths). It is recommended to break up datasets into 300'000 files or less.
|
||||
* There is currently a limit on the number of files per dataset (technically, the limit is from the total length of all file paths). It is recommended to break up datasets into 300'000 files or less.
|
||||
* If it is not possible or desirable to split data between multiple datasets, an alternate work-around is to package files into a tarball. For datasets which are already compressed, omit the -z option for a considerable speedup:
|
||||
|
||||
```
|
||||
tar -f [output].tar [srcdir]
|
||||
```
|
||||
|
||||
Uncompressed data can be compressed on the cluster using the following command:
|
||||
|
||||
```
|
||||
sbatch /data/software/Slurm/Utilities/Parallel_TarGz.batch -s [srcdir] -t [output].tar -n
|
||||
```
|
||||
|
||||
Run /data/software/Slurm/Utilities/Parallel_TarGz.batch -h for more details and options.
|
||||
|
||||
#### Sample ingestion output (datasetIngestor 1.1.11)
|
||||
<details>
|
||||
<summary>[Show Example]: Sample ingestion output (datasetIngestor 1.1.11)</summary>
|
||||
<pre class="terminal code highlight js-syntax-highlight plaintext" lang="plaintext" markdown="false">
|
||||
@ -353,15 +358,22 @@ user_n@pb-archive.psi.ch's password:
|
||||
</pre>
|
||||
</details>
|
||||
|
||||
### Publishing
|
||||
|
||||
After datasets are are ingested they can be assigned a public DOI. This can be included in publications and will make the datasets on http://doi.psi.ch.
|
||||
|
||||
For instructions on this, please read the ['Publish' section in the ingest manual](https://scicatproject.github.io/documentation/Ingestor/ingestManual.html#sec-8).
|
||||
|
||||
### Retrieving data
|
||||
|
||||
The retrieval process is still a work in progress. For more info, read the ingest manual.
|
||||
Retrieving data from the archive is also initiated through the Data Catalog. Please read the ['Retrieve' section in the ingest manual](https://scicatproject.github.io/documentation/Ingestor/ingestManual.html#sec-6).
|
||||
|
||||
## Further Information
|
||||
|
||||
* **[PSI Data Catalog](https://discovery.psi.ch)**
|
||||
* **[Full Documentation](https://melanie.gitpages.psi.ch/SciCatPages/)**: **[PDF](https://melanie.gitpages.psi.ch/SciCatPages/ingestManual.pdf)**.
|
||||
* Data Catalog **[Official Website](https://www.psi.ch/photon-science-data-services/data-catalog-and-archive)**
|
||||
* Data catalog **[SciCat Software](https://scicatproject.github.io/)**
|
||||
* **[FAIR](https://www.nature.com/articles/sdata201618)** definition and **[SNF Research Policy](http://www.snf.ch/en/theSNSF/research-policies/open_research_data/Pages/default.aspx#FAIR%20Data%20Principles%20for%20Research%20Data%20Management)**
|
||||
* **[Petabyte Archive at CSCS](https://www.cscs.ch/fileadmin/user_upload/contents_publications/annual_reports/AR2017_Online.pdf)**
|
||||
* [PSI Data Catalog](https://discovery.psi.ch)
|
||||
* [Full Documentation](https://scicatproject.github.io/documentation/Ingestor/ingestManual.html)
|
||||
* [Published Datasets (doi.psi.ch)](https://doi.psi.ch)
|
||||
* Data Catalog [PSI page](https://www.psi.ch/photon-science-data-services/data-catalog-and-archive)
|
||||
* Data catalog [SciCat Software](https://scicatproject.github.io/)
|
||||
* [FAIR](https://www.nature.com/articles/sdata201618) definition and [SNF Research Policy](http://www.snf.ch/en/theSNSF/research-policies/open_research_data/Pages/default.aspx#FAIR%20Data%20Principles%20for%20Research%20Data%20Management)
|
||||
* [Petabyte Archive at CSCS](https://www.cscs.ch/fileadmin/user_upload/contents_publications/annual_reports/AR2017_Online.pdf)
|
||||
|
Loading…
x
Reference in New Issue
Block a user