initial formatting changes complete
This commit is contained in:
@@ -1,12 +1,4 @@
|
||||
---
|
||||
title: Archive & PSI Data Catalog
|
||||
#tags:
|
||||
keywords: linux, archive, data catalog, archiving, lts, tape, long term storage, ingestion, datacatalog
|
||||
last_updated: 31 January 2020
|
||||
summary: "This document describes how to use the PSI Data Catalog for archiving Merlin7 data."
|
||||
sidebar: merlin7_sidebar
|
||||
permalink: /merlin7/archive.html
|
||||
---
|
||||
# Archive & PSI Data Catalog
|
||||
|
||||
## PSI Data Catalog as a PSI Central Service
|
||||
|
||||
@@ -19,14 +11,14 @@ The Data Catalog and Archive is suitable for:
|
||||
* Derived data produced by processing some inputs
|
||||
* Data required to reproduce PSI research and publications
|
||||
|
||||
The Data Catalog is part of PSI's effort to conform to the FAIR principles for data management.
|
||||
In accordance with this policy, ***data will be publicly released under CC-BY-SA 4.0 after an
|
||||
The Data Catalog is part of PSI's effort to conform to the FAIR principles for data management.
|
||||
In accordance with this policy, ***data will be publicly released under CC-BY-SA 4.0 after an
|
||||
embargo period expires.***
|
||||
|
||||
The Merlin cluster is connected to the Data Catalog. Hence, users archive data stored in the
|
||||
The Merlin cluster is connected to the Data Catalog. Hence, users archive data stored in the
|
||||
Merlin storage under the ``/data`` directories (currentlyi, ``/data/user`` and ``/data/project``).
|
||||
Archiving from other directories is also possible, however the process is much slower as data
|
||||
can not be directly retrieved by the PSI archive central servers (**central mode**), and needs to
|
||||
can not be directly retrieved by the PSI archive central servers (**central mode**), and needs to
|
||||
be indirectly copied to these (**decentral mode**).
|
||||
|
||||
Archiving can be done from any node accessible by the users (usually from the login nodes).
|
||||
@@ -48,33 +40,33 @@ Archiving can be done from any node accessible by the users (usually from the lo
|
||||
Below are the main steps for using the Data Catalog.
|
||||
|
||||
* Ingest the dataset into the Data Catalog. This makes the data known to the Data Catalog system at PSI:
|
||||
* Prepare a metadata file describing the dataset
|
||||
* Run **``datasetIngestor``** script
|
||||
* If necessary, the script will copy the data to the PSI archive servers
|
||||
* Usually this is necessary when archiving from directories other than **``/data/user``** or
|
||||
* Prepare a metadata file describing the dataset
|
||||
* Run **``datasetIngestor``** script
|
||||
* If necessary, the script will copy the data to the PSI archive servers
|
||||
* Usually this is necessary when archiving from directories other than **``/data/user``** or
|
||||
**``/data/project``**. It would be also necessary when the Merlin export server (**``merlin-archive.psi.ch``**)
|
||||
is down for any reason.
|
||||
* Archive the dataset:
|
||||
* Visit [https://discovery.psi.ch](https://discovery.psi.ch)
|
||||
* Click **``Archive``** for the dataset
|
||||
* The system will now copy the data to the PetaByte Archive at CSCS
|
||||
* Visit [<https://discovery.psi.ch](https://discovery.psi.ch>)
|
||||
* Click **``Archive``** for the dataset
|
||||
* The system will now copy the data to the PetaByte Archive at CSCS
|
||||
* Retrieve data from the catalog:
|
||||
* Find the dataset on [https://discovery.psi.ch](https://discovery.psi.ch) and click **``Retrieve``**
|
||||
* Wait for the data to be copied to the PSI retrieval system
|
||||
* Run **``datasetRetriever``** script
|
||||
* Find the dataset on [<https://discovery.psi.ch](https://discovery.psi.ch>) and click **``Retrieve``**
|
||||
* Wait for the data to be copied to the PSI retrieval system
|
||||
* Run **``datasetRetriever``** script
|
||||
|
||||
Since large data sets may take a lot of time to transfer, some steps are designed to happen in the
|
||||
background. The discovery website can be used to track the progress of each step.
|
||||
Since large data sets may take a lot of time to transfer, some steps are designed to happen in the
|
||||
background. The discovery website can be used to track the progress of each step.
|
||||
|
||||
### Account Registration
|
||||
|
||||
Two types of account permit access to the Data Catalog. If your data was collected at a ***beamline***, you may
|
||||
have been assigned a **``p-group``** (e.g. ``p12345``) for the experiment. Other users are assigned **``a-group``**
|
||||
(e.g. ``a-12345``).
|
||||
Two types of account permit access to the Data Catalog. If your data was collected at a ***beamline***, you may
|
||||
have been assigned a **``p-group``** (e.g. ``p12345``) for the experiment. Other users are assigned **``a-group``**
|
||||
(e.g. ``a-12345``).
|
||||
|
||||
Groups are usually assigned to a PI, and then individual user accounts are added to the group. This must be done
|
||||
under user request through PSI Service Now. For existing **a-groups** and **p-groups**, you can follow the standard
|
||||
central procedures. Alternatively, if you do not know how to do that, follow the Merlin7
|
||||
central procedures. Alternatively, if you do not know how to do that, follow the Merlin7
|
||||
**[Requesting extra Unix groups](../01-Quick-Start-Guide/requesting-accounts.md)** procedure, or open
|
||||
a **[PSI Service Now](https://psi.service-now.com/psisp)** ticket.
|
||||
|
||||
@@ -114,11 +106,11 @@ $ SCICAT_TOKEN=RqYMZcqpqMJqluplbNYXLeSyJISLXfnkwlfBKuvTSdnlpKkU
|
||||
|
||||
Tokens expire after 2 weeks and will need to be fetched from the website again.
|
||||
|
||||
### Ingestion
|
||||
### Ingestion
|
||||
|
||||
The first step to ingesting your data into the catalog is to prepare a file describing what data you have. This is called
|
||||
**``metadata.json``**, and can be created with a text editor (e.g. *``vim``*). It can in principle be saved anywhere,
|
||||
but keeping it with your archived data is recommended. For more information about the format, see the 'Bio metadata'
|
||||
The first step to ingesting your data into the catalog is to prepare a file describing what data you have. This is called
|
||||
**``metadata.json``**, and can be created with a text editor (e.g. *``vim``*). It can in principle be saved anywhere,
|
||||
but keeping it with your archived data is recommended. For more information about the format, see the 'Bio metadata'
|
||||
section below. An example follows:
|
||||
|
||||
```yaml
|
||||
@@ -176,30 +168,31 @@ It will ask for your PSI credentials and then print some info about the data to
|
||||
datasetIngestor --token $SCICAT_TOKEN --ingest --autoarchive metadata.json
|
||||
```
|
||||
|
||||
You will be asked whether you want to copy the data to the central system:
|
||||
You will be asked whether you want to copy the data to the central system:
|
||||
|
||||
* If you are on the Merlin cluster and you are archiving data from ``/data/user`` or ``/data/project``, answer 'no' since the data catalog can
|
||||
|
||||
* If you are on the Merlin cluster and you are archiving data from ``/data/user`` or ``/data/project``, answer 'no' since the data catalog can
|
||||
directly read the data.
|
||||
* If you are on a directory other than ``/data/user`` and ``/data/project, or you are on a desktop computer, answer 'yes'. Copying large datasets
|
||||
* If you are on a directory other than ``/data/user`` and ``/data/project, or you are on a desktop computer, answer 'yes'. Copying large datasets
|
||||
to the PSI archive system may take quite a while (minutes to hours).
|
||||
|
||||
If there are no errors, your data has been accepted into the data catalog! From now on, no changes should be made to the ingested data.
|
||||
This is important, since the next step is for the system to copy all the data to the CSCS Petabyte archive. Writing to tape is slow, so
|
||||
If there are no errors, your data has been accepted into the data catalog! From now on, no changes should be made to the ingested data.
|
||||
This is important, since the next step is for the system to copy all the data to the CSCS Petabyte archive. Writing to tape is slow, so
|
||||
this process may take several days, and it will fail if any modifications are detected.
|
||||
|
||||
If using the ``--autoarchive`` option as suggested above, your dataset should now be in the queue. Check the data catalog:
|
||||
[https://discovery.psi.ch](https://discovery.psi.ch). Your job should have status 'WorkInProgress'. You will receive an email when the ingestion
|
||||
If using the ``--autoarchive`` option as suggested above, your dataset should now be in the queue. Check the data catalog:
|
||||
[<https://discovery.psi.ch](https://discovery.psi.ch>). Your job should have status 'WorkInProgress'. You will receive an email when the ingestion
|
||||
is complete.
|
||||
|
||||
If you didn't use ``--autoarchive``, you need to manually move the dataset into the archive queue. From **discovery.psi.ch**, navigate to the 'Archive'
|
||||
tab. You should see the newly ingested dataset. Check the dataset and click **``Archive``**. You should see the status change from **``datasetCreated``** to
|
||||
If you didn't use ``--autoarchive``, you need to manually move the dataset into the archive queue. From **discovery.psi.ch**, navigate to the 'Archive'
|
||||
tab. You should see the newly ingested dataset. Check the dataset and click **``Archive``**. You should see the status change from **``datasetCreated``** to
|
||||
**``scheduleArchiveJob``**. This indicates that the data is in the process of being transferred to CSCS.
|
||||
|
||||
After a few days the dataset's status will change to **``datasetOnAchive``** indicating the data is stored. At this point it is safe to delete the data.
|
||||
|
||||
#### Useful commands
|
||||
|
||||
Running the datasetIngestor in dry mode (**without** ``--ingest``) finds most errors. However, it is sometimes convenient to find potential errors
|
||||
Running the datasetIngestor in dry mode (**without** ``--ingest``) finds most errors. However, it is sometimes convenient to find potential errors
|
||||
yourself with simple unix commands.
|
||||
|
||||
Find problematic filenames
|
||||
@@ -239,8 +232,8 @@ find . -name '*#autosave#' -delete
|
||||
Certificate invalid: name is not a listed principal
|
||||
```
|
||||
It indicates that no kerberos token was provided for authentication. You can avoid the warning by first running kinit (PSI linux systems).
|
||||
|
||||
* For decentral ingestion cases, the copy step is indicated by a message ``Running [/usr/bin/rsync -e ssh -avxz ...``. It is expected that this
|
||||
|
||||
* For decentral ingestion cases, the copy step is indicated by a message ``Running [/usr/bin/rsync -e ssh -avxz ...``. It is expected that this
|
||||
step will take a long time and may appear to have hung. You can check what files have been successfully transfered using rsync:
|
||||
|
||||
```bash
|
||||
@@ -250,7 +243,7 @@ step will take a long time and may appear to have hung. You can check what files
|
||||
where UID is the dataset ID (12345678-1234-1234-1234-123456789012) and PATH is the absolute path to your data. Note that rsync creates directories first and that the transfer order is not alphabetical in some cases, but it should be possible to see whether any data has transferred.
|
||||
|
||||
* There is currently a limit on the number of files per dataset (technically, the limit is from the total length of all file paths). It is recommended to break up datasets into 300'000 files or less.
|
||||
* If it is not possible or desirable to split data between multiple datasets, an alternate work-around is to package files into a tarball. For datasets which are already compressed, omit the -z option for a considerable speedup:
|
||||
* If it is not possible or desirable to split data between multiple datasets, an alternate work-around is to package files into a tarball. For datasets which are already compressed, omit the -z option for a considerable speedup:
|
||||
|
||||
```
|
||||
tar -f [output].tar [srcdir]
|
||||
@@ -271,7 +264,6 @@ step will take a long time and may appear to have hung. You can check what files
|
||||
/data/project/bio/myproject/archive $ datasetIngestor -copy -autoarchive -allowexistingsource -ingest metadata.json
|
||||
2019/11/06 11:04:43 Latest version: 1.1.11
|
||||
|
||||
|
||||
2019/11/06 11:04:43 Your version of this program is up-to-date
|
||||
2019/11/06 11:04:43 You are about to add a dataset to the === production === data catalog environment...
|
||||
2019/11/06 11:04:43 Your username:
|
||||
@@ -321,7 +313,6 @@ user_n@pb-archive.psi.ch's password:
|
||||
2019/11/06 11:05:04 The source folder /data/project/bio/myproject/archive is not centrally available (decentral use case).
|
||||
The data must first be copied to a rsync cache server.
|
||||
|
||||
|
||||
2019/11/06 11:05:04 Do you want to continue (Y/n)?
|
||||
Y
|
||||
2019/11/06 11:05:09 Created dataset with id 12.345.67890/12345678-1234-1234-1234-123456789012
|
||||
@@ -359,7 +350,7 @@ user_n@pb-archive.psi.ch's password:
|
||||
|
||||
### Publishing
|
||||
|
||||
After datasets are are ingested they can be assigned a public DOI. This can be included in publications and will make the datasets on http://doi.psi.ch.
|
||||
After datasets are are ingested they can be assigned a public DOI. This can be included in publications and will make the datasets on <http://doi.psi.ch>.
|
||||
|
||||
For instructions on this, please read the ['Publish' section in the ingest manual](https://scicatproject.github.io/documentation/Ingestor/ingestManual.html#sec-8).
|
||||
|
||||
|
||||
Reference in New Issue
Block a user