Added archive.md

2020-01-31 11:13:15 +01:00
parent b3be875d02
commit 5bf1a50c63
2 changed files with 359 additions and 0 deletions
--- a/_data/sidebars/merlin6_sidebar.yml
+++ b/_data/sidebars/merlin6_sidebar.yml
@@ -35,6 +35,8 @@ entries:
      url: /merlin6/storage.html
    - title: Transferring Data
      url: /merlin6/transfer-data.html
+    - title: Archive & PSI Data Catalog
+      url: /merlin6/archive.html
    - title: Remote Desktop Access
      url: /merlin6/nomachine.html
  - title: Job Submission
--- a/accessing-merlin6/archive.md
+++ b/accessing-merlin6/archive.md
@@ -0,0 +1,357 @@
+---
+title: Archive & PSI Data Catalog
+
+#tags:
+keywords: Linux, archive, DataCatalog, 
+last_updated: 31 January 2020
+summary: "This document describes how to use the PSI Data Catalog for archiving Merlin6 data."
+sidebar: merlin6_sidebar
+permalink: /merlin6/archive.html
+---
+
+## PSI Data Catalog as a PSI Central Service
+
+PSI provides access to the ***Data Catalog*** for **long-term data storage and retrieval**. Data is
+stored on the ***PetaByte Archive*** at the **Swiss National Supercomputing Centre (CSCS)**.
+
+The Data Catalog and Archive is suitable for:
+
+* Raw data generated by PSI instruments
+* Derived data produced by processing some inputs
+* Data required to reproduce PSI research and publications
+
+The Data Catalog is part of PSI's effort to conform to the FAIR principles for data management. 
+In accordance with this policy, ***data will be publicly released under CC-BY-SA 4.0 after an 
+embargo period expires.***
+
+The Merlin cluster is connected to the Data Catalog. Hence, users archive data stored in the 
+Merlin storage under the ``/data`` directories (currentlyi, ``/data/user`` and ``/data/project``).
+Archiving from other directories is also possible, however the process is much slower as data
+can not be directly retrieved by the PSI archive central servers (**central mode**, and needs to 
+be indirectly copied to these (**decentral mode**).
+
+Archiving can be done from any node accessible by the users (usually from the login nodes).
+
+## Procedure
+
+### Overview
+
+Below are the main steps for using the Data Catalog.
+
+* Ingest the dataset into the Data Catalog. This makes the data known to the Data Catalog system at PSI:
+   * Prepare a metadata file describing the dataset
+   * Run **``datasetIngestor``** script
+   * If necessary, the script will copy the data to the PSI archive servers 
+     * Usually this is necessary when archiving from directories other than **``/data/user``** or 
+       **``/data/project``**. It would be also necessary when the Merlin export server (**``merlin-archive.psi.ch``**)
+       is down for any reason.
+* Archive the dataset:
+   * Visit [https://discovery.psi.ch](https://discovery.psi.ch)
+   * Click **``Archive``** for the dataset
+   * The system will now copy the data to the PetaByte Archive at CSCS 
+* Retrieve data from the catalog:
+   * Find the dataset on [https://discovery.psi.ch](https://discovery.psi.ch) and click **``Retrieve``**
+   * Wait for the data to be copied to the PSI retrieval system
+   * Run **``datasetRetriever``** script 
+
+Since large data sets may take a lot of time to transfer, some steps are designed to happen in the 
+background. The discovery website can be used to track the progress of each step. 
+
+###  Account Registration
+
+Two types of account permit access to the Data Catalog. If your data was collected at a ***beamline***, you may 
+have been assigned a **``p-group``** (e.g. ``p12345``) for the experiment. Other users are assigned **``a-group``** 
+(e.g. ``a-12345``). 
+
+Groups are usually assigned to a PI, and then individual user accounts are added to the group. This must be done
+under user request through PSI Service Now. For existing **a-groups** and **p-groups**, you can follow the standard
+central procedures. Alternatively, if you do not know how to do that, follow the Merlin6 
+**[Requesting extra Unix groups](/merlin6/request-account.html#requesting-extra-unix-groups)** procedure, or open
+a **[PSI Service Now](https://psi.service-now.com/psisp)** ticket.
+
+### Installation
+
+Accessing the Data Catalog is done through the [SciCat software](https://melanie.gitpages.psi.ch/SciCatPages/).
+Documentation is here: [ingestManual.pdf](https://melanie.gitpages.psi.ch/SciCatPages/ingestManual.pdf).
+
+#### (Merlin systems) Loading datacatalog tools
+
+This is the ***official supported method*** for archiving the Merlin cluster.
+
+The latest datacatalog software is maintained in the PSI module system. To access it from the Merlin systems, run the following command:
+
+```bash
+module load datacatalog
+```
+
+It can be done from any host in the Merlin cluster accessible by users. Usually, login nodes will be the nodes used for archiving.
+
+#### (Non-standard systems) Installing datacatalog tools
+
+***This method is not supported by the Merlin admins***. However, we provide a small recipe for archiving from any host at PSI.
+On any problems, Central AIT should be contacted.
+
+If you do not have access to PSI modules (for instance, when archiving from Ubuntu systems), then you can install the 
+datacatalog software yourself. These tools require 64-bit linux. To ingest from Windows systems, it is suggested to 
+transfer the data to a linux system such as Merlin.
+
+We suggest storing the SciCat scripts in ``~/bin`` so that they can be easily accessed.
+
+
+```bash
+mkdir -p ~/bin
+cd ~/bin
+/usr/bin/curl -O https://intranet.psi.ch/pub/Daas/WebHome/datasetIngestor
+chmod +x ./datasetIngestor
+/usr/bin/curl -O https://intranet.psi.ch/pub/Daas/WebHome/datasetRetriever
+chmod +x ./datasetRetriever
+```
+
+When the scripts are updated you will be prompted to re-run some of the above commands to get the latest version.
+
+You can call the ingestion scripts using the full path (``~/bin/datasetIngestor``) or else add ``~/bin`` to your unix PATH. 
+To do so, add the following line to your ``~/.bashrc`` file:
+
+```bash
+export PATH="$HOME/bin:$PATH"
+```
+
+### Ingestion
+
+The first step to ingesting your data into the catalog is to prepare a file describing what data you have. This is called 
+**``metadata.json``**, and can be created with a text editor (e.g. *``vim``*). It can in principle be saved anywhere, 
+but keeping it with your archived data is recommended. For more information about the format, see the 'Bio metadata' 
+section below. An example follows:
+
+```yaml
+{
+    "principalInvestigator": "albrecht.gessler@psi.ch",
+    "creationLocation": "/PSI/EMF/JEOL2200FS",
+    "dataFormat": "TIFF+LZW Image Stack",
+    "sourceFolder": "/gpfs/group/LBR/pXXX/myimages",
+    "owner": "Wilhelm Tell",
+    "ownerEmail": "wilhelm.tell@psi.ch",
+    "type": "raw",
+    "description": "EM micrographs of amygdalin",
+    "ownerGroup": "a-12345",
+    "scientificMetadata": {
+        "description": "EM micrographs of amygdalin",
+        "sample": {
+            "name": "Amygdalin beta-glucosidase 1",
+            "uniprot": "P29259",
+            "species": "Apple"
+        },
+        "dataCollection": {
+            "date": "2018-08-01"
+        },
+        "microscopeParameters": {
+            "pixel size": {
+                "v": 0.885,
+                "u": "A"
+            },
+            "voltage": {
+                "v": 200,
+                "u": "kV"
+            },
+            "dosePerFrame": {
+                "v": 1.277,
+                "u": "e/A2"
+            }
+        }
+    }
+}
+```
+
+The following steps can be run from wherever you saved your ``metadata.json``. First, perform a "dry-run" which will check the metadata for errors:
+
+```bash
+datasetIngestor metadata.json
+```
+
+It will ask for your PSI credentials and then print some info about the data to be ingested. If there are no errors, proceed to the real ingestion:
+
+```bash
+datasetIngestor --ingest --autoarchive metadata.json
+```
+
+For particularly important datasets, you may also want to use the parameter **``--tapecopies 2``** to store **redundant copies** of the data.
+
+You will be asked whether you want to copy the data to the central system: 
+
+* If you are on the Merlin cluster and you are archiving data from ``/data/user`` or ``/data/project``, answer 'no' since the data catalog can 
+directly read the data.
+* If you are on a directory other than  ``/data/user`` and ``/data/project, or you are on a desktop computer, answer 'yes'. Copying large datasets 
+to the PSI archive system may take quite a while (minutes to hours).
+
+If there are no errors, your data has been accepted into the data catalog! From now on, no changes should be made to the ingested data. 
+This is important, since the next step is for the system to copy all the data to the CSCS Petabyte archive. Writing to tape is slow, so 
+this process may take several days, and it will fail if any modifications are detected.
+
+If using the ``--autoarchive`` option as suggested above, your dataset should now be in the queue. Check the data catalog: 
+[https://discovery.psi.ch](https://discovery.psi.ch). Your job should have status 'WorkInProgress'. You will receive an email when the ingestion 
+is complete.
+
+If you didn't use ``--autoarchive``, you need to manually move the dataset into the archive queue. From **discovery.psi.ch**, navigate to the 'Archive' 
+tab. You should see the newly ingested dataset. Check the dataset and click **``Archive``**. You should see the status change from **``datasetCreated``** to 
+**``scheduleArchiveJob``**. This indicates that the data is in the process of being transferred to CSCS.
+
+After a few days the dataset's status will change to **``datasetOnAchive``** indicating the data is stored. At this point it is safe to delete the data.
+
+#### Useful commands
+
+Running the datasetIngestor in dry mode (**without** ``--ingest``) finds most errors. However, it is sometimes convenient to find potential errors 
+yourself with simple unix commands.
+
+Find problematic filenames
+
+```bash
+find . -iregex '.*/[^/]*[^a-zA-Z0-9_ ./-][^/]*'=
+```
+
+Find broken links
+
+```bash
+find -L . -type l
+```
+
+Find outside links
+
+```bash
+find . -type l -exec bash -c 'realpath --relative-base "`pwd`" "$0" 2>/dev/null |egrep "^[./]" |sed "s|^|$0 ->|" ' '{}' ';'
+```
+
+Delete certain files (use with caution)
+
+```bash
+# Empty directories
+find . -type d -empty -delete
+# Backup files
+find . -name '*~' -delete
+find . -name '*#autosave#' -delete
+```
+
+### Troubleshooting & Known Bugs
+
+* The following message can be safely ignored:
+
+   ```bash
+   key_cert_check_authority: invalid certificate
+   Certificate invalid: name is not a listed principal
+   ```
+   It indicates that no kerberos token was provided for authentication. You can avoid the warning by first running kinit (PSI linux systems).
+   
+* For decentral ingestion cases, the copy step is indicated by a message ``Running [/usr/bin/rsync -e ssh -avxz ...``. It is expected that this 
+step will take a long time and may appear to have hung. You can check what files have been successfully transfered using rsync:
+
+   ```bash
+   rsync --list-only user_n@pb-archive.psi.ch:archive/UID/PATH/
+   ```
+
+   where UID is the dataset ID (12345678-1234-1234-1234-123456789012) and PATH is the absolute path to your data. Note that rsync creates directories first and that the transfer order is not alphabetical in some cases, but it should be possible to see whether any data has transferred.
+
+* There is currently a limit on the number of files per dataset (technically, the limit is from the total length of all file paths). It is recommended to break up datasets into 300'000 files or less. 
+
+<details>
+<summary>[Show Example]: Sample ingestion output (datasetIngestor 1.1.11)</summary>
+<pre class="terminal code highlight js-syntax-highlight plaintext" lang="plaintext" markdown="false">
+/data/project/bio/myproject/archive $ datasetIngestor -copy -autoarchive -allowexistingsource -ingest metadata.json
+2019/11/06 11:04:43 Latest version: 1.1.11
+
+
+2019/11/06 11:04:43 Your version of this program is up-to-date
+2019/11/06 11:04:43 You are about to add a dataset to the === production === data catalog environment...
+2019/11/06 11:04:43 Your username:
+user_n
+2019/11/06 11:04:48 Your password:
+2019/11/06 11:04:52 User authenticated: XXX
+2019/11/06 11:04:52 User is member in following a or p groups: XXX
+2019/11/06 11:04:52 OwnerGroup information a-XXX verified successfully.
+2019/11/06 11:04:52 contactEmail field added: XXX
+2019/11/06 11:04:52 Scanning files in dataset /data/project/bio/myproject/archive
+2019/11/06 11:04:52 No explicit filelistingPath defined - full folder /data/project/bio/myproject/archive is used.
+2019/11/06 11:04:52 Source Folder: /data/project/bio/myproject/archive at /data/project/bio/myproject/archive
+2019/11/06 11:04:57 The dataset contains 100000 files with a total size of 50000000000 bytes.
+2019/11/06 11:04:57 creationTime field added: 2019-07-29 18:47:08 +0200 CEST
+2019/11/06 11:04:57 endTime field added: 2019-11-06 10:52:17.256033 +0100 CET
+2019/11/06 11:04:57 license field added: CC BY-SA 4.0
+2019/11/06 11:04:57 isPublished field added: false
+2019/11/06 11:04:57 classification field added: IN=medium,AV=low,CO=low
+2019/11/06 11:04:57 Updated metadata object:
+{
+    "accessGroups": [
+        "XXX"
+    ],
+    "classification": "IN=medium,AV=low,CO=low",
+    "contactEmail": "XXX",
+    "creationLocation": "XXX",
+    "creationTime": "2019-07-29T18:47:08+02:00",
+    "dataFormat": "XXX",
+    "description": "XXX",
+    "endTime": "2019-11-06T10:52:17.256033+01:00",
+    "isPublished": false,
+    "license": "CC BY-SA 4.0",
+    "owner": "XXX",
+    "ownerEmail": "XXX",
+    "ownerGroup": "a-XXX",
+    "principalInvestigator": "XXX",
+    "scientificMetadata": {
+...
+    },
+    "sourceFolder": "/data/project/bio/myproject/archive",
+    "type": "raw"
+}
+2019/11/06 11:04:57 Running [/usr/bin/ssh -l user_n pb-archive.psi.ch test -d /data/project/bio/myproject/archive].
+key_cert_check_authority: invalid certificate
+Certificate invalid: name is not a listed principal
+user_n@pb-archive.psi.ch's password:
+2019/11/06 11:05:04 The source folder /data/project/bio/myproject/archive is not centrally available (decentral use case).
+The data must first be copied to a rsync cache server.
+
+
+2019/11/06 11:05:04 Do you want to continue (Y/n)?
+Y
+2019/11/06 11:05:09 Created dataset with id 12.345.67890/12345678-1234-1234-1234-123456789012
+2019/11/06 11:05:09 The dataset contains 108057 files.
+2019/11/06 11:05:10 Created file block 0 from file 0 to 1000 with total size of 413229990 bytes
+2019/11/06 11:05:10 Created file block 1 from file 1000 to 2000 with total size of 416024000 bytes
+2019/11/06 11:05:10 Created file block 2 from file 2000 to 3000 with total size of 416024000 bytes
+2019/11/06 11:05:10 Created file block 3 from file 3000 to 4000 with total size of 416024000 bytes
+...
+2019/11/06 11:05:26 Created file block 105 from file 105000 to 106000 with total size of 416024000 bytes
+2019/11/06 11:05:27 Created file block 106 from file 106000 to 107000 with total size of 416024000 bytes
+2019/11/06 11:05:27 Created file block 107 from file 107000 to 108000 with total size of 850195143 bytes
+2019/11/06 11:05:27 Created file block 108 from file 108000 to 108057 with total size of 151904903 bytes
+2019/11/06 11:05:27 short dataset id: 0a9fe316-c9e7-4cc5-8856-e1346dd31e31
+2019/11/06 11:05:27 Running [/usr/bin/rsync -e ssh -avxz /data/project/bio/myproject/archive/ user_n@pb-archive.psi.ch:archive
+/0a9fe316-c9e7-4cc5-8856-e1346dd31e31/data/project/bio/myproject/archive].
+key_cert_check_authority: invalid certificate
+Certificate invalid: name is not a listed principal
+user_n@pb-archive.psi.ch's password:
+Permission denied, please try again.
+user_n@pb-archive.psi.ch's password:
+/usr/libexec/test_acl.sh: line 30: /tmp/tmpacl.txt: Permission denied
+/usr/libexec/test_acl.sh: line 30: /tmp/tmpacl.txt: Permission denied
+/usr/libexec/test_acl.sh: line 30: /tmp/tmpacl.txt: Permission denied
+/usr/libexec/test_acl.sh: line 30: /tmp/tmpacl.txt: Permission denied
+/usr/libexec/test_acl.sh: line 30: /tmp/tmpacl.txt: Permission denied
+...
+2019/11/06 12:05:08 Successfully updated {"pid":"12.345.67890/12345678-1234-1234-1234-123456789012",...}
+2019/11/06 12:05:08 Submitting Archive Job for the ingested datasets.
+2019/11/06 12:05:08 Job response Status: okay
+2019/11/06 12:05:08 A confirmation email will be sent to XXX
+12.345.67890/12345678-1234-1234-1234-123456789012
+</pre>
+</details>
+
+### Retrieving data
+
+The retrieval process is still a work in progress. For more info, read the ingest manual. 
+
+## Further Information
+
+* **[PSI Data Catalog](https://discovery.psi.ch)**
+* **[Full Documentation](https://melanie.gitpages.psi.ch/SciCatPages/)**: **[PDF](https://melanie.gitpages.psi.ch/SciCatPages/ingestManual.pdf)**.
+* Data Catalog **[Official Website](https://www.psi.ch/photon-science-data-services/data-catalog-and-archive)**
+* Data catalog **[SciCat Software](https://scicatproject.github.io/)**
+* **[FAIR](https://www.nature.com/articles/sdata201618)** definition and **[SNF Research Policy](http://www.snf.ch/en/theSNSF/research-policies/open_research_data/Pages/default.aspx#FAIR%20Data%20Principles%20for%20Research%20Data%20Management)**
+* **[Petabyte Archive at CSCS](https://www.cscs.ch/fileadmin/user_upload/contents_publications/annual_reports/AR2017_Online.pdf)**