diff --git a/.markdownlint.yaml b/.markdownlint.yaml index b9a88ff..4830523 100644 --- a/.markdownlint.yaml +++ b/.markdownlint.yaml @@ -1,10 +1,15 @@ default: true -line-length: - line_length: 88 - tables: false +line-length: false +# line-length: +# line_length: 88 +# tables: false no-trailing-punctuation: true heading-style: style: atx no-missing-space-atx: true single-title: false -fenced-code-language: true \ No newline at end of file +fenced-code-language: true +code-block-style: + style: fenced +no-duplicate-heading: + siblings_only: true \ No newline at end of file diff --git a/docs/assets/images/screenshots/PublishingData1.png b/docs/assets/images/screenshots/PublishingData1.png new file mode 100644 index 0000000..f075e63 Binary files /dev/null and b/docs/assets/images/screenshots/PublishingData1.png differ diff --git a/docs/assets/images/screenshots/PublishingData2.png b/docs/assets/images/screenshots/PublishingData2.png new file mode 100644 index 0000000..1538fb0 Binary files /dev/null and b/docs/assets/images/screenshots/PublishingData2.png differ diff --git a/docs/assets/images/screenshots/PublishingData3.png b/docs/assets/images/screenshots/PublishingData3.png new file mode 100644 index 0000000..1251897 Binary files /dev/null and b/docs/assets/images/screenshots/PublishingData3.png differ diff --git a/docs/assets/images/screenshots/metadata.png b/docs/assets/images/screenshots/metadata.png new file mode 100644 index 0000000..680a047 Binary files /dev/null and b/docs/assets/images/screenshots/metadata.png differ diff --git a/docs/assets/images/screenshots/pgroup_selection.png b/docs/assets/images/screenshots/pgroup_selection.png new file mode 100644 index 0000000..1d6697c Binary files /dev/null and b/docs/assets/images/screenshots/pgroup_selection.png differ diff --git a/docs/assets/images/screenshots/proposal_found.png b/docs/assets/images/screenshots/proposal_found.png new file mode 100644 index 0000000..de93258 Binary files /dev/null and b/docs/assets/images/screenshots/proposal_found.png differ diff --git a/docs/assets/images/screenshots/proposal_not_found.png b/docs/assets/images/screenshots/proposal_not_found.png new file mode 100644 index 0000000..23bd541 Binary files /dev/null and b/docs/assets/images/screenshots/proposal_not_found.png differ diff --git a/docs/assets/presentations/SciCatGettingStartedSLS.pptx b/docs/assets/presentations/SciCatGettingStartedSLS.pptx new file mode 100644 index 0000000..66ea375 Binary files /dev/null and b/docs/assets/presentations/SciCatGettingStartedSLS.pptx differ diff --git a/docs/assets/presentations/SciCatGettingStartedSLSSummary.pdf b/docs/assets/presentations/SciCatGettingStartedSLSSummary.pdf new file mode 100644 index 0000000..668dacc Binary files /dev/null and b/docs/assets/presentations/SciCatGettingStartedSLSSummary.pdf differ diff --git a/docs/assets/presentations/SciCatGettingStartedSLSSummary.pptx b/docs/assets/presentations/SciCatGettingStartedSLSSummary.pptx new file mode 100644 index 0000000..be6bccc Binary files /dev/null and b/docs/assets/presentations/SciCatGettingStartedSLSSummary.pptx differ diff --git a/docs/index.md b/docs/index.md index 2103952..5f72d10 100644 --- a/docs/index.md +++ b/docs/index.md @@ -10,3 +10,4 @@ principles](https://force11.org/info/the-fair-data-principles/). - Browse the Data Catalog at [discovery.psi.ch](https://discovery.psi.ch) - See published datasets at [doi.psi.ch](https://doi.psi.ch) +- Read the [Ingestor Manual](ingestorManual.md) to get started adding your datasets diff --git a/docs/ingestorManual.md b/docs/ingestorManual.md new file mode 100644 index 0000000..5dd8a60 --- /dev/null +++ b/docs/ingestorManual.md @@ -0,0 +1,1672 @@ +--- +title: Ingestor Manual +--- + +## Overview and Concepts + +PSI offers a Data Catalog Service for annotated long-term data storage +, retrieval and publishing. The annotation information , i.e. metadata +is stored in a central database to allow for fast query for the +data. The raw data itself is stored on the PetaByte Archive at the +Swiss National Supercomputing Centre (CSCS). The Data Catalog and +Archive is designed to be suitable for: + +- Raw data generated by PSI instruments or simulations +- Derived data produced by processing the raw input data +- Data required to reproduce PSI research and publications, e.g FAIR data + +All data which are added to the data catalog must either not be +classified or have a classification level of "normal". +You are not allowed to add any personal or private data. You are not +allowed to use the data catalog as a backup system. Data must come +from scientific activities pursued at PSI. If data from external +partner institutes should be stored, then this needs a dedicated +contract signed by the management. + +The service is based on the catalog system SciCat, documented at + and +, which is an open source system that +allows to ingest and retrieve datasets in different ways, matching the +requirements of the respective use cases. The use cases differ in the +level of automation provided. + +Data is always stored in terms of `datasets`, which you can think of as a +collection of files combined with administrativ and scientific metadata. + +This manual describes how you can use this services by following the +main steps in the lifecycle of the data management: + +- Definition and ingestion of metadata +- Archiving of the datasets +- Retrieving of datasets +- Publishing of datasets +- Retention of datasets + +Note: as of today (June 2021) the services can be only be used from +within the PSI intranet with the exception of the published data, +which is by definition publicly available. Although the service itself +can be used from any operating system, the command line and +GUI tools currently offered are available only for Linux and Windows +platforms. + +## The Concept of Datasets + +For the following it is useful to have a better understanding of the +concept of a dataset. A dataset is a logical grouping of potentially +many files. It is up to the scientist to define datasets from the +files. When defining datasets take the following conditions into +account + +- a dataset is the smallest unit for adding meta data +- a dataset is the smallest unit for data handling (archiving and retrieval) +- a dataset is the smallest unit for publication (DOI assignmnet) + +Therefore you need to find a compromise between putting too few or too +many files into a single dataset. + +`Ingestion` of datasets means, that you make data known to the data +catalog by providing both metadata about the dataset and the file +listing comprising the dataset. For each dataset a persistent +identifier (PID) is automatically created. + +It is important to note that the data catalog is a "passive" system in +the sense that it has to be told if new data arrives. The data catalog +has no direct access to the file systems containing the actual +files. In contrast the **datasetIngestor** program is run from systems, which +have access to the data files. + +The datasets always belong to an so called ownerGroup. Only members of +these groups have access to the data, unless the dataset is being +published. At PSI there are two types of ownerGroups, + +- pgroups, starting with letter "p". They are used for experimental + data linked to a proposal system. They are managed by the digital + user office DUO +- a-groups, starting with "a-" for any other data to be archived + +Once data is contained in the data catalog, this information is +considered to be stored permanently. However after a retention period +the connected raw data files may actually be deleted. In this case the +dataset is **marked** as deleted in the data catalog, but the data +catalog entry persists, in agreement with the FAIR principles. + +Warning: you should not modify the files which make up your dataset +after the dataset was ingested to the datacatalog. This means that you +should ingest the data only, if you are sure that no further +modifications on the files take place. The subsequent archive job will +only take care of the files which existed at ingest time and otherwise +return an error message and not archive the data at all. + +## Getting started + +You will need a PSI account and this account needs to be member in so +called `p-groups`, which are managed by the PSI digital user office +proposal system and are usually linked to a principal investigator +(PI). This is required to define the authorization to the data, +i.e. who is allowed to see which datasets. + +In addition to these so called `user accounts` there are a couple of +pre-defined `functional accounts` which are used for automated +processes. In particular each beamline has one such functional +account, e.g. called sls-tomcat, which can be used for automated +ingestion of new data or to query all data generated at a given +beamline. These accounts are only defined in the data catalog system +and are given to the respective beamline managers. + +If your data can not be linked to this proposal system you can still +use the services, but you may need to ask for the creation of a +so-called `a-group` and become member in these groups. You can order +an `a-group` via ServiceNow under `PSI Service Catalog` -> `IT` -> +`Identity & Access Management` -> `Order Group / Project drive`. Under +`Account Type` choose `Archive Group / Project Drive`. You will be +asked about the group members. The group owner is not automatically +added to the group members. + +To use some of the software you may need to install it +first. Installation is described in the appendix Installation of Tools + +## Ingest + +### Important Update since April 14th 2022 + +For all commandline tools, like the datasetIngestor, datasetRetriever + etc, using your own user account you **have** to use the –token + option with a predefined API token SCICAT-TOKEN. Specifying + username/password is not possible for normal users (this limitation + is caused by the switch to a new authentication protocol). The + easiest to get such an API token is to sign it at + , then follow the "Login with PSI account" + button. This will bring you to the user settings page, from where + you can copy the token with a click on the corresponding copy + button. + +For functional accounts, like beamline accounts you can +however continue to use username/password authentication instead. + +### Definition of input files + +First you need to specify the location of the files that you want to +have stored as one dataset. A typically example would be all the files +taken during a measurement, a scan etc or all output data from an +analysis of raw data files. In the simplest case it is sufficient to +define only one location. i.e. the **sourceFolder**, which should +contain all the files (and only those files) that make up the +dataset. In a more general case you can also specify an explicit list +of files and/or directories that you want to have assembled to a +dataset. See the datasetIngestor command options for details. The +appendix has a Recommended file structure for raw datasets on +disk. Please take note of the limitations of a dataset, as +defined in the appendix Dataset limitations. + +### Definition of metadata + +There are two types of metadata which need to be provided: + +- administrative metadata: specifies when and where the data is taken, + who is the owner etc. There are both mandatory and optional fields + and the fields depend on the type of the dataset + (generic/raw/derived), see Section 11.4 + below. The most important metadata field for ownership is the value + of the "ownerGroup" field, which defines a group name, whose member + have access to the data. +- scientific metadata: this depends on the scientific discipline and + can be defined in a flexible way by respective research group. It is + up to the research groups to define the format(s) of their data that + they want to support, ideally on an international level. See also + the section About Scientific Values and Units . + +Therefore the next step to ingesting your data into the catalog is to +prepare a file describing what data you have. This is called +metadata.json, and can be created with any text editor. It can in +principle be saved anywhere, but keeping it with your archived data is +recommended. + +Here is a minimalistic example the file metadata.json for raw data: + +```json +{ + "creationLocation": "/PSI/SLS/TOMCAT", + "sourceFolder": "/data/p16/p16623/June2020", + "type": "raw", + "ownerGroup":"p16623" +} +``` + +In the Appendix Use Case Examples you find many more examples for +metadata.json files, both for raw and derived data. Here is a more +real life example from Bio department: + +```json +{ + "principalInvestigator": "albrecht.gessler@psi.ch", + "creationLocation": "/PSI/EMF/JEOL2200FS", + "dataFormat": "TIFF+LZW Image Stack", + "sourceFolder": "/gpfs/group/LBR/pXXX/myimages", + "datasetName": "myimages", + "owner": "Wilhelm Tell", + "ownerEmail": "wilhelm.tell@psi.ch", + "type": "raw", + "description": "EM micrographs of amygdalin", + "ownerGroup": "a-12345", + "scientificMetadata": { + "sample": { + "name": "Amygdalin beta-glucosidase 1", + "uniprot": "P29259", + "species": "Apple" + }, + "dataCollection": { + "date": "2018-08-01" + }, + "microscopeParameters": { + "pixel size": { + "value": 0.885, + "unit": "A" + }, + "voltage": { + "value": 200, + "unit": "kV" + }, + "dosePerFrame": { + "value": 1.277, + "unit": "e/A2" + } + } + } +} +``` + +For manual creation of this file there are various helper tools +available. One option is to use the ScicatEditor + for creating these +metadata files. This is a browser-based tool specifically for +ingesting PSI data. Using the tool avoids syntax errors and provides +templates for common data sets and options. The finished JSON file can +then be downloaded or copied into a text editor. + +Another option for datasets on ra or merlin is to use the SciCat +graphical interface from NoMachine. This provides a graphical +interface for selecting data to archive. This is particularly useful +for data associated with a DUO experiment and p-group. Type `SciCat` to +get started after loading the datacatalog module. The GUI also +replaces the the command-line ingestion decribed below. + +After preparing your metadata.json file, run the following steps to +ingest the data. First, perform a "dry-run" that will check the +metadata for errors: ( Please note that in the following only the +Linux type notation is used. For the changes which apply to Windows +see the separate section below) + +```sh +datasetIngestor metadata.json +``` + +It will ask for your PSI credentials and then print some info +about the data to be ingested. This command will scan the files, make +checks and extract further metadata information from the files and +from the DUO system, unless the corresponding metadata fields are +already provided in the metadata.json file. If there are no errors, +proceed to the real ingestion: + +```sh +datasetIngestor --ingest metadata.json +``` + +For particularly important datasets, you may also want to use the +parameter –tapecopies 2 to store redundant copies of the data. +To give some numbers, 0.2–0.4% of the tapes get damaged so +there is a chance that archiving with only one copy will result in lost +data, in very few cases. Keep in mind that archival with redundancy +doubles the cost which is billed to the responsible department. + +You may be asked whether you want to copy the data first to a central +system. This step is needed for all files which are not residing on +one of the central fileservers at PSI. In particular local (windows) +workstations/PCs are likely to fall in this category. + +There are more options for this command, just type + +```sh +datasetIngestor +``` + +to see a list of available options. In particular you can define +explicit list of files to be combined into a dataset, which can come +from many different folders by providing a filelisting.txt file +containing this information in addition to the metadata.json file. The +section in the Appendix Using the datasetIngestor Tool has more details + +### Special notes for the decentral use case + +#### For Windows + +For Windows you need execute the corresponding commands inside a +powershell and use the binary files ending in .exe, e.g. + +```sh +datasetIngestor.exe -token SCICAT-TOKEN -user username:password -copy metadata.json +``` + +For Windows systems you can only use personal accounts and the data is +always handled as `decentral` case, i.e. the data will first be copied +from the windows machine to a central file server via scp +first. Therefore you need to specify all of the above parameters +-token, -user and -copy. + +Please also note the syntax, that has to be used for the definition of +the sourceFolder inside the metadata.json file: this has to be in the +following form: + +```json +"sourceFolder": "/C/Somefolder/etc", +``` + +i.e. **forward slashes** and **no colon** ":" after the drive letter like +"C:" in this case. + +#### For Linux + +You must have a valid kerberos ticket in order to be able to copy the +data to the intermediate storage server. You can use the kinit command +to get this ticket. + +### Summary of the different use cases + +The following table summarizes the different use cases + +| OS | sourceLocation | Account-Type | Neededed parameters | Comment | +|---------|--------------------|--------------|---------------------|------------------------------------------------| +| Linux | central | User | token | Fetch token via Web GUI discovery.psi.ch | +| Linux | central | Functional | username/pw | The tool fetches token from API server | +| Linux | anywhere/decentral | User | token + Kerb ticket | Token for API, Kerb ticket for copying data | +| Linux | anywhere/decentral | Functional | not supported | Functional accounts not existing on ssh server | +| Windows | central | User | (token) | Needs mounting of Windows FS to Arema | +| Windows | central | Functional | (username/pw) | dito | +| Windows | anywhere/decentral | User | token + username/pw | Token for API, username/pw for copying data | +| Windows | anywhere/decentral | Functional | not supported | Functional accounts not existing on ssh server | +|---------|--------------------|--------------|---------------------|------------------------------------------------| + +## Archive + +If there are no errors, your data has been accepted into the data +catalog! From now on, no changes should be made to the ingested +data. This is important, since the next step is for the system to copy +all the data to the CSCS Petabyte archive. Writing to tape is slow, so +this process may take some time, and it will fail if any +modifications are detected. + +Triggering the copy to tape can be done in 3 ways. Either you do it +automatically as part of the ingestion + +```sh +datasetIngestor --ingest --autoarchive metadata.json +``` + +In this case directly after ingestion a job is created to copy the +data to tape. Your dataset should now be in the queue. Check the data +catalog: . Your job should have status +'WorkInProgress'. You will receive an email when the ingestion is +complete. + +The second method is to use the discovery.psi.ch to interactively +start the archive job: click on the "Archivable" button. You should +see the newly ingested datasets. Select all the datasets you want to +have archived and click 'Archive'. You should see the status change +from 'datasetCreated' to 'scheduleArchiveJob'. This indicates that the +data is in the process of being transferred to CSCS. After some time +the dataset's status will change to 'datasetOnAchive' indicating the +data is stored. + +A third option is to use a command line version datasetArchiver. + +```console +datasetArchiver [options] (ownerGroup | space separated list of datasetIds) + +You must choose either an ownerGroup, in which case all archivable datasets +of this ownerGroup not yet archived will be archived. +Or you choose a (list of) datasetIds, in which case all archivable datasets +of this list not yet archived will be archived. + +List of options: + + -devenv + Use development environment instead or production + -localenv + Use local environment (local) instead or production + -noninteractive + Defines if no questions will be asked, just do it - make sure you know what you are doing + -tapecopies int + Number of tapecopies to be used for archiving (default 1) + -testenv + Use test environment (qa) instead or production + -token string + Defines optional API token instead of username:password + -user string + Defines optional username and password +``` + +## Retrieve + +Here we describe the retrieval via the command line tools. A retrieve +process via a desktop GUI application is described in the section SciCatArchiver GUI . + +Retrieving is two-step process: first the data is copied from tape to a +central retrieve server. From there the data needs to be copied to the +final destination system of your choice. + +### First Step + +For the first step: login to , find the +datasets you want to retrieve and selected all "Retrievable" datasets +by clicking the corresponding button. Finally click the retrieve +button. This will create a retrieve job. Once it is finshed you will +get an email. Depending on the size of your datasets this may take +minutes (e.g. for 1GB) up to days (e.g for 100TB) + +### Second Step (for Linux) + +#### Standard commands + +For the second step you can use the **datasetRetriever** command, which +uses the rsync protocol to copy the data to your destination. + +```console +Tool to retrieve datasets from the intermediate cache server of the tape archive +to the destination path on your local system. +Run script with 1 argument: + +datasetRetriever [options] local-destination-path + +Per default all available datasets on the retrieve server will be fetched. +Use option -dataset or -ownerGroup to restrict the datasets which should be fetched. + + -chksum + Switch on optional chksum verification step (default no checksum tests) + -dataset string + Defines single dataset to retrieve (default all available datasets) + -devenv + Use development environment (default is to use production system) + -ownergroup string + Defines to fetch only datasets of the specified ownerGroup (default is to fetch all available datasets) + -retrieve + Defines if this command is meant to actually copy data to the local system (default nothing is done) + -testenv + Use test environment (qa) (default is to use production system) + -token string + Defines optional API token instead of username:password + -user string + Defines optional username and password (default is to prompt for username and password) +``` + +For the program to check which data is available on the cache server +and if the catalog knows about these datasets, you can use: + +```console +datasetRetriever my-local-destination-folder + +======Checking for available datasets on archive cache server ebarema4in.psi.ch: + +Dataset ID Size[MB] Owner SourceFolder +=================================================================== +0f6fe8b3-d3f1-4cfb-a1af-0464c901a24f 1895 p16371 /sls/MX/Data10/e16371/20171017_E2/cbfs/2017-10-17_22-28-30_Na108_thau7_100degs_dtz60_f_500_Hz_Eth0_6200_eV +58f2037e-3f9b-4e08-8963-c70c3d29c068 1896 p16371 /sls/MX/Data10/e16371/20171017_E2/cbfs/2017-10-17_21-41-02_cca385a_lyso8_100degs_f_500_Hz_Eth0_6200_eV +cf8e5b25-9c76-49a7-80d9-fd38a71e0ef8 3782 p16371 /sls/MX/Data10/e16371/20171017_E2/cbfs/2017-10-18_10-15-41_na108_thau6_50degs_lowdose_pos1_f_500_Hz_Eth0_6200_eV +df1c7a17-2caa-41ee-af6e-c3cf4452af17 1893 p16371 /sls/MX/Data10/e16371/20171017_E2/cbfs/2017-10-17_20-58-34_cca385a_lyso3_100degs_f_500_Hz_Eth0_6200_eV +``` + +If you want you can skip the previous step and +directly trigger the file copy by adding the -retrieve flag: + +```sh +datasetRetriever -retrieve +``` + +This will copy the files into the destinationFolder using the original +sourceFolder path beneath the destinationFolder. This is especially +useful if you want to retrieve many datasets, which you expect to +appear in the same folder structure as originally. + +Optionally you can also verify the consistency of the copied data by +using the `-chksum` flag + +```sh +datasetRetriever -retrieve -chksum +``` + +If you just want to retrieve a single dataset do the following: + +```sh +datasetRetriever -retrieve -dataset +``` + +If you want to retrieve all datasets of a given **ownerGroup** do the following: + +```sh +datasetRetriever -retrieve -ownergroup +``` + +#### Expert commands + +If you prefer to have more control over the file transfer you are free +to type your own rsync commands, e.g. to simply the folders available + +in the retrieve cache do: + +```sh +rsync -e ssh --list-only pb-retrieve.psi.ch:retrieve/ +``` + +To actually copy the data over use: + +```sh +rsync -e ssh -av pb-retrieve.psi.ch:retrieve/{shortDatasetId} your-destination-target/ +``` + +In this case the shortDatsetId is the dataseid id without the PSI +prefix, e.g. for dataset PID +20.500.11935/08bc2944-e09e-48da-894d-0c5c47977553 the shortDatasetId +is 08bc2944-e09e-48da-894d-0c5c47977553 + +### Second Step (for Windows) + +The second step for Windows is instead using the sftp +protocol. Therefore any sftp client for Windows, like e.g. Filezilla, +can then be used to retrieve the data to your local Windows PC. The +following connection information must be provided, taking the command +line client access via powershell as an example + +```powershell +# for the production system +sftp -P 4222 your-username@pb-retrieve.psi.ch +# or for the test system +sftp -P 4222 your-username@pbt-retrieve.psi.ch +``` + +After the connection is built up you can copy files recursively, +e.g. using the "get -r \*" command. With the filezilla GUI you can +achieve the same via drag and drop operations + +## Ingest, Archive and Retrieve with QT desktop application SciCat + +### Important Update since April 14th 2022 + +You currently first need to get a token before you can use SciCat: the +easiest to get such an API token is to sign it at +, then follow the "Login with PSI account" +button. This will bring you to the user settings page, from where you +can copy the token with a click on the corresponding copy button. + +### General considerations + +`SciCat` is a GUI based tool designed to make initial +ingests easy. It is especially useful, to ingest data, which can not +be ingested automatically. Therefore it is designed in particular to +assist you when archiving derived datasets. Often, the archival of +derived data cannot be scheduled in advance, nor does it follow a +strict file structure. The `SciCat` GUI can help you to ingest such +datasets more easily. Yet, the ingestion of raw datasets is also +supported. Additionally, the tool also allows for the convenient +retrieval of datasets. + +### Getting started + +Currently, `SciCat` is supported on PSI-hosted **Linux** and **Windows** +systems and is accessible on the Ra cluster as part of the datacatalog +module: just type + +```sh +module load datacatalog +``` + +Then the software can be started with + +```sh +SciCat +``` + +On the SLS beamline consoles the software is also pre-installed in the +/work/sls/bin folder, which is part of the standard PATH variable. + +If you are not working on the Ra cluster you can download the +software on Linux: + +```sh +/usr/bin/curl -O https://gitlab.psi.ch/scicat/tools/raw/master/linux/SciCat;chmod +x ./SciCat +``` + +On Windows the executable can be downloaded from + +```sh +https://gitlab.psi.ch/scicat/tools/-/blob/master/windows/SciCatGUI_Win10.zip +``` + +To start the GUI, unzip the directory and execute SciCat.exe + +### Login and permissions + +After starting the GUI, you will be asked for a username and password. Please +enter your PSI credentials. Functional accounts are not supported. + +### Pgroup selection + +The first step is always to select the pgroup. If there is no proposal assigned to +this account, you will have to specify the information about the PI manually. + +![img](./screenshots/proposal_found.png "Pgroup selection") + +### Archiving + +After selection the files, you will be prompted with a metadata editor, where you can modify +the general info, such as dataset name, description etc. Please make +sure that you select the correct data type (raw or derived). As a general rule of thumb, it is +a derived dataset if you can specify a raw dataset as input. If you want to ingest a derived dataset, +you can specify corresponding raw datasets on the "Input datasets" tab. +To edit scientific metadata, switch to "Scientific metadata" tab. + +### Retrieval + +Retrieving successfully archived datasets from SciCat is a two-step process. First you will have to +retrieve to an intermediate server. Once the data is there, you will be notified by email. +The final step is to copy the data to the final destination on your machine. +Both steps can be steered from within the GUI. + +On the retrieve page, all datasets of your pgroup are listed. If the data has been archived successfully, +the cell in column "retrievable" is set to "true". To retrieve the data to the intermediate file server, +select the datasets that you want to retrieve and click on "Retrieve." After the retrieval, the column +"retrieved" is set to true. You are now able to start copying the data to you local machine by selecting +the desired datasets and clicking on "Save." + +### Settings + +Additional settings, such as the default value for certain fields can be modified in settings panel (button +on the lower left corner). + +## Publish + +As part of a publication workflow datasets must become citable via a +digital object identifier (DOI). This assignment is done as part of +the publication workflow described below. The publication then can +link to these published datasets using this DOI. The DOIs can link to +both raw and/or derived datasets. The published data and therefore the +DOI ususally refers to a **set** of Datasets, thus avoiding the need to +list potentially thousands of individual dataset identifiers in a +journal publication. + +You publish data in the following way: go to , +login and select all the datasets, that you want to publish under a +new DOI. + +![img](./screenshots/PublishingData1.png "Selecting Datasets to be published") + +Then you add these datasest a a "shopping cart" by using the "add to +Cart" button. You can repeat this often as needed. Once finished with +the selection you can "check out" the cart (click on the cart in the +top bar) and pick the "Publish" action. + +![img](./screenshots/PublishingData2.png "Check out cart") + +This opens a form +with prefilled information derived from the connected proposal +data. This data can then be edited by the user and finally saved. + +![img](./screenshots/PublishingData3.png "Defining metadata of published data") + +This defines the data as to be published and makes it known to the +data catalog, but the corresponding DOI is not yet made globally +available. For this last step to happen, someone with access to this +newly generated published data definition (e.g. the person defining +the published data or e.g. the PI) has to hit the "register" +button. This will trigger the global publication of the DOI. The links +on are usually updated within one day, so wait one day +before following these links or searching for the doi via the doi +reolver. + +All published data definitions are then openly available via the so +called "Landing Pages", which are hosted on . + +The file data itself data becomes available via the normal data export +System of the Ra cluster, which requires however a PSI account. If you +want to make the file data anonymously available you need to send a +corresponding request to for now. This process is +planned to be automated in future. + +For now all publication are triggered by a scientist explicitly, +whenever necessary. In future in addition an automated publication +after the embargo period (default 3 years after data taking) will be +implemented (details to be defined) + +## Cleanup and Retention + +This part is not yet defined. + +## Troubleshooting + +### Locale error message + +If you get error messages like the following (so far only happened +from Mac Computers) + +```console +perl: warning: Setting locale failed. +perl: warning: Please check that your locale settings: +.... +``` + +then you need to prevent that the Mac ssh client sends the +`LC_CTYPE` variable. Just follow the description in: + + +### Invalid certificate messages + +The following message can be safely ignored: + +```console +key_cert_check_authority: invalid certificate +Certificate invalid: name is not a listed principal +``` + +It indicates that no kerberos token was provided for authentication. +You can avoid the warning by first running kinit (PSI linux systems). + +### Long Running copy commands + +For decentral ingestion cases, the copy step is indicated by a message +'Running [/usr/bin/rsync -e ssh -avxz …'. It is expected that this +step will take a long time and may appear to have hung. You can check +what files have been successfully transfered using rsync: + +```sh +rsync --list-only user_n@pb-archive.psi.ch:archive/UID/PATH/ +``` + +where UID is the dataset ID (12345678-1234-1234-1234-123456789012) and +PATH is the absolute path to your data. Note that rsync creates +directories first and that the transfer order is not alphabetical in +some cases, but it should be possible to see whether any data has +transferred. + +### Kerberos tickets + +As a normal user you should have a valid Kerberos ticket. This is +usually the case on the centrally provided Linux machines +automtically. You can verify the existence with the "klist" +command. In case no valid ticket is returned you have to get one using +the "kinit" command. (Note: beamline accounts do not need this) + +```sh +klist +# if no Ticket listed get one by +kinit +``` + +### Instructions to set ACLS in AFS + +In the AFS file system the user have to permit access to the +sourceFolder by setting read and lookup ACL permission for the AFS +group “pb-archive”. The easiest way to achieve is to run the following +script with the sourceFolder as an argunent + +```sh +/afs/psi.ch/service/bin/pb_setacl.sh sourceFolder +``` + +This script must be run by a person who has the rights to modify the +access rights in AFS. + +## Appendix + +### Installation of Tools + +#### Access to the SciCat GUI + +For the access to the SciCat web-based user interface no software +needs to be installed, simply use your browser to go to +. + +#### Loading datacatalog tools on Clusters + +The latest datacatalog software is maintained in the PSI module system +on the main clusters (Ra, Merlin). To access it from PSI linux +systems, run the following command: + +```sh +module load datacatalog +``` + +#### (Non-standard Linux systems) Installing datacatalog tools + +If you do not have access to PSI modules (for instance, when archiving +from Ubuntu systems), then you can install the datacatalog software +yourself. These tools require 64-bit linux. + +I suggest storing the SciCat scripts in ~/bin so that they can be +easily accessed. + +```sh +mkdir -p ~/bin +cd ~/bin +/usr/bin/curl -O https://gitlab.psi.ch/scicat/tools/raw/master/linux/datasetIngestor +chmod +x ./datasetIngestor +/usr/bin/curl -O https://gitlab.psi.ch/scicat/tools/raw/master/linux/datasetRetriever +chmod +x ./datasetRetriever +/usr/bin/curl -O https://gitlab.psi.ch/scicat/tools/raw/master/linux/SciCat +chmod +x ./SciCat +``` + +When the scripts are updated you will be prompted to re-run some of +the above commands to get the latest version. + +You can call the ingestion scripts using the full path +(~/bin/datasetIngestor) or else add ~/bin to your unix PATH. To do so, +add the following line to your ~/.bashrc file: + +```sh +export PATH="$HOME/bin:$PATH" +``` + +#### Installation on Windows Systems + +On Windows the executables can be downloaded from the following URL, +just enter the address in abrowser and download the file + +```sh +https://gitlab.psi.ch/scicat/tools/-/blob/master/windows/datasetIngestor.exe +https://gitlab.psi.ch/scicat/tools/-/blob/master/windows/SciCatGUI_Win10.zip +``` + +#### Online work stations in beamline hutches + +The command line tools are pre-installed in /work/sls/bin. No further +action needed + +### Dataset limitations + +#### Size limitations + +- a single dataset should currently not have more than 400k files +- a single dataset should not be larger than 50 TB +- recommended size of a single dataset: between 1GB and 1TB + +#### SourceFolder and file names limitations + +The sourceFolder metadata and the name of the files can contain the following special characters: + +- \% +- \# +- \- +- \+ +- \. +- \: +- \= +- \@ +- \_ + +Any other special characters are not guaranteed to work. + +### Recommended file structure for raw datasets + +One recommended way of structuring your data on disk is the following: + +```txt +e12345 <--- user's group e-account, linked to a DUO proposal + + - sampleName <-- contains measurement for a given sample + - datasetfolder1 <-- name can be anything + ... in here all the files, and only the files + ... which make up a measurement + - datasetfolder2 <-- name can be anything + ... dito + - etc... + - derived-dataset1 (optional, for online processed data + name should contain "derived") + ... in here all the files and only the files + ... which make up the derived data + - derived-dataset2 + ... dito + + - nextSampleName... + +e12375 <--- next user's group e-account +``` + +### Metadata Field Definitions + +The following table defines the mandatory and optional fields for the +administrative metadata, which have to be provided (status June +2021). All fields marked "m" are mandatory, the rest is optional. Some +fields are filled automatically if possible, see comments. For the +most recent status see this URL + and follow the link +called "Model" for the respective datamodel (e.g. Dataset), visible +e.g. inside the GET API call section. Or see the model definitions as +defined in the SciCat backend, see the json files in + + +All "Date" fields must follow the date/time format defined in RFC +3339, section 5.6, see + +#### Metadata field definitions for datasets of type "base" + +| field | type | must | comment | +|------------------|---------------|------|------------------------------------------------------| +| pid | string | m | filled by API automatically, do *not* provide this | +| owner | string | m | filled by datasetIngestor if missing | +| ownerEmail | string | | filled by datasetIngestor if missing | +| orcidOfOwner | string | | | +| contactEmail | string | m | filled by datasetIngestor if missing | +| datasetName | string | | set to "tail" of sourceFolder path if missing | +| sourceFolder | string | m | | +| size | number | | autofilled when OrigDataBlock created | +| packedSize | number | | autofilled when DataBlock created | +| creationTime | date | m | filled by API if missing | +| type | string | m | (raw, derived...) | +| validationStatus | string | | | +| keywords | Array[string] | | | +| description | string | | | +| classification | string | | filled by API or datasetIngestor if missing | +| license | string | | filled by datasetIngestor if missing (CC By-SA 4.0) | +| version | string | | autofilled by API | +| doi | string | | filled as part of publication workflow | +| isPublished | boolean | | filled by datasetIngestor if missing (false) | +| ownerGroup | string | m | must be filled explicitly | +| accessGroups | Array[string] | | filled by datasetIngestor to beamline specific group | +| | | | derived from creationLocation | +| | | | e.g. /PSI/SLS/TOMCAT -> accessGroups=["slstomcat"] | + +#### Additional fields for type="raw" + +| field | type | must | comment | +|-----------------------|--------|------|------------------------------------------------------------| +| principalInvestigator | string | m | filled in datasetIngestor if missing (proposal must exist) | +| endTime | date | | filled from datasetIngetor if missing | +| creationLocation | string | m | see known Instrument list below | +| dataFormat | string | | | +| scientificMetadata | object | | | +| proposalId | string | | filled by API automatically if missing | + +#### Additional fields for type="derived" + +| field | type | must | comment | +|--------------------|---------------|------|---------| +| investigator | string | m | | +| inputDatasets | Array[string] | m | | +| usedSoftware | string | m | | +| jobParameters | object | | | +| jobLogData | string | | | +| scientificMetadata | object | | | + +### About Scientific Values and Units + +It is strongly recommended that physical quantities are stored in the + following format (the field names are just examples, the structure + with the two fields "value" and "unit" is important here) + +```json +"scientificMetadata": { + ... + "beamlineParameters": { + "Ring current": { + "value": 402.246, + "unit": "mA" + }, + "Beam energy": { + "value": 22595, + "unit": "eV" + } + } + ... +} +``` + +In future for such quantities the data catalog will automatically add +two additional fields "valueSI" and "unitSI" with the corresponding +SI units. The rationale for this is to support value queries in a +reliable manner across datasets with potentially different units +chosen for the same quantity: + +```json +"scientificMetadata": { + ... + "beamlineParameters": { + "Ring current": { + "value": 402.246, + "unit": "mA", + "valueSI": 0.402246, + "unitSI": "A" + }, + "Beam energy": { + "value": 22595, + "unit": "eV", + "valueSI": 3.6201179E-15 + "unitSI":"J" + } + } + ... +} +``` + +### Use Case Examples + +#### Use Case: Manual ingest using datasetIngestor program + +1. Overview + + Data owners may want to define in an adhoc manner the creation of + datasets in order to allow a subsequent archiving of the data. The + most important use cases are + + - raw data from a beamline + - derived data created by a scientist + - archiving of historic data + - archiving of data stored on local (decentral) file storage systems + + For this purpose a command line client **datasetIngestor** is provided + which allows to + + - ingest the meta data and files + - optionally copy the data to a central cache file server + + The necessary steps to use this tool are now described: + +2. Preparation of the meta data + + You need to create a file metadata.json defining at least the + administrative metadata + +3. Example of minimal json file for raw data: + + ```json + { + "creationLocation": "/PSI/SLS/TOMCAT", + "sourceFolder": "/scratch/devops", + "type": "raw", + "ownerGroup":"p16623" + } + ``` + +4. Example for raw data including scientific metadata + + ```json + { + "principalInvestigator": "egon.meier@psi.ch", + "creationLocation": "/PSI/SLS/TOMCAT", + "dataFormat": "Tomcat pre HDF5 format 2017", + "sourceFolder": "/sls/X02DA/data/e12345/Data10/disk3/817b_B2_", + "owner": "Egon Meier", + "ownerEmail": "egon.meier@psi.ch", + "type": "raw", + "description": "Add a short description here for this dataset ...", + "ownerGroup": "p12345", + "scientificMetadata": { + "beamlineParameters": { + "Monostripe": "Ru/C", + "Ring current": { + "value": 0.402246, + "unit": "A" + }, + "Beam energy": { + "value": 22595, + "unit": "eV" + } + }, + "detectorParameters": { + "Objective": 20, + "Scintillator": "LAG 20um", + "Exposure time": { + "value": 0.4, + "unit": "s" + } + }, + "scanParameters": { + "Number of projections": 1801, + "Rot Y min position": { + "value": 0, + "unit": "deg" + }, + "Inner scan flag": 0, + "File Prefix": "817b_B2_", + "Sample In": { + "value": 0, + "unit": "m" + }, + "Number of darks": 10, + "Rot Y max position": { + "value": 180, + "unit": "deg" + }, + "Angular step": { + "value": 0.1, + "unit": "deg" + }, + "Number of flats": 120, + "Sample Out": { + "value": -0.005, + "unit": "m" + }, + "Flat frequency": 0, + "Number of inter-flats": 0 + } + } + } + ``` + +5. Example of minimal json file for derived data: + + ```json + { + "sourceFolder": "/data/test/myExampleData", + "type": "derived", + "ownerGroup": "p12345", + "investigator": "federika.marone@psi.ch", + "inputDatasets": [ + "/data/test/input1.dat", + "20.500.11935/000031f3-0675-4d30-b5ca-b9c674bcf027" + ], + "usedSoftware": [ + "https://gitlab.psi.ch/MyAnalysisRepo/tomcatScripts/commit/60629a1cbef493a26aac626602ba8f1a6c9e14d2" + ] + } + ``` + + - owner and contactEmail will be filled automatically + - important: in case you ingest derived datasets with a **beamline + account** , such as slstomcat (instead of a personal account), you **have** to add the beamline account + to the accessGroups field like this: + + ```json + { + "sourceFolder": "/data/test/myExampleData", + "type": "derived", + "ownerGroup": "p12345", + "accessGroups": [ + "slstomcat" + ], + "investigator": "", + "inputDatasets": [ + "/data/test/input1.dat", + "20.500.11935/000031f3-0675-4d30-b5ca-b9c674bcf027" + ], + "usedSoftware": [ + "https://gitlab.psi.ch/MyAnalysisRepo/tomcatScripts/commit/60629a1cbef493a26aac626602ba8f1a6c9e14d2" + ] + } + ``` + + 1. Extended derived example + + ```json + { + "sourceFolder": "/some/folder/containg/the/derived/data", + "owner": "Thomas Meier", + "ownerEmail": "thomas.meier@psi.ch", + "contactEmail": "eugen.mueller@psi.ch", + "type": "derived", + "ownerGroup": "p13268", + "creationTime": "2011-09-14T12:08:25.000Z", + "investigator": "thomas.meier@psi.ch", + "inputDatasets": [ + "20.500.11935/000031f3-0675-4d30-b5ca-b9c674bcf027", + "20.500.11935/000031f3-0675-4d30-b5ca-b9c674bcf028" + ], + "usedSoftware": [ + "https://gitlab.psi.ch/MyAnalysisRepo/tomcatScripts/commit/60629a1cbef493a26aac626602ba8f1a6c9e14d2" + ] + } + ``` + +6. Optionally: preparation of a file listing file + + **Please note**: The following is only needed, if you do not want to + store all files in a source Folder, but just a **subset**. In this case + you can specify an explicit list of files and directories. Only the + files specified in this list will be stored as part of the + dataset. For the directories in this list it is implied that they are + recursively descended and all data contained in the directory is taken + Here is an example for a filelisting.txt file. All entries in this + textfiles are path names **relativ** to the sourceFolder specified in + the metadata.json file + + Example of filelisting.txt + + ```txt + datafile1 + datafile2 + specialStuff/logfile1.log + allFilesInThisDirectory + ``` + +7. Optionally: for multiple datasets to be created + + If you have many sourceFolders containing data, each to be turned into + a dataset then the easiest method is to define a 'folderlisting.txt' + file. (the file must have exactly this name). This is a useful option + to archive large amounts of "historic" data. + + Each line in this file is the absolute path to the sourceFolder In + this case it is assumed, that the metadata.json file is valid for all + datasets and that **all** files inside the sourceFolder are part of the + dataset (i.e. you can **not** combine the filelisting.txt option with the + folderlisting.txt option) + + Example of folderlisting.txt + + ```txt + /some/folder/containg/the/data/raw/sample1 + /some/folder/containg/the/data/raw/sample2 + /some/folder/containg/the/data/derived + ``` + +8. Starting the ingest + + Just run the following command in a terminal as a first test if + everything is okay. This is a so called "dry run" and nothing will + actually be stored, but the consistency of the data will be checked + and the folders will be scanned for files + + ```sh + datasetIngestor metadata.json [filelisting.txt | 'folderlisting.txt'] + ``` + + You will be prompted for your username and password. + + If everything looks as expected you should now repeat the command with + the "–ingest" flag to actually store the dataset(s) in the data + catalog + + ```sh + datasetIngestor --ingest metadata.json [filelisting.txt | 'folderlisting.txt'] + ``` + + When the job is finshed all needed metadata will be ingested into the + data catalog (and for decentral data the data will be copied to the + central cache file server). + + In addition you have the option to directly trigger the archiving of + the data to tape by adding the –autoarchive flag. Do this only if you + sure that this data is worth to be archived + +#### Use Case: Automated ingest of raw datasets from beamline or instruments + +1. Using the datasetIngestor Tool + + This method usually requires a fully automatic ingestion procedure, + since data is produced at regular times and in a predictable way. + + For each beamline this automation is done together with the experts + from the data catalog group and potentially with the help from the + controls /detector-integration groups. Please contact + to get in touch. + + The recommended method is to define preparation scripts, which + automatically produce the files metadata.json and optionally + filelisting.txt or folderlisting.txt (for multiple datasets) as you + would do in the manual case described in the previous section. + Example of such scripts can be provided by the data catalog team, + please contact for further help. The effort to + implement such a system depends very much on the availability of the + meta data as well as on the effort to convert the existing metadata to + the data catalog format inside the converter processes. If the meta + data is already available in some form in a file an estimate of the + order of magnitude of work needed per instrument is 1-2 person-weeks + of work, including test runs etc. But efforts may also be considerably + smaller or larger in some cases. + + Then you run the datasetIngestor program usually under a beamline + specic account. In order to run fully automatic all potential + questions asked interactively by the program must be pre-answered + through a set of command line options: + + ```console + datasetIngestor [options] metadata-file [filelisting-file|'folderlisting.txt'] + + -allowexistingsource + Defines if existing sourceFolders can be reused + -autoarchive + Option to create archive job automatically after ingestion + -copy + Defines if files should be copied from your local system to a central server before ingest. + -devenv + Use development environment instead of production environment (developers only) + -ingest + Defines if this command is meant to actually ingest data + -linkfiles string + Define what to do with symbolic links: (keep|delete|keepInternalOnly) (default "keepInternalOnly") + -noninteractive + If set no questions will be asked and the default settings for all undefined flags will be assumed + -tapecopies int + Number of tapecopies to be used for archiving (default 1) + -testenv + Use test environment (qa) instead of production environment + -user string + Defines optional username:password string + ``` + + - here is a typical example using the MX beamline at SLS as an example + and ingesting a singel dataset with meta data defined in + metadata.json + + ```sh + datasetIngestor -ingest \ + -linkfiles keepInternalOnly \ + -allowexistingsource \ + -user slsmx:XXXXXXXX \ + -noninteractive \ + metadata.json + ``` + + This command must be called by the respective data acquisition systems + at a proper time, i.e. after all the files from the measurement run + have been written to disk and all metadata became available (often + this meta data is collected by the controls system). + +2. HDF5 Files + + If the raw data exists in form of HDF5 files, there is a good chance + that the meta data can be extracted from the HDF5 files' meta data. In + such a case the meta data extraction must be done as part of the part + beamline preparation scripts. Example of such HDF5 extraction scripts + exist which can the basis of a beamline specific solution, again + please contact . These scripts will mostly need + minimal adjustments for each beamline, mainly specifying the filter + conditions defining which of the meta data in the HDF5 file are to be + considered meta data for the data catalog. + + Very often the whole dataset will only consist of one HDF5 file, thus + also simplifying the filelisting definition. + +#### Use Case: Ingest datasets stored on decentral systems + +These are data that you want to have archived for some reason, but are +not available on central file systems. Data from the old PSI archiv +system fall in this category or data from local PCs, Laptops or +instruments. If this data is not assigned to a p-group (given via the +DUO digital user office, usually linked to a proposal) then you must +assign this data to an a-group. The allocation of an "a-group" for +this kind of data must be done beforehand by a tool currently in +preparation at AIT. The "a-group" will define the ownership and +therefor the access to the data by listing a number of users onside the +group. + +Otherwise just follow the description in the section "Manual ingest +using datasetIngestor program" and use the option -copy, e.g. + +```sh +datasetIngestor -autoarchive -copy -ingest metadata.json +``` + +This command will copy the data to a central rsync server, from where +the archive system can then copy the files to tape, in this case +(option -autoarchive) the copy to archive tapes will happen automatically + +On recent versions of the datasetIngestor program the program detects +automatically,if your data lies on central or decentral systems. In +the latter case it will, after a confirmation by the user, copy the +data automatically to the rsync cache server, even if the copy flag is +not provided. + +#### Use Case: Ingest datasets from simulations/model calculations + +These can be treated like datasets of type "base" or "raw". In the +latter case specify the field "creationLocation" as the name of the +server or cluster which produced the simulation files. Otherwise the +procedure is identical to the previous use case. + +### Policy settings and email notifications + +The archiving process can further be configured via **policy** +parameters, e.g. if you require a second tape copy for very +precious data. Also the details about the notification settings by +email for both archive and retrieve jobs can be set here. You reach +the menu to set the policy values via the submenu `Policies` +in the dropdown menu to the top right of the GUI. + +Emails are automatically sent at the start of every archive and +retrieve jobs as well as when the job finishes. The email is sent to +the person creating the jobs. In addition it is sent the list of +emails defined in the policy settings. Per default this list is empty +but can be extended by you. In the policy one can also switch off the +email notification. However emails about error conditions (which can +be either user caused or system caused) can not be switched off. Such +error messages are always sent to the user as well as the archive +administrators. + +For user caused errors the user has to take action to repair the +situation. Typically error cases are, that the user has moved or +removed part or all of the files before archiving them. System errors +on the other hand have their reason inside the catalog and archive +system (e.g. a network connection problem or similar) and will be +taken care of by the archive managers. In such a case the user +creating the job will be informed manually, when the problem is fixed +again. + +Policy parameters can be defined at site level or at ownerGroup +level. For each ownerGroup at least one manager must be defined +(e.g. a principal investigator (PI) via the linked proposal +information) in the policy model (field "manager") . Only the manager +can change the policy settings at ownerGroup level, but all group +mebers can see them. + +Changes to this policy settings only effect future dataset creation +and archiving + +| Parameter | Allowed Values | Default | Level | +|-------------------------------|---------------------------------|---------|------------| +| policyPublicationShiftInYears | small positive integer, e.g. 3 | 3 | Site (ro) | +| policyRetentionShiftInYears | small positive integer, e.g. 10 | 10 | Site (ro) | +|-------------------------------|---------------------------------|---------|------------| +| autoArchive | true/false | false | ownerGroup | +| tapeRedundancy | low/medium/(high) | low | ownerGroup | +| archiveEmailNotification | true/false | false | ownerGroup | +| archiveEmailsToBeNotified | Array of additional emails | [] | ownerGroup | +| retrieveEmailNotification | true/false | false | ownerGroup | +| retrieveEmailsToBeNotified | Array of additional emails | [] | ownerGroup | +| (archiveDelayInDays) | small positive integer, e.g. 7 | 0 | ownerGroup | + +The job Initiator always gets an email unless email notification is disabled. + +### Analyzing Metadata Statistics + +Note: This service is currently (summer 2021) out of order due to the +missing JupyterHub environment. + +#### Overview + +It is possible to analyze the information about datasets amd jobs etc, +e.g. for statistical purposes. A Jupyterhub based solution was chosen +as a tool for allowing to do this analysis in a flexible and +interactive manner. This means you can use Jupyter notebooks to query +the Data catalog via the API for its data and analyze the results in +terms of tables and graphs. Example notebooks are provided. + +#### Getting started + +Simply follow the following link and login with your PSI account: + . The initial start of the +Jupyter environments takes some time (about 40 seconds), but +subsequent starts are much faster. You will then see a "bootstrap" +notebook which you can execute to populate your Jupyter home directory +with the example notebooks. + +The example notebooks require you to login to the data catalog API +server. Here you can again use your personal account, which gives you +access to all data, for which you have read access (i.e. for which you +are member of the associated p-group). Beamline managers can also use +the beamline accounts here in order to get the statistics relevant for +the whole beamline. You can then look at the example notebooks, +e.g. datasetAnalyzer.ipynb and run it, look at resulting tables and +graphs. Afterwards you can optionally adapt the notebooks to your +needs. + +Please note, that this service is currently only available as a pilot +with **no guaranteed availability**. This also means, that you should +make **regular backups of your own notebooks** which you may develop +using this tool. For this you can e.g. simply download the notebook +and copy it to a place for which backup exists, like your home +directory. + +### Access to the API (for script developers) + +The data catalog can also be accessed directly via a REST API. There +exists an API "Explorer" which allows to test such API calls +conveniently. The explorer can be found at + .The explorer works with a test +database which is separate from the production database and contains +other data. + +For most of the API calls you will need an access token first. You +create such an access token by "login" to the data catalog via the +following curl command: + +```sh +# for "functional" accounts +curl -X POST --header 'Content-Type: application/json' -d '{"username":"YOUR-LOGIN","password":"YOUR-PASSWORD"}' 'https://dacat-qa.psi.ch/api/v3/Users/login' + +# for normal user accounts +curl -X POST --header 'Content-Type: application/json' -d '{"username":"YOUR-LOGIN","password":"YOUR-PASSWORD"}' 'https://dacat-qa.psi.ch/auth/msad' + +# reply if succesful: +{"id":"NQhe3...","ttl":1209600,"created":"2019-01-22T07:03:21.422Z","userId":"5a745bde4d12b30008020843"} +``` + +The "id" field contains the access token, which you copy in to the corresponding field at the top of the explorer page. + +Afterwards you can test the full API. If you found the right API call +you can finally apply the call to the production system by replacing +"dacat-qa" by "dacat" and then by retrieving the access token from the +production system. + +### Using datasetIngestor inside wrapper scripts (for developers) + +The command datasetIngestor returns with a return code equal zero in +case the command could be executed succesfully. If the program however +fails for some reason the return code will be one. Typical examples of +failures are that files can not be found or not be accessed. Other +possibilities are that the catalog system is not available, +e.g. during scheduled maintenance periods. All outputs describing the +reason for the failure are written to STDERR. Please have a look at +these outputs to understand what the reason for the failure was. If +you need help please contact + +Please note: it is the task of the wrapper scripts to test +for the return code and to repeat the command once all conditions for +a succesful execution are fulfilled + +In case the ingest finishes succesfully the dataset persistent +identifiers (PID) of the resulting dataset(s) are written to STDOUT, +one line per dataset. + +### Ingestion of datasets which should never be published + +For datasets which should never be published you should add the +following fields at ingest time to your metadata.json file: + +```json +"datasetlifecycle": { + "publishable":false, + "dateOfPublishing":"2099-12-31T00:00:00.000Z", + "archiveRetentionTime":"2099-12-31T00:00:00.000Z" +} +``` + +- this will move the time of publication to a date in some far future + (2100 in this case) + +### Retrieving proposal information + +In case you need information about the principal investigator you can +use the command datasetGetProposal, which returns the proposal +information for a given ownerGroup + +```sh +/usr/bin/curl -O https://gitlab.psi.ch/scicat/tools/raw/master/linux/datasetGetProposal;chmod +x ./datasetGetProposal +``` + +### Link to Group specific descriptions + +- BIO department: + +### List of known creationLocation for raw data + +The following values for the creationLocation should be used for the +respective beamlines. They are derived from the identifiers used +inside the digital user office DUO + +#### SLS + +| Beamline | creationLocation | Ingest Account | +|--------------------|--------------------------|--------------------| +| Adress-RIXS | /PSI/SLS/ADRESS-RIXS | slsadress-rixs | +| Adress-SX-ARPES | /PSI/SLS/ADRESS-SX-ARPES | slsadress-sx-arpes | +| cSAXS | /PSI/SLS/CSAXS | slscsaxs | +| Micro-XAS | /PSI/SLS/MICRO-XAS | slsmicro-xas | +| Micro-XAS-Femto | /PSI/SLS/MICRO-XAS-FEMTO | slsmicro-xas-femto | +| MS-Powder | /PSI/SLS/MS-POWDER | slsms-powder | +| MS-Surf-Diffr | /PSI/SLS/MS-SURF-DIFFR | slsms-surf-diffr | +| Nano-XAS | /PSI/SLS/NANOXAS | slsnanoxas | +| Pearl | /PSI/SLS/PEARL | slspearl | +| Phoenix | /PSI/SLS/PHOENIX | slsphoenix | +| Pollux | /PSI/SLS/POLLUX | slspollux | +| MX (PX,PXII,PXIII) | /PSI/SLS/MX | slsmx | +| SIM | /PSI/SLS/SIM | slssim | +| Sis-Cophee | /PSI/SLS/SIS-COPHEE | slssis-cophee | +| Sis-Hrpes | /PSI/SLS/SIS-HRPES | slssis-hrpes | +| Super-XAS | /PSI/SLS/SUPER-XAS | slssuper-xas | +| Tomcat | /PSI/SLS/TOMCAT | slstomcat | +| VUV | /PSI/SLS/VUV | slsvuv | +| XIL-II | /PSI/SLS/XIL-II | slsxil-ii | +| Xtreme | /PSI/SLS/XTREME | slsxtreme | + +The connected email distribution lists are {ingestAccount}@psi.ch + +#### Swissfel + +| Beamline | creationLocation | Ingest Account | +|-------------|----------------------------------|----------------------------| +| Alvra | /PSI/SWISSFEL/ARAMIS-ALVRA | swissfelaramis-alvra | +| Bernina | /PSI/SWISSFEL/ARAMIS-BERNINA | swissfelaramis-bernina | +| Cristallina | /PSI/SWISSFEL/ARAMIS-CRISTALLINA | swissfelaramis-cristallina | +| Furka | /PSI/SWISSFEL/ATHOS-FURKA | swissfelathos-furka | +| Maloja | /PSI/SWISSFEL/ATHOS-MALOJA | swissfelathos-maloja | + +The connected email distribution lists are {ingestAccount}@psi.ch + +#### SINQ + +| Instrument | creationLocation | Ingest Account | +|------------|--------------------|----------------| +| AMOR | /PSI/SINQ/AMOR | sinqamor | +| DMC | /PSI/SINQ/DMC | sinqdmc | +| EIGER | /PSI/SINQ/EIGER | sinqeiger | +| FOCUS | /PSI/SINQ/FOCUS | sinqfocus | +| HRPT | /PSI/SINQ/HRPT | sinqhrpt | +| ICON | /PSI/SINQ/ICON | sinqicon | +| Morpheus | /PSI/SINQ/MORPHEUS | sinqmorpheus | +| NARZISS | /PSI/SINQ/NARZISS | sinqnarziss | +| NEUTRA | /PSI/SINQ/NEUTRA | sinqneutra | +| POLDI | /PSI/SINQ/POLDI | sinqpoldi | +| RITA-II | /PSI/SINQ/RITA-II | sinqrita-ii | +| SANS-I | /PSI/SINQ/SANS-I | sinqsans-i | +| SANS-II | /PSI/SINQ/SANS-II | sinqsans-ii | +| TASP | /PSI/SINQ/TASP | sinqtasp | +| ZEBRA | /PSI/SINQ/ZEBRA | sinqzebra | +| | | | + +The connected email distribution lists are {ingestAccount}@psi.ch + +#### SmuS + +| Instrument | creationLocation | Ingest Account | +|------------|--------------------|----------------| +| Dolly | /PSI/SMUS/DOLLY | smusdolly | +| GPD | /PSI/SMUS/GPD | smusgpd | +| GPS | /PSI/SMUS/GPS | smusgps | +| HAL-9500 | /PSI/SMUS/HAL-9500 | smushal-9500 | +| LEM | /PSI/SMUS/LEM | smuslem | +| FLAME | /PSI/SMUS/FLAME | smusflame | + +The connected email distribution lists are {ingestAccount}@psi.ch + +## Update History of Ingest Manual + +| Date | Updates | +|--------------------|----------------------------------------------------------------------------| +| 10. September 2018 | Initial Release | +| 6. October 2018 | Added warning section to not modify data after ingest | +| 10. October 2018 | ownerGroup field must be defined explicitly | +| 28. October 2018 | Added section on datasetRetriever tool | +| 20. November 2018 | Remove ssh key handling description (use Kerberos) | +| 3. December 2018 | Restructure archive stepp, add autoarchive flag | +| 17. January 2019 | Update on automatically filled values, more options for datasetIngestor | +| 22. January 2019 | Added description for API access for script developers, 2 new commands | +| | datasetArchiver and datasetGetProposal | +| 22. February 2019 | Added known beamlines(instruments (creationLocation) value list | +| 24. February 2019 | datasetIngestor use cases for automated ingests using beamline accounts | +| 23. April 2019 | Added AFS infos and available central storage, need for Kerberos tickets | +| 23. April 2019 | Availability of commands on RA cluster via pmodules | +| 3. May 2019 | Added size limitation infos | +| 9. May 2019 | Added hints for accessGroups definition for derived data | +| | Added infos about email notifications | +| 10. May 2019 | Added ownerGroup filtered retrieve option, decentral case auto detect | +| 7. Juni 2019 | Feedback from Manuel added | +| 21. Oct 2019 | New version of CLI tools to deal with edge cases (blanks in sourcefolder | +| | dangling links, ingest for other person, need for kerberos ticket as user) | +| 14. November 2019 | Restructuring of manual,New CLI tools, auto kinit login | +| | Progress indicators, chksum test updated | +| 20. Januar 2020 | Auto fill principalInvestigator if missing | +| 3. March 2020 | Added Jupyter notebook analysis section | +| 5. March 2020 | Add hint for datasets not to be published | +| 19. March 2020 | Added hint that analysis Jupyter tool is in pilot phase only | +| 19. March 2020 | Added recommendation concerning unit handling for physical quantities | +| 9. July 2020 | Added GUI tool SciCatArchiver (developer: Klaus Wakonig) | +| 11. July 2020 | Installation of SciCatArchiver on non-Ra system | +| 14. July 2020 | Added publication workflow and recommended file structure chapter | +| 16. July 2020 | Updated SciCat GUI deployment information | +| 31. July 2020 | New deploy location, + policy parameters, new recommended file structure | +| 27. August 2020 | Added Windows Support information | +| 10. Sept 2020 | Corrected example JSON syntax in one location | +| 23. November 2020 | Corrected instructions for using the SciCat GUI on Windows 10 | +| 19. February 2020 | Added info about proposalId link | +| 24. Juni 2021 | Major restructuring of full document for easier readability | +| 9. Dec 2021 | Corrected spelling of value/units convention | +| 23. April 2022 | Added hint to use -token option for CLI and SciCat GUI as normal user | +| 2. Dec 2022 | Extended ingest use cases description of needed parameters Win+Linux | +| 21. Dec 2023 | Inlcude redundancy risks and costs and file names limitations | diff --git a/mkdocs.yml b/mkdocs.yml index 1f6a16d..38505bc 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -13,6 +13,7 @@ markdown_extensions: - admonition - toc: permalink: true +- pymdownx.superfences # Configuration theme: