From b0338eace0e9b10ae8dd0e1e475370c44be63b9b Mon Sep 17 00:00:00 2001 From: minotti_c Date: Wed, 18 Feb 2026 16:00:46 +0100 Subject: [PATCH] fix: align docs with major new backend changes --- docs/ingestorManual.md | 184 ++++++++++++++--------------------------- 1 file changed, 60 insertions(+), 124 deletions(-) diff --git a/docs/ingestorManual.md b/docs/ingestorManual.md index b6e8087..97b0905 100644 --- a/docs/ingestorManual.md +++ b/docs/ingestorManual.md @@ -42,13 +42,6 @@ main steps in the lifecycle of the data management: - Publishing of datasets - Retention of datasets -Note: as of today (June 2021) the services can be only be used from -within the PSI intranet with the exception of the published data, -which is by definition publicly available. Although the service itself -can be used from any operating system, the command line and -GUI tools currently offered are available only for Linux and Windows -platforms. - ## The Concept of Datasets For the following it is useful to have a better understanding of the @@ -127,6 +120,25 @@ first. Installation is described in the appendix Installation of Tools ## Ingest +### Important Update since January 2025 + +The SciCat stack has gone through a major upgrade, thus the command +line syntax has changed. + +The separate executables (like datasetIngestor, datasetRetriever...) +were combined into one scicat-cli executable, with each executable's +features available as commands given as the first parameter to this executable. + +These commands bear the same names as the former executables. +The general syntax change is that if you called +./[COMMAND] [flags] before, now it's ./scicat-cli [COMMAND] [flags]. + +Furthermore, the use of single hyphen, multi-letter flags is now discontinued, +as it went against general convention. So, in practical terms, -[long_flag_name] +and --[long_flag_name] were both accepted, but now only the latter is accepted. + +There are backward compatible scripts in the [github repo](https://github.com/paulscherrerinstitute/scicat-cli?tab=readme-ov-file#backwards-compatibility-with-v2). + ### Important Update since April 14th 2022 For all commandline tools, like the datasetIngestor, datasetRetriever @@ -237,7 +249,7 @@ real life example from Bio department: For manual creation of this file there are various helper tools available. One option is to use the ScicatEditor - for creating these + for creating these metadata files. This is a browser-based tool specifically for ingesting PSI data. Using the tool avoids syntax errors and provides templates for common data sets and options. The finished JSON file can @@ -257,7 +269,7 @@ Linux type notation is used. For the changes which apply to Windows see the separate section below) ```sh -datasetIngestor metadata.json +scicat-cli datasetIngestor metadata.json ``` It will ask for your PSI credentials and then print some info @@ -268,7 +280,7 @@ already provided in the metadata.json file. If there are no errors, proceed to the real ingestion: ```sh -datasetIngestor --ingest metadata.json +scicat-cli datasetIngestor --ingest metadata.json ``` For particularly important datasets, you may also want to use the @@ -286,7 +298,7 @@ workstations/PCs are likely to fall in this category. There are more options for this command, just type ```sh -datasetIngestor +scicat-cli datasetIngestor ``` to see a list of available options. In particular you can define @@ -303,7 +315,7 @@ For Windows you need execute the corresponding commands inside a powershell and use the binary files ending in .exe, e.g. ```sh -datasetIngestor.exe -token SCICAT-TOKEN -user username:password -copy metadata.json +scicat-cli.exe datasetIngestor -token SCICAT-TOKEN -user username:password -copy metadata.json ``` For Windows systems you can only use personal accounts and the data is @@ -358,7 +370,7 @@ Triggering the copy to tape can be done in 3 ways. Either you do it automatically as part of the ingestion ```sh -datasetIngestor --ingest --autoarchive metadata.json +scicat-cli datasetIngestor --ingest --autoarchive metadata.json ``` In this case directly after ingestion a job is created to copy the @@ -379,31 +391,14 @@ data is stored. A third option is to use a command line version datasetArchiver. ```console -datasetArchiver [options] (ownerGroup | space separated list of datasetIds) +scicat-cli datasetArchiver [options] (ownerGroup | space separated list of datasetIds) +``` You must choose either an ownerGroup, in which case all archivable datasets of this ownerGroup not yet archived will be archived. Or you choose a (list of) datasetIds, in which case all archivable datasets of this list not yet archived will be archived. -List of options: - - -devenv - Use development environment instead or production - -localenv - Use local environment (local) instead or production - -noninteractive - Defines if no questions will be asked, just do it - make sure you know what you are doing - -tapecopies int - Number of tapecopies to be used for archiving (default 1) - -testenv - Use test environment (qa) instead or production - -token string - Defines optional API token instead of username:password - -user string - Defines optional username and password -``` - ## Retrieve Here we describe the retrieval via the command line tools. A retrieve @@ -429,39 +424,22 @@ minutes (e.g. for 1GB) up to days (e.g for 100TB) For the second step you can use the **datasetRetriever** command, which uses the rsync protocol to copy the data to your destination. -```console Tool to retrieve datasets from the intermediate cache server of the tape archive to the destination path on your local system. Run script with 1 argument: -datasetRetriever [options] local-destination-path +```console +scicat-cli datasetRetriever [options] local-destination-path +``` Per default all available datasets on the retrieve server will be fetched. -Use option -dataset or -ownerGroup to restrict the datasets which should be fetched. - - -chksum - Switch on optional chksum verification step (default no checksum tests) - -dataset string - Defines single dataset to retrieve (default all available datasets) - -devenv - Use development environment (default is to use production system) - -ownergroup string - Defines to fetch only datasets of the specified ownerGroup (default is to fetch all available datasets) - -retrieve - Defines if this command is meant to actually copy data to the local system (default nothing is done) - -testenv - Use test environment (qa) (default is to use production system) - -token string - Defines optional API token instead of username:password - -user string - Defines optional username and password (default is to prompt for username and password) -``` +Use option --dataset or --ownerGroup to restrict the datasets which should be fetched. For the program to check which data is available on the cache server and if the catalog knows about these datasets, you can use: ```console -datasetRetriever my-local-destination-folder +scicat-cli datasetRetriever my-local-destination-folder ======Checking for available datasets on archive cache server ebarema4in.psi.ch: @@ -477,7 +455,7 @@ If you want you can skip the previous step and directly trigger the file copy by adding the -retrieve flag: ```sh -datasetRetriever -retrieve +scicat-cli datasetRetriever --retrieve ``` This will copy the files into the destinationFolder using the original @@ -489,19 +467,19 @@ Optionally you can also verify the consistency of the copied data by using the `-chksum` flag ```sh -datasetRetriever -retrieve -chksum +scicat-cli datasetRetriever --retrieve --chksum ``` If you just want to retrieve a single dataset do the following: ```sh -datasetRetriever -retrieve -dataset +scicat-cli datasetRetriever --retrieve --dataset ``` If you want to retrieve all datasets of a given **ownerGroup** do the following: ```sh -datasetRetriever -retrieve -ownergroup +scicat-cli datasetRetriever --retrieve --ownergroup ``` #### Expert commands @@ -559,7 +537,7 @@ easiest to get such an API token is to sign it at button. This will bring you to the user settings page, from where you can copy the token with a click on the corresponding copy button. -### General considerations + ## Publish @@ -822,42 +800,23 @@ module load datacatalog If you do not have access to PSI modules (for instance, when archiving from Ubuntu systems), then you can install the datacatalog software -yourself. These tools require 64-bit linux. +yourself. Both linux, Mac and Windows versions are available. I suggest storing the SciCat scripts in ~/bin so that they can be easily accessed. -```sh -mkdir -p ~/bin -cd ~/bin -/usr/bin/curl -O https://gitlab.psi.ch/scicat/tools/raw/master/linux/datasetIngestor -chmod +x ./datasetIngestor -/usr/bin/curl -O https://gitlab.psi.ch/scicat/tools/raw/master/linux/datasetRetriever -chmod +x ./datasetRetriever -/usr/bin/curl -O https://gitlab.psi.ch/scicat/tools/raw/master/linux/SciCat -chmod +x ./SciCat -``` +To download and install the binaries, please follow these steps: -When the scripts are updated you will be prompted to re-run some of -the above commands to get the latest version. +1. Go to the [GitHub releases page](https://github.com/paulscherrerinstitute/scicat-cli/releases) -You can call the ingestion scripts using the full path -(~/bin/datasetIngestor) or else add ~/bin to your unix PATH. To do so, -add the following line to your ~/.bashrc file: +2. Choose the release of interest (latest released is recommended) -```sh -export PATH="$HOME/bin:$PATH" -``` +3. Download the file from the Assets of the chosen release, making sure to select the one compatible with your OS -#### Installation on Windows Systems +4. Decompress the asset -On Windows the executables can be downloaded from the following URL, -just enter the address in abrowser and download the file +5. Open the folder and run the required APP (grant execute permissions if required) -```sh -https://gitlab.psi.ch/scicat/tools/-/blob/master/windows/datasetIngestor.exe -https://gitlab.psi.ch/scicat/tools/-/blob/master/windows/SciCatGUI_Win10.zip -``` #### Online work stations in beamline hutches @@ -1245,7 +1204,7 @@ chosen for the same quantity: and the folders will be scanned for files ```sh - datasetIngestor metadata.json [filelisting.txt | 'folderlisting.txt'] + scicat-cli datasetIngestor metadata.json [filelisting.txt | 'folderlisting.txt'] ``` You will be prompted for your username and password. @@ -1255,7 +1214,7 @@ chosen for the same quantity: catalog ```sh - datasetIngestor --ingest metadata.json [filelisting.txt | 'folderlisting.txt'] + scicat-cli datasetIngestor --ingest metadata.json [filelisting.txt | 'folderlisting.txt'] ``` When the job is finshed all needed metadata will be ingested into the @@ -1295,31 +1254,11 @@ chosen for the same quantity: Then you run the datasetIngestor program usually under a beamline specic account. In order to run fully automatic all potential questions asked interactively by the program must be pre-answered - through a set of command line options: + through a set of command line options. The command below shows all + available options: ```console - datasetIngestor [options] metadata-file [filelisting-file|'folderlisting.txt'] - - -allowexistingsource - Defines if existing sourceFolders can be reused - -autoarchive - Option to create archive job automatically after ingestion - -copy - Defines if files should be copied from your local system to a central server before ingest. - -devenv - Use development environment instead of production environment (developers only) - -ingest - Defines if this command is meant to actually ingest data - -linkfiles string - Define what to do with symbolic links: (keep|delete|keepInternalOnly) (default "keepInternalOnly") - -noninteractive - If set no questions will be asked and the default settings for all undefined flags will be assumed - -tapecopies int - Number of tapecopies to be used for archiving (default 1) - -testenv - Use test environment (qa) instead of production environment - -user string - Defines optional username:password string + scicat-cli datasetIngestor [options] metadata-file [filelisting-file|'folderlisting.txt'] ``` - here is a typical example using the MX beamline at SLS as an example @@ -1327,11 +1266,11 @@ chosen for the same quantity: metadata.json ```sh - datasetIngestor -ingest \ - -linkfiles keepInternalOnly \ - -allowexistingsource \ - -user slsmx:XXXXXXXX \ - -noninteractive \ + scicat-cli datasetIngestor --ingest \ + --linkfiles keepInternalOnly \ + --allowexistingsource \ + --user slsmx:XXXXXXXX \ + --noninteractive \ metadata.json ``` @@ -1372,7 +1311,7 @@ Otherwise just follow the description in the section "Manual ingest using datasetIngestor program" and use the option -copy, e.g. ```sh -datasetIngestor -autoarchive -copy -ingest metadata.json +scicat-cli datasetIngestor -autoarchive -copy -ingest metadata.json ``` This command will copy the data to a central rsync server, from where @@ -1500,13 +1439,10 @@ following curl command: ```sh # for "functional" accounts -curl -X POST --header 'Content-Type: application/json' -d '{"username":"YOUR-LOGIN","password":"YOUR-PASSWORD"}' 'https://dacat-qa.psi.ch/api/v3/Users/login' - -# for normal user accounts -curl -X POST --header 'Content-Type: application/json' -d '{"username":"YOUR-LOGIN","password":"YOUR-PASSWORD"}' 'https://dacat-qa.psi.ch/auth/msad' +curl -X POST --header 'Content-Type: application/json' -d '{"username":"YOUR-LOGIN","password":"YOUR-PASSWORD"}' 'https://dacat-qa.psi.ch/api/v3/auth/login' # reply if succesful: -{"id":"NQhe3...","ttl":1209600,"created":"2019-01-22T07:03:21.422Z","userId":"5a745bde4d12b30008020843"} +{"access_token": "NQhe3...", "id":"NQhe3...","ttl":1209600,"created":"2019-01-22T07:03:21.422Z","userId":"5a745bde4d12b30008020843"} ``` The "id" field contains the access token, which you copy in to the corresponding field at the top of the explorer page. @@ -1559,7 +1495,7 @@ use the command datasetGetProposal, which returns the proposal information for a given ownerGroup ```sh -/usr/bin/curl -O https://gitlab.psi.ch/scicat/tools/raw/master/linux/datasetGetProposal;chmod +x ./datasetGetProposal +scicat-cli datasetGetProposal ``` ### Link to Group specific descriptions