Major changes

- Separated CPU and GPU lanes
- Add --cluster parameter to all lanes (Fix bug running on new gmerlin6
  cluster)
- Rename all lanes to mirror (somewhat) the partition names
- Add remove_old_lanes.sh to clean up the renames
This commit is contained in:
2021-06-01 21:24:50 +02:00
parent c2f1ba9000
commit 51569c5eff
19 changed files with 240 additions and 129 deletions

View File

@ -4,10 +4,11 @@ This repository collects cryosparc 'lanes' suitable for running jobs at PSI.
## Lanes
- *merlin6*. Default lane. 7 day time limit, uses all resources.
- *merlin6-big*. Requests exclusive node usage, with full memory. If GPUs are
needed, requests 4 GPUs with 11GB of video memory
- *merlin6-short*. Short requests. 1 hour time limit.
- g*pu*. Default lane. 7 day time limit, uses all resources.
- gpu-big*. Requests requests GPUs with 11GB of video memory, plus 100G memory
- *gpu-short*. Short requests. 2 hour time limit.
- gpu-rtx2080ti. Specialty queue for jobs that benefit from the RTX 2080Ti cards
- cpu-daily. CPU-only lane. 1 day time limit.
## Installing
@ -23,6 +24,9 @@ automatically:
dev/install_filters.sh
Check that the cuda version in cluster_script.sh is correct. If not, update the
files as described in 'Change CUDA versions' below.
Finally, connect the newly modified scripts to cryosparc. This should be done
on same machine cryosparc runs on. To connect all lanes:
@ -40,10 +44,31 @@ version, update git:
git pull
(Optional) If you want to remove old lanes (e.g. when updating from scripts
v1.5.0 to v2.0.0), run
./remove_old_lanes.sh
Then, connect the lanes to your cryosparc cluster as in installation:
./connect_all.sh
## Change CUDA versions
The scripts load cuda/10.0.130 by default. Newer cryosparc versions may require
using a newer cuda version. This can be changed by either loading a different
cuda module while installing/upgrading the worker, or using the following
command:
module load cuda/10.0.130
$CRYOSPARC_HOME/cryosparc2_worker/bin/cryosparcw newcuda $CUDA_PREFIX
The submission scripts then need to be updated to match this cuda version:
module load cuda/10.0.130
sed -ri "s|cuda/\S+|cuda/$CUDA_VERSION|" */cluster_script.sh
./connect_all.sh
## Developers
If you plan on committing changes to this repository, make sure you run

View File

@ -8,7 +8,7 @@ if [[ ! -x "$CRYOSPARCM" ]]; then
exit 1
fi
LANES=("merlin6" "merlin6-big" "merlin6-short" "merlin6-rtx2080ti")
LANES=("gpu" "gpu-big" "gpu-short" "gpu-rtx2080ti" "cpu-daily")
success=1
for lane in "${LANES[@]}"; do
@ -32,4 +32,5 @@ done
if [[ $success == 0 ]]; then
echo "Errors occured. See above." >&2
exit 1
fi

View File

@ -0,0 +1,11 @@
{
"name" : "cpu-daily",
"worker_bin_path" : "/data/user/USERNAME/cryosparc/cryosparc2_worker/bin/cryosparcw",
"cache_path" : "/scratch/",
"send_cmd_tpl" : "{{ command }}",
"qsub_cmd_tpl" : "bash -c 'sbatch --parsable --cluster=merlin6 \"{{ script_path_abs }}\" | cut -d \";\" -f 1'",
"qstat_cmd_tpl" : "squeue --cluster=merlin6 -j {{ cluster_job_id }}",
"qdel_cmd_tpl" : "scancel --cluster=merlin6 {{ cluster_job_id }}",
"qinfo_cmd_tpl" : "sinfo --cluster=merlin6",
"transfer_cmd_tpl" : "cp {{ src_path }} {{ dest_path }}"
}

View File

@ -0,0 +1,70 @@
#!/usr/bin/env bash
# cryoSPARC cluster submission script template for SLURM
# Lane: merlin6 v1.5.0 (2021-05-28)
#
# If you edit this file, run 'cryosparcm cluster connect'
{# This template uses jinja2 syntax. #}
# Available variables:
# script_path_abs={{ script_path_abs }}
# - the absolute path to the generated submission script
# run_cmd={{ run_cmd }}
# - the complete command-line string to run the job
# num_cpu={{ num_cpu }}
# - the number of CPUs needed
# num_gpu={{ num_gpu }}
# - the number of GPUs needed. Note: the code will use this many GPUs
# starting from dev id 0. The cluster scheduler or this script have the
# responsibility of setting CUDA_VISIBLE_DEVICES so that the job code
# ends up using the correct cluster-allocated GPUs.
# ram_gb={{ ram_gb }}
# - the amount of RAM needed in GB
# job_dir_abs={{ job_dir_abs }}
# - absolute path to the job directory
# project_dir_abs={{ project_dir_abs }}
# - absolute path to the project dir
# job_log_path_abs={{ job_log_path_abs }}
# - absolute path to the log file for the job
# worker_bin_path={{ worker_bin_path }}
# - absolute path to the cryosparc worker command
# run_args={{ run_args }}
# - arguments to be passed to cryosparcw run
# project_uid={{ project_uid }}
# - uid of the project
# job_uid={{ job_uid }}
# - uid of the job
# job_creator={{ job_creator }}
# - name of the user that created the job (may contain spaces)
# cryosparc_username={{ cryosparc_username }}
# - cryosparc username of the user that created the job (usually an email)
#SBATCH --job-name=cryosparc_{{ project_uid }}_{{ job_uid }}
#SBATCH --output={{ job_log_path_abs }}.out
#SBATCH --error={{ job_log_path_abs }}.err
#SBATCH --ntasks=1
#SBATCH --threads-per-core=1
#SBATCH --mem-per-cpu={{ ((ram_gb*1000)/num_cpu)|int }}M
#SBATCH --time=1-00:00:00
#SBATCH --partition=daily
#SBATCH --cpus-per-task={{ num_cpu }}
{%- if num_gpu > 0 %}
# Use GPU cluster
echo "Error: GPU requested. Use a GPU lane instead." >&2
exit 1
{%- else %}
# Print hostname, for debugging
echo "Job Id: $SLURM_JOBID"
echo "Host: $SLURM_NODELIST"
# Make sure this matches the version of cuda used to compile cryosparc
module purge
srun {{ run_cmd }}
EXIT_CODE=$?
echo "Exit code: $EXIT_CODE"
exit $?
{%- endif %}

11
gpu-big/cluster_info.json Normal file
View File

@ -0,0 +1,11 @@
{
"name" : "gpu-big",
"worker_bin_path" : "/data/user/USERNAME/cryosparc/cryosparc2_worker/bin/cryosparcw",
"cache_path" : "/scratch/",
"send_cmd_tpl" : "{{ command }}",
"qsub_cmd_tpl" : "bash -c 'sbatch --parsable --cluster=gmerlin6 \"{{ script_path_abs }}\" | cut -d \";\" -f 1'",
"qstat_cmd_tpl" : "squeue --cluster=gmerlin6 -j {{ cluster_job_id }}",
"qdel_cmd_tpl" : "scancel --cluster=gmerlin6 {{ cluster_job_id }}",
"qinfo_cmd_tpl" : "sinfo --cluster=gmerlin6",
"transfer_cmd_tpl" : "cp {{ src_path }} {{ dest_path }}"
}

View File

@ -1,17 +1,13 @@
#!/usr/bin/env bash
# cryoSPARC cluster submission script template for SLURM
# Lane: merlin6-big v1.5.0 (2021-05-28)
# Lane: gpu-big v1.5.0 (2021-05-28)
#
# This is the 'big' GPU configuration, meaning it requests exclusive access to
# the 4x GTX1080Ti/RTX2080Ti nodes (with 11GB video RAM). Please use the normal gpu
# lane if possible.
# This is the 'big' GPU configuration, meaning it reserves additional memory
# and uses the 11GB video cards. Please use the normal gpu lane if possible.
#
# If you edit this file, run 'cryosparcm cluster connect'
{# This template uses jinja2 syntax. #}
{%- macro _min(a, b) -%}
{%- if a <= b %}{{a}}{% else %}{{b}}{% endif -%}
{%- endmacro -%}
# Available variables:
# script_path_abs={{ script_path_abs }}
@ -49,23 +45,21 @@
#SBATCH --job-name=cryosparc_{{ project_uid }}_{{ job_uid }}
#SBATCH --output={{ job_log_path_abs }}.out
#SBATCH --error={{ job_log_path_abs }}.err
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --threads-per-core=1
#SBATCH --exclusive
#SBATCH --mem=0
#SBATCH --mem=102400
#SBATCH --time=7-00:00:00
#SBATCH --partition=gpu
#SBATCH --cluster=gmerlin6
#SBATCH --gpus={{ num_gpu }}
#SBATCH --cpus-per-gpu=4
#SBATCH --constraint=gpumem_11gb
{%- if num_gpu == 0 %}
# Use CPU cluster
#SBATCH --partition=general
#SBATCH --cpus-per-task={{ num_cpu }}
echo "Error: No GPU requested. Use a CPU lane instead." >&2
exit 1
{%- else %}
# Use GPU cluster
#SBATCH --partition=gpu
#SBATCH --cluster=gmerlin6
#SBATCH --gpus=4
#SBATCH --constraint=gpumem_11gb
{%- endif %}
# Print hostname, for debugging
echo "Job Id: $SLURM_JOBID"
@ -73,10 +67,11 @@ echo "Host: $SLURM_NODELIST"
# Make sure this matches the version of cuda used to compile cryosparc
module purge
module load cuda/10.0.130
module load cuda/10.0.130 gcc/10.3.0
srun {{ run_cmd }}
EXIT_CODE=$?
echo "Exit code: $EXIT_CODE"
exit $?
{%- endif %}

View File

@ -0,0 +1,11 @@
{
"name" : "gpu-rtx2080ti",
"worker_bin_path" : "/data/user/USERNAME/cryosparc/cryosparc2_worker/bin/cryosparcw",
"cache_path" : "/scratch/",
"send_cmd_tpl" : "{{ command }}",
"qsub_cmd_tpl" : "bash -c 'sbatch --parsable --cluster=gmerlin6 \"{{ script_path_abs }}\" | cut -d \";\" -f 1'",
"qstat_cmd_tpl" : "squeue --cluster=gmerlin6 -j {{ cluster_job_id }}",
"qdel_cmd_tpl" : "scancel --cluster=gmerlin6 {{ cluster_job_id }}",
"qinfo_cmd_tpl" : "sinfo --cluster=gmerlin6",
"transfer_cmd_tpl" : "cp {{ src_path }} {{ dest_path }}"
}

View File

@ -5,9 +5,6 @@
# If you edit this file, run 'cryosparcm cluster connect'
{# This template uses jinja2 syntax. #}
{%- macro _min(a, b) -%}
{%- if a <= b %}{{a}}{% else %}{{b}}{% endif -%}
{%- endmacro -%}
# Available variables:
# script_path_abs={{ script_path_abs }}
@ -49,25 +46,17 @@
#SBATCH --threads-per-core=1
#SBATCH --mem-per-cpu={{ ((ram_gb*1000)/num_cpu)|int }}M
#SBATCH --time=7-00:00:00
{%- if num_gpu == 0 %}
# Use CPU cluster
#SBATCH --partition=general
#SBATCH --cpus-per-task={{ num_cpu }}
{%- else %}
# Use GPU cluster
#SBATCH --partition=gpu
#SBATCH --cluster=gmerlin6
#SBATCH --gpus=geforce_rtx_2080_ti:{{ num_gpu }}
#SBATCH --gres=gpu:geforce_rtx_2080_ti:{{ num_gpu }}
{%- if num_gpu <= 2 %}
#SBATCH --cpus-per-task={{ _min(num_cpu, 8) }}
{%- else %}
{# Slurm requests too many CPU sometimes; restrict to 20 per machine #}
{%- set num_nodes = (num_gpu/4) | round(0, 'ceil') | int %}
#SBATCH --cpus-per-task={{ _min(num_cpu, 20*num_nodes) }}
{%- endif %}
{%- endif %}
#SBATCH --cpus-per-gpu=4
{%- if num_gpu == 0 %}
# Use CPU cluster
echo "Error: No GPU requested. Use a CPU lane instead." >&2
exit 1
{%- else %}
# Print hostname, for debugging
echo "Job Id: $SLURM_JOBID"
@ -75,10 +64,11 @@ echo "Host: $SLURM_NODELIST"
# Make sure this matches the version of cuda used to compile cryosparc
module purge
module load cuda/10.0.130
module load cuda/10.0.130 gcc/10.3.0
srun {{ run_cmd }}
EXIT_CODE=$?
echo "Exit code: $EXIT_CODE"
exit $?
{%- endif %}

View File

@ -0,0 +1,11 @@
{
"name" : "gpu-short",
"worker_bin_path" : "/data/user/USERNAME/cryosparc/cryosparc2_worker/bin/cryosparcw",
"cache_path" : "/scratch/",
"send_cmd_tpl" : "{{ command }}",
"qsub_cmd_tpl" : "bash -c 'sbatch --parsable --cluster=gmerlin6 \"{{ script_path_abs }}\" | cut -d \";\" -f 1'",
"qstat_cmd_tpl" : "squeue --cluster=gmerlin6 -j {{ cluster_job_id }}",
"qdel_cmd_tpl" : "scancel --cluster=gmerlin6 {{ cluster_job_id }}",
"qinfo_cmd_tpl" : "sinfo --cluster=gmerlin6",
"transfer_cmd_tpl" : "cp {{ src_path }} {{ dest_path }}"
}

View File

@ -1,16 +1,13 @@
#!/usr/bin/env bash
# cryoSPARC cluster submission script template for SLURM
# Lane: merlin6-short v1.5.0 (2021-05-28)
# Lane: gpu-short v1.5.0 (2021-05-28)
#
# This is the 'short' configuration, intended for interactive jobs and rapid
# experimentation. Jobs are limited to 1 hour.
# experimentation. Jobs are limited to 2 hours.
#
# If you edit this file, run 'cryosparcm cluster connect'
{# This template uses jinja2 syntax. #}
{%- macro _min(a, b) -%}
{%- if a <= b %}{{a}}{% else %}{{b}}{% endif -%}
{%- endmacro -%}
# Available variables:
# script_path_abs={{ script_path_abs }}
@ -51,26 +48,17 @@
#SBATCH --ntasks=1
#SBATCH --threads-per-core=1
#SBATCH --mem-per-cpu={{ ((ram_gb*1000)/num_cpu)|int }}M
#SBATCH --time=2:00:00
#SBATCH --partition=gpu-short
#SBATCH --cluster=gmerlin6
#SBATCH --gpus={{ num_gpu }}
#SBATCH --cpus-per-gpu=4
{%- if num_gpu == 0 %}
# Use CPU cluster
#SBATCH --partition=hourly
#SBATCH --time=1:00:00
#SBATCH --cpus-per-task={{ num_cpu }}
echo "Error: No GPU requested. Use a CPU lane instead." >&2
exit 1
{%- else %}
# Use GPU cluster
#SBATCH --partition=gpu-short
#SBATCH --cluster=gmerlin6
#SBATCH --time=2:00:00
#SBATCH --gpus={{ num_gpu }}
{%- if num_gpu <= 2 %}
#SBATCH --cpus-per-task={{ _min(num_cpu, 8) }}
{%- else %}
{# Slurm requests too many CPU sometimes; restrict to 20 per machine #}
{%- set num_nodes = (num_gpu/4) | round(0, 'ceil') | int %}
#SBATCH --cpus-per-task={{ _min(num_cpu, 20*num_nodes) }}
{%- endif %}
{%- endif %}
# Print hostname, for debugging
echo "Job Id: $SLURM_JOBID"
@ -78,10 +66,11 @@ echo "Host: $SLURM_NODELIST"
# Make sure this matches the version of cuda used to compile cryosparc
module purge
module load cuda/10.0.130
module load cuda/10.0.130 gcc/10.3.0
srun {{ run_cmd }}
EXIT_CODE=$?
echo "Exit code: $EXIT_CODE"
exit $?
{%- endif %}

11
gpu/cluster_info.json Normal file
View File

@ -0,0 +1,11 @@
{
"name" : "gpu",
"worker_bin_path" : "/data/user/USERNAME/cryosparc/cryosparc2_worker/bin/cryosparcw",
"cache_path" : "/scratch/",
"send_cmd_tpl" : "{{ command }}",
"qsub_cmd_tpl" : "bash -c 'sbatch --parsable --cluster=gmerlin6 \"{{ script_path_abs }}\" | cut -d \";\" -f 1'",
"qstat_cmd_tpl" : "squeue --cluster=gmerlin6 -j {{ cluster_job_id }}",
"qdel_cmd_tpl" : "scancel --cluster=gmerlin6 {{ cluster_job_id }}",
"qinfo_cmd_tpl" : "sinfo --cluster=gmerlin6",
"transfer_cmd_tpl" : "cp {{ src_path }} {{ dest_path }}"
}

View File

@ -1,13 +1,10 @@
#!/usr/bin/env bash
# cryoSPARC cluster submission script template for SLURM
# Lane: merlin6 v1.5.0 (2021-05-28)
# Lane: gpu v1.5.0 (2021-05-28)
#
# If you edit this file, run 'cryosparcm cluster connect'
{# This template uses jinja2 syntax. #}
{%- macro _min(a, b) -%}
{%- if a <= b %}{{a}}{% else %}{{b}}{% endif -%}
{%- endmacro -%}
# Available variables:
# script_path_abs={{ script_path_abs }}
@ -49,24 +46,16 @@
#SBATCH --threads-per-core=1
#SBATCH --mem-per-cpu={{ ((ram_gb*1000)/num_cpu)|int }}M
#SBATCH --time=7-00:00:00
{%- if num_gpu == 0 %}
# Use CPU cluster
#SBATCH --partition=general
#SBATCH --cpus-per-task={{ num_cpu }}
{%- else %}
# Use GPU cluster
#SBATCH --partition=gpu
#SBATCH --cluster=gmerlin6
#SBATCH --gpus={{ num_gpu }}
{%- if num_gpu <= 2 %}
#SBATCH --cpus-per-task={{ _min(num_cpu, 8) }}
{%- else %}
{# Slurm requests too many CPU sometimes; restrict to 20 per machine #}
{%- set num_nodes = (num_gpu/4) | round(0, 'ceil') | int %}
#SBATCH --cpus-per-task={{ _min(num_cpu, 20*num_nodes) }}
{%- endif %}
{%- endif %}
#SBATCH --cpus-per-gpu=4
{%- if num_gpu == 0 %}
# Use CPU cluster
echo "Error: No GPU requested. Use a CPU lane instead." >&2
exit 1
{%- else %}
# Print hostname, for debugging
echo "Job Id: $SLURM_JOBID"
@ -74,10 +63,11 @@ echo "Host: $SLURM_NODELIST"
# Make sure this matches the version of cuda used to compile cryosparc
module purge
module load cuda/10.0.130
module load cuda/10.0.130 gcc/10.3.0
srun {{ run_cmd }}
EXIT_CODE=$?
echo "Exit code: $EXIT_CODE"
exit $?
{%- endif %}

1
gpu/slurm-3228.out Normal file
View File

@ -0,0 +1 @@
merlin-g-001.psi.ch

1
gpu/slurm-380113.out Normal file
View File

@ -0,0 +1 @@
merlin-c-215.psi.ch

View File

@ -1,11 +0,0 @@
{
"name" : "merlin6-big",
"worker_bin_path" : "/data/user/USERNAME/cryosparc/cryosparc2_worker/bin/cryosparcw",
"cache_path" : "/scratch/",
"send_cmd_tpl" : "{{ command }}",
"qsub_cmd_tpl" : "sbatch {{ script_path_abs }}",
"qstat_cmd_tpl" : "squeue -j {{ cluster_job_id }}",
"qdel_cmd_tpl" : "scancel {{ cluster_job_id }}",
"qinfo_cmd_tpl" : "sinfo",
"transfer_cmd_tpl" : "cp {{ src_path }} {{ dest_path }}"
}

View File

@ -1,11 +0,0 @@
{
"name" : "merlin6-rtx2080ti",
"worker_bin_path" : "/data/user/USERNAME/cryosparc/cryosparc2_worker/bin/cryosparcw",
"cache_path" : "/scratch/",
"send_cmd_tpl" : "{{ command }}",
"qsub_cmd_tpl" : "sbatch {{ script_path_abs }}",
"qstat_cmd_tpl" : "squeue -j {{ cluster_job_id }}",
"qdel_cmd_tpl" : "scancel {{ cluster_job_id }}",
"qinfo_cmd_tpl" : "sinfo",
"transfer_cmd_tpl" : "cp {{ src_path }} {{ dest_path }}"
}

View File

@ -1,11 +0,0 @@
{
"name" : "merlin6-short",
"worker_bin_path" : "/data/user/USERNAME/cryosparc/cryosparc2_worker/bin/cryosparcw",
"cache_path" : "/scratch/",
"send_cmd_tpl" : "{{ command }}",
"qsub_cmd_tpl" : "sbatch {{ script_path_abs }}",
"qstat_cmd_tpl" : "squeue -j {{ cluster_job_id }}",
"qdel_cmd_tpl" : "scancel {{ cluster_job_id }}",
"qinfo_cmd_tpl" : "sinfo",
"transfer_cmd_tpl" : "cp {{ src_path }} {{ dest_path }}"
}

View File

@ -1,11 +0,0 @@
{
"name" : "merlin6",
"worker_bin_path" : "/data/user/USERNAME/cryosparc/cryosparc2_worker/bin/cryosparcw",
"cache_path" : "/scratch/",
"send_cmd_tpl" : "{{ command }}",
"qsub_cmd_tpl" : "sbatch {{ script_path_abs }}",
"qstat_cmd_tpl" : "squeue -j {{ cluster_job_id }}",
"qdel_cmd_tpl" : "scancel {{ cluster_job_id }}",
"qinfo_cmd_tpl" : "sinfo",
"transfer_cmd_tpl" : "cp {{ src_path }} {{ dest_path }}"
}

38
remove_old_lanes.sh Executable file
View File

@ -0,0 +1,38 @@
#!/bin/bash
# Remove old lanes from cryosparc
# usage: remove_old_lanes.sh [lanes]
#
# If no lanes are specified, defaults to the lanes installed by connect_all.sh
: "${CRYOSPARCM:=/data/user/$USER/cryosparc/cryosparc2_master/bin/cryosparcm}"
if [[ ! -x "$CRYOSPARCM" ]]; then
echo "ERROR: Unable to find cryosparcm at $CRYOSPARCM" >&2
exit 1
fi
if [[ "$#" -gt 0 ]]; then
LANES=("$@")
else
LANES=("gpu" "gpu-big" "gpu-short" "gpu-rtx2080ti" "cpu-daily")
# old lane names
LANES+=("merlin6" "merlin6-big" "merlin6-short" "merlin6-rtx2080ti")
fi
success=1
for lane in "${LANES[@]}"; do
echo "$CRYOSPARCM" cluster remove "$lane" >&2
"$CRYOSPARCM" cluster remove "$lane"
if [[ $? != 0 ]]; then
echo "ERROR connecting $lane" >&2
success=0
fi
echo >&2
done
if [[ $success == 0 ]]; then
echo "Errors occured. See above." >&2
exit 1
fi