add alphafold module

This commit is contained in:
2024-07-18 10:49:35 +02:00
parent 6fe7d41d22
commit 30c60ca532
9 changed files with 536 additions and 0 deletions

119
alphafold/README.md Normal file
View File

@@ -0,0 +1,119 @@
# Alphafold
Alphafold contains two parts:
1. A conda environment containing dependencies
2. The alphafold module itself, containing the current code and submission scripts.
3. The Database
## DataBase Data
All the download scripts work from merlin, the only one not working is the pdb-mmcif script. as it is using rsync.
The port provided by alphafold is closed by PSI and the US mirror does not work nicely. Alternative that works:
rsync -rlpt -v -z --info=progress2 --delete rsync.ebi.ac.uk::pub/databases/pdb/data/structures/divided/mmCIF/ $DIR
Tip: Make sure to use tmux sessions for the downloads.
Tip: Double check reading permissions for users after copying/downloading the database, was causing errors last time!
## Conda Environment
Alphafold installed based on Spencers instructions from older installs and original git repo.
Change: No central conda env anymore, rather a version-based conda env setup.
The conda env should be installed from the environment.yml file, which has combinations of conda-forge, bioconda and pip installations, unfortuantely no environment .yml file provided by alphafold deepmind so far. Also, miniconda is used to do so, using the central conda installation might cause problems (openAFS hardlink issues), if the central version is used make sure to install from an openAFS host. (pmod7 e.g)
Also, the central conda is super old, needs to be updated?
After using the yml file jax and jaxlib need to be installed into the conda, does not work directly from the environment.yml file. (so far)
Also, there are a lot of contradicting descriptions in the orignal git repo concerning the jaxlib versions at current state.
```
OLD VERSIONS
conda create --name alphafold python==3.8
conda update -n base conda
source miniconda3/etc/profile.d/conda.sh
conda activate alphafold
conda install -y -c conda-forge openmm==7.5.1 cudnn==8.2.1.32 cudatoolkit==11.0.3 pdbfixer==1.7
conda install -y -c bioconda hmmer==3.3.2 hhsuite==3.3.0 kalign2==2.04
pip install absl-py==0.13.0 biopython==1.79 chex==0.0.7 dm-haiku==0.0.4 \
dm-tree==0.1.6 immutabledict==2.0.0 jax==0.2.14 ml-collections==0.1.0 \
numpy==1.19.5 scipy==1.7.0 tensorflow==2.5.0 pandas==1.3.4
pip install --upgrade jax jaxlib==0.1.69+cuda111 \
-f https://storage.googleapis.com/jax-releases/jax_releases.html
NEW VERSION (2.3.2 current state)
```
create the conda env from the environments.yml file , content:
```
channels:
- pytorch
- conda-forge
- defaults
- anaconda
- bioconda
dependencies:
- python==3.10
- pip
- openmm==7.7.0
- cudnn # Change version if not compatible with current system
- cudatoolkit
- pdbfixer
- hmmer==3.4
- hhsuite==3.3.0
- kalign2==2.04
- pip:
- absl-py==1.0.0
- biopython==1.79
- chex==0.0.7
- dm-haiku==0.0.10
- immutabledict==2.0.0
- ml-collections==0.1.0
- numpy==1.24.3
- scipy==1.11.1
- tensorflow-cpu==2.13.0
- jax==0.4.14
- pandas==2.0.3
- dm-tree==0.1.8
```
##Alphafold CODE
In the file run_alphafold.py, the flag --use_gpu_relax needs to be set to true, so far done manually!
Not sure if this is really neccessary.
```
flags.DEFINE_boolean('use_gpu_relax', None , 'Whether to relax on GPU. ' TO:
flags.DEFINE_boolean('use_gpu_relax', True, 'Whether to relax on GPU. '
```
## Alphafold module
Add version to files/variants. The version number should match a github tag
(e.g. `v2.0.1`) or else have the commit hash as `$V_RELEASE`.
As admin user:
```
cd MX/alphafold
./build <version>
```
## Testing
Here's an example sequence:
```
mkdir example
cd example
cat > query.fasta <<EOF
>dummy_sequence
GWSTELEKHREELKEFLKKEGITNVEIRIDNGRLEVRVEGGTERLKRFLEELRQKLEKKGYTVDIKIE
EOF
module use MX unstable
module load alphafold/2.1.1
sbatch alphafold_merlin.sh query.fasta
```

View File

@@ -0,0 +1,44 @@
#!/bin/bash
#SBATCH -p gpu
#SBATCH -J alphafold
#SBATCH -M gmerlin6
#SBATCH --gpus=1
#SBATCH -n 1
#SBATCH -c 10
# sLURM Options can be overwritten by specifing in the sbatch command!
#
# Alphafold submission script for the merlin cluster
# Usage: sbatch [slurm_opts] alphafold_merlin.sh [options] fasta_file
#
# OPTIONS
# All alphafold options are set automatically, but can be overwritten.
# Some common options:
#
# --max_template_date=YYYY-MM-DD (default: today)
# --output_dir (default: current directory
# --helpfull List all options
#
# 2024-02-08 Spencer Bliven, Greta Assmann
#
export ALPHAFOLD_DATA=/data/project/bio/shared/alphafold/versions/v2.3.2/latest
source /opt/psi/MX/alphafold/ALPHAFOLD_VERSION/miniconda/etc/profile.d/conda.sh
conda activate "${ALPHAFOLD_ENV:?"Error: ALPHAFOLD_ENV not set. Try 'module use MX unstable; module load alphafold'"}"
export ALPH_TMP_DIR=/scratch/${SLURM_JOBID}
mkdir -p $ALPH_TMP_DIR
export TF_FORCE_UNIFIED_MEMORY=1
export XLA_PYTHON_CLIENT_MEM_FRACTION="4.0"
echo "hostname=$(hostname)"
echo "python=$(which python)"
echo "ALPHAFOLD_DATA=$(realpath "$ALPHAFOLD_DATA")"
echo "Alphafold version=$(which alphafold_merlin.sh)"
echo "$ALPHAFOLD_ENV"
python "${ALPHAFOLD_DIR:?Error loading module}/bin/alphafold_runner.py" -v 0 "$@"
squeue -j $SLURM_JOB_ID
rm -rf $ALPH_TMP_DIR

43
alphafold/bin/alphafold_ra.sh Executable file
View File

@@ -0,0 +1,43 @@
#!/bin/bash
#SBATCH -p gpu-week
#SBATCH -t 2-00:00:00
#SBATCH -J alphafold
#SBATCH --gpus=1
#SBATCH -J alphafold
#SBATCH -n 1
#SBATCH -c 10
# Alphafold submission script for the RA cluster
# Usage: sbatch [slurm_opts] alphafold_ra.sh [options] fasta_file
#
# OPTIONS
# All alphafold options are set automatically, but can be overwritten.
# Some common options:
#
# --max_template_date=YYYY-MM-DD (default: today)
# --output_dir (default: current directory)
# --helpfull List all options
#
# 2024-02-09 Spencer Bliven, Greta Assmann
#
export ALPHAFOLD_DATA=/das/work/common/opt/alphafold/data_2.3.2/latest
source /opt/psi/MX/alphafold/2.3.2/miniconda/etc/profile.d/conda.sh
conda activate "${ALPHAFOLD_ENV:?"Error: ALPHAFOLD_ENV not set. Try 'module use MX unstable; module load alphafold'"}"
export ALPH_TMP_DIR=/scratch/${SLURM_JOBID}
mkdir -p $ALPH_TMP_DIR
export TF_FORCE_UNIFIED_MEMORY=1
export XLA_PYTHON_CLIENT_MEM_FRACTION="4.0"
echo "hostname=$(hostname)"
echo "python=$(which python)"
echo "ALPHAFOLD_DATA=$(realpath "$ALPHAFOLD_DATA")"
echo "Alphafold version=$(which alphafold_ra.sh)"
echo "$ALPHAFOLD_ENV"
python "${ALPHAFOLD_DIR:?Error loading module}/bin/alphafold_runner.py" -v 0 "$@"
squeue -j $SLURM_JOB_ID
rm -rf $ALPH_TMP_DIR

132
alphafold/bin/alphafold_runner.py Executable file
View File

@@ -0,0 +1,132 @@
#!/usr/bin/env python
"""
Wrapper script for Alphafold 2, with automatic setting of common options
usage: python alphafold_runner.py [alphafold options] input.fa
"""
import sys
import os
import importlib
import subprocess
import logging
import argparse
from datetime import date
from pathlib import Path
from typing import Union
from absl import app
from absl.flags import FLAGS
from absl import logging
def import_alphafold():
"Import run_alphafold.py from ALPHAFOLD_HOME"
home = os.environ.get('ALPHAFOLD_HOME', str(Path(__file__).parent.resolve("../alphafold")))
sys.path.append(home)
try:
return importlib.import_module("run_alphafold")
except ImportError:
sys.stderr.write(f"Unable to find run_alphafold.py\n")
sys.stderr.write(f"path:{', '.join(sys.path)}")
sys.exit(1)
af = import_alphafold()
def multi_fasta(fasta_path):
entries = 0
with open(fasta_path, 'r') as fasta:
for line in fasta:
if line and line[0] == '>':
entries += 1
if entries > 1:
return True
return False
def guess_model_preset(fasta_paths):
if any(multi_fasta(f) for f in fasta_paths):
logging.info("Input appears to be multimer")
return "multimer"
logging.info("Input appears to be monomer")
return "monomer"
def main(argv):
"""Set some option defaults and then call alphafold's main method
Most alphafold options have defaults set automatically:
- database files are set from the ALPHAFOLD_DATA variable or the --data_dir option
(assuming the versioned layout, which differs slightly from the default)
- `--model_preset` is set to either monomer or multimer depending on the number of sequences in the fasta file
- `--max_template_date` defaults to the current date
"""
if len(argv) > 2:
raise app.UsageError('Too many command-line arguments.')
# Accept positional fasta_paths
if len(argv) > 1:
if FLAGS["fasta_paths"].present:
raise app.UsageError("Both the --fasta_paths option and a fasta file argument were given")
FLAGS["fasta_paths"].parse(argv[1])
elif not FLAGS.fasta_paths:
raise app.UsageError("No fasta file specified")
# Database flags
if FLAGS["data_dir"].present:
data_dir = FLAGS.data_dir
elif "ALPHAFOLD_DATA" in os.environ:
data_dir = os.environ["ALPHAFOLD_DATA"]
logging.info(f"Using ALPHAFOLD_DATA={data_dir}")
FLAGS['data_dir'].value = data_dir
else:
raise app.UsageError("Specify --data_dir or set ALPHAFOLD_DATA")
if not FLAGS["model_preset"].present:
FLAGS.model_preset = guess_model_preset(FLAGS.fasta_paths)
use_small_bfd = FLAGS.db_preset == 'reduced_dbs'
if use_small_bfd:
if not FLAGS.small_bfd_database_path:
FLAGS.small_bfd_database_path = os.path.join(data_dir, "small_bfd", "bfd-first_non_concensus_sequences.fasta")
else:
if not FLAGS.bfd_database_path:
FLAGS.bfd_database_path = os.path.join(data_dir, "bfd", "bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt")
if not FLAGS.uniref30_database_path:
FLAGS.uniref30_database_path = os.path.join(data_dir, "uniref30", "UniRef30_2021_03")
run_multimer_system = 'multimer' in FLAGS.model_preset
if run_multimer_system:
if not FLAGS.pdb_seqres_database_path:
FLAGS.pdb_seqres_database_path = os.path.join(data_dir, "pdb_seqres", "pdb_seqres.txt")
if not FLAGS.uniprot_database_path:
FLAGS.uniprot_database_path = os.path.join(data_dir, "uniprot", "uniprot_sprot.fasta")
else:
if not FLAGS.pdb70_database_path:
FLAGS.pdb70_database_path = os.path.join(data_dir, "pdb70", "pdb70")
if not FLAGS.mgnify_database_path:
FLAGS.mgnify_database_path = os.path.join(data_dir, "mgnify", "mgy_clusters_2022_05.fa")
if not FLAGS.obsolete_pdbs_path:
FLAGS.obsolete_pdbs_path = os.path.join(data_dir, "pdb_mmcif", "obsolete.dat")
if not FLAGS.template_mmcif_dir:
FLAGS.template_mmcif_dir = os.path.join(data_dir, "pdb_mmcif", "mmcif_files")
if not FLAGS.uniref90_database_path:
FLAGS.uniref90_database_path = os.path.join(data_dir, "uniref90", "uniref90.fasta")
if not FLAGS.output_dir:
FLAGS.output_dir = os.getcwd()
if not FLAGS.max_template_date:
FLAGS["max_template_date"].parse(date.today().isoformat())
if not FLAGS.use_gpu_relax:
FLAGS.use_gpu_relax = True
af.main(argv[0:1])
if __name__ == "__main__":
app.run(main)

91
alphafold/bin/submit.sh Executable file
View File

@@ -0,0 +1,91 @@
#!/bin/bash
# Generic alphafold submission script.
# Set the ALPHAFOLD_DATA variable before running.
# Usage: sbatch [slurm_opts] $ALPHAFOLD_DIR/bin/submit.sh fasta_file [max_template_date]
#
# Output will be in the same directory as the fasta_file.
# Slurm logs will be in the current directory.
#
# 2021-08-09 Spencer Bliven, D.Ozerov
#
# Bash strict mode
set -euo pipefail
IFS=$'\n\t'
usage () {
echo "Usage: sbatch [slurm_opts] \$ALPHAFOLD_DIR/bin/submit_merlin.sh fasta_file [max_template_date]"
}
# Parse parameters
if [ "$#" -lt 1 ]
then
echo "No fasta_file name" >&2
usage >&2
exit
fi
FASTA_FILE=`readlink -f $1`
if [ ! -e ${FASTA_FILE} ] || [ "$FASTA_FILE" == "" ]
then
echo "${FASTA_FILE} is not reachable (input argument was $1)"
exit
fi
DIR_QUERY=`dirname ${FASTA_FILE}`
LOG="${DIR_QUERY}/alphafold.out"
if [ "$#" -ge 2 ]
then
MAX_TEMPLATE_DATE=$2
else
MAX_TEMPLATE_DATE=$(date '+%Y-%m-%d')
fi
date > "$LOG"
hostname >> "$LOG"
set +u # Allow unset variables in activate commands
module purge
module use MX unstable
module load alphafold/ALPHAFOLD_VERSION 2>> "$LOG"
conda activate "${ALPHAFOLD_ENV:?"Error: ALPHAFOLD_ENV not set. Try 'module use MX unstable; module load alphafold'"}"
set -u
# Check the module loaded correctly
if ! [ -d "${ALPHAFOLD_HOME}" ]; then
echo "Error: $ALPHAFOLD_HOME not available" >&2
exit 1
fi
# Data dir
if ! [ -d "${ALPHAFOLD_DATA:?Set ALPHAFOLD_DATA before running}" ]; then
echo "Error: ALPHAFOLD_DATA directory not available ($ALPHAFOLD_DATA)" >&2
exit 1
fi
echo "GPUs: ${CUDA_VISIBLE_DEVICES:-None}" >> "$LOG"
echo "Detecting GPUs with Tensorflow:" >> "$LOG"
python -c 'import tensorflow as tf; tf.compat.v1.Session(config=tf.compat.v1.ConfigProto(log_device_placement=True))' 2>&1 |
sed -rn 's/^.* (Created TensorFlow device.*)$/\1/p' >> "$LOG"
echo >> "$LOG"
echo "Running alphafold from $PWD for fasta sequence : " >> "$LOG"
cat ${FASTA_FILE} >> "$LOG"
echo "and max_template_date : ${MAX_TEMPLATE_DATE} " >> "$LOG"
echo >> "$LOG"
cd "${ALPHAFOLD_HOME}"
CMD=("./run_alphafold.sh" -p full_dbs -d "${ALPHAFOLD_DATA}" -o "${DIR_QUERY}" -m model_1,model_2,model_3,model_4,model_5 -f "${FASTA_FILE}" -t "${MAX_TEMPLATE_DATE}")
if [ -z "${CUDA_VISIBLE_DEVICES:-}" ]
then
CMD+=(-g false)
else
CMD+=(-a "$CUDA_VISIBLE_DEVICES")
fi
echo "Run: ${CMD[@]}" >> "$LOG"
echo >> "$LOG"
( ( time "${CMD[@]}" ) 2>&1 ) >> "$LOG"

57
alphafold/build Executable file
View File

@@ -0,0 +1,57 @@
#!/usr/bin/env modbuild
pbuild::add_to_group 'MX'
pbuild::prep() {
:
}
pbuild::configure() {
#BUILD CONDA ENV with miniconda installation:
#
#MINICONDA INSTALL
mkdir "$PREFIX/miniconda"
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O "$PREFIX/miniconda/miniconda.sh"
bash "$PREFIX/miniconda/miniconda.sh" -b -u -p "$PREFIX/miniconda/"
#FOR DEBUGGING AND CHECKING MINICONDA INSTALLATION
"$PREFIX/miniconda/condabin/conda" config --show
#CREATE ENV , make sure to source the "correct" conda.sh
"$PREFIX/miniconda/condabin/conda" env create --name "alphafold_$V" -f "$BUILDBLOCK_DIR/environment.yml"
source "$PREFIX/miniconda/etc/profile.d/conda.sh"
conda activate "alphafold_$V"
pip3 install --upgrade --no-cache-dir jax==0.3.25 jaxlib==0.3.25+cuda11.cudnn805 -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
conda deactivate
}
pbuild::compile() {
ALPHAFOLD_HOME="$PREFIX/alphafold"
#local BRANCH
if [[ "${#V_RELEASE}" -eq 7 ]]; then
#Release looks like a git hash
BRANCH="${V_RELEASE}"
else
#choose the given version(e.g. a tag) as branch or choose main
#BRANCH="v${V_PKG}"
BRANCH="main"
fi
git clone --depth=1 -b "$BRANCH" https://github.com/deepmind/alphafold.git "$ALPHAFOLD_HOME" || return $?
if ! [ -f "$ALPHAFOLD_HOME/alphafold/common/stereo_chemical_props.txt" ]; then
curl -fLsS -o "$ALPHAFOLD_HOME/alphafold/common/stereo_chemical_props.txt" \
https://git.scicore.unibas.ch/schwede/openstructure/-/raw/7102c63615b64735c4941278d92b554ec94415f8/modules/mol/alg/src/stereo_chemical_props.txt
fi
}
pbuild::install() {
cp -r "$BUILDBLOCK_DIR/bin" "$PREFIX/"
sed -i "s/ALPHAFOLD_VERSION/$V/g" "$PREFIX/bin/"*
}

31
alphafold/environment.yml Normal file
View File

@@ -0,0 +1,31 @@
name: alphafold_2.3.2
channels:
- pytorch
- conda-forge
- defaults
- anaconda
- bioconda
dependencies:
- python==3.10
- pip
- openmm==7.7.0
- cudnn # Change version if not compatible with current system
- cudatoolkit
- pdbfixer
- hmmer==3.4
- hhsuite==3.3.0
- kalign2==2.04
- pip:
- absl-py==1.0.0
- biopython==1.79
- chex==0.0.7
- dm-haiku==0.0.10
- immutabledict==2.0.0
- ml-collections==0.1.0
- numpy==1.24.3
- scipy==1.11.1
- tensorflow-cpu==2.13.0
- jax==0.4.14
- pandas==2.0.3
- dm-tree==0.1.8

4
alphafold/files/variants Normal file
View File

@@ -0,0 +1,4 @@
alphafold/2.0.0-b88f8da unstable anaconda/2019.07 b:gcc/10.3.0 cuda/11.0.3
alphafold/2.0.1 stable anaconda/2019.07 b:gcc/10.3.0 cuda/11.0.3
alphafold/2.1.1 unstable anaconda/2019.07 b:gcc/10.3.0 cuda/11.0.3
alphafold/2.3.2 unstable b:gcc/10.3.0 cuda/11.8.0

15
alphafold/modulefile Normal file
View File

@@ -0,0 +1,15 @@
#%Module1.0
module-whatis "AlphaFold"
module-url "https://github.com/deepmind/alphafold/"
module-license "Code: Apache 2.0 License. Parameters: Noncommercial CC-BY-NC 4.0"
module-maintainer "Greta Assmann <greta.assmann@psi.ch"
module-help "The AlphaFold 2 protein structure prediction method by Google DeepMind.
Jumper, J., Evans, R., Pritzel, A. et al. Highly accurate protein structure prediction with AlphaFold. Nature (2021). https://doi.org/10.1038/s41586-021-03819-2
"
setenv ALPHAFOLD_HOME "$PREFIX/alphafold"
setenv ALPHAFOLD_ENV "$PREFIX/miniconda/envs/alphafold_$V"