AlphaFold on Tufts HPC Cluster
General Tufts HPC Cluster Access Info
Please review https://tufts.box.com/v/Pax-User-Guide before proceeding forward.
Login and Allocate Computing Resources
Tips:
- Login. If you have a Mac, use the Terminal app. If you have a
- If you need to use GPU resources and don't have access to contrib node partitions, "preempt" is the best option
Alphafold
The Alphafold script is available for everyone in /cluster/tufts/hpc/tools/alphafold/2.2.0/runaf2test.sh
Make a copy of the file to your own folder (e.g. your home directory):
$ cp /cluster/tufts/hpc/tools/alphafold/2.2.0/runaf2test.sh /your/own/directory
Go to your own copy of the script:
#!/bin/bash
#SBATCH -p preempt #if you DO have ccgpu access, use "ccgpu"
#SBATCH -n 8 # 8 cpu cores
#SBATCH --mem=64g #64GB of RAM
#SBATCH --time=2-0 #run 2 days, up to 7 days "7-00:00:00"
#SBATCH -o output.%j
#SBATCH -e error.%j
#SBATCH -N 1
#SBATCH --gres=gpu:a100:1 # number of GPUs. please follow instructions in "Pax User Guide" when submit jobs to different partition and selecting different GPU architectures.
module load alphafold/2.2.0
module list
nvidia-smi
module help alphafold/2.2.0 # this command will print out all input options for "runaf2" command
#Please use your own path/value for the following variables
#Make sure to specify the outputpath to a path that you have write permission
outputpath=/cluster/tufts/hpc/tools/alphafold/2.2.0/test
fastapath=/cluster/tufts/hpc/tools/alphafold/2.2.0/T1050.fasta
maxtemplatedate=2020-06-10
source activate alphafold2.2.0
#running alphafold 2.2.0
runaf2 -o $outputpath -f $fastapath -t $maxtemplatedate
Make sure you specify the outputpath
to a path that you have write permission.
Make sure you specify the fastapath
to FASTA file containing the protein sequence for which you wish to predict the structure.
Make sure maxtemplatedate
is set to be before the release date of the structure.
Module Help
Please see $ module help alphafold/2.2.0
for additional input options ( Required Parameters & Optional Parameters) for runaf2
command.
----------- Module Specific Help for 'alphafold/2.2.0' ------------
This module adds AlphaFold 2.2.0 to the PATH
Run AlphaFold 2.2.0 with:
runaf2 <Required parameters> <Optional Parameters>
Please make sure all REQUIRED parameters are given
Required Parameters:
-o <output_dir> Path to a directory that will store the results, make sure the user has write permission to the directory.
-f <fasta_path> Path to a FASTA file containing sequence. If a FASTA file contains multiple sequences, then it will be folded as a multimer
-t <max_template_date> Maximum template release date to consider (ISO-8601 format - i.e. YYYY-MM-DD). Important if folding historical test sets
Optional Parameters:
-g <use_gpu> Enable NVIDIA runtime to run with GPUs (default: true)
-n <openmm_threads> OpenMM threads (default: all available cores)
-a <gpu_devices> Comma separated list of devices to pass to 'CUDA_VISIBLE_DEVICES' (default: 0)
-m <model_preset> Choose preset model configuration - the monomer model, the monomer model with extra ensembling, monomer model with pTM head, or multimer model (default: 'monomer')
-c <db_preset> Choose preset MSA database configuration - smaller genetic database config (reduced_dbs) or full genetic database config (full_dbs) (default: 'full_dbs')
-p <use_precomputed_msas> Whether to read MSAs that have been written to disk. WARNING: This will not check if the sequence, database or configuration have changed (default: 'false')
Submit Job
To submit your job, go to the folder that contains runaf2test.sh
From command line, submit with $ sbatch runaf2test.sh
Then follow the instructions in https://tufts.box.com/v/Pax-User-Guide to check your job status.
AlphaFold output
The outputs will be in a subfolder of output_dir
that you specified in runaf2test.sh
.
They include the computed MSAs, unrelaxed structures, relaxed structures, ranked structures, raw model outputs, prediction metadata, and section timings. Theoutput_dir
directory will have the following structure:
output_dir/
features.pkl
ranked_{0,1,2,3,4}.pdb
ranking_debug.json
relaxed_model_{1,2,3,4,5}.pdb
result_model_{1,2,3,4,5}.pkl
timings.json
unrelaxed_model_{1,2,3,4,5}.pdb
msas/
bfd_uniclust_hits.a3m
mgnify_hits.sto
uniref90_hits.sto
The contents of each output file are as follows:
-
features.pkl
– Apickle
file containing the input feature Numpy arrays used by the models to produce the structures. -
unrelaxed_model_*.pdb
– A PDB format text file containing the predicted structure, exactly as outputted by the model. -
relaxed_model_*.pdb
– A PDB format text file containing the predicted structure, after performing an Amber relaxation procedure on the unrelaxed structure prediction, see Jumper et al. 2021, Suppl. Methods 1.8.6 for details. -
ranked_*.pdb
– A PDB format text file containing the relaxed predicted structures, after reordering by model confidence. Hereranked_0.pdb
should contain the prediction with the highest confidence, andranked_4.pdb
the prediction with the lowest confidence. To rank model confidence, we use predicted LDDT (pLDDT), see Jumper et al. 2021, Suppl. Methods 1.9.6 for details. -
ranking_debug.json
– A JSON format text file containing the pLDDT values used to perform the model ranking, and a mapping back to the original model names. -
timings.json
– A JSON format text file containing the times taken to run each section of the AlphaFold pipeline. -
msas/
- A directory containing the files describing the various genetic tool hits that were used to construct the input MSA. -
result_model_*.pkl
– Apickle
file containing a nested dictionary of the various Numpy arrays directly produced by the model. In addition to the output of the structure module, this includes auxiliary outputs such as distograms and pLDDT scores. If using the pTM models then the pTM logits will also be contained in this file.