proteinworkshop 0.2.5

Creator: codyrutscher

Last updated:

Add to Cart

Description:

proteinworkshop 0.2.5

Protein Workshop








Documentation
This repository provides the code for the protein structure representation learning benchmark detailed in the paper Evaluating Representation Learning on the Protein Structure Universe.
In the benchmark, we implement numerous featurisation schemes, datasets for self-supervised pre-training and downstream evaluation, pre-training tasks, and auxiliary tasks.
The benchmark can be used as a working template for a protein representation learning research project, a library of drop-in components for use in your projects, or as a CLI tool for quickly running protein representation learning evaluation and pre-training configurations.
Processed datasets and pre-trained weights are made available. Downloading datasets is not required; upon first run all datasets will be downloaded and processed from their respective source.
Configuration files to run the experiments described in the manuscript are provided in the proteinworkshop/config/sweeps/ directory.
Contents

Protein Workshop

Contents
Installation

From PyPI
Building from source


Tutorials
Quickstart

Downloading datasets
Training a model
Finetuning a model
Running a sweep/experiment
Embedding a dataset
Visualising a dataset's embeddings
Performing attribution of a pre-trained model
Verifying a config
Using proteinworkshop modules functionally


Models

Invariant Graph Encoders
Equivariant Graph Encoders

(Vector-type)
(Tensor-type)




Datasets

Structure-based Pre-training Corpuses
Supervised Datasets


Tasks

Self-Supervised Tasks
Generic Supervised Tasks


Featurisation Schemes

Invariant Node Features
Equivariant Node Features
Edge Construction
Invariant Edge Features
Equivariant Edge Features


For Developers

Dependency Management
Code Formatting
Documentation





Installation
Below, we outline how one may set up a virtual environment for proteinworkshop. Note that these installation instructions currently target Linux-like systems with NVIDIA CUDA support. Note that Windows and macOS are currently not officially supported.
From PyPI
proteinworkshop is available for install from PyPI. This enables training of specific configurations via the CLI or using individual components from the benchmark, such as datasets, featurisers, or transforms, as drop-ins to other projects. Make sure to install PyTorch (specifically version 2.1.2 or newer) using its official pip installation instructions, with CUDA support as desired.
# install `proteinworkshop` from PyPI
pip install proteinworkshop

# install PyTorch Geometric using the (now-installed) CLI
workshop install pyg

# set a custom data directory for file downloads; otherwise, all data will be downloaded to `site-packages`
export DATA_PATH="where/you/want/data/" # e.g., `export DATA_PATH="proteinworkshop/data"`

However, for full exploration we recommend cloning the repository and building from source.
Building from source
With a local virtual environment activated (e.g., one created with conda create -n proteinworkshop python=3.10):


Clone and install the project
git clone https://github.com/a-r-j/ProteinWorkshop
cd ProteinWorkshop
pip install -e .



Install PyTorch (specifically version 2.1.2 or newer) using its official pip installation instructions, with CUDA support as desired
# e.g., to install PyTorch with CUDA 11.8 support on Linux:
pip install torch==2.1.2+cu118 torchvision==0.16.2+cu118 torchaudio==2.1.2+cu118 --index-url https://download.pytorch.org/whl/cu118



Then use the newly-installed proteinworkshop CLI to install PyTorch Geometric
workshop install pyg



Configure paths in .env (optional, will override default paths if set). See .env.example for an example.


Download PDB data:
python proteinworkshop/scripts/download_pdb_mmtf.py



Tutorials
We provide a five-part tutorial series of Jupyter notebooks to provide users with examples
of how to use and extend proteinworkshop, as outlined below.

Training a new model
Customizing an existing dataset
Adding a new dataset
Adding a new model
Adding a new task

Quickstart
Downloading datasets
Datasets can either be built from the source structures or downloaded from Zenodo. Datasets will be built from source the first time a dataset is used in a run (or by calling the appropriate setup() method in the corresponding datamodule). We provide a CLI tool for downloading datasets:
workshop download <DATASET_NAME>
workshop download pdb
workshop download cath
workshop download afdb_rep_v4
# etc..

If you wish to build datasets from source, we recommend first downloading the entire PDB first (in MMTF format, c. 24 Gb) to reuse shared PDB data as much as possible:
workshop download pdb
# or
python proteinworkshop/scripts/download_pdb_mmtf.py

Training a model
Launching an experiment minimally requires specification of a dataset, structural encoder, and task (devices can be specified with trainer=cpu/gpu):
workshop train dataset=cath encoder=egnn task=inverse_folding trainer=cpu env.paths.data=where/you/want/data/
# or
python proteinworkshop/train.py dataset=cath encoder=egnn task=inverse_folding trainer=cpu # or trainer=gpu

This command uses the default configurations in configs/train.yaml, which can be overwritten by equivalently named options. For instance, you can use a different input featurisation using the features option, or set the display name of your experiment on wandb using the name option:
workshop train dataset=cath encoder=egnn task=inverse_folding features=ca_bb name=MY-EXPT-NAME trainer=cpu env.paths.data=where/you/want/data/
# or
python proteinworkshop/train.py dataset=cath encoder=egnn task=inverse_folding features=ca_bb name=MY-EXPT-NAME trainer=cpu # or trainer=gpu

Finetuning a model
Finetuning a model additionally requires specification of a checkpoint.
workshop finetune dataset=cath encoder=egnn task=inverse_folding ckpt_path=PATH/TO/CHECKPOINT trainer=cpu env.paths.data=where/you/want/data/
# or
python proteinworkshop/finetune.py dataset=cath encoder=egnn task=inverse_folding ckpt_path=PATH/TO/CHECKPOINT trainer=cpu # or trainer=gpu

Running a sweep/experiment
We can make use of the hydra wandb sweeper plugin to configure experiments as sweeps, allowing searches over hyperparameters, architectures, pre-training/auxiliary tasks and datasets.
See proteinworkshop/config/sweeps/ for examples.

Create the sweep with weights and biases

wandb sweep proteinworkshop/config/sweeps/my_new_sweep_config.yaml


Launch job workers

With wandb:
wandb agent mywandbgroup/proteinworkshop/2wwtt7oy --count 8

Or an example SLURM submission script:
#!/bin/bash
#SBATCH --nodes 1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:1
#SBATCH --array=0-32

source ~/.bashrc
source $(conda info --base)/envs/proteinworkshop/bin/activate

wandb agent mywandbgroup/proteinworkshop/2wwtt7oy --count 1

Reproduce the sweeps performed in the manuscript:
# reproduce the baseline tasks sweep (i.e., those performed without pre-training each model)
wandb sweep proteinworkshop/config/sweeps/baseline_fold.yaml
wandb agent mywandbgroup/proteinworkshop/2awtt7oy --count 8
wandb sweep proteinworkshop/config/sweeps/baseline_ppi.yaml
wandb agent mywandbgroup/proteinworkshop/2bwtt7oy --count 8
wandb sweep proteinworkshop/config/sweeps/baseline_inverse_folding.yaml
wandb agent mywandbgroup/proteinworkshop/2cwtt7oy --count 8

# reproduce the model pre-training sweep
wandb sweep proteinworkshop/config/sweeps/pre_train.yaml
wandb agent mywandbgroup/proteinworkshop/2dwtt7oy --count 8

# reproduce the pre-trained tasks sweep (i.e., those performed after pre-training each model)
wandb sweep proteinworkshop/config/sweeps/pt_fold.yaml
wandb agent mywandbgroup/proteinworkshop/2ewtt7oy --count 8
wandb sweep proteinworkshop/config/sweeps/pt_ppi.yaml
wandb agent mywandbgroup/proteinworkshop/2fwtt7oy --count 8
wandb sweep proteinworkshop/config/sweeps/pt_inverse_folding.yaml
wandb agent mywandbgroup/proteinworkshop/2gwtt7oy --count 8

Embedding a dataset
We provide a utility in proteinworkshop/embed.py for embedding a dataset using a pre-trained model.
To run it:
python proteinworkshop/embed.py ckpt_path=PATH/TO/CHECKPOINT collection_name=COLLECTION_NAME

See the embed section of proteinworkshop/config/embed.yaml for additional parameters.
Visualising pre-trained model embeddings for a given dataset
We provide a utility in proteinworkshop/visualise.py for visualising the UMAP embeddings of a pre-trained model for a given dataset.
To run it:
python proteinworkshop/visualise.py ckpt_path=PATH/TO/CHECKPOINT plot_filepath=VISUALISATION/FILEPATH.png

See the visualise section of proteinworkshop/config/visualise.yaml for additional parameters.
Performing attribution of a pre-trained model
We provide a utility in proteinworkshop/explain.py for performing attribution of a pre-trained model using integrated gradients.
This will write PDB files for all the structures in a dataset for a supervised task with residue-level attributions in the b_factor column. To visualise the attributions, we recommend using the Protein Viewer VSCode extension and changing the 3D representation to colour by Uncertainty/Disorder.
To run the attribution:
python proteinworkshop/explain.py ckpt_path=PATH/TO/CHECKPOINT output_dir=ATTRIBUTION/DIRECTORY

See the explain section of proteinworkshop/config/explain.yaml for additional parameters.
Verifying a config
python proteinworkshop/validate_config.py dataset=cath features=full_atom task=inverse_folding

Using proteinworkshop modules functionally
One may use the modules (e.g., datasets, models, featurisers, and utilities) of proteinworkshop
functionally by importing them directly. When installing this package using PyPi, this makes building
on top of the assets of proteinworkshop straightforward and convenient.
For example, to use any datamodule available in proteinworkshop:
from proteinworkshop.datasets.cath import CATHDataModule

datamodule = CATHDataModule(path="data/cath/", pdb_dir="data/pdb/", format="mmtf", batch_size=32)
datamodule.download()

train_dl = datamodule.train_dataloader()

To use any model or featuriser available in proteinworkshop:
from proteinworkshop.models.graph_encoders.dimenetpp import DimeNetPPModel
from proteinworkshop.features.factory import ProteinFeaturiser
from proteinworkshop.datasets.utils import create_example_batch

model = DimeNetPPModel(hidden_channels=64, num_layers=3)
ca_featuriser = ProteinFeaturiser(
representation="CA",
scalar_node_features=["amino_acid_one_hot"],
vector_node_features=[],
edge_types=["knn_16"],
scalar_edge_features=["edge_distance"],
vector_edge_features=[],
)

example_batch = create_example_batch()
batch = ca_featuriser(example_batch)

model_outputs = model(example_batch)

Read the docs for a full list of modules available in proteinworkshop.
Models
Invariant Graph Encoders



Name
Source
Protein Specific




GearNet
Zhang et al.



DimeNet++
Gasteiger et al.



SchNet
Schütt et al.



CDConv
Fan et al.




Equivariant Graph Encoders
(Vector-type)



Name
Source
Protein Specific




GCPNet
Morehead et al.



GVP-GNN
Jing et al.



EGNN
Satorras et al.




(Tensor-type)



Name
Source
Protein Specific




Tensor Field Network
Corso et al.



Multi-ACE
Batatia et al.




Sequence-based Encoders



Name
Source
Protein Specific




ESM2
Lin et al.




Datasets
To download a (processed) dataset from Zenodo, you can run
workshop download <DATASET_NAME>

where <DATASET_NAME> is given the first column in the tables below.
Otherwise, simply starting a training run will download and process the data from source.
Structure-based Pre-training Corpuses
Pre-training corpuses (with the exception of pdb, cath, and astral) are provided in FoldComp database format. This format is highly compressed, resulting in very small disk space requirements despite the large size. pdb is provided as a collection of
MMTF files, which are significantly smaller in size than conventional .pdb or .cif file.



Name
Description
Source
Size
Disk Size
License




astral
SCOPe domain structures
SCOPe/ASTRAL

1 - 2.2 Gb
Publicly available


afdb_rep_v4
Representative structures identified from the AlphaFold database by FoldSeek structural clustering
Barrio-Hernandez et al.
2.27M Chains
9.6 Gb
GPL-3.0


afdb_rep_dark_v4
Dark proteome structures identied by structural clustering of the AlphaFold database.
Barrio-Hernandez et al.
~800k
2.2 Gb
GPL-3.0


afdb_swissprot_v4
AlphaFold2 predictions for SwissProt/UniProtKB
Kim et al.
542k Chains
2.9 Gb
GPL-3.0


afdb_uniprot_v4
AlphaFold2 predictions for UniProt
Kim et al.
214M Chains
1 Tb
GPL-3.0 / CC-BY 4.0


cath
CATH 4.2 40% split by CATH topologies.
Ingraham et al.
~18k chains
4.3 Gb
CC-BY 4.0


esmatlas
ESMAtlas predictions (full)
Kim et al.

1 Tb
GPL-3.0 / CC-BY 4.0


esmatlas_v2023_02
ESMAtlas predictions (v2023_02 release)
Kim et al.

137 Gb
GPL-3.0 / CC-BY 4.0


highquality_clust30
ESMAtlas High Quality predictions
Kim et al.
37M Chains
114 Gb
GPL-3.0 / CC-BY 4.0


igfold_paired_oas
IGFold Predictions for Paired OAS
Ruffolo et al.
104,994 paired Ab chains

CC-BY 4.0


igfold_jaffe
IGFold predictions for Jaffe2022 data
Ruffolo et al.
1,340,180 paired Ab chains

CC-BY 4.0


pdb
Experimental structures deposited in the RCSB Protein Data Bank
wwPDB consortium
~800k Chains
23 Gb
CC0 1.0




Additionally, we provide several species-specific compilations (mostly reference species)



Name
Description
Source
Size




a_thaliana
Arabidopsis thaliana (thale cress) proteome
AlphaFold2



c_albicans
Candida albicans (a fungus) proteome
AlphaFold2



c_elegans
Caenorhabditis elegans (roundworm) proteome
AlphaFold2



d_discoideum
Dictyostelium discoideum (slime mold) proteome
AlphaFold2



d_melanogaster
Drosophila melanogaster (fruit fly) proteome
AlphaFold2



d_rerio
Danio rerio (zebrafish) proteome
AlphaFold2



e_coli
Escherichia coli (a bacteria) proteome
AlphaFold2



g_max
Glycine max (soy bean) proteome
AlphaFold2



h_sapiens
Homo sapiens (human) proteome
AlphaFold2



m_jannaschii
Methanocaldococcus jannaschii (an archaea) proteome
AlphaFold2



m_musculus
Mus musculus (mouse) proteome
AlphaFold2



o_sativa
Oryza sative (rice) proteome
AlphaFold2



r_norvegicus
Rattus norvegicus (brown rat) proteome
AlphaFold2



s_cerevisiae
Saccharomyces cerevisiae (brewer's yeast) proteome
AlphaFold2



s_pombe
Schizosaccharomyces pombe (a fungus) proteome
AlphaFold2



z_mays
Zea mays (corn) proteome
AlphaFold2





Supervised Datasets



Name
Description
Source
License




antibody_developability
Antibody developability prediction
Chen et al.
CC-BY 3.0


atom3d_msp
Mutation stability prediction
Townshend et al.
MIT


atom3d_ppi
Protein-protein interaction prediction
Townshend et al.
MIT


atom3d_psr
Protein structure ranking
Townshend et al.
MIT


atom3d_res
Residue identity prediction
Townshend et al.
MIT


ccpdb_ligands
Ligand binding residue prediction
Agrawal et al.
Publicly Available


ccpdb_metal
Metal ion binding residue prediction
Agrawal et al.
Publicly Available


ccpdb_nucleic
Nucleic acid binding residue prediction
Agrawal et al.
Publicly Available


ccpdb_nucleotides
Nucleotide binding residue prediction
Agrawal et al.
Publicly Available


deep_sea_proteins
Gene Ontology prediction (Biological Process)
Sieg et al.
Public domain


go-bp
Gene Ontology prediction (Biological Process)
Gligorijevic et al
CC-BY 4.0


go-cc
Gene Ontology (Cellular Component)
Gligorijevic et al
CC-BY 4.0


go-mf
Gene Ontology (Molecular Function)
Gligorijevic et al
CC-BY 4.0


ec-reaction
Enzyme Commission (EC) Number Prediction
Hermosilla et al.
MIT


fold-fold
Fold prediction, split at the fold level
Hou et al.
CC-BY 4.0


fold-family
Fold prediction, split at the family level
Hou et al.
CC-BY 4.0


fold-superfamily
Fold prediction, split at the superfamily level
Hou et al.
CC-BY 4.0


masif-site
Protein-protein interaction site prediction
Gainza et al.
Apache 2.0


metal_3d
Zinc Binding Site Prediction
Duerr et al.
MIT


ptm
Post Translational Modification Side Prediction
Yan et al.
CC-BY 4.0



Tasks
Self-Supervised Tasks



Name
Description
Source




inverse_folding
Predict amino acid sequence given structure



residue_prediction
Masked residue type prediction



distance_prediction
Masked edge distance prediction
Zhang et al.


angle_prediction
Masked triplet angle prediction
Zhang et al.


dihedral_angle_prediction
Masked quadruplet dihedral prediction
Zhang et al.


multiview_contrast
Contrastive learning with multiple crops and InfoNCE loss
Zhang et al.


structural_denoising
Denoising of atomic coordinates with SE(3) decoders




Generic Supervised Tasks
Generic supervised tasks can be applied broadly across datasets. The labels are directly extracted from the PDB structures.
These are likely to be most frequently used with the pdb dataset class which wraps the PDB Dataset curator from Graphein.



Name
Description
Requires




binding_site_prediction
Predict ligand binding residues
HETATM ligands (for training)


ppi_site_prediction
Predict protein binding residues
graph_y attribute in data objects specifying the desired chain to select interactions for (for training)



Featurisation Schemes
Part of the goal of the proteinworkshop benchmark is to investigate the impact of the degree to which increasing granularity of structural detail affects performance. To achieve this, we provide several featurisation schemes for protein structures.
Invariant Node Features
N.B. All angular features are provided in [sin, cos] transformed form. E.g.: dihedrals=[sin(ϕ),cos(ϕ),sin(ψ),cos(ψ),sin(ω),cos⁡(ω)], hence their dimensionality will be double the number of angles.



Name
Description
Dimensionality




residue_type
One-hot encoding of amino acid type
21


positional_encoding
Transformer-like positional encoding of sequence position
16


alpha
Virtual torsion angle defined by four Cα atoms of residues I−1,I,I+1,I+2
2


kappa
Virtual bond angle (bend angle) defined by the three Cα atoms of residues I−2,I,I+2
2


dihedrals
Backbone dihedral angles (ϕ,ψ,ω)
6


sidechain_torsions
Sidechain torsion angles (χ1−4)
8



Equivariant Node Features



Name
Description
Dimensionality




orientation
Forward and backward node orientation vectors (unit-normalized)
2



Edge Construction
We predominanty support two types of edges: k-NN and ϵ edges.
Edge types can be specified as follows:
python proteinworkshop/train.py ... features.edge_types=[knn_16, knn_32, eps_16]

Where the suffix after knn or eps specifies k (number of neighbours) or ϵ (distance threshold in angstroms).
Invariant Edge Features



Name
Description
Dimensionality




edge_distance
Euclidean distance between source and target nodes
1


node_features
Concatenated scalar node features of the source and target nodes
Number of scalar node features ×2


edge_type
Type annotation for each edge
1


sequence_distance
Sequence-based distance between source and target nodes
1


pos_emb
Structured Transformer-inspired positional embedding of i−j for source node i and target node j
16



Equivariant Edge Features



Name
Description
Dimensionality




edge_vectors
Edge directional vectors (unit-normalized)
1



For Developers
Dependency Management
We use poetry to manage the project's underlying dependencies and to push updates to the project's PyPI package. To make changes to the project's dependencies, follow the instructions below to (1) install poetry on your local machine; (2) customize the dependencies; or (3) (de)activate the project's virtual environment using poetry:


Install poetry for platform-agnostic dependency management using its installation instructions
After installing poetry, to avoid potential keyring errors, disable its keyring usage by adding PYTHON_KEYRING_BACKEND=keyring.backends.null.Keyring to your shell's startup configuration and restarting your shell environment (e.g., echo 'export PYTHON_KEYRING_BACKEND=keyring.backends.null.Keyring' >> ~/.bashrc && source ~/.bashrc for a Bash shell environment and likewise for other shell environments).


Install, add, or upgrade project dependencies
poetry install # install the latest project dependencies
# or
poetry add XYZ # add dependency `XYZ` to the project
# or
poetry show # list all dependencies currently installed
# or
poetry lock # standardize the (now-)installed dependencies



Activate the newly-created virtual environment following poetry's usage documentation
# activate the environment on a `posix`-like (e.g., macOS or Linux) system
source $(poetry env info --path)/bin/activate

# activate the environment on a `Windows`-like system
& ((poetry env info --path) + "\Scripts\activate.ps1")

# if desired, deactivate the environment
deactivate



Code Formatting
To keep with the code style for the proteinworkshop repository, using the following lines, please format your commits before opening a pull request:
# assuming you are located in the `ProteinWorkshop` top-level directory
isort .
autoflake -r --in-place --remove-unused-variables --remove-all-unused-imports --ignore-init-module-imports .
black --config=pyproject.toml .

Documentation
To build a local version of the project's Sphinx documentation web pages:
# assuming you are located in the `ProteinWorkshop` top-level directory
pip install -r docs/.docs.requirements # one-time only
rm -rf docs/build/ && sphinx-build docs/source/ docs/build/ # NOTE: errors can safely be ignored

License

For personal and professional use. You cannot resell or redistribute these repositories in their original state.

Customer Reviews

There are no reviews.