Skip to content

MuthusaravananS/PINPOINT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 

Repository files navigation

PINPOINT: Protease INhibitor PredictiOn at plant–pathogen INTerface using large language models and structural modeling.

Open pipeline proposed for discovering novel protease inhibitor at plant-pathogen interface

Project Status: Active – The project has reached a stable, usable state and is being actively developed.

This repository contains the code and notebooks for PINPOINT, a deep learning pipeline for identifying small secretory proteins (SSPs) with no known Pfam domain as protease inhibitors based on protein sequence and structure. All modules run directly in Google Colab—no local installation required. The free tier GPU can handle up to ~4000 SSPs efficiently; for larger datasets, we recommend Colab Pro (for extended runtime/GPU) or local installation with GPU acceleration.

Fine-Tuned Models (Openly Available on Hugging Face)

The core prediction models from this study were publicly hosted and free to use/download:

All models focus on small proteins (<250 AA) and are trained on curated MEROPS/UniProt/AlphaFold data.

Available Models on Hugging Face

Model Type Base Key Features Link
PIPES-M Binary sequence classifier ESM-2 (150M) Large-scale sequence-based screening Open PIPES-M
PIP-BERT Binary sequence classifier ProtBERT Large-scale sequence-based screening Open PIP-BERT
structuralmodule-protease_inhibitors Unsupervised one-class autoencoder (PyOD) RCSB embeddings Filters non-inhibtior protein structures by reconstruction error, trained on ~18k PI structures Open Structural Module

Quick Overview

  • PIPES-M
    Fine-tuned ESM-2 (150M params) binary classifier. Predicts if a protein sequence is a potential protease inhibitor using only the primary sequence. Ideal for fast, structure-free screening of small proteins (<250 AA).

  • PIP-BERT
    Fine-tuned ProtBERT binary classifier. Predicts if a protein sequence is a potential protease inhibitor using only the primary sequence. Ideal for fast, structure-free screening of small proteins (<250 AA).

  • structuralmodule-protease_inhibitors
    Unsupervised one-class autoencoder (PyOD/PyTorch) for structural filtering. Detects non-PI-like structures via high reconstruction error. Trained on ~18k curated protease inhibitor structures from MEROPS + AlphaFold.
    → Input: standardized RCSB embeddings calculated from input .cif (not raw PDB/CIF). Use the provided scaler.

If you use these in your work, please cite the repo / Hugging Face pages.

Star ⭐ the repo if you find it useful!

Authors

  • Muthusaravanan S ORCID iD icon
    Department of Biological Sciences, Birla Institute of Technology and Science, Pilani, India

  • Balakumaran Chandrasekar ORCID iD icon
    Indian Institute of Technology, Jodhpur, India

Abstract

To be included...

Usage Instructions

To use protease-inhibitor-prediction, run the Sequence Module Colab notebook to screen candidates with the fine-tuned PIPES-M and PIP-BERT models (handles up to ~4000 sequences on free Colab GPU). For promising hits, obtain 3D structures either via the ESMFold API notebook (de novo prediction) or the AlphaFold Fetch notebook (download precomputed models by uploading UniProt IDs in a .txt file). Next, apply the Structural Module notebook to filter candidates using the dedicated autoencoder model and retain only those with protease inhibitor-like structural features. Finally, for the retained candidates, model them as heterodimer interactions with target immune proteases of interest using your preferred multimeric protein structure modeling platform. (GPU-accelerated Local ColabFold is strongly recommended for large-scale screening). All steps except mutimeric complex modeling can be run directly in Google Colab with no installation required; simply open the notebooks and enable GPU runtime for optimal performance. Just follow the usage instrcution in each Notebook.

Colab Notebooks (Pipeline Modules)

  1. Sequence Module — Initial screening with fine-tuned sequence based models
    Open in Colab

  2. ESMFold API — De novo structure prediction for screened candidates in PDB format (need positive hit's mature sequences as .fasta)
    Open in Colab

  3. AlphaFold Fetch — Download precomputed AlphaFold structures in .cif format (need positive hit's UniProt IDs as .txt)
    Open in Colab

  4. Structural Module — Filter structures lacking protease inhibitor-like features (needs protein structure files in .cif format as compressed .ZIP or .rar) Open in Colab

  5. Heterodimer Modeling — Recommended tool for final interaction prediction
    GPU-accelerated ColabFold

Note: Free Colab has session limits (~12h runtime, occasional disconnects); Colab Pro removes most restrictions for heavy use.

Computational Requirements

  • Google Colab (free tier sufficient for <4000 sequences; Pro recommended for larger batches or longer runs).
  • GPU acceleration enabled (Runtime → Change runtime type → GPU).
  • No local install needed, but for very large-scale screening local GPU setup is recommended.

If you use this tool or models, please cite:
Muthusaravanan S et al. (in preparation).

Key tools and resources used in this pipeline:

  • ESM-2 & ProtBERT base models — Lin, Z., Akin, H., Rao, R., Hie, B., Zhu, Z., Lu, W., Smetanin, N., Verkuil, R., Kabeli, O., Shmueli, Y., et al. (2023). Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6638); Elnaggar A, Heinzinger M, Dallago C, Rehawi G, Wang Y, Jones L, Gibbs T, Feher T, Angerer C, Steinegger M, Bhowmik D, Rost B. ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning. IEEE Trans Pattern Anal Mach Intell. 2022 Oct;44(10):7112-7127.
  • ESMFold / ESMAtlas API — Lin et al. (2023). Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. DOI: 10.1126/science.ade2574
  • AlphaFold (precomputed models & Multimer) — Jumper et al. (2021). Highly accurate protein structure prediction with AlphaFold. Nature. DOI: 10.1038/s41586-021-03819-2; Evans et al. (2022). Protein complex prediction with AlphaFold-Multimer. bioRxiv. DOI: 10.1101/2021.10.04.463034; Fleming J. et al. AlphaFold Protein Structure Database and 3D-Beacons: New Data and Capabilities. Journal of Molecular Biology, (2025).
  • ColabFold (GPU-accelerated implementation) — Mirdita et al. (2022). ColabFold: making protein folding accessible to all. Nature Methods. DOI: 10.1038/s41592-022-01488-1.
  • PyOD (for one-class autoencoder in structural filtering) — Zhao et al. (2019). PyOD: A Python Toolbox for Scalable Outlier Detection. Journal of Machine Learning Research.
  • MEROPS database — Rawlings et al. (2018). The MEROPS database of proteolytic enzymes, their substrates and inhibitors in 2017 and a comparison with peptidases in the PANTHER database. Nucleic Acids Research. DOI: 10.1093/nar/gkx1134.
  • UNIPROT database — UniProt: the Universal Protein Knowledgebase in 2025. Nucleic Acids Research, Volume 53, Issue D1, 6 January 2025, Pages D609–D617, https://doi.org/10.1093/nar/gkae1010.

Have data to contribute?


We welcome datasets of known/putative protease inhibitors or SSPs to improve accuracy. Contact us!

Open models and access to data democratize scientific discovery. We encourage sharing sequences/structures in public repositories for open use.

For inquiries

Contact


Muthusaravanan S - @Muthu_Sivaram - muthusaravanan.ind@gmail.com

Releases

No releases published

Packages

 
 
 

Contributors