Open pipeline proposed for discovering novel protease inhibitor at plant-pathogen interface

PINPOINT: Protease INhibitor PredictiOn at plant–pathogen INTerface using large language models and structural modeling.

Open pipeline proposed for discovering novel protease inhibitor at plant-pathogen interface

This repository contains the code and notebooks for PINPOINT, a deep learning pipeline for identifying small secretory proteins (SSPs) with no known Pfam domain as protease inhibitors based on protein sequence and structure. All modules run directly in Google Colab—no local installation required. The free tier GPU can handle up to ~4000 SSPs efficiently; for larger datasets, we recommend Colab Pro (for extended runtime/GPU) or local installation with GPU acceleration.

Fine-Tuned Models (Openly Available on Hugging Face)

The core prediction models from this study were publicly hosted and free to use/download:

All models focus on small proteins (<250 AA) and are trained on curated MEROPS/UniProt/AlphaFold data.

Available Models on Hugging Face

Model	Type	Base	Key Features
PIPES-M	Binary sequence classifier	ESM-2 (150M)	Large-scale sequence-based screening
PIP-BERT	Binary sequence classifier	ProtBERT	Large-scale sequence-based screening
structuralmodule-protease_inhibitors	Unsupervised one-class autoencoder (PyOD)	RCSB embeddings	Filters non-inhibtior protein structures by reconstruction error, trained on ~18k PI structures

Quick Overview

PIPES-M
Fine-tuned ESM-2 (150M params) binary classifier. Predicts if a protein sequence is a potential protease inhibitor using only the primary sequence. Ideal for fast, structure-free screening of small proteins (<250 AA).
PIP-BERT
Fine-tuned ProtBERT binary classifier. Predicts if a protein sequence is a potential protease inhibitor using only the primary sequence. Ideal for fast, structure-free screening of small proteins (<250 AA).
structuralmodule-protease_inhibitors
Unsupervised one-class autoencoder (PyOD/PyTorch) for structural filtering. Detects non-PI-like structures via high reconstruction error. Trained on ~18k curated protease inhibitor structures from MEROPS + AlphaFold.
→ Input: standardized RCSB embeddings calculated from input .cif (not raw PDB/CIF). Use the provided scaler.

If you use these in your work, please cite the repo / Hugging Face pages.

Star ⭐ the repo if you find it useful!

Authors

Muthusaravanan S
Department of Biological Sciences, Birla Institute of Technology and Science, Pilani, India
Balakumaran Chandrasekar
Indian Institute of Technology, Jodhpur, India

Abstract

To be included...

Usage Instructions

To use protease-inhibitor-prediction, run the Sequence Module Colab notebook to screen candidates with the fine-tuned PIPES-M and PIP-BERT models (handles up to ~4000 sequences on free Colab GPU). For promising hits, obtain 3D structures either via the ESMFold API notebook (de novo prediction) or the AlphaFold Fetch notebook (download precomputed models by uploading UniProt IDs in a .txt file). Next, apply the Structural Module notebook to filter candidates using the dedicated autoencoder model and retain only those with protease inhibitor-like structural features. Finally, for the retained candidates, model them as heterodimer interactions with target immune proteases of interest using your preferred multimeric protein structure modeling platform. (GPU-accelerated Local ColabFold is strongly recommended for large-scale screening). All steps except mutimeric complex modeling can be run directly in Google Colab with no installation required; simply open the notebooks and enable GPU runtime for optimal performance. Just follow the usage instrcution in each Notebook.

Colab Notebooks (Pipeline Modules)

Sequence Module — Initial screening with fine-tuned sequence based models
Open in Colab
ESMFold API — De novo structure prediction for screened candidates in PDB format (need positive hit's mature sequences as .fasta)
Open in Colab
AlphaFold Fetch — Download precomputed AlphaFold structures in .cif format (need positive hit's UniProt IDs as .txt)
Open in Colab
Structural Module — Filter structures lacking protease inhibitor-like features (needs protein structure files in .cif format as compressed .ZIP or .rar) Open in Colab
Heterodimer Modeling — Recommended tool for final interaction prediction
GPU-accelerated ColabFold

Note: Free Colab has session limits (~12h runtime, occasional disconnects); Colab Pro removes most restrictions for heavy use.

Computational Requirements

Google Colab (free tier sufficient for <4000 sequences; Pro recommended for larger batches or longer runs).
GPU acceleration enabled (Runtime → Change runtime type → GPU).
No local install needed, but for very large-scale screening local GPU setup is recommended.

If you use this tool or models, please cite:
Muthusaravanan S et al. (in preparation).

Key tools and resources used in this pipeline:

ESM-2 & ProtBERT base models — Lin, Z., Akin, H., Rao, R., Hie, B., Zhu, Z., Lu, W., Smetanin, N., Verkuil, R., Kabeli, O., Shmueli, Y., et al. (2023). Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6638); Elnaggar A, Heinzinger M, Dallago C, Rehawi G, Wang Y, Jones L, Gibbs T, Feher T, Angerer C, Steinegger M, Bhowmik D, Rost B. ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning. IEEE Trans Pattern Anal Mach Intell. 2022 Oct;44(10):7112-7127.
ESMFold / ESMAtlas API — Lin et al. (2023). Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. DOI: 10.1126/science.ade2574
AlphaFold (precomputed models & Multimer) — Jumper et al. (2021). Highly accurate protein structure prediction with AlphaFold. Nature. DOI: 10.1038/s41586-021-03819-2; Evans et al. (2022). Protein complex prediction with AlphaFold-Multimer. bioRxiv. DOI: 10.1101/2021.10.04.463034; Fleming J. et al. AlphaFold Protein Structure Database and 3D-Beacons: New Data and Capabilities. Journal of Molecular Biology, (2025).
ColabFold (GPU-accelerated implementation) — Mirdita et al. (2022). ColabFold: making protein folding accessible to all. Nature Methods. DOI: 10.1038/s41592-022-01488-1.
PyOD (for one-class autoencoder in structural filtering) — Zhao et al. (2019). PyOD: A Python Toolbox for Scalable Outlier Detection. Journal of Machine Learning Research.
MEROPS database — Rawlings et al. (2018). The MEROPS database of proteolytic enzymes, their substrates and inhibitors in 2017 and a comparison with peptidases in the PANTHER database. Nucleic Acids Research. DOI: 10.1093/nar/gkx1134.
UNIPROT database — UniProt: the Universal Protein Knowledgebase in 2025. Nucleic Acids Research, Volume 53, Issue D1, 6 January 2025, Pages D609–D617, https://doi.org/10.1093/nar/gkae1010.

Have data to contribute?

We welcome datasets of known/putative protease inhibitors or SSPs to improve accuracy. Contact us!

Open models and access to data democratize scientific discovery. We encourage sharing sequences/structures in public repositories for open use.

For inquiries

Contact

Muthusaravanan S - @Muthu_Sivaram - muthusaravanan.ind@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
PIP-BERT-model		PIP-BERT-model
PIPES-M-model		PIPES-M-model
Structural_module		Structural_module
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PINPOINT: Protease INhibitor PredictiOn at plant–pathogen INTerface using large language models and structural modeling.

Open pipeline proposed for discovering novel protease inhibitor at plant-pathogen interface

Fine-Tuned Models (Openly Available on Hugging Face)

Available Models on Hugging Face

Quick Overview

Authors

Abstract

Usage Instructions

Colab Notebooks (Pipeline Modules)

Computational Requirements

Key tools and resources used in this pipeline:

Have data to contribute?

For inquiries

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PINPOINT: Protease INhibitor PredictiOn at plant–pathogen INTerface using large language models and structural modeling.

Open pipeline proposed for discovering novel protease inhibitor at plant-pathogen interface

Fine-Tuned Models (Openly Available on Hugging Face)

Available Models on Hugging Face

Quick Overview

Authors

Abstract

Usage Instructions

Colab Notebooks (Pipeline Modules)

Computational Requirements

Key tools and resources used in this pipeline:

Have data to contribute?

For inquiries

Contact

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages