🧬 DNA Transcription Factor Binding Predictor

Deep learning model to predict CTCF transcription factor binding sites in DNA sequences using PyTorch CNN with In Silico Mutagenesis validation.

📑 Table of Contents

📊 Project Overview
🧬 What is CTCF?
🎯 Problem Statement
📁 Dataset
🤖 Model Architecture
📊 Performance Metrics
🔬 In Silico Mutagenesis
📂 Project Structure
✨ Key Features
🚀 Installation
💻 Usage
📈 Results
⚙️ Technical Implementation
🔮 Future Enhancements
🛠️ Technologies Used
📚 Citation & Acknowledgments
🤝 Contributing
📝 License
👤 Author

📊 Project Overview

This project implements a Convolutional Neural Network (CNN) using PyTorch to predict CTCF transcription factor binding sites in DNA sequences. The model achieves 89.61% test accuracy and 96.27% AUC-ROC on a dataset of over 61,000 DNA sequences.

🎯 Key Achievements

⭐ 89.61% test accuracy on DNA binding prediction
🎯 96.27% AUC-ROC demonstrating excellent discrimination
🧬 461,953 trainable parameters in optimized CNN architecture
⚡ ~3.5 minutes training time on GPU with early stopping
🔬 Biological validation through In Silico Mutagenesis analysis
📊 Comprehensive metrics tracking for model interpretability

🧬 What is CTCF?

CTCF (CCCTC-Binding Factor) is a highly conserved zinc finger protein that plays a crucial role in gene regulation:

🧬 Chromatin Architecture: CTCF organizes 3D genome structure by creating chromatin loops
🔀 Insulator Function: Acts as a boundary element to prevent inappropriate enhancer-promoter interactions
📖 Gene Expression: Critical regulator of gene expression patterns across cell types
🎯 Binding Specificity: Recognizes specific DNA sequences (~200 base pairs) across the genome

Understanding where CTCF binds is essential for:

🔬 Epigenetics Research: Studying gene regulation mechanisms
🏥 Medical Genomics: Identifying disease-associated regulatory variants
💊 Drug Development: Targeting regulatory elements for therapeutic intervention

🎯 Problem Statement

Challenge

Predicting transcription factor binding sites from DNA sequence alone is challenging because:

📏 Sequence Complexity: DNA contains 4 nucleotides (A, C, G, T) in 200bp windows = 4^200 possible combinations
🎯 Context Dependency: Binding depends on sequence motifs, but also on surrounding context
⚖️ Class Imbalance: Most genomic regions don't bind CTCF
🔍 Interpretability: Need to understand what the model learns beyond prediction accuracy

Solution

Our CNN-based approach:

🔄 One-Hot Encoding: Converts DNA sequences into 4-channel tensors (A, C, G, T)
🧠 Convolutional Layers: Automatically learns sequence motifs relevant for binding
🎯 Binary Classification: Predicts binding vs non-binding with high accuracy
🔬 ISM Validation: In Silico Mutagenesis reveals biological plausibility of learned patterns

📁 Dataset

📊 Dataset Statistics

Attribute	Value
Total Sequences	61,083
Sequence Length	200 base pairs (bp)
Training Set	42,758 sequences (70%)
Validation Set	9,162 sequences (15%)
Test Set	9,163 sequences (15%)
Positive Class	CTCF binding sites
Negative Class	Non-binding regions
Source	O'Reilly "Deep Learning for Biology"

📝 Data Format

The dataset is stored as TSV (Tab-Separated Values) with the following columns:

Column	Description	Example
`sequence`	DNA sequence (200bp)	`ATCGATCG...`
`label`	Binary label (0 or 1)	`1` (binding)
`transcription_factor`	TF name	`CTCF`
`subset`	Data split	`train` / `val` / `test`

🧬 DNA Encoding

DNA sequences are converted to one-hot encoding:

A → [1, 0, 0, 0]
C → [0, 1, 0, 0]
G → [0, 0, 1, 0]
T → [0, 0, 0, 1]
N → [0.25, 0.25, 0.25, 0.25]  # Unknown base

Result: 200bp sequence → Tensor of shape (4, 200)

🤖 Model Architecture

🏗️ ConvModelV2 Architecture

Input: DNA Sequence (200 base pairs)
    ↓
One-Hot Encoding → Tensor (4 × 200)
    ↓
┌─────────────────────────────────────┐
│ Conv1D Layer                        │
│  • Filters: 64                      │
│  • Kernel Size: 10                  │
│  • Activation: ReLU                 │
│  • Output: (64 × 191)               │
└─────────────────────────────────────┘
    ↓
┌─────────────────────────────────────┐
│ MaxPool1D                           │
│  • Kernel Size: 3                   │
│  • Output: (64 × 63)                │
└─────────────────────────────────────┘
    ↓
┌─────────────────────────────────────┐
│ Dropout (p=0.3)                     │
└─────────────────────────────────────┘
    ↓
Flatten → (4032,)
    ↓
┌─────────────────────────────────────┐
│ Dense Layer                         │
│  • Units: 128                       │
│  • Activation: ReLU                 │
└─────────────────────────────────────┘
    ↓
┌─────────────────────────────────────┐
│ Dropout (p=0.3)                     │
└─────────────────────────────────────┘
    ↓
┌─────────────────────────────────────┐
│ Output Dense Layer                  │
│  • Units: 1                         │
│  • Activation: Sigmoid              │
└─────────────────────────────────────┘
    ↓
Output: Binding Probability [0-1]

📊 Model Specifications

Parameter	Value
Total Parameters	461,953
Conv Filters	64
Kernel Size	10
Dense Units	128
Dropout Rate	0.3
Optimizer	Adam (lr=1e-5)
Loss Function	BCEWithLogitsLoss
Scheduler	CosineAnnealingLR

📊 Performance Metrics

🎯 Model Performance Summary

Dataset	Accuracy	Precision	Recall	F1-Score	AUC-ROC
Train	94.99%	94.69%	95.32%	95.01%	98.92% ⭐
Validation	88.94%	88.63%	89.35%	88.99%	96.04%
Test	89.61%	89.07%	90.31%	89.68%	96.27%

📊 Quick Stats

Metric	Value
Total Parameters	461,953
Training Samples	42,758 sequences
Test Accuracy	89.61% ⭐
Test AUC-ROC	96.27%
Sequence Length	200 base pairs
Training Time	~211 seconds (GPU)
Epochs Trained	28 (early stopping)
Best Val Loss	0.2633
Device Used	CUDA (GPU)

📈 Generalization Analysis

The model shows excellent generalization with minimal overfitting:

Metric	Train	Test	Gap
Accuracy	94.99%	89.61%	5.38% ✅
AUC-ROC	98.92%	96.27%	2.65% ✅
F1-Score	95.01%	89.68%	5.33% ✅

✅ Good generalization: Small gap indicates the model learned genuine biological patterns rather than memorizing training data.

🔬 In Silico Mutagenesis

In Silico Mutagenesis (ISM) is a computational technique to understand how DNA sequence changes affect model predictions, providing biological interpretability.

🧪 How ISM Works

Select a test DNA sequence (binding or non-binding)
Mutate each position systematically to all 4 nucleotides (A, C, G, T)
Predict binding probability for each mutant
Analyze how mutations affect predictions

📊 ISM Analysis Reveals:

🎯 Critical Nucleotides: Which positions are most important for binding
🔄 Substitution Effects: How changing A→T affects binding differently than A→G
🧬 Motif Discovery: Identifies sequence patterns the CNN learned
✅ Biological Validation: Confirms model learns realistic CTCF binding motifs

🎨 Visualizations

The ISM analysis notebook produces:

🗺️ Heatmaps: Showing mutation impact across all positions
📊 Importance Plots: Highlighting critical binding positions
🔄 Difference Maps: Comparing positive vs negative sequences

📂 Project Structure

DNA-TF-Binding-Predictor/
├── 📄 README.md                                    # Project documentation
├── 📋 requirements.txt                             # Python dependencies
├── 🚫 .gitignore                                   # Git ignore rules
├── 📝 LICENSE                                      # MIT License
│
├── 📓 notebooks/
│   ├── 01_model_training.ipynb                    # Complete training pipeline
│   └── 02_ism_analysis.ipynb                      # In Silico Mutagenesis analysis
│
├── 📁 data/
│   ├── train.tsv                                  # 61K+ CTCF binding sequences
│   └── README.md                                  # Dataset documentation
│
├── 🤖 models/
│   ├── dna_cnn_model_20250926_234649.pth         # Trained model weights (5.4MB)
│   ├── dna_cnn_model_20250926_234649_config.json # Model architecture config
│   └── dna_cnn_model_20250926_234649_metrics.json # Training metrics & history
│
└── 📚 docs/
    └── CTCF_Binding_Presentation.pptx             # Project presentation slides

✨ Key Features

🔧 1. Robust Data Pipeline

✅ Custom PyTorch Dataset: Efficient DNA sequence loading and encoding
✅ DataLoader: Batched training with automatic shuffling
✅ Train/Val/Test Split: Stratified split maintaining class balance
✅ One-Hot Encoding: Handles standard (ACGT) and unknown (N) nucleotides

🎯 2. Advanced Training Setup

✅ Early Stopping: Prevents overfitting (patience=8 epochs)
✅ Learning Rate Scheduling: CosineAnnealingLR for smooth convergence
✅ GPU Acceleration: CUDA support for faster training
✅ Comprehensive Metrics: Accuracy, Precision, Recall, F1, AUC-ROC
✅ Training History: Loss and accuracy tracking across all epochs

🔬 3. Model Interpretability

✅ In Silico Mutagenesis: Systematic mutation analysis
✅ Biological Validation: Sanity checks with known binding/non-binding sequences
✅ Visualization Suite: Heatmaps, importance plots, performance curves
✅ Saved Metadata: Model config, metrics, and training details

💾 4. Production-Ready Features

✅ Model Checkpointing: Save/load models with full metadata
✅ Configuration Management: JSON config for reproducibility
✅ Metrics Persistence: Detailed performance metrics saved
✅ Inference Pipeline: Easy prediction on new sequences

🚀 Installation

📋 Prerequisites

Python 3.8 or higher
CUDA-capable GPU (recommended) or CPU
8GB+ RAM
Jupyter Notebook

⚡ Quick Setup

Clone the repository:

git clone https://github.com/sys0507/DNA-TF-Binding-Predictor.git
cd DNA-TF-Binding-Predictor

Create virtual environment (recommended):

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Verify PyTorch installation:

python -c "import torch; print(f'PyTorch {torch.__version__}')"
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"

Launch Jupyter Notebook:

jupyter notebook

Open and run notebooks:
- Start with notebooks/01_model_training.ipynb for full training pipeline
- Then explore notebooks/02_ism_analysis.ipynb for mutation analysis

💻 Usage

🎬 Training the Model

Open notebooks/01_model_training.ipynb and run all cells. The notebook includes:

Data Loading 📥
- Loads 61K+ DNA sequences from data/train.tsv
- Performs train/validation/test split
Preprocessing 🔧
- One-hot encodes DNA sequences
- Creates PyTorch DataLoaders
Model Training 🏋️
- Initializes CNN architecture
- Trains with early stopping and LR scheduling
- Tracks comprehensive metrics
Evaluation 📊
- Calculates test set performance
- Generates training curves
- Saves model and metrics
Biological Validation 🔬
- Tests on known binding/non-binding sequences
- Verifies model learned relevant patterns

🧪 Running In Silico Mutagenesis

Open notebooks/02_ism_analysis.ipynb and run all cells:

Load Trained Model 📂
- Loads saved model from models/
Select Test Sequences 🧬
- Choose positive (binding) and negative (non-binding) examples
Systematic Mutation 🔄
- Mutates each position to all 4 nucleotides
- Predicts binding for each mutant
Visualization 📊
- Generates heatmaps showing mutation effects
- Plots nucleotide importance scores
- Compares positive vs negative sequences

🔮 Making Predictions on New Sequences

import torch
import numpy as np

# Load the trained model
from notebooks.model_utils import load_model  # Helper function
model = load_model('models/dna_cnn_model_20250926_234649.pth')

# Define your DNA sequence (200bp)
sequence = "ATCGATCG..." * 25  # 200bp sequence

# One-hot encode
encoded = dna_to_one_hot(sequence)  # (4, 200)
tensor = torch.FloatTensor(encoded).unsqueeze(0)  # (1, 4, 200)

# Predict
with torch.no_grad():
    output = model(tensor)
    probability = torch.sigmoid(output).item()

print(f"CTCF Binding Probability: {probability:.4f}")
print(f"Prediction: {'Binding' if probability > 0.5 else 'Non-binding'}")

📈 Results

🏆 Model Performance Highlights

✅ 89.61% Test Accuracy: High reliability for binding site prediction
✅ 96.27% AUC-ROC: Excellent discrimination between binding/non-binding
✅ Low Overfitting: Only 5.38% accuracy gap between train and test
✅ Balanced Performance: Similar precision (89.07%) and recall (90.31%)
✅ Fast Training: Converged in 28 epochs with early stopping

📊 Training Dynamics

Loss Curves: Smooth convergence with minimal overfitting

Final Training Loss: 0.136
Final Validation Loss: 0.269
Test Loss: 0.259

Accuracy Progression:

Started at ~65% (epoch 1)
Reached ~89% (epoch 10)
Stabilized at ~90% (epoch 20+)
Early stopped at epoch 28

🔬 Biological Validation

The model successfully:

✅ Predicts high probability for known CTCF binding sequences
✅ Predicts low probability for random/non-binding sequences
✅ ISM reveals learned motifs consistent with known CTCF binding preferences
✅ Critical positions align with experimentally validated binding sites

⚙️ Technical Implementation

🧬 DNA One-Hot Encoding

def dna_to_one_hot(sequence: str) -> np.ndarray:
    """
    Convert DNA sequence to one-hot encoded tensor.

    Returns:
        np.ndarray: Shape (4, sequence_length)
                    Channels represent [A, C, G, T]
    """
    mapping = {
        'A': [1, 0, 0, 0],
        'C': [0, 1, 0, 0],
        'G': [0, 0, 1, 0],
        'T': [0, 0, 0, 1],
        'N': [0.25, 0.25, 0.25, 0.25]
    }
    return np.array([mapping[base] for base in sequence]).T

🏗️ CNN Architecture Details

Key Design Choices:

Kernel Size 10: Captures ~10bp motifs typical for TF binding
64 Filters: Sufficient capacity to learn diverse sequence patterns
Dropout 0.3: Prevents overfitting while maintaining performance
Dense 128: Sufficient for combining learned sequence features

🎯 Training Pipeline

# Optimizer
optimizer = Adam(model.parameters(), lr=1e-5)

# Loss function
criterion = BCEWithLogitsLoss()

# Learning rate scheduler
scheduler = CosineAnnealingLR(optimizer, T_max=50)

# Early stopping
early_stopping = EarlyStopping(patience=8, mode='min')

# Training loop
for epoch in range(max_epochs):
    train_loss = train_epoch(model, train_loader, optimizer, criterion)
    val_loss, val_metrics = validate(model, val_loader, criterion)
    scheduler.step()

    if early_stopping(val_loss):
        print(f"Early stopping at epoch {epoch}")
        break

🔮 Future Enhancements

🧠 Model Improvements

Attention Mechanisms: Add self-attention to capture long-range dependencies
Bidirectional LSTM: Combine CNN with recurrent layers for sequence context
Deeper Architectures: Experiment with ResNet-style connections
Ensemble Methods: Combine multiple models for improved predictions

📊 Data & Training

Data Augmentation: Reverse complement sequences, random mutations
Multi-Task Learning: Predict binding for multiple transcription factors
Transfer Learning: Pre-train on larger genomic datasets
Class Weighting: Handle potential class imbalance more explicitly

🔬 Biological Features

Motif Visualization: Extract and visualize learned sequence motifs
Feature Importance: Use GradCAM or integrated gradients
Sequence Logo: Generate position weight matrices from learned features
3D Structure Integration: Incorporate DNA shape features

🌐 Deployment

Web Application: Flask/FastAPI for online predictions
REST API: Enable programmatic access
Batch Processing: Efficient genome-wide prediction pipeline
Containerization: Docker for reproducible deployment

🛠️ Technologies Used

Category	Technologies
Deep Learning	PyTorch 2.0+
Language	Python 3.8+
Numerical Computing	NumPy, SciPy
Data Processing	Pandas
Machine Learning	scikit-learn
Visualization	Matplotlib, Seaborn
Development	Jupyter Notebook
Version Control	Git, GitHub
Hardware	CUDA (GPU acceleration)

📚 Citation & Acknowledgments

📖 Based on Research From

This project is based on methodology from:

Ravarani, C., & Latysheva, N. (2025). Deep Learning for Biology: Harness AI to Solve Real-World Biology Problems. O'Reilly Media.

Note: The training data and model architectures are adapted from this excellent educational resource.

🙏 Acknowledgments

📚 Chiara Ravarani & Natasha Latysheva for the foundational methodology, dataset, and educational framework
🔬 O'Reilly Media for publishing cutting-edge bioinformatics and deep learning content
🧬 Computational Biology Community for advancing machine learning applications in genomics
💻 PyTorch Team for the exceptional deep learning framework
🔬 CTCF Research Community for experimental validation of transcription factor binding

📄 BibTeX Citation

@book{ravarani2025deep,
  title={Deep Learning for Biology: Harness AI to Solve Real-World Biology Problems},
  author={Ravarani, Chiara and Latysheva, Natasha},
  year={2025},
  publisher={O'Reilly Media}
}

🤝 Contributing

Contributions are welcome! Whether you're interested in:

🐛 Bug Fixes: Report or fix issues
✨ New Features: Add model improvements or analysis tools
📚 Documentation: Improve explanations and examples
🧪 Testing: Add validation or biological benchmarks

How to Contribute

Fork the repository
Create your feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

You are free to:

✅ Use the code for research and education
✅ Modify and distribute the code
✅ Use for commercial purposes

With attribution to the original authors and citation of the O'Reilly book.

👤 Author

Created by sys0507

🔬 Passionate about computational biology and deep learning
🧬 Interested in genomics, transcription factor analysis, and regulatory networks
💻 Open to collaborations and discussions

Feel free to reach out for questions, suggestions, or collaborations!

⭐ If you find this project helpful, please consider giving it a star!

Made with ❤️ for the computational biology community

📌 Note: This is an educational project based on the O'Reilly "Deep Learning for Biology" book. The dataset and model architectures are adapted from this resource for learning purposes. For production use, consider additional validation and biological benchmarking.

⬆ Back to Top

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
docs		docs
models		models
notebooks		notebooks
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🧬 DNA Transcription Factor Binding Predictor

📑 Table of Contents

📊 Project Overview

🎯 Key Achievements

🧬 What is CTCF?

🎯 Problem Statement

Challenge

Solution

📁 Dataset

📊 Dataset Statistics

📝 Data Format

🧬 DNA Encoding

🤖 Model Architecture

🏗️ ConvModelV2 Architecture

📊 Model Specifications

📊 Performance Metrics

🎯 Model Performance Summary

📊 Quick Stats

📈 Generalization Analysis

🔬 In Silico Mutagenesis

🧪 How ISM Works

📊 ISM Analysis Reveals:

🎨 Visualizations

📂 Project Structure

✨ Key Features

🔧 1. Robust Data Pipeline

🎯 2. Advanced Training Setup

🔬 3. Model Interpretability

💾 4. Production-Ready Features

🚀 Installation

📋 Prerequisites

⚡ Quick Setup

💻 Usage

🎬 Training the Model

🧪 Running In Silico Mutagenesis

🔮 Making Predictions on New Sequences

📈 Results

🏆 Model Performance Highlights

📊 Training Dynamics

🔬 Biological Validation

⚙️ Technical Implementation

🧬 DNA One-Hot Encoding

🏗️ CNN Architecture Details

🎯 Training Pipeline

🔮 Future Enhancements

🧠 Model Improvements

📊 Data & Training

🔬 Biological Features

🌐 Deployment

🛠️ Technologies Used

📚 Citation & Acknowledgments

📖 Based on Research From

🙏 Acknowledgments

📄 BibTeX Citation

🤝 Contributing

How to Contribute

📝 License

👤 Author

⭐ If you find this project helpful, please consider giving it a star!

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages