Deep learning model to predict CTCF transcription factor binding sites in DNA sequences using PyTorch CNN with In Silico Mutagenesis validation.
- ๐ Project Overview
- ๐งฌ What is CTCF?
- ๐ฏ Problem Statement
- ๐ Dataset
- ๐ค Model Architecture
- ๐ Performance Metrics
- ๐ฌ In Silico Mutagenesis
- ๐ Project Structure
- โจ Key Features
- ๐ Installation
- ๐ป Usage
- ๐ Results
- โ๏ธ Technical Implementation
- ๐ฎ Future Enhancements
- ๐ ๏ธ Technologies Used
- ๐ Citation & Acknowledgments
- ๐ค Contributing
- ๐ License
- ๐ค Author
This project implements a Convolutional Neural Network (CNN) using PyTorch to predict CTCF transcription factor binding sites in DNA sequences. The model achieves 89.61% test accuracy and 96.27% AUC-ROC on a dataset of over 61,000 DNA sequences.
- โญ 89.61% test accuracy on DNA binding prediction
- ๐ฏ 96.27% AUC-ROC demonstrating excellent discrimination
- ๐งฌ 461,953 trainable parameters in optimized CNN architecture
- โก ~3.5 minutes training time on GPU with early stopping
- ๐ฌ Biological validation through In Silico Mutagenesis analysis
- ๐ Comprehensive metrics tracking for model interpretability
CTCF (CCCTC-Binding Factor) is a highly conserved zinc finger protein that plays a crucial role in gene regulation:
- ๐งฌ Chromatin Architecture: CTCF organizes 3D genome structure by creating chromatin loops
- ๐ Insulator Function: Acts as a boundary element to prevent inappropriate enhancer-promoter interactions
- ๐ Gene Expression: Critical regulator of gene expression patterns across cell types
- ๐ฏ Binding Specificity: Recognizes specific DNA sequences (~200 base pairs) across the genome
Understanding where CTCF binds is essential for:
- ๐ฌ Epigenetics Research: Studying gene regulation mechanisms
- ๐ฅ Medical Genomics: Identifying disease-associated regulatory variants
- ๐ Drug Development: Targeting regulatory elements for therapeutic intervention
Predicting transcription factor binding sites from DNA sequence alone is challenging because:
- ๐ Sequence Complexity: DNA contains 4 nucleotides (A, C, G, T) in 200bp windows = 4^200 possible combinations
- ๐ฏ Context Dependency: Binding depends on sequence motifs, but also on surrounding context
- โ๏ธ Class Imbalance: Most genomic regions don't bind CTCF
- ๐ Interpretability: Need to understand what the model learns beyond prediction accuracy
Our CNN-based approach:
- ๐ One-Hot Encoding: Converts DNA sequences into 4-channel tensors (A, C, G, T)
- ๐ง Convolutional Layers: Automatically learns sequence motifs relevant for binding
- ๐ฏ Binary Classification: Predicts binding vs non-binding with high accuracy
- ๐ฌ ISM Validation: In Silico Mutagenesis reveals biological plausibility of learned patterns
| Attribute | Value |
|---|---|
| Total Sequences | 61,083 |
| Sequence Length | 200 base pairs (bp) |
| Training Set | 42,758 sequences (70%) |
| Validation Set | 9,162 sequences (15%) |
| Test Set | 9,163 sequences (15%) |
| Positive Class | CTCF binding sites |
| Negative Class | Non-binding regions |
| Source | O'Reilly "Deep Learning for Biology" |
The dataset is stored as TSV (Tab-Separated Values) with the following columns:
| Column | Description | Example |
|---|---|---|
sequence |
DNA sequence (200bp) | ATCGATCG... |
label |
Binary label (0 or 1) | 1 (binding) |
transcription_factor |
TF name | CTCF |
subset |
Data split | train / val / test |
DNA sequences are converted to one-hot encoding:
A โ [1, 0, 0, 0]
C โ [0, 1, 0, 0]
G โ [0, 0, 1, 0]
T โ [0, 0, 0, 1]
N โ [0.25, 0.25, 0.25, 0.25] # Unknown base
Result: 200bp sequence โ Tensor of shape (4, 200)
Input: DNA Sequence (200 base pairs)
โ
One-Hot Encoding โ Tensor (4 ร 200)
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Conv1D Layer โ
โ โข Filters: 64 โ
โ โข Kernel Size: 10 โ
โ โข Activation: ReLU โ
โ โข Output: (64 ร 191) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ MaxPool1D โ
โ โข Kernel Size: 3 โ
โ โข Output: (64 ร 63) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Dropout (p=0.3) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
Flatten โ (4032,)
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Dense Layer โ
โ โข Units: 128 โ
โ โข Activation: ReLU โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Dropout (p=0.3) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Output Dense Layer โ
โ โข Units: 1 โ
โ โข Activation: Sigmoid โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
Output: Binding Probability [0-1]
| Parameter | Value |
|---|---|
| Total Parameters | 461,953 |
| Conv Filters | 64 |
| Kernel Size | 10 |
| Dense Units | 128 |
| Dropout Rate | 0.3 |
| Optimizer | Adam (lr=1e-5) |
| Loss Function | BCEWithLogitsLoss |
| Scheduler | CosineAnnealingLR |
| Dataset | Accuracy | Precision | Recall | F1-Score | AUC-ROC |
|---|---|---|---|---|---|
| Train | 94.99% | 94.69% | 95.32% | 95.01% | 98.92% โญ |
| Validation | 88.94% | 88.63% | 89.35% | 88.99% | 96.04% |
| Test | 89.61% | 89.07% | 90.31% | 89.68% | 96.27% |
| Metric | Value |
|---|---|
| Total Parameters | 461,953 |
| Training Samples | 42,758 sequences |
| Test Accuracy | 89.61% โญ |
| Test AUC-ROC | 96.27% |
| Sequence Length | 200 base pairs |
| Training Time | ~211 seconds (GPU) |
| Epochs Trained | 28 (early stopping) |
| Best Val Loss | 0.2633 |
| Device Used | CUDA (GPU) |
The model shows excellent generalization with minimal overfitting:
| Metric | Train | Test | Gap |
|---|---|---|---|
| Accuracy | 94.99% | 89.61% | 5.38% โ |
| AUC-ROC | 98.92% | 96.27% | 2.65% โ |
| F1-Score | 95.01% | 89.68% | 5.33% โ |
โ Good generalization: Small gap indicates the model learned genuine biological patterns rather than memorizing training data.
In Silico Mutagenesis (ISM) is a computational technique to understand how DNA sequence changes affect model predictions, providing biological interpretability.
- Select a test DNA sequence (binding or non-binding)
- Mutate each position systematically to all 4 nucleotides (A, C, G, T)
- Predict binding probability for each mutant
- Analyze how mutations affect predictions
- ๐ฏ Critical Nucleotides: Which positions are most important for binding
- ๐ Substitution Effects: How changing AโT affects binding differently than AโG
- ๐งฌ Motif Discovery: Identifies sequence patterns the CNN learned
- โ Biological Validation: Confirms model learns realistic CTCF binding motifs
The ISM analysis notebook produces:
- ๐บ๏ธ Heatmaps: Showing mutation impact across all positions
- ๐ Importance Plots: Highlighting critical binding positions
- ๐ Difference Maps: Comparing positive vs negative sequences
DNA-TF-Binding-Predictor/
โโโ ๐ README.md # Project documentation
โโโ ๐ requirements.txt # Python dependencies
โโโ ๐ซ .gitignore # Git ignore rules
โโโ ๐ LICENSE # MIT License
โ
โโโ ๐ notebooks/
โ โโโ 01_model_training.ipynb # Complete training pipeline
โ โโโ 02_ism_analysis.ipynb # In Silico Mutagenesis analysis
โ
โโโ ๐ data/
โ โโโ train.tsv # 61K+ CTCF binding sequences
โ โโโ README.md # Dataset documentation
โ
โโโ ๐ค models/
โ โโโ dna_cnn_model_20250926_234649.pth # Trained model weights (5.4MB)
โ โโโ dna_cnn_model_20250926_234649_config.json # Model architecture config
โ โโโ dna_cnn_model_20250926_234649_metrics.json # Training metrics & history
โ
โโโ ๐ docs/
โโโ CTCF_Binding_Presentation.pptx # Project presentation slides
- โ Custom PyTorch Dataset: Efficient DNA sequence loading and encoding
- โ DataLoader: Batched training with automatic shuffling
- โ Train/Val/Test Split: Stratified split maintaining class balance
- โ One-Hot Encoding: Handles standard (ACGT) and unknown (N) nucleotides
- โ Early Stopping: Prevents overfitting (patience=8 epochs)
- โ Learning Rate Scheduling: CosineAnnealingLR for smooth convergence
- โ GPU Acceleration: CUDA support for faster training
- โ Comprehensive Metrics: Accuracy, Precision, Recall, F1, AUC-ROC
- โ Training History: Loss and accuracy tracking across all epochs
- โ In Silico Mutagenesis: Systematic mutation analysis
- โ Biological Validation: Sanity checks with known binding/non-binding sequences
- โ Visualization Suite: Heatmaps, importance plots, performance curves
- โ Saved Metadata: Model config, metrics, and training details
- โ Model Checkpointing: Save/load models with full metadata
- โ Configuration Management: JSON config for reproducibility
- โ Metrics Persistence: Detailed performance metrics saved
- โ Inference Pipeline: Easy prediction on new sequences
- Python 3.8 or higher
- CUDA-capable GPU (recommended) or CPU
- 8GB+ RAM
- Jupyter Notebook
- Clone the repository:
git clone https://github.com/sys0507/DNA-TF-Binding-Predictor.git
cd DNA-TF-Binding-Predictor- Create virtual environment (recommended):
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies:
pip install -r requirements.txt- Verify PyTorch installation:
python -c "import torch; print(f'PyTorch {torch.__version__}')"
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"- Launch Jupyter Notebook:
jupyter notebook- Open and run notebooks:
- Start with
notebooks/01_model_training.ipynbfor full training pipeline - Then explore
notebooks/02_ism_analysis.ipynbfor mutation analysis
- Start with
Open notebooks/01_model_training.ipynb and run all cells. The notebook includes:
-
Data Loading ๐ฅ
- Loads 61K+ DNA sequences from
data/train.tsv - Performs train/validation/test split
- Loads 61K+ DNA sequences from
-
Preprocessing ๐ง
- One-hot encodes DNA sequences
- Creates PyTorch DataLoaders
-
Model Training ๐๏ธ
- Initializes CNN architecture
- Trains with early stopping and LR scheduling
- Tracks comprehensive metrics
-
Evaluation ๐
- Calculates test set performance
- Generates training curves
- Saves model and metrics
-
Biological Validation ๐ฌ
- Tests on known binding/non-binding sequences
- Verifies model learned relevant patterns
Open notebooks/02_ism_analysis.ipynb and run all cells:
-
Load Trained Model ๐
- Loads saved model from
models/
- Loads saved model from
-
Select Test Sequences ๐งฌ
- Choose positive (binding) and negative (non-binding) examples
-
Systematic Mutation ๐
- Mutates each position to all 4 nucleotides
- Predicts binding for each mutant
-
Visualization ๐
- Generates heatmaps showing mutation effects
- Plots nucleotide importance scores
- Compares positive vs negative sequences
import torch
import numpy as np
# Load the trained model
from notebooks.model_utils import load_model # Helper function
model = load_model('models/dna_cnn_model_20250926_234649.pth')
# Define your DNA sequence (200bp)
sequence = "ATCGATCG..." * 25 # 200bp sequence
# One-hot encode
encoded = dna_to_one_hot(sequence) # (4, 200)
tensor = torch.FloatTensor(encoded).unsqueeze(0) # (1, 4, 200)
# Predict
with torch.no_grad():
output = model(tensor)
probability = torch.sigmoid(output).item()
print(f"CTCF Binding Probability: {probability:.4f}")
print(f"Prediction: {'Binding' if probability > 0.5 else 'Non-binding'}")- โ 89.61% Test Accuracy: High reliability for binding site prediction
- โ 96.27% AUC-ROC: Excellent discrimination between binding/non-binding
- โ Low Overfitting: Only 5.38% accuracy gap between train and test
- โ Balanced Performance: Similar precision (89.07%) and recall (90.31%)
- โ Fast Training: Converged in 28 epochs with early stopping
Loss Curves: Smooth convergence with minimal overfitting
- Final Training Loss: 0.136
- Final Validation Loss: 0.269
- Test Loss: 0.259
Accuracy Progression:
- Started at ~65% (epoch 1)
- Reached ~89% (epoch 10)
- Stabilized at ~90% (epoch 20+)
- Early stopped at epoch 28
The model successfully:
- โ Predicts high probability for known CTCF binding sequences
- โ Predicts low probability for random/non-binding sequences
- โ ISM reveals learned motifs consistent with known CTCF binding preferences
- โ Critical positions align with experimentally validated binding sites
def dna_to_one_hot(sequence: str) -> np.ndarray:
"""
Convert DNA sequence to one-hot encoded tensor.
Returns:
np.ndarray: Shape (4, sequence_length)
Channels represent [A, C, G, T]
"""
mapping = {
'A': [1, 0, 0, 0],
'C': [0, 1, 0, 0],
'G': [0, 0, 1, 0],
'T': [0, 0, 0, 1],
'N': [0.25, 0.25, 0.25, 0.25]
}
return np.array([mapping[base] for base in sequence]).TKey Design Choices:
- Kernel Size 10: Captures ~10bp motifs typical for TF binding
- 64 Filters: Sufficient capacity to learn diverse sequence patterns
- Dropout 0.3: Prevents overfitting while maintaining performance
- Dense 128: Sufficient for combining learned sequence features
# Optimizer
optimizer = Adam(model.parameters(), lr=1e-5)
# Loss function
criterion = BCEWithLogitsLoss()
# Learning rate scheduler
scheduler = CosineAnnealingLR(optimizer, T_max=50)
# Early stopping
early_stopping = EarlyStopping(patience=8, mode='min')
# Training loop
for epoch in range(max_epochs):
train_loss = train_epoch(model, train_loader, optimizer, criterion)
val_loss, val_metrics = validate(model, val_loader, criterion)
scheduler.step()
if early_stopping(val_loss):
print(f"Early stopping at epoch {epoch}")
break- Attention Mechanisms: Add self-attention to capture long-range dependencies
- Bidirectional LSTM: Combine CNN with recurrent layers for sequence context
- Deeper Architectures: Experiment with ResNet-style connections
- Ensemble Methods: Combine multiple models for improved predictions
- Data Augmentation: Reverse complement sequences, random mutations
- Multi-Task Learning: Predict binding for multiple transcription factors
- Transfer Learning: Pre-train on larger genomic datasets
- Class Weighting: Handle potential class imbalance more explicitly
- Motif Visualization: Extract and visualize learned sequence motifs
- Feature Importance: Use GradCAM or integrated gradients
- Sequence Logo: Generate position weight matrices from learned features
- 3D Structure Integration: Incorporate DNA shape features
- Web Application: Flask/FastAPI for online predictions
- REST API: Enable programmatic access
- Batch Processing: Efficient genome-wide prediction pipeline
- Containerization: Docker for reproducible deployment
| Category | Technologies |
|---|---|
| Deep Learning | PyTorch 2.0+ |
| Language | Python 3.8+ |
| Numerical Computing | NumPy, SciPy |
| Data Processing | Pandas |
| Machine Learning | scikit-learn |
| Visualization | Matplotlib, Seaborn |
| Development | Jupyter Notebook |
| Version Control | Git, GitHub |
| Hardware | CUDA (GPU acceleration) |
This project is based on methodology from:
Ravarani, C., & Latysheva, N. (2025). Deep Learning for Biology: Harness AI to Solve Real-World Biology Problems. O'Reilly Media.
Note: The training data and model architectures are adapted from this excellent educational resource.
- ๐ Chiara Ravarani & Natasha Latysheva for the foundational methodology, dataset, and educational framework
- ๐ฌ O'Reilly Media for publishing cutting-edge bioinformatics and deep learning content
- ๐งฌ Computational Biology Community for advancing machine learning applications in genomics
- ๐ป PyTorch Team for the exceptional deep learning framework
- ๐ฌ CTCF Research Community for experimental validation of transcription factor binding
@book{ravarani2025deep,
title={Deep Learning for Biology: Harness AI to Solve Real-World Biology Problems},
author={Ravarani, Chiara and Latysheva, Natasha},
year={2025},
publisher={O'Reilly Media}
}Contributions are welcome! Whether you're interested in:
- ๐ Bug Fixes: Report or fix issues
- โจ New Features: Add model improvements or analysis tools
- ๐ Documentation: Improve explanations and examples
- ๐งช Testing: Add validation or biological benchmarks
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
You are free to:
- โ Use the code for research and education
- โ Modify and distribute the code
- โ Use for commercial purposes
With attribution to the original authors and citation of the O'Reilly book.
Created by sys0507
- ๐ฌ Passionate about computational biology and deep learning
- ๐งฌ Interested in genomics, transcription factor analysis, and regulatory networks
- ๐ป Open to collaborations and discussions
Feel free to reach out for questions, suggestions, or collaborations!
Made with โค๏ธ for the computational biology community
๐ Note: This is an educational project based on the O'Reilly "Deep Learning for Biology" book. The dataset and model architectures are adapted from this resource for learning purposes. For production use, consider additional validation and biological benchmarking.