Skip to content

sys0507/DNA-TF-Binding-Predictor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

1 Commit
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿงฌ DNA Transcription Factor Binding Predictor

Deep learning model to predict CTCF transcription factor binding sites in DNA sequences using PyTorch CNN with In Silico Mutagenesis validation.

Python 3.8+ PyTorch NumPy Pandas scikit-learn Jupyter Notebook

License Status GitHub stars GitHub forks


๐Ÿ“‘ Table of Contents


๐Ÿ“Š Project Overview

This project implements a Convolutional Neural Network (CNN) using PyTorch to predict CTCF transcription factor binding sites in DNA sequences. The model achieves 89.61% test accuracy and 96.27% AUC-ROC on a dataset of over 61,000 DNA sequences.

๐ŸŽฏ Key Achievements

  • โญ 89.61% test accuracy on DNA binding prediction
  • ๐ŸŽฏ 96.27% AUC-ROC demonstrating excellent discrimination
  • ๐Ÿงฌ 461,953 trainable parameters in optimized CNN architecture
  • โšก ~3.5 minutes training time on GPU with early stopping
  • ๐Ÿ”ฌ Biological validation through In Silico Mutagenesis analysis
  • ๐Ÿ“Š Comprehensive metrics tracking for model interpretability

๐Ÿงฌ What is CTCF?

CTCF (CCCTC-Binding Factor) is a highly conserved zinc finger protein that plays a crucial role in gene regulation:

  • ๐Ÿงฌ Chromatin Architecture: CTCF organizes 3D genome structure by creating chromatin loops
  • ๐Ÿ”€ Insulator Function: Acts as a boundary element to prevent inappropriate enhancer-promoter interactions
  • ๐Ÿ“– Gene Expression: Critical regulator of gene expression patterns across cell types
  • ๐ŸŽฏ Binding Specificity: Recognizes specific DNA sequences (~200 base pairs) across the genome

Understanding where CTCF binds is essential for:

  • ๐Ÿ”ฌ Epigenetics Research: Studying gene regulation mechanisms
  • ๐Ÿฅ Medical Genomics: Identifying disease-associated regulatory variants
  • ๐Ÿ’Š Drug Development: Targeting regulatory elements for therapeutic intervention

๐ŸŽฏ Problem Statement

Challenge

Predicting transcription factor binding sites from DNA sequence alone is challenging because:

  1. ๐Ÿ“ Sequence Complexity: DNA contains 4 nucleotides (A, C, G, T) in 200bp windows = 4^200 possible combinations
  2. ๐ŸŽฏ Context Dependency: Binding depends on sequence motifs, but also on surrounding context
  3. โš–๏ธ Class Imbalance: Most genomic regions don't bind CTCF
  4. ๐Ÿ” Interpretability: Need to understand what the model learns beyond prediction accuracy

Solution

Our CNN-based approach:

  • ๐Ÿ”„ One-Hot Encoding: Converts DNA sequences into 4-channel tensors (A, C, G, T)
  • ๐Ÿง  Convolutional Layers: Automatically learns sequence motifs relevant for binding
  • ๐ŸŽฏ Binary Classification: Predicts binding vs non-binding with high accuracy
  • ๐Ÿ”ฌ ISM Validation: In Silico Mutagenesis reveals biological plausibility of learned patterns

๐Ÿ“ Dataset

๐Ÿ“Š Dataset Statistics

Attribute Value
Total Sequences 61,083
Sequence Length 200 base pairs (bp)
Training Set 42,758 sequences (70%)
Validation Set 9,162 sequences (15%)
Test Set 9,163 sequences (15%)
Positive Class CTCF binding sites
Negative Class Non-binding regions
Source O'Reilly "Deep Learning for Biology"

๐Ÿ“ Data Format

The dataset is stored as TSV (Tab-Separated Values) with the following columns:

Column Description Example
sequence DNA sequence (200bp) ATCGATCG...
label Binary label (0 or 1) 1 (binding)
transcription_factor TF name CTCF
subset Data split train / val / test

๐Ÿงฌ DNA Encoding

DNA sequences are converted to one-hot encoding:

A โ†’ [1, 0, 0, 0]
C โ†’ [0, 1, 0, 0]
G โ†’ [0, 0, 1, 0]
T โ†’ [0, 0, 0, 1]
N โ†’ [0.25, 0.25, 0.25, 0.25]  # Unknown base

Result: 200bp sequence โ†’ Tensor of shape (4, 200)


๐Ÿค– Model Architecture

๐Ÿ—๏ธ ConvModelV2 Architecture

Input: DNA Sequence (200 base pairs)
    โ†“
One-Hot Encoding โ†’ Tensor (4 ร— 200)
    โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Conv1D Layer                        โ”‚
โ”‚  โ€ข Filters: 64                      โ”‚
โ”‚  โ€ข Kernel Size: 10                  โ”‚
โ”‚  โ€ข Activation: ReLU                 โ”‚
โ”‚  โ€ข Output: (64 ร— 191)               โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
    โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ MaxPool1D                           โ”‚
โ”‚  โ€ข Kernel Size: 3                   โ”‚
โ”‚  โ€ข Output: (64 ร— 63)                โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
    โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Dropout (p=0.3)                     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
    โ†“
Flatten โ†’ (4032,)
    โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Dense Layer                         โ”‚
โ”‚  โ€ข Units: 128                       โ”‚
โ”‚  โ€ข Activation: ReLU                 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
    โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Dropout (p=0.3)                     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
    โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Output Dense Layer                  โ”‚
โ”‚  โ€ข Units: 1                         โ”‚
โ”‚  โ€ข Activation: Sigmoid              โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
    โ†“
Output: Binding Probability [0-1]

๐Ÿ“Š Model Specifications

Parameter Value
Total Parameters 461,953
Conv Filters 64
Kernel Size 10
Dense Units 128
Dropout Rate 0.3
Optimizer Adam (lr=1e-5)
Loss Function BCEWithLogitsLoss
Scheduler CosineAnnealingLR

๐Ÿ“Š Performance Metrics

๐ŸŽฏ Model Performance Summary

Dataset Accuracy Precision Recall F1-Score AUC-ROC
Train 94.99% 94.69% 95.32% 95.01% 98.92% โญ
Validation 88.94% 88.63% 89.35% 88.99% 96.04%
Test 89.61% 89.07% 90.31% 89.68% 96.27%

๐Ÿ“Š Quick Stats

Metric Value
Total Parameters 461,953
Training Samples 42,758 sequences
Test Accuracy 89.61% โญ
Test AUC-ROC 96.27%
Sequence Length 200 base pairs
Training Time ~211 seconds (GPU)
Epochs Trained 28 (early stopping)
Best Val Loss 0.2633
Device Used CUDA (GPU)

๐Ÿ“ˆ Generalization Analysis

The model shows excellent generalization with minimal overfitting:

Metric Train Test Gap
Accuracy 94.99% 89.61% 5.38% โœ…
AUC-ROC 98.92% 96.27% 2.65% โœ…
F1-Score 95.01% 89.68% 5.33% โœ…

โœ… Good generalization: Small gap indicates the model learned genuine biological patterns rather than memorizing training data.


๐Ÿ”ฌ In Silico Mutagenesis

In Silico Mutagenesis (ISM) is a computational technique to understand how DNA sequence changes affect model predictions, providing biological interpretability.

๐Ÿงช How ISM Works

  1. Select a test DNA sequence (binding or non-binding)
  2. Mutate each position systematically to all 4 nucleotides (A, C, G, T)
  3. Predict binding probability for each mutant
  4. Analyze how mutations affect predictions

๐Ÿ“Š ISM Analysis Reveals:

  • ๐ŸŽฏ Critical Nucleotides: Which positions are most important for binding
  • ๐Ÿ”„ Substitution Effects: How changing Aโ†’T affects binding differently than Aโ†’G
  • ๐Ÿงฌ Motif Discovery: Identifies sequence patterns the CNN learned
  • โœ… Biological Validation: Confirms model learns realistic CTCF binding motifs

๐ŸŽจ Visualizations

The ISM analysis notebook produces:

  • ๐Ÿ—บ๏ธ Heatmaps: Showing mutation impact across all positions
  • ๐Ÿ“Š Importance Plots: Highlighting critical binding positions
  • ๐Ÿ”„ Difference Maps: Comparing positive vs negative sequences

๐Ÿ“‚ Project Structure

DNA-TF-Binding-Predictor/
โ”œโ”€โ”€ ๐Ÿ“„ README.md                                    # Project documentation
โ”œโ”€โ”€ ๐Ÿ“‹ requirements.txt                             # Python dependencies
โ”œโ”€โ”€ ๐Ÿšซ .gitignore                                   # Git ignore rules
โ”œโ”€โ”€ ๐Ÿ“ LICENSE                                      # MIT License
โ”‚
โ”œโ”€โ”€ ๐Ÿ““ notebooks/
โ”‚   โ”œโ”€โ”€ 01_model_training.ipynb                    # Complete training pipeline
โ”‚   โ””โ”€โ”€ 02_ism_analysis.ipynb                      # In Silico Mutagenesis analysis
โ”‚
โ”œโ”€โ”€ ๐Ÿ“ data/
โ”‚   โ”œโ”€โ”€ train.tsv                                  # 61K+ CTCF binding sequences
โ”‚   โ””โ”€โ”€ README.md                                  # Dataset documentation
โ”‚
โ”œโ”€โ”€ ๐Ÿค– models/
โ”‚   โ”œโ”€โ”€ dna_cnn_model_20250926_234649.pth         # Trained model weights (5.4MB)
โ”‚   โ”œโ”€โ”€ dna_cnn_model_20250926_234649_config.json # Model architecture config
โ”‚   โ””โ”€โ”€ dna_cnn_model_20250926_234649_metrics.json # Training metrics & history
โ”‚
โ””โ”€โ”€ ๐Ÿ“š docs/
    โ””โ”€โ”€ CTCF_Binding_Presentation.pptx             # Project presentation slides

โœจ Key Features

๐Ÿ”ง 1. Robust Data Pipeline

  • โœ… Custom PyTorch Dataset: Efficient DNA sequence loading and encoding
  • โœ… DataLoader: Batched training with automatic shuffling
  • โœ… Train/Val/Test Split: Stratified split maintaining class balance
  • โœ… One-Hot Encoding: Handles standard (ACGT) and unknown (N) nucleotides

๐ŸŽฏ 2. Advanced Training Setup

  • โœ… Early Stopping: Prevents overfitting (patience=8 epochs)
  • โœ… Learning Rate Scheduling: CosineAnnealingLR for smooth convergence
  • โœ… GPU Acceleration: CUDA support for faster training
  • โœ… Comprehensive Metrics: Accuracy, Precision, Recall, F1, AUC-ROC
  • โœ… Training History: Loss and accuracy tracking across all epochs

๐Ÿ”ฌ 3. Model Interpretability

  • โœ… In Silico Mutagenesis: Systematic mutation analysis
  • โœ… Biological Validation: Sanity checks with known binding/non-binding sequences
  • โœ… Visualization Suite: Heatmaps, importance plots, performance curves
  • โœ… Saved Metadata: Model config, metrics, and training details

๐Ÿ’พ 4. Production-Ready Features

  • โœ… Model Checkpointing: Save/load models with full metadata
  • โœ… Configuration Management: JSON config for reproducibility
  • โœ… Metrics Persistence: Detailed performance metrics saved
  • โœ… Inference Pipeline: Easy prediction on new sequences

๐Ÿš€ Installation

๐Ÿ“‹ Prerequisites

  • Python 3.8 or higher
  • CUDA-capable GPU (recommended) or CPU
  • 8GB+ RAM
  • Jupyter Notebook

โšก Quick Setup

  1. Clone the repository:
git clone https://github.com/sys0507/DNA-TF-Binding-Predictor.git
cd DNA-TF-Binding-Predictor
  1. Create virtual environment (recommended):
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
  1. Install dependencies:
pip install -r requirements.txt
  1. Verify PyTorch installation:
python -c "import torch; print(f'PyTorch {torch.__version__}')"
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"
  1. Launch Jupyter Notebook:
jupyter notebook
  1. Open and run notebooks:
    • Start with notebooks/01_model_training.ipynb for full training pipeline
    • Then explore notebooks/02_ism_analysis.ipynb for mutation analysis

๐Ÿ’ป Usage

๐ŸŽฌ Training the Model

Open notebooks/01_model_training.ipynb and run all cells. The notebook includes:

  1. Data Loading ๐Ÿ“ฅ

    • Loads 61K+ DNA sequences from data/train.tsv
    • Performs train/validation/test split
  2. Preprocessing ๐Ÿ”ง

    • One-hot encodes DNA sequences
    • Creates PyTorch DataLoaders
  3. Model Training ๐Ÿ‹๏ธ

    • Initializes CNN architecture
    • Trains with early stopping and LR scheduling
    • Tracks comprehensive metrics
  4. Evaluation ๐Ÿ“Š

    • Calculates test set performance
    • Generates training curves
    • Saves model and metrics
  5. Biological Validation ๐Ÿ”ฌ

    • Tests on known binding/non-binding sequences
    • Verifies model learned relevant patterns

๐Ÿงช Running In Silico Mutagenesis

Open notebooks/02_ism_analysis.ipynb and run all cells:

  1. Load Trained Model ๐Ÿ“‚

    • Loads saved model from models/
  2. Select Test Sequences ๐Ÿงฌ

    • Choose positive (binding) and negative (non-binding) examples
  3. Systematic Mutation ๐Ÿ”„

    • Mutates each position to all 4 nucleotides
    • Predicts binding for each mutant
  4. Visualization ๐Ÿ“Š

    • Generates heatmaps showing mutation effects
    • Plots nucleotide importance scores
    • Compares positive vs negative sequences

๐Ÿ”ฎ Making Predictions on New Sequences

import torch
import numpy as np

# Load the trained model
from notebooks.model_utils import load_model  # Helper function
model = load_model('models/dna_cnn_model_20250926_234649.pth')

# Define your DNA sequence (200bp)
sequence = "ATCGATCG..." * 25  # 200bp sequence

# One-hot encode
encoded = dna_to_one_hot(sequence)  # (4, 200)
tensor = torch.FloatTensor(encoded).unsqueeze(0)  # (1, 4, 200)

# Predict
with torch.no_grad():
    output = model(tensor)
    probability = torch.sigmoid(output).item()

print(f"CTCF Binding Probability: {probability:.4f}")
print(f"Prediction: {'Binding' if probability > 0.5 else 'Non-binding'}")

๐Ÿ“ˆ Results

๐Ÿ† Model Performance Highlights

  • โœ… 89.61% Test Accuracy: High reliability for binding site prediction
  • โœ… 96.27% AUC-ROC: Excellent discrimination between binding/non-binding
  • โœ… Low Overfitting: Only 5.38% accuracy gap between train and test
  • โœ… Balanced Performance: Similar precision (89.07%) and recall (90.31%)
  • โœ… Fast Training: Converged in 28 epochs with early stopping

๐Ÿ“Š Training Dynamics

Loss Curves: Smooth convergence with minimal overfitting

  • Final Training Loss: 0.136
  • Final Validation Loss: 0.269
  • Test Loss: 0.259

Accuracy Progression:

  • Started at ~65% (epoch 1)
  • Reached ~89% (epoch 10)
  • Stabilized at ~90% (epoch 20+)
  • Early stopped at epoch 28

๐Ÿ”ฌ Biological Validation

The model successfully:

  • โœ… Predicts high probability for known CTCF binding sequences
  • โœ… Predicts low probability for random/non-binding sequences
  • โœ… ISM reveals learned motifs consistent with known CTCF binding preferences
  • โœ… Critical positions align with experimentally validated binding sites

โš™๏ธ Technical Implementation

๐Ÿงฌ DNA One-Hot Encoding

def dna_to_one_hot(sequence: str) -> np.ndarray:
    """
    Convert DNA sequence to one-hot encoded tensor.

    Returns:
        np.ndarray: Shape (4, sequence_length)
                    Channels represent [A, C, G, T]
    """
    mapping = {
        'A': [1, 0, 0, 0],
        'C': [0, 1, 0, 0],
        'G': [0, 0, 1, 0],
        'T': [0, 0, 0, 1],
        'N': [0.25, 0.25, 0.25, 0.25]
    }
    return np.array([mapping[base] for base in sequence]).T

๐Ÿ—๏ธ CNN Architecture Details

Key Design Choices:

  • Kernel Size 10: Captures ~10bp motifs typical for TF binding
  • 64 Filters: Sufficient capacity to learn diverse sequence patterns
  • Dropout 0.3: Prevents overfitting while maintaining performance
  • Dense 128: Sufficient for combining learned sequence features

๐ŸŽฏ Training Pipeline

# Optimizer
optimizer = Adam(model.parameters(), lr=1e-5)

# Loss function
criterion = BCEWithLogitsLoss()

# Learning rate scheduler
scheduler = CosineAnnealingLR(optimizer, T_max=50)

# Early stopping
early_stopping = EarlyStopping(patience=8, mode='min')

# Training loop
for epoch in range(max_epochs):
    train_loss = train_epoch(model, train_loader, optimizer, criterion)
    val_loss, val_metrics = validate(model, val_loader, criterion)
    scheduler.step()

    if early_stopping(val_loss):
        print(f"Early stopping at epoch {epoch}")
        break

๐Ÿ”ฎ Future Enhancements

๐Ÿง  Model Improvements

  • Attention Mechanisms: Add self-attention to capture long-range dependencies
  • Bidirectional LSTM: Combine CNN with recurrent layers for sequence context
  • Deeper Architectures: Experiment with ResNet-style connections
  • Ensemble Methods: Combine multiple models for improved predictions

๐Ÿ“Š Data & Training

  • Data Augmentation: Reverse complement sequences, random mutations
  • Multi-Task Learning: Predict binding for multiple transcription factors
  • Transfer Learning: Pre-train on larger genomic datasets
  • Class Weighting: Handle potential class imbalance more explicitly

๐Ÿ”ฌ Biological Features

  • Motif Visualization: Extract and visualize learned sequence motifs
  • Feature Importance: Use GradCAM or integrated gradients
  • Sequence Logo: Generate position weight matrices from learned features
  • 3D Structure Integration: Incorporate DNA shape features

๐ŸŒ Deployment

  • Web Application: Flask/FastAPI for online predictions
  • REST API: Enable programmatic access
  • Batch Processing: Efficient genome-wide prediction pipeline
  • Containerization: Docker for reproducible deployment

๐Ÿ› ๏ธ Technologies Used

Category Technologies
Deep Learning PyTorch 2.0+
Language Python 3.8+
Numerical Computing NumPy, SciPy
Data Processing Pandas
Machine Learning scikit-learn
Visualization Matplotlib, Seaborn
Development Jupyter Notebook
Version Control Git, GitHub
Hardware CUDA (GPU acceleration)

๐Ÿ“š Citation & Acknowledgments

๐Ÿ“– Based on Research From

This project is based on methodology from:

Ravarani, C., & Latysheva, N. (2025). Deep Learning for Biology: Harness AI to Solve Real-World Biology Problems. O'Reilly Media.

Note: The training data and model architectures are adapted from this excellent educational resource.

๐Ÿ™ Acknowledgments

  • ๐Ÿ“š Chiara Ravarani & Natasha Latysheva for the foundational methodology, dataset, and educational framework
  • ๐Ÿ”ฌ O'Reilly Media for publishing cutting-edge bioinformatics and deep learning content
  • ๐Ÿงฌ Computational Biology Community for advancing machine learning applications in genomics
  • ๐Ÿ’ป PyTorch Team for the exceptional deep learning framework
  • ๐Ÿ”ฌ CTCF Research Community for experimental validation of transcription factor binding

๐Ÿ“„ BibTeX Citation

@book{ravarani2025deep,
  title={Deep Learning for Biology: Harness AI to Solve Real-World Biology Problems},
  author={Ravarani, Chiara and Latysheva, Natasha},
  year={2025},
  publisher={O'Reilly Media}
}

๐Ÿค Contributing

Contributions are welcome! Whether you're interested in:

  • ๐Ÿ› Bug Fixes: Report or fix issues
  • โœจ New Features: Add model improvements or analysis tools
  • ๐Ÿ“š Documentation: Improve explanations and examples
  • ๐Ÿงช Testing: Add validation or biological benchmarks

How to Contribute

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

๐Ÿ“ License

This project is licensed under the MIT License - see the LICENSE file for details.

You are free to:

  • โœ… Use the code for research and education
  • โœ… Modify and distribute the code
  • โœ… Use for commercial purposes

With attribution to the original authors and citation of the O'Reilly book.


๐Ÿ‘ค Author

Created by sys0507

  • ๐Ÿ”ฌ Passionate about computational biology and deep learning
  • ๐Ÿงฌ Interested in genomics, transcription factor analysis, and regulatory networks
  • ๐Ÿ’ป Open to collaborations and discussions

Feel free to reach out for questions, suggestions, or collaborations!


โญ If you find this project helpful, please consider giving it a star!

Made with โค๏ธ for the computational biology community


๐Ÿ“Œ Note: This is an educational project based on the O'Reilly "Deep Learning for Biology" book. The dataset and model architectures are adapted from this resource for learning purposes. For production use, consider additional validation and biological benchmarking.

โฌ† Back to Top

About

Deep learning model to predict CTCF transcription factor binding sites in DNA sequences using PyTorch CNN with In Silico Mutagenesis validation

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors