🧬 QueryTwin AI

Intelligent Duplicate Question Detection powered by XGBoost, NLP Feature Engineering, and Transformer Embeddings

📋 Overview

QueryTwin AI detects whether two natural-language questions are semantically duplicate — asking the same thing in different words. The system combines 22 handcrafted NLP features, TF-IDF representations, and question frequency statistics into an XGBoost classifier tuned with Bayesian hyper-parameter optimisation (Optuna) and explained via SHAP.

Built on the Quora Question Pairs dataset (400K+ labelled pairs).

Key Results

Metric	Score
Accuracy	85.88%
Precision	82.98%
Recall	77.69%
F1 Score	80.25%
ROC-AUC	93.74%

✨ Features

🚀 XGBoost Classifier with Bayesian HPO via Optuna
📝 TF-IDF (1,2)-gram representations (2 × 5000 dims)
🔢 22 Handcrafted NLP Features — fuzzy matching, token overlap, length stats
📊 3 Frequency Features — question appearance counts
🧠 Sentence-BERT semantic similarity (optional transformer layer)
🔍 SHAP Explainability — per-feature contribution for every prediction
📈 Comprehensive Evaluation — ROC, PR curve, calibration, threshold analysis
🎨 Premium Streamlit Dashboard — dark glassmorphism UI with Plotly
📦 Batch Prediction — CSV upload → bulk results download
🔄 Stratified K-Fold Cross-Validation
🪵 Structured Logging with rotating file handler
✅ Full Test Suite with pytest

🏗️ Architecture

Raw Questions
    │
    ▼
┌──────────────────┐
│  Text Preprocessing  │  lowercase · HTML strip · contractions · symbols
└────────┬─────────┘
         │
         ├──────────────────┬──────────────────┐
         ▼                  ▼                  ▼
  22 Handcrafted     3 Frequency        TF-IDF (1,2)-gram
  NLP Features       Features           2 × 5,000 dims
         │                  │                  │
         └──────────────────┴──────────────────┘
                            │
                            ▼
                ┌───────────────────┐
                │  XGBoost (Optuna) │
                └─────────┬─────────┘
                          │
             ┌────────────┼────────────┐
             ▼            ▼            ▼
         Prediction   Confidence   SHAP Explanation
          (0/1)        (0-100%)    (per-feature)

📂 Project Structure

QueryTwin AI/
│
├── app.py                      # Streamlit dashboard (entry point)
├── setup.py                    # Package configuration
├── requirements.txt            # Python dependencies
├── .gitignore
│
├── src/
│   ├── __init__.py
│   ├── config.py               # Central configuration & paths
│   ├── logger.py               # Structured logging (rotating files)
│   ├── data_loader.py          # Dataset loading & cleaning
│   ├── preprocessing.py        # Text normalisation pipeline
│   ├── feature_engineering.py  # 22 handcrafted features + SBERT
│   ├── train.py                # Training: XGBoost + Optuna + K-Fold
│   ├── evaluate.py             # 8 evaluation charts + metrics
│   └── inference.py            # Prediction engine + SHAP + batch
│
├── tests/
│   ├── conftest.py             # Shared pytest fixtures
│   ├── test_preprocessing.py   # Preprocessing unit tests
│   ├── test_features.py        # Feature engineering tests
│   └── test_inference.py       # Inference integration tests
│
├── data/
│   ├── raw/                    # train.csv, test.csv
│   └── processed/              # Parquet intermediates
│
├── artifacts/                  # Trained model + TF-IDF + metrics
├── reports/                    # Evaluation charts (PNG)
└── logs/                       # Structured log files

🚀 Quick Start

1. Clone & Install

git clone https://github.com/your-username/QueryTwin-AI.git
cd QueryTwin-AI

python -m venv venv
venv\Scripts\activate          # Windows
# source venv/bin/activate     # Linux/Mac

pip install -r requirements.txt

2. Download Data

Download the Quora Question Pairs dataset and place train.csv into data/raw/.

3. Train the Model

python -m src.train

This runs the full pipeline: load → clean → featurise → train → evaluate → save.

4. Launch the Dashboard

streamlit run app.py

5. Run Tests

python -m pytest tests/ -v

⚙️ Advanced Usage

Bayesian Hyper-Parameter Optimisation

from src.train import ModelTrainer

trainer = ModelTrainer()
df = trainer.load_dataset()
X = trainer.create_training_matrix(df)
y = trainer.get_target(df)

X_train, X_test, y_train, y_test = trainer.split_data(X, y)
best_params = trainer.hyperparameter_search(X_train, y_train, X_test, y_test)

Stratified K-Fold Cross-Validation

cv_results = trainer.cross_validate(X, y)
# → {'accuracy': 0.8590, 'f1_score': 0.8031, 'roc_auc': 0.9375}

SHAP Explanations

from src.inference import DuplicateQuestionPredictor

predictor = DuplicateQuestionPredictor()
explanation = predictor.explain(
    "How do I learn Python?",
    "What is the best way to learn Python?"
)
# → {'fuzz_token_set_ratio': +0.1234, 'common_word_ratio': +0.0891, ...}

Sentence-BERT Similarity (Optional)

pip install sentence-transformers

from src.feature_engineering import SBERTFeatureExtractor

sbert = SBERTFeatureExtractor()
similarity = sbert.compute_similarity(
    "How do I learn Python?",
    "What is the best way to learn Python?"
)
# → {'cosine_similarity': 0.8912, 'euclidean_distance': 0.4321, ...}

🔬 Feature Engineering Details

Category	Count	Examples
Basic	7	Character length, word count, common word ratio
Token	8	Stopword overlap ratios, first/last word equality
Length	3	Token diff, avg length, LCS ratio
Fuzzy	4	QRatio, partial ratio, token sort/set ratio
Frequency	3	Question appearance count in training corpus
TF-IDF	10,000	Unigram + bigram (2 × 5,000 features)
Total	10,025

📊 Evaluation Suite

The evaluation module generates 8 publication-quality visualisations:

Confusion Matrix — True/false positive/negative breakdown
ROC Curve — TPR vs FPR with AUC
Precision-Recall Curve — Precision vs Recall with PR-AUC
Calibration Curve — Reliability diagram
Threshold Analysis — Metrics vs decision threshold
Feature Importance — Top 20 features by XGBoost gain
Metrics Bar Chart — Visual metric comparison
Classification Report — Per-class precision/recall/F1

🛠️ Tech Stack

Layer	Technology
ML Framework	XGBoost, scikit-learn
HPO	Optuna (Bayesian)
Explainability	SHAP (TreeExplainer)
NLP	NLTK, FuzzyWuzzy, TF-IDF
Transformers	Sentence-BERT (optional)
Visualisation	Plotly, Matplotlib, Seaborn
Web App	Streamlit
Testing	pytest
Logging	Python logging (rotating)

📄 License

This project is licensed under the MIT License.

Built with 🧬 by Satvik

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧬 QueryTwin AI

📋 Overview

Key Results

✨ Features

🏗️ Architecture

📂 Project Structure

🚀 Quick Start

1. Clone & Install

2. Download Data

3. Train the Model

4. Launch the Dashboard

5. Run Tests

⚙️ Advanced Usage

Bayesian Hyper-Parameter Optimisation

Stratified K-Fold Cross-Validation

SHAP Explanations

Sentence-BERT Similarity (Optional)

🔬 Feature Engineering Details

📊 Evaluation Suite

🛠️ Tech Stack

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
artifacts		artifacts
reports		reports
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

🧬 QueryTwin AI

📋 Overview

Key Results

✨ Features

🏗️ Architecture

📂 Project Structure

🚀 Quick Start

1. Clone & Install

2. Download Data

3. Train the Model

4. Launch the Dashboard

5. Run Tests

⚙️ Advanced Usage

Bayesian Hyper-Parameter Optimisation

Stratified K-Fold Cross-Validation

SHAP Explanations

Sentence-BERT Similarity (Optional)

🔬 Feature Engineering Details

📊 Evaluation Suite

🛠️ Tech Stack

📄 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages