Skip to content

7vik2005/QueryTwin-AI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🧬 QueryTwin AI

Intelligent Duplicate Question Detection powered by XGBoost, NLP Feature Engineering, and Transformer Embeddings

Python XGBoost Streamlit SHAP Optuna


πŸ“‹ Overview

QueryTwin AI detects whether two natural-language questions are semantically duplicate β€” asking the same thing in different words. The system combines 22 handcrafted NLP features, TF-IDF representations, and question frequency statistics into an XGBoost classifier tuned with Bayesian hyper-parameter optimisation (Optuna) and explained via SHAP.

Built on the Quora Question Pairs dataset (400K+ labelled pairs).

Key Results

Metric Score
Accuracy 85.88%
Precision 82.98%
Recall 77.69%
F1 Score 80.25%
ROC-AUC 93.74%

✨ Features

  • πŸš€ XGBoost Classifier with Bayesian HPO via Optuna
  • πŸ“ TF-IDF (1,2)-gram representations (2 Γ— 5000 dims)
  • πŸ”’ 22 Handcrafted NLP Features β€” fuzzy matching, token overlap, length stats
  • πŸ“Š 3 Frequency Features β€” question appearance counts
  • 🧠 Sentence-BERT semantic similarity (optional transformer layer)
  • πŸ” SHAP Explainability β€” per-feature contribution for every prediction
  • πŸ“ˆ Comprehensive Evaluation β€” ROC, PR curve, calibration, threshold analysis
  • 🎨 Premium Streamlit Dashboard β€” dark glassmorphism UI with Plotly
  • πŸ“¦ Batch Prediction β€” CSV upload β†’ bulk results download
  • πŸ”„ Stratified K-Fold Cross-Validation
  • πŸͺ΅ Structured Logging with rotating file handler
  • βœ… Full Test Suite with pytest

πŸ—οΈ Architecture

Raw Questions
    β”‚
    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Text Preprocessing  β”‚  lowercase Β· HTML strip Β· contractions Β· symbols
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β–Ό                  β–Ό                  β–Ό
  22 Handcrafted     3 Frequency        TF-IDF (1,2)-gram
  NLP Features       Features           2 Γ— 5,000 dims
         β”‚                  β”‚                  β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                            β”‚
                            β–Ό
                β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                β”‚  XGBoost (Optuna) β”‚
                β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          β”‚
             β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
             β–Ό            β–Ό            β–Ό
         Prediction   Confidence   SHAP Explanation
          (0/1)        (0-100%)    (per-feature)

πŸ“‚ Project Structure

QueryTwin AI/
β”‚
β”œβ”€β”€ app.py                      # Streamlit dashboard (entry point)
β”œβ”€β”€ setup.py                    # Package configuration
β”œβ”€β”€ requirements.txt            # Python dependencies
β”œβ”€β”€ .gitignore
β”‚
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ config.py               # Central configuration & paths
β”‚   β”œβ”€β”€ logger.py               # Structured logging (rotating files)
β”‚   β”œβ”€β”€ data_loader.py          # Dataset loading & cleaning
β”‚   β”œβ”€β”€ preprocessing.py        # Text normalisation pipeline
β”‚   β”œβ”€β”€ feature_engineering.py  # 22 handcrafted features + SBERT
β”‚   β”œβ”€β”€ train.py                # Training: XGBoost + Optuna + K-Fold
β”‚   β”œβ”€β”€ evaluate.py             # 8 evaluation charts + metrics
β”‚   └── inference.py            # Prediction engine + SHAP + batch
β”‚
β”œβ”€β”€ tests/
β”‚   β”œβ”€β”€ conftest.py             # Shared pytest fixtures
β”‚   β”œβ”€β”€ test_preprocessing.py   # Preprocessing unit tests
β”‚   β”œβ”€β”€ test_features.py        # Feature engineering tests
β”‚   └── test_inference.py       # Inference integration tests
β”‚
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ raw/                    # train.csv, test.csv
β”‚   └── processed/              # Parquet intermediates
β”‚
β”œβ”€β”€ artifacts/                  # Trained model + TF-IDF + metrics
β”œβ”€β”€ reports/                    # Evaluation charts (PNG)
└── logs/                       # Structured log files

πŸš€ Quick Start

1. Clone & Install

git clone https://github.com/your-username/QueryTwin-AI.git
cd QueryTwin-AI

python -m venv venv
venv\Scripts\activate          # Windows
# source venv/bin/activate     # Linux/Mac

pip install -r requirements.txt

2. Download Data

Download the Quora Question Pairs dataset and place train.csv into data/raw/.

3. Train the Model

python -m src.train

This runs the full pipeline: load β†’ clean β†’ featurise β†’ train β†’ evaluate β†’ save.

4. Launch the Dashboard

streamlit run app.py

5. Run Tests

python -m pytest tests/ -v

βš™οΈ Advanced Usage

Bayesian Hyper-Parameter Optimisation

from src.train import ModelTrainer

trainer = ModelTrainer()
df = trainer.load_dataset()
X = trainer.create_training_matrix(df)
y = trainer.get_target(df)

X_train, X_test, y_train, y_test = trainer.split_data(X, y)
best_params = trainer.hyperparameter_search(X_train, y_train, X_test, y_test)

Stratified K-Fold Cross-Validation

cv_results = trainer.cross_validate(X, y)
# β†’ {'accuracy': 0.8590, 'f1_score': 0.8031, 'roc_auc': 0.9375}

SHAP Explanations

from src.inference import DuplicateQuestionPredictor

predictor = DuplicateQuestionPredictor()
explanation = predictor.explain(
    "How do I learn Python?",
    "What is the best way to learn Python?"
)
# β†’ {'fuzz_token_set_ratio': +0.1234, 'common_word_ratio': +0.0891, ...}

Sentence-BERT Similarity (Optional)

pip install sentence-transformers
from src.feature_engineering import SBERTFeatureExtractor

sbert = SBERTFeatureExtractor()
similarity = sbert.compute_similarity(
    "How do I learn Python?",
    "What is the best way to learn Python?"
)
# β†’ {'cosine_similarity': 0.8912, 'euclidean_distance': 0.4321, ...}

πŸ”¬ Feature Engineering Details

Category Count Examples
Basic 7 Character length, word count, common word ratio
Token 8 Stopword overlap ratios, first/last word equality
Length 3 Token diff, avg length, LCS ratio
Fuzzy 4 QRatio, partial ratio, token sort/set ratio
Frequency 3 Question appearance count in training corpus
TF-IDF 10,000 Unigram + bigram (2 Γ— 5,000 features)
Total 10,025

πŸ“Š Evaluation Suite

The evaluation module generates 8 publication-quality visualisations:

  1. Confusion Matrix β€” True/false positive/negative breakdown
  2. ROC Curve β€” TPR vs FPR with AUC
  3. Precision-Recall Curve β€” Precision vs Recall with PR-AUC
  4. Calibration Curve β€” Reliability diagram
  5. Threshold Analysis β€” Metrics vs decision threshold
  6. Feature Importance β€” Top 20 features by XGBoost gain
  7. Metrics Bar Chart β€” Visual metric comparison
  8. Classification Report β€” Per-class precision/recall/F1

πŸ› οΈ Tech Stack

Layer Technology
ML Framework XGBoost, scikit-learn
HPO Optuna (Bayesian)
Explainability SHAP (TreeExplainer)
NLP NLTK, FuzzyWuzzy, TF-IDF
Transformers Sentence-BERT (optional)
Visualisation Plotly, Matplotlib, Seaborn
Web App Streamlit
Testing pytest
Logging Python logging (rotating)

πŸ“„ License

This project is licensed under the MIT License.


Built with 🧬 by Satvik

About

AI-powered duplicate question detection using XGBoost, NLP feature engineering, TF-IDF, SHAP explainability, and optional Sentence-BERT semantic similarity.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages