Intelligent Duplicate Question Detection powered by XGBoost, NLP Feature Engineering, and Transformer Embeddings
QueryTwin AI detects whether two natural-language questions are semantically duplicate β asking the same thing in different words. The system combines 22 handcrafted NLP features, TF-IDF representations, and question frequency statistics into an XGBoost classifier tuned with Bayesian hyper-parameter optimisation (Optuna) and explained via SHAP.
Built on the Quora Question Pairs dataset (400K+ labelled pairs).
| Metric | Score |
|---|---|
| Accuracy | 85.88% |
| Precision | 82.98% |
| Recall | 77.69% |
| F1 Score | 80.25% |
| ROC-AUC | 93.74% |
- π XGBoost Classifier with Bayesian HPO via Optuna
- π TF-IDF (1,2)-gram representations (2 Γ 5000 dims)
- π’ 22 Handcrafted NLP Features β fuzzy matching, token overlap, length stats
- π 3 Frequency Features β question appearance counts
- π§ Sentence-BERT semantic similarity (optional transformer layer)
- π SHAP Explainability β per-feature contribution for every prediction
- π Comprehensive Evaluation β ROC, PR curve, calibration, threshold analysis
- π¨ Premium Streamlit Dashboard β dark glassmorphism UI with Plotly
- π¦ Batch Prediction β CSV upload β bulk results download
- π Stratified K-Fold Cross-Validation
- πͺ΅ Structured Logging with rotating file handler
- β Full Test Suite with pytest
Raw Questions
β
βΌ
ββββββββββββββββββββ
β Text Preprocessing β lowercase Β· HTML strip Β· contractions Β· symbols
ββββββββββ¬ββββββββββ
β
ββββββββββββββββββββ¬βββββββββββββββββββ
βΌ βΌ βΌ
22 Handcrafted 3 Frequency TF-IDF (1,2)-gram
NLP Features Features 2 Γ 5,000 dims
β β β
ββββββββββββββββββββ΄βββββββββββββββββββ
β
βΌ
βββββββββββββββββββββ
β XGBoost (Optuna) β
βββββββββββ¬ββββββββββ
β
ββββββββββββββΌβββββββββββββ
βΌ βΌ βΌ
Prediction Confidence SHAP Explanation
(0/1) (0-100%) (per-feature)
QueryTwin AI/
β
βββ app.py # Streamlit dashboard (entry point)
βββ setup.py # Package configuration
βββ requirements.txt # Python dependencies
βββ .gitignore
β
βββ src/
β βββ __init__.py
β βββ config.py # Central configuration & paths
β βββ logger.py # Structured logging (rotating files)
β βββ data_loader.py # Dataset loading & cleaning
β βββ preprocessing.py # Text normalisation pipeline
β βββ feature_engineering.py # 22 handcrafted features + SBERT
β βββ train.py # Training: XGBoost + Optuna + K-Fold
β βββ evaluate.py # 8 evaluation charts + metrics
β βββ inference.py # Prediction engine + SHAP + batch
β
βββ tests/
β βββ conftest.py # Shared pytest fixtures
β βββ test_preprocessing.py # Preprocessing unit tests
β βββ test_features.py # Feature engineering tests
β βββ test_inference.py # Inference integration tests
β
βββ data/
β βββ raw/ # train.csv, test.csv
β βββ processed/ # Parquet intermediates
β
βββ artifacts/ # Trained model + TF-IDF + metrics
βββ reports/ # Evaluation charts (PNG)
βββ logs/ # Structured log files
git clone https://github.com/your-username/QueryTwin-AI.git
cd QueryTwin-AI
python -m venv venv
venv\Scripts\activate # Windows
# source venv/bin/activate # Linux/Mac
pip install -r requirements.txtDownload the Quora Question Pairs dataset and place train.csv into data/raw/.
python -m src.trainThis runs the full pipeline: load β clean β featurise β train β evaluate β save.
streamlit run app.pypython -m pytest tests/ -vfrom src.train import ModelTrainer
trainer = ModelTrainer()
df = trainer.load_dataset()
X = trainer.create_training_matrix(df)
y = trainer.get_target(df)
X_train, X_test, y_train, y_test = trainer.split_data(X, y)
best_params = trainer.hyperparameter_search(X_train, y_train, X_test, y_test)cv_results = trainer.cross_validate(X, y)
# β {'accuracy': 0.8590, 'f1_score': 0.8031, 'roc_auc': 0.9375}from src.inference import DuplicateQuestionPredictor
predictor = DuplicateQuestionPredictor()
explanation = predictor.explain(
"How do I learn Python?",
"What is the best way to learn Python?"
)
# β {'fuzz_token_set_ratio': +0.1234, 'common_word_ratio': +0.0891, ...}pip install sentence-transformersfrom src.feature_engineering import SBERTFeatureExtractor
sbert = SBERTFeatureExtractor()
similarity = sbert.compute_similarity(
"How do I learn Python?",
"What is the best way to learn Python?"
)
# β {'cosine_similarity': 0.8912, 'euclidean_distance': 0.4321, ...}| Category | Count | Examples |
|---|---|---|
| Basic | 7 | Character length, word count, common word ratio |
| Token | 8 | Stopword overlap ratios, first/last word equality |
| Length | 3 | Token diff, avg length, LCS ratio |
| Fuzzy | 4 | QRatio, partial ratio, token sort/set ratio |
| Frequency | 3 | Question appearance count in training corpus |
| TF-IDF | 10,000 | Unigram + bigram (2 Γ 5,000 features) |
| Total | 10,025 |
The evaluation module generates 8 publication-quality visualisations:
- Confusion Matrix β True/false positive/negative breakdown
- ROC Curve β TPR vs FPR with AUC
- Precision-Recall Curve β Precision vs Recall with PR-AUC
- Calibration Curve β Reliability diagram
- Threshold Analysis β Metrics vs decision threshold
- Feature Importance β Top 20 features by XGBoost gain
- Metrics Bar Chart β Visual metric comparison
- Classification Report β Per-class precision/recall/F1
| Layer | Technology |
|---|---|
| ML Framework | XGBoost, scikit-learn |
| HPO | Optuna (Bayesian) |
| Explainability | SHAP (TreeExplainer) |
| NLP | NLTK, FuzzyWuzzy, TF-IDF |
| Transformers | Sentence-BERT (optional) |
| Visualisation | Plotly, Matplotlib, Seaborn |
| Web App | Streamlit |
| Testing | pytest |
| Logging | Python logging (rotating) |
This project is licensed under the MIT License.