A comprehensive study of NLP tasks using BERT and BART β sentiment analysis, paraphrase detection, and controlled text generation.
π Overview Β· βοΈ Setup Β· π¬ Methodology Β· π Results Β· π₯ Team
| Field | Details |
|---|---|
| π·οΈ Group Code | 06 |
| π©βπ« Tutor | Corinna Wegner |
| π₯ Members | Khalid Tariq Β· Shakhzod Bakhodirov Β· Alina Amanbayeva |
This project explores five core NLP tasks through fine-tuning of BERT and BART foundation models. Our work goes beyond standard baselines by introducing advanced loss functions, architectural improvements, and training strategies:
| Task | Model | Key Innovation |
|---|---|---|
| Semantic Textual Similarity (STS) | BERT | Siamese network + Triplet Loss |
| Sentiment Analysis (SST) | BERT | CLS + Mean Pooling + LayerNorm |
| Paraphrase Detection (QQP) | BERT | Focal Loss + Adversarial Training (FGM) |
| Paraphrase Type Detection (PTD) | BART | Custom Head + Weighted Sampling |
| Paraphrase Type Generation (PTG) | BART | Multi-Objective Loss + Bayesian Optimization |
git clone https://github.com/WeskerPRO/NLP_Project.git
cd NLP_ProjectOption A β Automatic (recommended):
bash setup.shOption B β Manual (conda):
conda create -n dnlp python=3.10 -y
conda activate dnlp
conda install pytorch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 pytorch-cuda=12.1 -c pytorch -c nvidia
conda install -y tqdm requests scikit-learn numpy pandas matplotlib seaborn scipy -c conda-forge
conda install -y protobuf transformers tokenizers pyarrow=20.0.0 spacy -c conda-forge
conda install -y sentence-transformers=5.0.0 -c conda-forge
pip install explainaboard-client==0.1.4 sacrebleu==2.5.1 optuna==3.6.0 importlib_metadatapython multitask_classifier.py --option finetune --task=[sst, sts, qqp, etpc] --epochs=15 --use_gpu# Paraphrase Type Generation
python bart_generation.py --use_gpu
# Paraphrase Type Detection
python bart_detection.py --use_gpu
# Optional: T5 model
python bart_generation.py --model=T5 --use_gpuA Siamese network architecture maps sentence pairs into a shared embedding space using a single weight-shared model.
Key techniques:
- Triplet Loss with Hard Negative Mining β pulls similar embeddings together, pushes dissimilar ones apart. Hard negatives are mined within each batch for maximum discrimination.
- MSE Regression Loss β directly optimizes prediction of the ground-truth similarity score in the [0, 5] range.
-
Hybrid Loss:
$L = \alpha \cdot L_{\text{triplet}} + (1 - \alpha) \cdot L_{\text{regression}}$ , with optimal$\alpha = 0.8$ .
The vanilla CLS-only baseline suffered from limited expressiveness and overfitting on this small dataset. Three improvements were introduced:
| Improvement | Description |
|---|---|
| Advanced Pooling | Concatenates [CLS] embedding with mean of all last-layer hidden states |
| Layer Normalization | Stabilizes the concatenated representation before the classifier |
| Strong Dropout (p=0.5) | Acts as ensemble regularization, critical for SST's limited size |
Five targeted improvements over the baseline:
| # | Method | Purpose |
|---|---|---|
| 1 | Advanced Feature Engineering | Element-wise difference, product, cosine similarity between embeddings |
| 2 | Custom Paraphrase Head | Learns nonlinear decision boundaries from interaction features |
| 3 | Dropout Regularization | Prevents over-reliance on any single interaction type |
| 4 | Focal Loss |
|
| 5 | Adversarial Training (FGM) | Adds worst-case embedding perturbations to boost robustness |
Multi-label classification across 26 paraphrase categories using the ETPC dataset, evaluated with Matthews Correlation Coefficient (MCC).
Improvements:
- Class-wise Threshold Tuning β each of the 26 classes gets its own optimal threshold on the dev set
-
Weighted Random Sampler β rare paraphrase types get higher sampling weight:
$w_i = \frac{1}{\text{count}(i) + \epsilon}$ -
L2 Regularization β weight decay in AdamW:
$\mathcal{L}_{\text{reg}}(\theta) = \mathcal{L}(\theta) + \lambda |\theta|_2^2$ - Custom Classifier Head β Linear β SiLU β BatchNorm β Dropout β Output Linear
- BCEWithLogitsLoss β handles multi-label prediction with per-class binary cross-entropy
A controlled generation framework trained with a multi-objective loss:
Key contributions:
- Stochastic Sampling β
top_pandtemperatureencourage diverse, non-repetitive generation - Multi-Objective Loss β simultaneously optimizes semantic similarity, lexical variation, and syntactic variation
- Bayesian Hyperparameter Optimization (Optuna) β intelligently searches learning rate, weight decay, loss weights, and generation parameters
- BART vs. T5 Comparison β T5 starts strong (instruction-following pre-training), but BART learns to de-copy and generate novel paraphrases over epochs
| Task | Model | Metric | Score |
|---|---|---|---|
| SST | Baseline | Accuracy | 52.20% |
| SST | Advanced Pooling + Regularization | Accuracy | 55.20% |
| QQP | Baseline | Accuracy | 75.00% |
| QQP | Feature Eng. + Focal Loss + FGM | Accuracy | 85.70% |
| STS | Baseline (MSE) | Pearson r | 0.352 |
| STS | Simple Regression + Regularization | Pearson r | 0.375 |
| STS | Siamese BERT + Contrastive Learning | Pearson r | 0.672 |
Paraphrase Type Detection (PTD)
| Configuration | Accuracy | MCC |
|---|---|---|
| Baseline | 87.50% | 0.180 |
| Threshold Tuning | 87.50% | 0.206 |
| WeightedSampler + BCE + L2 | 78.00% | 0.318 |
| Custom Classifier Head | 80.00% | 0.400 |
| Optuna + Layer Freezing | 56.50% | 0.350 |
Paraphrase Type Generation (PTG)
| Configuration | BLEU | Negative BLEU | Penalized BLEU |
|---|---|---|---|
| Baseline | 48.44 | 2.84 | 2.64 |
| Stochastic Sampling | 45.22 | 20.93 | 18.20 |
| Multi-Objective Loss | 44.08 | 22.61 | 19.16 |
| Optimized Controlled Generation | 42.02 | 29.79 | 24.08 |
Extended Typology Paraphrase Corpus (ETPC) β Bonus
| Configuration | Accuracy | Micro F1 | Macro F1 | Macro MCC |
|---|---|---|---|---|
| Baseline | 0.00% | 0.000 | 0.000 | 0.000 |
| Multi-label + Concatenated Inputs | 85.60% | 0.674 | 0.147 | 0.162 |
Training loss decreases consistently. Dev loss flattens around epoch 8, where Macro MCC peaks β confirming the value of early stopping for model selection.
Figure 1: Stochastic Sampling approach
Figure 2: Optimized Controlled Generation
The controlled generation model doesn't just generate diverse text β it purposefully balances semantic, lexical, and syntactic objectives through the multi-objective loss, making it a superior solution over pure stochastic sampling.
| Parameter | Value |
|---|---|
| Mode | finetune |
| Epochs | 15 |
| Learning Rate | 1e-05 |
| Weight Decay (L2) | 1e-2 |
| Dropout | 0.3 |
| Batch Size | 16 |
| Optimizer | AdamW |
| Parameter | Value |
|---|---|
| Epochs | 12 |
| Learning Rate | 5e-5 |
| Weight Decay | 1e-2 |
| Dropout | 0.1 |
| Patience | 3 |
| Batch Size | 16 |
| Parameter | Range |
|---|---|
| Learning Rate | 1e-5 β 1e-4 |
| Weight Decay | 1e-5 β 1e-2 |
| Epochs | 10 β 15 |
| Alpha (sem/lex/syn) | 0.1 β 1.0 each |
| Temperature | 0.5 β 1.0 |
| Top-p | 0.80 β 0.95 |
| Repetition Penalty | 1.5 β 3.0 |
| Patience | 4 β 7 |
| Batch Size | 8 |
Shakhzod Bakhodirov Β· @WeskerPRO Β· Matriculation: 18749742
Responsible for the three most technically demanding tasks in the project. Designed and implemented the full controlled generation pipeline for PTG including multi-objective loss and Bayesian optimization, built the Siamese BERT architecture with contrastive learning for STS, and resolved the critical preprocessing failure in ETPC to deliver a working multi-label classifier.
| Task | Contribution |
|---|---|
| π₯ Paraphrase Type Generation (PTG) | Multi-objective loss Β· Stochastic sampling Β· Bayesian optimization Β· BART vs T5 comparison |
| π₯ Semantic Textual Similarity (STS) | Siamese BERT Β· Triplet loss Β· Hard negative mining Β· Hybrid loss ( |
| π₯ ETPC Bonus Task | Concatenated input design Β· Multi-label classification Β· Full pipeline fix |
AI support was documented in our AI Usage Card.
- Devlin et al. (2018) β BERT: Pre-training of Deep Bidirectional Transformers
- Liu et al. (2019) β RoBERTa: A Robustly Optimized BERT Pretraining Approach
- Reimers & Gurevych (2019) β Sentence-BERT: Siamese Network Structures
- Kumar et al. (2021) β Controlled Text Generation as Continuous Optimization
- Lin et al. (2017) β Focal Loss for Dense Object Detection (RetinaNet)
- Mushava & Murray (2022) β Flexible Loss Functions for Binary Classification
- Loshchilov & Hutter (2017) β Decoupled Weight Decay Regularization
- Ioffe & Szegedy (2015) β Batch Normalization
- Srivastava et al. (2014) β Dropout: Preventing Neural Networks from Overfitting
- MΓΌller, Kornblith & Hinton (2019) β When Does Label Smoothing Help?
- Ramachandran, Zoph & Le (2017) β Searching for Activation Functions (SiLU)


