Skip to content

WeskerPRO/NLP_Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🧠 DNLP SS25 β€” Gradient Descenders

Status Version Python PyTorch

A comprehensive study of NLP tasks using BERT and BART β€” sentiment analysis, paraphrase detection, and controlled text generation.

πŸ“– Overview Β· βš™οΈ Setup Β· πŸ”¬ Methodology Β· πŸ“Š Results Β· πŸ‘₯ Team


πŸ“Œ Project Info

Field Details
🏷️ Group Code 06
πŸ‘©β€πŸ« Tutor Corinna Wegner
πŸ‘₯ Members Khalid Tariq Β· Shakhzod Bakhodirov Β· Alina Amanbayeva

πŸ—ΊοΈ Overview

This project explores five core NLP tasks through fine-tuning of BERT and BART foundation models. Our work goes beyond standard baselines by introducing advanced loss functions, architectural improvements, and training strategies:

Task Model Key Innovation
Semantic Textual Similarity (STS) BERT Siamese network + Triplet Loss
Sentiment Analysis (SST) BERT CLS + Mean Pooling + LayerNorm
Paraphrase Detection (QQP) BERT Focal Loss + Adversarial Training (FGM)
Paraphrase Type Detection (PTD) BART Custom Head + Weighted Sampling
Paraphrase Type Generation (PTG) BART Multi-Objective Loss + Bayesian Optimization

βš™οΈ Setup Instructions

1. Clone the Repository

git clone https://github.com/WeskerPRO/NLP_Project.git
cd NLP_Project

2. Install Dependencies

Option A β€” Automatic (recommended):

bash setup.sh

Option B β€” Manual (conda):

conda create -n dnlp python=3.10 -y
conda activate dnlp
conda install pytorch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 pytorch-cuda=12.1 -c pytorch -c nvidia
conda install -y tqdm requests scikit-learn numpy pandas matplotlib seaborn scipy -c conda-forge
conda install -y protobuf transformers tokenizers pyarrow=20.0.0 spacy -c conda-forge
conda install -y sentence-transformers=5.0.0 -c conda-forge
pip install explainaboard-client==0.1.4 sacrebleu==2.5.1 optuna==3.6.0 importlib_metadata

3. Train BERT (Multitask)

python multitask_classifier.py --option finetune --task=[sst, sts, qqp, etpc] --epochs=15 --use_gpu

4. Train BART

# Paraphrase Type Generation
python bart_generation.py --use_gpu

# Paraphrase Type Detection
python bart_detection.py --use_gpu

# Optional: T5 model
python bart_generation.py --model=T5 --use_gpu

πŸ”¬ Methodology

1. πŸ“ Semantic Textual Similarity (STS)

A Siamese network architecture maps sentence pairs into a shared embedding space using a single weight-shared model.

Key techniques:

  • Triplet Loss with Hard Negative Mining β€” pulls similar embeddings together, pushes dissimilar ones apart. Hard negatives are mined within each batch for maximum discrimination.
  • MSE Regression Loss β€” directly optimizes prediction of the ground-truth similarity score in the [0, 5] range.
  • Hybrid Loss: $L = \alpha \cdot L_{\text{triplet}} + (1 - \alpha) \cdot L_{\text{regression}}$, with optimal $\alpha = 0.8$.

2. πŸ’¬ Sentiment Analysis (SST)

The vanilla CLS-only baseline suffered from limited expressiveness and overfitting on this small dataset. Three improvements were introduced:

Improvement Description
Advanced Pooling Concatenates [CLS] embedding with mean of all last-layer hidden states
Layer Normalization Stabilizes the concatenated representation before the classifier
Strong Dropout (p=0.5) Acts as ensemble regularization, critical for SST's limited size

3. πŸ” Paraphrase Detection (QQP)

Five targeted improvements over the baseline:

# Method Purpose
1 Advanced Feature Engineering Element-wise difference, product, cosine similarity between embeddings
2 Custom Paraphrase Head Learns nonlinear decision boundaries from interaction features
3 Dropout Regularization Prevents over-reliance on any single interaction type
4 Focal Loss $FL(p_t) = -\alpha(1-p_t)^\gamma \log(p_t)$ β€” addresses class imbalance
5 Adversarial Training (FGM) Adds worst-case embedding perturbations to boost robustness

4. 🏷️ Paraphrase Type Detection (PTD) β€” BART

Multi-label classification across 26 paraphrase categories using the ETPC dataset, evaluated with Matthews Correlation Coefficient (MCC).

Improvements:

  • Class-wise Threshold Tuning β€” each of the 26 classes gets its own optimal threshold on the dev set
  • Weighted Random Sampler β€” rare paraphrase types get higher sampling weight: $w_i = \frac{1}{\text{count}(i) + \epsilon}$
  • L2 Regularization β€” weight decay in AdamW: $\mathcal{L}_{\text{reg}}(\theta) = \mathcal{L}(\theta) + \lambda |\theta|_2^2$
  • Custom Classifier Head β€” Linear β†’ SiLU β†’ BatchNorm β†’ Dropout β†’ Output Linear
  • BCEWithLogitsLoss β€” handles multi-label prediction with per-class binary cross-entropy

5. ✍️ Paraphrase Type Generation (PTG) β€” BART

A controlled generation framework trained with a multi-objective loss:

$$L_{\text{total}} = L_{CE} + \alpha_{\text{sem}} \cdot L_{\text{sem}} + \alpha_{\text{lex}} \cdot L_{\text{lex}} + \alpha_{\text{syn}} \cdot L_{\text{syn}}$$

Key contributions:

  • Stochastic Sampling β€” top_p and temperature encourage diverse, non-repetitive generation
  • Multi-Objective Loss β€” simultaneously optimizes semantic similarity, lexical variation, and syntactic variation
  • Bayesian Hyperparameter Optimization (Optuna) β€” intelligently searches learning rate, weight decay, loss weights, and generation parameters
  • BART vs. T5 Comparison β€” T5 starts strong (instruction-following pre-training), but BART learns to de-copy and generate novel paraphrases over epochs

πŸ“Š Results

BERT Tasks

Task Model Metric Score
SST Baseline Accuracy 52.20%
SST Advanced Pooling + Regularization Accuracy 55.20%
QQP Baseline Accuracy 75.00%
QQP Feature Eng. + Focal Loss + FGM Accuracy 85.70%
STS Baseline (MSE) Pearson r 0.352
STS Simple Regression + Regularization Pearson r 0.375
STS Siamese BERT + Contrastive Learning Pearson r 0.672

BART Tasks

Paraphrase Type Detection (PTD)

Configuration Accuracy MCC
Baseline 87.50% 0.180
Threshold Tuning 87.50% 0.206
WeightedSampler + BCE + L2 78.00% 0.318
Custom Classifier Head 80.00% 0.400
Optuna + Layer Freezing 56.50% 0.350

Paraphrase Type Generation (PTG)

Configuration BLEU Negative BLEU Penalized BLEU
Baseline 48.44 2.84 2.64
Stochastic Sampling 45.22 20.93 18.20
Multi-Objective Loss 44.08 22.61 19.16
Optimized Controlled Generation 42.02 29.79 24.08

Extended Typology Paraphrase Corpus (ETPC) β€” Bonus

Configuration Accuracy Micro F1 Macro F1 Macro MCC
Baseline 0.00% 0.000 0.000 0.000
Multi-label + Concatenated Inputs 85.60% 0.674 0.147 0.162

πŸ“ˆ Visualizations

PTD β€” Training & Validation Curve

PTD Training Curve

Training loss decreases consistently. Dev loss flattens around epoch 8, where Macro MCC peaks β€” confirming the value of early stopping for model selection.

PTG β€” Stochastic Sampling vs. Controlled Generation

PTG Stochastic

Figure 1: Stochastic Sampling approach

PTG Controlled

Figure 2: Optimized Controlled Generation

The controlled generation model doesn't just generate diverse text β€” it purposefully balances semantic, lexical, and syntactic objectives through the multi-objective loss, making it a superior solution over pure stochastic sampling.


πŸ”§ Hyperparameters

Multitask BERT (QQP, STS, SST, ETPC)

Parameter Value
Mode finetune
Epochs 15
Learning Rate 1e-05
Weight Decay (L2) 1e-2
Dropout 0.3
Batch Size 16
Optimizer AdamW

PTD (BART Detection)

Parameter Value
Epochs 12
Learning Rate 5e-5
Weight Decay 1e-2
Dropout 0.1
Patience 3
Batch Size 16

PTG (BART Generation β€” Bayesian Search Space)

Parameter Range
Learning Rate 1e-5 β†’ 1e-4
Weight Decay 1e-5 β†’ 1e-2
Epochs 10 β†’ 15
Alpha (sem/lex/syn) 0.1 β†’ 1.0 each
Temperature 0.5 β†’ 1.0
Top-p 0.80 β†’ 0.95
Repetition Penalty 1.5 β†’ 3.0
Patience 4 β†’ 7
Batch Size 8

πŸ‘₯ Members Contribution

⭐ Lead Contributor

Shakhzod Bakhodirov Β· @WeskerPRO Β· Matriculation: 18749742

Responsible for the three most technically demanding tasks in the project. Designed and implemented the full controlled generation pipeline for PTG including multi-objective loss and Bayesian optimization, built the Siamese BERT architecture with contrastive learning for STS, and resolved the critical preprocessing failure in ETPC to deliver a working multi-label classifier.

Task Contribution
πŸ₯‡ Paraphrase Type Generation (PTG) Multi-objective loss Β· Stochastic sampling Β· Bayesian optimization Β· BART vs T5 comparison
πŸ₯‡ Semantic Textual Similarity (STS) Siamese BERT Β· Triplet loss Β· Hard negative mining Β· Hybrid loss ($\alpha$ analysis)
πŸ₯‡ ETPC Bonus Task Concatenated input design Β· Multi-label classification Β· Full pipeline fix

πŸ€– AI Usage

AI support was documented in our AI Usage Card.


πŸ“š References

  • Devlin et al. (2018) β€” BERT: Pre-training of Deep Bidirectional Transformers
  • Liu et al. (2019) β€” RoBERTa: A Robustly Optimized BERT Pretraining Approach
  • Reimers & Gurevych (2019) β€” Sentence-BERT: Siamese Network Structures
  • Kumar et al. (2021) β€” Controlled Text Generation as Continuous Optimization
  • Lin et al. (2017) β€” Focal Loss for Dense Object Detection (RetinaNet)
  • Mushava & Murray (2022) β€” Flexible Loss Functions for Binary Classification
  • Loshchilov & Hutter (2017) β€” Decoupled Weight Decay Regularization
  • Ioffe & Szegedy (2015) β€” Batch Normalization
  • Srivastava et al. (2014) β€” Dropout: Preventing Neural Networks from Overfitting
  • MΓΌller, Kornblith & Hinton (2019) β€” When Does Label Smoothing Help?
  • Ramachandran, Zoph & Le (2017) β€” Searching for Activation Functions (SiLU)

About

Fine-tuning BERT and BART for sentiment analysis, paraphrase detection, and controlled text generation. Features Siamese networks, multi-objective loss, adversarial training, and Bayesian hyperparameter optimization.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors