Skip to content

lumenworksco/NLP

Repository files navigation

Multilingual Sentiment Analysis: Cross-Lingual Transfer with BERT

Binary sentiment classification (positive/negative) across English, French, and Dutch, comparing monolingual BERT specialists against multilingual mBERT for cross-lingual transfer learning.

Research Questions

  1. How well does mBERT transfer sentiment knowledge across languages in a zero-shot setting?
  2. Does training on multiple languages improve over single-language fine-tuning?
  3. How do language-specific models compare to mBERT on in-language evaluation?

Datasets

Language Dataset Source Train Val Test
English IMDB Movie Reviews stanfordnlp/imdb 22,500 2,500 25,000
French Allocine Movie Reviews tblard/allocine 25,000 20,000 20,000
Dutch DBRD Book Reviews benjaminvdb/dbrd 18,028 2,000 2,224

All datasets are balanced (~50% positive / 50% negative).

Models

Monolingual baselines:

  • bert-base-uncased (English)
  • GroNLP/bert-base-dutch-cased / BERTje (Dutch)
  • almanach/camembert-base / CamemBERT (French)

Multilingual experiments (mBERT = bert-base-multilingual-cased):

  • mBERT trained on English only (zero-shot transfer test)
  • mBERT trained on French only
  • mBERT trained on Dutch only
  • mBERT trained on all three languages combined

Project Structure

.
├── 01_data_exploration.ipynb       # Load, explore, preprocess datasets
├── 02_monolingual_finetuning.ipynb # Train monolingual baselines
├── 03_multilingual_finetuning.ipynb# Train mBERT variants
├── 04_evaluation_comparison.ipynb  # Evaluate all models, generate analysis
├── src/
│   ├── __init__.py
│   ├── config.py                   # Centralized hyperparameters and paths
│   ├── utils.py                    # Shared training/evaluation utilities
│   └── predict.py                  # Inference on new text
├── tests/
│   ├── __init__.py
│   ├── test_config.py
│   ├── test_utils.py
│   └── test_predict.py
├── data/                           # Preprocessed datasets (generated by notebook 01)
│   ├── en/ fr/ nl/ combined/
├── models/                         # Saved model checkpoints (generated by notebooks 02-03)
├── results/                        # Evaluation outputs and figures
│   └── figures/
├── requirements.txt
├── pyproject.toml
├── Makefile
└── .gitignore

Setup

# Clone and install
git clone <repo-url>
cd NLP
python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"

Usage

Notebook workflow (recommended for exploration)

Run the notebooks in order:

jupyter notebook
  1. 01_data_exploration.ipynb — downloads and preprocesses data into data/
  2. 02_monolingual_finetuning.ipynb — trains monolingual baselines into models/
  3. 03_multilingual_finetuning.ipynb — trains mBERT variants into models/
  4. 04_evaluation_comparison.ipynb — evaluates all models, generates figures

CLI inference

After training, classify new text:

from src.predict import SentimentPredictor

predictor = SentimentPredictor("models/mbert-multilingual")
result = predictor.predict("This movie was absolutely fantastic!")
print(result)  # {'label': 'Positive', 'score': 0.98}

Makefile targets

make data        # Run notebook 01 (data preprocessing)
make train       # Run notebooks 02 and 03 (training)
make evaluate    # Run notebook 04 (evaluation)
make all         # Run full pipeline
make test        # Run unit tests
make clean       # Remove model checkpoints and results

Training Configuration

Parameter Value
Max sequence length 256 tokens
Batch size 8 (×4 gradient accumulation)
Learning rate 2e-5
Epochs 3
Warmup 10% of steps
Scheduler Linear decay
Metric for best model F1
FP16 Enabled when CUDA available

License

This project is for research and educational purposes. The datasets are used under their respective licenses:

  • IMDB: for non-commercial research
  • Allocine: MIT License
  • DBRD: CC BY-NC-SA 4.0

About

A multilingual sentiment analysis project built with BERT, capable of understanding and classifying text across multiple languages. Designed to deliver accurate, context-aware sentiment predictions using state-of-the-art NLP techniques.

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors