Binary sentiment classification (positive/negative) across English, French, and Dutch, comparing monolingual BERT specialists against multilingual mBERT for cross-lingual transfer learning.
- How well does mBERT transfer sentiment knowledge across languages in a zero-shot setting?
- Does training on multiple languages improve over single-language fine-tuning?
- How do language-specific models compare to mBERT on in-language evaluation?
| Language | Dataset | Source | Train | Val | Test |
|---|---|---|---|---|---|
| English | IMDB Movie Reviews | stanfordnlp/imdb |
22,500 | 2,500 | 25,000 |
| French | Allocine Movie Reviews | tblard/allocine |
25,000 | 20,000 | 20,000 |
| Dutch | DBRD Book Reviews | benjaminvdb/dbrd |
18,028 | 2,000 | 2,224 |
All datasets are balanced (~50% positive / 50% negative).
Monolingual baselines:
bert-base-uncased(English)GroNLP/bert-base-dutch-cased/ BERTje (Dutch)almanach/camembert-base/ CamemBERT (French)
Multilingual experiments (mBERT = bert-base-multilingual-cased):
- mBERT trained on English only (zero-shot transfer test)
- mBERT trained on French only
- mBERT trained on Dutch only
- mBERT trained on all three languages combined
.
├── 01_data_exploration.ipynb # Load, explore, preprocess datasets
├── 02_monolingual_finetuning.ipynb # Train monolingual baselines
├── 03_multilingual_finetuning.ipynb# Train mBERT variants
├── 04_evaluation_comparison.ipynb # Evaluate all models, generate analysis
├── src/
│ ├── __init__.py
│ ├── config.py # Centralized hyperparameters and paths
│ ├── utils.py # Shared training/evaluation utilities
│ └── predict.py # Inference on new text
├── tests/
│ ├── __init__.py
│ ├── test_config.py
│ ├── test_utils.py
│ └── test_predict.py
├── data/ # Preprocessed datasets (generated by notebook 01)
│ ├── en/ fr/ nl/ combined/
├── models/ # Saved model checkpoints (generated by notebooks 02-03)
├── results/ # Evaluation outputs and figures
│ └── figures/
├── requirements.txt
├── pyproject.toml
├── Makefile
└── .gitignore
# Clone and install
git clone <repo-url>
cd NLP
python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"Run the notebooks in order:
jupyter notebook01_data_exploration.ipynb— downloads and preprocesses data intodata/02_monolingual_finetuning.ipynb— trains monolingual baselines intomodels/03_multilingual_finetuning.ipynb— trains mBERT variants intomodels/04_evaluation_comparison.ipynb— evaluates all models, generates figures
After training, classify new text:
from src.predict import SentimentPredictor
predictor = SentimentPredictor("models/mbert-multilingual")
result = predictor.predict("This movie was absolutely fantastic!")
print(result) # {'label': 'Positive', 'score': 0.98}make data # Run notebook 01 (data preprocessing)
make train # Run notebooks 02 and 03 (training)
make evaluate # Run notebook 04 (evaluation)
make all # Run full pipeline
make test # Run unit tests
make clean # Remove model checkpoints and results| Parameter | Value |
|---|---|
| Max sequence length | 256 tokens |
| Batch size | 8 (×4 gradient accumulation) |
| Learning rate | 2e-5 |
| Epochs | 3 |
| Warmup | 10% of steps |
| Scheduler | Linear decay |
| Metric for best model | F1 |
| FP16 | Enabled when CUDA available |
This project is for research and educational purposes. The datasets are used under their respective licenses:
- IMDB: for non-commercial research
- Allocine: MIT License
- DBRD: CC BY-NC-SA 4.0