Persian sentiment classification pipeline using a BERT backbone with optional LoRA fine-tuning.
This repository provides an end-to-end workflow:
- Data preprocessing from raw Excel
- Optional text augmentation
- Model training (with class imbalance handling + early stopping)
- Evaluation with JSON report and confusion matrix
- Simple prediction demo
SentimentAnalysis/
├── data/
│ ├── raw/
│ ├── processed/
│ └── augmented/
├── outputs/
│ ├── checkpoints/
│ ├── best_model/
│ └── reports/
├── src/
│ ├── config.py
│ ├── preprocess.py
│ ├── augment_data.py
│ ├── model_loader.py
│ ├── train_lora.py
│ ├── eval.py
│ ├── predict.py
│ └── temp.py
└── requirements-win.txt
- Python 3.10+ (recommended)
pip
Dependencies are listed in requirements-win.txt:
- transformers, peft
- pandas, openpyxl
- scikit-learn
- hazm
- matplotlib
From project root:
python -m venv .venv
source .venv/bin/activate
pip install -r requirements-win.txtFor Windows PowerShell activation:
.venv\Scripts\Activate.ps1Place your raw dataset in:
data/raw/Sentiment_Analysis_data.xlsx
Expected input assumptions in the current pipeline:
- The Excel file has at least two columns.
- First selected column is treated as text (
comment). - Second selected column is treated as label (
label). - Labels are numeric class IDs used as-is.
Default class mapping used in model config:
0 -> Negative1 -> Positive2 -> Neutral
Central configuration lives in src/config.py:
Paths: all input/output directories and artifactsTrainConfig: model name, batch settings, LoRA, loss settings, and training hyperparameters
Edit these dataclasses to customize training behavior.
Run commands from the repository root.
python src/preprocess.pyOutput files:
data/processed/train_orig.csvdata/processed/val_orig.csvdata/processed/test_orig.csv
python src/augment_data.pyOutput file:
data/augmented/train_aug.csv
If train_aug.csv exists, it is automatically appended to training data in train_lora.py.
python src/train_lora.pyTraining artifacts:
- Epoch checkpoints in
outputs/checkpoints/ - Exported merged best model in
outputs/best_model/ - Validation summary in
outputs/reports/val_best.json
python src/eval.pyEvaluation outputs:
outputs/reports/test_report.jsonoutputs/reports/confusion_matrix.png
python src/predict.pysrc/preprocess.py: Cleans text and creates train/val/test split.src/augment_data.py: Creates augmentation candidates with masked language modeling.src/model_loader.py: Loads tokenizer/model and applies LoRA adapter configuration.src/train_lora.py: Trains classifier, validates each epoch, saves best model.src/eval.py: Generates classification metrics + confusion matrix.src/predict.py: Simple prediction demo on sample Persian texts.src/temp.py: Local utility script for quick data inspection.
- Device logic in training prefers Intel XPU when available, then DirectML fallback, then CPU.
- Mixed precision (
bf16) is enabled only for supported XPU execution path. - The project expects Hugging Face model downloads on first run (internet required).
- If model download fails, verify internet access and Hugging Face availability.
- If
openpyxlerrors occur, reinstall requirements and verify Excel file path. - If
hazmPOS model is missing, augmentation still runs with fallback masking strategy. - If
outputs/best_model/does not exist, complete training before runningeval.pyorpredict.py.
No license file is currently included in this repository.