Skip to content

Persian-NLP-Toolkit/SentimentAnalysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SentimentAnalysis

Persian sentiment classification pipeline using a BERT backbone with optional LoRA fine-tuning.

This repository provides an end-to-end workflow:

  • Data preprocessing from raw Excel
  • Optional text augmentation
  • Model training (with class imbalance handling + early stopping)
  • Evaluation with JSON report and confusion matrix
  • Simple prediction demo

Project Structure

SentimentAnalysis/
├── data/
│   ├── raw/
│   ├── processed/
│   └── augmented/
├── outputs/
│   ├── checkpoints/
│   ├── best_model/
│   └── reports/
├── src/
│   ├── config.py
│   ├── preprocess.py
│   ├── augment_data.py
│   ├── model_loader.py
│   ├── train_lora.py
│   ├── eval.py
│   ├── predict.py
│   └── temp.py
└── requirements-win.txt

Requirements

  • Python 3.10+ (recommended)
  • pip

Dependencies are listed in requirements-win.txt:

  • transformers, peft
  • pandas, openpyxl
  • scikit-learn
  • hazm
  • matplotlib

Installation

From project root:

python -m venv .venv
source .venv/bin/activate
pip install -r requirements-win.txt

For Windows PowerShell activation:

.venv\Scripts\Activate.ps1

Data Format

Place your raw dataset in:

data/raw/Sentiment_Analysis_data.xlsx

Expected input assumptions in the current pipeline:

  • The Excel file has at least two columns.
  • First selected column is treated as text (comment).
  • Second selected column is treated as label (label).
  • Labels are numeric class IDs used as-is.

Default class mapping used in model config:

  • 0 -> Negative
  • 1 -> Positive
  • 2 -> Neutral

Configuration

Central configuration lives in src/config.py:

  • Paths: all input/output directories and artifacts
  • TrainConfig: model name, batch settings, LoRA, loss settings, and training hyperparameters

Edit these dataclasses to customize training behavior.

Run Pipeline

Run commands from the repository root.

1) Preprocess raw data

python src/preprocess.py

Output files:

  • data/processed/train_orig.csv
  • data/processed/val_orig.csv
  • data/processed/test_orig.csv

2) (Optional) Generate augmented training data

python src/augment_data.py

Output file:

  • data/augmented/train_aug.csv

If train_aug.csv exists, it is automatically appended to training data in train_lora.py.

3) Train model

python src/train_lora.py

Training artifacts:

  • Epoch checkpoints in outputs/checkpoints/
  • Exported merged best model in outputs/best_model/
  • Validation summary in outputs/reports/val_best.json

4) Evaluate on test split

python src/eval.py

Evaluation outputs:

  • outputs/reports/test_report.json
  • outputs/reports/confusion_matrix.png

5) Run sample predictions

python src/predict.py

Script Summary

  • src/preprocess.py: Cleans text and creates train/val/test split.
  • src/augment_data.py: Creates augmentation candidates with masked language modeling.
  • src/model_loader.py: Loads tokenizer/model and applies LoRA adapter configuration.
  • src/train_lora.py: Trains classifier, validates each epoch, saves best model.
  • src/eval.py: Generates classification metrics + confusion matrix.
  • src/predict.py: Simple prediction demo on sample Persian texts.
  • src/temp.py: Local utility script for quick data inspection.

Notes

  • Device logic in training prefers Intel XPU when available, then DirectML fallback, then CPU.
  • Mixed precision (bf16) is enabled only for supported XPU execution path.
  • The project expects Hugging Face model downloads on first run (internet required).

Troubleshooting

  • If model download fails, verify internet access and Hugging Face availability.
  • If openpyxl errors occur, reinstall requirements and verify Excel file path.
  • If hazm POS model is missing, augmentation still runs with fallback masking strategy.
  • If outputs/best_model/ does not exist, complete training before running eval.py or predict.py.

License

No license file is currently included in this repository.

About

Persian sentiment classification pipeline using a BERT backbone with optional LoRA fine-tuning.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages