SentimentAnalysis

Persian sentiment classification pipeline using a BERT backbone with optional LoRA fine-tuning.

This repository provides an end-to-end workflow:

Data preprocessing from raw Excel
Optional text augmentation
Model training (with class imbalance handling + early stopping)
Evaluation with JSON report and confusion matrix
Simple prediction demo

Project Structure

SentimentAnalysis/
├── data/
│   ├── raw/
│   ├── processed/
│   └── augmented/
├── outputs/
│   ├── checkpoints/
│   ├── best_model/
│   └── reports/
├── src/
│   ├── config.py
│   ├── preprocess.py
│   ├── augment_data.py
│   ├── model_loader.py
│   ├── train_lora.py
│   ├── eval.py
│   ├── predict.py
│   └── temp.py
└── requirements-win.txt

Requirements

Python 3.10+ (recommended)
pip

Dependencies are listed in requirements-win.txt:

transformers, peft
pandas, openpyxl
scikit-learn
hazm
matplotlib

Installation

From project root:

python -m venv .venv
source .venv/bin/activate
pip install -r requirements-win.txt

For Windows PowerShell activation:

.venv\Scripts\Activate.ps1

Data Format

Place your raw dataset in:

data/raw/Sentiment_Analysis_data.xlsx

Expected input assumptions in the current pipeline:

The Excel file has at least two columns.
First selected column is treated as text (comment).
Second selected column is treated as label (label).
Labels are numeric class IDs used as-is.

Default class mapping used in model config:

0 -> Negative
1 -> Positive
2 -> Neutral

Configuration

Central configuration lives in src/config.py:

Paths: all input/output directories and artifacts
TrainConfig: model name, batch settings, LoRA, loss settings, and training hyperparameters

Edit these dataclasses to customize training behavior.

Run Pipeline

Run commands from the repository root.

1) Preprocess raw data

python src/preprocess.py

Output files:

data/processed/train_orig.csv
data/processed/val_orig.csv
data/processed/test_orig.csv

2) (Optional) Generate augmented training data

python src/augment_data.py

Output file:

data/augmented/train_aug.csv

If train_aug.csv exists, it is automatically appended to training data in train_lora.py.

3) Train model

python src/train_lora.py

Training artifacts:

Epoch checkpoints in outputs/checkpoints/
Exported merged best model in outputs/best_model/
Validation summary in outputs/reports/val_best.json

4) Evaluate on test split

python src/eval.py

Evaluation outputs:

outputs/reports/test_report.json
outputs/reports/confusion_matrix.png

5) Run sample predictions

python src/predict.py

Script Summary

src/preprocess.py: Cleans text and creates train/val/test split.
src/augment_data.py: Creates augmentation candidates with masked language modeling.
src/model_loader.py: Loads tokenizer/model and applies LoRA adapter configuration.
src/train_lora.py: Trains classifier, validates each epoch, saves best model.
src/eval.py: Generates classification metrics + confusion matrix.
src/predict.py: Simple prediction demo on sample Persian texts.
src/temp.py: Local utility script for quick data inspection.

Notes

Device logic in training prefers Intel XPU when available, then DirectML fallback, then CPU.
Mixed precision (bf16) is enabled only for supported XPU execution path.
The project expects Hugging Face model downloads on first run (internet required).

Troubleshooting

If model download fails, verify internet access and Hugging Face availability.
If openpyxl errors occur, reinstall requirements and verify Excel file path.
If hazm POS model is missing, augmentation still runs with fallback masking strategy.
If outputs/best_model/ does not exist, complete training before running eval.py or predict.py.

License

No license file is currently included in this repository.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
SentimentAnalysis		SentimentAnalysis
data		data
src		src
.gitignore		.gitignore
README.md		README.md
requirements-win.txt		requirements-win.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SentimentAnalysis

Project Structure

Requirements

Installation

Data Format

Configuration

Run Pipeline

1) Preprocess raw data

2) (Optional) Generate augmented training data

3) Train model

4) Evaluate on test split

5) Run sample predictions

Script Summary

Notes

Troubleshooting

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SentimentAnalysis

Project Structure

Requirements

Installation

Data Format

Configuration

Run Pipeline

1) Preprocess raw data

2) (Optional) Generate augmented training data

3) Train model

4) Evaluate on test split

5) Run sample predictions

Script Summary

Notes

Troubleshooting

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages