- Bag-of-Words (BoW) feature representation
- Multinomial Naive Bayes
- Laplace smoothing (tested across different smoothing factors)
- Processed Text + Naive Bayes (smoothing factor comparison)
- Raw Text + Naive Bayes (smoothing factor comparison)
- Removing Rare Words (frequency threshold < 5 removed)
- Impact of Sentiment on Review Lengths (distribution analysis)
Preprocessing applied: tokenization + stopword removal.
Smoothing factors evaluated: 0.1, 0.3, 0.5, 1.0, 1.5, 1.8, 1.9, 2.0.
Best result: smoothing factor 1.8
- Precision 0.85592
- Recall 0.84009
- F1 0.84793
π Figure: Impact of smoothing factors on precision/recall/F1
No preprocessing (raw reviews).
Smoothing factors evaluated: 0.1 β 3.0.
Best result: smoothing factor 0.5
- Precision 0.88033
- Recall 0.84914
- F1 0.86445
π Figure: Smoothing factors vs precision/recall/F1 using raw text
Rare words appearing < 5 times are removed to reduce vocabulary size and noise.
- Rare words identified: 70,178
- Metrics after retraining:
- Precision 0.8534
- Recall 0.8425
- F1 0.8479
Comparison (Cleaned Rare Words vs Processed Text):
| Metric | Cleaned Rare Words | Processed Text |
|---|---|---|
| Precision | 0.8534 | 0.8553 |
| Recall | 0.8425 | 0.8405 |
| F1 Score | 0.8479 | 0.8478 |
π Figure: Cleaned rare words vs processed text
This experiment analyzes the relationship between review length and sentiment label.
Observation:
- Negative reviews are more concentrated around shorter lengths.
- Positive reviews show a slightly broader spread.
- There is no clear length threshold that cleanly separates positive vs negative, but length may still help as a feature in ambiguous cases.
π Figure: Histogram of review lengths by sentiment
- Raw text Naive Bayes performed better than processed text in terms of F1 score, suggesting raw text retains helpful context.
- Removing rare words provides similar overall performance while reducing complexity and potentially improving efficiency.
- Review length alone is not a reliable sentiment indicator, but trends suggest it could be a useful auxiliary feature.



