Skip to content

This repository contains experiments for binary sentiment analysis on a review dataset using a Bag-of-Words (BoW) representation with a Multinomial Naive Bayes classifier. The goal is to study how text preprocessing choices and Laplace smoothing affect performance, and to analyze the relationship between review length and sentiment.

Notifications You must be signed in to change notification settings

naveen777-github/Sentimental_Data_Analysis_Using_Multinomial_Naive_Bayes

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

7 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Sentiment_Data_Analysis_Using_Multinomial_Naive_Bayes

Overview

Techniques Used

  • Bag-of-Words (BoW) feature representation
  • Multinomial Naive Bayes
  • Laplace smoothing (tested across different smoothing factors)

Experiments Included

  1. Processed Text + Naive Bayes (smoothing factor comparison)
  2. Raw Text + Naive Bayes (smoothing factor comparison)
  3. Removing Rare Words (frequency threshold < 5 removed)
  4. Impact of Sentiment on Review Lengths (distribution analysis)

Experiments & Key Results

Experiment 1 β€” Processed Text (BoW + Naive Bayes)

Preprocessing applied: tokenization + stopword removal.
Smoothing factors evaluated: 0.1, 0.3, 0.5, 1.0, 1.5, 1.8, 1.9, 2.0.

Best result: smoothing factor 1.8

  • Precision 0.85592
  • Recall 0.84009
  • F1 0.84793

πŸ“Œ Figure: Impact of smoothing factors on precision/recall/F1

Figure 1


Experiment 2 β€” Raw Text (BoW + Naive Bayes)

No preprocessing (raw reviews).
Smoothing factors evaluated: 0.1 β†’ 3.0.

Best result: smoothing factor 0.5

  • Precision 0.88033
  • Recall 0.84914
  • F1 0.86445

πŸ“Œ Figure: Smoothing factors vs precision/recall/F1 using raw text

Figure 2


Experiment 3 β€” Removing Rare Words

Rare words appearing < 5 times are removed to reduce vocabulary size and noise.

  • Rare words identified: 70,178
  • Metrics after retraining:
    • Precision 0.8534
    • Recall 0.8425
    • F1 0.8479

Comparison (Cleaned Rare Words vs Processed Text):

Metric Cleaned Rare Words Processed Text
Precision 0.8534 0.8553
Recall 0.8425 0.8405
F1 Score 0.8479 0.8478

πŸ“Œ Figure: Cleaned rare words vs processed text

Figure 3


Experiment 4 β€” Impact of Sentiment on Review Lengths

This experiment analyzes the relationship between review length and sentiment label.

Observation:

  • Negative reviews are more concentrated around shorter lengths.
  • Positive reviews show a slightly broader spread.
  • There is no clear length threshold that cleanly separates positive vs negative, but length may still help as a feature in ambiguous cases.

πŸ“Œ Figure: Histogram of review lengths by sentiment

Figure 4


Summary of Findings (Experiments 1–4)

  • Raw text Naive Bayes performed better than processed text in terms of F1 score, suggesting raw text retains helpful context.
  • Removing rare words provides similar overall performance while reducing complexity and potentially improving efficiency.
  • Review length alone is not a reliable sentiment indicator, but trends suggest it could be a useful auxiliary feature.

About

This repository contains experiments for binary sentiment analysis on a review dataset using a Bag-of-Words (BoW) representation with a Multinomial Naive Bayes classifier. The goal is to study how text preprocessing choices and Laplace smoothing affect performance, and to analyze the relationship between review length and sentiment.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published