This project focuses on performing sentiment analysis on customer reviews in the FinTech domain using Natural Language Processing (NLP). The goal is to classify reviews as positive or negative based on the sentiment conveyed. This project leverages several machine learning models and NLP techniques to preprocess text data and build a predictive sentiment analysis model.
Sentiment analysis helps businesses understand customer opinions and feedback, which can be crucial in improving products and services. In this project, I analyzed customer reviews to classify the sentiment (positive/negative) using multiple machine learning algorithms. The project includes data cleaning, feature extraction, model building, and evaluation.
Technologies Used Data Preprocessing Model Building Model Evaluation Visualizations Results
Python: Programming language Pandas: Data manipulation and analysis NumPy: Numerical operations Scikit-learn: Machine learning algorithms (Random Forest, XGBoost, Decision Tree) NLTK / SpaCy: Text preprocessing (Tokenization, Lemmatization, Stemming) Matplotlib / Seaborn: Data visualization WordCloud: Visualization of frequently used words CountVectorizer / TfidfVectorizer: Feature extraction from text data
The raw text data was preprocessed in the following steps:
Text Cleaning: Removal of special characters, numbers, and punctuation. Tokenization: Splitting the text into individual words (tokens). Stopword Removal: Eliminating common words that don’t add value to sentiment analysis. Stemming: Reducing words to their root form (e.g., "running" to "run"). Feature Extraction: Used CountVectorizer for bag-of-words representation. Employed TF-IDF to capture important words based on frequency.
The following machine learning models were used to perform sentiment classification:
Random Forest Classifier: An ensemble method combining multiple decision trees to improve accuracy. XGBoost Classifier: A gradient boosting algorithm known for speed and performance. Decision Tree Classifier: A simple, interpretable model that splits data based on feature values.
Accuracy: Proportion of correctly classified samples. Confusion Matrix: To visualize the true positive, true negative, false positive, and false negative values.
Various visualizations were created to explore and understand the data:
WordCloud: Displayed the most frequent words in the dataset. Sentiment Distribution: Visualized the number of positive and negative reviews. Model Performance Metrics: Plotted Confusion Matrix and other evaluation metrics.
Achieved competitive model accuracy using Random Forest and XGBoost classifiers.