Project 3 - Web APIs and NLP

Problem Statement

We are a data science team working in GG (a PC building company like AfterShock) and graphics card problems are the number one issue users have. It was shown that most users have graphics card problems after office hours and management had asked the IT team to build a bot to direct customers with graphics card problems to relevant subreddits incase they need help after office hours. The software engineers requested our assistance to help classify keywords for the bot to direct customers to subreddits albeit with accuracy.

And thus, our aim is to build a classifier that identifies keywords for the bot to determine relevant subreddits accurately (90%).

Background

PC building companies like Aftershock PC and GG (our pseudo company) build powerful computers. The company allows users to customize their computers as well as recommend different computer builds. One of the critical components of the computer includes the graphics card, which unfortunately has the highest customer requests and inquiries.

Graphics cards also known as the graphics processing unit is a piece of graphics rendering accelerating hardware used in powerful computers.

There are two types of graphics card used in computers, integrated and discrete.

Discrete graphic cards are preferred for cryptocurrency mining, media development and intense gaming uses.

The graphics card industry is dominated by two main manufacturing companies, namely:

AMD
NVIDIA

As we are a PC building company, we provide discrete graphics cards from both companies.

Executive Summary

The following subreddits were scraped using PushShift API from https://api.pushshift.io to scrape the title and selftext from the following subreddits: AMD and NVIDIA.

This is a classification problem and it is approached by using Count Vectorization and TF-IDF Vectorization with 2 classifier models.

Grid Search CV was utilized to find the optimal hyperparamters for each classifier model.
The models chosen are :
1. Logistic Regression
2. Multinomial Naive Bayes

The train, test scores and area under curve of the models were used to gauge the classifier performance.

The criteria for selecting the models are the following:

High Accuracy
Minimize False Positives (High Sensitivity)
Good Generalization

High sensitivity is required as we want to direct customers to the correct subreddit and not the wrong ones.

All models provided similar accuracy of 90% and similiar sensitivity of 0.9.

In summary, multinomial naive bayes with Count Vectorization was picked as the chosen classifier with the least score difference with 0.90 train and 0.89 test as it had the best generalization with the other considerations being the same.

Data Dictionary

Dataframes used: unvectorized_gpu.csv

Variable Name	Data Type	Description
subreddit	str	name of subreddit
author	str	name of poster of subreddit post
num_comments	int	number of comments per subreddit post
score	int	subreddit upvotes - subreddit downvotes
word_count	int	count of words after lemmatization
content_	str	subreddit post title + subreddit post content lemmatized

Recommendation and Conclusion

The completion of the classifier is only stage 1 of the entire project.
The IT team can already start testing their bot with this classifier.

As the top 5 words include the brand of the graphics cards, it is recommended that the IT Team requests user inputs for the brand and model of the graphics card as well as a problem keyword so that the classifier can assist with more accuracy, and not the user just typing randomly in their problems.

The wordcloud above shows the top 150 words with the highest coefficients in helping the model classify.

Nevertheless, the model has worked exceptionally well, especially for general data, in classifying which subreddit with specific keywords required.

We recommend that the software engineering team requests for customer ID then match product ID when a customer wants to ask a question so that brands and models are automatically imputed. As customers may now known what is inside and may just want to ask a random question.

We also recommend that the bot forces the customer to search by type of problem first and before allowing them to type their question.

Going forward, in stage 2, the team will work together to identify subreddits and manufacturer forums with keywords and list the links in a dictionary to assist the IT team with direct problem solutions.

This method can also be applied for troubleshooting for other components of the computer such as cpu, motherboard and so on in the future. And hope that the IT Team and work together with us to build a full AI solution for all customer component and troubleshooting inquiries.

W also hope to increase accuracy by trying other models such as SVM and also continue tuning our hyperparameters to try and improve accuracy.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.ipynb_checkpoints		.ipynb_checkpoints
code		code
data		data
images		images
.DS_Store		.DS_Store
._.DS_Store		._.DS_Store
._Project 3 Slides_pdf.pdf		._Project 3 Slides_pdf.pdf
._README.md		._README.md
Project 3 Slides.pdf		Project 3 Slides.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project 3 - Web APIs and NLP

Problem Statement

Background

Executive Summary

Data Dictionary

Dataframes used: unvectorized_gpu.csv

Recommendation and Conclusion

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Project 3 - Web APIs and NLP

Problem Statement

Background

Executive Summary

Data Dictionary

Dataframes used: unvectorized_gpu.csv

Recommendation and Conclusion

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages