Skip to content

sergiosonline/bigbigdata

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

bigbigdata

This was a self-directed project based on Kaggle's Microsoft Malware Challenge.

I implemented the following 4 classification models using PySpark's ml and mllib libraries in the GCP and Azure cloud environments:

  • Logistic Regression

  • Support Vector Machine

  • Random Forest Classification

  • Gradient Boosted Classification Tree

Below is the abstract--you may find the final report in my website here:

Abstract:

This work is centered around a binary classification problem over millions of observations, each pertaining to a distinct Windows device. By classifying correctly which device has the highest chance of acquiring malware in the coming time period, we can get an idea of the most influential factors towards said infection. This work also consists of an exploration of Google Cloud Platform (GCP) and Microsoft Azure as on-demand distributed computing ecosystems. There is also some discussion around the class imbalance problem in classification. My results are pretty impressive for such rudimentary implementations (1st place in Kaggle got .71 AUROC).

About

Big Data Projects

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors