Skip to content

Latest commit

 

History

History
28 lines (17 loc) · 1.26 KB

File metadata and controls

28 lines (17 loc) · 1.26 KB

PySpark-HDFS-MachineLearning

Random Forest Classification using PySpark and HDFS. Predicting term deposit subscription for Portuguese Bank.

Instructions:

Download the Ipython Notebook to run it on your machine.

Pre-requisite required to run the notebook

  1. HDFS needs to be setup on your machine to read and write the data. We have used Windows 10 for this purpose, but any OS can be used. You can follow the tutorial for setting up single node HDFS cluster here

  2. Spark needs to be installed from the official Apache Spark Link

  3. PySpark Shell needs to be configured in order to run it inside Jupyter Notebook. To setup in your machine follow these instructions

  4. Apart from these usual Python Libraries needs to be installed using pip installer.

pip install Jupyter Notebook

pip install pandas

pip install numpy

pip install math

pip install pyspark

pip install matplotlib

pip install seaborn

Let us know if you need any help with Spark or HDFS Cluster Setup or if there are any issues running the notebook.