Random Forest Classification using PySpark and HDFS. Predicting term deposit subscription for Portuguese Bank.
Download the Ipython Notebook to run it on your machine.
-
HDFS needs to be setup on your machine to read and write the data. We have used Windows 10 for this purpose, but any OS can be used. You can follow the tutorial for setting up single node HDFS cluster here
-
Spark needs to be installed from the official Apache Spark Link
-
PySpark Shell needs to be configured in order to run it inside Jupyter Notebook. To setup in your machine follow these instructions
-
Apart from these usual Python Libraries needs to be installed using pip installer.