Distributed-Linear-Regression-Analysis

The housing dataset is used to develop a pipeline for running a simple linear regression in a pseudo-distributed manner (single node setup) using Spark ML. The following was implemented: • Created a Spark session, loaded the data, parse and display them using the apache spark ecosystem (pyspark). • Created a feature vector separating the features from the labels, i.e. the column you need to predict. • Split the dataset into 70% train and 30% test set. • Train a linear regression model. • Evaluated its performance on the train and test sets, reported the mean absolute error (MAE), the root mean squared error (RMSE) and mean squared error (MSE). • The test set is used to generate a table showing the predicted vs actual values as well as a predicted vs actual plot.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
Distributed-Linear-Regression-Analysis.ipynb		Distributed-Linear-Regression-Analysis.ipynb
README.md		README.md
housing.csv		housing.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Distributed-Linear-Regression-Analysis

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Distributed-Linear-Regression-Analysis

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages