Course 8 in the Data Science Specialization by John Hopkins University
- The Data Scientist's Toolbox
- R Programming
- Getting and Cleaning Data
- Exploratory Data Analysis
- Reproducible Research
- Statistical Inference
- Regression Models
- Practical Machine Learning
- Developing Data Products
- Data Science Capstone
How I feel about building "black box" machine learning models:
-
Useful for learning about concepts & models
-
Course project at the end provides a good opportunity for hands-on practice.
-
Slides were insufficiently prepared, some of models used for the quiz questions are not taught in the slides (e.g. SVM, bats() forecast)
-
No hyperlinks in the PDF slides! Students cannot access the additional information the lecturer provides.
- In and out of sample errors
- Cross Validation
- Caret package
- Principal Component Analysis (PCA)
- Regression with multiple features
- Plotting
- Decision trees and random forest
- Bagging
- Boosting
- Regular Expression
- Unsupervised Prediction
- Combining predictors
- Forecasting
- Models used for the coursework project:
- Gradient Boosted Machine(gbm)
- Support Vector Machine (svmradial)
- Random Forest
Among these 3 models, random forest has the best performance (Accuracy > 99%), so we used rf to do the prediction.
- Performance issue with random forest model
One of the biggest challenges that I struggled with during the coursework is the amount of processing time the random forest model takes. At first, each run of the rf model would take 1-2 hours, which was not helpful since I just wanted to see if using different features would result in different accuracies!
Afterwards I looked at the course's discussion forum and found this helpful post: Improve random forest performance. This suggests a way to increase rf model performance through a combination of Cross-Validation and Parallel Processing. Applying this results in a ~70% decrease in computing time (took my mac about 25 mins to knit the html file from rmd file).
library(caret)
library(kernlab)
data(spam)
inTrain<-createDataPartition(y=spam$type,p=0.75,list=FALSE)
training<-spam[inTrain,]
testing<-spam[-inTrain,]
hist(training$capitalAve,main="",xlab="avg. capital run length")
The histogram shows that the data are heavily skewed to the left.
trainCapAve<-training$capitalAve
trainCapAveS<-(trainCapAve-mean(trainCapAve))/sd(trainCapAve)
mean(trainCapAveS)
sd(trainCapAveS)
Standardizing the test set, using mean and sd of the training set. This means that the standardized test cap will not be exactly the same as that of the training set, but they should be similar.
testCapAve<-testing$capitalAve
testCapAveS<-(testCapAve-mean(trainCapAve))/sd(trainCapAve)
mean(testCapAveS)
Use preprocess() function to do the standardization on the training set. The result is the same as using the above functions
preObj<-preProcess(training[,-58],method=c("center","scale"))
trainCapAveS<-predict(preObj,training[,-58])$capitalAve
mean(trainCapAveS)
sd(trainCapAveS)
Use preProcess() to do the same on the testing dataset. Note that preObj (which was created based on the training set) is also used to predict on the testing set.
Note that mean() is not equal to 0 on the testing set, and sd is not equal to 1.
testCapAveS<-predict(preObj,testing[,-58])$capitalAve
mean(testCapAveS)
sd(testCapAveS)
set.seed(1)
model<-train(type ~.,data=training,preProcess=c("center","scale"),method="glm")
model
This transforms the data into normal shape - i.e. bell shape
preObj<-preProcess(training[,-58],method=c("BoxCox"))
trainCapAveS<-predict(preObj,training[,-58])$capitalAve
par(mfrow=c(1,2))
hist(trainCapAveS)
qqnorm(trainCapAveS)
knnImpute uses the average of the k-nearest neighbours to impute the data where it's not available.
set.seed(1)
# Make some value NAs
training$capAve<-training$capitalAve
selectNA<-rbinom(dim(training)[1],size=1,prob=0.05)==1
training$capAve[selectNA]<-NA
# Impute data when it's NA, and standardize
preObj<-preProcess(training[,-58],method="knnImpute")
capAve<-predict(preObj,training[,-58])$capAve
# Standardize true values
capAveTruth<-training$capitalAve
capAveTruth<-(capAveTruth-mean(capAveTruth))/sd(capAveTruth)
Look at the difference at the imputed value (capAve) and the true value (capAveTruth), using quantile() function.
If the values are all relatively small, then it shows that imputing data works (i.e. doesn't change the dataset too much).
quantile(capAve-capAveTruth)
- training and testing must be processed in the same way (i.e. use the same
preObjinpredict()function)
-
Step 1: raw data -> features (e.g. free text -> data frame) Google "Feature extraction for [data type]" Examples:
- Text files: frequency of words, frequency of phrases, frequency of capital letters
- Images: Edges, corners, ridges
- Webpages: # and type of images, position of elements, colors, videos (e.g. A/B testing)
- People: Height, weight, hair color, gender etc.
-
Step 2: features -> new, useful features
- more useful for some models (e.g. regression, SVM) than others( e.g. decision trees)
- should be done only on the training set
- new features should be added to data frames
-
An example of feature creation
{r} library(ISLR) library(caret) data(Wage) inTrain<-createDataPartition(y=Wage$wage,p=0.7,list=FALSE) training<-Wage[inTrain,] testing<-Wage[-inTrain,]
-
Convert factor variables to dummy variables
The
jobclasscolumn is chacracters, so we can convert it to dummy variable withdummyVarsfunctiondummies<-dummyVars(wage ~ jobclass,data=training) head(predict(dummies,newdata=training)) -
Remove features which is the same throughout the dataframe, using
nearZeroVar
If nsv (`nearZeroVar`) returns TRUE, then this feature is not important and thus can be removed.
```{r}
nsv<-nearZeroVar(training,saveMetrics = TRUE)
nsv
```
* Spline basis
`df=3` says that we want a 3rd-degree polynomial on this variable `training$age`.
First column means `age`
Second column means `age^2`
Third column means `age^3`
```{r}
library(splines)
bsBasis<-bs(training$age,df=3)
bsBasis
```
#### Fitting curves with splines
```{r}
lm1<-lm(wage~bsBasis,data=training)
plot(training$age,training$wage,pch=19,cex=0.5)
points(training$age,predict(lm1,newdata=training),col="red",pch=19,cex=0.5)
```
#### splines on the test set.
Note that we are using the same `bsBasis` as is created in the training dataset
```{r}
predict(bsBasis,age=testing$age)
```
- Find features which are correlated
which() returns the list of features with correlation > 0.8
library(caret)
library(kernlab)
data(spam)
set.seed(1)
inTrain<-createDataPartition(y=spam$type,p=0.75,list=FALSE)
training<-spam[inTrain,]
testing<-spam[-inTrain,]
M<-abs(cor(training[,-58]))
diag(M)<-0
which(M>0.8,arr.ind=T)
Take a look at the correlated features:
names(spam)[c(34,32,40)]
plot(spam[,34],spam[,32])
Apply PCA in R: prcomp()
smallSpam<-spam[,c(34,32)]
prComp<-prcomp(smallSpam)
plot(prComp$x[,1],prComp$x[,2])
prComp$rotation
typeColor<-((spam$type=="spam")*1+1)
prComp<-prcomp(log10(spam[,-58]+1))
plot(prComp$x[,1],prComp$x[,2],col=typeColor,xlab="PC1",ylab="PC2")
preProc<-preProcess(log10(spam[,-58]+1),method="pca",pcaComp = 2)
spamPC<-predict(preProc,log10(spam[,-58]+1))
plot(spamPC[,1],spamPC[,2],col=typeColor)
preProc<-preProcess(log10(training[,-58]+1),method="pca",pcaComp=2)
trainPC<-predict(preProc,log10(training[,-58]+1))
modelFit <- train(x = trainPC, y = training$type,method="glm")
Note that we should use the same PCA procedure (preProc) when using predict()
on the testing set
testPC<-predict(preProc,log10(testing[,-58]+1))
confusionMatrix(testing$type,predict(modelFit,testPC))
Accuracy is > 0.9!
Alternative: preProcess with PCA during the training process (instead of doing PCA first, then do the training)
modelFit <- train(x = trainPC, y = training$type,method="glm",preProcess="pca")
confusionMatrix(testing$type,predict(modelFit,testPC))
Use the fainthful eruption data in caret
library(caret)
data(faithful)
set.seed(333)
inTrain<-createDataPartition(y=faithful$waiting,p=0.5,list=FALSE)
trainFaith<-faithful[inTrain,]
testFaith<-faithful[-inTrain,]
head(trainFaith)
You can see that there's a roughly linear relationship between the two variables.
plot(trainFaith$waiting,trainFaith$eruptions,pch=19,col="blue",xlab="waiting",ylab="eruption duration")
lm1<-lm(eruptions~waiting,data=trainFaith)
summary(lm1)
plot(trainFaith$waiting,trainFaith$eruptions,pch=19,col="blue",xlab="waiting",ylab="eruption duration")
lines(trainFaith$waiting,lm1$fitted,lwd=3)
When waiting time = 80
newdata<-data.frame(waiting=80)
predict(lm1,newdata)
par(mfrow=c(1,2))
# training
plot(trainFaith$waiting,trainFaith$eruptions,pch=19,col="blue",main="training",xlab="waiting",ylab="eruption duration")
lines(trainFaith$waiting,predict(lm1),lwd=3)
# testing
plot(testFaith$waiting,testFaith$eruptions,pch=19,col="blue",main="testing",xlab="waiting",ylab="eruption duration")
lines(testFaith$waiting,predict(lm1,newdata=testFaith),lwd=3)
# RMSE on training
sqrt(sum((lm1$fitted-trainFaith$eruptions)^2))
# RMSE on testing
sqrt(sum((predict(lm1,newdata=testFaith)-testFaith$eruptions)^2))
pred1<-predict(lm1,newdata=testFaith,interval="prediction")
ord<-order(testFaith$waiting)
plot(testFaith$waiting,testFaith$eruptions,pch=19,col="blue")
matlines(testFaith$waiting[ord],pred1[ord,],type="l",col=c(1,2,2),lty=c(1,1,1),lwd=3)
modFit<-train(eruptions~waiting,data=trainFaith,method="lm")
summary(modFit$finalModel)
Use the wages dataset in ISLR package
library(ISLR)
library(ggplot2)
library(caret)
data(Wage)
Wage<-subset(Wage,select=-c(logwage))
summary(Wage)
inTrain<-createDataPartition(y=Wage$wage,p=0.7,list=FALSE)
training<-Wage[inTrain,]
testing<-Wage[-inTrain,]
dim(training)
dim(testing)
featurePlot(x=training[,c("age","education","jobclass")],y=training$wage,plot="pairs")
qplot(age,wage,data=training)
We can see that the outliners are mostly for people in informational jobclass
qplot(age,wage,color=jobclass,data=training)
You can see that the outliners are mostly advance degree education
qplot(age,wage,color=education,data=training)
modFit<-train(wage~age+jobclass+education,method="lm",data=training)
finMod<-modFit$finalModel
print(modFit)
plot(finMod,1,pch=19,cex=0.5,col="#00000010")
qplot(finMod$fitted,finMod$residuals,color=race,data=training)
plot(finMod$residuals,pch=19)
pred<-predict(modFit,testing)
qplot(wage,pred,color=year,data=testing)
modFitAll<-train(wage~.,data=training,method="lm")
pred<-predict(modFitAll,newdata=testing)
qplot(wage,pred,data=testing)
Pros: better interpretability, better performance for non-linear settings
Stop splitting when the leaves are pure
- Misclassification Error:
- 0 = perfect purity
- 0.5 = no purity
- Gini index:
- 0 = perfect purity
- 0.5 = no purity
- Deviance/information gain:
- 0 = perfect purity
- 1 = no purity
Example: Iris Data
data(iris)
library(ggplot2)
library(caret)
names(iris)
table(iris$Species)
inTrain<-createDataPartition(y=iris$Species,p=0.7,list=FALSE)
training<-iris[inTrain,]
testing<-iris[-inTrain,]
# plot the Iris petal widths/species
qplot(Petal.Width,Sepal.Width,col=Species, data=training)
Train the model
#rpart is R's package for doing regressions
modFit<-train(Species~.,method="rpart",data=training)
print(modFit$finalModel)
# plot tree
plot(modFit$finalModel,uniform=TRUE,main="Classification Tree")
text(modFit$finalModel,use.n=TRUE,all=TRUE,cex=0.8)
Use the rattle package to make the trees look better
library(rattle)
fancyRpartPlot(modFit$finalModel)
Predict new values
predict(modFit,newdata=testing)
Notes: Classification trees are non-linear models
- They use interaction between variables
- Tree can also be used for regression problems (i.e. continuous outcome)
What is bagging?
- Resample cases and recalculate predictions
- Average or majority vote
- It produces similar bias, but reduces variance.
- Bagging is more useful for non-linear functions
Example with the Ozone data from ElemStatLearn package
library(ElemStatLearn)
data(ozone,package="ElemStatLearn")
ozone<-ozone[order(ozone$ozone),]
We'll predict temperature based on zone
ll<-matrix(NA,nrow=10,ncol=155)
#we'll resample the data 10 times (loop 10 times)
for(i in 1:10){
# each time we'll resample with replacement
ss<-sample(1:dim(ozone)[1],replace=T)
# ozone0 is the resampled subset. We'll also reorder the resampled subset with ozone
ozone0<-ozone[ss,];ozone0<-ozone0[order(ozone0$ozone),]
# we'll fit a loess line through the resampled subset. span determins how smooth this line would be
loess0<-loess(temperature~ozone,data=ozone0,span=0.2)
# for each of the loess curve, we'll predict the outcome for the 155 rows in the original dataset
ll[i,]<-predict(loess0,newdata=data.frame(ozone=1:155))
}
The red line is the bagged (average) line across the 10 resamples
plot(ozone$ozone,ozone$temperature,pch=19,cex=0.5)
for(i in 1:10){lines(1:155,ll[i,],col="grey",lwd=2)}
lines(1:155,apply(ll,2,mean),col="red",lwd=2)
Notes:
- Bagging is most useful for non-linear models
- Often used with trees & random forests
What is random forests?
- Bootstrap samples
- At each split, bootstrap variables
- Grow multiple trees and vote
Pros:
- Accuracy
Cons:
- Speed
- Interpretability
- Overfitting
Random Forest on Iris data
data(iris)
library(ggplot2)
library(caret)
inTrain<-createDataPartition(y=iris$Species,p=0.7,list=FALSE)
training<-iris[inTrain,]
testing<-iris[-inTrain,]
# build random forest model using caret
modFit<-train(Species~.,model="rf",prox=TRUE,data=training)
library(randomForest)
getTree(modFit$finalModel,k=2)
irisP<-classCenter(training[,c(3,4)],training$Species,modFit$finalModel$prox)
irisP<-as.data.frame(irisP)
irisP$Species<-rownames(irisP)
p<-qplot(Petal.Width,Petal.Length,col=Species,data=training)
# This line plots the three centers
p+geom_point(aes(x=Petal.Width,y=Petal.Length,col=Species),size=5,shape=4,data=irisP)
pred<-predict(modFit,testing)
testing$preRight<-pred==testing$Species
table(pred,testing$Species)
qplot(Petal.Width,Petal.Length,col=preRight,data=testing,main="newdata Predictions")
Boosting and random forest are two of the most accurate out of the box classifiers for prediction analysis.
- Take lots of (possibly) weak predictors
- Weight them and add them up
- Get a strong predictor
library(ISLR)
data(Wage)
library(ggplot2)
library(caret)
Wage<-subset(Wage,select=-c(logwage))
set.seed(1)
inTrain<-createDataPartition(y=Wage$wage,p=0.7,list=FALSE)
training<-Wage[inTrain,]
testing<-Wage[-inTrain,]
gbm is boosting for tree models.
modFit<-train(wage~.,data=training,method="gbm",verbose=FALSE)
qplot(predict(modFit,testing),wage,data=testing)
- Assume the data follow a probabilistic model
- Use Bayes' theorem to identify optimal classifiers
- Take advantage of data structures
- Computationally convenient
- Reasonably accurate
- Make additional assumptions about data
- When model is incorrect, it may reduce accuracy
Naive Bayes assumes that all features are independent of each other - useful for binary or categorical data, e.g. text classification
Model based prediction with Iris data
data(iris)
library(ggplot2)
library(caret)
set.seed(2)
inTrain<-createDataPartition(y=iris$Species,p=0.7,list=FALSE)
training<-iris[inTrain,]
testing<-iris[-inTrain,]
lda= linear discriminant analysisnb= Naive Bayes
modlda<-train(Species~.,data=training,method="lda")
modnb<-train(Species~.,data=training,method="nb")
plda<-predict(modlda,testing)
pnb<-predict(modnb,testing)
table(plda,pnb)
equalPredictions =(plda==pnb)
qplot(Petal.Width,Sepal.Width,col=equalPredictions,data=testing)
- Fit a regression model
- Penalize (or shrink) large coefficients
Pros
- Help with bias/variance tradeoff
- Help with model selection
Cons
- Computationally demanding
- Lower performance than random forests and boosting
Prediction error = irreducible error + bias^2 + variance
Tuning parameter lambda
-
lambda controls the size of the coefficients, and the amount of regularization
-
As lambda approaches 0, we obtain the least squared solution (i.e. what we get from the standard linear model)
-
As lambda approaches infinity, the coefficients go towards 0
In caret methods for penalised regularization models are:
- ridge
- lasso
- relaxo
-
Combining classifiers generally improves accuracy but reduces interpretability.
-
How? Use majority vote
- similar classifiers: bagging, boosting, random forest
- different classifiers: model stacking, model ensembling
- example with wage data
library(ISLR)
data(Wage)
library(ggplot2)
library(caret)
Wage<-subset(Wage,select=-c(logwage))
# Create a building data set (which is split into training and testing data) and validation set
inBuild<-createDataPartition(y=Wage$wage,p=0.7,list=FALSE)
validation<-Wage[-inBuild,]
buildData<-Wage[inBuild,]
inTrain<-createDataPartition(y=buildData$wage,p=0.7,list=FALSE)
training<-buildData[inTrain,]
testing<-buildData[-inTrain,]
dim(training)
Build two different models: linear regression + random forest
mod1<-train(wage~.,method="glm",data=training)
mod2<-train(wage~.,method="rf",data=training,trControl=trainControl(method="cv"),number=3)
Plot the two different models on the same chart
pred1<-predict(mod1,testing)
pred2<-predict(mod2,testing)
qplot(pred1,pred2,colour=wage,data= testing)
First, create a new dataframe that is the prediction of the original two models. Then train a new model based on the new dataframe.
predDF<-data.frame(pred1,pred2,wage=testing$wage)
combModFit<-train(wage~.,method="gam",data=predDF)
combPred<-predict(combModFit,predDF)
# linear regression
sqrt(sum((pred1-testing$wage)^2))
# random forest
sqrt(sum((pred2-testing$wage)^2))
# combined (linear regression + random forest)
sqrt(sum((combPred-testing$wage)^2))
We can see that the combined predictor has the lowest testing error rate
pred1V<-predict(mod1,validation)
pred2V<-predict(mod2,validation)
predVDF<-data.frame(pred1=pred1V,pred2=pred2V)
combPredV<-predict(combModFit,predVDF)
# linear regression
sqrt(sum((pred1V-validation$wage)^2))
# random forest
sqrt(sum((pred2V-validation$wage)^2))
# combined
sqrt(sum((combPredV-validation$wage)^2))
Typical model for binary/multiclass data
- Build an odd number of models
- Predict with each model
- Predict the class by majority vote
- example: predict the price of Google stock
library(quantmod)
from.dat<-as.Date("01/01/08",format="%m/%d/%y")
to.dat<-as.Date("12/31/13",format="%m/%d/%y")
getSymbols("GOOG",src="yahoo",from=from.dat,to=to.dat)
mGoog<-to.monthly(GOOG)
googOpen<-Op(mGoog)
ts1<-ts(googOpen,frequency = 12)
plot(ts1,xlab="Year+1",ylab="GOOG")
trend, seasonal and random
plot(decompose(ts1),xlab="Years+1")
ts1Train<-window(ts1,start=1,end=5)
ts1Test<-window(ts1,start=5,end=(7-0.01))
ts1Train
plot(ts1Train)
lines(ma(ts1Train,order=3),col="red")
We can get a range of the possible
library(forecast)
ets1<-ets(ts1Train,model="MMM")
fcast<-forecast(ets1)
plot(fcast)
lines(ts1Test,col="red")
