Pml Assignment
23 Jun 2014Practical Machine Learning - Assignment
The goal of the assignment was to be able to predict reliably, the correctness with which the Weight-Lifting activity was being executed by the participant. The execution could belong to any of 5 classes - A (correct) or B,C,D,E (incorrect).
Preliminaries
A quick look at the data in MS Excel revealed 19622 observations of 159 variables. It also revealed the presence of a number of superfluous columns that were either empty or filled with NA values. These were easily removed and so were some other columns with unimportant information such as record number, name of participant etc. In this way, the dataset was whittled down to 52 predictor variables and the "classe" variable.
Creating Training and Testing sets
The data was stored into the dataframe dat1.
library(caret);
library(kernlab);
library(scatterplot3d);
dat=read.csv("pml-training.csv",header=TRUE)
dattemp<-dat[,!sapply(dat,function(x) any(is.na(x)))]
dat1<-dattemp[,!sapply(dattemp, function(x) any(x == ""))]
The createDataPartition was used to divide the dataset into the training and testing sets.
inTrain=createDataPartition(y=dat1$classe,p=0.70,list=FALSE)
training<-dat1[inTrain,]
testing<-dat1[-inTrain,]
A quick check using the princomp function reveals that factoring the variables may probably not be very useful here.
dat2_pca<-princomp(formula=~.-classe,data=training,cor=TRUE,na.action=na.exclude)
plot(dat2_pca)
No variable sufficiently captures the variance, rendering PCA not very effective in this case.
Set the control parameters and run the train function with the method set to Random Forest. The cross-validation parameter was set to 2 as the accuracy wasn't hurt much and the code ran much faster.
Control = trainControl(method = "cv", number = 2)
modFit<-train(classe~.,method="rf",data=training,trControl=Control)
Accuracy of ~98.5% with mtry=27
Test this against the testing dataset.
newclass<-predict(modFit,newdata=testing)
confusionMatrix(newclass,testing$classe)
Accuracy of 99.61% using the testing dataset
Run the model on the 20 provided test cases.
testdat=read.csv("pml-testing.csv",header=TRUE)
newclass<-predict(modFit,newdata=testdat)
newclass
And finally, varImpPlot shows the order of importance of the variables in the form of a dot-chart.
Variables in order of importance
Please note: The data for this assignment came from: http://groupware.les.inf.puc-rio.br/har