PracticalMachineLearningCourseProject/Practical Machine Learning Course Project.Rmd at master · sux13/PracticalMachineLearningCourseProject · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
---
title: "Machine Learning Course Project"
author: "Xing Su"
date: "February 21, 2015"
output:
  html_document:
    toc: yes
---

## Processing Data

First, we download the training and test datasets and load them in through the `read.csv` function. During my exploratory data analysis, I saw that blank values, "NA", and "#DIV/0!" often show up in data columns so I have decided to treat all of these values as `NA`.

```{r cache = TRUE, message=FALSE}
# load packag
library(caret)
# download data
if(!file.exists("pml-training.csv")){
	download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv",
		destfile = "pml-training.csv", method = "curl")
}
if(!file.exists("pml-testing.csv")){
	download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv",
		destfile = "pml-testing.csv", method = "curl")
}
# load data
train <- read.csv("pml-training.csv", header = TRUE, na.strings=c("","NA", "#DIV/0!"))
test <- read.csv("pml-testing.csv", header = TRUE, na.strings=c("","NA", "#DIV/0!"))
```

In order to run the machine learning algorithms, the features used cannot contain any `NA` values. To see which variables/features should be used, I calculated the percentage of NA's for each column.

```{r cache = TRUE}
# see error percentage
NAPercent <- round(colMeans(is.na(train)), 2)
table(NAPercent)
```
From above, we can see that only 60 variables have complete data so those are the variables we will use to build the prediction algorithm. I removed the first variable here because it is the row index from the csv file and not a true variable.

```{r cache = TRUE}
# find index of the complete columns minus the first
index <- which(NAPercent==0)[-1]
# subset the data
train <- train[, index]
test <- test[, index]
# looking at the structure of the data for the first 10 columns
str(train[, 1:10])
```
From the structure of the data, we can see that the first 6 variables `user_name`, `raw_timestamp_part_1`, `raw_timestamp_part_2`, `cvtd_timestamp`, `new_window`, `num_window` are simply administrative parameters and are ***unlikely*** to help us predict the activity the subjects are performing. Therefore, we are going to leave those 6 columns out before we build the algorithm. In addition, to make the columns easier to deal with, we will go ahead and convert all features to `numeric` class.

```{r cache = TRUE}
# subset the data
train <- train[, -(1:6)]
test <- test[, -(1:6)]
# convert all numerical data to numeric class
for(i in 1:(length(train)-1)){
    train[,i] <- as.numeric(train[,i])
    test[,i] <- as.numeric(test[,i])
}
```

## Cross Validation

Forthis project, we will focus on using the two most widely-used, most accurate prediction algorithms,

We set `test` set aside and split the `train` data into two sections for cross validation. We will allocate 80% of the data to train the model and 20% to validate it.

We expect that the out-of-bag (OOB) error rates returned by the models should be good estimate for  the out of sample error rate. We will get actual estimates of error rates from the **accuracies** achieved by the models.

```{r cache = TRUE}
# split train data set
inTrain <- createDataPartition(y=train$classe,p=0.8, list=FALSE)
trainData <- train[inTrain,]
validation <- train[-inTrain,]
# print out the dimentions of the 3 data sets
rbind(trainData = dim(trainData), validation = dim(validation), test = dim(test))
```

## Comparing Model and Results

First, We will use **random forest** to build the first model. Because the algorithm is computationally intensive, we will leverage parallel processing using multiple cores through the `doMC` package

```{r cache = TRUE, message=FALSE}
# load doMC package
library(doMC)
# set my cores
registerDoMC(cores = 8)
# load randomForest package
library(randomForest)
# run the random forest algorithm on the training data set
rfFit <- randomForest(classe~., data = trainData, method ="rf", prox = TRUE)
rfFit
# use model to predict on validation data set
rfPred <- predict(rfFit, validation)
# predicted result
confusionMatrix(rfPred, validation$classe)
```

Next, we will try the **Generalized Boosted Regression Models**.

```{r cache = TRUE, message=FALSE}
# run the generalized boosted regression model
gbmFit <- train(classe~., data = trainData, method ="gbm", verbose = FALSE)
gbmFit
# use model to predict on validation data set
gbmPred <- predict(gbmFit, validation)
# predicted result
confusionMatrix(gbmPred, validation$classe)
```
From the above, we can see that **randomForest** is the better performing algorithm with **0.43%** out-of-bag (OOB) error rate, which is ***what we expect the out of sample error rate to be***. When applied to the validation set for cross validation, the model achieved an accuracy of **99.7%**, which indicates the actual error rate is **0.3%**, where as GBM has an accuracy of **96.0%** with error rate of **4.0%**.


## Result

We can apply the randomForest model to the 20 given test set for the predictions. The results were all correct.

```{r cache = TRUE}
# apply random forest model to test set
predict(rfFit, test)
```