-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathproject.Rmd
More file actions
211 lines (133 loc) · 6.38 KB
/
project.Rmd
File metadata and controls
211 lines (133 loc) · 6.38 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
# Practical Machine Learning Course Project
## Introduction
The aim of this project is to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants to predict the manner in which they did the exercise. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available [here](http://groupware.les.inf.puc-rio.br/har).
## Data
The training and test data sets are available through the links below : <br/><br/>
[Training Data](https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv) <br/>
[Test Data](https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv)
## Libraries
```{r eval=FALSE}
library(caret)
library(gbm)
library(randomForest)
````
## Loading and Cleaning Data
We start by loading the training and test data. On loading, some unknown values are transformed to NA values using the `na.strings` function.
```{r eval=FALSE}
#Load Data and convert some values to NA
training <- read.csv("pml-training.csv", na.strings=c("NA","#DIV/0!",""))
testing <- read.csv("pml-testing.csv", na.strings=c("NA","#DIV/0!",""))
````
Afterwards, we eliminate all columns containing NA values as they constitute a majority of the values in those columns as we can see from the `summary(training)` function. Columns containing irrelevant information are also stripped off. Same applies for columns containing similar values using the `nearZeroVar` function. In our case, there was no column containing values exhibiting this property.
```{r eval=FALSE}
set.seed(12479)
#Remove columns containing NA values since the majority of values in these columns are NA values
training <- training[ , sapply(training, function(x) !any(is.na(x)))]
testing <- testing[ , sapply(testing, function(x) !any(is.na(x)))]
#Remove columns containing information irrelevant to the prediction model
training <- training[ , -c(1 : 7)]
testing <- testing[ , -c(1 : 7)]
nzv <- nearZeroVar(training[, -ncol(training)])
if(length(nzv) > 0) training <- training[, -nzv]
nzv <- nearZeroVar(testing)
if(length(nzv) > 0) testing <- testing[, -nzv]
````
## Building and Training Models
* **Grading Boosting Machine**
5-fold Cross validation resampling was applied in this case:
```{r eval=FALSE}
modelGBM <- train(classe ~ ., method = "gbm", data = training, trControl = trainControl(method = "cv", number = 5), verbose = FALSE)
modelGBM
```
Stochastic Gradient Boosting
19622 samples
52 predictor
5 classes: 'A', 'B', 'C', 'D', 'E'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 15698, 15696, 15697, 15699, 15698
Resampling results across tuning parameters:
interaction.depth n.trees Accuracy Kappa Accuracy SD Kappa SD
1 50 0.752 0.685 0.01339 0.01737
1 100 0.822 0.774 0.00996 0.01275
1 150 0.854 0.815 0.00610 0.00777
2 50 0.858 0.820 0.00888 0.01127
2 100 0.907 0.882 0.00672 0.00849
2 150 0.932 0.914 0.00545 0.00690
3 50 0.895 0.868 0.00612 0.00776
3 100 0.942 0.927 0.00573 0.00724
3 150 0.962 0.952 0.00263 0.00332
Tuning parameter 'shrinkage' was held constant at a value of 0.1
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were n.trees = 150, interaction.depth = 3 and shrinkage = 0.1.
```{r eval=FALSE}
predictions <- predict(modelGBM, newdata=training)
confusionMatrix(predictions,training$classe)
```
Confusion Matrix and Statistics
Reference
Prediction A B C D E
A 5528 87 0 2 4
B 40 3636 84 8 20
C 6 73 3297 90 27
D 5 1 35 3101 35
E 1 0 6 15 3521
Overall Statistics
Accuracy : 0.9725
95% CI : (0.9701, 0.9748)
No Information Rate : 0.2844
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.9652
We recorded a **97.25%** prediction accuracy.
* **Random Forest**
```{r eval=FALSE}
modelRF <- train(classe ~ ., method = "rf", data = training, trControl = trainControl(method = "cv", number = 5), verbose = FALSE)
modelRF
```
Random Forest
19622 samples
52 predictor
5 classes: 'A', 'B', 'C', 'D', 'E'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 15698, 15698, 15696, 15697, 15699
Resampling results across tuning parameters:
mtry Accuracy Kappa Accuracy SD Kappa SD
2 0.994 0.993 0.00201 0.00255
27 0.995 0.993 0.00147 0.00186
52 0.990 0.987 0.00326 0.00413
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 27.
```{r eval=FALSE}
predictions <- predict(modelRF, newdata=training)
confusionMatrix(predictions,training$classe)
```
Confusion Matrix and Statistics
Reference
Prediction A B C D E
A 5580 0 0 0 0
B 0 3797 0 0 0
C 0 0 3422 0 0
D 0 0 0 3216 0
E 0 0 0 0 3607
Overall Statistics
Accuracy : 1
95% CI : (0.9998, 1)
No Information Rate : 0.2844
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 1
We recorded a **100%** prediction accuracy.
## Prediction and Submission
After prediction, the code provided on the course website was used to prepare files for submission.
```{r eval=FALSE}
predictions <- predict(modelRF, testing)
predictions
pml_write_files = function(x){
n = length(x)
for(i in 1:n){
filename = paste0("problem_id_",i,".txt")
write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
}
}
pml_write_files(predictions)
```