DataScienceInR/8-MachineLearning.qmd at main · TheoreticalEcology/DataScienceInR · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
---
output: html_document
editor_options:
  chunk_output_type: console
---

```{r, include=FALSE}
set.seed(42)
```

# Machine Learning

The idea of machine learning is to maximize the predictive performance, which means to make predictions of y for new observations:

Let's assume that part of the data is unknown (which will be our test data):

```{r}
airquality = airquality[complete.cases(airquality),]
indices = sample.int(nrow(airquality), 50)
train = airquality[-indices,]
test = airquality[indices,]
```

We first train a model on the train data and then use the model make predictions for new data (test data):

```{r}
model = lm(Ozone~., data = train)
predictions = predict(model, newdata = test)
plot(predictions, test$Ozone, xlab = "Predicted Ozone", ylab = "Observed Ozone")
```

We can calculate the predictive performance (or prediction) error, for example, by using the correlation factor:

```{r}
cor(predictions, test$Ozone)
#R^2:
cor(predictions, test$Ozone)**2
```

So the linear regression model is a great model for inferential statistics because it is interpretable (we exactly know how the variables are connected to the response). Many ML algorithms are not interpretable at all - they trade interpretability for predictive performance.

Random Forest is a famous algorithm which was first described by [Leo Breiman, 2001](https://link.springer.com/article/10.1023/a:1010933404324). The idea of the random forest is to fit hundreds of 'weak' decision trees on bootstrap samples from the original data. Weak means that the decision trees are intentionally 'constrained' in their complexity (by random subsampling the set of features in each split):

```{r}
library(randomForest)
model = randomForest(Ozone~., data = train)
predictions = predict(model, newdata = test)
plot(predictions, test$Ozone, xlab = "Predicted Ozone", ylab = "Observed Ozone")
cor(predictions, test$Ozone)
#R^2:
cor(predictions, test$Ozone)**2
```

Our Random Forest achieved a higher predictive performance! BUT:

```{r}
summary(model)
```

...doesn't tell us anything about effects or p-values. However, compared to other ML algorithms (e.g. artificial neural networks), at least the RF has a variable importance that reports which features were the most important features (similar to an ANOVA):

```{r}
importance(model)
```

For the RF, the Temp was the most important feature, followed by Wind.

::: callout-tip
In ML a slightly different wording is used. Explanatory variables are called features. Datasets with responses with numerical values (e.g. Normal distribution or even Poisson distribution) are called regression tasks. Datasets with categorical responses (e.g. Binomial) are called classification tasks.
:::

## Regression

We call task with a numerical response variable a regression task:

```{r}
indices = sample.int(nrow(airquality), 50)
train = airquality[-indices,]
test = airquality[indices,]

# 1. Fit model on train data:
model = randomForest(Ozone~., data = train)

# 2. Make Predictions
predictions = predict(model, newdata = test)

# 3. Compare predictions with observed values:
## the root mean squared error is commonly used as an error statistic:
sqrt(mean((predictions-test$Ozone)**2))
# Or use a correlationf actor
cor(predictions, test$Ozone)
# Or Rsquared
cor(predictions, test$Ozone)**2

```

## Classification

We call a task with a categorical response variable a classification task (see also multi-class and multi-label classification):

```{r}
indices = sample.int(nrow(iris), 50)
train = iris[-indices,]
test = iris[indices,]

# 1. Fit model on train data:
model = randomForest(Species~., data = train)

# 2. Make Predictions
predictions = predict(model, newdata = test)

# 3. Compare predictions with observed values:
mean(predictions == test$Species) # accuracy

```

96% accuracy, which means only 4% of the observations were wrongly classified by our random forest!

Variable importance:

```{r}
varImpPlot(model)
```

Petal.Width and Petal.Length were the most important predictors!