tidymodels-tutorial/live_session.Rmd at master · emlab-ucsb/tidymodels-tutorial · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
---
title: "Tidymodels live session - EcoDataScience"
author: "Gavin McDonald - Environmental Markets Solutions Lab (emLab)"
date: "11/19/2019"
output:
  html_document:
    df_print: paged
---

# Introduction

This live session has been adapted from from Edgar Ruiz's [A Gentle Introduction to tidymodels](https://rviews.rstudio.com/2019/06/19/a-gentle-intro-to-tidymodels/). We will run through two end-to-end modeling examples. The first will be a regression example using the `Sacramento` housing prices dataset, and the second will be a classification example using the `GermanCredit` credit score dataset.

I will briefly touch on important steps of the predictive modeling process, but this is *not* meant to be a comprehensive instruction on predictive modeling (*a.k.a.* machine learning). This tutorial is rather just meant to give a flavor of what's possible using the new `tidymodels` universe of packages. For a much more detailed overview of predictive modeling, see Max Kuhn's fantastic [Applied Predictive Modeling](http://appliedpredictivemodeling.com). I'd also recommend [An introduction to statistical learning](http://faculty.marshall.usc.edu/gareth-james/ISL/) by Gareth, James, Hastie and Tibshirani.

# Install necessary pacakges

```{r}
# pacman will help us install any necessary packages
if (!require("pacman")) install.packages("pacman")
# pacman::p_load checks to see if this packages are installed, and installs them if not
pacman::p_load(tidymodels, ranger, randomForest, caret)
```


# Load packages

Load the tidymodels library. This loads a collection of both tidymodels packages, and select tidyverse packages like dplyr, purrr, ggplot2. We will also load caret since it has some very [nice datasets](https://topepo.github.io/caret/data-sets.html) for regression and classification exercises.

```{r}
library(tidymodels)
library(caret)
# Set random number seed to get consistent results
set.seed(101)
```

# Regression example using `Sacramento`

## Load dataset

```{r}
data(Sacramento)

Sacramento
```

## Data sampling

The first step in the modeling process is to split your data into separate training and testing datasets. The model will be trained using the training dataset, and the testing dataset will not be used until you are ready to assess model performance. The `rsample::initial_split` function helps with this initial splitting of the data. The `rsample` includes many other helpful functions for splitting the data for cross-validation, bootstrapping, etc.

```{r}
# Split the dataset, using 75% of the data for training and 25% for testing
housing_split <- Sacramento %>%
  as_tibble() %>%
  dplyr::select(price,type,sqft,beds,baths,latitude,longitude) %>%
  initial_split(prop = 0.75)

# This rsplit object tells you how many observations are used for training, how many for testing, and how many total
housing_split

# The training function can be used to extract the training data from the rsplit object
housing_training <- housing_split %>%
  training()

housing_training

# The testing function can be used to extract the testing data from the rsplit object
housing_testing <- housing_split %>%
  testing()

housing_testing
```

## Data pre-processing

After splitting the data, we will do some data processing. To do this, we will use the `recipe` package. A recipe is a blueprint for how data will be processed. By creating a blueprint, rather than processing data directly, we can apply the same blueprint to training and testing datasets. Importantly, the recipe is defined using only data from the training dataset, which will allow us to see how well the model performs using the testing dataset. Recipe steps can be defined using pipes with a number of sequential steps - there are many many options for recipe steps.

```{r}

housing_recipe <- housing_training %>%
  # Specify regression model formula
  recipe(price ~.)  %>%
  # step_corr removes highly correlated variables
  step_corr(all_numeric(), -all_outcomes()) %>%
  # step_center normalizes data to have a mean of 0
  step_center(all_numeric(), -all_outcomes()) %>%
  # step_scale normalizes data to have a standard deviation of 1
  step_scale(all_numeric(), -all_outcomes()) %>%
  # Create dummy variable columns for all factor columns
  step_dummy(all_predictors(),-all_numeric())

housing_recipe

housing_recipe_prepped <- housing_recipe %>%
  # prep trains the recipe using the training dataset
  prep()

housing_recipe_prepped

# Use use the juice function to apply the prepped recipe to the training dataset
housing_training_juiced <- juice(housing_recipe_prepped)

housing_training_juiced

# We use the bake function to apply the prepped recipe to the testing dataset
housing_testing_baked <- housing_recipe_prepped %>%
  bake(housing_testing)

housing_testing_baked
```

## Model training

Next, we will use the `parsnip` package to define a number of models. Generally, we use parsnip to define 3 things about our model:

1. The type of model (e.g., linear regression or random forest)
2. the mode of the model (e.g., regression or classification)
3. The engine for the model (e.g., `ranger` or `randomForest`)

After we've defined the model in this way, we can use the `fit` function to fit the model.

```{r}
# Define and fit a linear regression model
housing_model_lm <- linear_reg() %>%
  set_engine("lm")

housing_model_lm

housing_fit_lm <- housing_model_lm %>%
  fit(price ~ ., data = housing_training_juiced)

housing_fit_lm

# Define and fit a random forest regression model using the randomForest engine/package
housing_model_randomForest <-  rand_forest(trees = 100, mode = "regression") %>%
  set_engine("randomForest")

housing_model_randomForest

housing_fit_randomForest <- housing_model_randomForest %>%
  fit(price ~ ., data = housing_training_juiced)

housing_fit_randomForest

# Define and fit a random forest regression model using the ranger engine/package
housing_model_ranger <- rand_forest(trees = 100, mode = "regression") %>%
  set_engine("ranger")

housing_model_ranger

housing_fit_ranger <- housing_model_ranger %>%
  fit(price ~ ., data = housing_training_juiced)

housing_fit_ranger
```

Once we have the model fits, we can use the `predict` function to generate our predictions for our testing dataset. The predict function always produces a dataframe with the same number of rows as observations. Because of this, `bind_cols` can be used to bind the predictions to the original dataframe

```{r}
# Generate predictions for our testing using the ranger model
predict(housing_fit_ranger, housing_testing_baked)

# Add these ranger predictions to the testing dataset
# parsnip::predict always gives you same number of rows as data, so predictions can be added using bind_cols
housing_fit_ranger %>%
  predict(housing_testing_baked) %>%
  bind_cols(housing_testing_baked)

# Save this combined dataframe for later
housing_ranger_predict <- housing_fit_ranger %>%
  predict(housing_testing_baked) %>%
  bind_cols(housing_testing_baked) %>%
  # Add a column for model name
  mutate(model_name = "ranger")

# Let's do the same thing for the linerar regression model
housing_lm_predict <- housing_fit_lm %>%
  predict(housing_testing_baked) %>%
  bind_cols(housing_testing_baked) %>%
  mutate(model_name = "lm")

# Let's do the same thing for the randomForest model
housing_randomForest_predict <- housing_fit_randomForest %>%
  predict(housing_testing_baked) %>%
  bind_cols(housing_testing_baked) %>%
  mutate(model_name = "randomForest")

# Let's combine all of these datasets so we can look at them side-by-side
housing_all_predict <- bind_rows(housing_lm_predict,
                              housing_ranger_predict,
                              housing_randomForest_predict)
```

Let's just look and see how our predictions line up with the observed values in our testing dataset.

```{r}
housing_all_predict %>%
  ggplot(aes(x = price,y=.pred,color=model_name)) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(x = "Observed price",
       y = "Predicted price",
       title = "Predictions vs observed values for 3 model types\nA simple linear regression is overlaid") +
  coord_equal()
```


## Model performance assessment

Using our predictions from the `parsnip` package, we can use the `yardstick` package to generate model performance metrics. This can be done using the `metrics` function, which generates a default metric set (for a regression model, these are root mean squared error or `rsme`, r-squared or `rsq`, and mean absolute error or `mae`; for a classification model, these are `accuracy` and Kappa or `kap`). You can also define a custom set of metrics using `metric_set`, and there are also individual functions for all metric types.

```{r}
housing_ranger_predict %>%
  # Here we use the metric functions and must define the truth value and the estimated prediction
  metrics(truth = price, estimate = .pred)
```

When the predictions are in a dataframe, we can group by model type and calculate metrics by group

```{r}
housing_all_predict%>%
  group_by(model_name) %>%
  metrics(truth = price, estimate = .pred)

housing_all_predict%>%
  group_by(model_name) %>%
  metrics(truth = price, estimate = .pred)%>%
  ggplot(aes(x = model_name, y = .estimate)) +
  geom_bar(stat="identity") +
  facet_wrap(.~.metric,scales="free") +
  labs(x = "Model name",
       y = "Model performance metric estimate",
       title = "Model performance metrics for 3 model types")
```

We can also do what we just did in a much more tidy fashion, while also keeping the model specifications, model fits, model predictions, and model metrics all in a single dataframe. This ensures that things stay together, and makes it very easy to extract summary statistics or plots. `purrr::map` and list columns makes this all possible. We could apply this same approach to build and test many models for cross-validation, for hyperparameter tuning, etc.

```{r}
# Define a tibble using model names and their associated specifications
all_models <-
  tibble(model_name = "lm",
         model = list(housing_model_lm)) %>%
  add_row(model_name = "ranger",
          model = list(housing_model_ranger)) %>%
  add_row(model_name = "randomForest",
          model = list(housing_model_randomForest))

all_models

all_model_results <- all_models %>%
  # Add a column for model fits
  mutate(model_fit = purrr::map(model,
                                ~fit(.x, price ~ ., data = housing_training_juiced)),
         # Add a column for predictions
         model_predictions = purrr::map(model_fit,
                                        ~.x %>%
                                          predict(housing_testing_baked) %>%
                                          bind_cols(housing_testing_baked)),
         # Add a column for model metrics
         model_metrics = purrr::map(model_predictions,
                                    ~metrics(.x, truth = price, estimate = .pred)))

all_model_results

# This plot is the same as the one we made above
all_model_results %>%
  unnest(model_metrics) %>%
  ggplot(aes(x = model_name, y = .estimate)) +
  geom_bar(stat="identity") +
  facet_wrap(.~.metric,scales="free") +
  labs(x = "Model name",
       y = "Model performance metric estimate",
       title = "Model performance metrics for 3 model types")
```

# Classification example using `GermanCredit`

Let's also go through a classification example using the `GermanCredit` dataset from `caret`. Now we will try to predict credit rating (good or bad) using a number of predictors.

## Load packages
```{r}
data(GermanCredit)

GermanCredit
```

## Data pre-processing
``` {r}
# Split the credit dataset, using 75% of the data for training and 25% for testing, stratified by credit class
# This maintains the ratio of Good and Bad credit classes in both the training and testing datasets
credit_split <- GermanCredit %>%
  as_tibble() %>%
  # Convert most columns to factors since they are binaries
  mutate_at(vars(-Duration,-Amount,-InstallmentRatePercentage,-ResidenceDuration,-Age,-NumberExistingCredits,-NumberPeopleMaintenance),
            as.factor) %>%
  initial_split(prop = 0.75, strata = "Class")

# The training function can be used to extract the training data from the rsplit object
credit_training <- credit_split %>%
  training()

# The testing function can be used to extract the testing data from the rsplit object
credit_testing <- credit_split %>%
  testing()

credit_recipe <- credit_training %>%
  recipe(Class ~.) %>%
  # Remove all near-zero variance predictors, such as factors with only one level
  step_nzv(all_predictors()) %>%
  # step_corr removes highly correlated variables
  step_corr(all_numeric()) %>%
  # step_center normalizes data to have a mean of 0
  step_center(all_numeric()) %>%
  # step_scale normalizes data to have a standard deviation of 0
  step_scale(all_numeric())%>%
  # Make all factors dummy columns
  step_dummy(all_nominal(),-all_outcomes())

credit_recipe_prepped <- credit_recipe %>%
  # prep trains the recipe using the training dataset
  prep()

# Use use the juice function to apply the prepped recipe to the training dataset
credit_training_juiced <- juice(credit_recipe_prepped)

# We use the bake function to apply the prepped recipe to the testing dataset
credit_testing_baked <- credit_recipe_prepped %>%
  bake(credit_testing)
```

## Model training

```{r}
# Define and fit a random forest regression model using the randomForest engine/package
credit_model_randomForest <-  rand_forest(trees = 100, mode = "classification") %>%
  set_engine("randomForest")

# Define and fit a random forest regression model using the ranger engine/package
credit_model_ranger <- rand_forest(trees = 100, mode = "classification") %>%
  set_engine("ranger")


# Define a tibble using model names and their associated specifications
all_models_credit <-
  tibble(model_name = "ranger",
          model = list(credit_model_ranger)) %>%
  add_row(model_name = "randomForest",
          model = list(credit_model_randomForest))

all_model_results_credit <- all_models_credit %>%
  # Add a column for model fits
  mutate(model_fit = purrr::map(model,
                                ~fit(.x, Class ~ ., data = credit_training_juiced)),
         # Add a column for class predictions
         model_predictions_class = purrr::map(model_fit,
                                        ~.x %>%
                                          predict(credit_testing_baked, type = "class") %>%
                                          bind_cols(credit_testing_baked)),
         # Add a column for model metrics from class predictions
         model_metrics = purrr::map(model_predictions_class,
                                    ~metrics(.x, truth = Class, estimate = .pred_class)),
         # Add a column for probability predictions
         model_predictions_prob = purrr::map(model_fit,
                                              ~.x %>%
                                                predict(credit_testing_baked, type="prob") %>%
                                                bind_cols(credit_testing_baked)),
         # Add ROC curves from probability predictions
         roc_curves = purrr::map(model_predictions_prob,
                                ~roc_curve(.x, Class, .pred_Good)))
```

## Model performance assessment

```{r}
# Let's plot the performance metrics by model type
all_model_results_credit %>%
  unnest(model_metrics) %>%
  ggplot(aes(x = model_name, y = .estimate)) +
  geom_bar(stat="identity") +
  facet_wrap(.~.metric,scales="free") +
  labs(x = "Model name",
       y = "Model performance metric estimate",
       title = "Model performance metrics for 2 model types")

# Let's also plot ROC curves by model type
all_model_results_credit %>%
  unnest(roc_curves) %>%
  ggplot(aes(x = 1-specificity,y=sensitivity,color=model_name)) +
  geom_line() +
  geom_abline(slope=1) +
  labs(title = "Receiver Operating Characteristic (ROC) curves for 2 model types",
    x = "False positive rate\n[1 - specificity = 1 - TN/(TN + FP)]",
    y = "True positive rate\n[recall = sensitivity = TP / (TP + FN)]")

```