process-models-workshop/ml_inference_regression.qmd at main · role-model/process-models-workshop · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
# ML parameter estimation with MESS simulations

## Key questions

1. How do I estimate parameter values for continuous MESS model parameters?
1. How do I evaluate the uncertainty of ML parameter estimates?

## Lesson objectives

After this lesson, learners should be able to...

1. Use simulations and randomForest ML to perform parameter estimation for
key MESS model parameters.
1. Evaluate uncertainty in parameter estimation accuracy using cross-validation
simulations.
1. Apply ML parameter estimation to empirical data and interpret results in
terms of story
1. Brainstorm applications to real data

## Planned exercises

* Motivating MESS process-model parameter estimation
* Implement ML parameter estimation
* Hands-on Exercise: Predicting MESS parameters of mystery simulations
* Posterior predictive simulations (if time)

### Motivating MESS process-model parameter estimation

Now that we have identified the neutral model as the most probable, we can
estimate parameters of the emipirical data given this model. Essentially,
we are asking the question "What are the parameters of the model that
generate summary statistics most similar to those of the empirical data?"

### Implement ML parameter estimation

Let's start with a new clean Rscript, so create a new file and save it as
"MESS-Regression.R", then load the necessary libraries and simulated data.

#### Load libraries, setwd, and load the simulated data
```R
library(randomForest)
library(caret)
library(reticulate)

setwd("/home/rstudio/MESS-inference")
simdata = MESS$load_local_sims("MESS-SIMOUT.csv")[[1]]
```

#### Extract sumulations generated under the 'filtering' assembly model
Let's pretend that the most probable model from the classification procedure
was 'environmental filtering'. If our goal is to estimate parameters under this
model then we want to _exclude_ simulations from the ml inference that do not
fit our most probable model. This can be achieved by selecting rows in the
`simdata` data.frame that have "filtering" as the `assembly_model`.
```R
simdata = simdata[simdata$assembly_model == "filtering", ]
simdata
```

#### Split train and test data as normal
Again, we'll split the data into training and testing sets.

```R
tmp <- sample(2, nrow(simdata), replace = TRUE, prob = c(0.7, 0.3))
train <- simdata[tmp==1,]
test <- simdata[tmp==2,]
```

#### Train the ML regressor as normal
Train the ml model to perform regression, as our focal dependent variable takes
continuous values. The `randomForest` package auto-detects if the dependent
variable is continuous or categorical, so there's nothing more to do there. We
will use the same formula as before, specifying `local_S` and the first Hill
number on each axis of biodiversity as the predictor variables.

```R
rf <- randomForest(colrate ~ local_S + pi_h1 + abund_h1 + trait_h1, data=train, proximity=TRUE)
rf
```

#### Plot results of predictions for colrate of held-out training set
For a regression analysis we can't use a confusion matrix because the response
is continuous, so instead we evaluate prediction accuracy by making a scatter
plot showing predictions as a function of known simulated values. With perfect
prediction accuracy the points in the scatter plot would fall along the identity
line.

```R
preds <- predict(rf, test)
plot(preds, test$colrate)
```

![Simulated and predicted colonization rates](images/MESS-Regression-PredictionScatter.png)

::: {.callout-tip}
## What can we say about the accuracy of ml prediction of colonization rate?

:::

::: {.callout-warning collapse="true"}
## Prediction accuracy should be quantified with a formal method.

What we are doing here is a very qualitative, ad hoc evaluation of prediction
accuracy. This is another case where the full details of performing ml inference
are slightly out of scope for this workshop, given the amount of time we have.
In a real analysis one should more formally quantify this with mean squared
error, or mean absolute error, or pseudo-R^2, some of which are provided in
the `randomForest` package, e.g.:

```R
rf <- randomForest(colrate ~ local_S + pi_h1 + abund_h1 + trait_h1, data=train, proximity=TRUE)
mean(rf$rsq)
```
```default
0.6219401
```

:::

#### Predict `colrate` of test simulation and plot distribution of predictions
Now we can practice making predictions for a _single_ simulation and looking
at uncertainty in the prediction. Like we did before we can select one simulation
by using the `test [#, ]` row selection strategy on our `test` data.frame.

When we ask for `predict.all=TRUE` this will return the prediction value for each
tree in the rf, and we can plot the aggregation of these predictions using `hist`
to show a histogram.

::: {.callout-tip collapse="true"}
## Return values of `predict()` for regression

When `predict.all=TRUE` the `predict()` function returns a list with 2 elements:

* The predicted value (point estimate)
* A vector of predicted values for each tree in the forest
:::

```R
preds <- predict(rf, test[2, ], predict.all=TRUE)
# The predicted value is element [[1]]
print(preds[[1]])
# A vector of predictions for each tree in the forest [[2]]
hist(preds[[2]])
```
```default
0.001191852
```

![Distribution of colonization rate predictions for a single simulation](images/MESS-Regression-EmpiricalUncertainty.png)

**What can we say about the uncertainty on parameter estimation for this one
simulation?**

### Hands-on Exercise: Predicting MESS parameters of mystery simulations
Now see if you can load the mystery simulations and do the regression inference
on one or a couple of them. Try to do this on your own, but if you get stuck
you can check the hint here:

::: {.callout-tip collapse="true"}
## Hints: Loading the mystery sims and making predictions

```R
## Importing the mystery data
mystery_sims = read.csv("/home/rstudio/MESS-inference/Mystery-data.csv")

## Predict for mystery sim 1
myst_sim = 1
mystery_pred = predict(rf, mystery_sims[myst_sim, ])
print(mystery_pred)
```
:::

A link to the key containing the true `colrate` values is hidden below.
Don’t peek until you have a guess for your simulated data! How close did you get?

::: {.callout-warning collapse="true"}
## The Key is here

[Assembly models per mystery simulation are in this file](data/Mystery-key.csv)
:::

::: {.callout-tip collapse="true"}
## More to explore if you are ahead of the group

If you are ahead of the group you will see that the "key" also contains the
true values of `J` and `ecological_strength` so you might try performing
the regression inference on these as well. Are either of these parameters
harder or easier to estimate than `colrate`? Why do you think that is?

You might also try manipulating the predictor variables before training the
ml model and evaluating the impact on performance. For example, try only using
abundance Hill numbers, or only genetics Hill numbers. Try using *all* the Hill
numbers for all the data axes. How does the composition of the predictor
variables change the ml parameter estimation accuracy?

### Posterior predictive simulations (if time)

Posterior predictive simulations are a good practice to assess the goodness
of fit of the model to the data. Essentially this involves running new simulations
setting MESS model parameters to the most probable estimates from the ML
inference, and then evaluating the fit of the new simulations to the empirical
data, typically by projecting summary statistics of the simulated and observed
data and checking that they all fall within a point cloud.

We probably won't have time to get to this, but it's a good practice in fitting
ml models to real data.
:::


## Key points
* Machine learning models can be used to "estimate parameters" of continuous
response variables given multi-dimensional input data.
* Major components of machine learning inference include gathering and
transforming data, splitting data into training and testing sets, training the
ml model, and evaluating model performance.
* ML inference is **inherently** uncertain, so it is important to evaluate
uncertainty in ml prediction accuracy and propagate this into any downstream
interpretation when applying ml to empirical data.