Conversation
864474b to
c7575a7
Compare
dajmcdon
left a comment
There was a problem hiding this comment.
I made it partway. It's looking good so far!
| ..., | ||
| role = NA, | ||
| trained = FALSE, | ||
| lambdas = NULL, |
There was a problem hiding this comment.
If NULL is the only possible value here, then it should be removed as an arg to this function. (You can set it to NULL in the output and keep it in the constructor function below.)
| limits = c(-5, 5), | ||
| num_unique = 5, | ||
| na_rm = TRUE, | ||
| epi_keys_checked = NULL, |
There was a problem hiding this comment.
What are these used for? Probably similar to lambdas that if the user can't set them, they shouldn't be exposed.
Is this intended to allow for other epikeys beside geo_value? Grouping across them? If so, the name and default should match other step_epi_*() functions.
There was a problem hiding this comment.
They're exposed in the recipes step this is based on. Not sure why that exposes λ if you're not supposed to set it https://recipes.tidymodels.org/reference/step_YeoJohnson.html
There was a problem hiding this comment.
(I think David's comment was meant for the lambdas = NULL comment above?
But yea,
lambdas = NULL,
limits = [c](https://rdrr.io/r/base/c.html)(-5, 5),
num_unique = 5,
na_rm = TRUE,
are copied from the original step. I don't seriously expect users to want to overwrite with their own lambdas though, so happy to remove that, if we don't care to keep the interface the same.)
epi_keys_checked ... I just lazily copied from step_adjust_ahead for an easy way to get access to recipes$template in prep and bake. I don't really anticipate grouping by anything other than geo_value + other_keys, so I'll remove it and find the epikeys another way.
| filtered_data %>% | ||
| mutate(cases = log(cases)) %>% | ||
| ggplot(aes(time_value, cases)) + | ||
| geom_line(color = "blue") + | ||
| geom_line(data = out1 %>% mutate(cases = log(cases)), | ||
| aes(time_value, cases), color = "green") + | ||
| geom_line(data = out2 %>% mutate(cases = log(cases)), | ||
| aes(time_value, cases), color = "red") + | ||
| facet_wrap(~geo_value, scales = "free_y") + | ||
| theme_minimal() + | ||
| labs(title = "Yeo-Johnson transformation", x = "Time", y = "Log Cases") |
There was a problem hiding this comment.
| filtered_data %>% | |
| mutate(cases = log(cases)) %>% | |
| ggplot(aes(time_value, cases)) + | |
| geom_line(color = "blue") + | |
| geom_line(data = out1 %>% mutate(cases = log(cases)), | |
| aes(time_value, cases), color = "green") + | |
| geom_line(data = out2 %>% mutate(cases = log(cases)), | |
| aes(time_value, cases), color = "red") + | |
| facet_wrap(~geo_value, scales = "free_y") + | |
| theme_minimal() + | |
| labs(title = "Yeo-Johnson transformation", x = "Time", y = "Log Cases") | |
| all_together <- rbind( | |
| filtered_data %>% | |
| mutate(name = "raw"), | |
| out1 %>% mutate(name = "yeo-johnson"), | |
| out2 %>% mutate(name = "quarter-root") | |
| ) | |
| all_together %>% | |
| ggplot(aes(time_value, cases, color = name)) + | |
| geom_line() + | |
| facet_grid(~geo_value, scales = "free_y") + | |
| theme_minimal() + | |
| labs(title = "Yeo-Johnson transformation", x = "Time", y = "Log Cases") + | |
| scale_y_log10() |
This will generate an actual legend and makes toggling log-scale easier.
As for the result of the transform, the difference in scale between NY and CA is a bit confusing tbh. Not sure why NY is scaled so much more aggressively; is it because of the literal actual zero? For practical purposes we'd probably want to smooth this dataset anyways
There was a problem hiding this comment.
also the blip where it goes literally negative in CA in early July is a bit concerning
There was a problem hiding this comment.
looked into it, it's an actual negative value in the raw signal
filtered_data %>% filter(geo_value == "ca", time_value == "2021-06-29")
An `epi_df` object, 1 x 3 with metadata:
* geo_type = state
* time_type = day
* as_of = 2024-03-20
# A tibble: 1 × 3
geo_value time_value cases
<chr> <date> <dbl>
1 ca 2021-06-29 -3940
> out1 %>% filter(geo_value == "ca", time_value == "2021-06-29")
An `epi_df` object, 1 x 3 with metadata:
* geo_type = state
* time_type = day
* as_of = 2024-03-20
# A tibble: 1 × 3
geo_value time_value cases
<chr> <date> <dbl>
1 ca 2021-06-29 -11320.
> out2 %>% filter(geo_value == "ca", time_value == "2021-06-29")
An `epi_df` object, 1 x 3 with metadata:
* geo_type = state
* time_type = day
* as_of = 2024-03-20
# A tibble: 1 × 3
geo_value time_value cases
<chr> <date> <dbl>
1 ca 2021-06-29 NaN
There was a problem hiding this comment.
Yea it's a huge data anomaly. I wonder if we should fix this test dataset. On the one hand, it's educational of data reality, on the other hand, what the hell.
There was a problem hiding this comment.
Yeah I take as a demo that it will do ~ the right thing in the presence of both positive and negative values, just wanted to check it was legit and not an artifact of implementation problems.
The difference in scaling for NY & CA is a bit confusing. I forget exactly what the optimization routine is minimizing in its choice of lambda; iirc it's literally the variance? Probably worth noting in the description of the step
* step and layer work with a single outcome and layer_yj(.pred) * need to work on multiple outcomes case
Co-authored-by: Daniel McDonald <dajmcdon@gmail.com>
Co-authored-by: Daniel McDonald <dajmcdon@gmail.com>
Co-authored-by: Daniel McDonald <dajmcdon@gmail.com>
Co-authored-by: David Weber <david.weber2@pm.me>
Co-authored-by: David Weber <david.weber2@pm.me>
Co-authored-by: David Weber <david.weber2@pm.me>
Co-authored-by: David Weber <david.weber2@pm.me>
Co-authored-by: David Weber <david.weber2@pm.me>
Co-authored-by: David Weber <david.weber2@pm.me>
Co-authored-by: David Weber <david.weber2@pm.me>
|
Replaced with cmu-delphi/epipredict#451 |
close cmu-delphi/epipredict#457
TODO
termsinterface in a layer)