Discuss deviance calculation with replicates; clarify saturation vs. full parametrization#428
Conversation
…ametrization Agent-Logs-Url: https://github.com/d-morrison/rme/sessions/5ec46eee-86e6-497e-a280-030b6ea86579 Co-authored-by: d-morrison <2474437+d-morrison@users.noreply.github.com>
… deviance Agent-Logs-Url: https://github.com/d-morrison/rme/sessions/5ec46eee-86e6-497e-a280-030b6ea86579 Co-authored-by: d-morrison <2474437+d-morrison@users.noreply.github.com>
Agent-Logs-Url: https://github.com/d-morrison/rme/sessions/0096c260-f494-46d4-8567-a0443503044f Co-authored-by: d-morrison <2474437+d-morrison@users.noreply.github.com>
…rametrized model for ungrouped data Agent-Logs-Url: https://github.com/d-morrison/rme/sessions/14b8d079-55f6-4d4d-a497-07ca47d8afe2 Co-authored-by: d-morrison <2474437+d-morrison@users.noreply.github.com>
…hy chi^2 GOF test fails for ungrouped data Agent-Logs-Url: https://github.com/d-morrison/rme/sessions/ec68eba3-8c76-4f8c-a9fd-422947591440 Co-authored-by: d-morrison <2474437+d-morrison@users.noreply.github.com>
… size; add note about grouped data workaround Agent-Logs-Url: https://github.com/d-morrison/rme/sessions/c5a6c684-cb62-4fb7-9c4c-a0b832f21613 Co-authored-by: d-morrison <2474437+d-morrison@users.noreply.github.com>
|
@copilot let's refer readers to https://stats.stackexchange.com/questions/626597/is-there-a-justification-for-the-bernoulli-deviance-in-the-r-stats-package and https://link.springer.com/chapter/10.1007/978-1-4419-0118-7_9#Sec10 for more information |
…ed deviance discussion Agent-Logs-Url: https://github.com/d-morrison/rme/sessions/bb2d12d1-c32c-46b1-aeba-b1aecfa444ab Co-authored-by: d-morrison <2474437+d-morrison@users.noreply.github.com>
Added in commit 86de9cb. The callout now ends with:
The Springer link ( |
Built to branch |
There was a problem hiding this comment.
Pull request overview
Clarifies how deviance is computed/interpreted when covariate patterns repeat (replicates), with emphasis on the distinction between saturated vs. fully parametrized reference models and how R’s deviance() behaves for grouped vs. ungrouped Binomial/Bernoulli data.
Changes:
- Refines the Gaussian deviance discussion to be in terms of distinct covariate patterns and adds a new subsection contrasting saturated vs. fully parametrized models under replicates.
- Adds narrative + R examples demonstrating
deviance(fit) == -2*logLik(fit)for ungrouped Bernoulli data with repeated patterns, and explains implications for GOF testing. - Extends
inst/WORDLISTwith additional technical terms/acronyms.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
_sec_compare_lm_glm_deviance.qmd |
Adds the replicates-focused explanation and examples; updates the summary table to reflect the clarified deviance reference. |
inst/WORDLIST |
Adds new allowlisted technical tokens used by the updated prose/math. |
| fp | ||
| frac | ||
| github | ||
| GOF | ||
| infty | ||
| leq |
There was a problem hiding this comment.
inst/WORDLIST already contains GitHub, but this change adds github (lowercase) and also introduces duplicate entries for frac/github later in the file. Duplicates/inconsistent casing can make spellcheck behavior harder to reason about and may allow incorrect capitalization in prose; consider removing the duplicates and keeping only the preferred casing (e.g., GitHub).
| By convention, terms with $y_k = 0$ or $y_k = n_k$ contribute zero to the sum, | ||
| since $0\log(0) = 0$ in the limit. |
There was a problem hiding this comment.
The note says patterns with y_k = 0 or y_k = n_k “contribute zero to the sum,” but in the Binomial deviance expression only the 0·log(0) component is taken as 0; the other term generally remains nonzero (e.g., if y_k=0, the (n_k-y_k) log((1-y_k/n_k)/(1-\hat\pi_k)) term is still present). Reword to avoid implying the entire bracketed contribution is 0.
| By convention, terms with $y_k = 0$ or $y_k = n_k$ contribute zero to the sum, | |
| since $0\log(0) = 0$ in the limit. | |
| By convention, | |
| when $y_k = 0$ or $y_k = n_k$, | |
| the corresponding component of the form $0\log(0)$ is taken to equal 0 | |
| in the limiting sense. | |
| The other component in the bracket generally remains nonzero | |
| and still contributes to the sum. |
| When patterns do repeat ($q < n$), the saturated model sets $\hat\pi_k = y_k/n_k$ per pattern, | ||
| giving $\ell_{\text{sat}} < 0$. | ||
|
|
||
| `deviance()` for ungrouped data **cannot** be used as a goodness-of-fit test | ||
| against the $\chi^2$ distribution when $q < n$. | ||
| The correct GOF statistic is $2(\ell_{\text{sat}} - \ell(\hat\beta))$, | ||
| but R's `deviance()` for ungrouped data returns $-2\ell(\hat\beta)$ (using $\ell_{\text{fp}} = 0$). | ||
| These two quantities differ by $-2\ell_{\text{sat}} > 0$ whenever $q < n$. |
There was a problem hiding this comment.
“When patterns do repeat (
| When patterns do repeat ($q < n$), the saturated model sets $\hat\pi_k = y_k/n_k$ per pattern, | |
| giving $\ell_{\text{sat}} < 0$. | |
| `deviance()` for ungrouped data **cannot** be used as a goodness-of-fit test | |
| against the $\chi^2$ distribution when $q < n$. | |
| The correct GOF statistic is $2(\ell_{\text{sat}} - \ell(\hat\beta))$, | |
| but R's `deviance()` for ungrouped data returns $-2\ell(\hat\beta)$ (using $\ell_{\text{fp}} = 0$). | |
| These two quantities differ by $-2\ell_{\text{sat}} > 0$ whenever $q < n$. | |
| When patterns do repeat ($q < n$), the saturated model sets $\hat\pi_k = y_k/n_k$ per pattern. | |
| If at least one pattern has both outcomes | |
| ($0 < y_k < n_k$), | |
| then $\ell_{\text{sat}} < 0$. | |
| `deviance()` for ungrouped data **cannot** be used as a goodness-of-fit test | |
| against the $\chi^2$ distribution when $q < n$. | |
| The correct GOF statistic is $2(\ell_{\text{sat}} - \ell(\hat\beta))$, | |
| but R's `deviance()` for ungrouped data returns $-2\ell(\hat\beta)$ (using $\ell_{\text{fp}} = 0$). | |
| These two quantities differ by $-2\ell_{\text{sat}} > 0$ | |
| whenever at least one repeated pattern has both outcomes | |
| ($0 < y_k < n_k$). |
| `deviance()` for ungrouped data **cannot** be used as a goodness-of-fit test | ||
| against the $\chi^2$ distribution when $q < n$. | ||
| The correct GOF statistic is $2(\ell_{\text{sat}} - \ell(\hat\beta))$, | ||
| but R's `deviance()` for ungrouped data returns $-2\ell(\hat\beta)$ (using $\ell_{\text{fp}} = 0$). |
There was a problem hiding this comment.
The claim that ungrouped deviance() “cannot be used as a goodness-of-fit test against the
| `deviance()` for ungrouped data **cannot** be used as a goodness-of-fit test | |
| against the $\chi^2$ distribution when $q < n$. | |
| The correct GOF statistic is $2(\ell_{\text{sat}} - \ell(\hat\beta))$, | |
| but R's `deviance()` for ungrouped data returns $-2\ell(\hat\beta)$ (using $\ell_{\text{fp}} = 0$). | |
| When $q < n$, `deviance()` for ungrouped data is **not** the correct statistic | |
| for the covariate-pattern goodness-of-fit test based on $\chi^2(q - p)$. | |
| For that lack-of-fit test, the correct statistic is | |
| $2(\ell_{\text{sat}} - \ell(\hat\beta))$, | |
| but R's `deviance()` for ungrouped data returns $-2\ell(\hat\beta)$ | |
| (using $\ell_{\text{fp}} = 0$). |
deviance() == -2*logLik()for ungrouped data with repeated covariate patterns, demonstrating that R usesdeviance()returnsungrouped,leq,vx,fp,infty,GOFtoinst/WORDLIST