Skip to content

Discuss deviance calculation with replicates; clarify saturation vs. full parametrization#428

Merged
d-morrison merged 11 commits intomainfrom
copilot/discuss-devience-calculation-replicates
Apr 11, 2026
Merged

Discuss deviance calculation with replicates; clarify saturation vs. full parametrization#428
d-morrison merged 11 commits intomainfrom
copilot/discuss-devience-calculation-replicates

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 10, 2026

  • Fix the Gaussian deviance section to say "one free parameter per distinct covariate pattern"
  • Add new subsection "Saturated vs. fully parametrized models with replicates"
  • Add note about $0\log(0) = 0$ convention for boundary cases in Binomial deviance formula
  • Clarify why $\ell_{\text{fp}} = 0$ for ungrouped Bernoulli data: R uses the fully parametrized model as reference (one param per observation), not the saturated model (one per pattern); they coincide only when $q = n$
  • Add callout noting that $\ell_{\text{sat}} < 0$ when covariate patterns repeat ($q < n$) — the saturated model sets $\hat\pi_k = y_k/n_k$ per pattern; R's ungrouped deviance uses the fully parametrized reference in all cases
  • Add R code chunk verifying deviance() == -2*logLik() for ungrouped data with repeated covariate patterns, demonstrating that R uses $\ell_{\text{fp}} = 0$ as its reference
  • Correct GOF test explanation: the fundamental issue is not that $n_k$ never grows large (it can when patterns repeat), but that R's ungrouped deviance() returns $-2\ell(\hat\beta)$ (using $\ell_{\text{fp}} = 0$) rather than $2(\ell_{\text{sat}} - \ell(\hat\beta))$; these differ by $-2\ell_{\text{sat}} > 0$ whenever $q < n$, making R's value the wrong statistic to compare against $\chi^2(q - p)$ regardless of $n_k$; add note to use grouped data for a valid GOF test
  • Add references to Dunn & Smyth (2018) "Generalized Linear Models with Examples in R" (Chapter 9) and the Stats Stack Exchange thread on Bernoulli deviance in R for further reading
  • Update Summary table with corrected deviance formula
  • Add ungrouped, leq, vx, fp, infty, GOF to inst/WORDLIST
  • Lint: no issues; Spellcheck: no errors

Copilot AI and others added 2 commits April 10, 2026 08:42
…ametrization

Agent-Logs-Url: https://github.com/d-morrison/rme/sessions/5ec46eee-86e6-497e-a280-030b6ea86579

Co-authored-by: d-morrison <2474437+d-morrison@users.noreply.github.com>
… deviance

Agent-Logs-Url: https://github.com/d-morrison/rme/sessions/5ec46eee-86e6-497e-a280-030b6ea86579

Co-authored-by: d-morrison <2474437+d-morrison@users.noreply.github.com>
Copilot AI changed the title [WIP] Discuss how deviance is calculated for LMs and GLMs with replicates Discuss deviance calculation with replicates; clarify saturation vs. full parametrization Apr 10, 2026
Copilot AI requested a review from d-morrison April 10, 2026 08:45
Comment thread _sec_compare_lm_glm_deviance.qmd Outdated
Copilot AI requested a review from d-morrison April 10, 2026 09:16
…rametrized model for ungrouped data

Agent-Logs-Url: https://github.com/d-morrison/rme/sessions/14b8d079-55f6-4d4d-a497-07ca47d8afe2

Co-authored-by: d-morrison <2474437+d-morrison@users.noreply.github.com>
Comment thread _sec_compare_lm_glm_deviance.qmd
Comment thread _sec_compare_lm_glm_deviance.qmd Outdated
…hy chi^2 GOF test fails for ungrouped data

Agent-Logs-Url: https://github.com/d-morrison/rme/sessions/ec68eba3-8c76-4f8c-a9fd-422947591440

Co-authored-by: d-morrison <2474437+d-morrison@users.noreply.github.com>
Copilot AI requested a review from d-morrison April 10, 2026 09:43
Comment thread _sec_compare_lm_glm_deviance.qmd Outdated
Copilot AI requested a review from d-morrison April 10, 2026 10:00
@d-morrison
Copy link
Copy Markdown
Owner

…ed deviance discussion

Agent-Logs-Url: https://github.com/d-morrison/rme/sessions/bb2d12d1-c32c-46b1-aeba-b1aecfa444ab

Co-authored-by: d-morrison <2474437+d-morrison@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 10, 2026

@copilot let's refer readers to https://stats.stackexchange.com/questions/626597/is-there-a-justification-for-the-bernoulli-deviance-in-the-r-stats-package and https://link.springer.com/chapter/10.100...

Added in commit 86de9cb. The callout now ends with:

For further discussion, see @dunn2018generalized [Chapter 9] and this Stats Stack Exchange thread.

The Springer link (978-1-4419-0118-7) corresponds to Dunn & Smyth (2018) "Generalized Linear Models with Examples in R", which was already in references.bib as dunn2018generalized.

@github-actions
Copy link
Copy Markdown
Contributor

PR Preview Action v1.8.1
🚀 Preview available at: https://d-morrison.github.io/rme/pr-preview/pr-428/

Built to branch gh-pages at 2026-04-10 19:21 UTC.

@d-morrison d-morrison marked this pull request as ready for review April 11, 2026 00:17
Copilot AI review requested due to automatic review settings April 11, 2026 00:17
@d-morrison d-morrison merged commit 81d682a into main Apr 11, 2026
4 checks passed
@d-morrison d-morrison deleted the copilot/discuss-devience-calculation-replicates branch April 11, 2026 00:17
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Clarifies how deviance is computed/interpreted when covariate patterns repeat (replicates), with emphasis on the distinction between saturated vs. fully parametrized reference models and how R’s deviance() behaves for grouped vs. ungrouped Binomial/Bernoulli data.

Changes:

  • Refines the Gaussian deviance discussion to be in terms of distinct covariate patterns and adds a new subsection contrasting saturated vs. fully parametrized models under replicates.
  • Adds narrative + R examples demonstrating deviance(fit) == -2*logLik(fit) for ungrouped Bernoulli data with repeated patterns, and explains implications for GOF testing.
  • Extends inst/WORDLIST with additional technical terms/acronyms.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
_sec_compare_lm_glm_deviance.qmd Adds the replicates-focused explanation and examples; updates the summary table to reflect the clarified deviance reference.
inst/WORDLIST Adds new allowlisted technical tokens used by the updated prose/math.

Comment thread inst/WORDLIST
Comment on lines +26 to +31
fp
frac
github
GOF
infty
leq
Copy link

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inst/WORDLIST already contains GitHub, but this change adds github (lowercase) and also introduces duplicate entries for frac/github later in the file. Duplicates/inconsistent casing can make spellcheck behavior harder to reason about and may allow incorrect capitalization in prose; consider removing the duplicates and keeping only the preferred casing (e.g., GitHub).

Copilot uses AI. Check for mistakes.
Comment on lines +198 to +199
By convention, terms with $y_k = 0$ or $y_k = n_k$ contribute zero to the sum,
since $0\log(0) = 0$ in the limit.
Copy link

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The note says patterns with y_k = 0 or y_k = n_k “contribute zero to the sum,” but in the Binomial deviance expression only the 0·log(0) component is taken as 0; the other term generally remains nonzero (e.g., if y_k=0, the (n_k-y_k) log((1-y_k/n_k)/(1-\hat\pi_k)) term is still present). Reword to avoid implying the entire bracketed contribution is 0.

Suggested change
By convention, terms with $y_k = 0$ or $y_k = n_k$ contribute zero to the sum,
since $0\log(0) = 0$ in the limit.
By convention,
when $y_k = 0$ or $y_k = n_k$,
the corresponding component of the form $0\log(0)$ is taken to equal 0
in the limiting sense.
The other component in the bracket generally remains nonzero
and still contributes to the sum.

Copilot uses AI. Check for mistakes.
Comment on lines +233 to +240
When patterns do repeat ($q < n$), the saturated model sets $\hat\pi_k = y_k/n_k$ per pattern,
giving $\ell_{\text{sat}} < 0$.

`deviance()` for ungrouped data **cannot** be used as a goodness-of-fit test
against the $\chi^2$ distribution when $q < n$.
The correct GOF statistic is $2(\ell_{\text{sat}} - \ell(\hat\beta))$,
but R's `deviance()` for ungrouped data returns $-2\ell(\hat\beta)$ (using $\ell_{\text{fp}} = 0$).
These two quantities differ by $-2\ell_{\text{sat}} > 0$ whenever $q < n$.
Copy link

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

“When patterns do repeat ($q &lt; n$) … giving $\ell_{\text{sat}} &lt; 0$” isn’t always true: with replicates you can still have $\ell_{\text{sat}} = 0$ if every repeated pattern is all-0 or all-1 (i.e., each pattern has $y_k \in {0,n_k}$ so the Binomial pmf at $\hat\pi_k=y_k/n_k$ is 1). Consider qualifying this as “$\ell_{\text{sat}} < 0$ whenever at least one pattern has both outcomes (0 < y_k < n_k).”

Suggested change
When patterns do repeat ($q < n$), the saturated model sets $\hat\pi_k = y_k/n_k$ per pattern,
giving $\ell_{\text{sat}} < 0$.
`deviance()` for ungrouped data **cannot** be used as a goodness-of-fit test
against the $\chi^2$ distribution when $q < n$.
The correct GOF statistic is $2(\ell_{\text{sat}} - \ell(\hat\beta))$,
but R's `deviance()` for ungrouped data returns $-2\ell(\hat\beta)$ (using $\ell_{\text{fp}} = 0$).
These two quantities differ by $-2\ell_{\text{sat}} > 0$ whenever $q < n$.
When patterns do repeat ($q < n$), the saturated model sets $\hat\pi_k = y_k/n_k$ per pattern.
If at least one pattern has both outcomes
($0 < y_k < n_k$),
then $\ell_{\text{sat}} < 0$.
`deviance()` for ungrouped data **cannot** be used as a goodness-of-fit test
against the $\chi^2$ distribution when $q < n$.
The correct GOF statistic is $2(\ell_{\text{sat}} - \ell(\hat\beta))$,
but R's `deviance()` for ungrouped data returns $-2\ell(\hat\beta)$ (using $\ell_{\text{fp}} = 0$).
These two quantities differ by $-2\ell_{\text{sat}} > 0$
whenever at least one repeated pattern has both outcomes
($0 < y_k < n_k$).

Copilot uses AI. Check for mistakes.
Comment on lines +236 to +239
`deviance()` for ungrouped data **cannot** be used as a goodness-of-fit test
against the $\chi^2$ distribution when $q < n$.
The correct GOF statistic is $2(\ell_{\text{sat}} - \ell(\hat\beta))$,
but R's `deviance()` for ungrouped data returns $-2\ell(\hat\beta)$ (using $\ell_{\text{fp}} = 0$).
Copy link

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The claim that ungrouped deviance() “cannot be used as a goodness-of-fit test against the $\chi^2$ distribution when $q &lt; n$” is too broad. The issue you describe is that it’s the wrong statistic to compare to $\chi^2(q-p)$ (lack-of-fit by covariate pattern); ungrouped deviance is typically compared to $\chi^2(n-p)$ instead. Suggest rephrasing to explicitly mention the df/statistic mismatch rather than implying no $\chi^2$ GOF use at all.

Suggested change
`deviance()` for ungrouped data **cannot** be used as a goodness-of-fit test
against the $\chi^2$ distribution when $q < n$.
The correct GOF statistic is $2(\ell_{\text{sat}} - \ell(\hat\beta))$,
but R's `deviance()` for ungrouped data returns $-2\ell(\hat\beta)$ (using $\ell_{\text{fp}} = 0$).
When $q < n$, `deviance()` for ungrouped data is **not** the correct statistic
for the covariate-pattern goodness-of-fit test based on $\chi^2(q - p)$.
For that lack-of-fit test, the correct statistic is
$2(\ell_{\text{sat}} - \ell(\hat\beta))$,
but R's `deviance()` for ungrouped data returns $-2\ell(\hat\beta)$
(using $\ell_{\text{fp}} = 0$).

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

discuss how deviance is calculated (for both LMs and GLMs) when there are replicates of some covariate patterns

3 participants