Add Reporting Information to print/summary.epi_df#691
Conversation
…ct implicit/explicit gaps.
|
/style |
|
/document |
|
/preview-docs |
|
🚀 Deployed on https://69a28cd1fbb453711951fddb--epiprocess.netlify.app |
brookslogan
left a comment
There was a problem hiding this comment.
Thanks for all the test cases! This is looking generally good. I just want to try to make this a bit quicker to understand.
| max_even <- length(unique(smry$max_t[!is.na(smry$max_t)])) <= 1 | ||
|
|
||
| min_desc <- if (min_even) "even across epikeys" else "uneven across epikeys" | ||
| max_desc <- if (max_even) "even across epikeys" else "uneven across epikeys" |
There was a problem hiding this comment.
issue: "even" didn't first parse as an adjective for me
issue: I don't think we have user-facing definitions of "epikey", and it isn't standard in the community
issue: this could be misleading if we have differing time ranges for different signals.
suggestion: either
(a) change this to be about epikey x signal combos, and the messages to "same for every time series", "but some of the time series start later", "but some of the time series end earlier"
(b) re-use some of the by-signal latency information added to print.epi_df
There was a problem hiding this comment.
(a) Done! It is worth noting that I used epikey in accordance with revision_analysis, which prints this term. Should we remove it from there as well?
(b) Done! Also, the by-signal latency information was moved here.
There was a problem hiding this comment.
(a) Yes, we probably should. (File an Issue?)
Co-authored-by: brookslogan <lcbrooks+github@andrew.cmu.edu>
Changes include: * displaying only the latency range * printing a message about empty time series
…e, gap, and latency reporting based on feedback
…nfo for better code organization
JavierMtzRdz
left a comment
There was a problem hiding this comment.
I have attended the previous issues. Since we provide latency summaries in both summary and print, I moved all the key combination x signal calculations in epi_ts_range. To address the potential problem with empty time series and missing signal columns, I returned calculations for all the rows, but added a column indicating that it is empty.
The remaining print latency is handled in print_latency_info as a result.
As the signal latency was moved to the summary, summary_time_latency reuses the epi_ts_range output of time range, latency, and gap information. Each message for the sections is handled within their respective functions for clarity.
Finally, I reused time_delta_to_n_steps to calculate latency. However, since as_of does not necessarily fall on the same day when using time_type = "week", I can get non-integer results. To prevent such issues, I added a require_integer parameter to time_delta_to_n_steps.
Below is a detailed list of examples.
# Demonstrating Improved print.epi_df and summary.epi_df reporting
pkgload::load_all(".")
#> ℹ Loading epiprocess
#> Loading required package: epidatasets
library(dplyr)
#> Warning: package 'dplyr' was built under R version 4.5.2
#>
#> Attaching package: 'dplyr'
#>
#> The following object is masked from 'package:epiprocess':
#>
#> between
#>
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#>
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tibble)
#> Warning: package 'tibble' was built under R version 4.5.2
# Setup basic parameters
start_date <- as.Date("2024-01-01")
as_of_date <- as.Date("2024-01-15")
# Standard clean data ---
## No signal
(case0 <- tibble(
geo_value = rep(c("ca", "hi"), each = 6),
time_value = rep(start_date + 0:5, 2),
) %>% as_epi_df(as_of = as_of_date))
#> An `epi_df` object, 12 x 2 with metadata:
#> * geo_type = state
#> * time_type = day
#> * as_of = 2024-01-15
#> Latency (lag from as_of to latest observation by time series):
#> * No time series detected
#> # A tibble: 12 × 2
#> geo_value time_value
#> * <chr> <date>
#> 1 ca 2024-01-01
#> 2 ca 2024-01-02
#> 3 ca 2024-01-03
#> 4 ca 2024-01-04
#> 5 ca 2024-01-05
#> 6 ca 2024-01-06
#> 7 hi 2024-01-01
#> 8 hi 2024-01-02
#> 9 hi 2024-01-03
#> 10 hi 2024-01-04
#> 11 hi 2024-01-05
#> 12 hi 2024-01-06
summary(case0)
#> An `epi_df` x, with metadata:
#> * geo_type = state
#> * as_of = 2024-01-15
#> ----------
#> Time range:
#> * min time value = 2024-01-01
#> * max time value = 2024-01-06
#> Gaps:
#> * time gaps = none detected
#> * average rows per time value = 2.00
#> Latency (lag from as_of to latest observation by time series):
#> * No time series detected
## Standard
(case1 <- tibble(
geo_value = rep(c("ca", "hi"), each = 6),
time_value = rep(start_date + 0:5, 2),
value = 1:12
) %>% as_epi_df(as_of = as_of_date))
#> An `epi_df` object, 12 x 3 with metadata:
#> * geo_type = state
#> * time_type = day
#> * as_of = 2024-01-15
#> Latency (lag from as_of to latest observation by time series):
#> * lag = 9 days
#>
#> # A tibble: 12 × 3
#> geo_value time_value value
#> * <chr> <date> <int>
#> 1 ca 2024-01-01 1
#> 2 ca 2024-01-02 2
#> 3 ca 2024-01-03 3
#> 4 ca 2024-01-04 4
#> 5 ca 2024-01-05 5
#> 6 ca 2024-01-06 6
#> 7 hi 2024-01-01 7
#> 8 hi 2024-01-02 8
#> 9 hi 2024-01-03 9
#> 10 hi 2024-01-04 10
#> 11 hi 2024-01-05 11
#> 12 hi 2024-01-06 12
summary(case1)
#> An `epi_df` x, with metadata:
#> * geo_type = state
#> * as_of = 2024-01-15
#> ----------
#> Time range:
#> * min time value = 2024-01-01 (same for every time series)
#> * max time value = 2024-01-06 (same for every time series)
#> Gaps:
#> * time gaps = none detected
#> * average rows per time value = 2.00
#> Latency (lag from as_of to latest observation by time series):
#> * value: lag 9 days (max time 2024-01-06) (!)
#> (!): notable latency (lag > 7 days)
## all NA signal
(case1.5 <- tibble(
geo_value = rep(c("ca", "hi"), each = 6),
time_value = rep(start_date + 0:5, 2),
value = 1:12,
value2 = NA
) %>% as_epi_df(as_of = as_of_date))
#> An `epi_df` object, 12 x 4 with metadata:
#> * geo_type = state
#> * time_type = day
#> * as_of = 2024-01-15
#> Latency (lag from as_of to latest observation by time series):
#> * lag = 9 days
#> * Empty time series detected
#> # A tibble: 12 × 4
#> geo_value time_value value value2
#> * <chr> <date> <int> <lgl>
#> 1 ca 2024-01-01 1 NA
#> 2 ca 2024-01-02 2 NA
#> 3 ca 2024-01-03 3 NA
#> 4 ca 2024-01-04 4 NA
#> 5 ca 2024-01-05 5 NA
#> 6 ca 2024-01-06 6 NA
#> 7 hi 2024-01-01 7 NA
#> 8 hi 2024-01-02 8 NA
#> 9 hi 2024-01-03 9 NA
#> 10 hi 2024-01-04 10 NA
#> 11 hi 2024-01-05 11 NA
#> 12 hi 2024-01-06 12 NA
summary(case1.5)
#> An `epi_df` x, with metadata:
#> * geo_type = state
#> * as_of = 2024-01-15
#> ----------
#> Time range:
#> * min time value = 2024-01-01 (same for every time series)
#> * max time value = 2024-01-06 (same for every time series)
#> Gaps:
#> * time gaps = none detected
#> * average rows per time value = 2.00
#> Latency (lag from as_of to latest observation by time series):
#> * value: lag 9 days (max time 2024-01-06) (!)
#> * value2: all NA
#> (!): notable latency (lag > 7 days)
# Integer time indices ---
(case2 <- tibble(
geo_value = rep(c("ca", "hi"), each = 6),
time_value = rep(100:105, 2),
value = 1:12
) %>% as_epi_df(as_of = 110))
#> An `epi_df` object, 12 x 3 with metadata:
#> * geo_type = state
#> * time_type = integer
#> * as_of = 110
#> Latency (lag from as_of to latest observation by time series):
#> * lag = 5
#>
#> # A tibble: 12 × 3
#> geo_value time_value value
#> * <chr> <int> <int>
#> 1 ca 100 1
#> 2 ca 101 2
#> 3 ca 102 3
#> 4 ca 103 4
#> 5 ca 104 5
#> 6 ca 105 6
#> 7 hi 100 7
#> 8 hi 101 8
#> 9 hi 102 9
#> 10 hi 103 10
#> 11 hi 104 11
#> 12 hi 105 12
summary(case2)
#> An `epi_df` x, with metadata:
#> * geo_type = state
#> * as_of = 110
#> ----------
#> Time range:
#> * min time value = 100 (same for every time series)
#> * max time value = 105 (same for every time series)
#> Gaps:
#> * time gaps = none detected
#> * average rows per time value = 2.00
#> Latency (lag from as_of to latest observation by time series):
#> * value: lag 5 (max time 105) (!)
#> (!): notable latency (lag > 2 )
# Other Keys ---
(case3 <- tibble(
geo_value = rep(c("ca", "hi"), each = 4),
age_group = rep(rep(c("0-17", "18+"), each = 2), 2),
time_value = rep(c(start_date, start_date + 1), 4),
value = 1:8
) %>% as_epi_df(as_of = as_of_date, other_keys = "age_group"))
#> An `epi_df` object, 8 x 4 with metadata:
#> * geo_type = state
#> * time_type = day
#> * other_keys = age_group
#> * as_of = 2024-01-15
#> Latency (lag from as_of to latest observation by time series):
#> * lag = 13 days
#>
#> # A tibble: 8 × 4
#> geo_value age_group time_value value
#> * <chr> <chr> <date> <int>
#> 1 ca 0-17 2024-01-01 1
#> 2 ca 0-17 2024-01-02 2
#> 3 ca 18+ 2024-01-01 3
#> 4 ca 18+ 2024-01-02 4
#> 5 hi 0-17 2024-01-01 5
#> 6 hi 0-17 2024-01-02 6
#> 7 hi 18+ 2024-01-01 7
#> 8 hi 18+ 2024-01-02 8
summary(case3)
#> An `epi_df` x, with metadata:
#> * geo_type = state
#> * other_keys = age_group
#> * as_of = 2024-01-15
#> ----------
#> Time range:
#> * min time value = 2024-01-01 (same for every time series)
#> * max time value = 2024-01-02 (same for every time series)
#> Gaps:
#> * time gaps = none detected
#> * average rows per time value = 4.00
#> Latency (lag from as_of to latest observation by time series):
#> * value: lag 13 days (max time 2024-01-02) (!)
#> (!): notable latency (lag > 7 days)
# Late start and implicit gaps ---
edf_base <- tibble(
geo_value = rep(c("ca", "hi"), each = 6),
time_value = rep(start_date + 0:5, 2),
value = 1:12
)
edf_uneven <- edf_base[-c(3, 7), ]
# 'hi' starts late
# 'ca' missing day 2
(case4 <- as_epi_df(edf_uneven, as_of = as_of_date))
#> An `epi_df` object, 10 x 3 with metadata:
#> * geo_type = state
#> * time_type = day
#> * as_of = 2024-01-15
#> Latency (lag from as_of to latest observation by time series):
#> * lag = 9 days
#>
#> # A tibble: 10 × 3
#> geo_value time_value value
#> * <chr> <date> <int>
#> 1 ca 2024-01-01 1
#> 2 ca 2024-01-02 2
#> 3 ca 2024-01-04 4
#> 4 ca 2024-01-05 5
#> 5 ca 2024-01-06 6
#> 6 hi 2024-01-02 8
#> 7 hi 2024-01-03 9
#> 8 hi 2024-01-04 10
#> 9 hi 2024-01-05 11
#> 10 hi 2024-01-06 12
summary(case4)
#> An `epi_df` x, with metadata:
#> * geo_type = state
#> * as_of = 2024-01-15
#> ----------
#> Time range:
#> * min time value = 2024-01-01 (but some time series start later)
#> * max time value = 2024-01-06 (same for every time series)
#> Gaps:
#> * implicit (missing rows in 1/2 key combinations, affecting 1 signal)
#> * average rows per time value = 1.67
#> Latency (lag from as_of to latest observation by time series):
#> * value: lag 9 days (max time 2024-01-06) (!)
#> (!): notable latency (lag > 7 days)
# Explicit NAs ---
edf_lags <- tibble(
geo_value = rep(c("ca", "hi"), each = 6),
time_value = rep(start_date + 0:5, 2),
value = 1:12
)
edf_lags$value[2] <- NA # 'ca' gap at Jan 2
edf_lags$value[12] <- NA # 'hi' missing Jan 6 (lag)
(case5 <- as_epi_df(edf_lags, as_of = start_date + 7))
#> An `epi_df` object, 12 x 3 with metadata:
#> * geo_type = state
#> * time_type = day
#> * as_of = 2024-01-08
#> Latency (lag from as_of to latest observation by time series):
#> * lag = 2–3 days
#>
#> # A tibble: 12 × 3
#> geo_value time_value value
#> * <chr> <date> <int>
#> 1 ca 2024-01-01 1
#> 2 ca 2024-01-02 NA
#> 3 ca 2024-01-03 3
#> 4 ca 2024-01-04 4
#> 5 ca 2024-01-05 5
#> 6 ca 2024-01-06 6
#> 7 hi 2024-01-01 7
#> 8 hi 2024-01-02 8
#> 9 hi 2024-01-03 9
#> 10 hi 2024-01-04 10
#> 11 hi 2024-01-05 11
#> 12 hi 2024-01-06 NA
summary(case5)
#> An `epi_df` x, with metadata:
#> * geo_type = state
#> * as_of = 2024-01-08
#> ----------
#> Time range:
#> * min time value = 2024-01-01 (same for every time series)
#> * max time value = 2024-01-06 (but some time series end earlier)
#> Gaps:
#> * explicit (non-lag NAs in 1/2 key combinations, affecting 1 signal)
#> * average rows per time value = 2.00
#> Latency (lag from as_of to latest observation by time series):
#> * value: lag 2–3 days (max time 2024-01-06); lagging keys: hi (!)
#> (!): notable latency (lagging keys)
# Many signals and multivariate lags ---
df_many <- tibble(geo_value = "ca", time_value = start_date + 0:9)
for (i in 1:10) {
df_many[[paste0("sig", i)]] <- 1:10
}
# sig5 lags by 2 days, sig7 has an internal hole
df_many$sig5[9:10] <- NA
df_many$sig7[4:6] <- NA
(case6 <- as_epi_df(df_many, as_of = start_date + 12))
#> An `epi_df` object, 10 x 12 with metadata:
#> * geo_type = state
#> * time_type = day
#> * as_of = 2024-01-13
#> Latency (lag from as_of to latest observation by time series):
#> * lag across all time series = 3–5 days
#>
#> # A tibble: 10 × 12
#> geo_value time_value sig1 sig2 sig3 sig4 sig5 sig6 sig7 sig8 sig9
#> * <chr> <date> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1 ca 2024-01-01 1 1 1 1 1 1 1 1 1
#> 2 ca 2024-01-02 2 2 2 2 2 2 2 2 2
#> 3 ca 2024-01-03 3 3 3 3 3 3 3 3 3
#> 4 ca 2024-01-04 4 4 4 4 4 4 NA 4 4
#> 5 ca 2024-01-05 5 5 5 5 5 5 NA 5 5
#> 6 ca 2024-01-06 6 6 6 6 6 6 NA 6 6
#> 7 ca 2024-01-07 7 7 7 7 7 7 7 7 7
#> 8 ca 2024-01-08 8 8 8 8 8 8 8 8 8
#> 9 ca 2024-01-09 9 9 9 9 NA 9 9 9 9
#> 10 ca 2024-01-10 10 10 10 10 NA 10 10 10 10
#> # ℹ 1 more variable: sig10 <int>
summary(case6)
#> An `epi_df` x, with metadata:
#> * geo_type = state
#> * as_of = 2024-01-13
#> ----------
#> Time range:
#> * min time value = 2024-01-01 (same for every time series)
#> * max time value = 2024-01-10 (but some time series end earlier)
#> Gaps:
#> * explicit (non-lag NAs in 1/1 key combinations, affecting 1 signal)
#> * average rows per time value = 1.00
#> Latency (lag from as_of to latest observation by time series):
#> * sig1: lag 3 days (max time 2024-01-10)
#> * sig2: lag 3 days (max time 2024-01-10)
#> * sig3: lag 3 days (max time 2024-01-10)
#> * sig4: lag 3 days (max time 2024-01-10)
#> * sig5: lag 5 days (max time 2024-01-08)
#> * sig6: lag 3 days (max time 2024-01-10)
#> * sig7: lag 3 days (max time 2024-01-10)
#> * sig8: lag 3 days (max time 2024-01-10)
#> * ... and 2 other signalsCreated on 2026-04-02 with reprex v2.1.1
| max_even <- length(unique(smry$max_t[!is.na(smry$max_t)])) <= 1 | ||
|
|
||
| min_desc <- if (min_even) "even across epikeys" else "uneven across epikeys" | ||
| max_desc <- if (max_even) "even across epikeys" else "uneven across epikeys" |
There was a problem hiding this comment.
(a) Done! It is worth noting that I used epikey in accordance with revision_analysis, which prints this term. Should we remove it from there as well?
(b) Done! Also, the by-signal latency information was moved here.
Checklist
Please:
PR).
brookslogan, nmdefries.
DESCRIPTION. Always incrementthe patch version number (the third number), unless you are making a
release PR from dev to main, in which case increment the minor version
number (the second number).
(backwards-incompatible changes to the documented interface) are noted.
Collect the changes under the next release number (e.g. if you are on
1.7.2, then write your changes under the 1.8 heading).
/styleto check the style and fix any issues./documentto check the package documentation and fix any issues./preview-docsto preview the docs.process.
Add Reporting Information to
print/summary.epi_dfThe
print.epi_dfmethod now provides a signal-level latency report that calculates reporting lags relative to theas_ofmetadata. Signals with notable latencies are marked with an alert flag, and the output specifically identifies any lagging keys. For objects with many signals, the output is now truncated with a summary of the remaining variables to preserve readability.The
summary.epi_dfmethod has been expanded to include a regularity analysis, as mentioned in #688. This feature identifies whether the minimum and maximum time values are even or uneven across all epikeys. The summary also diagnoses implicit and explicit gaps.I included the time analysis in
summary.epi_dfbecauseprint.epi_dfwas becoming too long. They are provided as helper functions in case we want to move them around. Though it may be more efficient to combine them if they are put together.clito print those summaries could improve their appearance.Magic GitHub syntax to mark associated Issue(s) as resolved when this is merged into the default branch
print.epi_df: add notes about even/uneven min and maxtime_valueby epikey, whether there are gaps, implicit or explicit #688Examples:
Created on 2026-02-27 with reprex v2.1.1