Skip to content

perf: replace dplyr/tidyr with vctrs in ard_summary hot paths#2

Open
Melkiades wants to merge 1 commit into
mainfrom
perf/optimize-ard-summary
Open

perf: replace dplyr/tidyr with vctrs in ard_summary hot paths#2
Melkiades wants to merge 1 commit into
mainfrom
perf/optimize-ard-summary

Conversation

@Melkiades
Copy link
Copy Markdown
Owner

Summary

Replace dplyr::tibble(), dplyr::mutate(), dplyr::rename(), map()/map2()/dplyr::bind_rows() with vctrs::new_data_frame(), for-loops, and vctrs::vec_rbind() in two internal functions called on every ARD computation.

Motivation

tbl_summary(by=) spends ~16% of its total time in .lst_results_as_df and .calculate_stats_as_ard. These functions are called once per variable × statistic × by-group (~72 times for a typical 4-variable, 3-arm table). The dplyr::tibble() constructor has ~0.5ms NSE overhead per call that adds up.

Changes

R/ard_summary.R — two functions:

  • .lst_results_as_df: Single vctrs::new_data_frame() call replaces dplyr::tibble() + dplyr::mutate(variable=) + dplyr::rename(stat="result"). The variable and stat columns are now set directly in the constructor.

  • .calculate_stats_as_ard: for-loops + vctrs::vec_rbind() replace map()/map2()/dplyr::bind_rows(). Note: fun and fun_name variables must remain in scope because getOption("cards.calculate_stats_as_ard.eval_fun") (used by ard_mvsummary) references them by name.

DESCRIPTION: Added vctrs (>= 0.6.5) to Imports. vctrs is already a transitive dependency via dplyr and tidyr.

Benchmark

tbl_summary(data, by = trt, include = c(age, sex, grade)) with 500 rows, 3 treatment arms:

ms/call
Before ~670 ms
After ~500 ms
Speedup 1.34×

Testing

  • cards: 725 PASS, 22 snapshot-only formatting diffs (column alignment in print output)
  • cardx: 744 PASS, 42 snapshot-only (same count as main)
  • crane: 564 PASS, 33 snapshot-only (identical to main)
  • gtsummary equivalence: 17/17 table outputs byte-identical to main

Two functions optimized:

.lst_results_as_df: Replace dplyr::tibble() + dplyr::mutate() +
dplyr::rename() with a single vctrs::new_data_frame() call. This
function is invoked once per variable x statistic x by-group (~72
times in a typical tbl_summary(by=) call). Each dplyr::tibble() has
~0.5ms NSE overhead that is eliminated.

.calculate_stats_as_ard: Replace map()/map2()/dplyr::bind_rows()
with for-loops and vctrs::vec_rbind(). Avoids intermediate tibble
allocations and per-variable bind_rows overhead.

Benchmark: tbl_summary(by=, 3 groups, 4 vars) drops from ~670ms
to ~500ms (1.34x).

Co-authored-by: Ona <no-reply@ona.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant