Merge refactored branch into main by pjhop · Pull Request #9 · KennaLab/rvat

pjhop · 2026-03-30T19:25:38Z

This release is primarily an internal refactoring focused on improved robustness:

large source files are split into focused modules.
almost all methods and functions have been refactored significantly for clarity, maintainability and robustness.
test coverage is expanded to >90% for most code.
there are also several small bug fixes and performance improvements.

…90% coverage)

…li; minor fixes documentation

Copilot

Pull request overview

This PR merges a large internal refactor into main, reorganizing the gdb/association tooling into new focused modules and replacing the legacy aggregateFile workflow with the new SQLite-backed aggdb format.

Changes:

Introduces aggdb/aggdbList classes and associated merge/collapse + aggdb-based assocTest implementation.
Refactors and modularizes gdb utilities (build/concat/subset/getGT/mapVariants/summariseGeno) with shared helpers and stronger input validation.
Updates package metadata/exports (DESCRIPTION/NAMESPACE) and adds an R-CMD-check GitHub Actions workflow.

Reviewed changes

Copilot reviewed 31 out of 251 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
R/gdb-utils.R	Adds refactored gdb utilities (concat/subset/writeVcf) and internal helpers.
R/gdb-summariseGeno.R	New `summariseGeno,gdb` implementation using refactored helpers and filters.
R/gdb-shared-helpers.R	Shared validation + metadata/index helpers used across gdb modules.
R/gdb-mapVariants.R	Refactors `mapVariants,gdb` and its range-handling/validation.
R/gdb-getGT.R	Refactors `getGT,gdb` into smaller helpers (vars/cohort/ploidy/anno).
R/gdb-buildGdb.R	Refactors gdb build/population from VCF and var_ranges generation.
R/gdb-anno-cohort.R	Refactors get/upload anno/cohort and table dropping behavior.
R/gdb-aggregate.R	Updates gdb aggregate implementation to write `aggdb` outputs.
R/dump.R	Removes legacy `.dump`-based implementations.
R/assocTest_helper.R	Removes legacy covariate helper (reintroduced elsewhere in refactor).
R/assocTest_aggregateFile.R	Removes legacy assocTest on `aggregateFile`.
R/aggregateFile.R	Removes legacy `aggregateFile` classes/APIs.
R/asserts-types-check.R	Adds rlang standalone type checking utilities.
R/asserts-obj-type.R	Adds rlang standalone object-type/error utilities.
R/allInternalData.R	Refactors internal constants + adds aggdb test lists and contig lengths.
R/aggDb.R	Adds `aggdb` class constructor, getters, and internal initializer.
R/aggDb-list.R	Adds `aggdbList` builder + list methods.
R/aggDb-merge.R	Adds `mergeAggDbs,aggdbList` method.
R/aggDb-collapse.R	Adds `collapseAggDbs,aggdbList` method.
R/assocTest-gdb.R	Adds refactored `assocTest,gdb` entrypoint.
R/assocTest-aggdb.R	Adds refactored `assocTest,aggdb` implementation including IO + analysis loop.
NAMESPACE	Switches exports from `aggregateFile` to `aggdb` and adds `listParams`.
DESCRIPTION	Bumps version to 0.4.0, updates dependencies/collate, adds rlang/data.table.
.github/workflows/R-CMD-check.yaml	Adds CI workflow for R CMD check on push/PR.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

R/aggDb.R

R/gdb-utils.R

Copilot · 2026-03-30T19:28:26Z

R/gdb-utils.R

+      IID <- DBI::dbGetQuery(gdb_i, "select * from SM")$IID
+    } else {
+      IID_i <- DBI::dbGetQuery(gdb_i, "select * from SM")$IID
+      if (!identical(IID, IID_i)) {


identical() requires the same ordering; select * from SM has no guaranteed row order, so two gdbs with the same sample set but different row ordering will falsely fail. Compare in a stable order (e.g., select IID from SM order by IID or preserve SM rowid ordering explicitly) or use setequal() if order is not semantically important.

Suggested change

if (!identical(IID, IID_i)) {

if (!setequal(IID, IID_i)) {

R/gdb-mapVariants.R

R/gdb-utils.R

Copilot · 2026-03-30T19:28:27Z

DESCRIPTION

+    readr,
+    ggh4x,
+    data.table,
+    rlang


The bundled rlang standalone files explicitly state an rlang (>= 1.1.0) requirement, and they call versioned rlang::ffi_* symbols. Add a minimum version constraint in DESCRIPTION (e.g., rlang (>= 1.1.0)) to prevent installations with older rlang versions failing at runtime.

Suggested change

rlang

rlang (>= 1.1.0)

R/gdb-utils.R

R/assocTest-aggdb.R

Copilot

Pull request overview

Copilot reviewed 31 out of 251 changed files in this pull request and generated 10 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-30T20:30:05Z

R/gdb-utils.R

+  # variant retrieval queries with / without genotype data
+  if (includeGeno) {
+    query <- sprintf(
+      "select CHROM, POS, %s, REF, ALT, QUAL, FILTER, '.', FORMAT, GT from var inner join dosage using (VAR_id)",
+      id_field
+    )
+  } else {
+    query <- sprintf(
+      "select CHROM, POS, %s, REF, ALT, QUAL, FILTER, '.' from var",
+      id_field
+    )
+  }


writeVcf() exports only the GT field (from dosage.GT), but the query selects var.FORMAT from the originally imported VCF which may contain multiple subfields (e.g., GT:DP:GQ). This produces invalid VCF rows where FORMAT declares more fields than the genotype columns contain. Consider selecting a constant FORMAT of 'GT' (or reconstructing full genotype fields) when includeGeno = TRUE.

Copilot · 2026-03-30T20:30:06Z

R/gdb-getGT.R

+  ## unpack genotypes and store as matrix
+  GT <- lapply(GT[, 2], memDecompress, type = "gzip", asChar = FALSE)
+  ascii_offset <- as.integer(charToRaw("0"))
+  GT <- as.integer(unlist(GT)) - ascii_offset
+  GT[GT > 2L] <- NA_real_
+  GT <- matrix(GT, nrow = nvar, ncol = length(GT) / nvar, byrow = TRUE)
+  rownames(GT) <- VAR_id
+  if (verbose) {
+    message(sprintf("Retrieved genotypes for %s variants", nvar))
+  }
+
+  # return
+  GT


For non-empty genotype matrices, colnames(GT) is never set to IID (it’s only set in the nvar == 0L branch). This will break downstream expectations that samples are labeled and aligned (e.g., checks comparing SM$IID to colnames(GT)). Set colnames(GT) <- IID before returning.

Copilot · 2026-03-30T20:30:06Z

R/gdb-shared-helpers.R

+    }
+    if (anyDuplicated(VAR_id) != 0L) {
+      stop("`VAR_id` contains duplicated values.", call. = FALSE)
+    }


.gdb_check_varid() allows arbitrary character vectors (e.g., c('abc')) as long as they are non-NA and non-duplicated. Multiple call sites later coerce VAR_id with as.integer(VAR_id), which would introduce NAs and corrupt queries. Consider tightening validation: if VAR_id is character, require it to be integer-like (e.g., all grepl('^[0-9]+$')) or explicitly validate that as.integer(VAR_id) does not introduce NAs.

Suggested change

}

}

if (is.character(VAR_id)) {

VAR_id_int <- suppressWarnings(as.integer(VAR_id))

if (anyNA(VAR_id_int)) {

stop(

"`VAR_id` is a character vector but contains values that cannot be safely coerced to integer.",

call. = FALSE

)

}

}

Copilot · 2026-03-30T20:30:06Z

R/gdb-aggregate.R

+    # check output (aggdb) and overWrite
+    # establish connection to output
+    .check_output(output, overWrite = overWrite, verbose = verbose)
+    inputs <- .aggregate_prepare_aggdb_inputs(as.list(environment()), gdb = x)
+    aggdb <- .init_aggdb(
+      output = output,
+      metadata = inputs[["metadata"]],
+      params = inputs[["params"]],
+      samples = inputs[["samples"]],
+      overWrite = overWrite,
+      verbose = verbose
+    )


aggregate(gdb) documents output = NULL as a valid default, but the implementation unconditionally calls .check_output(output, ...) and then initializes an aggdb at output. This makes output = NULL effectively invalid and will error at runtime. Either (mandatory) require output to be a non-NULL single path in input validation, or (optional) implement an in-memory return path when output is NULL.

Copilot · 2026-03-30T20:30:06Z

R/gdb-utils.R

+    # export new var table to output gdb
+    DBI::dbExecute(object, sprintf("ATTACH '%s' as %s", output, tmp))
+    ## detach on exit
+    on.exit(
+      try(
+        DBI::dbExecute(
+          object,
+          sprintf("DETACH DATABASE %s", DBI::dbQuoteIdentifier(object, tmp))
+        ),


There is an SQL injection / quoting risk in the ATTACH statement because output and the attached schema name tmp are interpolated via sprintf() without quoting/escaping. Use DBI::dbQuoteString() for the file path and DBI::dbQuoteIdentifier() for the attach name in the ATTACH statement as well (not just in DETACH).

Copilot · 2026-03-30T20:30:07Z

R/gdb-utils.R

+    versions[gdb_i] <- DBI::dbGetQuery(
+      gdb,
+      "select * from src.meta where name = 'rvatVersion'"
+    )$value
+    builds[gdb_i] <- DBI::dbGetQuery(
+      gdb,
+      "select * from src.meta where name = 'genomeBuild'"
+    )$value


If src.meta is missing either row, dbGetQuery(... )$value will be character(0), and assigning that into versions[gdb_i]/builds[gdb_i] will throw a 'replacement has length zero' error. Handle the 0-row case explicitly (e.g., default to NA_character_ or stop with a clearer message).

Suggested change

versions[gdb_i] <- DBI::dbGetQuery(

gdb,

"select * from src.meta where name = 'rvatVersion'"

)$value

builds[gdb_i] <- DBI::dbGetQuery(

gdb,

"select * from src.meta where name = 'genomeBuild'"

)$value

version_row <- DBI::dbGetQuery(

gdb,

"select * from src.meta where name = 'rvatVersion'"

)

if (NROW(version_row) != 1L) {

warning(

sprintf(

"Expected exactly one 'rvatVersion' entry in meta for '%s', found %d; using NA.",

gdb_i,

NROW(version_row)

),

call. = FALSE

)

versions[gdb_i] <- NA_character_

} else {

versions[gdb_i] <- as.character(version_row$value[[1L]])

}

build_row <- DBI::dbGetQuery(

gdb,

"select * from src.meta where name = 'genomeBuild'"

)

if (NROW(build_row) != 1L) {

warning(

sprintf(

"Expected exactly one 'genomeBuild' entry in meta for '%s', found %d; using NA.",

gdb_i,

NROW(build_row)

),

call. = FALSE

)

builds[gdb_i] <- NA_character_

} else {

builds[gdb_i] <- as.character(build_row$value[[1L]])

}

Copilot · 2026-03-30T20:30:07Z

R/aggDb-list.R

+        call. = FALSE
+      )
+    } else {
+      units_all <- paste0("X", seq_along(units_all))


When checkDups = FALSE, units_all is renamed locally, but the underlying units stored in each aggdb are not renamed. This makes downstream behaviors inconsistent (e.g., mergeAggDbs() still inserts the original unit names and can violate the aggregates(unit PRIMARY KEY) constraint, and the final validation compares against renamed values). Either disallow checkDups = FALSE for aggdbList, or carry a mapping and apply it during merge/collapse (including rewriting unit names on insert).

Suggested change

units_all <- paste0("X", seq_along(units_all))

stop(

paste(

"checkDups = FALSE is not supported for aggdbList when duplicate",

"units are present across databases."

),

call. = FALSE

)

R/aggDb-merge.R

DESCRIPTION

R/gdb-buildGdb.R

pjhop and others added 30 commits September 30, 2025 20:37

refactoring core gdb methods/functions + >90% coverage

553366c

varSet refactoring

4bdb033

refactored varSet unit tests (>90% coverage)

a50297d

genoMatrix refactoring

a1fda07

genoMatrix unit test refactoring (>90%)

9d74812

refactoring summariseGeno

90ed689

add asserts

526e65c

replace aggregateFile and associated methods with aggDb

94b9d59

refactored gdb-aggregate method (using aggDb)

ce5482c

assocTest refactoring

d025cdd

refactor assocTest_gdb

67cf889

replace assoctest_aggregateFile with assoctest_aggdb + refactor

eb0f2d4

refactor rvatResult

2a1b1e3

update nullmodelGSA to aggdb

24ee3a1

update internal data

6d6bb79

update cli

3700348

minor fixes in geneSetAssoc tests

51aed93

replace aggregateFile tests with aggdb tests

4343ffa

minor updates cli tests

5eaa201

add test helper functions

0435360

minor updates unit tests

79d3d00

updates DESCRIPTION/NAMESPACE and documentation

761d2aa

small fix unit tests

b17f17e

add test data

cdf19f6

small fix in unit tests

39dea48

minor fixes

d88c383

refactor genoMatrix unit tests (>90% coverage)

3508f50

refactor summariseGeno unit tests (>90% coverage)

6994358

minor updates aggdb + updated unit tests aggdb and related methods (>…

977e3c0

…90% coverage)

minor updates assocTest + refactoring assocTest unit tests

7f7730f

pjhop and others added 23 commits January 12, 2026 10:49

fix aggdb documentation/examples

7d34802

update documentation + move examples to separate files

b0442ad

remove skat_fwe tests; add aggregate documentation; add writeVcf to c…

61b7238

…li; minor fixes documentation

reformat allClasses/allGenerics

3959bba

fix writing to output in assocTest

8c635d1

consistent file naming

d8a9889

minor fix gdb and related methods

338eb3b

minor fixes + file renamings

d21b34d

minor fixes documentation

81efb14

fix data paths in unit tests

39ee4f6

version bump + format DESCRIPTION

98ceee4

minor fixes unit tests

27a0829

minor fixes

a8d8dee

update air formatting

27b595b

update man

653e516

remaining formatting

672a75b

minor fixes

26f034f

add snapshots + R CMD check workflow

fd013ee

avoid opening all aggdb connections simultaneously in aggdbList

3b6fa0d

minor update

a62427d

add rvatData to remotes in DESCRIPTION

d9ca5db

minor unit test fix

a8df225

add v.0.4.0 to changelog

5c8eb74

pjhop requested a review from Copilot March 30, 2026 19:25

pjhop self-assigned this Mar 30, 2026

Copilot AI reviewed Mar 30, 2026

View reviewed changes

couple of minor fixes

eee1e68

pjhop requested a review from Copilot March 30, 2026 20:25

Copilot AI reviewed Mar 30, 2026

View reviewed changes

pjhop merged commit 7cf9257 into main Mar 31, 2026
2 checks passed

-    }
+    }
+    if (is.character(VAR_id)) {
+      VAR_id_int <- suppressWarnings(as.integer(VAR_id))
+      if (anyNA(VAR_id_int)) {
+        stop(
+          "`VAR_id` is a character vector but contains values that cannot be safely coerced to integer.",
+          call. = FALSE
+        )
+      }
+    }

-    versions[gdb_i] <- DBI::dbGetQuery(
-      gdb,
-      "select * from src.meta where name = 'rvatVersion'"
-    )$value
-    builds[gdb_i] <- DBI::dbGetQuery(
-      gdb,
-      "select * from src.meta where name = 'genomeBuild'"
-    )$value
+    version_row <- DBI::dbGetQuery(
+      gdb,
+      "select * from src.meta where name = 'rvatVersion'"
+    )
+    if (NROW(version_row) != 1L) {
+      warning(
+        sprintf(
+          "Expected exactly one 'rvatVersion' entry in meta for '%s', found %d; using NA.",
+          gdb_i,
+          NROW(version_row)
+        ),
+        call. = FALSE
+      )
+      versions[gdb_i] <- NA_character_
+    } else {
+      versions[gdb_i] <- as.character(version_row$value[[1L]])
+    }
+    build_row <- DBI::dbGetQuery(
+      gdb,
+      "select * from src.meta where name = 'genomeBuild'"
+    )
+    if (NROW(build_row) != 1L) {
+      warning(
+        sprintf(
+          "Expected exactly one 'genomeBuild' entry in meta for '%s', found %d; using NA.",
+          gdb_i,
+          NROW(build_row)
+        ),
+        call. = FALSE
+      )
+      builds[gdb_i] <- NA_character_
+    } else {
+      builds[gdb_i] <- as.character(build_row$value[[1L]])
+    }

Conversation

pjhop commented Mar 30, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants