Skip to content

Merge refactored branch into main#9

Merged
pjhop merged 78 commits intomainfrom
dev
Mar 31, 2026
Merged

Merge refactored branch into main#9
pjhop merged 78 commits intomainfrom
dev

Conversation

@pjhop
Copy link
Copy Markdown
Contributor

@pjhop pjhop commented Mar 30, 2026

This release is primarily an internal refactoring focused on improved robustness:

  • large source files are split into focused modules.
  • almost all methods and functions have been refactored significantly for clarity, maintainability and robustness.
  • test coverage is expanded to >90% for most code.
  • there are also several small bug fixes and performance improvements.

@pjhop pjhop requested a review from Copilot March 30, 2026 19:25
@pjhop pjhop self-assigned this Mar 30, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR merges a large internal refactor into main, reorganizing the gdb/association tooling into new focused modules and replacing the legacy aggregateFile workflow with the new SQLite-backed aggdb format.

Changes:

  • Introduces aggdb/aggdbList classes and associated merge/collapse + aggdb-based assocTest implementation.
  • Refactors and modularizes gdb utilities (build/concat/subset/getGT/mapVariants/summariseGeno) with shared helpers and stronger input validation.
  • Updates package metadata/exports (DESCRIPTION/NAMESPACE) and adds an R-CMD-check GitHub Actions workflow.

Reviewed changes

Copilot reviewed 31 out of 251 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
R/gdb-utils.R Adds refactored gdb utilities (concat/subset/writeVcf) and internal helpers.
R/gdb-summariseGeno.R New summariseGeno,gdb implementation using refactored helpers and filters.
R/gdb-shared-helpers.R Shared validation + metadata/index helpers used across gdb modules.
R/gdb-mapVariants.R Refactors mapVariants,gdb and its range-handling/validation.
R/gdb-getGT.R Refactors getGT,gdb into smaller helpers (vars/cohort/ploidy/anno).
R/gdb-buildGdb.R Refactors gdb build/population from VCF and var_ranges generation.
R/gdb-anno-cohort.R Refactors get/upload anno/cohort and table dropping behavior.
R/gdb-aggregate.R Updates gdb aggregate implementation to write aggdb outputs.
R/dump.R Removes legacy .dump-based implementations.
R/assocTest_helper.R Removes legacy covariate helper (reintroduced elsewhere in refactor).
R/assocTest_aggregateFile.R Removes legacy assocTest on aggregateFile.
R/aggregateFile.R Removes legacy aggregateFile classes/APIs.
R/asserts-types-check.R Adds rlang standalone type checking utilities.
R/asserts-obj-type.R Adds rlang standalone object-type/error utilities.
R/allInternalData.R Refactors internal constants + adds aggdb test lists and contig lengths.
R/aggDb.R Adds aggdb class constructor, getters, and internal initializer.
R/aggDb-list.R Adds aggdbList builder + list methods.
R/aggDb-merge.R Adds mergeAggDbs,aggdbList method.
R/aggDb-collapse.R Adds collapseAggDbs,aggdbList method.
R/assocTest-gdb.R Adds refactored assocTest,gdb entrypoint.
R/assocTest-aggdb.R Adds refactored assocTest,aggdb implementation including IO + analysis loop.
NAMESPACE Switches exports from aggregateFile* to aggdb* and adds listParams.
DESCRIPTION Bumps version to 0.4.0, updates dependencies/collate, adds rlang/data.table.
.github/workflows/R-CMD-check.yaml Adds CI workflow for R CMD check on push/PR.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

IID <- DBI::dbGetQuery(gdb_i, "select * from SM")$IID
} else {
IID_i <- DBI::dbGetQuery(gdb_i, "select * from SM")$IID
if (!identical(IID, IID_i)) {
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

identical() requires the same ordering; select * from SM has no guaranteed row order, so two gdbs with the same sample set but different row ordering will falsely fail. Compare in a stable order (e.g., select IID from SM order by IID or preserve SM rowid ordering explicitly) or use setequal() if order is not semantically important.

Suggested change
if (!identical(IID, IID_i)) {
if (!setequal(IID, IID_i)) {

Copilot uses AI. Check for mistakes.
DESCRIPTION Outdated
readr,
ggh4x,
data.table,
rlang
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bundled rlang standalone files explicitly state an rlang (>= 1.1.0) requirement, and they call versioned rlang::ffi_* symbols. Add a minimum version constraint in DESCRIPTION (e.g., rlang (>= 1.1.0)) to prevent installations with older rlang versions failing at runtime.

Suggested change
rlang
rlang (>= 1.1.0)

Copilot uses AI. Check for mistakes.
@pjhop pjhop requested a review from Copilot March 30, 2026 20:25
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 31 out of 251 changed files in this pull request and generated 10 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +756 to +767
# variant retrieval queries with / without genotype data
if (includeGeno) {
query <- sprintf(
"select CHROM, POS, %s, REF, ALT, QUAL, FILTER, '.', FORMAT, GT from var inner join dosage using (VAR_id)",
id_field
)
} else {
query <- sprintf(
"select CHROM, POS, %s, REF, ALT, QUAL, FILTER, '.' from var",
id_field
)
}
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

writeVcf() exports only the GT field (from dosage.GT), but the query selects var.FORMAT from the originally imported VCF which may contain multiple subfields (e.g., GT:DP:GQ). This produces invalid VCF rows where FORMAT declares more fields than the genotype columns contain. Consider selecting a constant FORMAT of 'GT' (or reconstructing full genotype fields) when includeGeno = TRUE.

Copilot uses AI. Check for mistakes.
Comment on lines +429 to +441
## unpack genotypes and store as matrix
GT <- lapply(GT[, 2], memDecompress, type = "gzip", asChar = FALSE)
ascii_offset <- as.integer(charToRaw("0"))
GT <- as.integer(unlist(GT)) - ascii_offset
GT[GT > 2L] <- NA_real_
GT <- matrix(GT, nrow = nvar, ncol = length(GT) / nvar, byrow = TRUE)
rownames(GT) <- VAR_id
if (verbose) {
message(sprintf("Retrieved genotypes for %s variants", nvar))
}

# return
GT
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For non-empty genotype matrices, colnames(GT) is never set to IID (it’s only set in the nvar == 0L branch). This will break downstream expectations that samples are labeled and aligned (e.g., checks comparing SM$IID to colnames(GT)). Set colnames(GT) <- IID before returning.

Copilot uses AI. Check for mistakes.
}
if (anyDuplicated(VAR_id) != 0L) {
stop("`VAR_id` contains duplicated values.", call. = FALSE)
}
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.gdb_check_varid() allows arbitrary character vectors (e.g., c('abc')) as long as they are non-NA and non-duplicated. Multiple call sites later coerce VAR_id with as.integer(VAR_id), which would introduce NAs and corrupt queries. Consider tightening validation: if VAR_id is character, require it to be integer-like (e.g., all grepl('^[0-9]+$')) or explicitly validate that as.integer(VAR_id) does not introduce NAs.

Suggested change
}
}
if (is.character(VAR_id)) {
VAR_id_int <- suppressWarnings(as.integer(VAR_id))
if (anyNA(VAR_id_int)) {
stop(
"`VAR_id` is a character vector but contains values that cannot be safely coerced to integer.",
call. = FALSE
)
}
}

Copilot uses AI. Check for mistakes.
Comment on lines +107 to +118
# check output (aggdb) and overWrite
# establish connection to output
.check_output(output, overWrite = overWrite, verbose = verbose)
inputs <- .aggregate_prepare_aggdb_inputs(as.list(environment()), gdb = x)
aggdb <- .init_aggdb(
output = output,
metadata = inputs[["metadata"]],
params = inputs[["params"]],
samples = inputs[["samples"]],
overWrite = overWrite,
verbose = verbose
)
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aggregate(gdb) documents output = NULL as a valid default, but the implementation unconditionally calls .check_output(output, ...) and then initializes an aggdb at output. This makes output = NULL effectively invalid and will error at runtime. Either (mandatory) require output to be a non-NULL single path in input validation, or (optional) implement an in-memory return path when output is NULL.

Copilot uses AI. Check for mistakes.
Comment on lines +319 to +327
# export new var table to output gdb
DBI::dbExecute(object, sprintf("ATTACH '%s' as %s", output, tmp))
## detach on exit
on.exit(
try(
DBI::dbExecute(
object,
sprintf("DETACH DATABASE %s", DBI::dbQuoteIdentifier(object, tmp))
),
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an SQL injection / quoting risk in the ATTACH statement because output and the attached schema name tmp are interpolated via sprintf() without quoting/escaping. Use DBI::dbQuoteString() for the file path and DBI::dbQuoteIdentifier() for the attach name in the ATTACH statement as well (not just in DETACH).

Copilot uses AI. Check for mistakes.
Comment on lines +77 to +84
versions[gdb_i] <- DBI::dbGetQuery(
gdb,
"select * from src.meta where name = 'rvatVersion'"
)$value
builds[gdb_i] <- DBI::dbGetQuery(
gdb,
"select * from src.meta where name = 'genomeBuild'"
)$value
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If src.meta is missing either row, dbGetQuery(... )$value will be character(0), and assigning that into versions[gdb_i]/builds[gdb_i] will throw a 'replacement has length zero' error. Handle the 0-row case explicitly (e.g., default to NA_character_ or stop with a clearer message).

Suggested change
versions[gdb_i] <- DBI::dbGetQuery(
gdb,
"select * from src.meta where name = 'rvatVersion'"
)$value
builds[gdb_i] <- DBI::dbGetQuery(
gdb,
"select * from src.meta where name = 'genomeBuild'"
)$value
version_row <- DBI::dbGetQuery(
gdb,
"select * from src.meta where name = 'rvatVersion'"
)
if (NROW(version_row) != 1L) {
warning(
sprintf(
"Expected exactly one 'rvatVersion' entry in meta for '%s', found %d; using NA.",
gdb_i,
NROW(version_row)
),
call. = FALSE
)
versions[gdb_i] <- NA_character_
} else {
versions[gdb_i] <- as.character(version_row$value[[1L]])
}
build_row <- DBI::dbGetQuery(
gdb,
"select * from src.meta where name = 'genomeBuild'"
)
if (NROW(build_row) != 1L) {
warning(
sprintf(
"Expected exactly one 'genomeBuild' entry in meta for '%s', found %d; using NA.",
gdb_i,
NROW(build_row)
),
call. = FALSE
)
builds[gdb_i] <- NA_character_
} else {
builds[gdb_i] <- as.character(build_row$value[[1L]])
}

Copilot uses AI. Check for mistakes.
call. = FALSE
)
} else {
units_all <- paste0("X", seq_along(units_all))
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When checkDups = FALSE, units_all is renamed locally, but the underlying units stored in each aggdb are not renamed. This makes downstream behaviors inconsistent (e.g., mergeAggDbs() still inserts the original unit names and can violate the aggregates(unit PRIMARY KEY) constraint, and the final validation compares against renamed values). Either disallow checkDups = FALSE for aggdbList, or carry a mapping and apply it during merge/collapse (including rewriting unit names on insert).

Suggested change
units_all <- paste0("X", seq_along(units_all))
stop(
paste(
"checkDups = FALSE is not supported for aggdbList when duplicate",
"units are present across databases."
),
call. = FALSE
)

Copilot uses AI. Check for mistakes.
@pjhop pjhop merged commit 7cf9257 into main Mar 31, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants