Conversation
…li; minor fixes documentation
There was a problem hiding this comment.
Pull request overview
This PR merges a large internal refactor into main, reorganizing the gdb/association tooling into new focused modules and replacing the legacy aggregateFile workflow with the new SQLite-backed aggdb format.
Changes:
- Introduces
aggdb/aggdbListclasses and associated merge/collapse + aggdb-basedassocTestimplementation. - Refactors and modularizes gdb utilities (build/concat/subset/getGT/mapVariants/summariseGeno) with shared helpers and stronger input validation.
- Updates package metadata/exports (DESCRIPTION/NAMESPACE) and adds an R-CMD-check GitHub Actions workflow.
Reviewed changes
Copilot reviewed 31 out of 251 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| R/gdb-utils.R | Adds refactored gdb utilities (concat/subset/writeVcf) and internal helpers. |
| R/gdb-summariseGeno.R | New summariseGeno,gdb implementation using refactored helpers and filters. |
| R/gdb-shared-helpers.R | Shared validation + metadata/index helpers used across gdb modules. |
| R/gdb-mapVariants.R | Refactors mapVariants,gdb and its range-handling/validation. |
| R/gdb-getGT.R | Refactors getGT,gdb into smaller helpers (vars/cohort/ploidy/anno). |
| R/gdb-buildGdb.R | Refactors gdb build/population from VCF and var_ranges generation. |
| R/gdb-anno-cohort.R | Refactors get/upload anno/cohort and table dropping behavior. |
| R/gdb-aggregate.R | Updates gdb aggregate implementation to write aggdb outputs. |
| R/dump.R | Removes legacy .dump-based implementations. |
| R/assocTest_helper.R | Removes legacy covariate helper (reintroduced elsewhere in refactor). |
| R/assocTest_aggregateFile.R | Removes legacy assocTest on aggregateFile. |
| R/aggregateFile.R | Removes legacy aggregateFile classes/APIs. |
| R/asserts-types-check.R | Adds rlang standalone type checking utilities. |
| R/asserts-obj-type.R | Adds rlang standalone object-type/error utilities. |
| R/allInternalData.R | Refactors internal constants + adds aggdb test lists and contig lengths. |
| R/aggDb.R | Adds aggdb class constructor, getters, and internal initializer. |
| R/aggDb-list.R | Adds aggdbList builder + list methods. |
| R/aggDb-merge.R | Adds mergeAggDbs,aggdbList method. |
| R/aggDb-collapse.R | Adds collapseAggDbs,aggdbList method. |
| R/assocTest-gdb.R | Adds refactored assocTest,gdb entrypoint. |
| R/assocTest-aggdb.R | Adds refactored assocTest,aggdb implementation including IO + analysis loop. |
| NAMESPACE | Switches exports from aggregateFile* to aggdb* and adds listParams. |
| DESCRIPTION | Bumps version to 0.4.0, updates dependencies/collate, adds rlang/data.table. |
| .github/workflows/R-CMD-check.yaml | Adds CI workflow for R CMD check on push/PR. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| IID <- DBI::dbGetQuery(gdb_i, "select * from SM")$IID | ||
| } else { | ||
| IID_i <- DBI::dbGetQuery(gdb_i, "select * from SM")$IID | ||
| if (!identical(IID, IID_i)) { |
There was a problem hiding this comment.
identical() requires the same ordering; select * from SM has no guaranteed row order, so two gdbs with the same sample set but different row ordering will falsely fail. Compare in a stable order (e.g., select IID from SM order by IID or preserve SM rowid ordering explicitly) or use setequal() if order is not semantically important.
| if (!identical(IID, IID_i)) { | |
| if (!setequal(IID, IID_i)) { |
DESCRIPTION
Outdated
| readr, | ||
| ggh4x, | ||
| data.table, | ||
| rlang |
There was a problem hiding this comment.
The bundled rlang standalone files explicitly state an rlang (>= 1.1.0) requirement, and they call versioned rlang::ffi_* symbols. Add a minimum version constraint in DESCRIPTION (e.g., rlang (>= 1.1.0)) to prevent installations with older rlang versions failing at runtime.
| rlang | |
| rlang (>= 1.1.0) |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 31 out of 251 changed files in this pull request and generated 10 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # variant retrieval queries with / without genotype data | ||
| if (includeGeno) { | ||
| query <- sprintf( | ||
| "select CHROM, POS, %s, REF, ALT, QUAL, FILTER, '.', FORMAT, GT from var inner join dosage using (VAR_id)", | ||
| id_field | ||
| ) | ||
| } else { | ||
| query <- sprintf( | ||
| "select CHROM, POS, %s, REF, ALT, QUAL, FILTER, '.' from var", | ||
| id_field | ||
| ) | ||
| } |
There was a problem hiding this comment.
writeVcf() exports only the GT field (from dosage.GT), but the query selects var.FORMAT from the originally imported VCF which may contain multiple subfields (e.g., GT:DP:GQ). This produces invalid VCF rows where FORMAT declares more fields than the genotype columns contain. Consider selecting a constant FORMAT of 'GT' (or reconstructing full genotype fields) when includeGeno = TRUE.
| ## unpack genotypes and store as matrix | ||
| GT <- lapply(GT[, 2], memDecompress, type = "gzip", asChar = FALSE) | ||
| ascii_offset <- as.integer(charToRaw("0")) | ||
| GT <- as.integer(unlist(GT)) - ascii_offset | ||
| GT[GT > 2L] <- NA_real_ | ||
| GT <- matrix(GT, nrow = nvar, ncol = length(GT) / nvar, byrow = TRUE) | ||
| rownames(GT) <- VAR_id | ||
| if (verbose) { | ||
| message(sprintf("Retrieved genotypes for %s variants", nvar)) | ||
| } | ||
|
|
||
| # return | ||
| GT |
There was a problem hiding this comment.
For non-empty genotype matrices, colnames(GT) is never set to IID (it’s only set in the nvar == 0L branch). This will break downstream expectations that samples are labeled and aligned (e.g., checks comparing SM$IID to colnames(GT)). Set colnames(GT) <- IID before returning.
| } | ||
| if (anyDuplicated(VAR_id) != 0L) { | ||
| stop("`VAR_id` contains duplicated values.", call. = FALSE) | ||
| } |
There was a problem hiding this comment.
.gdb_check_varid() allows arbitrary character vectors (e.g., c('abc')) as long as they are non-NA and non-duplicated. Multiple call sites later coerce VAR_id with as.integer(VAR_id), which would introduce NAs and corrupt queries. Consider tightening validation: if VAR_id is character, require it to be integer-like (e.g., all grepl('^[0-9]+$')) or explicitly validate that as.integer(VAR_id) does not introduce NAs.
| } | |
| } | |
| if (is.character(VAR_id)) { | |
| VAR_id_int <- suppressWarnings(as.integer(VAR_id)) | |
| if (anyNA(VAR_id_int)) { | |
| stop( | |
| "`VAR_id` is a character vector but contains values that cannot be safely coerced to integer.", | |
| call. = FALSE | |
| ) | |
| } | |
| } |
| # check output (aggdb) and overWrite | ||
| # establish connection to output | ||
| .check_output(output, overWrite = overWrite, verbose = verbose) | ||
| inputs <- .aggregate_prepare_aggdb_inputs(as.list(environment()), gdb = x) | ||
| aggdb <- .init_aggdb( | ||
| output = output, | ||
| metadata = inputs[["metadata"]], | ||
| params = inputs[["params"]], | ||
| samples = inputs[["samples"]], | ||
| overWrite = overWrite, | ||
| verbose = verbose | ||
| ) |
There was a problem hiding this comment.
aggregate(gdb) documents output = NULL as a valid default, but the implementation unconditionally calls .check_output(output, ...) and then initializes an aggdb at output. This makes output = NULL effectively invalid and will error at runtime. Either (mandatory) require output to be a non-NULL single path in input validation, or (optional) implement an in-memory return path when output is NULL.
| # export new var table to output gdb | ||
| DBI::dbExecute(object, sprintf("ATTACH '%s' as %s", output, tmp)) | ||
| ## detach on exit | ||
| on.exit( | ||
| try( | ||
| DBI::dbExecute( | ||
| object, | ||
| sprintf("DETACH DATABASE %s", DBI::dbQuoteIdentifier(object, tmp)) | ||
| ), |
There was a problem hiding this comment.
There is an SQL injection / quoting risk in the ATTACH statement because output and the attached schema name tmp are interpolated via sprintf() without quoting/escaping. Use DBI::dbQuoteString() for the file path and DBI::dbQuoteIdentifier() for the attach name in the ATTACH statement as well (not just in DETACH).
| versions[gdb_i] <- DBI::dbGetQuery( | ||
| gdb, | ||
| "select * from src.meta where name = 'rvatVersion'" | ||
| )$value | ||
| builds[gdb_i] <- DBI::dbGetQuery( | ||
| gdb, | ||
| "select * from src.meta where name = 'genomeBuild'" | ||
| )$value |
There was a problem hiding this comment.
If src.meta is missing either row, dbGetQuery(... )$value will be character(0), and assigning that into versions[gdb_i]/builds[gdb_i] will throw a 'replacement has length zero' error. Handle the 0-row case explicitly (e.g., default to NA_character_ or stop with a clearer message).
| versions[gdb_i] <- DBI::dbGetQuery( | |
| gdb, | |
| "select * from src.meta where name = 'rvatVersion'" | |
| )$value | |
| builds[gdb_i] <- DBI::dbGetQuery( | |
| gdb, | |
| "select * from src.meta where name = 'genomeBuild'" | |
| )$value | |
| version_row <- DBI::dbGetQuery( | |
| gdb, | |
| "select * from src.meta where name = 'rvatVersion'" | |
| ) | |
| if (NROW(version_row) != 1L) { | |
| warning( | |
| sprintf( | |
| "Expected exactly one 'rvatVersion' entry in meta for '%s', found %d; using NA.", | |
| gdb_i, | |
| NROW(version_row) | |
| ), | |
| call. = FALSE | |
| ) | |
| versions[gdb_i] <- NA_character_ | |
| } else { | |
| versions[gdb_i] <- as.character(version_row$value[[1L]]) | |
| } | |
| build_row <- DBI::dbGetQuery( | |
| gdb, | |
| "select * from src.meta where name = 'genomeBuild'" | |
| ) | |
| if (NROW(build_row) != 1L) { | |
| warning( | |
| sprintf( | |
| "Expected exactly one 'genomeBuild' entry in meta for '%s', found %d; using NA.", | |
| gdb_i, | |
| NROW(build_row) | |
| ), | |
| call. = FALSE | |
| ) | |
| builds[gdb_i] <- NA_character_ | |
| } else { | |
| builds[gdb_i] <- as.character(build_row$value[[1L]]) | |
| } |
| call. = FALSE | ||
| ) | ||
| } else { | ||
| units_all <- paste0("X", seq_along(units_all)) |
There was a problem hiding this comment.
When checkDups = FALSE, units_all is renamed locally, but the underlying units stored in each aggdb are not renamed. This makes downstream behaviors inconsistent (e.g., mergeAggDbs() still inserts the original unit names and can violate the aggregates(unit PRIMARY KEY) constraint, and the final validation compares against renamed values). Either disallow checkDups = FALSE for aggdbList, or carry a mapping and apply it during merge/collapse (including rewriting unit names on insert).
| units_all <- paste0("X", seq_along(units_all)) | |
| stop( | |
| paste( | |
| "checkDups = FALSE is not supported for aggdbList when duplicate", | |
| "units are present across databases." | |
| ), | |
| call. = FALSE | |
| ) |
This release is primarily an internal refactoring focused on improved robustness: