An R package with tools for running PRIMED PGS (Polygenic Score) benchmarking workflows in an AnVIL/Terra workspace. The package wraps and orchestrates WDL workflows, and provides helper utilities for workspace inputs/outputs, local ancestry pipelines (HAUDI/GAUDI), and PGS Catalog metadata ingestion.
primed.benchmarking supports two main use cases:
-
PRIMED PGS Catalog pipeline
- Fetch a scoring file from the PGS Catalog
using the
primed_fetch_pgs_catalogworkflow. - Compute individual-level PGS on cohort genotypes (PLINK2 pgen/psam/pvar)
using the
primed_calc_pgsworkflow. - Optionally perform ancestry-based score adjustment using PCs.
- Both workflows validate outputs against the PRIMED PGS data model and import results into the workspace data tables.
- Fetch a scoring file from the PGS Catalog
using the
-
HAUDI/GAUDI local-ancestry-aware pipeline
- Prepare local ancestry inputs (VCF → PLINK2 + FLARE →
.lanc). - Build a Filebacked Big Matrix (FBM) for HAUDI/GAUDI.
- Fit HAUDI or GAUDI models and generate ancestry-aware PGS.
- Prepare local ancestry inputs (VCF → PLINK2 + FLARE →
Many functions assume you are authenticated to AnVIL/Terra and working inside (or targeting) a specific workspace.
Once the package is published, install it from GitHub:
# install.packages("remotes")
remotes::install_github("UW-GAC/primed.benchmarking")While the package is under active development, you can install directly from a specific branch:
# install.packages("remotes")
remotes::install_github("UW-GAC/primed.benchmarking", ref = "main")If you have cloned the repository and want to use the latest uncommitted
changes (for example, while developing or testing), use devtools:
# install.packages("devtools")
devtools::load_all("/path/to/primed.benchmarking")load_all() sources all R files in R/ into your session without fully
building the package, which makes the edit-test cycle faster. To do a full
local install instead, use:
devtools::install("/path/to/primed.benchmarking")The package depends on the AnVILGCP
Bioconductor package. Install it with BiocManager
before installing primed.benchmarking:
# install.packages("BiocManager")
BiocManager::install("AnVILGCP")Tests use the testthat framework. To run them from a local clone:
# install.packages("devtools")
devtools::test("/path/to/primed.benchmarking")Before using these functions, the following must be configured in your AnVIL workspace:
-
Workflow method configurations — both
primed_fetch_pgs_catalogandprimed_calc_pgsmust be imported into the workspace from Dockstore or the Broad Methods Repository. -
Cohort genotype files — the paths to the cohort PLINK2 files must be stored as workspace-level data attributes:
workspace.pgen— path to the.pgenfileworkspace.psam— path to the.psamfileworkspace.pvar— path to the.pvarfile
Note: The
.pvarfile must have variant IDs in the formchr:pos:ref:altwithout thechrprefix.These can be set in the Terra UI under Data → Workspace Data, or programmatically with
AnVILGCP::avdata_import().
library(primed.benchmarking)
result <- run_pgs_pipeline(
pgs_id = "PGS000001",
genome_build = "GRCh38",
dest_bucket = "gs://my-bucket/pgs_results",
sampleset_name = "my_cohort"
)
# Returns submission IDs for both workflows
result$fetch_submission
result$calc_submissionresult <- run_pgs_pipeline(
pgs_id = "PGS000001",
genome_build = "GRCh38",
dest_bucket = "gs://my-bucket/pgs_results",
sampleset_name = "my_cohort",
ancestry_adjust = TRUE,
pcs = "gs://my-bucket/cohort/cohort.pcs"
)# 1. Read cohort file paths from workspace attributes
cohort <- get_cohort_files()
# 2. Submit the fetch workflow
fetch_id <- submit_fetch_pgs_workflow(
pgs_id = "PGS000001",
genome_build = "GRCh38",
dest_bucket = "gs://my-bucket/pgs_catalog",
model_url = paste0(
"https://raw.githubusercontent.com/UW-GAC/primed_data_models/",
"refs/heads/main/PRIMED_PGS_data_model.json"
)
)
# 3. Wait for the fetch workflow to complete
wait_for_workflow(fetch_id)
# 4. Submit the calc workflow
calc_id <- submit_calc_pgs_workflow(
pgs_model_id = "PGS000001",
scorefile = "gs://my-bucket/pgs_catalog/PGS000001_hmPOS_GRCh38.txt.gz",
genome_build = "GRCh38",
pgen = cohort$pgen,
psam = cohort$psam,
pvar = cohort$pvar,
min_overlap = 0.75,
sampleset_name = "my_cohort",
dest_bucket = "gs://my-bucket/pgs_results",
model_url = paste0(
"https://raw.githubusercontent.com/UW-GAC/primed_data_models/",
"refs/heads/main/PRIMED_PGS_data_model.json"
)
)All exported (public) functions are listed below, grouped by category.
Orchestrates the full end-to-end PRIMED PGS workflow for a given PGS Catalog score:
- Reads cohort genotype paths from workspace attributes.
- Submits
primed_fetch_pgs_catalogto fetch and import the scoring file. - Waits for that workflow to complete.
- Looks up the fetched scoring file path from the
pgs_scoring_filetable. - Submits
primed_calc_pgsto calculate individual-level scores.
| Argument | Type | Default | Description |
|---|---|---|---|
pgs_id |
character |
— | PGS Catalog ID, e.g. "PGS000001". |
genome_build |
character |
— | "GRCh38" or "GRCh37". |
dest_bucket |
character |
— | gs:// path where output files are written. |
sampleset_name |
character |
— | Cohort name used in output file naming. |
model_url |
character |
PRIMED data model URL | URL to the PRIMED PGS data model JSON. |
min_overlap |
numeric |
0.75 |
Minimum fraction of score variants present in the genotype data. |
workspace_namespace |
character |
current workspace | AnVIL workspace namespace. |
workspace_name |
character |
current workspace | AnVIL workspace name. |
workflow_namespace |
character |
workspace_namespace |
Namespace of the workflow method configurations. |
overwrite |
logical |
FALSE |
Overwrite existing rows in the data tables. |
ancestry_adjust |
logical |
FALSE |
Enable ancestry-based score adjustment. |
pcs |
character or NULL |
NULL |
gs:// path to PC file (required when ancestry_adjust = TRUE). |
primed_dataset_id |
character or NULL |
NULL |
Optional PRIMED dataset identifier. |
poll_interval |
numeric |
60 |
Seconds between workflow status polls. |
timeout |
numeric |
3600 |
Maximum seconds to wait for the fetch workflow. |
use_call_cache |
logical |
TRUE |
Enable Cromwell call caching. |
skip_if_complete |
logical |
FALSE |
Reuse prior successful submissions when available. |
Returns a named list with:
fetch_submission— submission ID of theprimed_fetch_pgs_catalogrun.calc_submission— submission ID of theprimed_calc_pgsrun.
Reads cohort genotype file paths stored as workspace-level data attributes
(workspace.pgen, workspace.psam, workspace.pvar) and returns them as a
named list.
| Argument | Type | Default | Description |
|---|---|---|---|
workspace_namespace |
character |
current workspace | AnVIL workspace namespace. |
workspace_name |
character |
current workspace | AnVIL workspace name. |
Returns a named list with character elements pgen, psam, and pvar
(Google Cloud Storage paths).
Configures and submits the primed_fetch_pgs_catalog workflow in the current
AnVIL workspace. This workflow fetches a scoring file from the PGS Catalog,
copies it to the specified GCS bucket, and imports metadata into the workspace
pgs_model and pgs_scoring_file data tables.
The workflow method configuration named primed_fetch_pgs_catalog must already
be imported into the workspace before calling this function.
| Argument | Type | Default | Description |
|---|---|---|---|
pgs_id |
character |
— | PGS Catalog ID, e.g. "PGS000001". |
genome_build |
character |
— | "GRCh38" or "GRCh37". |
dest_bucket |
character |
— | gs:// path where scoring files are written. |
model_url |
character |
— | URL to the PRIMED PGS data model JSON. |
workspace_namespace |
character |
current workspace | AnVIL workspace namespace. |
workspace_name |
character |
current workspace | AnVIL workspace name. |
workflow_namespace |
character |
workspace_namespace |
Namespace of the workflow method configuration. |
overwrite |
logical |
FALSE |
Overwrite existing rows in the data tables. |
use_call_cache |
logical |
TRUE |
Enable Cromwell call caching. |
skip_if_complete |
logical |
FALSE |
Skip submission if a prior successful run exists; return its ID. |
Returns a character string: the workflow submission ID.
submit_calc_pgs_workflow(pgs_model_id, scorefile, genome_build, pgen, psam, pvar, min_overlap, sampleset_name, dest_bucket, model_url, ...)
Configures and submits the primed_calc_pgs workflow. This workflow matches
the scoring file to cohort genotype data, calculates individual-level polygenic
scores with PLINK2, optionally adjusts for ancestry using PCs, and imports
results into the workspace pgs_individual_file data table.
The workflow method configuration named primed_calc_pgs must already be
imported into the workspace before calling this function.
| Argument | Type | Default | Description |
|---|---|---|---|
pgs_model_id |
character |
— | PGS model identifier, e.g. "PGS000001". |
scorefile |
character |
— | gs:// path to the scoring file fetched from the PGS Catalog. |
genome_build |
character |
— | "GRCh38" or "GRCh37". |
pgen |
character |
— | gs:// path to the cohort .pgen file. |
psam |
character |
— | gs:// path to the cohort .psam file. |
pvar |
character |
— | gs:// path to the cohort .pvar file. |
min_overlap |
numeric |
— | Minimum fraction of score variants present in the genotype data. |
sampleset_name |
character |
— | Name used to construct output file names. |
dest_bucket |
character |
— | gs:// path where score output files are written. |
model_url |
character |
— | URL to the PRIMED PGS data model JSON. |
workspace_namespace |
character |
current workspace | AnVIL workspace namespace. |
workspace_name |
character |
current workspace | AnVIL workspace name. |
workflow_namespace |
character |
workspace_namespace |
Namespace of the workflow method configuration. |
overwrite |
logical |
FALSE |
Overwrite existing rows in the data tables. |
ancestry_adjust |
logical |
FALSE |
Enable ancestry-based score adjustment. |
pcs |
character or NULL |
NULL |
gs:// path to a PC file for ancestry adjustment. |
primed_dataset_id |
character or NULL |
NULL |
Optional PRIMED dataset identifier. |
use_call_cache |
logical |
TRUE |
Enable Cromwell call caching. |
skip_if_complete |
logical |
FALSE |
Skip submission if a prior successful run exists; return its ID. |
Returns a character string: the workflow submission ID.
Polls an AnVIL workflow submission at regular intervals until all workflows in
the submission reach a terminal state (Succeeded, Failed, or Aborted).
Raises an error if any workflows fail or are aborted.
| Argument | Type | Default | Description |
|---|---|---|---|
submission_id |
character |
— | Submission ID returned by submit_fetch_pgs_workflow() or submit_calc_pgs_workflow(). |
workspace_namespace |
character |
current workspace | AnVIL workspace namespace. |
workspace_name |
character |
current workspace | AnVIL workspace name. |
poll_interval |
numeric |
60 |
Seconds between status checks. |
timeout |
numeric |
3600 |
Maximum seconds to wait before timing out. |
Returns invisibly: the final job status tibble.
This package wraps three WDL workflows for running the HAUDI and GAUDI ancestry-aware PGS methods.
submit_gaudi_prep_workflow(vcf_files, ref_file_list, out_prefix_list, genetic_map_file, reference_map_file, ...)
Configures and submits the gaudi_prep workflow
(github.com/UW-GAC/gaudi_prep_wdl,
branch gaudi_prep_wdl). This workflow converts per-chromosome VCF files to
PLINK2 format, runs FLARE local ancestry inference, and converts FLARE output
to the .lanc format required by submit_make_fbm_workflow().
The workflow method configuration named gaudi_prep must already be imported
into the workspace from Dockstore
(github.com/UW-GAC/gaudi_prep_wdl/gaudi_prep:gaudi_prep_wdl) before calling
this function.
| Argument | Type | Default | Description |
|---|---|---|---|
vcf_files |
character vector |
— | gs:// paths to per-chromosome VCF files. |
ref_file_list |
character vector |
— | gs:// paths to per-chromosome reference VCF files for FLARE. |
out_prefix_list |
character vector |
— | Output prefixes for FLARE, one per chromosome (e.g. c("chr1", ..., "chr22")). |
genetic_map_file |
character |
— | gs:// path to the genetic map file for FLARE. |
reference_map_file |
character |
— | gs:// path to the FLARE reference population map file. |
samples_keep |
character or NULL |
NULL |
Optional gs:// path to a file of sample IDs to retain. |
workspace_namespace |
character |
current workspace | AnVIL workspace namespace. |
workspace_name |
character |
current workspace | AnVIL workspace name. |
workflow_namespace |
character |
workspace_namespace |
Namespace of the workflow method configuration. |
use_call_cache |
logical |
TRUE |
Enable Cromwell call caching. |
skip_if_complete |
logical |
FALSE |
Skip submission if a prior successful run exists; return its ID. |
Returns a character string: the workflow submission ID.
submit_make_fbm_workflow(lanc_files, pgen_files, pvar_files, psam_files, fbm_prefix, anc_names, ...)
Configures and submits the make_fbm workflow
(github.com/frankp-0/HAUDI_workflow,
branch main). This workflow converts per-chromosome .lanc local ancestry
files and matching PLINK2 files into a Filebacked Big Matrix (FBM) compatible
with HAUDI and GAUDI.
The workflow method configuration named make_fbm must already be imported
into the workspace from Dockstore
(github.com/frankp-0/HAUDI_workflow/make_fbm:main) before calling this
function.
| Argument | Type | Default | Description |
|---|---|---|---|
lanc_files |
character vector |
— | gs:// paths to per-chromosome .lanc files. |
pgen_files |
character vector |
— | gs:// paths to per-chromosome .pgen files. |
pvar_files |
character vector |
— | gs:// paths to per-chromosome .pvar files. |
psam_files |
character vector |
— | gs:// paths to per-chromosome .psam files. |
fbm_prefix |
character |
— | Output prefix for the FBM files (the backing file will be <fbm_prefix>.bk). |
anc_names |
character vector |
— | Ancestry names in the same order as the integer codes used in the .lanc files (e.g. c("AFR", "EUR")). |
variants_file |
character or NULL |
NULL |
Optional gs:// path to a file of variant IDs used to subset the FBM. |
min_ac |
integer or NULL |
NULL |
Optional minimum allele count to retain a column in the FBM. |
samples_file |
character or NULL |
NULL |
Optional gs:// path to a file of sample IDs used to subset the FBM. |
chunk_size |
integer |
400 |
Maximum number of variants to read from the .pgen file at a time. |
workspace_namespace |
character |
current workspace | AnVIL workspace namespace. |
workspace_name |
character |
current workspace | AnVIL workspace name. |
workflow_namespace |
character |
workspace_namespace |
Namespace of the workflow method configuration. |
use_call_cache |
logical |
TRUE |
Enable Cromwell call caching. |
skip_if_complete |
logical |
FALSE |
Skip submission if a prior successful run exists; return its ID. |
Returns a character string: the workflow submission ID.
submit_fit_haudi_workflow(method, bk_file, info_file, dims_file, fbm_samples_file, phenotype_file, phenotype, output_prefix, ...)
Configures and submits the fit_haudi workflow
(github.com/frankp-0/HAUDI_workflow,
branch main). This workflow fits a HAUDI or GAUDI polygenic score model using
the FBM produced by submit_make_fbm_workflow() and a phenotype file, and
outputs ancestry-specific effect estimates and individual-level PGS.
The workflow method configuration named fit_haudi must already be imported
into the workspace from Dockstore
(github.com/frankp-0/HAUDI_workflow/fit_haudi:main) before calling this
function.
| Argument | Type | Default | Description |
|---|---|---|---|
method |
character |
— | "HAUDI" or "GAUDI". |
bk_file |
character |
— | gs:// path to the FBM backing file (.bk) from submit_make_fbm_workflow(). |
info_file |
character |
— | gs:// path to the FBM column info file. |
dims_file |
character |
— | gs:// path to the FBM dimensions file. |
fbm_samples_file |
character |
— | gs:// path to the FBM samples file. |
phenotype_file |
character |
— | gs:// path to a phenotype file. Must contain a "#IID" column and at least one phenotype column. |
phenotype |
character |
— | Name of the phenotype column to use as the response variable. |
output_prefix |
character |
— | Prefix for output files (model, effects, PGS results). |
family |
character or NULL |
NULL (→ "gaussian") |
Model family: "gaussian" or "binomial" (HAUDI only). |
training_samples_file |
character or NULL |
NULL |
Optional gs:// path to a file with training sample IDs. |
gamma_min |
numeric |
0.01 |
Minimum value of the gamma tuning parameter. |
gamma_max |
numeric |
5 |
Maximum value of the gamma tuning parameter. |
n_gamma |
numeric |
5 |
Number of gamma values to evaluate. |
variants_file |
character or NULL |
NULL |
Optional gs:// path to a file of variant IDs to use for model fitting. |
n_folds |
integer |
5 |
Number of cross-validation folds. |
workspace_namespace |
character |
current workspace | AnVIL workspace namespace. |
workspace_name |
character |
current workspace | AnVIL workspace name. |
workflow_namespace |
character |
workspace_namespace |
Namespace of the workflow method configuration. |
use_call_cache |
logical |
TRUE |
Enable Cromwell call caching. |
skip_if_complete |
logical |
FALSE |
Skip submission if a prior successful run exists; return its ID. |
Returns a character string: the workflow submission ID.
Checks whether the local_ancestry_summary workspace data table exists in the
specified AnVIL workspace. If absent, creates a minimal table with a single
local_ancestry_summary_id column populated by cohort. If it already exists,
emits a message and makes no changes.
| Argument | Type | Description |
|---|---|---|
cohort |
character |
Cohort/entity ID for the initial local_ancestry_summary_id value. |
cohort.namespace |
character |
AnVIL workspace namespace. |
cohort.name |
character |
AnVIL workspace name. |
Returns invisibly: FALSE if the table was created, TRUE if it already
existed.
set_up_step1c_summarize_local_ancestry_proportions(cohort, cohort.namespace, cohort.name, merged.6.ancestry_frac_path)
Retrieves the step1c_summarize_local_ancestry_proportions workflow
configuration from the workspace, updates its inputs and outputs for the given
cohort and data file, then performs a dry-run validation and dry-run
submission. Use
run_step1c_summarize_local_ancestry_proportions() to actually submit the job.
| Argument | Type | Description |
|---|---|---|
cohort |
character |
Entity name to run on. |
cohort.namespace |
character |
AnVIL workspace namespace. |
cohort.name |
character |
AnVIL workspace name. |
merged.6.ancestry_frac_path |
character |
gs:// path to the merged 6-ancestry fraction file from an upstream step. |
Returns the updated workflow configuration object.
run_step1c_summarize_local_ancestry_proportions(cohort, cohort.namespace, cohort.name, merged.6.ancestry_frac_path, run_now, new_config)
Applies the workflow configuration and submits the
step1c_summarize_local_ancestry_proportions workflow for real when
run_now = TRUE. When run_now = FALSE (default) returns invisibly without
doing anything, making it safe to call in scripts still being prepared.
| Argument | Type | Default | Description |
|---|---|---|---|
cohort |
character |
— | Entity name to run on. |
cohort.namespace |
character |
— | AnVIL workspace namespace. |
cohort.name |
character |
— | AnVIL workspace name. |
merged.6.ancestry_frac_path |
character |
— | GCS path to the merged 6-ancestry fraction file. Accepted for API symmetry with set_up_step1c_summarize_local_ancestry_proportions(); the configuration is already embedded in new_config so this argument is not read again here. |
run_now |
logical |
FALSE |
If TRUE, apply the configuration and submit the workflow. |
new_config |
workflow config | — | Configuration object returned by set_up_step1c_summarize_local_ancestry_proportions(). |
Returns invisibly NULL.
Computes two-way ancestry counts from individual-level admixture proportion
data. For every pair of reference populations (columns whose names begin with
"K", e.g. KAFR, KEUR), the function counts individuals meeting two-way
criteria and related exclusion categories.
| Argument | Type | Default | Description |
|---|---|---|---|
admixture_anc_prop_list |
data.frame / tibble |
— | Data frame with at least two numeric columns whose names begin with "K". Each row is one individual. |
cohort_name |
character |
— | Cohort name; populates the Cohort column in the result. |
threshold |
numeric |
0.9 |
Minimum combined ancestry proportion (x1 + x2) for an individual to be considered for any category. |
min_prop |
numeric |
0.10 |
Minimum individual ancestry proportion for classification as admixed. |
Returns a tibble with one row per pair of reference populations and the following columns:
| Column | Description |
|---|---|
Cohort |
Cohort name. |
Ref_Pop1 |
Name of the first reference population column. |
Ref_Pop2 |
Name of the second reference population column. |
Count_two_way |
Individuals with x1 >= min_prop, x2 >= min_prop, and x1 + x2 >= threshold. |
Excluded_Ref1_lt10_and_Ref2_lt90 |
Individuals with x1 < min_prop, x2 < threshold, and x1 + x2 >= threshold. |
Excluded_Ref2_lt10_and_Ref1_lt90 |
Individuals with x2 < min_prop, x1 < threshold, and x1 + x2 >= threshold. |
Excluded_Ref1_lt10_and_Ref2_gt90 |
Individuals with x1 < min_prop and x2 >= threshold. |
Excluded_Ref2_lt10_and_Ref1_gt90 |
Individuals with x2 < min_prop and x1 >= threshold. |
n |
Total individuals with x1 + x2 >= threshold. |
Downloads and cleans the PGS Catalog "all metadata scores" CSV from the EBI
FTP server. Column names are standardized to snake_case, whitespace is
trimmed, obviously numeric or date-like columns are converted to their native
types, and optionally columns with delimited list values are split into
list-columns.
| Argument | Type | Default | Description |
|---|---|---|---|
url |
character |
EBI FTP URL | URL of the pgs_all_metadata_scores.csv file. |
split_list_columns |
logical |
TRUE |
Split columns with pipe/semicolon/comma-space delimiters into list-columns. |
Returns a tibble with cleaned columns ready for analysis and joins. When
split_list_columns = TRUE, some columns will be list-columns (each element a
character vector).
Downloads a file from a GCS bucket to a local path using gsutil cp. If the
destination file already exists locally, the copy is skipped.
Requires gsutil (Google Cloud SDK) to be installed and available on PATH.
| Argument | Type | Description |
|---|---|---|
gspath |
character |
GCS path of the file to download (e.g. "gs://my-bucket/path/file.txt"). |
newfilename |
character |
Local destination path where the file should be saved. |
Returns invisibly NULL.
Uploads a local file to a GCS bucket using gsutil cp, then confirms the
transfer with gsutil ls -l.
Requires gsutil (Google Cloud SDK) to be installed and available on PATH.
| Argument | Type | Description |
|---|---|---|
filename |
character |
Local path of the file to upload. |
gspath |
character |
GCS bucket or prefix path to copy the file into (e.g. "gs://my-bucket/output"). |
newfilename |
character |
Name to give the file in the bucket. The file is written to <gspath>/<newfilename>. |
Returns invisibly NULL.
This package wraps two WDL workflows from the primed-pgs-catalog repository:
primed_fetch_pgs_catalog— fetches a scoring file from the PGS Catalog and imports metadata into the workspacepgs_modelandpgs_scoring_filedata tables.primed_calc_pgs— applies the scoring file to cohort genotype data (pgen/psam/pvar format) and imports individual-level scores into the workspacepgs_individual_filedata table.
This package also wraps three WDL workflows for running the HAUDI and GAUDI ancestry-aware PGS methods:
gaudi_prep(github.com/UW-GAC/gaudi_prep_wdl, branchgaudi_prep_wdl) — converts per-chromosome VCF files to PLINK2 format, runs FLARE local ancestry inference, and converts FLARE output to the.lancformat required bymake_fbm.make_fbm(github.com/frankp-0/HAUDI_workflow, branchmain) — converts.lanclocal ancestry files and the matching PLINK2 files into a Filebacked Big Matrix (FBM) compatible with HAUDI and GAUDI.fit_haudi(github.com/frankp-0/HAUDI_workflow, branchmain) — fits a HAUDI or GAUDI PGS model using an FBM and a phenotype file; outputs ancestry-specific effect estimates and individual-level scores.
library(primed.benchmarking)
# Step 1 — prepare PLINK2 + .lanc files from VCF inputs
prep_id <- submit_gaudi_prep_workflow(
vcf_files = paste0("gs://my-bucket/vcf/chr", 1:22, ".vcf.gz"),
ref_file_list = paste0("gs://my-bucket/ref/chr", 1:22, "REF.vcf.gz"),
out_prefix_list = paste0("chr", 1:22),
genetic_map_file = "gs://my-bucket/ref/genetic_map.map",
reference_map_file = "gs://my-bucket/ref/reference.pop"
)
wait_for_workflow(prep_id)
# Step 2 — build the Filebacked Big Matrix
# (supply the .lanc and PLINK2 outputs from Step 1)
fbm_id <- submit_make_fbm_workflow(
lanc_files = paste0("gs://my-bucket/lanc/chr", 1:22, ".lanc"),
pgen_files = paste0("gs://my-bucket/plink/chr", 1:22, ".pgen"),
pvar_files = paste0("gs://my-bucket/plink/chr", 1:22, ".pvar"),
psam_files = paste0("gs://my-bucket/plink/chr", 1:22, ".psam"),
fbm_prefix = "cohort",
anc_names = c("AFR", "EUR")
)
wait_for_workflow(fbm_id)
# Step 3 — fit the HAUDI/GAUDI model
# (supply the FBM outputs from Step 2)
fit_id <- submit_fit_haudi_workflow(
method = "HAUDI",
bk_file = "gs://my-bucket/fbm/cohort.bk",
info_file = "gs://my-bucket/fbm/cohort_info.txt",
dims_file = "gs://my-bucket/fbm/cohort_dims.txt",
fbm_samples_file = "gs://my-bucket/fbm/cohort_samples.txt",
phenotype_file = "gs://my-bucket/pheno/cohort.pheno",
phenotype = "BMI",
output_prefix = "cohort_BMI"
)
wait_for_workflow(fit_id)