Skip to content

Avoid generating large BED/BIM/FAM datasets when we process BIM inputs #78

@edg1983

Description

@edg1983

Ideally, I'd like to reduce the generation of complete bed/bim/fam datasets as much as possible. This can result in large files and take a long time for large datasets.
We can instead only generate a BIM file, where we set ID for SNP we don't want to use to a meaningful flag like DUPLICATED, FAILED_LIFTOVER, etc. Then we always pass around the original BED/FAM and use this processed BIM instead of the original one

During the finemap, we only consider SNPs that match between BIM file and the sumstat. Hence these SNPs we flagged will always be excluded.

Possible implementation:
During the BED processing step, we create symlinks with new names for BED and FAM files and a new processed BIM with the same naming schema. No variants are removed during the process, but we set variant IDs to error flags as described above so that they will not match with sum stats SNP IDs and will be removed downstream.

We need to change the R code to re-integrate variants lost after liftover and ensure the order of variants is maintained between the original input BIM and the processed BIM.

Only processed BIM is published.

User input:

  • Option 1: The user has to make symlinks for BED and FAM around the BIM
  • Option 2: We make an additional optional column in the input file where the user can provide an already processed BIM, and this will replace the original one.

In this latter scenario, we create a new input channel for all processes that manage only the BIM file, using stageAs to align the BIM's name with the prefix BED/FAM.

We have to modify this block.

Flanders/main.nf

Lines 37 to 46 in 032b340

.map{ row ->
def bfile_dataset = params.is_test_profile ? file("${projectDir}/${row.bfile}.{bed,bim,fam}") : file("${row.bfile}.{bed,bim,fam}")
tuple(
row.process_bfile,
row.bfile,
"${row.grch_bfile ? row.grch_bfile : row.grch}",
"${params.run_liftover ? "T" : "F"}",
bfile_dataset
)
}

Here we want to change
bfile_dataset = params.is_test_profile ? file("${projectDir}/${row.bfile}.{bed,fam}")

And have a new input channel only for BIM that you put in the tuple. Something like
bim_file = params.is_test_profile ? file("${projectDir}/${row.bim_processed}.bim")

In the single processes, we add an input item like using stageAs to align bim file name with bfile prefix.
tuple path(bfile_dataset), path(bim_file, stageAs: { "${bfile_dataset[0].baseName}.bim" })

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions