Avoid generating large BED/BIM/FAM datasets when we process BIM inputs

Ideally, I'd like to reduce the generation of complete bed/bim/fam datasets as much as possible. This can result in large files and take a long time for large datasets.
We can instead only generate a BIM file, where we set ID for SNP we don't want to use to a meaningful flag like DUPLICATED, FAILED_LIFTOVER, etc. Then we always pass around the original BED/FAM and use this processed BIM instead of the original one

During the finemap, we only consider SNPs that match between BIM file and the sumstat. Hence these SNPs we flagged will always be excluded. 



Possible implementation:
During the BED processing step, we create symlinks with new names for BED and FAM files and a new processed BIM with the same naming schema. No variants are removed during the process, but we set variant IDs to error flags as described above so that they will not match with sum stats SNP IDs and will be removed downstream.

We need to change the R code to re-integrate variants lost after liftover and ensure the order of variants is maintained between the original input BIM and the processed BIM.

Only processed BIM is published.

User input:

- Option 1: The user has to make symlinks for BED and FAM around the BIM
- Option 2: We make an additional optional column in the input file where the user can provide an already processed BIM, and this will replace the original one.

In this latter scenario, we create a new input channel for all processes that manage only the BIM file, using stageAs to align the BIM's name with the prefix BED/FAM.

We have to modify this block.
https://github.com/Biostatistics-Unit-HT/Flanders/blob/032b340eb481fae5ce8ce8d95d30a035376d8151/main.nf#L37-L46

Here we want to change
`bfile_dataset = params.is_test_profile ? file("${projectDir}/${row.bfile}.{bed,fam}")`

And have a new input channel only for BIM that you put in the tuple. Something like
`bim_file = params.is_test_profile ? file("${projectDir}/${row.bim_processed}.bim")`

In the single processes, we add an input item like using `stageAs` to align bim file name with bfile prefix.
`tuple path(bfile_dataset), path(bim_file, stageAs: { "${bfile_dataset[0].baseName}.bim" })`

	.map{ row ->
	def bfile_dataset = params.is_test_profile ? file("${projectDir}/${row.bfile}.{bed,bim,fam}") : file("${row.bfile}.{bed,bim,fam}")
	tuple(
	row.process_bfile,
	row.bfile,
	"${row.grch_bfile ? row.grch_bfile : row.grch}",
	"${params.run_liftover ? "T" : "F"}",
	bfile_dataset
	)
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid generating large BED/BIM/FAM datasets when we process BIM inputs #78

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Avoid generating large BED/BIM/FAM datasets when we process BIM inputs #78

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions