Skip to content

Best practices for chunking/phasing rare variants with SHAPEIT5 without predefined chunks? #129

@biozzq

Description

@biozzq

Dear all,
Since my study organism lacks predefined chunks (as available for human genomes), I am considering implementing a sliding window approach to phase my WGS data. I am working with WGS data from 3,200 samples of a non-model organism (genome size ~3 Gb). Given SHAPEIT5’s reported higher accuracy compared to Beagle, I aim to use it for phasing in our project. Following the guidelines from the tutorial (https://odelaneau.github.io/shapeit5/docs/tutorials/ukb_wgs/#quality-control), I performed preliminary quality control on the SNPs, removing those with missingness >10% and those in low-mappability regions.

Common Variants Phasing:

phase_common_static --progress -T 32 -I ${chr}.flt.vcf.gz --filter-maf 0.01 -R ${chr} -O ${chr}.common_imp.bcf

Rare Variants Phasing: For rare variants (MAF < 0.01), I attempted to phase them in relatively small chunks (5 Mb) using:

phase_rare_static --thread 88 --progress --input 18.fill.vcf.gz --input-region 18:1-5000000 --scaffold 18.common_imp.bcf --output 18.rare_imp.bcf --scaffold-region 18:1-55982971

However, when running phase_rare_static on the smallest chromosome, the memory usage exceeded 1 TB. Given our computational constraints, this is infeasible for larger chromosomes.

To mitigate the memory issue, I extended the --scaffold-region by 2.5 Mb on both sides of the --input-region. This approach allowed the job to complete successfully, though peak memory usage remained high (~524.66 GB). For example:

For the first and second 5 Mb chunks

phase_rare_static --thread 88 --progress --input 18.fill.vcf.gz --input-region 18:1-5000000 --scaffold 18.common_imp.bcf --output 18.rare_imp_01.bcf --scaffold-region 18:1-10000000
phase_rare_static --thread 88 --progress --input 18.fill.vcf.gz --input-region 18:5000001-10000000 --scaffold 18.common_imp.bcf --output 18.rare_imp_02.bcf --scaffold-region 18:2500000-12500000

Would this be an appropriate strategy for phasing rare variants across the genome?

Best regards,

Zheng zhuqing

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions