Dear all,
Since my study organism lacks predefined chunks (as available for human genomes), I am considering implementing a sliding window approach to phase my WGS data. I am working with WGS data from 3,200 samples of a non-model organism (genome size ~3 Gb). Given SHAPEIT5’s reported higher accuracy compared to Beagle, I aim to use it for phasing in our project. Following the guidelines from the tutorial (https://odelaneau.github.io/shapeit5/docs/tutorials/ukb_wgs/#quality-control), I performed preliminary quality control on the SNPs, removing those with missingness >10% and those in low-mappability regions.
Common Variants Phasing:
phase_common_static --progress -T 32 -I ${chr}.flt.vcf.gz --filter-maf 0.01 -R ${chr} -O ${chr}.common_imp.bcf
Rare Variants Phasing: For rare variants (MAF < 0.01), I attempted to phase them in relatively small chunks (5 Mb) using:
phase_rare_static --thread 88 --progress --input 18.fill.vcf.gz --input-region 18:1-5000000 --scaffold 18.common_imp.bcf --output 18.rare_imp.bcf --scaffold-region 18:1-55982971
However, when running phase_rare_static on the smallest chromosome, the memory usage exceeded 1 TB. Given our computational constraints, this is infeasible for larger chromosomes.
To mitigate the memory issue, I extended the --scaffold-region by 2.5 Mb on both sides of the --input-region. This approach allowed the job to complete successfully, though peak memory usage remained high (~524.66 GB). For example:
For the first and second 5 Mb chunks
phase_rare_static --thread 88 --progress --input 18.fill.vcf.gz --input-region 18:1-5000000 --scaffold 18.common_imp.bcf --output 18.rare_imp_01.bcf --scaffold-region 18:1-10000000
phase_rare_static --thread 88 --progress --input 18.fill.vcf.gz --input-region 18:5000001-10000000 --scaffold 18.common_imp.bcf --output 18.rare_imp_02.bcf --scaffold-region 18:2500000-12500000
Would this be an appropriate strategy for phasing rare variants across the genome?
Best regards,
Zheng zhuqing
Dear all,
Since my study organism lacks predefined chunks (as available for human genomes), I am considering implementing a sliding window approach to phase my WGS data. I am working with WGS data from 3,200 samples of a non-model organism (genome size ~3 Gb). Given SHAPEIT5’s reported higher accuracy compared to Beagle, I aim to use it for phasing in our project. Following the guidelines from the tutorial (https://odelaneau.github.io/shapeit5/docs/tutorials/ukb_wgs/#quality-control), I performed preliminary quality control on the SNPs, removing those with missingness >10% and those in low-mappability regions.
Common Variants Phasing:
phase_common_static --progress -T 32 -I ${chr}.flt.vcf.gz --filter-maf 0.01 -R ${chr} -O ${chr}.common_imp.bcfRare Variants Phasing: For rare variants (MAF < 0.01), I attempted to phase them in relatively small chunks (5 Mb) using:
phase_rare_static --thread 88 --progress --input 18.fill.vcf.gz --input-region 18:1-5000000 --scaffold 18.common_imp.bcf --output 18.rare_imp.bcf --scaffold-region 18:1-55982971However, when running
phase_rare_staticon the smallest chromosome, the memory usage exceeded 1 TB. Given our computational constraints, this is infeasible for larger chromosomes.To mitigate the memory issue, I extended the --scaffold-region by 2.5 Mb on both sides of the --input-region. This approach allowed the job to complete successfully, though peak memory usage remained high (~524.66 GB). For example:
For the first and second 5 Mb chunks
phase_rare_static --thread 88 --progress --input 18.fill.vcf.gz --input-region 18:1-5000000 --scaffold 18.common_imp.bcf --output 18.rare_imp_01.bcf --scaffold-region 18:1-10000000phase_rare_static --thread 88 --progress --input 18.fill.vcf.gz --input-region 18:5000001-10000000 --scaffold 18.common_imp.bcf --output 18.rare_imp_02.bcf --scaffold-region 18:2500000-12500000Would this be an appropriate strategy for phasing rare variants across the genome?
Best regards,
Zheng zhuqing