Skip to content

DyHealthNet/gnext_reference_data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Processing of GNExT Reference Data

This repository provides scripts to generate the reference resources required for deploying an instance of the GNExT platform. Specifically, we describe the construction of a PLINK formatted reference panel from the 1000 Genomes Project Phase 3 dataset, which is used by MAGMA for linkage disequilibrium estimation. In addition, GNExT requires gene annotation resources for SNP to gene mapping; accordingly, the repository includes a script to derive the corresponding annotation file from any Ensembl GTF release.

The reference sources produced by these scripts are deposited on Zenodo https://doi.org/10.5281/zenodo.17940903, including PLINK files for GRCh37 and GRCh38 across five super populations as well as Ensembl protein-coding gene annotation files for GRCh37 and GRCh38 from Ensembl releases 114 and 115.

GNExT_Logo_Black

GWAS Network Exploration Tool

PLINK stack generation for 1KG reference variants

This repository contains scripts for generating filtered PLINK stacks for the 1,000 Genomes Phase 3 reference variant data. As input, our workflow uses per-chromosome VCF files from both GRCh37 and GRCh38. In order to stratify samples into superpopulations, we have used this panel file for GRCh37 and this one for GRCh38. Below, we describe our main steps for GRCh37 and GRCh38 individually. All mentioned scripts are contained in the scripts/ directory, while any required auxiliary files, as well final summary statistics are contained in data/.

Worfklow for GRCh37

  1. filter_h37.sh: Transforming multiallelic, per-chromosome input VCFs into biallelic ones, while additionally filtering out copy number variations.
  2. concat_vcfs.sh: Aggregating all filtered, per-chromosome VCFs into one joint VCF file for all chromsomes.
  3. vcf_to_plink.sh: Transforming the aggregated VCF file into PLINK's bfile format. Variants in pseudo-autosomal regions of the X chromosome are mapped to chromosome code X. Also, variants are assigned unique IDs based on the format CHROM:POS:REF:ALT.
  4. split_into_populations.sh: The aggregated PLINK files are stratified into superpopulations.
  5. filter_MAC_duplicated.sh: PLINK representations for each superpopulation are filtered for variants that are present in at least one sample (minor allele count >= 1), and duplicated variants based on unique IDs are cleaned (only one version is kept).

The resulting reference files metrics per superpopulation read as follows:

Population Num_Variants Num_Samples
EUR 25059705 503
AFR 43671646 661
AMR 29497483 347
SAS 27686667 489
EAS 24503205 504

Worfklow for GRCh38

  1. prune_all_h38.sh: Due to the enormous VCF file size of GRCh38 per-chromosome variants, we decided to first prune unnecessary information for our purposes. In the FORMAT field, we only kept actual genotypes (GT) and the INFO field was removed completely.
  2. filter_h38.sh: Transforming multiallelic, per-chromosome input VCFs into biallelic ones.
  3. concat_vcf.sh: Aggregating all filtered, per-chromosome VCFs into one joint VCF file for all chromsomes.
  4. vcf_to_plink.sh: Transforming the aggregated VCF file into PLINK's bfile format. Variants in pseudo-autosomal regions of the X chromosome are mapped to chromosome code X. Also, variants are assigned unique IDs based on the format CHROM:POS:REF:ALT.
  5. split_into_populations.sh: The aggregated PLINK files are stratified into superpopulations.
  6. exclude_missing_alleles.sh: Variants with the symbol '*' in either the REF or ALT column (i.e. without proper allele encoding) were removed.
  7. filter_MAC_duplicated.sh: PLINK representations for each superpopulation are filtered for variants that are present in at least one sample (minor allele count >= 1), and duplicated variants based on unique IDs are cleaned (only one version is kept).

The resulting reference files metrics per superpopulation read as follows:

Population Num_Variants Num_Samples
EUR 41941265 633
AFR 68637558 893
AMR 45570833 490
SAS 46380876 601
EAS 41976171 585

Ensembl gene annotation file generation

For generating the protein-coding gene annotation file for the GNExT platform of a specific Ensembl release, first download the GTF file from the Ensembl FTP site https://ftp.ensembl.org/pub/ for the resepctive release and genome assembly and then run the extract_genes_from_gft.py by specifying the --gtf and --output flags.

About

Repository for Processing GNExT Reference Data (1,000 Genomes Phase 3 Reference Data & Gene Location Files)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors