Processing of GNExT Reference Data

This repository provides scripts to generate the reference resources required for deploying an instance of the GNExT platform. Specifically, we describe the construction of a PLINK formatted reference panel from the 1000 Genomes Project Phase 3 dataset, which is used by MAGMA for linkage disequilibrium estimation. In addition, GNExT requires gene annotation resources for SNP to gene mapping; accordingly, the repository includes a script to derive the corresponding annotation file from any Ensembl GTF release.

The reference sources produced by these scripts are deposited on Zenodo https://doi.org/10.5281/zenodo.17940903, including PLINK files for GRCh37 and GRCh38 across five super populations as well as Ensembl protein-coding gene annotation files for GRCh37 and GRCh38 from Ensembl releases 114 and 115.

GWAS Network Exploration Tool

PLINK stack generation for 1KG reference variants

This repository contains scripts for generating filtered PLINK stacks for the 1,000 Genomes Phase 3 reference variant data. As input, our workflow uses per-chromosome VCF files from both GRCh37 and GRCh38. In order to stratify samples into superpopulations, we have used this panel file for GRCh37 and this one for GRCh38. Below, we describe our main steps for GRCh37 and GRCh38 individually. All mentioned scripts are contained in the scripts/ directory, while any required auxiliary files, as well final summary statistics are contained in data/.

Worfklow for GRCh37

filter_h37.sh: Transforming multiallelic, per-chromosome input VCFs into biallelic ones, while additionally filtering out copy number variations.
concat_vcfs.sh: Aggregating all filtered, per-chromosome VCFs into one joint VCF file for all chromsomes.
vcf_to_plink.sh: Transforming the aggregated VCF file into PLINK's bfile format. Variants in pseudo-autosomal regions of the X chromosome are mapped to chromosome code X. Also, variants are assigned unique IDs based on the format CHROM:POS:REF:ALT.
split_into_populations.sh: The aggregated PLINK files are stratified into superpopulations.
filter_MAC_duplicated.sh: PLINK representations for each superpopulation are filtered for variants that are present in at least one sample (minor allele count >= 1), and duplicated variants based on unique IDs are cleaned (only one version is kept).

The resulting reference files metrics per superpopulation read as follows:

Population	Num_Variants	Num_Samples
EUR	25059705	503
AFR	43671646	661
AMR	29497483	347
SAS	27686667	489
EAS	24503205	504

Worfklow for GRCh38

prune_all_h38.sh: Due to the enormous VCF file size of GRCh38 per-chromosome variants, we decided to first prune unnecessary information for our purposes. In the FORMAT field, we only kept actual genotypes (GT) and the INFO field was removed completely.
filter_h38.sh: Transforming multiallelic, per-chromosome input VCFs into biallelic ones.
concat_vcf.sh: Aggregating all filtered, per-chromosome VCFs into one joint VCF file for all chromsomes.
vcf_to_plink.sh: Transforming the aggregated VCF file into PLINK's bfile format. Variants in pseudo-autosomal regions of the X chromosome are mapped to chromosome code X. Also, variants are assigned unique IDs based on the format CHROM:POS:REF:ALT.
split_into_populations.sh: The aggregated PLINK files are stratified into superpopulations.
exclude_missing_alleles.sh: Variants with the symbol '*' in either the REF or ALT column (i.e. without proper allele encoding) were removed.
filter_MAC_duplicated.sh: PLINK representations for each superpopulation are filtered for variants that are present in at least one sample (minor allele count >= 1), and duplicated variants based on unique IDs are cleaned (only one version is kept).

The resulting reference files metrics per superpopulation read as follows:

Population	Num_Variants	Num_Samples
EUR	41941265	633
AFR	68637558	893
AMR	45570833	490
SAS	46380876	601
EAS	41976171	585

Ensembl gene annotation file generation

For generating the protein-coding gene annotation file for the GNExT platform of a specific Ensembl release, first download the GTF file from the Ensembl FTP site https://ftp.ensembl.org/pub/ for the resepctive release and genome assembly and then run the extract_genes_from_gft.py by specifying the --gtf and --output flags.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
data		data
scripts		scripts
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Processing of GNExT Reference Data

PLINK stack generation for 1KG reference variants

Worfklow for GRCh37

Worfklow for GRCh38

Ensembl gene annotation file generation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Processing of GNExT Reference Data

PLINK stack generation for 1KG reference variants

Worfklow for GRCh37

Worfklow for GRCh38

Ensembl gene annotation file generation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages