This repository provides scripts to generate the reference resources required for deploying an instance of the GNExT platform. Specifically, we describe the construction of a PLINK formatted reference panel from the 1000 Genomes Project Phase 3 dataset, which is used by MAGMA for linkage disequilibrium estimation. In addition, GNExT requires gene annotation resources for SNP to gene mapping; accordingly, the repository includes a script to derive the corresponding annotation file from any Ensembl GTF release.
The reference sources produced by these scripts are deposited on Zenodo https://doi.org/10.5281/zenodo.17940903, including PLINK files for GRCh37 and GRCh38 across five super populations as well as Ensembl protein-coding gene annotation files for GRCh37 and GRCh38 from Ensembl releases 114 and 115.
This repository contains scripts for generating filtered PLINK stacks for the 1,000 Genomes Phase 3 reference variant data. As input, our workflow uses per-chromosome VCF files from both GRCh37 and GRCh38. In order to stratify samples into superpopulations, we have used this panel file for GRCh37 and this one for GRCh38. Below, we describe our main steps for GRCh37 and GRCh38 individually. All mentioned scripts are contained in the scripts/ directory, while any required auxiliary files, as well final summary statistics are contained in data/.
filter_h37.sh: Transforming multiallelic, per-chromosome input VCFs into biallelic ones, while additionally filtering out copy number variations.concat_vcfs.sh: Aggregating all filtered, per-chromosome VCFs into one joint VCF file for all chromsomes.vcf_to_plink.sh: Transforming the aggregated VCF file into PLINK's bfile format. Variants in pseudo-autosomal regions of the X chromosome are mapped to chromosome code X. Also, variants are assigned unique IDs based on the format CHROM:POS:REF:ALT.split_into_populations.sh: The aggregated PLINK files are stratified into superpopulations.filter_MAC_duplicated.sh: PLINK representations for each superpopulation are filtered for variants that are present in at least one sample (minor allele count >= 1), and duplicated variants based on unique IDs are cleaned (only one version is kept).
The resulting reference files metrics per superpopulation read as follows:
| Population | Num_Variants | Num_Samples |
|---|---|---|
| EUR | 25059705 | 503 |
| AFR | 43671646 | 661 |
| AMR | 29497483 | 347 |
| SAS | 27686667 | 489 |
| EAS | 24503205 | 504 |
prune_all_h38.sh: Due to the enormous VCF file size of GRCh38 per-chromosome variants, we decided to first prune unnecessary information for our purposes. In the FORMAT field, we only kept actual genotypes (GT) and the INFO field was removed completely.filter_h38.sh: Transforming multiallelic, per-chromosome input VCFs into biallelic ones.concat_vcf.sh: Aggregating all filtered, per-chromosome VCFs into one joint VCF file for all chromsomes.vcf_to_plink.sh: Transforming the aggregated VCF file into PLINK's bfile format. Variants in pseudo-autosomal regions of the X chromosome are mapped to chromosome code X. Also, variants are assigned unique IDs based on the format CHROM:POS:REF:ALT.split_into_populations.sh: The aggregated PLINK files are stratified into superpopulations.exclude_missing_alleles.sh: Variants with the symbol '*' in either the REF or ALT column (i.e. without proper allele encoding) were removed.filter_MAC_duplicated.sh: PLINK representations for each superpopulation are filtered for variants that are present in at least one sample (minor allele count >= 1), and duplicated variants based on unique IDs are cleaned (only one version is kept).
The resulting reference files metrics per superpopulation read as follows:
| Population | Num_Variants | Num_Samples |
|---|---|---|
| EUR | 41941265 | 633 |
| AFR | 68637558 | 893 |
| AMR | 45570833 | 490 |
| SAS | 46380876 | 601 |
| EAS | 41976171 | 585 |
For generating the protein-coding gene annotation file for the GNExT platform of a specific Ensembl release, first download the GTF file from the Ensembl FTP site https://ftp.ensembl.org/pub/ for the resepctive release and genome assembly and then run the extract_genes_from_gft.py by specifying the --gtf and --output flags.
