GitHub - rivas-lab/waveform_disease: processing waveform data + predicting disease phenotypes

Overview

Current ECG signal analysis in large-scale health datasets primarily relies on summary statistics, such as wavelet energy measures, to assess the relationship between heart signals and disease. This project examines the individual detailed wavelet coefficients in an effort to uncover new predictive biomarkers and potentially improve disease risk prediction performance.

We also explore the reconstruction of ECG waveforms from reduced-dimensional representations, allowing interpretable recovery of signal morphology from compressed data. In parallel, our objective is to estimate the heritability and genetic correlation of the energy features derived from wavelets using genome-wide association studies (GWAS), which may reveal genetic influences on different ECG features. We also explore the reconstruction of ECG waveforms from reduced-dimensional representations, allowing interpretable recovery of signal morphology from compressed data. In parallel, our objective is to estimate the heritability and genetic correlation of the energy features derived from wavelets using genome-wide association studies (GWAS), which may reveal genetic influences on different ECG features.

Data Pipeline

We had two primary sources of data:

UK Biobank: ECG signal files for 47,052 individuals white british only
Demographic data: Genetic principal components, biomarkers, and disease phenotypes

Data Processing Steps:

Extract Energy Coefficients
- Extract energy coefficients from raw waveform coefficient data.
Wavelet Decomposition
- Use the script ecg_energy.py (utilizing the PyWavelets library) to decompose ECG signals per lead using the Daubechies 6 (db6) wavelet at level 6.
- Calculate energy features by summing the squares of coefficients per lead, per individual.
- The resulting dataset: 72,716 rows × 85 columns. After mapping IDs to match the master.phe UK Biobank file and removing duplicates and keeping white british, we get our phenotype file wavelet_dedup_new.phe: 47,052 rows × 86 columns.

GWAS Analysis

For each energy feature phenotype, perform Genome-Wide Association Studies (GWAS) using PLINK2.
Adjust for covariates (age, sex, principal components), apply quantile normalization, and output results for chromosomes 1-22.

Example PLINK2 command (replace placeholders as needed):

./plink2 --chr 1-22 \
  --covar /oak/stanford/groups/mrivas/ukbb24983/phenotypedata/master_phe/master.phe \
  --covar-name age,sex,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10 \
  --covar-variance-standardize \
  --glm qt-residualize hide-covar omit-ref \
  --keep /oak/stanford/groups/mrivas/ukbb24983/sqc/population_stratification_w24983_20211020/ukb24983_white_british.phe \
  --out [INSERT OUTPUT DIRECTORY HERE] \
  --pfile /oak/stanford/groups/mrivas/ukbb24983/array-combined/pgen/ukb24983_cal_hla_cnv.p \
  --pheno [INCLUDE PHENOTYPE FILE] \
  --pheno-name [INCLUDE PHENOTYPE NAME] \
  --pheno-quantile-normalize \
  --threads 20 \
  --vif 100000

For our analyses, we use the phenotype file wavelet_dedup_new.phe.

LDSC Regression (Heritability & Genetic Correlation)
- Use the LDSC GitHub repository to run SNP-based heritability and genetic correlation analyses via LDSC regression.
- The code and environment have been updated for compatibility with Python 3.8 and modern dependencies.
Munge GWAS Files
- Prepare GWAS summary statistics for LDSC using munge_all.sh, modifying input paths as needed.
Run Heritability Analysis
- Use ldsc_all_h2.sh to compute heritability between each pair of munged GWAS files.
Run Genetic Correlation Analysis
- Use ldsc_all_rg.sh to compute genetic correlation between the munged GWAS files and external reference files (e.g., munged FinnGen I9 phenotype files).
Finemapping
- Using Susie Inf

References

Bulik-Sullivan, B.K., Loh, P.R., Finucane, H.K., Ripke, S., Yang, J., Patterson, N., Daly, M.J., Price, A.L., Neale, B.M., and the Schizophrenia Working Group of the Psychiatric Genomics Consortium. LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nature Genetics. 2015; 47(3): 291–295.
UK Biobank: https://www.ukbiobank.ac.uk
PyWavelets Documentation: https://pywavelets.readthedocs.io

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
LDSC(updated)		LDSC(updated)
finemapping		finemapping
README.md		README.md
ecg_energy.py		ecg_energy.py
ldsc_all_h2.sh		ldsc_all_h2.sh
ldsc_all_rg.sh		ldsc_all_rg.sh
munge_all.sh		munge_all.sh
results.tar.gz		results.tar.gz

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Data Pipeline

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Overview

Data Pipeline

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages