This repository demonstrates DeepVariant, a deep learning-based variant caller, and showcases fine-tuning on a small genomic dataset. The goal is to show how a pretrained model can be adapted to custom datasets and compare results before and after fine-tuning.
DeepVariant uses a convolutional neural network (CNN) to call genetic variants from aligned sequencing data.
- Inputs: BAM/CRAM reads aligned to a reference genome
- Outputs: VCF/gVCF files containing variant calls
- Fine-tuning: Adapts the pretrained model to new datasets for improved accuracy
This repo provides scripts, notebooks, and example data for a hands-on demo.
DeepVariant-Finetuning/
├── data/ # Example BAM and reference genome
├── notebooks/ # Jupyter/Colab demo notebook
├── scripts/
│ ├── run_inference.sh # Run default DeepVariant
│ ├── make_examples.sh # Prepare TFRecords for fine-tuning
│ ├── fine_tune.sh # Fine-tune the model
│ └── run_finetuned.sh # Run inference with fine-tuned model
├── models/
│ └── custom_model/ # Output of fine-tuning
├── README.md
└── requirements.txt # Optional dependencies
- Open the notebook
notebooks/demo_deepvariant.ipynbin Colab. - Upload example BAM and reference genome (small chromosome preferred for demo).
- Run the cells step-by-step:
- Run inference with the pretrained model.
- Generate examples for fine-tuning.
- Fine-tune on your dataset.
- Run inference with fine-tuned model.
- Compare results (variant counts, VCF differences).
docker pull google/deepvariant:1.5.0
docker run --platform linux/amd64 -v $PWD:/input -t google/deepvariant:1.5.0 \
/opt/deepvariant/bin/run_deepvariant \
--model_type=WGS \
--ref=/input/data/REFERENCE.fa \
--reads=/input/data/EXAMPLE.bam \
--output_vcf=/input/data/output.vcf \
--num_shards=4⚠ On Apple M1/M4/M5 Macs, emulation may be slow and fine-tuning is not recommended locally.
- Generate training examples:
sh scripts/make_examples.sh- Fine-tune the pretrained model:
sh scripts/fine_tune.sh- Run inference with the fine-tuned model:
sh scripts/run_finetuned.sh- Compare the outputs:
grep -vc "^#" data/output.vcf
grep -vc "^#" data/output_finetuned.vcf- Default model vs fine-tuned model
- Variant counts, example VCF snippets, and pileup visualizations
- Demonstrates improvements from domain adaptation
- For full fine-tuning on human genomes, GPU + x86_64 is recommended.
- Small datasets are used here for demo purposes only.
- Colab demo allows running end-to-end without local Docker setup.