Exomiser data files are currently distributed as large versioned zip files. Of these, only the transcripts and ClinVar data usually changes between releases which means a release really only requires a few hundred MB. We keep them versioned for the purpose of traceability and ease of initial installation.
/data/exomiser-data $ tree -L 1 -h 2602_hg38/
[4.0K] 2602_hg38/
├── [156M] 2602_hg38_clinvar.mv.db
├── [430M] 2602_hg38_genome.mv.db
├── [ 23M] 2602_hg38_transcripts_ensembl.ser
├── [ 61M] 2602_hg38_transcripts_refseq.ser
├── [106M] 2602_hg38_transcripts_ucsc.ser
└── [ 30G] 2602_hg38_variants.mv.db
1 directory, 7 files
The OCI spec https://specs.opencontainers.org/image-spec/image-layout/ deals with the issue of de-duplication by layering and not updating files which are unchanged, based on their hash. The spec can be used both for container images and data.
https://oras.land/
Using OCI layering would improve serverside storage usage and data egress whilst maintaining data provenance/versioning.
Hosting this on Google Cloud Artifact Registry could be an option, according to Claude:
Directory structure on disk
exomiser_hg38/
clinvar.mv.db
genome.mv.db
jannovar_transcripts_ensembl.ser
transcripts_ensembl.ser
transcripts_refseq.ser
transcripts_ucsc.ser
variants.mv.db
Files are no longer version-prefixed — the version lives entirely in the OCI tag.
- Initialise gcloud
gcloud init
gcloud auth login
- Enable Artifact Registry and create repository
gcloud services enable artifactregistry.googleapis.com
gcloud artifacts repositories create exomiser \
--repository-format=docker \
--location=us-central1 \
--description="Exomiser data releases"
- Authenticate oras
oras login us-central1-docker.pkg.dev \
--username oauth2accesstoken \
--password $(gcloud auth print-access-token)
- Push a release (your side)
# Start with small files to validate
oras push us-central1-docker.pkg.dev/<project>/exomiser/hg38:2602 \
exomiser_hg38/clinvar.mv.db \
exomiser_hg38/transcripts_ensembl.ser
Once validated, the full release push would be:
oras push us-central1-docker.pkg.dev/<project>/exomiser/hg38:2602 \
exomiser_hg38/clinvar.mv.db \
exomiser_hg38/genome.mv.db \
exomiser_hg38/jannovar_transcripts_ensembl.ser \
exomiser_hg38/transcripts_ensembl.ser \
exomiser_hg38/transcripts_refseq.ser \
exomiser_hg38/transcripts_ucsc.ser \
exomiser_hg38/variants.mv.db
- Pull (user side)
# New install
oras pull us-central1-docker.pkg.dev/<project>/exomiser/hg38:2602 \
--output exomiser_hg38/
# Update to new release - oras only downloads layers with changed digests
oras pull us-central1-docker.pkg.dev/<project>/exomiser/hg38:2603 \
--output exomiser_hg38/
Separate artifacts per genome build and phenotype:
exomiser/hg38:2602
exomiser/hg19:2602
exomiser/phenotype:2602
The user's Exomiser config remains permanently pointed at the fixed directories regardless of which release tag they pulled:
exomiser.data-directory=exomiser_hg38/
We could then add a nice update command to the CLI:
exomiser update --genome hg38 # update to latest
exomiser update --genome hg38 --tag 2603 # update to specific release
exomiser update --list # show available releases/tags
However This would need a bit of thought about how to continue to provide files for FTP download as this is how the data is moved into restricted environments. This could be resolved by manually zipping local file to upload through an airlock and manually unzipping where Exomiser expects them to be as defined in the application.properties
Exomiser data files are currently distributed as large versioned zip files. Of these, only the transcripts and ClinVar data usually changes between releases which means a release really only requires a few hundred MB. We keep them versioned for the purpose of traceability and ease of initial installation.
The OCI spec https://specs.opencontainers.org/image-spec/image-layout/ deals with the issue of de-duplication by layering and not updating files which are unchanged, based on their hash. The spec can be used both for container images and data.
https://oras.land/
Using OCI layering would improve serverside storage usage and data egress whilst maintaining data provenance/versioning.
Hosting this on Google Cloud Artifact Registry could be an option, according to Claude:
Directory structure on disk
Files are no longer version-prefixed — the version lives entirely in the OCI tag.
oras login us-central1-docker.pkg.dev \ --username oauth2accesstoken \ --password $(gcloud auth print-access-token)Once validated, the full release push would be:
Separate artifacts per genome build and phenotype:
The user's Exomiser config remains permanently pointed at the fixed directories regardless of which release tag they pulled:
exomiser.data-directory=exomiser_hg38/We could then add a nice
updatecommand to the CLI:However This would need a bit of thought about how to continue to provide files for FTP download as this is how the data is moved into restricted environments. This could be resolved by manually zipping local file to upload through an airlock and manually unzipping where Exomiser expects them to be as defined in the
application.properties