Skip to content

Investigate OCI images + oras for more efficient data distribution, storage and updates #622

@julesjacobsen

Description

@julesjacobsen

Exomiser data files are currently distributed as large versioned zip files. Of these, only the transcripts and ClinVar data usually changes between releases which means a release really only requires a few hundred MB. We keep them versioned for the purpose of traceability and ease of initial installation.

/data/exomiser-data $ tree -L 1 -h 2602_hg38/
[4.0K]  2602_hg38/
├── [156M]  2602_hg38_clinvar.mv.db
├── [430M]  2602_hg38_genome.mv.db
├── [ 23M]  2602_hg38_transcripts_ensembl.ser
├── [ 61M]  2602_hg38_transcripts_refseq.ser
├── [106M]  2602_hg38_transcripts_ucsc.ser
└── [ 30G]  2602_hg38_variants.mv.db

1 directory, 7 files

The OCI spec https://specs.opencontainers.org/image-spec/image-layout/ deals with the issue of de-duplication by layering and not updating files which are unchanged, based on their hash. The spec can be used both for container images and data.

https://oras.land/

Using OCI layering would improve serverside storage usage and data egress whilst maintaining data provenance/versioning.

Hosting this on Google Cloud Artifact Registry could be an option, according to Claude:

Directory structure on disk

exomiser_hg38/
  clinvar.mv.db
  genome.mv.db
  jannovar_transcripts_ensembl.ser
  transcripts_ensembl.ser
  transcripts_refseq.ser
  transcripts_ucsc.ser
  variants.mv.db

Files are no longer version-prefixed — the version lives entirely in the OCI tag.

  1. Initialise gcloud
gcloud init
gcloud auth login
  1. Enable Artifact Registry and create repository
gcloud services enable artifactregistry.googleapis.com

gcloud artifacts repositories create exomiser \
  --repository-format=docker \
  --location=us-central1 \
  --description="Exomiser data releases"
  1. Authenticate oras
oras login us-central1-docker.pkg.dev \
  --username oauth2accesstoken \
  --password $(gcloud auth print-access-token)
  1. Push a release (your side)
# Start with small files to validate
oras push us-central1-docker.pkg.dev/<project>/exomiser/hg38:2602 \
  exomiser_hg38/clinvar.mv.db \
  exomiser_hg38/transcripts_ensembl.ser

Once validated, the full release push would be:

oras push us-central1-docker.pkg.dev/<project>/exomiser/hg38:2602 \
  exomiser_hg38/clinvar.mv.db \
  exomiser_hg38/genome.mv.db \
  exomiser_hg38/jannovar_transcripts_ensembl.ser \
  exomiser_hg38/transcripts_ensembl.ser \
  exomiser_hg38/transcripts_refseq.ser \
  exomiser_hg38/transcripts_ucsc.ser \
  exomiser_hg38/variants.mv.db
  1. Pull (user side)
# New install
oras pull us-central1-docker.pkg.dev/<project>/exomiser/hg38:2602 \
  --output exomiser_hg38/

# Update to new release - oras only downloads layers with changed digests
oras pull us-central1-docker.pkg.dev/<project>/exomiser/hg38:2603 \
  --output exomiser_hg38/

Separate artifacts per genome build and phenotype:

exomiser/hg38:2602
exomiser/hg19:2602
exomiser/phenotype:2602

The user's Exomiser config remains permanently pointed at the fixed directories regardless of which release tag they pulled:

exomiser.data-directory=exomiser_hg38/

We could then add a nice update command to the CLI:

exomiser update --genome hg38           # update to latest
exomiser update --genome hg38 --tag 2603  # update to specific release
exomiser update --list                  # show available releases/tags

However This would need a bit of thought about how to continue to provide files for FTP download as this is how the data is moved into restricted environments. This could be resolved by manually zipping local file to upload through an airlock and manually unzipping where Exomiser expects them to be as defined in the application.properties

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions