Investigate OCI images + oras for more efficient data distribution, storage and updates

Exomiser data files are currently distributed as large versioned zip files. Of these, only the transcripts and ClinVar data usually changes between releases which means a release really only requires a few hundred MB. We keep them versioned for the purpose of traceability and ease of initial installation.   
```
/data/exomiser-data $ tree -L 1 -h 2602_hg38/
[4.0K]  2602_hg38/
├── [156M]  2602_hg38_clinvar.mv.db
├── [430M]  2602_hg38_genome.mv.db
├── [ 23M]  2602_hg38_transcripts_ensembl.ser
├── [ 61M]  2602_hg38_transcripts_refseq.ser
├── [106M]  2602_hg38_transcripts_ucsc.ser
└── [ 30G]  2602_hg38_variants.mv.db

1 directory, 7 files
```
The OCI spec https://specs.opencontainers.org/image-spec/image-layout/ deals with the issue of de-duplication by layering and not updating files which are unchanged, based on their hash. The spec can be used both for container images and data.

https://oras.land/

Using OCI layering would improve serverside storage usage and data egress whilst maintaining data provenance/versioning. 

Hosting this on Google Cloud Artifact Registry could be an option, according to Claude:

Directory structure on disk

```
exomiser_hg38/
  clinvar.mv.db
  genome.mv.db
  jannovar_transcripts_ensembl.ser
  transcripts_ensembl.ser
  transcripts_refseq.ser
  transcripts_ucsc.ser
  variants.mv.db
```
Files are no longer version-prefixed — the version lives entirely in the OCI tag.
1. Initialise gcloud
```bash
gcloud init
gcloud auth login
```

2. Enable Artifact Registry and create repository
```bash
gcloud services enable artifactregistry.googleapis.com

gcloud artifacts repositories create exomiser \
  --repository-format=docker \
  --location=us-central1 \
  --description="Exomiser data releases"
```

3. Authenticate oras
```bash
oras login us-central1-docker.pkg.dev \
  --username oauth2accesstoken \
  --password $(gcloud auth print-access-token)
```

4. Push a release (your side)
```bash
# Start with small files to validate
oras push us-central1-docker.pkg.dev/<project>/exomiser/hg38:2602 \
  exomiser_hg38/clinvar.mv.db \
  exomiser_hg38/transcripts_ensembl.ser
```
Once validated, the full release push would be:
```bash
oras push us-central1-docker.pkg.dev/<project>/exomiser/hg38:2602 \
  exomiser_hg38/clinvar.mv.db \
  exomiser_hg38/genome.mv.db \
  exomiser_hg38/jannovar_transcripts_ensembl.ser \
  exomiser_hg38/transcripts_ensembl.ser \
  exomiser_hg38/transcripts_refseq.ser \
  exomiser_hg38/transcripts_ucsc.ser \
  exomiser_hg38/variants.mv.db
```

5. Pull (user side)
```bash
# New install
oras pull us-central1-docker.pkg.dev/<project>/exomiser/hg38:2602 \
  --output exomiser_hg38/

# Update to new release - oras only downloads layers with changed digests
oras pull us-central1-docker.pkg.dev/<project>/exomiser/hg38:2603 \
  --output exomiser_hg38/
```

**Separate artifacts per genome build and phenotype:**
```
exomiser/hg38:2602
exomiser/hg19:2602
exomiser/phenotype:2602
```

The user's Exomiser config remains permanently pointed at the fixed directories regardless of which release tag they pulled:
```properties
exomiser.data-directory=exomiser_hg38/
```

We could then add a nice `update` command to the CLI:

```bash
exomiser update --genome hg38           # update to latest
exomiser update --genome hg38 --tag 2603  # update to specific release
exomiser update --list                  # show available releases/tags
```

**However** This would need a bit of thought about how to continue to provide files for FTP download as this is how the data is moved into restricted environments. This could be resolved by manually zipping local file to upload through an airlock and manually unzipping where Exomiser expects them to be as defined in the `application.properties` 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate OCI images + oras for more efficient data distribution, storage and updates #622

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Investigate OCI images + oras for more efficient data distribution, storage and updates #622

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions