Skip to content

Are RIAs necessary? #357

@asmacdo

Description

@asmacdo

Problem

goal of this issue: collect RIA discussion in 1 place, freeing up other related issues to be more focused.

babs currently creates input_ria/ and output_ria/ as part of every project. These are local RIA stores used to shuttle data between the project and job scratch directories. I do not understand the benefit, and we end up with more clones of the same dataset.

  • input_ria is a clone of the input dataset that jobs clone from, but the actual data is never pulled into it. Jobs datalad get through it and the data comes from the upstream source. It's an empty intermediary adding an extra hop in the clone chain.
  • output_ria collects job results before the octopus merge. Could we use a regular bare git repo for the same benefits?

The RIA pattern was motivated by the FAIRly big paper, where condor jobs needed to work across systems without a shared filesystem. When a shared filesystem exists, RIAs add indirection (extra clones, extra remotes) without solving a real problem.

Proposal

Replace RIA stores with direct git operations on the shared filesystem:

  • Input: jobs clone directly from the analysis dataset (or a bare repo alongside it)
  • Output: jobs push result branches directly to the analysis dataset (or a bare repo), with a lock file to serialize pushes as Yarik suggested in #327

This simplifies the project layout, removes a conceptual layer users have to understand, and eliminates the post-merge clone-from-output-RIA step.

Related discussion

Questions

  • Are there babs users running without a shared filesystem (e.g., HTCondor across sites)? If so, RIA removal would break their workflow.
  • Is the octopus merge step tightly coupled to RIA, or does it just need branches from any git remote?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions