Problem
goal of this issue: collect RIA discussion in 1 place, freeing up other related issues to be more focused.
babs currently creates input_ria/ and output_ria/ as part of every project. These are local RIA stores used to shuttle data between the project and job scratch directories. I do not understand the benefit, and we end up with more clones of the same dataset.
input_ria is a clone of the input dataset that jobs clone from, but the actual data is never pulled into it. Jobs datalad get through it and the data comes from the upstream source. It's an empty intermediary adding an extra hop in the clone chain.
output_ria collects job results before the octopus merge. Could we use a regular bare git repo for the same benefits?
The RIA pattern was motivated by the FAIRly big paper, where condor jobs needed to work across systems without a shared filesystem. When a shared filesystem exists, RIAs add indirection (extra clones, extra remotes) without solving a real problem.
Proposal
Replace RIA stores with direct git operations on the shared filesystem:
- Input: jobs clone directly from the analysis dataset (or a bare repo alongside it)
- Output: jobs push result branches directly to the analysis dataset (or a bare repo), with a lock file to serialize pushes as Yarik suggested in #327
This simplifies the project layout, removes a conceptual layer users have to understand, and eliminates the post-merge clone-from-output-RIA step.
Related discussion
Questions
- Are there babs users running without a shared filesystem (e.g., HTCondor across sites)? If so, RIA removal would break their workflow.
- Is the octopus merge step tightly coupled to RIA, or does it just need branches from any git remote?
Problem
goal of this issue: collect RIA discussion in 1 place, freeing up other related issues to be more focused.
babs currently creates
input_ria/andoutput_ria/as part of every project. These are local RIA stores used to shuttle data between the project and job scratch directories. I do not understand the benefit, and we end up with more clones of the same dataset.input_riais a clone of the input dataset that jobs clone from, but the actual data is never pulled into it. Jobsdatalad getthrough it and the data comes from the upstream source. It's an empty intermediary adding an extra hop in the clone chain.output_riacollects job results before the octopus merge. Could we use a regular bare git repo for the same benefits?The RIA pattern was motivated by the FAIRly big paper, where condor jobs needed to work across systems without a shared filesystem. When a shared filesystem exists, RIAs add indirection (extra clones, extra remotes) without solving a real problem.
Proposal
Replace RIA stores with direct git operations on the shared filesystem:
This simplifies the project layout, removes a conceptual layer users have to understand, and eliminates the post-merge clone-from-output-RIA step.
Related discussion
Questions