-
Notifications
You must be signed in to change notification settings - Fork 4
Description
We originally started having all tools and parameter files included in a single snakefile to simplify the execution of the workflows. In a second step thought about having a single ROCrate (and a single workflow) being executed per tool, e.g. there is one workflow for FEniCSx and another one for Kratos etc. However, our workflow now has actually no inputs (that @M-Jafarkhani could describe semantically) and we extract these using a specific function that is now user-dependent. One option might be that we interpret a single workflow only as a function that takes as input a couple of metadata/inputs - these are actually all defined in the benchmark definition (that we provide in the KG or in the benchmark ROCrate (depending on where the benchmark is now defined) or by the user (the tool and the version).
In the python package, the user says: "Run benchmark XYZ (URI or id of ROCrate) with tool FEnics (SoftwareApplication URI including version etc.) and a compute environment (docker image, conda, ..) with a script that takes as input a parameter.json file and outputs a metrics.json and a vtk.zip. In this setup, the loop over all parameter files is done by the user when creating the ROCrate and then a joint ROCrate is uploaded. Here are the details of these steps for a users that want to run an existing benchmark with their tool:
-
Define what the user supplies (benchmark + tool + environment)
In the example, the inputs is a benchmark_rocrate_ref (ROHub URL/UUID or local ZIP/path), tool_uri (software_application, including metadata such as version number). For that we would have to provide an link how to add something in case that is not existing (has to be done a priory or in the package itself, let's see). The user would provide a compute environment (or can we provide that with the SoftwareApplication? @doigl ) -
Obtain the benchmark contents and discover parameter files
If benchmark_rocrate_ref points to ROHub, use the ROHub client in ROhub to download the benchmark crate ZIP, unzip to a working directory, or similarly take the zip file, or extract the information from the NFDI4ING KG
Discover parameter files (either unzip, or extract from KG) -
Loop over all parameter files
- for each parameter file, generate a tool Snakefile per run from the benchmark-provided template
- Execute Snakemake once per parameter file with explicit input/output (each run creates a provenance file with clear input/output. The parameter file is passed as a parameter --config as a first-class workflow input to snakemake. Ensure each run writes into a distinct output subfolder (tool + configuration) so subsequent runs don’t overwrite artifacts.
-
Create a provenance RO-Crate for the complete loops and upload to ROHub
This is maybe a tricky point, because I do not know if that is possible (including multiple runs in a single ROCrate). And maybe this is strongly related to how we link the benchmark definition to the provenance of a single execution. @doigl @M-Jafarkhani We currently create one ROCrate per tool/benchmark. If we do the approach outlined above, we get multiple provenance graphs (one per parameter file). They would have to be combined (in a similar way as we discussed today the question of @div-tyg of how to combine multiple ROCrates in a query). IMO this would be something we have to handle here (but in a similar way in the jupyter notebook for analyzing multiple ROCrates). At the end, we just have a a joint KG that comprises all information. @doigl and @M-Jafarkhani do you think there is an option to do so?