This project is based on the Snakemake reporter plugin. It provides a custom reporter plugin for metadata4ing ontology , which can be used to extract and report metadata from Snakemake pipelines.
Install the plugin using pip:
python -m pip install git+https://github.com/izus-fokus/snakemake-report-plugin-metadata4ing
or from the source code:
poetry build
pip install --force-reinstall dist/snakemake_report_plugin_metadata4ing-1.0.0-py3-none-any.whl
Then, use it as the reporter in your Snakemake workflow:
snakemake --reporter metadata4ing ...
The reporter creates a zip file, which contains a RO-Crate zip file which contains important files from the simulation like the input and output files for each rule. It also creates 3 files
-- provenance.jsonld: Knowledge graph based on Metadata4ing ontology
-- provenance.ttl: Same as provenance.jsonld graph but in turtle format.
-- ro-crate-metadata.json: Research Object Crate file describing the dataset.
It is possible to pass a script as a parameter extractor. You can write your own extractor in a separate Python script and pass it to the reporter using the paramscript argument:
snakemake --reporter metadata4ing --report-metadata4ing-paramscript /Path_to_Extractor/my_extractor.py ...
Please note that, your extractor should implement the ParameterExtractorInterface.
class ParameterExtractorInterface(ABC):
@abstractmethod
def extract_params(self, rule_name: str, file_path: str) -> dict:
...
The extract_params method should return a dictionary where:
- Keys are the name of the corresponding procssing step (or the
rule_name). - Values another dictionary with two keys,
has parameterandinvestigates. These two keys resembele the input and output of that processing step, respectively. Each of these entries again should be a dictionary where the varaiable name is key and values as another dictionary with fixed key names: - Values are dictionaries with the following keys:
value: parameter valueunit: unit of the value (if applicable). It will be mapped to the neartest QUDT unit.json-path: the path to this value in the output JSONdata-type: the data type of the value
For example, a simple dictionary could liek this:
{
"run_simulation": {
"has parameter": {
"length": {
"value": 15,
"unit": "m",
"json-path": "/parameters.json/inputs",
"data-type": "float"
}
},
"investigates": {
"stress": {
"value": 1.0,
"unit": "MPa",
"json-path": "summary.json",
"data-type": "float"
}
}
}
}Please note that if you provide another name (or even multiple entries as the output), it adds new nodes (as processing steps) to the give rule. These new nodes would be add as a m4i:part of to the original processing step. This would be hepful if you have a single file as the summary where it summarizes all the simulation results (input and output parameters).
For example, if the meothd is called with a rule_name like run_simulation and the returned dictionary is like:
{
"run_simulation_1": {
"has parameter": {
"length": {
"value": 15,
"unit": "m",
"json-path": "/parameters.json/inputs",
"data-type": "float"
}
},
"investigates": {
"stress": {
"value": 1.0,
"unit": "MPa",
"json-path": "summary.json",
"data-type": "float"
}
}
},
"run_simulation_2": {
"has parameter": {
"length": {
"value": 10,
"unit": "m",
"json-path": "/parameters.json/inputs",
"data-type": "float"
}
},
"investigates": {
"stress": {
"value": 2.0,
"unit": "MPa",
"json-path": "summary.json",
"data-type": "float"
}
}
}
}{
"first_run": {
"has parameter": {
"length": {
"value": 15,
"unit": "m",
"json-path": "/parameters.json/inputs",
"data-type": "float"
}
},
"investigates": {
"stress": {
"value": 1.0,
"unit": "MPa",
"json-path": "summary.json",
"data-type": "float"
}
}
},
"second_run": {
"has parameter": {
"length": {
"value": 10,
"unit": "m",
"json-path": "/parameters.json/inputs",
"data-type": "float"
}
},
"investigates": {
"stress": {
"value": 2.0,
"unit": "MPa",
"json-path": "summary.json",
"data-type": "float"
}
}
}
}Then in the final graph we have:
local:processing_step_* a schema:Action ;
rdfs:label "run_simualtion" ;
.....
local:processing_step_** a schema:Action ;
rdfs:label "first_run" ;
schema:isPartOf local:processing_step_* ;
.....
local:processing_step_*** a schema:Action ;
rdfs:label "second_run" ;
schema:isPartOf local:processing_step_* ;
.....
A sample extractor is provided here.
The name of the final ZIP file. If not provided, it defaults to ro-crate-metadata-{simulation_hash}.zip, where simulation_hash is a 16-character hash computed from the content of the graph.
snakemake --reporter metadata4ing --report-metadata4ing-filename MyFile ...