For the sake of better manageability and future developments.
Metadata contract
- meta.json is namespaced:
- benchmark: identity, description, contributors (scientific, immutable)
- evaluation: primary metric, ranking semantics (affects comparability)
- presentation: display and visualization hints (frontend only)
- Only changes to benchmark or evaluation require a benchmark version bump
Reproducibility guarantees
- Any change to benchmark code, data, prompts, or schemas ⇒ new benchmark version
- Results reference:
- benchmark DOI(s)
- system version
- model configuration
- Execution logic and visualization can evolve independently of benchmark validity
TODO:
[] Discuss this proposition (@MHindermann)
[] Plan implementation
[] Plan publication
For the sake of better manageability and future developments.
REPO 1: (this one) Benchmarks/Datasets
benchmark.py)dataclass.py)REPO 2: Benchmark system
REPO 3: Results, collected results
Metadata contract
Reproducibility guarantees
TODO:
[] Discuss this proposition (@MHindermann)
[] Plan implementation
[] Plan publication