Update config and data in preparation of bench V2 by Thibaut-Fatus · Pull Request #13 · korabench/benchmark

Thibaut-Fatus · 2026-05-06T07:51:24Z

No description provided.

Restores the single-judge default that matches the production grading pipeline (kora-infra worker uses gpt-5.2-medium). The 3-judge ensemble remains opt-in via --judges a,b,c; aggregation logic is unchanged and handles N=1 transparently. - Adds gpt-5.2:medium:limited slug to models.json (medium reasoning). - run/reassess/continue commands all default to the single judge. - README updated: option tables, prose, example output, cost calc.

generate-seeds and expand-scenarios now accept a comma-separated model list. Each task tries the chain in order; for expand-scenarios the rotation also fires on ScenarioValidationError, fixing the case where a model returns valid JSON but the validator rejects it (e.g. truncated content).

781 seeds × 26 risks (us-census-2020, total-seeds=30, seed=42), all validated. Generated using the new fallback chain across gpt-5.2:high → gpt-5.5:medium → claude-sonnet-4.6:limited.

Thibaut-Fatus added 4 commits May 6, 2026 09:50

[chore] rollback 8 turns to 3 except for 2.3, 3.3, 3.4

1785683

[chore] regenerate seeds + scenarios with new taxonomy

c285c5c

781 seeds × 26 risks (us-census-2020, total-seeds=30, seed=42), all validated. Generated using the new fallback chain across gpt-5.2:high → gpt-5.5:medium → claude-sonnet-4.6:limited.

Thibaut-Fatus changed the title ~~[chore] rollback 8 turns to 3 except for 2.3, 3.3, 3.4~~ Update config and data in preparation of bench V2 May 6, 2026

[chore] fix prettier formatting

c6a8749

Thibaut-Fatus merged commit 8f87ff1 into main May 6, 2026
4 checks passed

Thibaut-Fatus deleted the feat/config-v2 branch May 6, 2026 13:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update config and data in preparation of bench V2#13

Update config and data in preparation of bench V2#13
Thibaut-Fatus merged 5 commits into
mainfrom
feat/config-v2

Thibaut-Fatus commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Thibaut-Fatus commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant