Skip to content

Update config and data in preparation of bench V2#13

Merged
Thibaut-Fatus merged 5 commits into
mainfrom
feat/config-v2
May 6, 2026
Merged

Update config and data in preparation of bench V2#13
Thibaut-Fatus merged 5 commits into
mainfrom
feat/config-v2

Conversation

@Thibaut-Fatus
Copy link
Copy Markdown
Collaborator

No description provided.

Restores the single-judge default that matches the production grading
pipeline (kora-infra worker uses gpt-5.2-medium). The 3-judge ensemble
remains opt-in via --judges a,b,c; aggregation logic is unchanged and
handles N=1 transparently.

- Adds gpt-5.2:medium:limited slug to models.json (medium reasoning).
- run/reassess/continue commands all default to the single judge.
- README updated: option tables, prose, example output, cost calc.
generate-seeds and expand-scenarios now accept a comma-separated
model list. Each task tries the chain in order; for expand-scenarios
the rotation also fires on ScenarioValidationError, fixing the case
where a model returns valid JSON but the validator rejects it (e.g.
truncated content).
781 seeds × 26 risks (us-census-2020, total-seeds=30, seed=42),
all validated. Generated using the new fallback chain across
gpt-5.2:high → gpt-5.5:medium → claude-sonnet-4.6:limited.
@Thibaut-Fatus Thibaut-Fatus changed the title [chore] rollback 8 turns to 3 except for 2.3, 3.3, 3.4 Update config and data in preparation of bench V2 May 6, 2026
@Thibaut-Fatus Thibaut-Fatus merged commit 8f87ff1 into main May 6, 2026
4 checks passed
@Thibaut-Fatus Thibaut-Fatus deleted the feat/config-v2 branch May 6, 2026 13:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant