Full jurisdiction system: configurable judge LLM, three policies, trajectory-aware verdicts.
After #2092 (vocabulary, v0.8.45) and #2093 (verifier-preview wiring, v0.8.46), this issue lands the full Codex-style LLM-as-judge with codewhale-native jurisdictions. Absorbs and supersedes the original "port Codex goal system" framing.
Concept
A jurisdiction is the policy a judge applies to decide whether a quarry is statutorily hunted. Three built-in jurisdictions:
| Jurisdiction |
What counts as hunted |
strict |
Diff exists, tests added or updated, CI green (or local verifier equivalent). |
evidentiary |
Diff cites files; agent shows changes; no contradiction with quarry. |
permissive |
Agent declares done; judge sanity-checks. |
Each turn ends with a judge call. Judge sees: quarry, current trajectory summary, this turn's evidence, full diff against origin/main (or session start). Judge returns:
struct JudgeRuling {
verdict: HuntVerdict,
reasoning: String,
next_step: NextStep, // continue | handoff | abandon | declare_hunted
}
In scope (v0.9.0)
config.toml [hunt] section: jurisdiction = "evidentiary" (default), judge_model = "auto" (defaults to the session model with a judge-only system prompt), judge_max_tokens = 4096, judge_temperature = 0.0.
- Judge runs at each turn boundary in hunt mode. Verdict logged inline in transcript between turns.
next_step actions:
continue — model gets a system-message hint of judge reasoning, continues.
handoff — runtime suggests a sub-agent and lists the species + brief.
abandon — session marked escaped, no trophy.
declare_hunted — trophy written, session may auto-close per config.
/hunt jurisdiction <strict|evidentiary|permissive> switches mid-hunt.
/why hunted / /why wounded / /why escaped shows the judge's last reasoning.
- Judge prompt template under
crates/tui/src/prompts/judge.txt — small, focused, source-controlled.
Out of scope
- Trained verifier models (out of scope forever for v0.9.0; remains LLM-as-judge).
- Per-statute custom jurisdictions in config (RFC; lands later if asked for).
- Judge-as-sub-agent species (the judge isn't a whale — it's the court).
Acceptance
- Three jurisdictions selectable and observably different in behavior on the same test quarry.
- Judge prompt is auditable and source-controlled.
- Verdict + reasoning render inline in transcript.
/why <verdict> returns the most recent judge reasoning.
- Trophy card includes the jurisdiction the hunt was decided under.
- Eval harness has at least one regression test per jurisdiction asserting the verdict shape.
Closes / partially closes
Notes
- The judge isn't a sub-agent species. Whales hunt; the judge is the court. Treat as a distinct primitive.
- LLM-as-judge has known failure modes (sycophancy toward the agent, brittleness on edge cases). The three-jurisdiction split is partial mitigation:
strict is mechanical (CI gates aren't LLM-decided), permissive is honest about its laxness, evidentiary is where the judging actually happens.
Replaces #2058. Final piece of the hunt trilogy: #2092 (vocabulary, v0.8.45) → #2093 (verifier wiring, v0.8.46) → this (v0.9.0).
Full jurisdiction system: configurable judge LLM, three policies, trajectory-aware verdicts.
After #2092 (vocabulary, v0.8.45) and #2093 (verifier-preview wiring, v0.8.46), this issue lands the full Codex-style LLM-as-judge with codewhale-native jurisdictions. Absorbs and supersedes the original "port Codex goal system" framing.
Concept
A jurisdiction is the policy a judge applies to decide whether a quarry is statutorily hunted. Three built-in jurisdictions:
strictevidentiarypermissiveEach turn ends with a judge call. Judge sees: quarry, current trajectory summary, this turn's evidence, full diff against
origin/main(or session start). Judge returns:In scope (v0.9.0)
config.toml[hunt]section:jurisdiction = "evidentiary"(default),judge_model = "auto"(defaults to the session model with a judge-only system prompt),judge_max_tokens = 4096,judge_temperature = 0.0.next_stepactions:continue— model gets a system-message hint of judge reasoning, continues.handoff— runtime suggests a sub-agent and lists the species + brief.abandon— session markedescaped, no trophy.declare_hunted— trophy written, session may auto-close per config./hunt jurisdiction <strict|evidentiary|permissive>switches mid-hunt./why hunted//why wounded//why escapedshows the judge's last reasoning.crates/tui/src/prompts/judge.txt— small, focused, source-controlled.Out of scope
Acceptance
/why <verdict>returns the most recent judge reasoning.Closes / partially closes
/goal→/hunt: quarry, verdict vocabulary, trophy card writer #2092 (vocabulary) and Wire verifier preview to emit hunt verdicts (pass/partial/fail → hunted/wounded/escaped) #2093 (verifier wiring).Notes
strictis mechanical (CI gates aren't LLM-decided),permissiveis honest about its laxness,evidentiaryis where the judging actually happens.Replaces #2058. Final piece of the hunt trilogy: #2092 (vocabulary, v0.8.45) → #2093 (verifier wiring, v0.8.46) → this (v0.9.0).