feat(eval): extended thinking on FaithfulnessJudge (v0.1.15)#15
feat(eval): extended thinking on FaithfulnessJudge (v0.1.15)#15silversurfer562 wants to merge 2 commits into
Conversation
Opt-in extended thinking on the faithfulness judge. When
use_thinking=True, the judge sends Anthropic's `thinking={"type":
"enabled", "budget_tokens": 32768}` block and swaps tool_choice
from forced to "auto" — required by Anthropic when combining
thinking with tools on Claude 4 models.
Parser now handles both response shapes:
- tool_use block → schema-guaranteed happy path (unchanged from
pre-thinking behavior).
- text block JSON → fallback for the rare case where the model
declines the tool. Honors ```json fences via _strip_code_fences.
- thinking blocks → always skipped.
API:
- `FaithfulnessJudge.score(use_thinking=False,
thinking_budget_tokens=32768)` — keyword-only.
- `FaithfulnessResult.thinking_used` — new boolean field.
- `attune-rag-benchmark --thinking` plus env-var defaults
(`ATTUNE_RAG_FAITHFULNESS_THINKING`,
`ATTUNE_RAG_FAITHFULNESS_THINKING_BUDGET`).
Default behavior (no kwarg / no flag) is byte-identical to the
pre-change call shape — covered by test_score_without_thinking_
omits_thinking_block.
`anthropic` SDK floor bumped to >=0.95 for stable thinking+tools
support on Claude 4.
Tests:
- 13 new tests on the parser fallback paths and thinking-mode
request/result wiring.
- 15 new tests on the benchmark CLI env/flag parsing and
_score_faithfulness kwarg forwarding.
- All 378 existing tests still pass.
Spec: workspace specs/faithfulness-judge-thinking/. Calibration
benchmark run (task 9) deferred — runs in a follow-up against
~/.attune/anthropic.env; output will be attached to the PR body.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Live calibration surfaced an Anthropic API constraint not in
the public docs prose: `max_tokens` must be strictly greater
than `thinking.budget_tokens`, because in thinking mode
max_tokens caps the COMBINED thinking + response output.
Default `max_tokens=2048` + `thinking_budget_tokens=32768`
crashed with:
BadRequestError: 400 - `max_tokens` must be greater
than `thinking.budget_tokens`
Fix: when use_thinking=True, send
`max_tokens + thinking_budget_tokens` to the API while the
caller's `max_tokens` keeps its original semantic meaning
(budget for the judge's REPLY, not counting thinking).
Caller code requires no changes.
Two new tests:
- test_score_with_thinking_bumps_max_tokens_above_budget
- test_score_without_thinking_passes_max_tokens_unchanged
This is exactly the failure unit tests with stub clients
cannot catch — the constraint lives in the Anthropic API,
not in the SDK. Task 9 (live calibration) caught it.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Task 9 — Live calibration benchmark (resolved)Ran Constraint discovered live (not in stub tests)Default config crashed: In thinking mode, Fix (commit This is exactly the kind of failure stub-client unit tests can't catch. The Phase-9 live gate caught it. Calibration results (post-fix)
(Pipeline latency = generator + judge; judge alone would be the cleaner comparison but requires per-block timing. Mean is comparable, suggesting thinking adds modest judge latency in absolute terms.) InterpretationThe thinking-on judge is stricter — flags 10 more percentage points of queries as having at least one unsupported claim. Mean faithfulness drops 1.1 points as a consequence. Whether that's "better" depends on whether the extra flagged claims are:
This data does NOT automatically support flipping Suggested follow-up (not blocking this PR)
What this PR delivers (unchanged from the original summary)
Ready for review and merge. |
Summary
Opt-in extended thinking on the faithfulness judge. Sharper judge → better signal for the eventual
use_native_citationsdefault-flip decision in the rag-native-citations track.Spec:
specs/faithfulness-judge-thinking/(workspace umbrella).Key design choices
tool_choiceis incompatible on Claude 4 — when thinking is enabled, the call swaps totool_choice: "auto". Parser was refactored to handle both response shapes:tool_useblock → schema-guaranteed happy path (unchanged from pre-thinking behavior).textblock JSON → fallback for the rare case where the model declines the tool. Honors```jsonfences.thinkingblocks → always skipped.budget_tokens = 32768(default). It's a ceiling, not a target — Anthropic bills only for tokens emitted, so a generous default is free insurance.use_thinking=False) produce a request that is byte-identical to today, asserted bytest_score_without_thinking_omits_thinking_block.Public surface
FaithfulnessResult.thinking_used: bool = Falseis the new field — surfaced into_dict()and the benchmark's per-query output.CLI:
SDK pin
anthropicfloor bumped:>=0.40.0,<1.0→>=0.95,<1.0across[claude],[all], and[dev]. Required for stable thinking + tool-use support on Claude 4.Test plan
_score_faithfulnesskwarg forwarding.pytest).ruff check .clean.attune-rag-benchmark --with-faithfulness --thinkingagainst the golden query set, attaching side-by-side thinking-on / thinking-off output. Will run separately and add the output to this PR before merge.Notes for the reviewer
_extract_tool_input→_extract_judge_payload. One existing test (test_missing_tool_use_block_raises) had its regex updated to match the new error message ("no tool_use or text block").test_thinking_flag_present_with_env_off.🤖 Generated with Claude Code