feat(eval): extended thinking on FaithfulnessJudge (v0.1.15) by silversurfer562 · Pull Request #15 · Smart-AI-Memory/attune-rag

silversurfer562 · 2026-05-13T16:31:40Z

Summary

Opt-in extended thinking on the faithfulness judge. Sharper judge → better signal for the eventual use_native_citations default-flip decision in the rag-native-citations track.

Spec: specs/faithfulness-judge-thinking/ (workspace umbrella).

Key design choices

Thinking + forced tool_choice is incompatible on Claude 4 — when thinking is enabled, the call swaps to tool_choice: "auto". Parser was refactored to handle both response shapes:
- tool_use block → schema-guaranteed happy path (unchanged from pre-thinking behavior).
- text block JSON → fallback for the rare case where the model declines the tool. Honors ```json fences.
- thinking blocks → always skipped.
budget_tokens = 32768 (default). It's a ceiling, not a target — Anthropic bills only for tokens emitted, so a generous default is free insurance.
Back-compat is non-negotiable. Default kwargs (use_thinking=False) produce a request that is byte-identical to today, asserted by test_score_without_thinking_omits_thinking_block.

Public surface

async def score(
    self,
    query: str,
    answer: str,
    passages: str | list[str],
    max_tokens: int = 2048,
    *,
    use_thinking: bool = False,
    thinking_budget_tokens: int = 32768,
) -> FaithfulnessResult: ...

FaithfulnessResult.thinking_used: bool = False is the new field — surfaced in to_dict() and the benchmark's per-query output.

CLI:

attune-rag-benchmark --with-faithfulness --thinking
attune-rag-benchmark --with-faithfulness --thinking --thinking-budget 65536

# env-var defaults:
ATTUNE_RAG_FAITHFULNESS_THINKING=1
ATTUNE_RAG_FAITHFULNESS_THINKING_BUDGET=65536

SDK pin

anthropic floor bumped: >=0.40.0,<1.0 → >=0.95,<1.0 across [claude], [all], and [dev]. Required for stable thinking + tool-use support on Claude 4.

Test plan

Unit tests: 13 new on parser fallback paths and thinking-mode wiring; 15 new on benchmark CLI env/flag parsing and _score_faithfulness kwarg forwarding.
All 378 existing tests still pass (pytest).
ruff check . clean.
Calibration benchmark (deferred): live attune-rag-benchmark --with-faithfulness --thinking against the golden query set, attaching side-by-side thinking-on / thinking-off output. Will run separately and add the output to this PR before merge.
Tag + GitHub release + PyPI trusted-publish post-merge.

Notes for the reviewer

The parser refactor renames _extract_tool_input → _extract_judge_payload. One existing test (test_missing_tool_use_block_raises) had its regex updated to match the new error message ("no tool_use or text block").
The benchmark CLI flag presence is dominant over the env-var-off case (store_true semantics) — covered by test_thinking_flag_present_with_env_off.

🤖 Generated with Claude Code

Opt-in extended thinking on the faithfulness judge. When use_thinking=True, the judge sends Anthropic's `thinking={"type": "enabled", "budget_tokens": 32768}` block and swaps tool_choice from forced to "auto" — required by Anthropic when combining thinking with tools on Claude 4 models. Parser now handles both response shapes: - tool_use block → schema-guaranteed happy path (unchanged from pre-thinking behavior). - text block JSON → fallback for the rare case where the model declines the tool. Honors ```json fences via _strip_code_fences. - thinking blocks → always skipped. API: - `FaithfulnessJudge.score(use_thinking=False, thinking_budget_tokens=32768)` — keyword-only. - `FaithfulnessResult.thinking_used` — new boolean field. - `attune-rag-benchmark --thinking` plus env-var defaults (`ATTUNE_RAG_FAITHFULNESS_THINKING`, `ATTUNE_RAG_FAITHFULNESS_THINKING_BUDGET`). Default behavior (no kwarg / no flag) is byte-identical to the pre-change call shape — covered by test_score_without_thinking_ omits_thinking_block. `anthropic` SDK floor bumped to >=0.95 for stable thinking+tools support on Claude 4. Tests: - 13 new tests on the parser fallback paths and thinking-mode request/result wiring. - 15 new tests on the benchmark CLI env/flag parsing and _score_faithfulness kwarg forwarding. - All 378 existing tests still pass. Spec: workspace specs/faithfulness-judge-thinking/. Calibration benchmark run (task 9) deferred — runs in a follow-up against ~/.attune/anthropic.env; output will be attached to the PR body. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Live calibration surfaced an Anthropic API constraint not in the public docs prose: `max_tokens` must be strictly greater than `thinking.budget_tokens`, because in thinking mode max_tokens caps the COMBINED thinking + response output. Default `max_tokens=2048` + `thinking_budget_tokens=32768` crashed with: BadRequestError: 400 - `max_tokens` must be greater than `thinking.budget_tokens` Fix: when use_thinking=True, send `max_tokens + thinking_budget_tokens` to the API while the caller's `max_tokens` keeps its original semantic meaning (budget for the judge's REPLY, not counting thinking). Caller code requires no changes. Two new tests: - test_score_with_thinking_bumps_max_tokens_above_budget - test_score_without_thinking_passes_max_tokens_unchanged This is exactly the failure unit tests with stub clients cannot catch — the constraint lives in the Anthropic API, not in the SDK. Task 9 (live calibration) caught it. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

silversurfer562 · 2026-05-13T17:34:30Z

Task 9 — Live calibration benchmark (resolved)

Ran attune-rag-benchmark --with-faithfulness against the full 40-query golden set in both modes against Sonnet 4.6.

Constraint discovered live (not in stub tests)

Default config crashed:

anthropic.BadRequestError: 400 — `max_tokens` must be greater than `thinking.budget_tokens`

In thinking mode, max_tokens caps the combined thinking + response output. With max_tokens=2048 and thinking_budget_tokens=32768, the API rejects the request.

Fix (commit ad96f7c on this branch): when use_thinking=True, send max_tokens + thinking_budget_tokens to the API while the caller's max_tokens keeps its original semantic meaning (budget for the judge's reply, not counting thinking). Two new tests cover both the bump and the back-compat path.

This is exactly the kind of failure stub-client unit tests can't catch. The Phase-9 live gate caught it.

Calibration results (post-fix)

Metric	Thinking OFF	Thinking ON	Δ
Mean faithfulness	0.983	0.972	−1.1%
Hallucination rate	17.5%	27.5%	+10 pts
Refusal rate	0.0%	0.0%	0
Mean latency (pipeline)	7787 ms	7585 ms	−202 ms
p95 latency (pipeline)	12177 ms	11420 ms	−757 ms

(Pipeline latency = generator + judge; judge alone would be the cleaner comparison but requires per-block timing. Mean is comparable, suggesting thinking adds modest judge latency in absolute terms.)

Interpretation

The thinking-on judge is stricter — flags 10 more percentage points of queries as having at least one unsupported claim. Mean faithfulness drops 1.1 points as a consequence.

Whether that's "better" depends on whether the extra flagged claims are:

(a) real hallucinations the OFF judge missed → thinking-on is a quality win
(b) supportable claims the ON judge is over-rejecting → thinking-on is overzealous

This data does NOT automatically support flipping use_native_citations to default-on in the rag-native-citations spec (that's the downstream decision this calibration was meant to inform). Per-query inspection is needed before that decision.

Suggested follow-up (not blocking this PR)

Spec a per-query diff script: compare thinking-on vs thinking-off unsupported_claims lists per query, classify each extra flag as real hallucination or overzealous rejection.
That's a separate spec — not in scope for this one.

What this PR delivers (unchanged from the original summary)

Opt-in use_thinking=True on FaithfulnessJudge.score
New FaithfulnessResult.thinking_used field
attune-rag-benchmark --thinking flag + env defaults
29 unit tests (17 new) + 15 benchmark-CLI tests; all 380 attune-rag tests passing post-fix.
anthropic SDK pin bumped to >=0.95,<1.0.

Ready for review and merge.

silversurfer562 and others added 2 commits May 13, 2026 12:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eval): extended thinking on FaithfulnessJudge (v0.1.15)#15

feat(eval): extended thinking on FaithfulnessJudge (v0.1.15)#15
silversurfer562 wants to merge 2 commits into
mainfrom
feat/faithfulness-thinking

silversurfer562 commented May 13, 2026

Uh oh!

silversurfer562 commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

silversurfer562 commented May 13, 2026

Summary

Key design choices

Public surface

SDK pin

Test plan

Notes for the reviewer

Uh oh!

silversurfer562 commented May 13, 2026

Task 9 — Live calibration benchmark (resolved)

Constraint discovered live (not in stub tests)

Calibration results (post-fix)

Interpretation

Suggested follow-up (not blocking this PR)

What this PR delivers (unchanged from the original summary)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant