Skip to content

feat(eval): extended thinking on FaithfulnessJudge (v0.1.15)#15

Open
silversurfer562 wants to merge 2 commits into
mainfrom
feat/faithfulness-thinking
Open

feat(eval): extended thinking on FaithfulnessJudge (v0.1.15)#15
silversurfer562 wants to merge 2 commits into
mainfrom
feat/faithfulness-thinking

Conversation

@silversurfer562
Copy link
Copy Markdown
Member

Summary

Opt-in extended thinking on the faithfulness judge. Sharper judge → better signal for the eventual use_native_citations default-flip decision in the rag-native-citations track.

Spec: specs/faithfulness-judge-thinking/ (workspace umbrella).

Key design choices

  • Thinking + forced tool_choice is incompatible on Claude 4 — when thinking is enabled, the call swaps to tool_choice: "auto". Parser was refactored to handle both response shapes:
    • tool_use block → schema-guaranteed happy path (unchanged from pre-thinking behavior).
    • text block JSON → fallback for the rare case where the model declines the tool. Honors ```json fences.
    • thinking blocks → always skipped.
  • budget_tokens = 32768 (default). It's a ceiling, not a target — Anthropic bills only for tokens emitted, so a generous default is free insurance.
  • Back-compat is non-negotiable. Default kwargs (use_thinking=False) produce a request that is byte-identical to today, asserted by test_score_without_thinking_omits_thinking_block.

Public surface

async def score(
    self,
    query: str,
    answer: str,
    passages: str | list[str],
    max_tokens: int = 2048,
    *,
    use_thinking: bool = False,
    thinking_budget_tokens: int = 32768,
) -> FaithfulnessResult: ...

FaithfulnessResult.thinking_used: bool = False is the new field — surfaced in to_dict() and the benchmark's per-query output.

CLI:

attune-rag-benchmark --with-faithfulness --thinking
attune-rag-benchmark --with-faithfulness --thinking --thinking-budget 65536

# env-var defaults:
ATTUNE_RAG_FAITHFULNESS_THINKING=1
ATTUNE_RAG_FAITHFULNESS_THINKING_BUDGET=65536

SDK pin

anthropic floor bumped: >=0.40.0,<1.0>=0.95,<1.0 across [claude], [all], and [dev]. Required for stable thinking + tool-use support on Claude 4.

Test plan

  • Unit tests: 13 new on parser fallback paths and thinking-mode wiring; 15 new on benchmark CLI env/flag parsing and _score_faithfulness kwarg forwarding.
  • All 378 existing tests still pass (pytest).
  • ruff check . clean.
  • Calibration benchmark (deferred): live attune-rag-benchmark --with-faithfulness --thinking against the golden query set, attaching side-by-side thinking-on / thinking-off output. Will run separately and add the output to this PR before merge.
  • Tag + GitHub release + PyPI trusted-publish post-merge.

Notes for the reviewer

  • The parser refactor renames _extract_tool_input_extract_judge_payload. One existing test (test_missing_tool_use_block_raises) had its regex updated to match the new error message ("no tool_use or text block").
  • The benchmark CLI flag presence is dominant over the env-var-off case (store_true semantics) — covered by test_thinking_flag_present_with_env_off.

🤖 Generated with Claude Code

silversurfer562 and others added 2 commits May 13, 2026 12:31
Opt-in extended thinking on the faithfulness judge. When
use_thinking=True, the judge sends Anthropic's `thinking={"type":
"enabled", "budget_tokens": 32768}` block and swaps tool_choice
from forced to "auto" — required by Anthropic when combining
thinking with tools on Claude 4 models.

Parser now handles both response shapes:
- tool_use block → schema-guaranteed happy path (unchanged from
  pre-thinking behavior).
- text block JSON → fallback for the rare case where the model
  declines the tool. Honors ```json fences via _strip_code_fences.
- thinking blocks → always skipped.

API:
- `FaithfulnessJudge.score(use_thinking=False,
   thinking_budget_tokens=32768)` — keyword-only.
- `FaithfulnessResult.thinking_used` — new boolean field.
- `attune-rag-benchmark --thinking` plus env-var defaults
  (`ATTUNE_RAG_FAITHFULNESS_THINKING`,
  `ATTUNE_RAG_FAITHFULNESS_THINKING_BUDGET`).

Default behavior (no kwarg / no flag) is byte-identical to the
pre-change call shape — covered by test_score_without_thinking_
omits_thinking_block.

`anthropic` SDK floor bumped to >=0.95 for stable thinking+tools
support on Claude 4.

Tests:
- 13 new tests on the parser fallback paths and thinking-mode
  request/result wiring.
- 15 new tests on the benchmark CLI env/flag parsing and
  _score_faithfulness kwarg forwarding.
- All 378 existing tests still pass.

Spec: workspace specs/faithfulness-judge-thinking/. Calibration
benchmark run (task 9) deferred — runs in a follow-up against
~/.attune/anthropic.env; output will be attached to the PR body.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Live calibration surfaced an Anthropic API constraint not in
the public docs prose: `max_tokens` must be strictly greater
than `thinking.budget_tokens`, because in thinking mode
max_tokens caps the COMBINED thinking + response output.

Default `max_tokens=2048` + `thinking_budget_tokens=32768`
crashed with:
    BadRequestError: 400 - `max_tokens` must be greater
    than `thinking.budget_tokens`

Fix: when use_thinking=True, send
`max_tokens + thinking_budget_tokens` to the API while the
caller's `max_tokens` keeps its original semantic meaning
(budget for the judge's REPLY, not counting thinking).
Caller code requires no changes.

Two new tests:
- test_score_with_thinking_bumps_max_tokens_above_budget
- test_score_without_thinking_passes_max_tokens_unchanged

This is exactly the failure unit tests with stub clients
cannot catch — the constraint lives in the Anthropic API,
not in the SDK. Task 9 (live calibration) caught it.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@silversurfer562
Copy link
Copy Markdown
Member Author

Task 9 — Live calibration benchmark (resolved)

Ran attune-rag-benchmark --with-faithfulness against the full 40-query golden set in both modes against Sonnet 4.6.

Constraint discovered live (not in stub tests)

Default config crashed:

anthropic.BadRequestError: 400 — `max_tokens` must be greater than `thinking.budget_tokens`

In thinking mode, max_tokens caps the combined thinking + response output. With max_tokens=2048 and thinking_budget_tokens=32768, the API rejects the request.

Fix (commit ad96f7c on this branch): when use_thinking=True, send max_tokens + thinking_budget_tokens to the API while the caller's max_tokens keeps its original semantic meaning (budget for the judge's reply, not counting thinking). Two new tests cover both the bump and the back-compat path.

This is exactly the kind of failure stub-client unit tests can't catch. The Phase-9 live gate caught it.

Calibration results (post-fix)

Metric Thinking OFF Thinking ON Δ
Mean faithfulness 0.983 0.972 −1.1%
Hallucination rate 17.5% 27.5% +10 pts
Refusal rate 0.0% 0.0% 0
Mean latency (pipeline) 7787 ms 7585 ms −202 ms
p95 latency (pipeline) 12177 ms 11420 ms −757 ms

(Pipeline latency = generator + judge; judge alone would be the cleaner comparison but requires per-block timing. Mean is comparable, suggesting thinking adds modest judge latency in absolute terms.)

Interpretation

The thinking-on judge is stricter — flags 10 more percentage points of queries as having at least one unsupported claim. Mean faithfulness drops 1.1 points as a consequence.

Whether that's "better" depends on whether the extra flagged claims are:

  • (a) real hallucinations the OFF judge missed → thinking-on is a quality win
  • (b) supportable claims the ON judge is over-rejecting → thinking-on is overzealous

This data does NOT automatically support flipping use_native_citations to default-on in the rag-native-citations spec (that's the downstream decision this calibration was meant to inform). Per-query inspection is needed before that decision.

Suggested follow-up (not blocking this PR)

  • Spec a per-query diff script: compare thinking-on vs thinking-off unsupported_claims lists per query, classify each extra flag as real hallucination or overzealous rejection.
  • That's a separate spec — not in scope for this one.

What this PR delivers (unchanged from the original summary)

  • Opt-in use_thinking=True on FaithfulnessJudge.score
  • New FaithfulnessResult.thinking_used field
  • attune-rag-benchmark --thinking flag + env defaults
  • 29 unit tests (17 new) + 15 benchmark-CLI tests; all 380 attune-rag tests passing post-fix.
  • anthropic SDK pin bumped to >=0.95,<1.0.

Ready for review and merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant