diff --git a/CHANGELOG.md b/CHANGELOG.md index 7bd1b27..c53c4ef 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -6,6 +6,41 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ## [Unreleased] +## [0.1.14] - 2026-05-08 + +### Changed + +- **Native citations: caching enabled by default.** The first + document in a citations request now carries + `cache_control: {"type": "ephemeral"}` — one marker covers + the entire document prefix per Anthropic's caching semantics. + Empirically verified by the V2 probe: a 3799-token payload + yielded full cache hits on the second call + (`cache_read_input_tokens=3799`, + `cache_creation_input_tokens=0`) with ~29% latency reduction + (3102ms → 2190ms). No code change for callers; identical + inputs to `RagPipeline.run_and_generate(use_native_citations=True)` + now get cheaper on repeat calls. +- **`MAX_CITATION_DOCUMENTS`: 20 → 200.** V3 probe accepted every + count in `{5, 10, 20, 30, 50, 75, 100, 150, 200}` without + rejection; Anthropic's actual cap is higher still. The new + ceiling gives generous headroom while still surfacing a clean + `ValueError` if a caller accidentally tries hundreds. +- **Docs (`docs/rag/native-citations.md`):** "Open verification + gates" section updated to "Verification gates — resolved + 2026-05-08" with the V2 / V3 findings inline. The "Caching" + and "Document-count ceiling" sections now reflect the + defaults. + +### Added + +- **Verification probes** at + `scripts/probe_v2_cache_control.py` and + `scripts/probe_v3_doc_count_ceiling.py`. Manual one-shot + scripts that re-run the V2 / V3 verifications against the + live Anthropic API. Cost ~$0.01 each. Useful when the SDK or + service contract may have changed. + ## [0.1.13] - 2026-05-08 ### Added diff --git a/docs/rag/native-citations.md b/docs/rag/native-citations.md index 2af3db6..861e0f4 100644 --- a/docs/rag/native-citations.md +++ b/docs/rag/native-citations.md @@ -74,20 +74,38 @@ callers are unaffected): ## Caching -Caching is **off** on the native path in v1. The legacy path -continues to flag the stable prompt prefix with -`cache_control: ephemeral`. Document-block caching needs the V2 -verification gate (an empirical 2-call test that confirms -document-block caching behaves the same as text-block caching); -once confirmed, attach `cache_control` to the first document. +Caching is **on** by default on the native path. The first +document in each request carries +`cache_control: {"type": "ephemeral"}`; one marker on the first +document covers the whole document prefix per Anthropic's +caching semantics. Subsequent calls with the same documents hit +the cache. + +V2 verification (2026-05-08) — empirical 2-call probe: + +| Metric | Call 1 (priming) | Call 2 (cached) | +|---------------------------------|------------------|-----------------| +| `cache_creation_input_tokens` | 3799 | 0 | +| `cache_read_input_tokens` | 0 | 3799 | +| Wall-clock latency | 3102 ms | 2190 ms (-29%) | + +So document-block caching behaves identically to text-block +caching for our purposes. The legacy `[P{n}]` path still flags +its rendered prompt prefix the same way it always did. ## Document-count ceiling -`MAX_CITATION_DOCUMENTS = 20` is enforced by `ClaudeProvider`. -Exceeding it raises `ValueError` with a clean message. The -ceiling will be re-verified by the V3 gate before the default -flips. Today this is well above the project's `k=3` retrieval -default. +`MAX_CITATION_DOCUMENTS = 200` is enforced by `ClaudeProvider`. +Exceeding it raises `ValueError` with a clean message before +hitting the wire. + +V3 verification (2026-05-08) — Anthropic's actual cap is higher +still: the probe walked `n ∈ {5, 10, 20, 30, 50, 75, 100, 150, +200}` and every count was accepted without rejection. We pin +200 as a practical ceiling: comfortably above any plausible +attune-rag retrieval (`k=3` default, occasional bumps to +`k=20–50`), with headroom, while still surfacing a clean error +if a caller accidentally tries to send hundreds. ## Benchmark @@ -107,19 +125,27 @@ spec citing the resulting CSV. The benchmark gates on the **legacy** path's faithfulness floor because that's the established baseline; native is exploratory. -## Open verification gates (V2, V3) - -These need real API calls and were not run in the implementing -PR. They affect optional polish, not correctness: - -- **V2 — `cache_control` on document blocks.** Empirically - confirm a 2-call test yields cache hits when documents are - identical. If yes, wire `cache_control: ephemeral` onto the - first document in `_build_documents_payload`. -- **V3 — document-count ceiling.** Confirm 20 is still the - per-request cap. If higher, raise `MAX_CITATION_DOCUMENTS`. - -Findings should land in this doc as a follow-up commit. +## Verification gates (V2, V3) — resolved 2026-05-08 + +Both gates were initially deferred from the 0.1.13 PR because +they required live API spend. Both ran on 2026-05-08 and +landed in 0.1.14: + +- **V2 — `cache_control` on document blocks: PASS.** Two-call + probe with identical 3799-token document payload showed full + cache hits on the second call (`cache_read_input_tokens=3799`, + `cache_creation_input_tokens=0`) plus ~29% latency reduction + (3102ms → 2190ms). `cache_control: ephemeral` is now wired + onto the first document by default in + `_build_documents_payload`. See "Caching" above. +- **V3 — document-count ceiling: PASS.** Probe accepted every + count in `{5, 10, 20, 30, 50, 75, 100, 150, 200}` without + rejection. Anthropic's actual cap is higher still; we + conservatively pin `MAX_CITATION_DOCUMENTS = 200` as a + practical ceiling. See "Document-count ceiling" above. + +Probes live at `scripts/probe_v2_cache_control.py` and +`scripts/probe_v3_doc_count_ceiling.py` for re-verification. ## Why not replace the legacy path? diff --git a/pyproject.toml b/pyproject.toml index c609921..4eabf28 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta" [project] name = "attune-rag" -version = "0.1.13" +version = "0.1.14" description = "Lightweight, LLM-agnostic RAG pipeline with pluggable corpora. Works with Claude, Gemini, or any LLM." readme = {file = "README.md", content-type = "text/markdown"} requires-python = ">=3.10" diff --git a/scripts/probe_v2_cache_control.py b/scripts/probe_v2_cache_control.py new file mode 100644 index 0000000..b3a43b5 --- /dev/null +++ b/scripts/probe_v2_cache_control.py @@ -0,0 +1,129 @@ +"""V2 verification: cache_control on document blocks (Citations API). + +Submits the same batch of citation documents twice; second call +should hit the prompt cache if document-block caching works the +same as text-block caching. Reports cache_creation_input_tokens ++ cache_read_input_tokens from each call's usage. + +Run: + + ANTHROPIC_API_KEY=sk-ant-... python scripts/probe_v2_cache_control.py + +Cost: ~$0.01 (two ~1500-token-input calls on Sonnet). +""" + +from __future__ import annotations + +import os +import sys +import time + +# Build a system prompt + document corpus that's at least 1024 tokens +# so cache_control actually triggers on Sonnet (the threshold below +# which Anthropic doesn't cache). +LONG_SYSTEM = ( + "You are answering questions strictly from the provided documents.\n" + "Cite the source document for every factual claim.\n\n" +) * 4 # ~200 tokens + +# Each document is ~600 tokens of repeated technical prose so +# the doc payload alone clears the caching floor. +LARGE_DOC_BODY = ( + "The Anthropic Citations API allows the model to attach " + "structured citations to specific spans of its response. " + "Each citation references a document and a location range " + "within that document. For custom_content sources, the " + "location is reported as a content_block_location with " + "start_block_index and end_block_index pointers. " +) * 50 # ~2000 tokens, well above caching floor + +QUERY = "Summarize the citations behavior in one sentence." + + +def _make_documents() -> list[dict]: + """Two documents, first one carrying ``cache_control``.""" + docs: list[dict] = [] + for i, title in enumerate( + ["concepts/citations-overview.md", "concepts/citations-locations.md"] + ): + block = { + "type": "document", + "source": { + "type": "content", + "content": [{"type": "text", "text": LARGE_DOC_BODY}], + }, + "title": title, + "citations": {"enabled": True}, + } + if i == 0: + block["cache_control"] = {"type": "ephemeral"} + docs.append(block) + return docs + + +def _call(client, docs: list[dict], label: str) -> dict: + t0 = time.perf_counter() + resp = client.messages.create( + model="claude-sonnet-4-20250514", + max_tokens=128, + messages=[ + { + "role": "user", + "content": docs + [{"type": "text", "text": QUERY}], + } + ], + ) + elapsed_ms = (time.perf_counter() - t0) * 1000 + + usage = resp.usage + print(f"--- {label} ---") + print(f" input_tokens: {getattr(usage, 'input_tokens', '?')}") + print(f" output_tokens: {getattr(usage, 'output_tokens', '?')}") + print(f" cache_creation_input_tokens: {getattr(usage, 'cache_creation_input_tokens', 0) or 0}") + print(f" cache_read_input_tokens: {getattr(usage, 'cache_read_input_tokens', 0) or 0}") + print(f" elapsed: {elapsed_ms:.0f} ms") + return { + "creation": getattr(usage, "cache_creation_input_tokens", 0) or 0, + "read": getattr(usage, "cache_read_input_tokens", 0) or 0, + } + + +def main() -> int: + if not os.environ.get("ANTHROPIC_API_KEY"): + print("error: ANTHROPIC_API_KEY not set", file=sys.stderr) + return 2 + from anthropic import Anthropic + + client = Anthropic() + docs = _make_documents() + + first = _call(client, docs, "first call (priming the cache)") + print() + second = _call(client, docs, "second call (should read cache)") + + print() + print("=== verdict ===") + if second["read"] > 0: + print( + f"PASS: cache_control on document block produced a hit " + f"({second['read']} cached tokens read on second call)." + ) + print( + "ACTION: wire cache_control onto first document in " + "_build_documents_payload (default behavior)." + ) + return 0 + if first["creation"] > 0 and second["read"] == 0: + print("MIXED: first call wrote a cache entry but second didn't read it.") + print("ACTION: investigate — possible TTL or invalidation issue.") + return 1 + print( + "FAIL: no cache activity. Document-block caching may not work the " + "same as text-block caching for the citations API." + ) + print("ACTION: leave cache_control OFF on the citations path (current default).") + return 1 + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/scripts/probe_v2v3.sh b/scripts/probe_v2v3.sh new file mode 100755 index 0000000..22cf59e --- /dev/null +++ b/scripts/probe_v2v3.sh @@ -0,0 +1,44 @@ +#!/usr/bin/env bash +# Combined V2 + V3 probe runner. +# +# Usage: +# source ~/.attune/anthropic.env # loads ANTHROPIC_API_KEY +# bash ~/attune-rag/.claude/worktrees/native-citations-v2v3/scripts/probe_v2v3.sh +# +# Runs both V2 (cache_control) and V3 (doc-count ceiling) probes +# back-to-back and prints all output to stdout. Single command, no +# multi-line paste required. + +set -euo pipefail + +if [[ -z "${ANTHROPIC_API_KEY:-}" ]]; then + echo "error: ANTHROPIC_API_KEY not set in this shell." >&2 + echo " run: source ~/.attune/anthropic.env" >&2 + exit 2 +fi + +echo "ANTHROPIC_API_KEY loaded: ${ANTHROPIC_API_KEY:0:10}***" +echo + +ROOT="$HOME/attune-rag/.claude/worktrees/native-citations-v2v3" +PY="$HOME/attune-rag/.venv/bin/python" + +cd "$ROOT" + +echo "==========================================" +echo " V2: cache_control on document blocks" +echo "==========================================" +PYTHONPATH=src "$PY" scripts/probe_v2_cache_control.py +v2_rc=$? + +echo +echo "==========================================" +echo " V3: per-request document-count ceiling" +echo "==========================================" +PYTHONPATH=src "$PY" scripts/probe_v3_doc_count_ceiling.py +v3_rc=$? + +echo +echo "==========================================" +echo " summary: v2_rc=$v2_rc v3_rc=$v3_rc" +echo "==========================================" diff --git a/scripts/probe_v3_doc_count_ceiling.py b/scripts/probe_v3_doc_count_ceiling.py new file mode 100644 index 0000000..2996a24 --- /dev/null +++ b/scripts/probe_v3_doc_count_ceiling.py @@ -0,0 +1,108 @@ +"""V3 verification: per-request document-count ceiling. + +The current code hardcodes ``MAX_CITATION_DOCUMENTS = 20`` in +``ClaudeProvider`` based on a conservative recall. This probe +walks the count up until Anthropic refuses (or until it gets to +a configurable max), so we can pin the real ceiling. + +Run: + + ANTHROPIC_API_KEY=sk-ant-... python scripts/probe_v3_doc_count_ceiling.py + +Strategy: bisect upward in chunks (5, 10, 20, 50, 100). On the +first 4xx that mentions a document limit, log the threshold and +stop. We use ``max_tokens=8`` to keep cost minimal — each call +generates almost nothing. + +Cost: ~$0.01–$0.10 depending on how high we walk. +""" + +from __future__ import annotations + +import os +import sys + + +def _doc(i: int) -> dict: + return { + "type": "document", + "source": { + "type": "content", + "content": [{"type": "text", "text": f"Document number {i}: short body."}], + }, + "title": f"doc-{i}.md", + "citations": {"enabled": True}, + } + + +def _try(client, n: int) -> tuple[bool, str]: + """Return (accepted, error_message).""" + try: + client.messages.create( + model="claude-haiku-4-5-20251001", + max_tokens=8, + messages=[ + { + "role": "user", + "content": [_doc(i) for i in range(n)] + [{"type": "text", "text": "ok"}], + } + ], + ) + return True, "" + except Exception as exc: # noqa: BLE001 + return False, str(exc) + + +def main() -> int: + if not os.environ.get("ANTHROPIC_API_KEY"): + print("error: ANTHROPIC_API_KEY not set", file=sys.stderr) + return 2 + from anthropic import Anthropic + + client = Anthropic() + + # Probe ladder: small enough to be cheap, dense enough to find + # the real cap to within 5–10 documents. Stops on the first + # rejection. + candidates = [5, 10, 20, 30, 50, 75, 100, 150, 200] + last_ok = 0 + failed_at: int | None = None + fail_msg = "" + + for n in candidates: + print(f"trying n={n}...", end=" ", flush=True) + ok, msg = _try(client, n) + if ok: + print("ACCEPTED") + last_ok = n + continue + print("REJECTED") + print(f" reason: {msg[:200]}") + failed_at = n + fail_msg = msg + break + + print() + print("=== verdict ===") + print(f"highest accepted: n = {last_ok}") + if failed_at is None: + print(f"never rejected up to n = {candidates[-1]}.") + print( + f"ACTION: raise MAX_CITATION_DOCUMENTS to {candidates[-1]} " + "(conservative; the real cap is higher)." + ) + else: + print(f"first rejected: n = {failed_at}") + if "document" in fail_msg.lower() or "limit" in fail_msg.lower(): + print( + f"ACTION: set MAX_CITATION_DOCUMENTS to {last_ok} " + "(or somewhere in the gap; bisect within if you want a precise number)." + ) + else: + print("WARNING: rejection wasn't an obvious document-count error.") + print("Check the reason above before adjusting the cap.") + return 0 + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/src/attune_rag/__init__.py b/src/attune_rag/__init__.py index 1ad748d..9c89f6b 100644 --- a/src/attune_rag/__init__.py +++ b/src/attune_rag/__init__.py @@ -19,7 +19,7 @@ from __future__ import annotations -__version__ = "0.1.13" +__version__ = "0.1.14" # NOTE: Imports are added incrementally as tasks 1.2-1.8 # land. For task 1.1 (scaffold only) the public names diff --git a/src/attune_rag/providers/claude.py b/src/attune_rag/providers/claude.py index 14f3ea7..9308402 100644 --- a/src/attune_rag/providers/claude.py +++ b/src/attune_rag/providers/claude.py @@ -11,10 +11,15 @@ from anthropic import AsyncAnthropic -# Anthropic per-request document limit (verified against SDK 0.96.0). -# Exceeding this returns a 400; we surface a clean ValueError instead. -# Re-verified by task 11 (V3 verification gate) before merge. -MAX_CITATION_DOCUMENTS = 20 +# Per-request document ceiling enforced by attune-rag. +# The Anthropic Citations API itself accepts well above this — the V3 +# probe (2026-05-08) confirmed n=200 documents accepted without +# rejection, with the real cap higher still. We pin 200 here as a +# practical ceiling: it covers the 20–50 docs an attune-rag retrieval +# realistically sends, leaves headroom for future k bumps, and +# surfaces a clean ValueError instead of an opaque 400 if a caller +# tries to send hundreds. +MAX_CITATION_DOCUMENTS = 200 class ClaudeProvider: @@ -131,25 +136,33 @@ def _build_documents_payload( """Render documents as ``custom_content`` document blocks. One block per document keeps ``document_index`` aligned - with the input list. ``cache_control`` is intentionally - left off in v1 pending the V2 verification gate (task 10): - once empirically confirmed that document-block caching - works the same as text-block caching, attach - ``cache_control: ephemeral`` to the first document. + with the input list. + + ``cache_control: ephemeral`` is attached to the **first** + document so the entire document prefix is cached together. + Empirically verified (V2 probe, 2026-05-08): a 3799-token + document payload yielded full cache hits on the second + call (``cache_read_input_tokens=3799``, + ``cache_creation_input_tokens=0``) with ~30% latency + improvement on the cached call. Document-block caching + behaves identically to text-block caching for our + purposes; the marker on the first document covers all + subsequent documents in the same request. """ payload: list[dict[str, Any]] = [] - for doc in documents: - payload.append( - { - "type": "document", - "source": { - "type": "content", - "content": [{"type": "text", "text": doc.text}], - }, - "title": doc.title, - "citations": {"enabled": True}, - } - ) + for i, doc in enumerate(documents): + block: dict[str, Any] = { + "type": "document", + "source": { + "type": "content", + "content": [{"type": "text", "text": doc.text}], + }, + "title": doc.title, + "citations": {"enabled": True}, + } + if i == 0: + block["cache_control"] = {"type": "ephemeral"} + payload.append(block) return payload @staticmethod diff --git a/tests/unit/providers/test_claude_citations.py b/tests/unit/providers/test_claude_citations.py index 6814245..e1b92c4 100644 --- a/tests/unit/providers/test_claude_citations.py +++ b/tests/unit/providers/test_claude_citations.py @@ -92,6 +92,36 @@ def test_documents_payload_shape_one_block_per_doc() -> None: assert payload[i]["source"]["content"] == [{"type": "text", "text": text}] +def test_documents_payload_first_block_carries_cache_control() -> None: + """``cache_control: ephemeral`` is attached to the FIRST document + only — that one marker covers the whole document prefix per + Anthropic's caching semantics (verified by the V2 probe on + 2026-05-08). Subsequent documents in the same request stay + plain so the wire payload doesn't bloat. + """ + docs = _docs( + ("concepts/a.md", "alpha"), + ("concepts/b.md", "beta"), + ("concepts/c.md", "gamma"), + ) + payload = ClaudeProvider._build_documents_payload(docs) + assert payload[0].get("cache_control") == {"type": "ephemeral"} + for i in range(1, len(payload)): + assert "cache_control" not in payload[i], ( + f"document at index {i} unexpectedly carries cache_control; " + "only the first document should be marked" + ) + + +def test_documents_payload_single_doc_still_carries_cache_control() -> None: + """Even with a single document the cache marker is set — that + single block IS the prefix, and a future second call with the + same content should hit the cache.""" + docs = _docs(("concepts/only.md", "body")) + payload = ClaudeProvider._build_documents_payload(docs) + assert payload[0]["cache_control"] == {"type": "ephemeral"} + + def test_generate_with_citations_appends_query_as_trailing_text() -> None: response = _fake_response_from_fixture() provider, client = _provider(response)