Smart-AI-Memory · silversurfer562 · May 8, 2026 · May 8, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -6,6 +6,41 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ## [Unreleased]
 
+## [0.1.14] - 2026-05-08
+
+### Changed
+
+- **Native citations: caching enabled by default.** The first
+  document in a citations request now carries
+  `cache_control: {"type": "ephemeral"}` — one marker covers
+  the entire document prefix per Anthropic's caching semantics.
+  Empirically verified by the V2 probe: a 3799-token payload
+  yielded full cache hits on the second call
+  (`cache_read_input_tokens=3799`,
+  `cache_creation_input_tokens=0`) with ~29% latency reduction
+  (3102ms → 2190ms). No code change for callers; identical
+  inputs to `RagPipeline.run_and_generate(use_native_citations=True)`
+  now get cheaper on repeat calls.
+- **`MAX_CITATION_DOCUMENTS`: 20 → 200.** V3 probe accepted every
+  count in `{5, 10, 20, 30, 50, 75, 100, 150, 200}` without
+  rejection; Anthropic's actual cap is higher still. The new
+  ceiling gives generous headroom while still surfacing a clean
+  `ValueError` if a caller accidentally tries hundreds.
+- **Docs (`docs/rag/native-citations.md`):** "Open verification
+  gates" section updated to "Verification gates — resolved
+  2026-05-08" with the V2 / V3 findings inline. The "Caching"
+  and "Document-count ceiling" sections now reflect the
+  defaults.
+
+### Added
+
+- **Verification probes** at
+  `scripts/probe_v2_cache_control.py` and
+  `scripts/probe_v3_doc_count_ceiling.py`. Manual one-shot
+  scripts that re-run the V2 / V3 verifications against the
+  live Anthropic API. Cost ~$0.01 each. Useful when the SDK or
+  service contract may have changed.
+
 ## [0.1.13] - 2026-05-08
 
 ### Added

diff --git a/docs/rag/native-citations.md b/docs/rag/native-citations.md
@@ -74,20 +74,38 @@ callers are unaffected):
 
 ## Caching
 
-Caching is **off** on the native path in v1. The legacy path
-continues to flag the stable prompt prefix with
-`cache_control: ephemeral`. Document-block caching needs the V2
-verification gate (an empirical 2-call test that confirms
-document-block caching behaves the same as text-block caching);
-once confirmed, attach `cache_control` to the first document.
+Caching is **on** by default on the native path. The first
+document in each request carries
+`cache_control: {"type": "ephemeral"}`; one marker on the first
+document covers the whole document prefix per Anthropic's
+caching semantics. Subsequent calls with the same documents hit
+the cache.
+
+V2 verification (2026-05-08) — empirical 2-call probe:
+
+| Metric                          | Call 1 (priming) | Call 2 (cached) |
+|---------------------------------|------------------|-----------------|
+| `cache_creation_input_tokens`   | 3799             | 0               |
+| `cache_read_input_tokens`       | 0                | 3799            |
+| Wall-clock latency              | 3102 ms          | 2190 ms (-29%)  |
+
+So document-block caching behaves identically to text-block
+caching for our purposes. The legacy `[P{n}]` path still flags
+its rendered prompt prefix the same way it always did.
 
 ## Document-count ceiling
 
-`MAX_CITATION_DOCUMENTS = 20` is enforced by `ClaudeProvider`.
-Exceeding it raises `ValueError` with a clean message. The
-ceiling will be re-verified by the V3 gate before the default
-flips. Today this is well above the project's `k=3` retrieval
-default.
+`MAX_CITATION_DOCUMENTS = 200` is enforced by `ClaudeProvider`.
+Exceeding it raises `ValueError` with a clean message before
+hitting the wire.
+
+V3 verification (2026-05-08) — Anthropic's actual cap is higher
+still: the probe walked `n ∈ {5, 10, 20, 30, 50, 75, 100, 150,
+200}` and every count was accepted without rejection. We pin
+200 as a practical ceiling: comfortably above any plausible
+attune-rag retrieval (`k=3` default, occasional bumps to
+`k=20–50`), with headroom, while still surfacing a clean error
+if a caller accidentally tries to send hundreds.
 
 ## Benchmark
 
@@ -107,19 +125,27 @@ spec citing the resulting CSV.
 The benchmark gates on the **legacy** path's faithfulness floor
 because that's the established baseline; native is exploratory.
 
-## Open verification gates (V2, V3)
-
-These need real API calls and were not run in the implementing
-PR. They affect optional polish, not correctness:
-
-- **V2 — `cache_control` on document blocks.** Empirically
-  confirm a 2-call test yields cache hits when documents are
-  identical. If yes, wire `cache_control: ephemeral` onto the
-  first document in `_build_documents_payload`.
-- **V3 — document-count ceiling.** Confirm 20 is still the
-  per-request cap. If higher, raise `MAX_CITATION_DOCUMENTS`.
-
-Findings should land in this doc as a follow-up commit.
+## Verification gates (V2, V3) — resolved 2026-05-08
+
+Both gates were initially deferred from the 0.1.13 PR because
+they required live API spend. Both ran on 2026-05-08 and
+landed in 0.1.14:
+
+- **V2 — `cache_control` on document blocks: PASS.** Two-call
+  probe with identical 3799-token document payload showed full
+  cache hits on the second call (`cache_read_input_tokens=3799`,
+  `cache_creation_input_tokens=0`) plus ~29% latency reduction
+  (3102ms → 2190ms). `cache_control: ephemeral` is now wired
+  onto the first document by default in
+  `_build_documents_payload`. See "Caching" above.
+- **V3 — document-count ceiling: PASS.** Probe accepted every
+  count in `{5, 10, 20, 30, 50, 75, 100, 150, 200}` without
+  rejection. Anthropic's actual cap is higher still; we
+  conservatively pin `MAX_CITATION_DOCUMENTS = 200` as a
+  practical ceiling. See "Document-count ceiling" above.
+
+Probes live at `scripts/probe_v2_cache_control.py` and
+`scripts/probe_v3_doc_count_ceiling.py` for re-verification.
 
 ## Why not replace the legacy path?
 

diff --git a/pyproject.toml b/pyproject.toml
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 
 [project]
 name = "attune-rag"
-version = "0.1.13"
+version = "0.1.14"
 description = "Lightweight, LLM-agnostic RAG pipeline with pluggable corpora. Works with Claude, Gemini, or any LLM."
 readme = {file = "README.md", content-type = "text/markdown"}
 requires-python = ">=3.10"

diff --git a/scripts/probe_v2_cache_control.py b/scripts/probe_v2_cache_control.py
@@ -0,0 +1,129 @@
+"""V2 verification: cache_control on document blocks (Citations API).
+
+Submits the same batch of citation documents twice; second call
+should hit the prompt cache if document-block caching works the
+same as text-block caching. Reports cache_creation_input_tokens
++ cache_read_input_tokens from each call's usage.
+
+Run:
+
+    ANTHROPIC_API_KEY=sk-ant-... python scripts/probe_v2_cache_control.py
+
+Cost: ~$0.01 (two ~1500-token-input calls on Sonnet).
+"""
+
+from __future__ import annotations
+
+import os
+import sys
+import time
+
+# Build a system prompt + document corpus that's at least 1024 tokens
+# so cache_control actually triggers on Sonnet (the threshold below
+# which Anthropic doesn't cache).
+LONG_SYSTEM = (
+    "You are answering questions strictly from the provided documents.\n"
+    "Cite the source document for every factual claim.\n\n"
+) * 4  # ~200 tokens
+
+# Each document is ~600 tokens of repeated technical prose so
+# the doc payload alone clears the caching floor.
+LARGE_DOC_BODY = (
+    "The Anthropic Citations API allows the model to attach "
+    "structured citations to specific spans of its response. "
+    "Each citation references a document and a location range "
+    "within that document. For custom_content sources, the "
+    "location is reported as a content_block_location with "
+    "start_block_index and end_block_index pointers. "
+) * 50  # ~2000 tokens, well above caching floor
+
+QUERY = "Summarize the citations behavior in one sentence."
+
+
+def _make_documents() -> list[dict]:
+    """Two documents, first one carrying ``cache_control``."""
+    docs: list[dict] = []
+    for i, title in enumerate(
+        ["concepts/citations-overview.md", "concepts/citations-locations.md"]
+    ):
+        block = {
+            "type": "document",
+            "source": {
+                "type": "content",
+                "content": [{"type": "text", "text": LARGE_DOC_BODY}],
+            },
+            "title": title,
+            "citations": {"enabled": True},
+        }
+        if i == 0:
+            block["cache_control"] = {"type": "ephemeral"}
+        docs.append(block)
+    return docs
+
+
+def _call(client, docs: list[dict], label: str) -> dict:
+    t0 = time.perf_counter()
+    resp = client.messages.create(
+        model="claude-sonnet-4-20250514",
+        max_tokens=128,
+        messages=[
+            {
+                "role": "user",
+                "content": docs + [{"type": "text", "text": QUERY}],
+            }
+        ],
+    )
+    elapsed_ms = (time.perf_counter() - t0) * 1000
+
+    usage = resp.usage
+    print(f"--- {label} ---")
+    print(f"  input_tokens:                {getattr(usage, 'input_tokens', '?')}")
+    print(f"  output_tokens:               {getattr(usage, 'output_tokens', '?')}")
+    print(f"  cache_creation_input_tokens: {getattr(usage, 'cache_creation_input_tokens', 0) or 0}")
+    print(f"  cache_read_input_tokens:     {getattr(usage, 'cache_read_input_tokens', 0) or 0}")
+    print(f"  elapsed:                     {elapsed_ms:.0f} ms")
+    return {
+        "creation": getattr(usage, "cache_creation_input_tokens", 0) or 0,
+        "read": getattr(usage, "cache_read_input_tokens", 0) or 0,
+    }
+
+
+def main() -> int:
+    if not os.environ.get("ANTHROPIC_API_KEY"):
+        print("error: ANTHROPIC_API_KEY not set", file=sys.stderr)
+        return 2
+    from anthropic import Anthropic
+
+    client = Anthropic()
+    docs = _make_documents()
+
+    first = _call(client, docs, "first call (priming the cache)")
+    print()
+    second = _call(client, docs, "second call (should read cache)")
+
+    print()
+    print("=== verdict ===")
+    if second["read"] > 0:
+        print(
+            f"PASS: cache_control on document block produced a hit "
+            f"({second['read']} cached tokens read on second call)."
+        )
+        print(
+            "ACTION: wire cache_control onto first document in "
+            "_build_documents_payload (default behavior)."
+        )
+        return 0
+    if first["creation"] > 0 and second["read"] == 0:
+        print("MIXED: first call wrote a cache entry but second didn't read it.")
+        print("ACTION: investigate — possible TTL or invalidation issue.")
+        return 1
+    print(
+        "FAIL: no cache activity. Document-block caching may not work the "
+        "same as text-block caching for the citations API."
+    )
+    print("ACTION: leave cache_control OFF on the citations path (current default).")
+    return 1
+
+
+if __name__ == "__main__":
+    sys.exit(main())
diff --git a/scripts/probe_v2v3.sh b/scripts/probe_v2v3.sh
@@ -0,0 +1,44 @@
+#!/usr/bin/env bash
+# Combined V2 + V3 probe runner.
+#
+# Usage:
+#   source ~/.attune/anthropic.env   # loads ANTHROPIC_API_KEY
+#   bash ~/attune-rag/.claude/worktrees/native-citations-v2v3/scripts/probe_v2v3.sh
+#
+# Runs both V2 (cache_control) and V3 (doc-count ceiling) probes
+# back-to-back and prints all output to stdout. Single command, no
+# multi-line paste required.
+
+set -euo pipefail
+
+if [[ -z "${ANTHROPIC_API_KEY:-}" ]]; then
+    echo "error: ANTHROPIC_API_KEY not set in this shell." >&2
+    echo "       run:  source ~/.attune/anthropic.env"     >&2
+    exit 2
+fi
+
+echo "ANTHROPIC_API_KEY loaded: ${ANTHROPIC_API_KEY:0:10}***"
+echo
+
+ROOT="$HOME/attune-rag/.claude/worktrees/native-citations-v2v3"
+PY="$HOME/attune-rag/.venv/bin/python"
+
+cd "$ROOT"
+
+echo "=========================================="
+echo " V2: cache_control on document blocks"
+echo "=========================================="
+PYTHONPATH=src "$PY" scripts/probe_v2_cache_control.py
+v2_rc=$?
+
+echo
+echo "=========================================="
+echo " V3: per-request document-count ceiling"
+echo "=========================================="
+PYTHONPATH=src "$PY" scripts/probe_v3_doc_count_ceiling.py
+v3_rc=$?
+
+echo
+echo "=========================================="
+echo " summary: v2_rc=$v2_rc  v3_rc=$v3_rc"
+echo "=========================================="