Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 35 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,41 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

## [0.1.14] - 2026-05-08

### Changed

- **Native citations: caching enabled by default.** The first
document in a citations request now carries
`cache_control: {"type": "ephemeral"}` — one marker covers
the entire document prefix per Anthropic's caching semantics.
Empirically verified by the V2 probe: a 3799-token payload
yielded full cache hits on the second call
(`cache_read_input_tokens=3799`,
`cache_creation_input_tokens=0`) with ~29% latency reduction
(3102ms → 2190ms). No code change for callers; identical
inputs to `RagPipeline.run_and_generate(use_native_citations=True)`
now get cheaper on repeat calls.
- **`MAX_CITATION_DOCUMENTS`: 20 → 200.** V3 probe accepted every
count in `{5, 10, 20, 30, 50, 75, 100, 150, 200}` without
rejection; Anthropic's actual cap is higher still. The new
ceiling gives generous headroom while still surfacing a clean
`ValueError` if a caller accidentally tries hundreds.
- **Docs (`docs/rag/native-citations.md`):** "Open verification
gates" section updated to "Verification gates — resolved
2026-05-08" with the V2 / V3 findings inline. The "Caching"
and "Document-count ceiling" sections now reflect the
defaults.

### Added

- **Verification probes** at
`scripts/probe_v2_cache_control.py` and
`scripts/probe_v3_doc_count_ceiling.py`. Manual one-shot
scripts that re-run the V2 / V3 verifications against the
live Anthropic API. Cost ~$0.01 each. Useful when the SDK or
service contract may have changed.

## [0.1.13] - 2026-05-08

### Added
Expand Down
74 changes: 50 additions & 24 deletions docs/rag/native-citations.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,20 +74,38 @@ callers are unaffected):

## Caching

Caching is **off** on the native path in v1. The legacy path
continues to flag the stable prompt prefix with
`cache_control: ephemeral`. Document-block caching needs the V2
verification gate (an empirical 2-call test that confirms
document-block caching behaves the same as text-block caching);
once confirmed, attach `cache_control` to the first document.
Caching is **on** by default on the native path. The first
document in each request carries
`cache_control: {"type": "ephemeral"}`; one marker on the first
document covers the whole document prefix per Anthropic's
caching semantics. Subsequent calls with the same documents hit
the cache.

V2 verification (2026-05-08) — empirical 2-call probe:

| Metric | Call 1 (priming) | Call 2 (cached) |
|---------------------------------|------------------|-----------------|
| `cache_creation_input_tokens` | 3799 | 0 |
| `cache_read_input_tokens` | 0 | 3799 |
| Wall-clock latency | 3102 ms | 2190 ms (-29%) |

So document-block caching behaves identically to text-block
caching for our purposes. The legacy `[P{n}]` path still flags
its rendered prompt prefix the same way it always did.

## Document-count ceiling

`MAX_CITATION_DOCUMENTS = 20` is enforced by `ClaudeProvider`.
Exceeding it raises `ValueError` with a clean message. The
ceiling will be re-verified by the V3 gate before the default
flips. Today this is well above the project's `k=3` retrieval
default.
`MAX_CITATION_DOCUMENTS = 200` is enforced by `ClaudeProvider`.
Exceeding it raises `ValueError` with a clean message before
hitting the wire.

V3 verification (2026-05-08) — Anthropic's actual cap is higher
still: the probe walked `n ∈ {5, 10, 20, 30, 50, 75, 100, 150,
200}` and every count was accepted without rejection. We pin
200 as a practical ceiling: comfortably above any plausible
attune-rag retrieval (`k=3` default, occasional bumps to
`k=20–50`), with headroom, while still surfacing a clean error
if a caller accidentally tries to send hundreds.

## Benchmark

Expand All @@ -107,19 +125,27 @@ spec citing the resulting CSV.
The benchmark gates on the **legacy** path's faithfulness floor
because that's the established baseline; native is exploratory.

## Open verification gates (V2, V3)

These need real API calls and were not run in the implementing
PR. They affect optional polish, not correctness:

- **V2 — `cache_control` on document blocks.** Empirically
confirm a 2-call test yields cache hits when documents are
identical. If yes, wire `cache_control: ephemeral` onto the
first document in `_build_documents_payload`.
- **V3 — document-count ceiling.** Confirm 20 is still the
per-request cap. If higher, raise `MAX_CITATION_DOCUMENTS`.

Findings should land in this doc as a follow-up commit.
## Verification gates (V2, V3) — resolved 2026-05-08

Both gates were initially deferred from the 0.1.13 PR because
they required live API spend. Both ran on 2026-05-08 and
landed in 0.1.14:

- **V2 — `cache_control` on document blocks: PASS.** Two-call
probe with identical 3799-token document payload showed full
cache hits on the second call (`cache_read_input_tokens=3799`,
`cache_creation_input_tokens=0`) plus ~29% latency reduction
(3102ms → 2190ms). `cache_control: ephemeral` is now wired
onto the first document by default in
`_build_documents_payload`. See "Caching" above.
- **V3 — document-count ceiling: PASS.** Probe accepted every
count in `{5, 10, 20, 30, 50, 75, 100, 150, 200}` without
rejection. Anthropic's actual cap is higher still; we
conservatively pin `MAX_CITATION_DOCUMENTS = 200` as a
practical ceiling. See "Document-count ceiling" above.

Probes live at `scripts/probe_v2_cache_control.py` and
`scripts/probe_v3_doc_count_ceiling.py` for re-verification.

## Why not replace the legacy path?

Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"

[project]
name = "attune-rag"
version = "0.1.13"
version = "0.1.14"
description = "Lightweight, LLM-agnostic RAG pipeline with pluggable corpora. Works with Claude, Gemini, or any LLM."
readme = {file = "README.md", content-type = "text/markdown"}
requires-python = ">=3.10"
Expand Down
129 changes: 129 additions & 0 deletions scripts/probe_v2_cache_control.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
"""V2 verification: cache_control on document blocks (Citations API).

Submits the same batch of citation documents twice; second call
should hit the prompt cache if document-block caching works the
same as text-block caching. Reports cache_creation_input_tokens
+ cache_read_input_tokens from each call's usage.

Run:

ANTHROPIC_API_KEY=sk-ant-... python scripts/probe_v2_cache_control.py

Cost: ~$0.01 (two ~1500-token-input calls on Sonnet).
"""

from __future__ import annotations

import os
import sys
import time

# Build a system prompt + document corpus that's at least 1024 tokens
# so cache_control actually triggers on Sonnet (the threshold below
# which Anthropic doesn't cache).
LONG_SYSTEM = (
"You are answering questions strictly from the provided documents.\n"
"Cite the source document for every factual claim.\n\n"
) * 4 # ~200 tokens

# Each document is ~600 tokens of repeated technical prose so
# the doc payload alone clears the caching floor.
LARGE_DOC_BODY = (
"The Anthropic Citations API allows the model to attach "
"structured citations to specific spans of its response. "
"Each citation references a document and a location range "
"within that document. For custom_content sources, the "
"location is reported as a content_block_location with "
"start_block_index and end_block_index pointers. "
) * 50 # ~2000 tokens, well above caching floor

QUERY = "Summarize the citations behavior in one sentence."


def _make_documents() -> list[dict]:
"""Two documents, first one carrying ``cache_control``."""
docs: list[dict] = []
for i, title in enumerate(
["concepts/citations-overview.md", "concepts/citations-locations.md"]
):
block = {
"type": "document",
"source": {
"type": "content",
"content": [{"type": "text", "text": LARGE_DOC_BODY}],
},
"title": title,
"citations": {"enabled": True},
}
if i == 0:
block["cache_control"] = {"type": "ephemeral"}
docs.append(block)
return docs


def _call(client, docs: list[dict], label: str) -> dict:
t0 = time.perf_counter()
resp = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=128,
messages=[
{
"role": "user",
"content": docs + [{"type": "text", "text": QUERY}],
}
],
)
elapsed_ms = (time.perf_counter() - t0) * 1000

usage = resp.usage
print(f"--- {label} ---")
print(f" input_tokens: {getattr(usage, 'input_tokens', '?')}")
print(f" output_tokens: {getattr(usage, 'output_tokens', '?')}")
print(f" cache_creation_input_tokens: {getattr(usage, 'cache_creation_input_tokens', 0) or 0}")
print(f" cache_read_input_tokens: {getattr(usage, 'cache_read_input_tokens', 0) or 0}")
print(f" elapsed: {elapsed_ms:.0f} ms")
return {
"creation": getattr(usage, "cache_creation_input_tokens", 0) or 0,
"read": getattr(usage, "cache_read_input_tokens", 0) or 0,
}


def main() -> int:
if not os.environ.get("ANTHROPIC_API_KEY"):
print("error: ANTHROPIC_API_KEY not set", file=sys.stderr)
return 2
from anthropic import Anthropic

client = Anthropic()
docs = _make_documents()

first = _call(client, docs, "first call (priming the cache)")
print()
second = _call(client, docs, "second call (should read cache)")

print()
print("=== verdict ===")
if second["read"] > 0:
print(
f"PASS: cache_control on document block produced a hit "
f"({second['read']} cached tokens read on second call)."
)
print(
"ACTION: wire cache_control onto first document in "
"_build_documents_payload (default behavior)."
)
return 0
if first["creation"] > 0 and second["read"] == 0:
print("MIXED: first call wrote a cache entry but second didn't read it.")
print("ACTION: investigate — possible TTL or invalidation issue.")
return 1
print(
"FAIL: no cache activity. Document-block caching may not work the "
"same as text-block caching for the citations API."
)
print("ACTION: leave cache_control OFF on the citations path (current default).")
return 1


if __name__ == "__main__":
sys.exit(main())
44 changes: 44 additions & 0 deletions scripts/probe_v2v3.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
#!/usr/bin/env bash
# Combined V2 + V3 probe runner.
#
# Usage:
# source ~/.attune/anthropic.env # loads ANTHROPIC_API_KEY
# bash ~/attune-rag/.claude/worktrees/native-citations-v2v3/scripts/probe_v2v3.sh
#
# Runs both V2 (cache_control) and V3 (doc-count ceiling) probes
# back-to-back and prints all output to stdout. Single command, no
# multi-line paste required.

set -euo pipefail

if [[ -z "${ANTHROPIC_API_KEY:-}" ]]; then
echo "error: ANTHROPIC_API_KEY not set in this shell." >&2
echo " run: source ~/.attune/anthropic.env" >&2
exit 2
fi

echo "ANTHROPIC_API_KEY loaded: ${ANTHROPIC_API_KEY:0:10}***"
echo

ROOT="$HOME/attune-rag/.claude/worktrees/native-citations-v2v3"
PY="$HOME/attune-rag/.venv/bin/python"

cd "$ROOT"

echo "=========================================="
echo " V2: cache_control on document blocks"
echo "=========================================="
PYTHONPATH=src "$PY" scripts/probe_v2_cache_control.py
v2_rc=$?

echo
echo "=========================================="
echo " V3: per-request document-count ceiling"
echo "=========================================="
PYTHONPATH=src "$PY" scripts/probe_v3_doc_count_ceiling.py
v3_rc=$?

echo
echo "=========================================="
echo " summary: v2_rc=$v2_rc v3_rc=$v3_rc"
echo "=========================================="
Loading
Loading