Skip to content

anthropic: fix prefix caching#21793

Open
kvc0 wants to merge 1 commit intoggml-org:masterfrom
kvc0:claude
Open

anthropic: fix prefix caching#21793
kvc0 wants to merge 1 commit intoggml-org:masterfrom
kvc0:claude

Conversation

@kvc0
Copy link
Copy Markdown

@kvc0 kvc0 commented Apr 12, 2026

Overview

When testing claude code against llama.cpp, I noticed that only
n_past 18577 was used even when context was 60k or more. The log
in llama-server says:

slot update_slots: id  3 | task 10342 | old: ... ; cch= | defa0;You are
slot update_slots: id  3 | task 10342 | new: ... ; cch= | 1c8b4;

I observed that the cch value changed every time. Reading about that,
the x-anthropic-billing-header system message seems to be specially
handled inside of the anthropic api. I could remove it, but there
is a meaningful string sometimes included at the end. So instead,
I just replace the changing cch checksum with fffff.

I'm treating this as an anthropic message body API detail - I think this
is the right way to do this, but by all means please correct me!

It's always 5 hexadecimal characters, but I've written the replacement
defensively in case they change the protocol.

Additional information

When asking "explain this repo to me on a different repo," using a freshly started llama-server, the second request:
Before:

selected slot by LCP similarity, sim_best = 0.566 (> 0.050 thold), f_keep = 0.704

This is the best case, but it gets progressively worse as the matched length never
goes longer than 18577 (up to 18580 theoretically, but I never saw higher than 18578).

After:

selected slot by LCP similarity, sim_best = 0.805 (> 0.050 thold), f_keep = 1.000

And further along, I see prefixes that only differ in tool call details, as you would expect:

selected slot by LCP similarity, sim_best = 0.994 (> 0.050 thold), f_keep = 0.999
[...]
slot update_slots: id  1 | task 449 | old: ... =command>
 | cd /home/kenny/g
slot update_slots: id  1 | task 449 | new: ... =command>
 | git status
</parameter>

After this change, similarity looks normal and caching is performing well.

While debugging this, I dumped the /slots api a couple times on subsequent requests.
The diffs in the prompt field were like:

diff prompt1 prompt2 --unchanged-line-format="" --old-line-format="< :%dn: %L" --new-line-format="> :%dn: %L"
< :62: x-anthropic-billing-header: cc_version=2.1.101.e51; cc_entrypoint=cli; cch=a5145;You are Claude Code, Anthropic's official CLI for Claude.
> :62: x-anthropic-billing-header: cc_version=2.1.101.e51; cc_entrypoint=cli; cch=4a1a8;You are Claude Code, Anthropic's official CLI for Claude.
> :5130: </tool_response><|im_end|>
> :5131: <|im_start|>assistant
> :5132: <think>
> :5133: The diagnostics still show an issue with [...]

You can see line 62 has a cch diff, and then over 5000 common lines before the diff.
This should have been a total cache hit because it's all new starting at line 5130. But
because of the line 62 diff, it had to re-ingest nearly the whole thing. Without this
change, llama-server does this on every request because of anthropic's magic "header."

Performance:
The impact of this change to users who aren't using claude to send messages to the
anthropic api is a single-position O(1) string prefix check per system message. I don't
imagine too many system messages start with x so in the usual case it will early out
at 1 character's worth of comparison.

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: NO. I read and wrote all of this myself.

When testing claude code against llama.cpp, I noticed that only
n_past 18577 was used even when context was 60k or more. The log
in llama-server says:
```
slot update_slots: id  3 | task 10342 | old: ... ; cch= | defa0;You are
slot update_slots: id  3 | task 10342 | new: ... ; cch= | 1c8b4;
```
I observed that the cch value changed every time. Reading about that,
the x-anthropic-billing-header system message seems to be specially
handled inside of the anthropic api. I could remove it, but there
is a meaningful string sometimes included at the end. So instead,
I just replace the changing cch checksum with fffff.

It's always 5 hexadecimal characters, but I've written the replacement
defensively in case they change the protocol.
@kvc0 kvc0 requested a review from a team as a code owner April 12, 2026 05:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant