Open
Conversation
When testing claude code against llama.cpp, I noticed that only n_past 18577 was used even when context was 60k or more. The log in llama-server says: ``` slot update_slots: id 3 | task 10342 | old: ... ; cch= | defa0;You are slot update_slots: id 3 | task 10342 | new: ... ; cch= | 1c8b4; ``` I observed that the cch value changed every time. Reading about that, the x-anthropic-billing-header system message seems to be specially handled inside of the anthropic api. I could remove it, but there is a meaningful string sometimes included at the end. So instead, I just replace the changing cch checksum with fffff. It's always 5 hexadecimal characters, but I've written the replacement defensively in case they change the protocol.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
When testing claude code against llama.cpp, I noticed that only
n_past 18577 was used even when context was 60k or more. The log
in llama-server says:
I observed that the cch value changed every time. Reading about that,
the x-anthropic-billing-header system message seems to be specially
handled inside of the anthropic api. I could remove it, but there
is a meaningful string sometimes included at the end. So instead,
I just replace the changing cch checksum with fffff.
I'm treating this as an anthropic message body API detail - I think this
is the right way to do this, but by all means please correct me!
It's always 5 hexadecimal characters, but I've written the replacement
defensively in case they change the protocol.
Additional information
When asking "explain this repo to me on a different repo," using a freshly started llama-server, the second request:
Before:
This is the best case, but it gets progressively worse as the matched length never
goes longer than 18577 (up to 18580 theoretically, but I never saw higher than 18578).
After:
And further along, I see prefixes that only differ in tool call details, as you would expect:
After this change, similarity looks normal and caching is performing well.
While debugging this, I dumped the /slots api a couple times on subsequent requests.
The diffs in the prompt field were like:
You can see line 62 has a cch diff, and then over 5000 common lines before the diff.
This should have been a total cache hit because it's all new starting at line 5130. But
because of the line 62 diff, it had to re-ingest nearly the whole thing. Without this
change, llama-server does this on every request because of anthropic's magic "header."
Performance:
The impact of this change to users who aren't using claude to send messages to the
anthropic api is a single-position O(1) string prefix check per system message. I don't
imagine too many system messages start with
xso in the usual case it will early outat 1 character's worth of comparison.
Requirements