feat: PIC cache reuse with mistralrs (local) backend#904
Draft
feat: PIC cache reuse with mistralrs (local) backend#904
Conversation
cf902c2 to
4197e89
Compare
c002cbd to
0a2d2e5
Compare
964b567 to
93f22b9
Compare
Wire spnl Plus operators to mistral.rs for position-independent KV cache reuse (PIC) across requests. Plus blocks in queries are tagged with in-band markers (\0PIC_PLUS\0), which the engine uses to identify cacheable segments via text-based content hashing. Add `spnl bench pic` benchmark measuring TTFT speedup and accuracy from PIC cache reuse with shuffled documents. Output controlled by `-o/--output` (comma-separated): speedup, iqr, hitrate, latency, json, accuracy. Supports `--full` sweep across doc sizes and models, `-l/--length` for doc words (TTFT) or max tokens (accuracy), `--grading-model` for LLM-judge scoring, progress bars, and silent model loading. Add unit tests for PIC helper functions (token_f1, normalize_tokens, parse_score, resolve_spectrum, compute_hit_rate, percentile) and add_messages_from_query_inner message builder. Include ignored integration test that verifies PIC speedup > 1x with a real model. Add README_PIC.md (user-facing: query syntax, CLI usage, benchmark results) and split engine internals into mistral.rs/PIC.md. Point mistralrs/mistralrs-core deps to starpit/mistral.rs fork branch pic-cache-reuse for PR testing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Nick Mitchell <nickm@us.ibm.com>
93f22b9 to
f4b239f
Compare
- Update mistralrs dependency to point to starpit/mistral.rs pic-cache-reuse branch - Update version requirement to >=0.7.1-alpha.1 for pre-release compatibility - Comment out local patch override - Rework accuracy benchmark to use fictional factual docs with verifiable answers - Support PIC-shuffled (pshuf) mode in accuracy benchmark - Add t-shirt size aliases (sm/lg) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Nick Mitchell <nickm@us.ibm.com>
Enable the prepare_fragment optimization path which pre-populates the PIC cache via 1-token generates before the main query, allowing all Plus blocks to hit the cache on the first request. - hlo::optimize enabled in bench/pic.rs timed_request - plus() keeps Plus wrapping for single elements (preserves PIC tagging) - Monad returns Seq([]) instead of empty message (avoids phantom tokens) - Comment out local patch override (changes pushed to starpit branch) - Fix missing verbose field in test RunCtx Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Nick Mitchell <nickm@us.ibm.com>
5ae4679 to
921df25
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Wire spnl Plus operators to mistral.rs for position-independent KV cache reuse (PIC) across requests. Plus blocks in queries are tagged with in-band markers, which the engine uses to identify cacheable segments via text-based content hashing — enabling automatic KV cache hits when the same document appears across different requests, regardless of position.
What's included
add_messages_from_querytags Plus blocks with\0PIC_PLUS\0in-band markers; the mistral.rs engine strips them before chat template processing and uses token subsequence matching to resolve message-to-token boundaries for cache lookupspnl bench pic): measures TTFT speedup and accuracy from cross-request PIC cache reuse with shuffled documents-o/--outputcontrols output (comma-separated):speedup,iqr,hitrate,latency,json,accuracy— combinable, e.g.-o speedup,accuracy--fullsweeps across doc sizes (xs through xxl) and models-l/--lengthsets doc words (TTFT) or max tokens (accuracy)--grading-modelenables LLM-judge semantic equivalence scoring alongside token F1token_f1,normalize_tokens,parse_score,resolve_spectrum,compute_hit_rate,percentile) andadd_messages_from_query_innermessage builder (Plus tagging, Cross/Plus nesting, unsupported query types). Ignored integration test verifies PIC speedup > 1x with a real model.README_PIC.md(user-facing: query syntax, CLI, results) and engine internals inmistral.rs/PIC.mdmistralrs/mistralrs-corepointed tostarpit/mistral.rsfork branchpic-cache-reuseBenchmark Results
Output of
spnl bench pic --full:Test plan
cargo test -p spnl-cli --features bench --bin spnl -- bench::pic::tests(22 tests)cargo test -p spnl --features local --lib -- generate::backend::mistralrs::tests(5 tests)spnlandspnl-clipassclippy -- -D warningscargo build --release -F bench,metalsucceedsspnl bench picruns end-to-end against a local modelcargo test -p spnl-cli --features bench,metal --bin spnl -- bench::pic::tests::pic_benchmark_shows_speedup --ignored-o accuracy) produces Plus vs flat comparison outputREADME_PIC.mdfor correctness