Skip to content

feat: PIC cache reuse with mistralrs (local) backend#904

Draft
starpit wants to merge 3 commits intoIBM:mainfrom
starpit:pic-cache-reuse
Draft

feat: PIC cache reuse with mistralrs (local) backend#904
starpit wants to merge 3 commits intoIBM:mainfrom
starpit:pic-cache-reuse

Conversation

@starpit
Copy link
Copy Markdown
Member

@starpit starpit commented Feb 24, 2026

Summary

Wire spnl Plus operators to mistral.rs for position-independent KV cache reuse (PIC) across requests. Plus blocks in queries are tagged with in-band markers, which the engine uses to identify cacheable segments via text-based content hashing — enabling automatic KV cache hits when the same document appears across different requests, regardless of position.

What's included

  • PIC integration: add_messages_from_query tags Plus blocks with \0PIC_PLUS\0 in-band markers; the mistral.rs engine strips them before chat template processing and uses token subsequence matching to resolve message-to-token boundaries for cache lookup
  • Benchmark (spnl bench pic): measures TTFT speedup and accuracy from cross-request PIC cache reuse with shuffled documents
    • -o/--output controls output (comma-separated): speedup, iqr, hitrate, latency, json, accuracy — combinable, e.g. -o speedup,accuracy
    • --full sweeps across doc sizes (xs through xxl) and models
    • -l/--length sets doc words (TTFT) or max tokens (accuracy)
    • --grading-model enables LLM-judge semantic equivalence scoring alongside token F1
    • Progress bars with silent model loading
  • Tests: unit tests for PIC helpers (token_f1, normalize_tokens, parse_score, resolve_spectrum, compute_hit_rate, percentile) and add_messages_from_query_inner message builder (Plus tagging, Cross/Plus nesting, unsupported query types). Ignored integration test verifies PIC speedup > 1x with a real model.
  • Docs: README_PIC.md (user-facing: query syntax, CLI, results) and engine internals in mistral.rs/PIC.md
  • Deps: mistralrs/mistralrs-core pointed to starpit/mistral.rs fork branch pic-cache-reuse

Benchmark Results

Output of spnl bench pic --full:

Model         xs 10w  sm 50w  m 200w  lg 500w  xl 1000w  xxl 2000w
 ──────────── ─────── ─────── ─────── ──────── ───────── ──────────
 llama3.2:1b    2.02x   4.28x  13.35x   29.85x    52.91x     80.55x
 llama3.2:3b    2.12x   4.56x  15.25x   29.84x    48.71x     83.94x
 llama3.1:8b    2.38x   5.48x  17.71x   36.17x    65.57x     91.84x
 qwen2.5:0.5b   1.74x   3.38x  10.10x   22.51x    46.70x     87.87x
 qwen2.5:14b    3.00x   7.41x  24.74x   48.96x    97.98x    124.84x

Test plan

  • Unit tests pass: cargo test -p spnl-cli --features bench --bin spnl -- bench::pic::tests (22 tests)
  • Message builder tests pass: cargo test -p spnl --features local --lib -- generate::backend::mistralrs::tests (5 tests)
  • Clippy clean: both spnl and spnl-cli pass clippy -- -D warnings
  • cargo build --release -F bench,metal succeeds
  • spnl bench pic runs end-to-end against a local model
  • Integration test with GPU: cargo test -p spnl-cli --features bench,metal --bin spnl -- bench::pic::tests::pic_benchmark_shows_speedup --ignored
  • Accuracy mode (-o accuracy) produces Plus vs flat comparison output
  • Review README_PIC.md for correctness

@starpit starpit force-pushed the pic-cache-reuse branch 2 times, most recently from cf902c2 to 4197e89 Compare February 24, 2026 00:18
@starpit starpit marked this pull request as draft February 24, 2026 00:19
@starpit starpit changed the title feat: PIC cache reuse feat: PIC cache reuse with mistralrs (local) backend Feb 24, 2026
@starpit starpit marked this pull request as ready for review February 24, 2026 16:10
@starpit starpit force-pushed the pic-cache-reuse branch 2 times, most recently from 964b567 to 93f22b9 Compare February 24, 2026 19:39
Wire spnl Plus operators to mistral.rs for position-independent KV
cache reuse (PIC) across requests. Plus blocks in queries are tagged
with in-band markers (\0PIC_PLUS\0), which the engine uses to identify
cacheable segments via text-based content hashing.

Add `spnl bench pic` benchmark measuring TTFT speedup and accuracy
from PIC cache reuse with shuffled documents. Output controlled by
`-o/--output` (comma-separated): speedup, iqr, hitrate, latency,
json, accuracy. Supports `--full` sweep across doc sizes and models,
`-l/--length` for doc words (TTFT) or max tokens (accuracy),
`--grading-model` for LLM-judge scoring, progress bars, and silent
model loading.

Add unit tests for PIC helper functions (token_f1, normalize_tokens,
parse_score, resolve_spectrum, compute_hit_rate, percentile) and
add_messages_from_query_inner message builder. Include ignored
integration test that verifies PIC speedup > 1x with a real model.

Add README_PIC.md (user-facing: query syntax, CLI usage, benchmark
results) and split engine internals into mistral.rs/PIC.md.

Point mistralrs/mistralrs-core deps to starpit/mistral.rs fork branch
pic-cache-reuse for PR testing.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Nick Mitchell <nickm@us.ibm.com>
@starpit starpit marked this pull request as draft February 25, 2026 21:02
starpit and others added 2 commits February 25, 2026 19:15
- Update mistralrs dependency to point to starpit/mistral.rs pic-cache-reuse branch
- Update version requirement to >=0.7.1-alpha.1 for pre-release compatibility
- Comment out local patch override
- Rework accuracy benchmark to use fictional factual docs with verifiable answers
- Support PIC-shuffled (pshuf) mode in accuracy benchmark
- Add t-shirt size aliases (sm/lg)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Nick Mitchell <nickm@us.ibm.com>
Enable the prepare_fragment optimization path which pre-populates the
PIC cache via 1-token generates before the main query, allowing all
Plus blocks to hit the cache on the first request.

- hlo::optimize enabled in bench/pic.rs timed_request
- plus() keeps Plus wrapping for single elements (preserves PIC tagging)
- Monad returns Seq([]) instead of empty message (avoids phantom tokens)
- Comment out local patch override (changes pushed to starpit branch)
- Fix missing verbose field in test RunCtx

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Nick Mitchell <nickm@us.ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant