Skip to content

kv : add dynamic KV cache resize (--kv-dynamic)#21757

Open
rockyRunnr wants to merge 1 commit intoggml-org:masterfrom
rockyRunnr:feature/dynamic-kv-cache
Open

kv : add dynamic KV cache resize (--kv-dynamic)#21757
rockyRunnr wants to merge 1 commit intoggml-org:masterfrom
rockyRunnr:feature/dynamic-kv-cache

Conversation

@rockyRunnr
Copy link
Copy Markdown

Add --kv-dynamic flag that starts with a small KV cache (256 cells) and grows on demand via try_resize(). Supports both standalone llama_kv_cache and hybrid (llama_memory_hybrid) architectures.

Growth strategy: doubling for small caches, +1GB linear for large.

Overview

When a large -c is set, llama.cpp allocates the full KV cache upfront. On
Apple Silicon / unified memory this can cause GPU OOM even when actual usage
is small.

--kv-dynamic starts the cache at 256 cells and grows as needed:

  • prepare() fail → try_resize() → retry in the same init_batch() call
  • resize: create new cache → copy per-layer/per-stream → swap internals
  • after resize, scheduler reserve is re-triggered in the same decode call
    Grow-only for now. Shrink is out of scope for this PR.
    Related to Feature Request: resize an existing context #11577.

Additional information

This is a draft — looking for feedback on direction before iterating further:

  • Is grow-only as a first step reasonable?
  • Is the create/copy/swap pattern acceptable?
  • The current growth heuristics are experimental and based on local testing rather than broad benchmarking.
  • Should growth heuristics be configurable rather than hardcoded?

Earlier experiments on Apple M4 (32 GB, Qwen3.5-27B-Q4_K_M, -c 131072):

prompt tokens vanilla KV dynamic KV
~100 8 GB 16 MB
~6K 8 GB 512 MB
~80K 8 GB 5 GB

In earlier local runs, the vanilla path frequently OOMed while the dynamic path completed without OOM.

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES — used for debugging assistance and code review. PR text written by me.

Add --kv-dynamic flag that starts with a small KV cache (256 cells) and grows
on demand via try_resize(). Supports both standalone llama_kv_cache and
hybrid (llama_memory_hybrid) architectures.

Growth strategy: doubling for small caches, +1GB linear for large.
@rockyRunnr rockyRunnr requested review from a team, CISC and ggerganov as code owners April 11, 2026 05:48
@ggml-gh-bot
Copy link
Copy Markdown

ggml-gh-bot bot commented Apr 11, 2026

Hi @rockyRunnr, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

@jacekpoplawski
Copy link
Copy Markdown
Contributor

Am I right that this mostly defers OOM rather than eliminating it? If so, what is the advantage over simply limiting the context size at startup?

@rockyRunnr
Copy link
Copy Markdown
Author

@jacekpoplawski, Thank you for your comment.

Limiting the context size at startup forces the user to guess their eventual usage in advance. In many real sessions, the context starts small and only later grows longer. --kv-dynamic keeps the large upper bound available without paying the full KV cost upfront.

@mvatafu
Copy link
Copy Markdown

mvatafu commented Apr 11, 2026

Wondering how would this complement the --parallel flag, since that one is splitting the -c into x pieces. It would mean we would be able to have more parallel small sessions ? That would be really nice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants