kv : add dynamic KV cache resize (--kv-dynamic)#21757
kv : add dynamic KV cache resize (--kv-dynamic)#21757rockyRunnr wants to merge 1 commit intoggml-org:masterfrom
Conversation
Add --kv-dynamic flag that starts with a small KV cache (256 cells) and grows on demand via try_resize(). Supports both standalone llama_kv_cache and hybrid (llama_memory_hybrid) architectures. Growth strategy: doubling for small caches, +1GB linear for large.
|
Hi @rockyRunnr, thanks for your contribution! Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:
Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below. |
|
Am I right that this mostly defers OOM rather than eliminating it? If so, what is the advantage over simply limiting the context size at startup? |
|
@jacekpoplawski, Thank you for your comment. Limiting the context size at startup forces the user to guess their eventual usage in advance. In many real sessions, the context starts small and only later grows longer. --kv-dynamic keeps the large upper bound available without paying the full KV cost upfront. |
|
Wondering how would this complement the --parallel flag, since that one is splitting the -c into x pieces. It would mean we would be able to have more parallel small sessions ? That would be really nice. |
Add --kv-dynamic flag that starts with a small KV cache (256 cells) and grows on demand via try_resize(). Supports both standalone llama_kv_cache and hybrid (llama_memory_hybrid) architectures.
Growth strategy: doubling for small caches, +1GB linear for large.
Overview
When a large
-cis set, llama.cpp allocates the full KV cache upfront. OnApple Silicon / unified memory this can cause GPU OOM even when actual usage
is small.
--kv-dynamicstarts the cache at 256 cells and grows as needed:prepare()fail →try_resize()→ retry in the sameinit_batch()callGrow-only for now. Shrink is out of scope for this PR.
Related to Feature Request: resize an existing context #11577.
Additional information
This is a draft — looking for feedback on direction before iterating further:
Earlier experiments on Apple M4 (32 GB, Qwen3.5-27B-Q4_K_M,
-c 131072):In earlier local runs, the vanilla path frequently OOMed while the dynamic path completed without OOM.
Requirements