From ec7dc532ddc86f36384a37c8c3f3458144daf331 Mon Sep 17 00:00:00 2001
From: "Syring, Nikolas" <nikolas.syring@rittec.de>
Date: Wed, 13 May 2026 11:20:47 +0200
Subject: [PATCH] docs(agent-runtime): document the prompt-budget ratio
 tradeoff for large-context models
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Both `docs/app/admin-agent-runtime.md` and `docs/agent/onscreen-agent-runtime.md`
already describe `prompt_budget_ratios` and the part-level trimming planner,
but neither explains a tradeoff that surprises operators of large-context
local backends: the default split `system: 30, history: 40, transient: 30`
caps the per-section history sub-budget at 40% of `max_tokens` regardless of
how large `max_tokens` itself is. On a backend with `max_tokens: 256000` the
history sub-budget is 102_400 tokens, so the trimmer fires middle-of-history
once the conversation crosses that line even though the total prompt is
still nowhere near the model's real context ceiling.

This is by design — the per-section caps protect skill catalogs and
transient runtime context from being pushed out by an unbounded history —
but the resulting trim cuts visibly slow long Talk sessions even with the
prefix prompt-cache stability fixes landed in the long-message placeholder
helpers. Operators whose admin or overlay chat rarely loads heavy custom
system prompts or transient sections can raise `history` against
`system` and `transient` (for example `history: 70, system: 15,
transient: 15`) so trimming only starts once the history itself exceeds
70% of `max_tokens`.

Add a short paragraph to both runtime docs next to the existing ratio
descriptions documenting this tradeoff and the recommended override
shape, and reaffirm the existing total-of-`100` constraint. No code
changes; doc-only update.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 .../_core/documentation/docs/agent/onscreen-agent-runtime.md    | 2 ++
 .../mod/_core/documentation/docs/app/admin-agent-runtime.md     | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/app/L0/_all/mod/_core/documentation/docs/agent/onscreen-agent-runtime.md b/app/L0/_all/mod/_core/documentation/docs/agent/onscreen-agent-runtime.md
index 30d18b95..79f039e3 100644
--- a/app/L0/_all/mod/_core/documentation/docs/agent/onscreen-agent-runtime.md
+++ b/app/L0/_all/mod/_core/documentation/docs/agent/onscreen-agent-runtime.md
@@ -52,6 +52,8 @@ Important config fields:
 
 `prompt_budget_ratios` drives the shared prompt-budget builder: `system`, `history`, and `transient` split the configured `max_tokens`, while `singleMessage` caps any one live history message as a percentage of the history budget before part-level trimming runs. Prepared prompt entries and prompt items now carry cached token metadata so rebuilds can reuse counts for the same bodies. Part-level trimming then uses one shared planner pass: contributor trims must each be at least `250` tokens, and when that is not possible for `system` or `transient`, the runtime trims one combined section body instead of applying tiny contributor cuts.
 
+The default split (`system: 30, history: 40, transient: 30`) reserves equal headroom for skill catalogs and transient runtime context, but it also caps the history sub-budget at 40% of `max_tokens`. On models with very large context windows (for example a local backend with `max_tokens: 256000`), this can trigger middle-of-history trimming long before the total prompt approaches the model's real ceiling — the history sub-budget hits its 40% slice first while system and transient still have plenty of headroom. If the overlay rarely loads heavy custom system prompts or transient sections and the goal is to keep more raw conversation history before trimming starts, raise `history` against `system` and `transient`, for example `history: 70, system: 15, transient: 15`. The three values must still total `100`.
+
 Important browser UI state fields:
 
 - `agent_x`, `agent_y`
diff --git a/app/L0/_all/mod/_core/documentation/docs/app/admin-agent-runtime.md b/app/L0/_all/mod/_core/documentation/docs/app/admin-agent-runtime.md
index 458c5902..fca1d29f 100644
--- a/app/L0/_all/mod/_core/documentation/docs/app/admin-agent-runtime.md
+++ b/app/L0/_all/mod/_core/documentation/docs/app/admin-agent-runtime.md
@@ -64,6 +64,8 @@ The admin settings modal now starts with a provider switch:
 
 Below those provider-specific sections, the shared settings area also exposes `max_tokens`, prompt-budget ratios for `system`, `history`, and `transient`, plus the separate single-history-message ratio used by the shared trimming path. Those values are persisted in `prompt_budget_ratios` and feed the same prompt-budget builder used by the onscreen agent: prepared entries and prompt items reuse cached token counts, single live history messages are capped first, contributor-level trims must each be at least `250` tokens, and `system` or `transient` falls back to one combined section-body trim when smaller contributor cuts would otherwise be required.
 
+The default split (`system: 30, history: 40, transient: 30`) reserves equal headroom for skill catalogs and transient runtime context, but it also caps the history sub-budget at 40% of `max_tokens`. On models with very large context windows (for example a local backend with `max_tokens: 256000`), this can trigger middle-of-history trimming long before the total prompt approaches the model's real ceiling — the history sub-budget hits its 40% slice first while system and transient still have plenty of headroom. If the admin chat rarely loads heavy custom system prompts or transient sections and the goal is to keep more raw conversation history before trimming starts, raise `history` against `system` and `transient`, for example `history: 70, system: 15, transient: 15`. The three values must still total `100`.
+
 When no local model is selected and saved models exist, the admin local panel preselects the browser-wide last successfully loaded saved model from `_core/huggingface/manager.js`, falling back to the first saved entry if that last-used entry was discarded. When no local model is selected, no local model is loaded, and the shared saved-model list is empty, the admin local panel prefills the Hugging Face model field with the same testing-page default: `onnx-community/gemma-4-E4B-it-ONNX`.
 
 The stored config keeps both API settings and the selected local provider state: