Skip to content

fix(engine): keep auto-compaction working on sub-500K self-hosted windows#2060

Open
h3c-hexin wants to merge 1 commit into
Hmbown:mainfrom
h3c-hexin:fix/self-hosted-context-budget
Open

fix(engine): keep auto-compaction working on sub-500K self-hosted windows#2060
h3c-hexin wants to merge 1 commit into
Hmbown:mainfrom
h3c-hexin:fix/self-hosted-context-budget

Conversation

@h3c-hexin
Copy link
Copy Markdown
Contributor

Problem

context_input_budget computes the internal input-token budget used by the preflight check, emergency context recovery, and capacity trimming. It reserved the full TURN_MAX_OUTPUT_TOKENS (262K) for output regardless of window size:

window.checked_sub(output)            // output = 262_144
      .and_then(|v| v.checked_sub(CONTEXT_HEADROOM_TOKENS))

For a self-hosted model whose window is below that reservation — e.g. vLLM serving Qwen with a 256K window — the math is 256_000 - 262_144 - 1_024, which underflows checked_sub to None.

A None budget makes every caller treat the session as having no budget to enforce, so it silently disables all preflight and emergency context-recovery paths. The session never compacts and runs until the provider hard-rejects the request on context length.

Fix

Two coupled changes:

  1. context_window_for_model — apply the explicit _Nk suffix hint for any vendor, not just DeepSeek. A self-hosted --served-model-name like qwen3-32b-256k is the only window signal available for non-DeepSeek/Claude models; without this its window resolves to None (and budgeting is disabled for a different reason). Renamed the helper deepseek_context_window_hintexplicit_context_window_hint since it is now vendor-agnostic.

  2. context_input_budget — tier the reserved-output term by window size:

    • >= 500K (V4-class): keep the full 262K headroom — preserves the existing "leave room for interleaved thinking" contract.
    • < 500K (smaller / self-hosted): reserve effective_max_output_tokens (what the API actually caps output at), which yields a usable positive budget.

    Also dropped the now-vestigial requested_output_tokens parameter — every caller passed the same constant.

Scope

  • Behavior change is confined to the budget tiering + the window-hint vendor scope.
  • Signature cleanup ripples to the 3 call sites (engine.rs, capacity_flow.rs, turn_loop.rs).
  • No trust-boundary surface (no auth / sandbox / publishing / prompts).

Tests

internal_context_budget_tiers_reserved_output_by_window is updated to pin both branches: V4 (>=500K) still reserves 262K, and a 256K self-hosted window now yields a positive budget instead of None.

cargo test -p codewhale-tui --bins context_budget internal_context_budget

🤖 Generated with Claude Code

…dows

`context_input_budget` reserved the full TURN_MAX_OUTPUT_TOKENS (262K) for
output regardless of window size. For a self-hosted model with a window
below that reservation (e.g. a 256K vLLM Qwen deployment) the math was
`256K - 262K - 1K`, which underflows `checked_sub` to `None`. A `None`
budget silently disables every preflight check and emergency context
recovery path, so the session never compacts and runs until the provider
hard-rejects on context length.

Two coupled fixes:

1. `context_window_for_model`: apply the explicit `_Nk` suffix hint for any
   vendor, not just DeepSeek. A self-hosted served-model-name like
   `qwen3-32b-256k` is the only window signal we have for non-DeepSeek/Claude
   models; without this its window resolves to `None`. Renamed the helper
   `deepseek_context_window_hint` -> `explicit_context_window_hint` since it
   is now vendor-agnostic.

2. `context_input_budget`: tier the reserved-output term by window —
   `>= 500K` keeps the full 262K headroom (preserves the V4 interleaved-
   thinking contract), `< 500K` falls back to `effective_max_output_tokens`
   (what the API actually caps output at), yielding a usable positive budget.
   Dropped the vestigial `requested_output_tokens` parameter (every caller
   passed the same constant).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the context budget calculation to prevent underflow on models with smaller context windows, such as self-hosted 256K deployments. It introduces a 500K token threshold to tier the reserved output headroom and makes the context window hint logic vendor-agnostic by allowing any model name with an '_Nk' suffix to be recognized. Additionally, function signatures for context recovery and budget calculation were simplified by removing redundant parameters. I have no feedback to provide as there were no review comments to assess.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant