fix(engine): keep auto-compaction working on sub-500K self-hosted windows#2060
Open
h3c-hexin wants to merge 1 commit into
Open
fix(engine): keep auto-compaction working on sub-500K self-hosted windows#2060h3c-hexin wants to merge 1 commit into
h3c-hexin wants to merge 1 commit into
Conversation
…dows `context_input_budget` reserved the full TURN_MAX_OUTPUT_TOKENS (262K) for output regardless of window size. For a self-hosted model with a window below that reservation (e.g. a 256K vLLM Qwen deployment) the math was `256K - 262K - 1K`, which underflows `checked_sub` to `None`. A `None` budget silently disables every preflight check and emergency context recovery path, so the session never compacts and runs until the provider hard-rejects on context length. Two coupled fixes: 1. `context_window_for_model`: apply the explicit `_Nk` suffix hint for any vendor, not just DeepSeek. A self-hosted served-model-name like `qwen3-32b-256k` is the only window signal we have for non-DeepSeek/Claude models; without this its window resolves to `None`. Renamed the helper `deepseek_context_window_hint` -> `explicit_context_window_hint` since it is now vendor-agnostic. 2. `context_input_budget`: tier the reserved-output term by window — `>= 500K` keeps the full 262K headroom (preserves the V4 interleaved- thinking contract), `< 500K` falls back to `effective_max_output_tokens` (what the API actually caps output at), yielding a usable positive budget. Dropped the vestigial `requested_output_tokens` parameter (every caller passed the same constant). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Code Review
This pull request refactors the context budget calculation to prevent underflow on models with smaller context windows, such as self-hosted 256K deployments. It introduces a 500K token threshold to tier the reserved output headroom and makes the context window hint logic vendor-agnostic by allowing any model name with an '_Nk' suffix to be recognized. Additionally, function signatures for context recovery and budget calculation were simplified by removing redundant parameters. I have no feedback to provide as there were no review comments to assess.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
context_input_budgetcomputes the internal input-token budget used by the preflight check, emergency context recovery, and capacity trimming. It reserved the fullTURN_MAX_OUTPUT_TOKENS(262K) for output regardless of window size:For a self-hosted model whose window is below that reservation — e.g. vLLM serving Qwen with a 256K window — the math is
256_000 - 262_144 - 1_024, which underflowschecked_subtoNone.A
Nonebudget makes every caller treat the session as having no budget to enforce, so it silently disables all preflight and emergency context-recovery paths. The session never compacts and runs until the provider hard-rejects the request on context length.Fix
Two coupled changes:
context_window_for_model— apply the explicit_Nksuffix hint for any vendor, not just DeepSeek. A self-hosted--served-model-namelikeqwen3-32b-256kis the only window signal available for non-DeepSeek/Claude models; without this its window resolves toNone(and budgeting is disabled for a different reason). Renamed the helperdeepseek_context_window_hint→explicit_context_window_hintsince it is now vendor-agnostic.context_input_budget— tier the reserved-output term by window size:>= 500K(V4-class): keep the full 262K headroom — preserves the existing "leave room for interleaved thinking" contract.< 500K(smaller / self-hosted): reserveeffective_max_output_tokens(what the API actually caps output at), which yields a usable positive budget.Also dropped the now-vestigial
requested_output_tokensparameter — every caller passed the same constant.Scope
engine.rs,capacity_flow.rs,turn_loop.rs).Tests
internal_context_budget_tiers_reserved_output_by_windowis updated to pin both branches: V4 (>=500K) still reserves 262K, and a 256K self-hosted window now yields a positive budget instead ofNone.🤖 Generated with Claude Code