Description
I’m using ART with a local Ollama server as the inference backend (for both the agent model and judge models). I’ve configured my Ollama model with a context window well above 8192 tokens (e.g. ctx: 16384) and adjusted num_predict accordingly.
However, in many runs I still get errors like:
token count exceeded 8192
This happens even though:
This suggests there is a hardcoded or implicit max token limit of 8192 somewhere in ART/RULER, or in how token counts are computed, independent of the actual model’s context window.
What I expect
-
ART should respect the context window of the underlying model or the configured ctx when running through Ollama.
-
If a hard limit exists (e.g. 8192), it should be:
- Documented and configurable; or
- Derived from the model’s metadata, not hard-coded.
What actually happens
Environment
- Backend: Local Ollama server
- Model: Qwen / other Ollama-hosted model (with
ctx > 8192)
- ART: latest version (as of date of issue)
- Using ART’s LangGraph integration (
init_chat_model) and RULER scoring
Questions / Requests
- Is there an internal default limit of 8192 tokens that’s applied regardless of the model’s context?
- Can you expose this limit via configuration, or derive it from the model / backend rather than hardcoding?
- Any guidance on how to set ART/RULER up so that it fully respects Ollama’s larger
ctx?