Dynamically adjusting `reasoning-budget` per chat prediction in llama.cpp server #21445

SpeedyCraftah · 2026-04-04T19:36:41Z

SpeedyCraftah
Apr 4, 2026

I have a feeling this may be a no at the moment but wanted to ask in case I missed anything.
The llama.cpp server supports a --reasoning-budget flag which allows limiting the amount of reasoning tokens the model generates by prematurely appending an end sequence provided by --reasoning-budget-message and forcing an assistant response turn.
Since the Qwen3.5 models running at my quant levels seem to overthink a lot (borderline infinitely) with certain system prompts or user messages, I find this parameter to be paramount in my use case.

However, I have recently encountered a use-case where I want the reasoning budget to be either smaller or bigger depending on the prediction that is taking place, for example for a one-off classification task I want to allow a much bigger reasoning budget, while for a typical user-chat response I want a much smaller budget to prevent the user from waiting a very long time because Qwen decided to spiral into a reasoning frenzy, and the nature of the chat isn't any accuracy-important reasoning task, hence interrupted reasoning is acceptable in this case (Qwen usually performs its most valuable reasoning in the first X amount of tokens, after that it tends to just be useless overthinking so early termination is actually very effective).

So my question is, is there a way to apply a --reasoning-budget (and possibly a --reasoning-budget-message) on the /v1/chat/completions route (and other related routes)? And if not, would this be a change the project would be interested in (as I'm happy to work on a PR myself if time allows)?

Answered by Stebalien

Apr 9, 2026

You're looking for thinking_budget_tokens. You can include this in the request body ({"thinking_budget_tokens": N}) as long as you haven't specified a budget on the command-line.

Relevant code:

llama.cpp/tools/server/server-common.cpp

Lines 1107 to 1119 in 0fcb376

     {  
   int reasoning_budget = opt.reasoning_budget;  
   if (reasoning_budget == -1 && body.contains("thinking_budget_tokens")) {  
   reasoning_budget = json_value(body, "thinking_budget_tokens", -1);  
   }  
    
   if (!chat_params.thinking_end_tag.empty()) {  
   llama_params["reasoning_budget_tokens"] = reasoning_budget;  
   llama_params["reasoning_budget_start_tag"] = chat_params.thinking_start_tag;  
   llam…

View full answer

YannFollet · 2026-04-07T01:44:09Z

YannFollet
Apr 7, 2026

yes would be nice to have a reasoning-budget in the payload
and the --reasoning-budget should be a kind of --max-reasoning-budget

also in qwen3.5 chat template there is an anoying thing
chat template, example_format:

<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
<think>

the <think> tag is in the template so it is not sent in the response, so you don't know you are in thinking mode

4 replies

SpeedyCraftah Apr 7, 2026
Author

the tag is in the template so it is not sent in the response, so you don't know you are in thinking mode

Are you talking about the llama.cpp server? For me I've never had a problem with that, with and without the hardcoded <think> token in the default template, llama.cpp still differentiates the two and correctly separates the reasoning and response tokens (even in stream mode):

{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello! 👋 How's it going? Is there something specific I can help you with today?",
        "reasoning_content": "Thinking Process:\n\n1.  **Analyze the Request:**\n    *   Input: \"Hello\"\n    *   Intent: Greeting.\n    *   Expected Output: A friendly response, acknowledging the greeting, and offering assistance.\n\nI'll spare everyone the rest here because it's just Qwen infuriatingly overthinking a "hello"

Template (came with Qwen3.5, at least my one): https://hasteb.in/ayBeWuPkKGoaVFw

YannFollet Apr 8, 2026

Thanks for the information, I use streaming with llama-server, I will check how it differentiates reasoning in the stream.

SpeedyCraftah Apr 8, 2026
Author

Thanks for the information, I use streaming with llama-server, I will check how it differentiates reasoning in the stream.

Same behaviour with streaming, so not sure why you're having this issue.

> .\curl.exe --no-buffer --% http://127.0.0.1:8080/v1/chat/completions -H "Content-Type: application/json" --data-raw "{\"model\":\"gpt-5.3\",\"stream\":true,\"messages\":[{\"role\":\"user\",\"content\":\"Hello\"}]}"
data: {"choices":[{"finish_reason":null,"index":0,"delta":{"role":"assistant","content":null}}],"created":1775665539,"id":"chatcmpl-2jvRHeymLX3xoOJBGlMdEwG5lj2Ycqjb","model":"Qwen3.5-35B-Q4_K_S.gguf","system_fingerprint":"b8586-64ac9ab66","object":"chat.completion.chunk"}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"reasoning_content":"Thinking"}}],"created":1775665539,"id":"chatcmpl-2jvRHeymLX3xoOJBGlMdEwG5lj2Ycqjb","model":"Qwen3.5-35B-Q4_K_S.gguf","system_fingerprint":"b8586-64ac9ab66","object":"chat.completion.chunk"}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"reasoning_content":" Process"}}],"created":1775665540,"id":"chatcmpl-2jvRHeymLX3xoOJBGlMdEwG5lj2Ycqjb","model":"Qwen3.5-35B-Q4_K_S.gguf","system_fingerprint":"b8586-64ac9ab66","object":"chat.completion.chunk"}

YannFollet Apr 9, 2026

I update my build, it works fine, thanks

Stebalien · 2026-04-09T20:48:59Z

Stebalien
Apr 9, 2026

You're looking for thinking_budget_tokens. You can include this in the request body ({"thinking_budget_tokens": N}) as long as you haven't specified a budget on the command-line.

Relevant code:

llama.cpp/tools/server/server-common.cpp

Lines 1107 to 1119 in 0fcb376

    
           { 
        
               int reasoning_budget = opt.reasoning_budget; 
        
               if (reasoning_budget == -1 && body.contains("thinking_budget_tokens")) { 
        
                   reasoning_budget = json_value(body, "thinking_budget_tokens", -1); 
        
               } 
        
               if (!chat_params.thinking_end_tag.empty()) { 
        
                   llama_params["reasoning_budget_tokens"] = reasoning_budget; 
        
                   llama_params["reasoning_budget_start_tag"] = chat_params.thinking_start_tag; 
        
                   llama_params["reasoning_budget_end_tag"] = chat_params.thinking_end_tag; 
        
                   llama_params["reasoning_budget_message"] = opt.reasoning_budget_message; 
        
               } 
        
           }

1 reply

SpeedyCraftah Apr 9, 2026
Author

Ah! My bad, should've searched the code for it first (couldn't find anything about it in the docs but maybe I missed that too). Many thanks for the pointer.

	{
	int reasoning_budget = opt.reasoning_budget;
	if (reasoning_budget == -1 && body.contains("thinking_budget_tokens")) {
	reasoning_budget = json_value(body, "thinking_budget_tokens", -1);
	}

	if (!chat_params.thinking_end_tag.empty()) {
	llama_params["reasoning_budget_tokens"] = reasoning_budget;
	llama_params["reasoning_budget_start_tag"] = chat_params.thinking_start_tag;
	llam…

Dynamically adjusting reasoning-budget per chat prediction in llama.cpp server #21445

Uh oh!

Uh oh!

SpeedyCraftah Apr 4, 2026

Replies: 2 comments · 5 replies

Uh oh!

YannFollet Apr 7, 2026

Uh oh!

Uh oh!

SpeedyCraftah Apr 7, 2026 Author

Uh oh!

YannFollet Apr 8, 2026

Uh oh!

Uh oh!

SpeedyCraftah Apr 8, 2026 Author

Uh oh!

YannFollet Apr 9, 2026

Uh oh!

Stebalien Apr 9, 2026

Uh oh!

Uh oh!

SpeedyCraftah Apr 9, 2026 Author

Dynamically adjusting `reasoning-budget` per chat prediction in llama.cpp server #21445

SpeedyCraftah
Apr 4, 2026

Replies: 2 comments 5 replies

YannFollet
Apr 7, 2026

SpeedyCraftah Apr 7, 2026
Author

SpeedyCraftah Apr 8, 2026
Author

Stebalien
Apr 9, 2026

SpeedyCraftah Apr 9, 2026
Author