Dynamically adjusting reasoning-budget per chat prediction in llama.cpp server
#21445
-
|
I have a feeling this may be a no at the moment but wanted to ask in case I missed anything. However, I have recently encountered a use-case where I want the reasoning budget to be either smaller or bigger depending on the prediction that is taking place, for example for a one-off classification task I want to allow a much bigger reasoning budget, while for a typical user-chat response I want a much smaller budget to prevent the user from waiting a very long time because Qwen decided to spiral into a reasoning frenzy, and the nature of the chat isn't any accuracy-important reasoning task, hence interrupted reasoning is acceptable in this case (Qwen usually performs its most valuable reasoning in the first X amount of tokens, after that it tends to just be useless overthinking so early termination is actually very effective). So my question is, is there a way to apply a |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 5 replies
-
|
yes would be nice to have a also in qwen3.5 chat template there is an anoying thing the < |
Beta Was this translation helpful? Give feedback.
-
|
You're looking for Relevant code: llama.cpp/tools/server/server-common.cpp Lines 1107 to 1119 in 0fcb376 |
Beta Was this translation helpful? Give feedback.
You're looking for
thinking_budget_tokens. You can include this in the request body ({"thinking_budget_tokens": N}) as long as you haven't specified a budget on the command-line.Relevant code:
llama.cpp/tools/server/server-common.cpp
Lines 1107 to 1119 in 0fcb376