UPSTREAM PR #21216: common : simplify autoparser tagged parser rules#1335
UPSTREAM PR #21216: common : simplify autoparser tagged parser rules#1335
Conversation
OverviewAnalysis of 125,160 functions across 15 binaries identified 35 modified, 0 new, 2 removed, and 125,123 unchanged functions. Changes focus on chat auto-parser refactoring with minimal overall performance impact. Power Consumption Changes:
Function Analysisbuild_tool_parser_tag_tagged (chat-auto-parser-generator.cpp, llama-cvector-generator & llama-tts):
Lambda operator (chat-auto-parser-generator.cpp, both binaries):
HTTP utilities (cpp-httplib vendor library):
Other analyzed functions (std::function templates, std::pair constructor, HTTP Put) show negligible changes from compiler optimizations. Additional FindingsCore inference unchanged: Matrix multiplication, attention mechanisms, quantization, and GPU backends show 0.000% power consumption change. All modifications isolated to non-critical utility code (parser construction, HTTP server operations). The refactoring prioritizes correctness (fixes issue #20650) and maintainability over micro-optimizations in initialization code, with no impact on token generation or model inference performance. 💬 Questions? Tag @loci-dev |
385b1fc to
06d9e10
Compare
7638ab4 to
f1b46d5
Compare
Note
Source pull request: ggml-org/llama.cpp#21216
Overview
common_schema_infoimplementation.RemoveReverted.p.end()usage now that we have a lenient parsing mode.{0, N}, which causes the explosion of repetition rules seen in #20879 and #20867.**Fixes #20879
Fixes #20867
Additional Information
I believe I found the culprit for the various parsing errors. Although yet to be confirmed.
When
MAX_REPETITION_THRESHOLDis hit, the grammar fails to parse but the request continues anyway. The server should return an error if the grammar is problematic. Without a functioning grammar, there's nothing to constrain the output, so the result can't be reliably parsed.My best guess is that trailing whitespace is what breaks the parsing. The Qwen models output
<tool_call>\nwhen executing parallel tool calls. This would also explain why parsing fails whenparallel_tool_calls = false: if the grammar isn't functional, the model may emit more than one tool call, and the parser fails because it was generated to support only a single call.Requirements