Skip to content

UPSTREAM PR #21216: common : simplify autoparser tagged parser rules#1335

Open
loci-dev wants to merge 9 commits intomainfrom
loci/pr-21216-autoparser-tagged-simplify-rules
Open

UPSTREAM PR #21216: common : simplify autoparser tagged parser rules#1335
loci-dev wants to merge 9 commits intomainfrom
loci/pr-21216-autoparser-tagged-simplify-rules

Conversation

@loci-dev
Copy link
Copy Markdown

@loci-dev loci-dev commented Apr 6, 2026

Note

Source pull request: ggml-org/llama.cpp#21216

Overview

  • Move the string arg parser to a dedicated rule to simplify the generated grammar.
  • Swap out the type resolution for the common_schema_info implementation.
  • Remove p.end() usage now that we have a lenient parsing mode. Reverted.
  • Changed optional arguments to be unbounded. Currently this uses a repetition {0, N}, which causes the explosion of repetition rules seen in #20879 and #20867.**
  • Fix uninitialized required arguments set.

Fixes #20879
Fixes #20867

Additional Information

I believe I found the culprit for the various parsing errors. Although yet to be confirmed.

When MAX_REPETITION_THRESHOLD is hit, the grammar fails to parse but the request continues anyway. The server should return an error if the grammar is problematic. Without a functioning grammar, there's nothing to constrain the output, so the result can't be reliably parsed.

My best guess is that trailing whitespace is what breaks the parsing. The Qwen models output <tool_call>\n when executing parallel tool calls. This would also explain why parsing fails when parallel_tool_calls = false: if the grammar isn't functional, the model may emit more than one tool call, and the parser fails because it was generated to support only a single call.

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES - I had it write the additional PEG parser tests out of laziness. Everything else was done by hand.

@loci-review
Copy link
Copy Markdown

loci-review Bot commented Apr 6, 2026

Overview

Analysis of 125,160 functions across 15 binaries identified 35 modified, 0 new, 2 removed, and 125,123 unchanged functions. Changes focus on chat auto-parser refactoring with minimal overall performance impact.

Power Consumption Changes:

  • build.bin.llama-tts: -0.09%
  • build.bin.llama-cvector-generator: +0.03%
  • build.bin.libmtmd.so, build.bin.llama-bench, build.bin.libllama.so: ±0.00%
  • build.bin.libggml-base.so, build.bin.libggml-cpu.so, build.bin.libggml.so, build.bin.llama-tokenize, build.bin.llama-gemma3-cli, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli, build.bin.llama-quantize, build.bin.llama-qwen2vl-cli: 0.00%

Function Analysis

build_tool_parser_tag_tagged (chat-auto-parser-generator.cpp, llama-cvector-generator & llama-tts):

  • Response time: +116.9 μs (+19.8-19.9%)
  • Throughput time: +143-154 ns (+24.2-25.4%)
  • Refactoring adds schema reference resolution, fixes flexible optional argument ordering (issue #20650), and centralizes type detection. New early function calls (until/rule parsers) add ~116 μs overhead but enable proper handling of complex JSON schemas. Changes occur during parser construction (initialization), not inference hot path.

Lambda operator (chat-auto-parser-generator.cpp, both binaries):

  • Response time: -50.9 μs (-3.7%) ✓
  • Throughput time: -611-613 ns (-24.4-24.6%) ✓
  • Centralized schema resolution through common_schema_info enables caching, reducing redundant type checking. Compiler optimizations (96-byte stack reduction, better register allocation) deliver performance gains despite added validation complexity.

HTTP utilities (cpp-httplib vendor library):

  • get_param_value_count: +60 ns (+71.4% throughput, +4.0% response)
  • operator-: +62 ns (+68.5% throughput, +55.1% response)
  • set_socket_opt_time: -29 ns (-26.8% throughput, -19.8% response) ✓
  • Compiler-generated code reorganization in vendored library. Absolute changes minimal (29-62 ns), not in inference path.

Other analyzed functions (std::function templates, std::pair constructor, HTTP Put) show negligible changes from compiler optimizations.

Additional Findings

Core inference unchanged: Matrix multiplication, attention mechanisms, quantization, and GPU backends show 0.000% power consumption change. All modifications isolated to non-critical utility code (parser construction, HTTP server operations). The refactoring prioritizes correctness (fixes issue #20650) and maintainability over micro-optimizations in initialization code, with no impact on token generation or model inference performance.

💬 Questions? Tag @loci-dev

@loci-dev loci-dev force-pushed the main branch 8 times, most recently from 385b1fc to 06d9e10 Compare April 13, 2026 02:18
@loci-dev loci-dev force-pushed the main branch 8 times, most recently from 7638ab4 to f1b46d5 Compare April 20, 2026 02:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants