UPSTREAM PR #21216: common : simplify autoparser tagged parser rules by loci-dev · Pull Request #1335 · auroralabs-loci/llama.cpp

loci-dev · 2026-04-06T03:11:39Z

Note

Source pull request: ggml-org/llama.cpp#21216

Overview

Move the string arg parser to a dedicated rule to simplify the generated grammar.
Swap out the type resolution for the common_schema_info implementation.
~~Remove p.end() usage now that we have a lenient parsing mode.~~ Reverted.
Changed optional arguments to be unbounded. Currently this uses a repetition {0, N}, which causes the explosion of repetition rules seen in #20879 and #20867.**
Fix uninitialized required arguments set.

Fixes #20879
Fixes #20867

Additional Information

I believe I found the culprit for the various parsing errors. Although yet to be confirmed.

When MAX_REPETITION_THRESHOLD is hit, the grammar fails to parse but the request continues anyway. The server should return an error if the grammar is problematic. Without a functioning grammar, there's nothing to constrain the output, so the result can't be reliably parsed.

My best guess is that trailing whitespace is what breaks the parsing. The Qwen models output <tool_call>\n when executing parallel tool calls. This would also explain why parsing fails when parallel_tool_calls = false: if the grammar isn't functional, the model may emit more than one tool call, and the parser fails because it was generated to support only a single call.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES - I had it write the additional PEG parser tests out of laziness. Everything else was done by hand.

loci-review · 2026-04-06T04:41:16Z

Overview

Analysis of 125,160 functions across 15 binaries identified 35 modified, 0 new, 2 removed, and 125,123 unchanged functions. Changes focus on chat auto-parser refactoring with minimal overall performance impact.

Power Consumption Changes:

build.bin.llama-tts: -0.09%
build.bin.llama-cvector-generator: +0.03%
build.bin.libmtmd.so, build.bin.llama-bench, build.bin.libllama.so: ±0.00%
build.bin.libggml-base.so, build.bin.libggml-cpu.so, build.bin.libggml.so, build.bin.llama-tokenize, build.bin.llama-gemma3-cli, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli, build.bin.llama-quantize, build.bin.llama-qwen2vl-cli: 0.00%

Function Analysis

build_tool_parser_tag_tagged (chat-auto-parser-generator.cpp, llama-cvector-generator & llama-tts):

Response time: +116.9 μs (+19.8-19.9%)
Throughput time: +143-154 ns (+24.2-25.4%)
Refactoring adds schema reference resolution, fixes flexible optional argument ordering (issue #20650), and centralizes type detection. New early function calls (until/rule parsers) add ~116 μs overhead but enable proper handling of complex JSON schemas. Changes occur during parser construction (initialization), not inference hot path.

Lambda operator (chat-auto-parser-generator.cpp, both binaries):

Response time: -50.9 μs (-3.7%) ✓
Throughput time: -611-613 ns (-24.4-24.6%) ✓
Centralized schema resolution through common_schema_info enables caching, reducing redundant type checking. Compiler optimizations (96-byte stack reduction, better register allocation) deliver performance gains despite added validation complexity.

HTTP utilities (cpp-httplib vendor library):

get_param_value_count: +60 ns (+71.4% throughput, +4.0% response)
operator-: +62 ns (+68.5% throughput, +55.1% response)
set_socket_opt_time: -29 ns (-26.8% throughput, -19.8% response) ✓
Compiler-generated code reorganization in vendored library. Absolute changes minimal (29-62 ns), not in inference path.

Other analyzed functions (std::function templates, std::pair constructor, HTTP Put) show negligible changes from compiler optimizations.

Additional Findings

Core inference unchanged: Matrix multiplication, attention mechanisms, quantization, and GPU backends show 0.000% power consumption change. All modifications isolated to non-critical utility code (parser construction, HTTP server operations). The refactoring prioritizes correctness (fixes issue #20650) and maintainability over micro-optimizations in initialization code, with no impact on token generation or model inference performance.

💬 Questions? Tag @loci-dev

aldehir added 9 commits April 1, 2026 00:51

common : simplify autoparser tagged parser rules

870e5d5

cont : remove upper limit on optional args

d417cd6

cont : revert changes to parsing at the end

508af90

cont : undo arbitrary ordering of optional args

fd78b13

cont : fix uninitialized required parameters

e6871d2

revert to simplify merge

f285719

merge

271ca72

re-apply patches

a8e5523

restore flexible optional arg ordering tests

b24d87f

loci-dev temporarily deployed to PROD__AL_DEMO April 6, 2026 03:11 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 8 times, most recently from 385b1fc to 06d9e10 Compare April 13, 2026 02:18

loci-dev force-pushed the main branch 8 times, most recently from 7638ab4 to f1b46d5 Compare April 20, 2026 02:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #21216: common : simplify autoparser tagged parser rules#1335

UPSTREAM PR #21216: common : simplify autoparser tagged parser rules#1335
loci-dev wants to merge 9 commits intomainfrom
loci/pr-21216-autoparser-tagged-simplify-rules

loci-dev commented Apr 6, 2026

Uh oh!

loci-review Bot commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Apr 6, 2026

Overview

Additional Information

Requirements

Uh oh!

loci-review Bot commented Apr 6, 2026

Overview

Function Analysis

Additional Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants