Skip to content

UPSTREAM PR #21242: fix: tool call parsing for LFM2 and LFM2.5 models#1325

Open
loci-dev wants to merge 2 commits intomainfrom
loci/pr-21242-fix-lfm2-lfm2-5-tool-calling
Open

UPSTREAM PR #21242: fix: tool call parsing for LFM2 and LFM2.5 models#1325
loci-dev wants to merge 2 commits intomainfrom
loci/pr-21242-fix-lfm2-lfm2-5-tool-calling

Conversation

@loci-dev
Copy link
Copy Markdown

@loci-dev loci-dev commented Apr 1, 2026

Note

Source pull request: ggml-org/llama.cpp#21242

Overview

Currently, LFM2 & LFM2.5 tool calling is broken in llama.cpp, issue ggml-org/llama.cpp#20245, this commit ggml-org/llama.cpp#20251 introduced a dedicated parser for LFM2; however, LFM2.5 and LFM2 have different tool calling jinja templates

This PR fixes the tool calling parser to catch the expected formats for both cases

  • LFM2: tool list as List of tools: <|tool_list_start|>[...]<|tool_list_end|>, tool calls as
    <|tool_call_start|>[name(arg="val")]<|tool_call_end|>
  • LFM2.5: tool list as List of tools: [...], tool calls as bare [name(arg="val")] with no wrapper tokens

Added common_chat_params_init_lfm2_5

Testing

  • Added unit test for lfm2.5 in test-chat.cpp
  • Tested tool calling use case locally with both LFM2.5-1.2B-Instruct-BF16.gguf and LFM2-8B-A1B-Q4_0.gguf

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure:
  • Used claude code for assistance in tracing code, assistance in where to make fix, generating local test scripts

@loci-review
Copy link
Copy Markdown

loci-review Bot commented Apr 1, 2026

Overview

Analysis of 124,195 functions across 15 binaries reveals negligible performance impact from LFM2/LFM2.5 parsing refactoring. 115 functions modified (0.09%), 192 new, 0 removed. All changes are compiler-generated STL code artifacts—no modifications to performance-critical inference paths.

Power Consumption Changes:

  • build.bin.llama-cvector-generator: +0.26% (+949 nJ)
  • build.bin.llama-tts: +0.14% (+520 nJ)
  • build.bin.libllama.so: 0.00%
  • build.bin.libmtmd.so: 0.00%
  • build.bin.llama-bench: 0.00%
  • build.bin.libggml.so: 0.00%
  • build.bin.libggml-cpu.so: 0.00%
  • build.bin.libggml-base.so: 0.00%
  • build.bin.llama-tokenize: 0.00%
  • build.bin.llama-gemma3-cli: 0.00%
  • build.bin.llama-gguf-split: 0.00%
  • build.bin.llama-llava-cli: 0.00%
  • build.bin.llama-minicpmv-cli: 0.00%
  • build.bin.llama-quantize: 0.00%
  • build.bin.llama-qwen2vl-cli: 0.00%

Function Analysis

All modified functions are STL template instantiations (std::vector, std::map, std::_Rb_tree) with no source code changes. Performance variations result from compiler code generation differences between builds.

Most Significant Changes:

  • std::_Rb_tree::_S_key() (llama-tts): Response time +165% (+186ns), throughput time +311% (+186ns). Red-black tree key extraction for JSON maps. Used during initialization only.

  • std::_Rb_tree::begin() (llama-cvector-generator): Response time +220% (+182ns), throughput time +289% (+182ns). Map iterator initialization with extra unconditional branch in target version.

  • std::vector::end() (llama-tts): Response time -69% (-183ns), throughput time -75% (-183ns). Compiler eliminated indirect jump, improved code layout.

  • std::vector::back() (llama-tts): Response time -42% (-190ns), throughput time -73% (-190ns). Entry block optimization removed unnecessary jumps.

  • jinja::parser::parse_any() (llama-tts): Response time +0.9% (+133ns), throughput time +68% (+137ns). Template parser dispatcher with extra entry block.

Other analyzed functions show similar compiler-induced variations in non-critical initialization code. No changes detected in inference hot paths: llama_decode(), matrix operations, attention mechanisms, KV cache, sampling, or GPU backends remain unaffected.

Additional Findings

Source code changes limited to common/chat.cpp for LFM2/LFM2.5 tool call parsing refactoring—completely isolated from inference pipeline. GPU/ML operations unaffected: all CUDA, Metal, Vulkan, and other backend implementations unchanged. Critical inference components (GEMM operations, Flash Attention, quantization kernels) show zero modifications. The refactoring successfully improves code organization without impacting inference performance.

🔎 Full breakdown: Loci Inspector
💬 Questions? Tag @loci-dev

@loci-dev loci-dev force-pushed the main branch 9 times, most recently from 126cd1f to a8215be Compare April 8, 2026 02:18
@loci-dev loci-dev force-pushed the main branch 7 times, most recently from e800934 to a024d9c Compare April 15, 2026 02:19
@loci-dev loci-dev force-pushed the main branch 6 times, most recently from 7638ab4 to f1b46d5 Compare April 20, 2026 02:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants