Skip to content

UPSTREAM PR #21421: mtmd: add Gemma 4 audio conformer encoder support#1336

Open
loci-dev wants to merge 13 commits intomainfrom
loci/pr-21421-gemma4-audio-pr
Open

UPSTREAM PR #21421: mtmd: add Gemma 4 audio conformer encoder support#1336
loci-dev wants to merge 13 commits intomainfrom
loci/pr-21421-gemma4-audio-pr

Conversation

@loci-dev
Copy link
Copy Markdown

@loci-dev loci-dev commented Apr 6, 2026

Note

Source pull request: ggml-org/llama.cpp#21421

Overview

Add audio processing support for Gemma 4 models via a USM-style Conformer encoder.

Architecture:

  • 12-layer Conformer: FFN -> Self-Attention -> Causal Conv1D -> FFN -> Norm
  • Subsampling Conv Projection: 2x Conv2D(stride=2) with LayerNorm
  • Chunked local attention with sinusoidal RPE (chunk_size=12, context_size=24)
  • Logit softcapping at 50.0, ClippableLinear with per-tensor clamping
  • Output projection -> RMSNorm -> multimodal embedder

Chunked local attention (matching PyTorch reference):

  • Q split into non-overlapping blocks of 12
  • K/V extracted as overlapping context windows of 24 via ggml_view_4d with stride 12
  • Per-block causal mask: query at position q only attends to keys at positions <= q
  • Blocked relative position shift (Transformer-XL appendix B)
  • RPE: 13 sinusoidal position embeddings [12, 11, ..., 0]

Mel preprocessing (dedicated mtmd_audio_preprocessor_gemma4a):

  • HTK mel scale, 128 bins, magnitude STFT, mel_floor=1e-3
  • Standard periodic Hann window (320 samples), zero-padded to FFT size
  • Semicausal left-padding (frame_length/2 samples)
  • Frame count matched to PyTorch unfold formula
  • No pre-emphasis, no Whisper-style normalization
  • 30-second chunking (splits long audio into 30s segments)
  • Mel cosine similarity vs PyTorch: 0.9998

Key fixes:

  • Tensor loading dedup: get_tensor() now throws on duplicate tensor names (via std::unordered_set guard), and GEMMA4A-specific tensor loading no longer re-loads tensors already handled by the generic per-layer loop.
  • ClippableLinear clamp_info loading moved after per-layer tensor loading to ensure all conformer weights have clamp data.
  • Skip Whisper-style (x+4)/4 normalization for Gemma4 raw log-mel output.
  • CUDA ggml_ssm_conv: added kernel_size=5 support (Gemma4 depthwise Conv1D uses kernel_size=5, previously only 3,4,9 were supported).

Test results (E4B Q4_K_M):

LibriSpeech test samples:

Ground truth: "MISTER QUILTER IS THE APOSTLE OF THE MIDDLE CLASSES AND WE ARE GLAD TO WELCOME HIS GOSPEL"
E4B output:   "Mr. Coulter is the apostle of the middle classes, and we are glad to welcome his gospel."

Ground truth: "NOR IS MISTER QUILTER'S MANNER LESS INTERESTING THAN HIS MATTER"
E4B output:   "Norris Mr. Coulter's manner less interesting than his manner."

Ground truth: "HE TELLS US THAT AT THIS FESTIVE SEASON OF THE YEAR WITH CHRISTMAS AND ROAST BEEF..."
E4B output:   "He tells us that this festive season of the year, with Christmas and New Year looming before us..."

Generation parameters (from model's generation_config.json):
--temp 1.0 --top-k 64 --top-p 0.95

Additional information

Test plan:

  • test-mtmd-c-api passes
  • test-llama-archs passes (gemma4 fixture added, skipped pending ISWA KV cache fix)
  • E4B Q4_K_M transcription (Vulkan Intel Iris Xe, CUDA RTX 3060, CUDA Tesla T4)
  • E2B Q4_K_M transcription (CUDA RTX 3060, CUDA Tesla T4)
  • LibriSpeech samples with known ground truth
  • Mel values verified against PyTorch (cosine 0.9998)
  • Encoder output cosine vs PyTorch: 0.68 (expected for F16 through 12 conformer layers)
  • CI ctest: 49/49 debug passed
  • CUDA ssm_conv kernel_size=5 tested on RTX 3060 and Tesla T4

Ref: #21325

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES - Claude Code was used in an assistive capacity for iterative debugging (tensor tracing, mel spectrogram comparison) and code review. All architecture decisions, algorithm implementations, and code were manually reviewed and verified against the PyTorch reference.

@loci-review
Copy link
Copy Markdown

loci-review Bot commented Apr 6, 2026

Overview

Analysis of 125,410 functions across 15 binaries shows 34 modified functions with performance changes isolated to the multimodal library. Changes add Gemma 4 audio conformer encoder support with duplicate tensor detection and safe map lookups.

Function counts: 34 modified, 290 new, 3 removed, 125,083 unchanged

Power consumption by binary:

Binary Base (nJ) Target (nJ) Change
build.bin.libmtmd.so 196,832.42 201,026.65 +2.13%
build.bin.libllama.so 264,739.36 264,739.10 -0.00%
build.bin.llama-cvector-generator 367,100.44 367,098.51 -0.00%
build.bin.llama-tts 372,435.36 372,435.17 -0.00%
build.bin.llama-bench 160,327.90 160,328.23 +0.00%
build.bin.llama-quantize 44,468.23 44,468.23 0.00%
build.bin.llama-qwen2vl-cli 277.87 277.87 0.00%
build.bin.llama-tokenize 38,388.75 38,388.75 0.00%
build.bin.llama-gemma3-cli 277.87 277.87 0.00%
build.bin.llama-gguf-split 2,864.08 2,864.08 0.00%
build.bin.llama-llava-cli 277.87 277.87 0.00%
build.bin.llama-minicpmv-cli 277.87 277.87 0.00%
build.bin.libggml-cpu.so 177,792.05 177,792.05 0.00%
build.bin.libggml.so 5,136.91 5,136.91 0.00%
build.bin.libggml-base.so 74,169.64 74,169.64 0.00%

Function Analysis

clip_model_loader::load_tensors (build.bin.libmtmd.so)

  • Response time: 446,935 ns → 1,035,599 ns (+588,664 ns, +132%)
  • Throughput time: 5,468 ns → 5,851 ns (+384 ns, +7%)
  • Adds ~150 lines for GEMMA4A audio conformer with 40+ specialized tensors per layer, duplicate detection via std::unordered_set, and safe iterator-based map lookups. Regression is feature-driven, not inefficiency.

clip_init (build.bin.libmtmd.so)

  • Response time: 1,180,428 ns → 2,356,773 ns (+1,176,345 ns, +100%)
  • Throughput time: 275 ns → 258 ns (-17 ns, -6%)
  • Regression entirely driven by load_tensors dependency. Function's own code improved 6%, demonstrating efficient implementation.

Lambda operator in load_tensors (build.bin.libmtmd.so)

  • Response time: 3,873 ns → 11,555 ns (+7,682 ns, +198%)
  • Throughput time: 242 ns → 356 ns (+114 ns, +47%)
  • Switches from vector storage to hash-based duplicate detection. Adds unordered_set::count() (+1,271 ns) and insert() (+4,616 ns) to prevent silent model corruption with 50+ new tensor types.

filter_params constructor (build.bin.libmtmd.so)

  • Response time: 27 ns → 50 ns (+23 ns, +88%)
  • Adds 3 fields for Gemma4 audio preprocessing: no_padding, use_magnitude, mel_floor. Proportional overhead for expanded functionality.

std::vector::end() (build.bin.libmtmd.so)

  • Response time: 265 ns → 82 ns (-183 ns, -69%)
  • Compiler optimization reduced CFG from 9 to 7 blocks with 89% faster entry block execution.

Other analyzed functions show regressions of 9-40% in memory allocation and vector operations due to clip_layer struct expansion (+40 bytes for 5 new tensor pointers), with negligible absolute overhead.

Flame Graph Comparison

Comparing flame graphs for clip_model_loader::load_tensors to illustrate the structural changes from vector-based to hash-based tensor tracking:

Base version:

Base version flame graph

Target version:

Target version flame graph

The target version shows 2.3x increase in execution time with new hash table operations (_M_insert, _M_bucket_index, _M_rehash) replacing simple vector operations, enabling robust duplicate detection for complex audio conformer models.

Additional Findings

All performance regressions occur in non-critical initialization code (model loading, one-time setup). Core inference libraries (libllama.so, libggml.so, libggml-cpu.so) show zero power consumption change, confirming inference hot paths are completely unaffected. The changes prioritize correctness over initialization speed—duplicate detection prevents silent model corruption, and safe lookups prevent crashes. Absolute overhead (~1.8 ms) is negligible in multi-second model loading operations. Chunked local attention reduces inference complexity from O(n²) to O(n·chunk_size), providing 50% computation reduction that will offset initialization costs during actual inference.

💬 Questions? Tag @loci-dev

@loci-dev loci-dev force-pushed the main branch 3 times, most recently from 126cd1f to a8215be Compare April 8, 2026 02:18
stephencox and others added 13 commits April 9, 2026 10:04
Add audio processing for Gemma 4 E2B/E4B via a USM-style Conformer.

Architecture:
- 12-layer Conformer: FFN → Self-Attention → Causal Conv1D → FFN → Norm
- Subsampling Conv Projection: 2x Conv2D(stride=2) with LayerNorm
- Full self-attention with sinusoidal RPE and sliding window mask (24)
- Logit softcapping at 50.0, ClippableLinear clamping
- Output: 1024 → 1536 → RMSNorm → multimodal embedder

Mel preprocessing (dedicated mtmd_audio_preprocessor_gemma4a):
- HTK mel scale, 128 bins, magnitude STFT, mel_floor=1e-3
- Standard periodic Hann window (320 samples), zero-padded to FFT size
- Semicausal left-padding (frame_length/2 samples)
- Frame count matched to PyTorch (unfold formula)
- No pre-emphasis, no Whisper-style normalization
- Mel cosine similarity vs PyTorch: 0.9998

Key fixes:
- Tensor loading dedup: prevent get_tensor() from creating duplicate
  entries in ctx_data. Fixed with std::set guard.
- ClippableLinear clamp_info loading moved after per-layer tensors.
- Sliding window mask (24 positions) matching PyTorch context_size.
- Skip Whisper normalization for Gemma4 mel output.

Tested on E2B and E4B with CPU and Vulkan backends.
Transcribes: "Glad to see things are going well and business is starting
to pick up" (matching ground truth).

Ref: #21325

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Audio encoder fixes:
- Fix swapped conv norm weight mapping in tensor_mapping.py
  (A_ENC_CONV_NORM and A_ENC_NORM_CONV had their gemma4 entries inverted,
  causing the conv pre-norm and internal norm weights to be swapped in GGUF.
  This produced 0.67 encoder cosine vs PyTorch; now 0.9999)
- Fix causal mask off-by-one: add (gq - gk) < max_past to match PyTorch's
  dist < left_window_size (was attending to 13 past tokens instead of 12)
- Use -1e9 instead of -INFINITY for masked positions to match PyTorch's
  attention_invalid_logits_value and avoid NaN in padded attention weights

LM fixes:
- Disable attention logit softcapping for Gemma4 (unlike Gemma2, Gemma4's
  text model does not use attn softcapping; was incorrectly hardcoded)
- Use BF16-rounded embedding scale constants to match PyTorch's native
  BF16 training precision (ref: PR #21451). Fixes long-context coherence
  on CPU/Vulkan backends.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use double-precision trig (sin/cos) instead of float (sinf/cosf) for
precomputed FFT twiddle factors, Hann window, and sinusoidal RPE to
match PyTorch's precision in the audio encoder preprocessing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ants"

This reverts commit 65a4b12e066501e34f2aac251a50bcca74fd0da5.
… derive softcap

- Revert conv_norm/pre_layer_norm swap in tensor_mapping.py to preserve
  backward compatibility with existing GGUFs; fix mapping in C++ clip.cpp
  by cross-loading the swapped tensor names at load time instead
- Fix missing comma in V_ENC_ATTN_QKV mapping (silent string concatenation bug)
- Remove duplicated comment line in gemma4-iswa.cpp
- Keep per-layer embedding scale for multimodal path (matches PyTorch
  ScaledWordEmbedding which replaces multimodal IDs with pad_token_id
  before lookup; scaling is a text model property, not projector)
- Derive attn_soft_cap from ml.get_key() return value instead of
  hardcoding true (Gemma4 has no attn softcapping key in GGUF)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove conv_norm cross-load in clip.cpp (the upstream tensor mapping
  is correct for existing GGUFs; cross-loading caused double-swap)
- Keep per-layer embedding scale for multimodal path — this is the
  text model's ScaledWordEmbedding behavior, cannot be moved to
  projector since tok_embd_per_layer is a text model tensor
- Derive attn_soft_cap from ml.get_key() return value
- Remove duplicated comment

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add auto-detection of swapped conv_norm/norm_conv tensor data in
  Gemma 4 audio mmproj GGUFs. Publicly released GGUFs have these
  tensors swapped. Detection compares weight energy (sum-of-squares)
  and swaps tensor pointers if needed.
- Remove duplicated comment line in gemma4-iswa.cpp

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Simplify conv norm fix: unconditionally swap tensor pointers after
  loading (all existing Gemma 4 mmproj GGUFs have this issue)
- Remove per-layer embedding scaling for multimodal path (moved to
  dedicated PR #21625)
- Remove duplicated comment in gemma4-iswa.cpp

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…guous

The GLU gate in the Gemma 4 conformer creates a non-contiguous view
(ggml_view_2d with offset) and passes it to ggml_sigmoid. CUDA and
Vulkan backends require contiguous inputs for unary ops, so sigmoid
fell back to CPU causing 25 graph splits per encoder forward pass.
The repeated GPU<->CPU transfers introduced numerical divergence that
caused repetition on longer audio.

Fix: wrap the view in ggml_cont() before ggml_sigmoid(). This keeps
the entire conformer graph on a single backend with no splits.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The conv norm mapping fix is handled in C++ (clip.cpp) by swapping
tensor pointers after loading. No changes to tensor_mapping.py needed.

The BF16-rounded scale, per-layer embedding scaling, and attn_soft_cap
changes are moved to dedicated PRs (#21613, #21625).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Restore BF16-rounded scale wrappers for embedding and MoE logits to
match PyTorch's native BF16 training precision. The small difference
between sqrtf(1536)=39.19 and BF16-rounded 39.25 compounds through
35 layers, causing audio repetition especially on CUDA.

Also add per-layer embedding scale for the multimodal path — PyTorch's
ScaledWordEmbedding replaces multimodal IDs with pad_token_id and
scales by sqrt(n_embd_per_layer). Without this, the token path is
scaled but the multimodal path is not, degrading audio quality.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The multimodal per-layer embedding scaling is handled by PR #21625.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@loci-review
Copy link
Copy Markdown

loci-review Bot commented Apr 9, 2026

The analysis encountered an error. Please review the Processing Details for more information.

@loci-dev loci-dev force-pushed the main branch 8 times, most recently from 1254f75 to 245e873 Compare April 16, 2026 09:24
@loci-dev loci-dev force-pushed the main branch 4 times, most recently from 7638ab4 to f1b46d5 Compare April 20, 2026 02:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants