UPSTREAM PR #21421: mtmd: add Gemma 4 audio conformer encoder support by loci-dev · Pull Request #1336 · auroralabs-loci/llama.cpp

loci-dev · 2026-04-06T03:11:42Z

Note

Source pull request: ggml-org/llama.cpp#21421

Overview

Add audio processing support for Gemma 4 models via a USM-style Conformer encoder.

Architecture:

12-layer Conformer: FFN -> Self-Attention -> Causal Conv1D -> FFN -> Norm
Subsampling Conv Projection: 2x Conv2D(stride=2) with LayerNorm
Chunked local attention with sinusoidal RPE (chunk_size=12, context_size=24)
Logit softcapping at 50.0, ClippableLinear with per-tensor clamping
Output projection -> RMSNorm -> multimodal embedder

Chunked local attention (matching PyTorch reference):

Q split into non-overlapping blocks of 12
K/V extracted as overlapping context windows of 24 via ggml_view_4d with stride 12
Per-block causal mask: query at position q only attends to keys at positions <= q
Blocked relative position shift (Transformer-XL appendix B)
RPE: 13 sinusoidal position embeddings [12, 11, ..., 0]

Mel preprocessing (dedicated mtmd_audio_preprocessor_gemma4a):

HTK mel scale, 128 bins, magnitude STFT, mel_floor=1e-3
Standard periodic Hann window (320 samples), zero-padded to FFT size
Semicausal left-padding (frame_length/2 samples)
Frame count matched to PyTorch unfold formula
No pre-emphasis, no Whisper-style normalization
30-second chunking (splits long audio into 30s segments)
Mel cosine similarity vs PyTorch: 0.9998

Key fixes:

Tensor loading dedup: get_tensor() now throws on duplicate tensor names (via std::unordered_set guard), and GEMMA4A-specific tensor loading no longer re-loads tensors already handled by the generic per-layer loop.
ClippableLinear clamp_info loading moved after per-layer tensor loading to ensure all conformer weights have clamp data.
Skip Whisper-style (x+4)/4 normalization for Gemma4 raw log-mel output.
CUDA ggml_ssm_conv: added kernel_size=5 support (Gemma4 depthwise Conv1D uses kernel_size=5, previously only 3,4,9 were supported).

Test results (E4B Q4_K_M):

LibriSpeech test samples:

Ground truth: "MISTER QUILTER IS THE APOSTLE OF THE MIDDLE CLASSES AND WE ARE GLAD TO WELCOME HIS GOSPEL"
E4B output:   "Mr. Coulter is the apostle of the middle classes, and we are glad to welcome his gospel."

Ground truth: "NOR IS MISTER QUILTER'S MANNER LESS INTERESTING THAN HIS MATTER"
E4B output:   "Norris Mr. Coulter's manner less interesting than his manner."

Ground truth: "HE TELLS US THAT AT THIS FESTIVE SEASON OF THE YEAR WITH CHRISTMAS AND ROAST BEEF..."
E4B output:   "He tells us that this festive season of the year, with Christmas and New Year looming before us..."

Generation parameters (from model's generation_config.json):
--temp 1.0 --top-k 64 --top-p 0.95

Additional information

Test plan:

test-mtmd-c-api passes
test-llama-archs passes (gemma4 fixture added, skipped pending ISWA KV cache fix)
E4B Q4_K_M transcription (Vulkan Intel Iris Xe, CUDA RTX 3060, CUDA Tesla T4)
E2B Q4_K_M transcription (CUDA RTX 3060, CUDA Tesla T4)
LibriSpeech samples with known ground truth
Mel values verified against PyTorch (cosine 0.9998)
Encoder output cosine vs PyTorch: 0.68 (expected for F16 through 12 conformer layers)
CI ctest: 49/49 debug passed
CUDA ssm_conv kernel_size=5 tested on RTX 3060 and Tesla T4

Ref: #21325

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES - Claude Code was used in an assistive capacity for iterative debugging (tensor tracing, mel spectrogram comparison) and code review. All architecture decisions, algorithm implementations, and code were manually reviewed and verified against the PyTorch reference.

loci-review · 2026-04-06T04:38:23Z

Overview

Analysis of 125,410 functions across 15 binaries shows 34 modified functions with performance changes isolated to the multimodal library. Changes add Gemma 4 audio conformer encoder support with duplicate tensor detection and safe map lookups.

Function counts: 34 modified, 290 new, 3 removed, 125,083 unchanged

Power consumption by binary:

Binary	Base (nJ)	Target (nJ)	Change
build.bin.libmtmd.so	196,832.42	201,026.65	+2.13%
build.bin.libllama.so	264,739.36	264,739.10	-0.00%
build.bin.llama-cvector-generator	367,100.44	367,098.51	-0.00%
build.bin.llama-tts	372,435.36	372,435.17	-0.00%
build.bin.llama-bench	160,327.90	160,328.23	+0.00%
build.bin.llama-quantize	44,468.23	44,468.23	0.00%
build.bin.llama-qwen2vl-cli	277.87	277.87	0.00%
build.bin.llama-tokenize	38,388.75	38,388.75	0.00%
build.bin.llama-gemma3-cli	277.87	277.87	0.00%
build.bin.llama-gguf-split	2,864.08	2,864.08	0.00%
build.bin.llama-llava-cli	277.87	277.87	0.00%
build.bin.llama-minicpmv-cli	277.87	277.87	0.00%
build.bin.libggml-cpu.so	177,792.05	177,792.05	0.00%
build.bin.libggml.so	5,136.91	5,136.91	0.00%
build.bin.libggml-base.so	74,169.64	74,169.64	0.00%

Function Analysis

clip_model_loader::load_tensors (build.bin.libmtmd.so)

Response time: 446,935 ns → 1,035,599 ns (+588,664 ns, +132%)
Throughput time: 5,468 ns → 5,851 ns (+384 ns, +7%)
Adds ~150 lines for GEMMA4A audio conformer with 40+ specialized tensors per layer, duplicate detection via std::unordered_set, and safe iterator-based map lookups. Regression is feature-driven, not inefficiency.

clip_init (build.bin.libmtmd.so)

Response time: 1,180,428 ns → 2,356,773 ns (+1,176,345 ns, +100%)
Throughput time: 275 ns → 258 ns (-17 ns, -6%)
Regression entirely driven by load_tensors dependency. Function's own code improved 6%, demonstrating efficient implementation.

Lambda operator in load_tensors (build.bin.libmtmd.so)

Response time: 3,873 ns → 11,555 ns (+7,682 ns, +198%)
Throughput time: 242 ns → 356 ns (+114 ns, +47%)
Switches from vector storage to hash-based duplicate detection. Adds unordered_set::count() (+1,271 ns) and insert() (+4,616 ns) to prevent silent model corruption with 50+ new tensor types.

filter_params constructor (build.bin.libmtmd.so)

Response time: 27 ns → 50 ns (+23 ns, +88%)
Adds 3 fields for Gemma4 audio preprocessing: no_padding, use_magnitude, mel_floor. Proportional overhead for expanded functionality.

std::vector::end() (build.bin.libmtmd.so)

Response time: 265 ns → 82 ns (-183 ns, -69%)
Compiler optimization reduced CFG from 9 to 7 blocks with 89% faster entry block execution.

Other analyzed functions show regressions of 9-40% in memory allocation and vector operations due to clip_layer struct expansion (+40 bytes for 5 new tensor pointers), with negligible absolute overhead.

Flame Graph Comparison

Comparing flame graphs for clip_model_loader::load_tensors to illustrate the structural changes from vector-based to hash-based tensor tracking:

Base version:

Target version:

The target version shows 2.3x increase in execution time with new hash table operations (_M_insert, _M_bucket_index, _M_rehash) replacing simple vector operations, enabling robust duplicate detection for complex audio conformer models.

Additional Findings

All performance regressions occur in non-critical initialization code (model loading, one-time setup). Core inference libraries (libllama.so, libggml.so, libggml-cpu.so) show zero power consumption change, confirming inference hot paths are completely unaffected. The changes prioritize correctness over initialization speed—duplicate detection prevents silent model corruption, and safe lookups prevent crashes. Absolute overhead (~1.8 ms) is negligible in multi-second model loading operations. Chunked local attention reduces inference complexity from O(n²) to O(n·chunk_size), providing 50% computation reduction that will offset initialization costs during actual inference.

💬 Questions? Tag @loci-dev

Add audio processing for Gemma 4 E2B/E4B via a USM-style Conformer. Architecture: - 12-layer Conformer: FFN → Self-Attention → Causal Conv1D → FFN → Norm - Subsampling Conv Projection: 2x Conv2D(stride=2) with LayerNorm - Full self-attention with sinusoidal RPE and sliding window mask (24) - Logit softcapping at 50.0, ClippableLinear clamping - Output: 1024 → 1536 → RMSNorm → multimodal embedder Mel preprocessing (dedicated mtmd_audio_preprocessor_gemma4a): - HTK mel scale, 128 bins, magnitude STFT, mel_floor=1e-3 - Standard periodic Hann window (320 samples), zero-padded to FFT size - Semicausal left-padding (frame_length/2 samples) - Frame count matched to PyTorch (unfold formula) - No pre-emphasis, no Whisper-style normalization - Mel cosine similarity vs PyTorch: 0.9998 Key fixes: - Tensor loading dedup: prevent get_tensor() from creating duplicate entries in ctx_data. Fixed with std::set guard. - ClippableLinear clamp_info loading moved after per-layer tensors. - Sliding window mask (24 positions) matching PyTorch context_size. - Skip Whisper normalization for Gemma4 mel output. Tested on E2B and E4B with CPU and Vulkan backends. Transcribes: "Glad to see things are going well and business is starting to pick up" (matching ground truth). Ref: #21325 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Audio encoder fixes: - Fix swapped conv norm weight mapping in tensor_mapping.py (A_ENC_CONV_NORM and A_ENC_NORM_CONV had their gemma4 entries inverted, causing the conv pre-norm and internal norm weights to be swapped in GGUF. This produced 0.67 encoder cosine vs PyTorch; now 0.9999) - Fix causal mask off-by-one: add (gq - gk) < max_past to match PyTorch's dist < left_window_size (was attending to 13 past tokens instead of 12) - Use -1e9 instead of -INFINITY for masked positions to match PyTorch's attention_invalid_logits_value and avoid NaN in padded attention weights LM fixes: - Disable attention logit softcapping for Gemma4 (unlike Gemma2, Gemma4's text model does not use attn softcapping; was incorrectly hardcoded) - Use BF16-rounded embedding scale constants to match PyTorch's native BF16 training precision (ref: PR #21451). Fixes long-context coherence on CPU/Vulkan backends. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Use double-precision trig (sin/cos) instead of float (sinf/cosf) for precomputed FFT twiddle factors, Hann window, and sinusoidal RPE to match PyTorch's precision in the audio encoder preprocessing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ants" This reverts commit 65a4b12e066501e34f2aac251a50bcca74fd0da5.

…d conv norms

… derive softcap - Revert conv_norm/pre_layer_norm swap in tensor_mapping.py to preserve backward compatibility with existing GGUFs; fix mapping in C++ clip.cpp by cross-loading the swapped tensor names at load time instead - Fix missing comma in V_ENC_ATTN_QKV mapping (silent string concatenation bug) - Remove duplicated comment line in gemma4-iswa.cpp - Keep per-layer embedding scale for multimodal path (matches PyTorch ScaledWordEmbedding which replaces multimodal IDs with pad_token_id before lookup; scaling is a text model property, not projector) - Derive attn_soft_cap from ml.get_key() return value instead of hardcoding true (Gemma4 has no attn softcapping key in GGUF) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Remove conv_norm cross-load in clip.cpp (the upstream tensor mapping is correct for existing GGUFs; cross-loading caused double-swap) - Keep per-layer embedding scale for multimodal path — this is the text model's ScaledWordEmbedding behavior, cannot be moved to projector since tok_embd_per_layer is a text model tensor - Derive attn_soft_cap from ml.get_key() return value - Remove duplicated comment Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add auto-detection of swapped conv_norm/norm_conv tensor data in Gemma 4 audio mmproj GGUFs. Publicly released GGUFs have these tensors swapped. Detection compares weight energy (sum-of-squares) and swaps tensor pointers if needed. - Remove duplicated comment line in gemma4-iswa.cpp Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Simplify conv norm fix: unconditionally swap tensor pointers after loading (all existing Gemma 4 mmproj GGUFs have this issue) - Remove per-layer embedding scaling for multimodal path (moved to dedicated PR #21625) - Remove duplicated comment in gemma4-iswa.cpp Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…guous The GLU gate in the Gemma 4 conformer creates a non-contiguous view (ggml_view_2d with offset) and passes it to ggml_sigmoid. CUDA and Vulkan backends require contiguous inputs for unary ops, so sigmoid fell back to CPU causing 25 graph splits per encoder forward pass. The repeated GPU<->CPU transfers introduced numerical divergence that caused repetition on longer audio. Fix: wrap the view in ggml_cont() before ggml_sigmoid(). This keeps the entire conformer graph on a single backend with no splits. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The conv norm mapping fix is handled in C++ (clip.cpp) by swapping tensor pointers after loading. No changes to tensor_mapping.py needed. The BF16-rounded scale, per-layer embedding scaling, and attn_soft_cap changes are moved to dedicated PRs (#21613, #21625). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Restore BF16-rounded scale wrappers for embedding and MoE logits to match PyTorch's native BF16 training precision. The small difference between sqrtf(1536)=39.19 and BF16-rounded 39.25 compounds through 35 layers, causing audio repetition especially on CUDA. Also add per-layer embedding scale for the multimodal path — PyTorch's ScaledWordEmbedding replaces multimodal IDs with pad_token_id and scales by sqrt(n_embd_per_layer). Without this, the token path is scaled but the multimodal path is not, degrading audio quality. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The multimodal per-layer embedding scaling is handled by PR #21625. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

loci-review · 2026-04-09T03:18:04Z

The analysis encountered an error. Please review the Processing Details for more information.

loci-dev temporarily deployed to PROD__AL_DEMO April 6, 2026 03:11 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 3 times, most recently from 126cd1f to a8215be Compare April 8, 2026 02:18

stephencox and others added 13 commits April 9, 2026 10:04

Revert "mtmd: use double-precision math for audio preprocessing const…

f97f5ab

…ants" This reverts commit 65a4b12e066501e34f2aac251a50bcca74fd0da5.

gguf-py: restore gemma3n mappings in tensor_mapping.py and fix swappe…

e6801b0

…d conv norms

gemma4: remove per-layer scaling (moved to #21625)

9b5efb8

The multimodal per-layer embedding scaling is handled by PR #21625. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

loci-dev force-pushed the main branch from a8215be to 34734bc Compare April 9, 2026 02:17

loci-dev force-pushed the loci/pr-21421-gemma4-audio-pr branch from 8a1494c to 9b5efb8 Compare April 9, 2026 03:10

loci-dev temporarily deployed to PROD__AL_DEMO April 9, 2026 03:10 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 8 times, most recently from 1254f75 to 245e873 Compare April 16, 2026 09:24

loci-dev force-pushed the main branch 4 times, most recently from 7638ab4 to f1b46d5 Compare April 20, 2026 02:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #21421: mtmd: add Gemma 4 audio conformer encoder support#1336

UPSTREAM PR #21421: mtmd: add Gemma 4 audio conformer encoder support#1336
loci-dev wants to merge 13 commits intomainfrom
loci/pr-21421-gemma4-audio-pr

loci-dev commented Apr 6, 2026

Uh oh!

loci-review Bot commented Apr 6, 2026

Uh oh!

loci-review Bot commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Apr 6, 2026

Overview

Additional information

Requirements

Uh oh!

loci-review Bot commented Apr 6, 2026

Overview

Function Analysis

Flame Graph Comparison

Additional Findings

Uh oh!

loci-review Bot commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants