UPSTREAM PR #21421: mtmd: add Gemma 4 audio conformer encoder support#1336
UPSTREAM PR #21421: mtmd: add Gemma 4 audio conformer encoder support#1336
Conversation
OverviewAnalysis of 125,410 functions across 15 binaries shows 34 modified functions with performance changes isolated to the multimodal library. Changes add Gemma 4 audio conformer encoder support with duplicate tensor detection and safe map lookups. Function counts: 34 modified, 290 new, 3 removed, 125,083 unchanged Power consumption by binary:
Function Analysisclip_model_loader::load_tensors (build.bin.libmtmd.so)
clip_init (build.bin.libmtmd.so)
Lambda operator in load_tensors (build.bin.libmtmd.so)
filter_params constructor (build.bin.libmtmd.so)
std::vector::end() (build.bin.libmtmd.so)
Other analyzed functions show regressions of 9-40% in memory allocation and vector operations due to clip_layer struct expansion (+40 bytes for 5 new tensor pointers), with negligible absolute overhead. Flame Graph ComparisonComparing flame graphs for clip_model_loader::load_tensors to illustrate the structural changes from vector-based to hash-based tensor tracking: Base version: Target version: The target version shows 2.3x increase in execution time with new hash table operations ( Additional FindingsAll performance regressions occur in non-critical initialization code (model loading, one-time setup). Core inference libraries (libllama.so, libggml.so, libggml-cpu.so) show zero power consumption change, confirming inference hot paths are completely unaffected. The changes prioritize correctness over initialization speed—duplicate detection prevents silent model corruption, and safe lookups prevent crashes. Absolute overhead (~1.8 ms) is negligible in multi-second model loading operations. Chunked local attention reduces inference complexity from O(n²) to O(n·chunk_size), providing 50% computation reduction that will offset initialization costs during actual inference. 💬 Questions? Tag @loci-dev |
126cd1f to
a8215be
Compare
Add audio processing for Gemma 4 E2B/E4B via a USM-style Conformer. Architecture: - 12-layer Conformer: FFN → Self-Attention → Causal Conv1D → FFN → Norm - Subsampling Conv Projection: 2x Conv2D(stride=2) with LayerNorm - Full self-attention with sinusoidal RPE and sliding window mask (24) - Logit softcapping at 50.0, ClippableLinear clamping - Output: 1024 → 1536 → RMSNorm → multimodal embedder Mel preprocessing (dedicated mtmd_audio_preprocessor_gemma4a): - HTK mel scale, 128 bins, magnitude STFT, mel_floor=1e-3 - Standard periodic Hann window (320 samples), zero-padded to FFT size - Semicausal left-padding (frame_length/2 samples) - Frame count matched to PyTorch (unfold formula) - No pre-emphasis, no Whisper-style normalization - Mel cosine similarity vs PyTorch: 0.9998 Key fixes: - Tensor loading dedup: prevent get_tensor() from creating duplicate entries in ctx_data. Fixed with std::set guard. - ClippableLinear clamp_info loading moved after per-layer tensors. - Sliding window mask (24 positions) matching PyTorch context_size. - Skip Whisper normalization for Gemma4 mel output. Tested on E2B and E4B with CPU and Vulkan backends. Transcribes: "Glad to see things are going well and business is starting to pick up" (matching ground truth). Ref: #21325 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Audio encoder fixes: - Fix swapped conv norm weight mapping in tensor_mapping.py (A_ENC_CONV_NORM and A_ENC_NORM_CONV had their gemma4 entries inverted, causing the conv pre-norm and internal norm weights to be swapped in GGUF. This produced 0.67 encoder cosine vs PyTorch; now 0.9999) - Fix causal mask off-by-one: add (gq - gk) < max_past to match PyTorch's dist < left_window_size (was attending to 13 past tokens instead of 12) - Use -1e9 instead of -INFINITY for masked positions to match PyTorch's attention_invalid_logits_value and avoid NaN in padded attention weights LM fixes: - Disable attention logit softcapping for Gemma4 (unlike Gemma2, Gemma4's text model does not use attn softcapping; was incorrectly hardcoded) - Use BF16-rounded embedding scale constants to match PyTorch's native BF16 training precision (ref: PR #21451). Fixes long-context coherence on CPU/Vulkan backends. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use double-precision trig (sin/cos) instead of float (sinf/cosf) for precomputed FFT twiddle factors, Hann window, and sinusoidal RPE to match PyTorch's precision in the audio encoder preprocessing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ants" This reverts commit 65a4b12e066501e34f2aac251a50bcca74fd0da5.
… derive softcap - Revert conv_norm/pre_layer_norm swap in tensor_mapping.py to preserve backward compatibility with existing GGUFs; fix mapping in C++ clip.cpp by cross-loading the swapped tensor names at load time instead - Fix missing comma in V_ENC_ATTN_QKV mapping (silent string concatenation bug) - Remove duplicated comment line in gemma4-iswa.cpp - Keep per-layer embedding scale for multimodal path (matches PyTorch ScaledWordEmbedding which replaces multimodal IDs with pad_token_id before lookup; scaling is a text model property, not projector) - Derive attn_soft_cap from ml.get_key() return value instead of hardcoding true (Gemma4 has no attn softcapping key in GGUF) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove conv_norm cross-load in clip.cpp (the upstream tensor mapping is correct for existing GGUFs; cross-loading caused double-swap) - Keep per-layer embedding scale for multimodal path — this is the text model's ScaledWordEmbedding behavior, cannot be moved to projector since tok_embd_per_layer is a text model tensor - Derive attn_soft_cap from ml.get_key() return value - Remove duplicated comment Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add auto-detection of swapped conv_norm/norm_conv tensor data in Gemma 4 audio mmproj GGUFs. Publicly released GGUFs have these tensors swapped. Detection compares weight energy (sum-of-squares) and swaps tensor pointers if needed. - Remove duplicated comment line in gemma4-iswa.cpp Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Simplify conv norm fix: unconditionally swap tensor pointers after loading (all existing Gemma 4 mmproj GGUFs have this issue) - Remove per-layer embedding scaling for multimodal path (moved to dedicated PR #21625) - Remove duplicated comment in gemma4-iswa.cpp Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…guous The GLU gate in the Gemma 4 conformer creates a non-contiguous view (ggml_view_2d with offset) and passes it to ggml_sigmoid. CUDA and Vulkan backends require contiguous inputs for unary ops, so sigmoid fell back to CPU causing 25 graph splits per encoder forward pass. The repeated GPU<->CPU transfers introduced numerical divergence that caused repetition on longer audio. Fix: wrap the view in ggml_cont() before ggml_sigmoid(). This keeps the entire conformer graph on a single backend with no splits. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The conv norm mapping fix is handled in C++ (clip.cpp) by swapping tensor pointers after loading. No changes to tensor_mapping.py needed. The BF16-rounded scale, per-layer embedding scaling, and attn_soft_cap changes are moved to dedicated PRs (#21613, #21625). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Restore BF16-rounded scale wrappers for embedding and MoE logits to match PyTorch's native BF16 training precision. The small difference between sqrtf(1536)=39.19 and BF16-rounded 39.25 compounds through 35 layers, causing audio repetition especially on CUDA. Also add per-layer embedding scale for the multimodal path — PyTorch's ScaledWordEmbedding replaces multimodal IDs with pad_token_id and scales by sqrt(n_embd_per_layer). Without this, the token path is scaled but the multimodal path is not, degrading audio quality. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The multimodal per-layer embedding scaling is handled by PR #21625. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
8a1494c to
9b5efb8
Compare
|
The analysis encountered an error. Please review the Processing Details for more information. |
1254f75 to
245e873
Compare
7638ab4 to
f1b46d5
Compare


Note
Source pull request: ggml-org/llama.cpp#21421
Overview
Add audio processing support for Gemma 4 models via a USM-style Conformer encoder.
Architecture:
Chunked local attention (matching PyTorch reference):
ggml_view_4dwith stride 12Mel preprocessing (dedicated
mtmd_audio_preprocessor_gemma4a):Key fixes:
get_tensor()now throws on duplicate tensor names (viastd::unordered_setguard), and GEMMA4A-specific tensor loading no longer re-loads tensors already handled by the generic per-layer loop.clamp_infoloading moved after per-layer tensor loading to ensure all conformer weights have clamp data.(x+4)/4normalization for Gemma4 raw log-mel output.ggml_ssm_conv: added kernel_size=5 support (Gemma4 depthwise Conv1D uses kernel_size=5, previously only 3,4,9 were supported).Test results (E4B Q4_K_M):
LibriSpeech test samples:
Generation parameters (from model's
generation_config.json):--temp 1.0 --top-k 64 --top-p 0.95Additional information
Test plan:
test-mtmd-c-apipassestest-llama-archspasses (gemma4 fixture added, skipped pending ISWA KV cache fix)Ref: #21325
Requirements