[Mirror] mtmd: Add DeepSeekOCR Support#66
Conversation
init commit
mtmd: fix vision model processing
testing Vision model loading
mtmd: DeepseekOCR Implement DeepSeek3B-MoE-A570M (LM component)
…ut in deepseek2 model
…e image decoding fails
# Conflicts: # tools/mtmd/clip.cpp
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
gguf-py/gguf/constants.py (1)
303-309:⚠️ Potential issue | 🔴 CriticalDuplicate
WINDOW_SIZEattribute inClipVisionclass.
WINDOW_SIZEis defined twice: at line 303 (newly added) and line 309 (pre-existing). The second definition shadows the first. Remove the duplicate.🐛 Proposed fix
SPATIAL_MERGE_SIZE = "clip.vision.spatial_merge_size" - WINDOW_SIZE = "clip.vision.window_size" USE_GELU = "clip.use_gelu" USE_SILU = "clip.use_silu" N_WA_PATTERN = "clip.vision.n_wa_pattern" # used by qwen2.5vl WA_LAYER_INDEXES = "clip.vision.wa_layer_indexes" # used by youtuvl IS_DEEPSTACK_LAYERS = "clip.vision.is_deepstack_layers" WINDOW_SIZE = "clip.vision.window_size"convert_hf_to_gguf.py (1)
7738-7774:⚠️ Potential issue | 🟠 MajorPotential metadata mismatch when
kv_lora_rankis absent.
You compute key/value lengths using a fallbackkv_lora_rank = 512, but you only emitadd_kv_lora_rankwhen the hparam is present. That can leave GGUF metadata inconsistent ifkv_lora_rankis missing orNone. Consider emitting the fallback (or deriving it) consistently.Suggested fix
- kv_lora_rank = hparams["kv_lora_rank"] if hparams.get("kv_lora_rank") is not None else 512 + kv_lora_rank = hparams.get("kv_lora_rank", 512) @@ - if "kv_lora_rank" in hparams and hparams["kv_lora_rank"] is not None: - self.gguf_writer.add_kv_lora_rank(kv_lora_rank) + if not is_ocr: + self.gguf_writer.add_kv_lora_rank(kv_lora_rank)
🤖 Fix all issues with AI agents
In `@src/llama-arch.cpp`:
- Around line 1523-1554: The tensor name set for LLM_ARCH_DEEPSEEK2OCR is
inconsistent with the converter output causing load failures; update the tensor
list returned in the LLM_ARCH_DEEPSEEK2OCR case to exactly match the
converter-exported names (replace plural exps names LLM_TENSOR_FFN_GATE_EXPS,
LLM_TENSOR_FFN_DOWN_EXPS, LLM_TENSOR_FFN_UP_EXPS with the converter's singular
enums LLM_TENSOR_FFN_GATE_EXP, LLM_TENSOR_FFN_DOWN_EXP, LLM_TENSOR_FFN_UP_EXP,
remove LLM_TENSOR_FFN_GATE_INP_SHEXP if the converter doesn't export it, and add
LLM_TENSOR_ATTN_ROT_EMBD which the converter does export); alternatively, if you
prefer changing the converter, modify its export names to produce the pluralized
enums used here so that the set returned by the LLM_ARCH_DEEPSEEK2OCR branch
matches the converter exactly.
🧹 Nitpick comments (2)
src/llama-model.cpp (1)
5014-5038: Add MoE sanity checks in the OCR branch for parity.
The OCR MoE path builds expert tensors without then_expert/n_expert_usedguards that exist in the non‑OCR path, which makes inconsistent metadata harder to diagnose. Consider mirroring the same fail-fast checks here.♻️ Suggested guard (parity with non‑OCR path)
else { layer.ffn_gate_inp = create_tensor(tn(LLM_TENSOR_FFN_GATE_INP, "weight", i), {n_embd, n_expert}, 0); layer.ffn_exp_probs_b = create_tensor(tn(LLM_TENSOR_FFN_EXP_PROBS_B, "bias", i), {n_expert}, TENSOR_NOT_REQUIRED); + if (n_expert == 0) { + throw std::runtime_error("n_expert must be > 0"); + } + if (n_expert_used == 0) { + throw std::runtime_error("n_expert_used must be > 0"); + } // MoE branch layer.ffn_gate_exps = create_tensor(tn(LLM_TENSOR_FFN_GATE_EXPS, "weight", i), { n_embd, n_ff_exp, n_expert}, 0);Upstream PR notes reference DeepSeek‑OCR LM support with standard attention, so the square Q/K/V weights here align with that intent. (github.com)
convert_hf_to_gguf.py (1)
1818-1818: Annotate mutable class attribute withClassVar(RUF012).
Helps typing clarity and resolves Ruff warning.Proposed update
-from typing import TYPE_CHECKING, Any, Callable, ContextManager, Iterable, Iterator, Literal, Sequence, TypeVar, cast +from typing import TYPE_CHECKING, Any, Callable, ClassVar, ContextManager, Iterable, Iterator, Literal, Sequence, TypeVar, cast @@ - n_block_keys = ["n_layers", "num_hidden_layers", "n_layer", "num_layers", "depth", "layers", "encoder_layers"] + n_block_keys: ClassVar[list[str]] = ["n_layers", "num_hidden_layers", "n_layer", "num_layers", "depth", "layers", "encoder_layers"]
# Conflicts: # convert_hf_to_gguf.py # gguf-py/gguf/tensor_mapping.py
# Conflicts: # convert_hf_to_gguf.py # gguf-py/gguf/tensor_mapping.py # src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
# Conflicts: # tools/mtmd/clip.cpp
- removed redundant RESIZE_ALGO_BICUBIC_PILLOW resize-algo - simplified image-preprocessing - removed/simplified debug functions
# Conflicts: # src/llama-model.cpp
# Conflicts: # src/llama-model.cpp
- ignore llama-arch test for deepseek-ocr
# Conflicts: # tools/mtmd/clip.cpp
# Conflicts: # convert_hf_to_gguf.py # src/llama-model.cpp
# Conflicts: # tools/mtmd/models/glm4v.cpp # tools/mtmd/models/siglip.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Mirror from upstream PR: ggml-org#17400
Note: @coderabbitai use my 'Mirror PR' preset for reviewing this.
Summary by CodeRabbit
New Features
Bug Fixes
Tests