Skip to content

[Mirror] mtmd: Add DeepSeekOCR Support#66

Open
ngxson wants to merge 143 commits intongxson:masterfrom
sfallah:sf/deepseek-ocr
Open

[Mirror] mtmd: Add DeepSeekOCR Support#66
ngxson wants to merge 143 commits intongxson:masterfrom
sfallah:sf/deepseek-ocr

Conversation

@ngxson
Copy link
Copy Markdown
Owner

@ngxson ngxson commented Dec 23, 2025

Mirror from upstream PR: ggml-org#17400

Note: @coderabbitai use my 'Mirror PR' preset for reviewing this.

Summary by CodeRabbit

  • New Features

    • Added DeepSeek‑OCR support: OCR-focused vision/SAM pipeline, new architecture/template option, updated preprocessing/resizing flow, and new GGUF vision metadata keys.
  • Bug Fixes

    • Stricter tensor type checks for interpolation and masked attention.
    • Fail-fast on unknown GPU upscale mode to avoid silent fall-throughs.
  • Tests

    • Added DeepSeek‑OCR test suite, sample test data, and new CI test cases.

sfallah and others added 30 commits November 14, 2025 12:40
mtmd: fix vision model processing
testing Vision model loading
mtmd: DeepseekOCR Implement DeepSeek3B-MoE-A570M (LM component)
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
gguf-py/gguf/constants.py (1)

303-309: ⚠️ Potential issue | 🔴 Critical

Duplicate WINDOW_SIZE attribute in ClipVision class.

WINDOW_SIZE is defined twice: at line 303 (newly added) and line 309 (pre-existing). The second definition shadows the first. Remove the duplicate.

🐛 Proposed fix
         SPATIAL_MERGE_SIZE  = "clip.vision.spatial_merge_size"
-        WINDOW_SIZE         = "clip.vision.window_size"
         USE_GELU            = "clip.use_gelu"
         USE_SILU            = "clip.use_silu"
         N_WA_PATTERN        = "clip.vision.n_wa_pattern" # used by qwen2.5vl
         WA_LAYER_INDEXES    = "clip.vision.wa_layer_indexes" # used by youtuvl
         IS_DEEPSTACK_LAYERS = "clip.vision.is_deepstack_layers"
         WINDOW_SIZE         = "clip.vision.window_size"
convert_hf_to_gguf.py (1)

7738-7774: ⚠️ Potential issue | 🟠 Major

Potential metadata mismatch when kv_lora_rank is absent.
You compute key/value lengths using a fallback kv_lora_rank = 512, but you only emit add_kv_lora_rank when the hparam is present. That can leave GGUF metadata inconsistent if kv_lora_rank is missing or None. Consider emitting the fallback (or deriving it) consistently.

Suggested fix
-        kv_lora_rank = hparams["kv_lora_rank"] if hparams.get("kv_lora_rank") is not None else 512
+        kv_lora_rank = hparams.get("kv_lora_rank", 512)
@@
-        if "kv_lora_rank" in hparams and hparams["kv_lora_rank"] is not None:
-            self.gguf_writer.add_kv_lora_rank(kv_lora_rank)
+        if not is_ocr:
+            self.gguf_writer.add_kv_lora_rank(kv_lora_rank)
🤖 Fix all issues with AI agents
In `@src/llama-arch.cpp`:
- Around line 1523-1554: The tensor name set for LLM_ARCH_DEEPSEEK2OCR is
inconsistent with the converter output causing load failures; update the tensor
list returned in the LLM_ARCH_DEEPSEEK2OCR case to exactly match the
converter-exported names (replace plural exps names LLM_TENSOR_FFN_GATE_EXPS,
LLM_TENSOR_FFN_DOWN_EXPS, LLM_TENSOR_FFN_UP_EXPS with the converter's singular
enums LLM_TENSOR_FFN_GATE_EXP, LLM_TENSOR_FFN_DOWN_EXP, LLM_TENSOR_FFN_UP_EXP,
remove LLM_TENSOR_FFN_GATE_INP_SHEXP if the converter doesn't export it, and add
LLM_TENSOR_ATTN_ROT_EMBD which the converter does export); alternatively, if you
prefer changing the converter, modify its export names to produce the pluralized
enums used here so that the set returned by the LLM_ARCH_DEEPSEEK2OCR branch
matches the converter exactly.
🧹 Nitpick comments (2)
src/llama-model.cpp (1)

5014-5038: Add MoE sanity checks in the OCR branch for parity.
The OCR MoE path builds expert tensors without the n_expert / n_expert_used guards that exist in the non‑OCR path, which makes inconsistent metadata harder to diagnose. Consider mirroring the same fail-fast checks here.

♻️ Suggested guard (parity with non‑OCR path)
                             else {
                                 layer.ffn_gate_inp = create_tensor(tn(LLM_TENSOR_FFN_GATE_INP, "weight", i), {n_embd, n_expert}, 0);
                                 layer.ffn_exp_probs_b = create_tensor(tn(LLM_TENSOR_FFN_EXP_PROBS_B, "bias", i), {n_expert}, TENSOR_NOT_REQUIRED);
+                                if (n_expert == 0) {
+                                    throw std::runtime_error("n_expert must be > 0");
+                                }
+                                if (n_expert_used == 0) {
+                                    throw std::runtime_error("n_expert_used must be > 0");
+                                }
                                 // MoE branch
                                 layer.ffn_gate_exps = create_tensor(tn(LLM_TENSOR_FFN_GATE_EXPS, "weight", i), {  n_embd, n_ff_exp, n_expert}, 0);

Upstream PR notes reference DeepSeek‑OCR LM support with standard attention, so the square Q/K/V weights here align with that intent. (github.com)

convert_hf_to_gguf.py (1)

1818-1818: Annotate mutable class attribute with ClassVar (RUF012).
Helps typing clarity and resolves Ruff warning.

Proposed update
-from typing import TYPE_CHECKING, Any, Callable, ContextManager, Iterable, Iterator, Literal, Sequence, TypeVar, cast
+from typing import TYPE_CHECKING, Any, Callable, ClassVar, ContextManager, Iterable, Iterator, Literal, Sequence, TypeVar, cast
@@
-    n_block_keys = ["n_layers", "num_hidden_layers", "n_layer", "num_layers", "depth", "layers", "encoder_layers"]
+    n_block_keys: ClassVar[list[str]] = ["n_layers", "num_hidden_layers", "n_layer", "num_layers", "depth", "layers", "encoder_layers"]

sfallah and others added 12 commits February 11, 2026 14:10
# Conflicts:
#	convert_hf_to_gguf.py
#	gguf-py/gguf/tensor_mapping.py
# Conflicts:
#	convert_hf_to_gguf.py
#	gguf-py/gguf/tensor_mapping.py
#	src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
- removed redundant RESIZE_ALGO_BICUBIC_PILLOW resize-algo
- simplified image-preprocessing
- removed/simplified debug functions
- ignore llama-arch test for deepseek-ocr
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Mar 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants