[Mirror] mtmd: Add DeepSeekOCR Support by ngxson · Pull Request #66 · ngxson/llama.cpp

ngxson · 2025-12-23T00:01:51Z

Mirror from upstream PR: ggml-org#17400

Note: @coderabbitai use my 'Mirror PR' preset for reviewing this.

Summary by CodeRabbit

New Features
- Added DeepSeek‑OCR support: OCR-focused vision/SAM pipeline, new architecture/template option, updated preprocessing/resizing flow, and new GGUF vision metadata keys.
Bug Fixes
- Stricter tensor type checks for interpolation and masked attention.
- Fail-fast on unknown GPU upscale mode to avoid silent fall-throughs.
Tests
- Added DeepSeek‑OCR test suite, sample test data, and new CI test cases.

init commit

mtmd: fix vision model processing

…f/deepseek-ocr

testing Vision model loading

mtmd: DeepseekOCR Implement DeepSeek3B-MoE-A570M (LM component)

…ut in deepseek2 model

…f/deepseek-ocr

…e image decoding fails

# Conflicts: # tools/mtmd/clip.cpp

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

gguf-py/gguf/constants.py (1)
303-309: ⚠️ Potential issue | 🔴 Critical

Duplicate WINDOW_SIZE attribute in ClipVision class.

WINDOW_SIZE is defined twice: at line 303 (newly added) and line 309 (pre-existing). The second definition shadows the first. Remove the duplicate.
🐛 Proposed fix
         SPATIAL_MERGE_SIZE  = "clip.vision.spatial_merge_size"
-        WINDOW_SIZE         = "clip.vision.window_size"
         USE_GELU            = "clip.use_gelu"
         USE_SILU            = "clip.use_silu"
         N_WA_PATTERN        = "clip.vision.n_wa_pattern" # used by qwen2.5vl
         WA_LAYER_INDEXES    = "clip.vision.wa_layer_indexes" # used by youtuvl
         IS_DEEPSTACK_LAYERS = "clip.vision.is_deepstack_layers"
         WINDOW_SIZE         = "clip.vision.window_size"
convert_hf_to_gguf.py (1)
7738-7774: ⚠️ Potential issue | 🟠 Major

Potential metadata mismatch when kv_lora_rank is absent.
You compute key/value lengths using a fallback kv_lora_rank = 512, but you only emit add_kv_lora_rank when the hparam is present. That can leave GGUF metadata inconsistent if kv_lora_rank is missing or None. Consider emitting the fallback (or deriving it) consistently.
Suggested fix
-        kv_lora_rank = hparams["kv_lora_rank"] if hparams.get("kv_lora_rank") is not None else 512
+        kv_lora_rank = hparams.get("kv_lora_rank", 512)
@@
-        if "kv_lora_rank" in hparams and hparams["kv_lora_rank"] is not None:
-            self.gguf_writer.add_kv_lora_rank(kv_lora_rank)
+        if not is_ocr:
+            self.gguf_writer.add_kv_lora_rank(kv_lora_rank)

🤖 Fix all issues with AI agents

In `@src/llama-arch.cpp`:
- Around line 1523-1554: The tensor name set for LLM_ARCH_DEEPSEEK2OCR is
inconsistent with the converter output causing load failures; update the tensor
list returned in the LLM_ARCH_DEEPSEEK2OCR case to exactly match the
converter-exported names (replace plural exps names LLM_TENSOR_FFN_GATE_EXPS,
LLM_TENSOR_FFN_DOWN_EXPS, LLM_TENSOR_FFN_UP_EXPS with the converter's singular
enums LLM_TENSOR_FFN_GATE_EXP, LLM_TENSOR_FFN_DOWN_EXP, LLM_TENSOR_FFN_UP_EXP,
remove LLM_TENSOR_FFN_GATE_INP_SHEXP if the converter doesn't export it, and add
LLM_TENSOR_ATTN_ROT_EMBD which the converter does export); alternatively, if you
prefer changing the converter, modify its export names to produce the pluralized
enums used here so that the set returned by the LLM_ARCH_DEEPSEEK2OCR branch
matches the converter exactly.

🧹 Nitpick comments (2)

src/llama-model.cpp (1)

5014-5038: Add MoE sanity checks in the OCR branch for parity.
The OCR MoE path builds expert tensors without the n_expert / n_expert_used guards that exist in the non‑OCR path, which makes inconsistent metadata harder to diagnose. Consider mirroring the same fail-fast checks here.

♻️ Suggested guard (parity with non‑OCR path)

                             else {
                                 layer.ffn_gate_inp = create_tensor(tn(LLM_TENSOR_FFN_GATE_INP, "weight", i), {n_embd, n_expert}, 0);
                                 layer.ffn_exp_probs_b = create_tensor(tn(LLM_TENSOR_FFN_EXP_PROBS_B, "bias", i), {n_expert}, TENSOR_NOT_REQUIRED);
+                                if (n_expert == 0) {
+                                    throw std::runtime_error("n_expert must be > 0");
+                                }
+                                if (n_expert_used == 0) {
+                                    throw std::runtime_error("n_expert_used must be > 0");
+                                }
                                 // MoE branch
                                 layer.ffn_gate_exps = create_tensor(tn(LLM_TENSOR_FFN_GATE_EXPS, "weight", i), {  n_embd, n_ff_exp, n_expert}, 0);

Upstream PR notes reference DeepSeek‑OCR LM support with standard attention, so the square Q/K/V weights here align with that intent. (github.com)

convert_hf_to_gguf.py (1)

1818-1818: Annotate mutable class attribute with ClassVar (RUF012).
Helps typing clarity and resolves Ruff warning.

Proposed update

-from typing import TYPE_CHECKING, Any, Callable, ContextManager, Iterable, Iterator, Literal, Sequence, TypeVar, cast
+from typing import TYPE_CHECKING, Any, Callable, ClassVar, ContextManager, Iterable, Iterator, Literal, Sequence, TypeVar, cast
@@
-    n_block_keys = ["n_layers", "num_hidden_layers", "n_layer", "num_layers", "depth", "layers", "encoder_layers"]
+    n_block_keys: ClassVar[list[str]] = ["n_layers", "num_hidden_layers", "n_layer", "num_layers", "depth", "layers", "encoder_layers"]

src/llama-arch.cpp

# Conflicts: # convert_hf_to_gguf.py # gguf-py/gguf/tensor_mapping.py

# Conflicts: # convert_hf_to_gguf.py # gguf-py/gguf/tensor_mapping.py # src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

# Conflicts: # tools/mtmd/clip.cpp

- removed redundant RESIZE_ALGO_BICUBIC_PILLOW resize-algo - simplified image-preprocessing - removed/simplified debug functions

# Conflicts: # src/llama-model.cpp

- ignore llama-arch test for deepseek-ocr

# Conflicts: # tools/mtmd/clip.cpp

# Conflicts: # convert_hf_to_gguf.py # src/llama-model.cpp

# Conflicts: # tools/mtmd/models/glm4v.cpp # tools/mtmd/models/siglip.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

sfallah and others added 30 commits November 14, 2025 12:40

mtmd: llama.cpp DeepSeekOCR support

43a130b

init commit

loading sam tensors

b6b9f02

mtmd: fix vision model processing

85c7cda

Merge pull request #1 from bluebread/sf/deepseek-ocr

578c8d7

mtmd: fix vision model processing

deepseek-ocr clip-vit model impl

2aab52e

mtmd: add DeepSeek-OCR LM support with standard attention

eab28ed

mtmd: successfully runs DeepSeek-OCR LM in llama-cli

7630587

mtmd: Fix RoPE type for DeepSeek-OCR LM.

2de3436

Merge branch 'sf/deepseek-ocr' of github.com:sfallah/llama.cpp into s…

e8b2610

…f/deepseek-ocr

loading LM

97e0907

testing Vision model loading

Merge branch 'sf/deepseek-ocr' into sf/deepseek-ocr

13dc6fb

Merge pull request #2 from bluebread/sf/deepseek-ocr

b32bb5e

mtmd: DeepseekOCR Implement DeepSeek3B-MoE-A570M (LM component)

sam warmup working

790bbb9

sam erroneous return corrected

cec9a5c

clip-vit: corrected cls_embd concat

8b3d319

clip-vit: model convert qkv_proj split

1e08157

corrected combining of image encoders' results

331cea8

fix: update callback for ffn_moe_weighted and add callback for attn_o…

6c0715b

…ut in deepseek2 model

Merge branch 'sf/deepseek-ocr' of github.com:sfallah/llama.cpp into s…

a65ddf5

…f/deepseek-ocr

concat image_newline and image_seperator tokens

63a042f

visual_model warmup (technically) works

89afda8

window partitioning using standard ggml ops

88032f4

Merge branch 'sf/deepseek-ocr' of github.com:sfallah/llama.cpp into s…

1268dc3

…f/deepseek-ocr

sam implementation without using CPU only ops

68b206b

clip: fixed warnings

8bce66d

Merge branch 'sf/deepseek-ocr' of github.com:sfallah/llama.cpp into s…

5e6cf3c

…f/deepseek-ocr

mtmd: fix get_rel_pos

7e9fbec

Merge branch 'sf/deepseek-ocr' of github.com:sfallah/llama.cpp into s…

0f5587d

…f/deepseek-ocr

mtmd: fixed the wrong scaler for get_rel_pos

7b8d735

image encoding technically works but the output can't be checked sing…

86f111f

…e image decoding fails

sfallah added 2 commits February 6, 2026 07:51

Merge branch 'ggml-org:master' into sf/deepseek-ocr

7ffa23c

Merge remote-tracking branch 'sfallah/master' into sf/deepseek-ocr

f41d323

# Conflicts: # tools/mtmd/clip.cpp

coderabbitai bot reviewed Feb 10, 2026

View reviewed changes

src/llama-arch.cpp Show resolved Hide resolved

sfallah and others added 12 commits February 11, 2026 14:10

minor

9b1a1b9

Merge remote-tracking branch 'sfallah/master' into sf/deepseek-ocr

52fcb13

# Conflicts: # convert_hf_to_gguf.py # gguf-py/gguf/tensor_mapping.py

Merge remote-tracking branch 'sfallah/master' into sf/deepseek-ocr

0031b41

# Conflicts: # convert_hf_to_gguf.py # gguf-py/gguf/tensor_mapping.py # src/llama-model.cpp

Update convert_hf_to_gguf.py

5f2283b

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

Merge remote-tracking branch 'sfallah/master' into sf/deepseek-ocr

7856e24

# Conflicts: # tools/mtmd/clip.cpp

- removed clip_is_deepseekocr

50c1e15

- removed redundant RESIZE_ALGO_BICUBIC_PILLOW resize-algo - simplified image-preprocessing - removed/simplified debug functions

Merge remote-tracking branch 'sfallah/master' into sf/deepseek-ocr

3e221cf

- cleaning commented out code

e037b95

Merge remote-tracking branch 'sfallah/master' into sf/deepseek-ocr

0b61c6a

# Conflicts: # src/llama-model.cpp

fixing instabilities issues reintroducing resize_bicubic_pillow

7a53e7e

Merge remote-tracking branch 'sfallah/master' into sf/deepseek-ocr

c2e6701

# Conflicts: # src/llama-model.cpp

- use f16 model for deepseek-ocr test

49f3ca5

- ignore llama-arch test for deepseek-ocr

github-actions bot added the testing label Mar 11, 2026

sfallah and others added 7 commits March 12, 2026 07:31

Merge remote-tracking branch 'sfallah/master' into sf/deepseek-ocr

21243f3

# Conflicts: # tools/mtmd/clip.cpp

Merge remote-tracking branch 'sfallah/master' into sf/deepseek-ocr

a493dc1

Merge remote-tracking branch 'sfallah/master' into sf/deepseek-ocr

754061e

# Conflicts: # convert_hf_to_gguf.py # src/llama-model.cpp

Merge remote-tracking branch 'sfallah/master' into sf/deepseek-ocr

7725399

# Conflicts: # tools/mtmd/models/glm4v.cpp # tools/mtmd/models/siglip.cpp

rename fc_w --> mm_fc_w

3754c32

Merge branch 'master' into sf/deepseek-ocr

d88b88e

add links to OCR discussion

0ea5fa4

github-actions bot added the documentation Improvements or additions to documentation label Mar 25, 2026

ngxson and others added 6 commits March 25, 2026 14:50

cleaner loading code

edf020d

add missing .weight to some tensors

8099869

add default jinja template (to be used by server)

1d90094

move test model to ggml-org

6faf264

rolling back upscale change

8dabfe3

Update convert_hf_to_gguf.py

95cc566

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Mirror] mtmd: Add DeepSeekOCR Support#66

[Mirror] mtmd: Add DeepSeekOCR Support#66
ngxson wants to merge 143 commits intongxson:masterfrom
sfallah:sf/deepseek-ocr

ngxson commented Dec 23, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ngxson commented Dec 23, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ngxson commented Dec 23, 2025 •

edited by coderabbitai bot

Loading