Add DeepseekV3HybridMoeModuleArchitecture by zhoutong-hai · Pull Request #661 · arcee-ai/mergekit

zhoutong-hai · 2026-01-21T01:23:02Z

DeepSeek V3 uses a hybrid dense + MoE MLP:
- the first 3 layers are dense MLP weights:
    mlp.{gate_proj,up_proj,down_proj}.weight
- the remaining layers are typically MoE (controlled by moe_layer_freq):
    mlp.gate.weight (+ optional e_score_correction_bias),
    mlp.shared_experts.{gate_proj,up_proj,down_proj}.weight,
    mlp.experts.{i}.{gate_proj,up_proj,down_proj}.weight

Note

Introduces support for DeepSeek V3’s hybrid dense+MoE layout and wires it into architecture selection.

Adds DeepseekV3HybridMoeModuleArchitecture with config-driven layer mapping: dense mlp.{gate_proj,up_proj,down_proj} for initial layers, then MoE (mlp.gate[.e_score_correction_bias?], mlp.shared_experts.*, mlp.experts.{i}.*) by moe_layer_freq; includes attention, pre/post weights
Registers the architecture in architecture/__init__.py (model_type=deepseek_v3)
Graph runtime: sets arbitrary_types_allowed=True on Task to accept non-pydantic types

^{Written by Cursor Bugbot for commit efce27e. This will update automatically on new commits. Configure here.}

github-actions · 2026-01-21T01:23:13Z

All contributors have signed the CLA ✍️ ✅
_{Posted by the CLA Assistant Lite bot.}

zhoutong-hai · 2026-01-21T01:24:34Z

I have read the CLA Document and I hereby sign the CLA

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

cursor · 2026-01-21T01:41:29Z

+        return [
+            WeightInfo(name="model.norm.weight"),
+            WeightInfo(name="lm_head.weight", is_embed=True),
+        ]


Missing optional and tied_names for lm_head.weight

High Severity

The lm_head.weight in post_weights is missing optional=True and tied_names=("model.embed_tokens.weight",). Many transformer models (including DeepSeek V3 variants) tie the input embeddings with the output LM head, storing only one copy as model.embed_tokens.weight. Without optional and tied_names, the weight loading will fail when lm_head.weight doesn't exist separately in the checkpoint, as the code has no fallback to look for the weight under its tied name. Other architecture definitions (LLaMA, Mistral) correctly handle this pattern.

zhoutong-hai added 2 commits January 20, 2026 01:37

add support for deepseek v3

46841da

pydantic change

efce27e

cursor bot reviewed Jan 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DeepseekV3HybridMoeModuleArchitecture#661

Add DeepseekV3HybridMoeModuleArchitecture#661
zhoutong-hai wants to merge 2 commits intoarcee-ai:mainfrom
zhoutong-hai:ds-v3

zhoutong-hai commented Jan 21, 2026 •

edited by cursor bot

Loading

Uh oh!

github-actions bot commented Jan 21, 2026 •

edited

Loading

Uh oh!

zhoutong-hai commented Jan 21, 2026

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Jan 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zhoutong-hai commented Jan 21, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhoutong-hai commented Jan 21, 2026

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Jan 21, 2026

Choose a reason for hiding this comment

Missing optional and tied_names for lm_head.weight

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

zhoutong-hai commented Jan 21, 2026 •

edited by cursor bot

Loading

github-actions bot commented Jan 21, 2026 •

edited

Loading

Missing `optional` and `tied_names` for `lm_head.weight`