Skip to content

Add DeepseekV3HybridMoeModuleArchitecture#661

Open
zhoutong-hai wants to merge 2 commits intoarcee-ai:mainfrom
zhoutong-hai:ds-v3
Open

Add DeepseekV3HybridMoeModuleArchitecture#661
zhoutong-hai wants to merge 2 commits intoarcee-ai:mainfrom
zhoutong-hai:ds-v3

Conversation

@zhoutong-hai
Copy link
Copy Markdown

@zhoutong-hai zhoutong-hai commented Jan 21, 2026

DeepSeek V3 uses a hybrid dense + MoE MLP:
- the first 3 layers are dense MLP weights:
    mlp.{gate_proj,up_proj,down_proj}.weight
- the remaining layers are typically MoE (controlled by moe_layer_freq):
    mlp.gate.weight (+ optional e_score_correction_bias),
    mlp.shared_experts.{gate_proj,up_proj,down_proj}.weight,
    mlp.experts.{i}.{gate_proj,up_proj,down_proj}.weight

Note

Introduces support for DeepSeek V3’s hybrid dense+MoE layout and wires it into architecture selection.

  • Adds DeepseekV3HybridMoeModuleArchitecture with config-driven layer mapping: dense mlp.{gate_proj,up_proj,down_proj} for initial layers, then MoE (mlp.gate[.e_score_correction_bias?], mlp.shared_experts.*, mlp.experts.{i}.*) by moe_layer_freq; includes attention, pre/post weights
  • Registers the architecture in architecture/__init__.py (model_type=deepseek_v3)
  • Graph runtime: sets arbitrary_types_allowed=True on Task to accept non-pydantic types

Written by Cursor Bugbot for commit efce27e. This will update automatically on new commits. Configure here.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Jan 21, 2026

All contributors have signed the CLA ✍️ ✅
Posted by the CLA Assistant Lite bot.

@zhoutong-hai
Copy link
Copy Markdown
Author

I have read the CLA Document and I hereby sign the CLA

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

return [
WeightInfo(name="model.norm.weight"),
WeightInfo(name="lm_head.weight", is_embed=True),
]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing optional and tied_names for lm_head.weight

High Severity

The lm_head.weight in post_weights is missing optional=True and tied_names=("model.embed_tokens.weight",). Many transformer models (including DeepSeek V3 variants) tie the input embeddings with the output LM head, storing only one copy as model.embed_tokens.weight. Without optional and tied_names, the weight loading will fail when lm_head.weight doesn't exist separately in the checkpoint, as the code has no fallback to look for the weight under its tied name. Other architecture definitions (LLaMA, Mistral) correctly handle this pattern.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant