feat(moe): add orthogonal initialization for gate parameters by joenaess · Pull Request #664 · arcee-ai/mergekit

joenaess · 2026-02-04T16:21:22Z

Overview

This PR introduces Orthogonal Initialization for Mixture-of-Experts (MoE) gate parameters. This is a critical feature for "Sparse Upcycling" workflows where a dense monolingual model is transformed into a sparse MoE architecture.

Technical Justification

Standard Gaussian initialization (random) can lead to high correlation between gate vectors in the early stages of training. In a language technology context, this causes Expert Collapse, where multiple experts are updated with gradients for the same token clusters, hindering specialization.

By implementing torch.nn.init.orthogonal_, we ensure:

The gate matrix has a condition number of 1.
Experts start by covering maximally distinct regions of the hidden state manifold.
Faster convergence during the initial multilingual fine-tuning phase.

Changes

mergekit/moe/config.py: Added orthogonal to GateMode Literal for configuration validation.
mergekit/moe/router.py: Implemented the initialization logic in get_gate_params, ensuring float32 precision during computation for mathematical stability.
tests/test_moe_orthogonal.py: Added unit tests to verify the mathematical orthogonality ($Q Q^T = I$) across all layers.

Verification

Ran unit tests using uv:
uv run python -m unittest tests/test_moe_orthogonal.py
Status: PASSED

Note

Medium Risk
Introduces new initialization behavior for MoE routing weights and widens accepted config values, which can affect downstream training/merge outputs. Also includes a minor but potentially risky change in tokensurgeon/rope_helpers.py that leaves a stray no-op statement that could indicate an accidental edit.

Overview
Adds a new MoE gate_mode option, orthogonal, and implements it in get_gate_params by generating per-layer, orthogonally-initialized gate matrices (initialized in float32, then cast/moved to the requested dtype/device).

Updates MoE config validation to strictly constrain gate_mode via Literal, and adds unit tests asserting Q @ Q.T ≈ I for the new initialization path.

Separately, this PR includes broad lint/formatting cleanups (e.g., is vs ==, is not None, assert formatting), adds ruff to dev dependencies, and contains a small functional change in tokensurgeon/rope_helpers.py where an unused n_heads local is removed (leaving a no-op expression).

^{Written by Cursor Bugbot for commit 378bb89. This will update automatically on new commits. Configure here.}

github-actions · 2026-02-04T16:21:34Z

All contributors have signed the CLA ✍️ ✅
_{Posted by the CLA Assistant Lite bot.}

joenaess · 2026-02-04T16:24:45Z

I have read the CLA Document and I hereby sign the CLA

cursor · 2026-02-04T16:27:32Z

+            # 3. Cast to the target dtype and move to the requested device
+            gate_vecs.append(layer_gate.to(dtype=target_dtype, device=device if device != "auto" else "cpu"))
+
+        return gate_vecs


Orthogonal mode returns list instead of tensor

High Severity

The new orthogonal mode returns gate_vecs as a list of tensors, while all other modes (random, uniform_random, hidden, cheap_embed) return a single tensor with shape (num_layers, num_experts, hidden_size). Callers in moe.py use tensor indexing like gate_vecs[:, :len(...), :] and warn_degenerate_gates expects gate_vecs.shape to exist. This will cause a runtime crash when using orthogonal mode.

cursor · 2026-02-04T16:34:31Z

+    # "random" is standard normal distribution (torch.randn)
    # "uniform_random" matches default initialization for torch.nn.Linear
+    # "orthogonal" ensures gate vectors are orthogonal for better expert specialization
+


Missing validation bypass for orthogonal mode prompts

Medium Severity

The is_bad_config function has an early return for "random" mode to skip prompt validation, but the new "orthogonal" mode (which also doesn't use prompts) wasn't added to this check. Users attempting to use orthogonal initialization without prompts will get the error "Expert X has no positive prompts" even though orthogonal mode generates gate vectors mathematically without using prompts at all.

cursor · 2026-02-04T16:34:31Z

+
+        gates = get_gate_params(
+            model_cfg=mock_cfg, experts=mock_experts, mode="orthogonal"
+        )


Test uses incorrect parameter name for function

Low Severity

The test calls get_gate_params(model_cfg=mock_cfg, ...) but the function signature expects model_ref as the first parameter, not model_cfg. It's also missing the required tokenizer parameter. This test will fail with a TypeError about unexpected keyword arguments.

cursor · 2026-02-04T16:34:31Z

+dev = [
+    "black~=25.1.0",
+ "isort~=6.0.1",
+ "pre-commit~=4.2.0",


Inconsistent indentation in pyproject.toml dev dependencies

Low Severity

Lines 98-99 in the dev dependency list have inconsistent indentation (only 1 space) compared to the surrounding lines which use 4 spaces. This inconsistency may cause TOML parsing issues or at minimum creates confusing formatting.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

cursor · 2026-02-05T16:30:13Z

            stats.to_approximate += 1

-    donor_tokenizer = transformers.AutoTokenizer.from_pretrained(
+    transformers.AutoTokenizer.from_pretrained(


Wasteful tokenizer loading with discarded result

Low Severity

The transformers.AutoTokenizer.from_pretrained call loads a donor tokenizer but the result is discarded. This performs network/disk I/O and memory allocation for no purpose. The change removed the donor_tokenizer variable assignment but kept the useless function call - the entire call can be removed since the loaded tokenizer is never used.

joenaess added 2 commits February 4, 2026 17:07

first commit orthogonal moe initialization option

d35df8c

orthogonal initialization passed tests

a65b4f7

cursor bot reviewed Feb 4, 2026

View reviewed changes

Fixed config Literal issue and linting errors

75aac6c

cursor bot reviewed Feb 4, 2026

View reviewed changes

joenaess added 3 commits February 4, 2026 17:35

fixed config duplication

702221b

fixed indentation in pyproject

bcc8b88

style: fix ruff linting errors and restore legacy gate modes

378bb89

cursor bot reviewed Feb 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(moe): add orthogonal initialization for gate parameters#664

feat(moe): add orthogonal initialization for gate parameters#664
joenaess wants to merge 6 commits intoarcee-ai:mainfrom
joenaess:feature/moe-orthogonal-init

joenaess commented Feb 4, 2026 •

edited by cursor bot

Loading

Uh oh!

github-actions bot commented Feb 4, 2026 •

edited

Loading

Uh oh!

joenaess commented Feb 4, 2026

Uh oh!

cursor bot Feb 4, 2026

Uh oh!

Uh oh!

Uh oh!

cursor bot Feb 4, 2026

Uh oh!

cursor bot Feb 4, 2026

Uh oh!

cursor bot Feb 4, 2026

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

joenaess commented Feb 4, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Technical Justification

Changes

Verification

Uh oh!

github-actions bot commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joenaess commented Feb 4, 2026

Uh oh!

cursor bot Feb 4, 2026

Choose a reason for hiding this comment

Orthogonal mode returns list instead of tensor

Uh oh!

Uh oh!

Uh oh!

cursor bot Feb 4, 2026

Choose a reason for hiding this comment

Missing validation bypass for orthogonal mode prompts

Uh oh!

cursor bot Feb 4, 2026

Choose a reason for hiding this comment

Test uses incorrect parameter name for function

Uh oh!

cursor bot Feb 4, 2026

Choose a reason for hiding this comment

Inconsistent indentation in pyproject.toml dev dependencies

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Feb 5, 2026

Choose a reason for hiding this comment

Wasteful tokenizer loading with discarded result

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

joenaess commented Feb 4, 2026 •

edited by cursor bot

Loading

github-actions bot commented Feb 4, 2026 •

edited

Loading