[contrib] Add sarvam-30b NeuronX port by dhwanw · Pull Request #83 · aws-neuron/neuronx-distributed-inference

dhwanw · 2026-03-17T21:19:48Z

Description

NeuronX Distributed Inference port of sarvamai/sarvam-30b, a 30B-parameter Mixture of Experts model (2.4B active per token). Sarvam uses a hybrid dense+MoE architecture (layer 0 dense, layers 1-18 MoE), 128 routed experts with top-6 routing, sigmoid routing with learned expert bias, a 2.5x routed scaling factor, shared experts, and Q/K normalization. Key porting challenges included the custom routing logic, shared expert handling separate from the NXDI MoE module, and ParallelEmbedding fixes for correct XLA tracing.

Model Information

Model Name: sarvam-30b
Model Architecture: Decoder-only hybrid dense+MoE transformer -- 19 layers (1 dense + 18 MoE), 128 routed experts with top-6 sigmoid routing + expert bias + 2.5x scaling, 1 shared expert per MoE layer, 64 Q heads / 4 KV heads (GQA), Q/K RMSNorm, RoPE (theta=8M)
Purpose: Multilingual text generation (Indian languages focus)

Checklist

Required Components

Accuracy Test (test/integration/test_model.py)
- Validates model generation and coherence
- Performance benchmarks (TTFT, throughput)
- Test can compile and run the model on Neuron
README.md with the following sections:
- Usage Example: Clear code example showing how to use the model
- Compatibility Matrix: Table showing tested Neuron SDK versions and instance types
- Example Checkpoints: Links to compatible model checkpoints
- Testing Instructions: Command to run the test suite for the model
Source Code (src/)
- Modeling code following NxD Inference patterns

Optional Components

Unit Tests (CPU or Neuron-based)

Folder Structure

/contrib/models/sarvam-30b/
  README.md
  /src
    modeling_sarvam.py
  /test
    /integration
      test_model.py

Testing

Model was compiled and tested with TP=8, batch_size=1, seq_len=128, bfloat16 on trn1.32xlarge.

Test Results:

Test	Status	Result
Smoke Test	✅ PASS	Model loads successfully
Greedy Token Matching	✅ PASS	61.1% average (4/10 prompts at 100%)
Teacher-Forced Match	✅ PASS	98.4% average
Throughput	✅ PASS	3.6 tok/s

Greedy divergence is expected for MoE models with sigmoid routing + expert bias + scaling factor interactions in BF16 precision. Teacher-forced match confirms the model is functionally correct.

Compatibility

Tested with:

Neuron SDK Version(s): 2.22
Instance Type(s): trn1.32xlarge
PyTorch Version: 2.9
Python Version: 3.10
Configuration: TP=8, batch_size=1, seq_len=128, bfloat16

Additional Information

Hybrid dense+MoE: Layer 0 uses dense MLP (first_k_dense_replace=1), layers 1-18 use MoE.
Sigmoid routing with expert bias: Custom SarvamRouterTopK applies sigmoid activation then adds learned expert_bias (post-sigmoid, pre-topk). Affinities use unbiased sigmoid scores.
Routed scaling factor: Normalized routing weights are multiplied by 2.5 before combining with shared expert output.
Shared expert: Handled separately from the NXDI MoE module to support the scaling factor.
Q/K normalization: RMSNorm applied per-head on head_dim=64 after Q/K projection split.
ParallelEmbedding fix: Required shard_across_embedding, pad, tensor_model_parallel_group, and use_spmd_rank parameters to avoid rank-0 baked constants in XLA tracing.

Related Issues

N/A

vLLM Integration

This model/feature is intended for use with vLLM
Documentation includes vLLM registration instructions

By submitting this PR, I confirm that:

I have read and followed the contributing guidelines
This is a community contribution and may have limited testing compared to officially-supported models
The code follows best practices and is well-documented
All required components listed above are included

Sarvam-30B is a 30B MoE model (128 experts + 1 shared, top-6 routing) with sigmoid scoring, learned expert bias, and 2.5x routed scaling factor. Custom SarvamRouterTopK implements exact HF routing behavior. Layer 0 is dense, layers 1-18 are MoE with separate shared expert MLP. Validation: 61% greedy match, 98.4% teacher-forced match (TP=8, bf16). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Use consistent CE/TG column table format across all contrib models. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

dhwanw and others added 3 commits March 13, 2026 20:24

Add performance metrics to README

e8dfcf1

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Standardize Performance section format in README

b15ca78

Use consistent CE/TG column table format across all contrib models. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

dhwanw marked this pull request as ready for review March 19, 2026 19:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[contrib] Add sarvam-30b NeuronX port#83

[contrib] Add sarvam-30b NeuronX port#83
dhwanw wants to merge 3 commits intomainfrom
contrib/sarvam-30b

dhwanw commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dhwanw commented Mar 17, 2026

Description

Model Information

Checklist

Required Components

Optional Components

Folder Structure

Testing

Compatibility

Additional Information

Related Issues

vLLM Integration

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant