Skip to content

[contrib] Add MPT-7B-Chat NeuronX port#80

Open
dhwanw wants to merge 3 commits intomainfrom
contrib/mpt-7b-chat
Open

[contrib] Add MPT-7B-Chat NeuronX port#80
dhwanw wants to merge 3 commits intomainfrom
contrib/mpt-7b-chat

Conversation

@dhwanw
Copy link

@dhwanw dhwanw commented Mar 17, 2026

Description

NeuronX Distributed Inference port of mosaicml/mpt-7b-chat, a 6.7B-parameter decoder-only transformer with ALiBi attention. NXDI has no native ALiBi support, so per-head slopes are stored as a weight parameter that gets TP-sharded. Position bias is computed at runtime and added to attention scores. Flash attention is disabled since NKI kernels cannot accept additive bias tensors.

Model Information

Model Name: MPT-7B-Chat
Model Architecture: Decoder-only transformer with ALiBi attention (no position embeddings), 32 MHA heads, 32 layers, LayerNorm without bias, GELU, fused QKV, tied embeddings
Purpose: Chat/instruction following

Checklist

Required Components

  • Accuracy Test (test/integration/test_model.py)
    • Validates model generation and coherence
    • Performance benchmarks (TTFT, throughput)
    • Test can compile and run the model on Neuron
  • README.md with the following sections:
    • Usage Example: Clear code example showing how to use the model
    • Compatibility Matrix: Table showing tested Neuron SDK versions and instance types
    • Example Checkpoints: Links to compatible model checkpoints
    • Testing Instructions: Command to run the test suite for the model
  • Source Code (src/)
    • Modeling code following NxD Inference patterns

Optional Components

  • Unit Tests (CPU or Neuron-based)

Folder Structure

/contrib/models/mpt-7b-chat/
  README.md
  /src
    modeling_mpt.py
  /test
    /integration
      test_model.py

Testing

Model was compiled and tested with TP=1, batch_size=1, seq_len=128, bfloat16 on trn1.32xlarge.

Test Results:

Test Status Result
Smoke Test ✅ PASS Model loads successfully
Greedy Token Matching ✅ PASS 54.84% average (2/10 prompts at 100%)
Teacher-Forced Match ✅ PASS 97.50% average
Throughput ✅ PASS 18.0 tok/s

The lower greedy match rate compared to non-ALiBi models is expected: BF16 precision differences in the additive position bias compound during autoregressive generation. The high teacher-forced rate (97.50%) confirms weights are correctly ported.

Compatibility

Tested with:

  • Neuron SDK Version(s): 2.22
  • Instance Type(s): trn1.32xlarge
  • PyTorch Version: 2.9
  • Python Version: 3.10
  • Configuration: TP=1, batch_size=1, seq_len=128, bfloat16

Additional Information

  • ALiBi attention: No native NXDI support. Per-head slopes stored as a weight parameter (alibi_slopes) that gets TP-sharded. Position bias computed at runtime from slopes and token positions, added to attention scores before softmax.
  • Flash attention disabled: NKI kernels cannot accept additive bias tensors, so flash attention must be disabled for ALiBi.
  • Fused QKV: HF checkpoint stores a single Wqkv weight, split into separate Q, K, V during weight conversion.
  • No-bias LayerNorm: MPT uses no_bias=True for all LayerNorm layers.

Related Issues

N/A

vLLM Integration

  • This model/feature is intended for use with vLLM
  • Documentation includes vLLM registration instructions

By submitting this PR, I confirm that:

  • I have read and followed the contributing guidelines
  • This is a community contribution and may have limited testing compared to officially-supported models
  • The code follows best practices and is well-documented
  • All required components listed above are included

dhwanw and others added 3 commits March 5, 2026 22:19
NeuronX port of MosaicML MPT-7B-Chat (6.7B params) with:
- ALiBi slopes stored as TP-sharded weight parameter
- Position bias computed at runtime, flash attention disabled
- Fused QKV from HF checkpoint split during weight conversion
- LayerNorm without bias

Validated: 54.84% greedy, 97.50% teacher-forced (job 7769).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use consistent CE/TG column table format across all contrib models.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@dhwanw dhwanw marked this pull request as ready for review March 19, 2026 19:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant