Skip to content

[contrib] Add BitNet-b1.58-2B-4T NeuronX port#82

Open
dhwanw wants to merge 3 commits intomainfrom
contrib/bitnet-b1.58-2B-4T
Open

[contrib] Add BitNet-b1.58-2B-4T NeuronX port#82
dhwanw wants to merge 3 commits intomainfrom
contrib/bitnet-b1.58-2B-4T

Conversation

@dhwanw
Copy link

@dhwanw dhwanw commented Mar 17, 2026

Description

NeuronX Distributed Inference port of microsoft/BitNet-b1.58-2B-4T, a 2B-parameter Llama-variant with ternary quantized weights (1.58 bits per weight). Key implementation challenges include ternary weight unpacking (packed uint8 with 4 values per byte, values -1/0/+1), sub-norm fusion (attn_sub_norm and ffn_sub_norm fused into following linear layers), ReLU squared activation, and TP-aware unit RMSNorm.

Model Information

Model Name: BitNet-b1.58-2B-4T
Model Architecture: Decoder-only transformer (Llama variant) with ternary quantized weights -- 30 layers, 20 Q heads / 5 KV heads (GQA), RoPE (theta=500k), ReLU squared activation, sub-norm fusion, tied embeddings
Purpose: Efficient text generation with ternary weight quantization (1.58 bits/weight)

Checklist

Required Components

  • Accuracy Test (test/integration/test_model.py)
    • Validates model generation and coherence
    • Performance benchmarks (TTFT, throughput)
    • Test can compile and run the model on Neuron
  • README.md with the following sections:
    • Usage Example: Clear code example showing how to use the model
    • Compatibility Matrix: Table showing tested Neuron SDK versions and instance types
    • Example Checkpoints: Links to compatible model checkpoints
    • Testing Instructions: Command to run the test suite for the model
  • Source Code (src/)
    • Modeling code following NxD Inference patterns

Optional Components

  • Unit Tests (CPU or Neuron-based)

Folder Structure

/contrib/models/bitnet-b1.58-2B-4T/
  README.md
  /src
    modeling_bitnet.py
  /test
    /integration
      test_model.py

Testing

Model was compiled and tested with TP=2, batch_size=1, seq_len=256, bfloat16 on trn1.32xlarge.

Test Results:

Test Status Result
Smoke Test ✅ PASS Model loads successfully
Greedy Token Matching ✅ PASS 70.9% average (4/10 prompts at 100%)
Teacher-Forced Match ✅ PASS 97.2% average
Throughput ✅ PASS 26.7 tok/s

Compatibility

Tested with:

  • Neuron SDK Version(s): 2.22
  • Instance Type(s): trn1.32xlarge
  • PyTorch Version: 2.9
  • Python Version: 3.10
  • Configuration: TP=2, batch_size=1, seq_len=256, bfloat16

Additional Information

  • Ternary weight unpacking: Weights are stored as packed uint8 (4 values per byte, values: -1/0/+1). Unpacked during convert_hf_to_neuron_state_dict and scaled by per-tensor weight_scale.
  • Sub-norm fusion: Both attn_sub_norm (before o_proj) and ffn_sub_norm (before down_proj) have their gamma fused into the following linear layer's weights. At runtime, _TPAwareUnitRMSNorm applies unit RMSNorm with TP-aware all-reduce.
  • ReLU squared activation: Uses relu2 (ReLU(x)^2) instead of SiLU/SwiGLU.
  • KV replication: When num_kv_heads % tp_degree != 0, KV heads are replicated via repeat_interleave for CONVERT_TO_MHA compatibility.

Related Issues

N/A

vLLM Integration

  • This model/feature is intended for use with vLLM
  • Documentation includes vLLM registration instructions

By submitting this PR, I confirm that:

  • I have read and followed the contributing guidelines
  • This is a community contribution and may have limited testing compared to officially-supported models
  • The code follows best practices and is well-documented
  • All required components listed above are included

dhwanw and others added 3 commits March 13, 2026 21:24
Ternary-weight Llama variant (microsoft/BitNet-b1.58-2B-4T) with sub-norm
fusion, relu² activation, and TP-aware unit RMSNorm. Validated at 70.9%
greedy / 97.2% teacher-forced match on TP=2.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use consistent CE/TG column table format across all contrib models.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@dhwanw dhwanw marked this pull request as ready for review March 19, 2026 19:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant