Skip to content

[contrib] Add Ouro-1.4B NeuronX port#77

Open
dhwanw wants to merge 3 commits intomainfrom
contrib/Ouro-1.4B
Open

[contrib] Add Ouro-1.4B NeuronX port#77
dhwanw wants to merge 3 commits intomainfrom
contrib/Ouro-1.4B

Conversation

@dhwanw
Copy link

@dhwanw dhwanw commented Mar 17, 2026

Description

NeuronX Distributed Inference port of ContextualAI/Ouro-1.4B, a 1.4B-parameter Universal Transformer. Ouro uses weight sharing across 4 UT steps over 24 layers, resulting in 96 unrolled physical layers for NXDI. The model features dual RMSNorm (pre+post sandwich) for both attention and MLP blocks, and intermediate norms at UT step boundaries.

Model Information

Model Name: Ouro-1.4B
Model Architecture: Decoder-only Universal Transformer -- 24 base layers x 4 UT steps = 96 unrolled physical layers, MHA (16 heads), RoPE, dual RMSNorm sandwich, SwiGLU MLP
Purpose: Text generation with Universal Transformer weight sharing

Checklist

Required Components

  • Accuracy Test (test/integration/test_model.py)
    • Validates model generation and coherence
    • Performance benchmarks (TTFT, throughput)
    • Test can compile and run the model on Neuron
  • README.md with the following sections:
    • Usage Example: Clear code example showing how to use the model
    • Compatibility Matrix: Table showing tested Neuron SDK versions and instance types
    • Example Checkpoints: Links to compatible model checkpoints
    • Testing Instructions: Command to run the test suite for the model
  • Source Code (src/)
    • Modeling code following NxD Inference patterns

Optional Components

  • Unit Tests (CPU or Neuron-based)

Folder Structure

/contrib/models/Ouro-1.4B/
  README.md
  /src
    modeling_ouro.py
  /test
    /integration
      test_model.py

Testing

Model was compiled and tested with TP=1, batch_size=1, seq_len=128, bfloat16 on trn1.32xlarge.

Test Results:

Test Status Result
Smoke Test ✅ PASS Model loads successfully
Greedy Token Matching ✅ PASS 87.0% average (7/10 prompts at 100%)
Teacher-Forced Match ✅ PASS 98.0% average
Throughput ✅ PASS 23.3 tok/s

Compatibility

Tested with:

  • Neuron SDK Version(s): 2.22
  • Instance Type(s): trn1.32xlarge
  • PyTorch Version: 2.9
  • Python Version: 3.10
  • Configuration: TP=1, batch_size=1, seq_len=128, bfloat16

Additional Information

  • Universal Transformer loop: 4 UT steps over 24 layers with shared weights. Unrolled into 96 physical layers with duplicated weights so NXDI can iterate in a single pass.
  • Dual layer norms: Each decoder layer applies pre-norm + post-norm sandwich for both attention and MLP blocks (4 RMSNorms per layer).
  • Intermediate norm: Additional RMSNorm applied at UT step boundaries (every 24 layers).
  • Separate KV cache per UT step: 96 total cache slots, one per unrolled layer.
  • Weight conversion: HF's 24-layer weights are duplicated 4x. Intermediate norm weights come from the model's final norm.

Related Issues

N/A

vLLM Integration

  • This model/feature is intended for use with vLLM
  • Documentation includes vLLM registration instructions

By submitting this PR, I confirm that:

  • I have read and followed the contributing guidelines
  • This is a community contribution and may have limited testing compared to officially-supported models
  • The code follows best practices and is well-documented
  • All required components listed above are included

dhwanw and others added 3 commits March 13, 2026 19:06
Ouro-1.4B is a Universal Transformer that runs 4 UT steps over 24 layers,
unrolled into 96 physical layers for NXDI single-pass iteration. Features
dual pre+post norm sandwich, intermediate RMSNorm at UT boundaries, and
shared weights across UT step copies.

Validation: 87% greedy match, 98% teacher-forced match (TP=1, bf16).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use consistent CE/TG column table format across all contrib models.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@dhwanw dhwanw marked this pull request as ready for review March 19, 2026 19:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant