Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 31 additions & 1 deletion contrib/models/Trinity/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -583,6 +583,36 @@ Trinity's mixed attention (sliding window + full attention every 4th layer) requ

3. **`TrinityKVCacheManager`** replaces the standard `KVCacheManager` with per-layer awareness. All layers share uniform `max_length` cache buffers (required for CTE `fill_prefix` safety), but during TKG, scatter indices are modulated per-layer (sliding: `position % sliding_window`, global: raw position) and KV reads are sliced per-layer (sliding: `sliding_window`, global: `max_length`).

## LNC (Logical NeuronCore) Configuration

On trn2 instances, the LNC setting determines the number of logical NeuronCores and therefore the **valid TP degrees**. The Neuron runtime requires `NEURON_RT_NUM_CORES` to be either 1 or the full device (all logical cores). Intermediate values are rejected at runtime.

| LNC | Logical Cores (trn2.3xlarge) | Valid TP | HBM per Core |
|-----|------------------------------|----------|--------------|
| LNC=2 (default) | 4 | **1 or 4 only** | 24 GB |
| LNC=1 | 8 | **1 or 8 only** | 12 GB shared |

**Common pitfall:** TP=2 is NOT valid with LNC=2 on trn2.3xlarge. The runtime error is: `NEURON_RT_NUM_CORES must request one core, or the whole device (multiple of 8)`. To use TP=2, switch to LNC=1 (which gives 8 logical cores, making TP=2 valid). However, LNC=1 halves HBM bandwidth per core.

**Practical impact on Trinity:**
- **Nano (TP=1):** Works on both LNC=1 and LNC=2
- **Nano (TP=2):** Requires LNC=1. Use TP=1 or TP=4 with default LNC=2
- **Mini (TP=4):** Works on LNC=2 (default) -- uses all 4 logical cores
- **Large (TP=64):** Works on LNC=2 (default) on trn2.48xlarge -- uses all 64 logical cores

To check or change LNC:
```bash
# Check current LNC
neuron-ls # Shows logical core count

# Set LNC=1 (persistent, requires reboot)
echo 'NEURON_LOGICAL_NC_CONFIG=1' | sudo tee /etc/environment
sudo reboot

# Set LNC=1 (current session only)
export NEURON_LOGICAL_NC_CONFIG=1
```

## Compatibility Matrix

| Model | Instance | TP | LNC | Max seq_len | Status |
Expand Down Expand Up @@ -792,4 +822,4 @@ The NxDI framework uses several NKI (Neuron Kernel Interface) kernels during Tri

Jim Burtoft

**Last Updated:** 2026-03-18 (added fused MoE TKG with expert_bias benchmark: Mini +29% throughput, 5/5 correctness match)
**Last Updated:** 2026-03-18 (added LNC configuration note with valid TP degrees per LNC mode; added fused MoE TKG with expert_bias benchmark: Mini +29% throughput, 5/5 correctness match)
186 changes: 116 additions & 70 deletions contrib/models/Trinity/expert_bias_fused_kernel.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -47,19 +47,23 @@
"\n",
"### Prerequisites\n",
"\n",
"- trn2.3xlarge instance with SDK 2.28 DLAMI\n",
"- Neuron venv: `/opt/aws_neuronx_venv_pytorch_inference_vllm_0_13/`\n",
"- Model weights downloaded to `/mnt/models/`\n",
"- NxDI contrib fork cloned to `/home/ubuntu/nxdi-fork/`\n",
"\n",
"### Running This Notebook\n",
"\n",
"**Important:** NeuronCore device memory is not reliably released within a single kernel session. Run each model section (Nano non-fused, Nano fused, Mini non-fused, Mini fused) in a separate kernel, or **restart the kernel** between model loads.\n",
"\n",
"### Validated Results (2026-03-18, trn2.3xlarge, SDK 2.28)\n",
"\n",
"- **Trinity-Nano (TP=2):** 5/5 first-token matches between fused and non-fused\n",
"- **Trinity-Mini (TP=4, BS=1):** TKG 11.8 ms (non-fused) vs 9.1 ms (fused) = **22.5% reduction, +29% throughput**. 5/5 first-token correctness match."
"- trn2.3xlarge instance with SDK 2.28 DLAMI (LNC=2, default)\n",
"- Neuron venv: `/opt/aws_neuronx_venv_pytorch_inference_vllm_0_13/`\n",
"- Model weights downloaded to `/mnt/models/`\n",
"- NxDI cloned to `/home/ubuntu/nxdi/` (`git clone https://github.com/aws-neuron/neuronx-distributed-inference.git /home/ubuntu/nxdi`)\n",
"\n",
"### LNC Constraint\n",
"\n",
"With LNC=2 (default) on trn2.3xlarge, there are 4 logical NeuronCores. The runtime requires `NEURON_RT_NUM_CORES` to be 1 or the full device (4). **TP=2 is NOT valid with LNC=2.** This notebook uses TP=1 for Nano and TP=4 for Mini.\n",
"\n",
"### Running This Notebook\n",
"\n",
"**Important:** NeuronCore device memory is not reliably released within a single kernel session via `del model; gc.collect()`. You **must restart the kernel** between model loads (e.g., between Nano and Mini sections, and between Mini non-fused and Mini fused). Each section that loads a model is marked with a restart reminder.\n",
"\n",
"### Validated Results (2026-03-18, trn2.3xlarge, SDK 2.28)\n",
"\n",
"- **Trinity-Nano (TP=1):** 5/5 first-token matches between fused and non-fused\n",
"- **Trinity-Mini (TP=4, BS=1):** TKG 11.5 ms (non-fused) vs 9.0 ms (fused) = **21.8% reduction, +28% throughput**. 5/5 first-token correctness match."
]
},
{
Expand All @@ -75,9 +79,10 @@
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"os.environ[\"NEURON_RT_NUM_CORES\"] = \"4\"\n",
"\n",
"import os\n",
"# NEURON_RT_NUM_CORES is set automatically by NxDI based on tp_degree.\n",
"# For Mini (TP=4), all 4 cores are used. For Nano (TP=1), 1 core is used.\n",
"\n",
"# Verify SDK versions before patching\n",
"import pkg_resources\n",
"for pkg in ['neuronx-cc', 'neuronx-distributed', 'neuronx-distributed-inference', 'torch-neuronx', 'torch']:\n",
Expand Down Expand Up @@ -120,25 +125,30 @@
"metadata": {},
"outputs": [],
"source": [
"%%bash\n",
"set -e\n",
"\n",
"echo \"=== Installing patched nki-library ===\"\n",
"pip install --no-deps git+https://github.com/jimburtoft/nki-library.git@feature/expert-bias-support 2>&1 | tail -3\n",
"\n",
"echo \"\"\n",
"echo \"=== Installing patched neuronx-distributed ===\"\n",
"pip install --no-deps git+https://github.com/jimburtoft/neuronx-distributed.git@feature/expert-bias-support 2>&1 | tail -3\n",
"\n",
"echo \"\"\n",
"echo \"=== Installing patched neuronx-distributed-inference ===\"\n",
"pip install --no-deps git+https://github.com/jimburtoft/neuronx-distributed-inference.git@feature/expert-bias-support 2>&1 | tail -3\n",
"\n",
"echo \"\"\n",
"echo \"=== Done. Verifying installations ===\"\n",
"pip show neuronx-distributed 2>/dev/null | grep -E 'Version|Location'\n",
"pip show neuronx-distributed-inference 2>/dev/null | grep -E 'Version|Location'\n",
"pip show nki-library 2>/dev/null | grep -E 'Version|Location' || echo 'nki-library: installed via nkilib_src'"
"%%bash\n",
"set -e\n",
"\n",
"echo \"=== Installing patched nki-library ===\"\n",
"# pip install fails due to setuptools_scm issue -- clone and copy the patched file directly\n",
"if [ ! -d /tmp/nki-library ]; then\n",
" git clone -b feature/expert-bias-support https://github.com/jimburtoft/nki-library.git /tmp/nki-library 2>&1 | tail -2\n",
"fi\n",
"NKILIB_SITE=$(python -c \"import nkilib; print(nkilib.__path__[0])\")\n",
"cp /tmp/nki-library/src/nkilib_src/nkilib/core/router_topk/router_topk.py $NKILIB_SITE/core/router_topk/router_topk.py\n",
"echo \" Copied patched router_topk.py to $NKILIB_SITE/core/router_topk/\"\n",
"\n",
"echo \"\"\n",
"echo \"=== Installing patched neuronx-distributed ===\"\n",
"pip install --no-deps git+https://github.com/jimburtoft/neuronx-distributed.git@feature/expert-bias-support 2>&1 | tail -3\n",
"\n",
"echo \"\"\n",
"echo \"=== Installing patched neuronx-distributed-inference ===\"\n",
"pip install --no-deps git+https://github.com/jimburtoft/neuronx-distributed-inference.git@feature/expert-bias-support 2>&1 | tail -3\n",
"\n",
"echo \"\"\n",
"echo \"=== Done. Verifying installations ===\"\n",
"pip show neuronx-distributed 2>/dev/null | grep -E 'Version|Location'\n",
"pip show neuronx-distributed-inference 2>/dev/null | grep -E 'Version|Location'"
]
},
{
Expand Down Expand Up @@ -195,12 +205,16 @@
"metadata": {},
"outputs": [],
"source": [
"%%bash\n",
"set -e\n",
"\n",
"HF_TOKEN = os.environ.get('HF_TOKEN', 'YOUR_HF_TOKEN_HERE') # Set via: export HF_TOKEN=hf_...\n",
"\n",
"# Trinity-Nano\n",
"%%bash\n",
"set -e\n",
"\n",
"# Set your HuggingFace token before running: export HF_TOKEN=hf_...\n",
"if [ -z \"$HF_TOKEN\" ]; then\n",
" echo \"ERROR: Set HF_TOKEN environment variable first: export HF_TOKEN=hf_...\"\n",
" exit 1\n",
"fi\n",
"\n",
"# Trinity-Nano\n",
"if [ ! -d \"/mnt/models/Trinity-Nano-HF\" ]; then\n",
" echo \"Downloading Trinity-Nano...\"\n",
" huggingface-cli download arcee-ai/Trinity-Nano-Preview \\\n",
Expand All @@ -227,15 +241,16 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Trinity-Nano Validation (TP=2)\n",
"\n",
"Quick validation on the smaller model. Trinity-Nano has:\n",
"- ~6B total parameters, ~1B active per token\n",
"- 54 MoE layers, 128 experts, top_k=8, intermediate=256\n",
"- Nano's small intermediate size means the fused kernel won't show a throughput improvement,\n",
" but it validates that expert_bias is working correctly.\n",
"\n",
"We compile and run both **non-fused** (default) and **fused** paths, comparing first-token output."
"## 3. Trinity-Nano Validation (TP=1)\n",
"\n",
"Quick validation on the smaller model. Trinity-Nano has:\n",
"- ~6B total parameters, ~1B active per token\n",
"- 54 MoE layers, 128 experts, top_k=8, intermediate=256\n",
"- Nano's small intermediate size means the fused kernel won't show a throughput improvement,\n",
" but it validates that expert_bias is working correctly.\n",
"- **TP=1** because LNC=2 (default) only allows TP=1 or TP=4 on trn2.3xlarge.\n",
"\n",
"We compile and run both **non-fused** (default) and **fused** paths, comparing first-token output."
]
},
{
Expand All @@ -249,7 +264,7 @@
"import gc\n",
"import torch\n",
"\n",
"sys.path.insert(0, \"/home/ubuntu/nxdi-fork/contrib/models/Trinity/src\")\n",
"sys.path.insert(0, \"/home/ubuntu/nxdi/contrib/models/Trinity/src\")\n",
"\n",
"from transformers import AutoTokenizer\n",
"from neuronx_distributed_inference.models.config import MoENeuronConfig\n",
Expand All @@ -262,8 +277,8 @@
"metadata": {},
"outputs": [],
"source": [
"NANO_PATH = \"/mnt/models/Trinity-Nano-HF\"\n",
"NANO_TP = 2\n",
"NANO_PATH = \"/mnt/models/Trinity-Nano-HF\"\n",
"NANO_TP = 1\n",
"SEQ_LEN = 2048\n",
"BS = 1\n",
"\n",
Expand Down Expand Up @@ -473,11 +488,25 @@
"print(\"Fused model unloaded.\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Trinity-Mini Benchmark (TP=4)\n",
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"## \u26a0\ufe0f Restart Kernel Before Continuing\n",
"\n",
"**You must restart the Jupyter kernel now** (Kernel > Restart) before running the Mini sections below.\n",
"\n",
"NeuronCore device memory is not released by `del model; gc.collect()`. If you skip this step, the Mini model will fail to load with an out-of-memory error.\n",
"\n",
"After restarting, re-run the imports cell (Cell 10: `import sys, time, gc, torch...`) and then continue from Section 4."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Trinity-Mini Benchmark (TP=4)\n",
"\n",
"Trinity-Mini is where the fused kernel shows real benefit:\n",
"- ~26B total parameters, ~4.5B active per token\n",
Expand Down Expand Up @@ -656,11 +685,23 @@
"print(\"\\nNon-fused Mini unloaded.\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4b. Compile and benchmark FUSED Mini (with expert_bias)"
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"## \u26a0\ufe0f Restart Kernel Before Continuing\n",
"\n",
"**Restart the kernel again** (Kernel > Restart) before loading the fused Mini model.\n",
"\n",
"After restarting, re-run: the imports cell, the Mini config cell (MINI_PATH, tokenizer), and the `benchmark_generation` function cell. Then continue below."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4b. Compile and benchmark FUSED Mini (with expert_bias)"
]
},
{
Expand Down Expand Up @@ -811,13 +852,18 @@
"| [neuronx-distributed](https://github.com/jimburtoft/neuronx-distributed/tree/feature/expert-bias-support) | `feature/expert-bias-support` |\n",
"| [neuronx-distributed-inference](https://github.com/jimburtoft/neuronx-distributed-inference/tree/feature/expert-bias-support) | `feature/expert-bias-support` |\n",
"\n",
"### To use in production\n",
"\n",
"```bash\n",
"pip install --no-deps git+https://github.com/jimburtoft/nki-library.git@feature/expert-bias-support\n",
"pip install --no-deps git+https://github.com/jimburtoft/neuronx-distributed.git@feature/expert-bias-support\n",
"pip install --no-deps git+https://github.com/jimburtoft/neuronx-distributed-inference.git@feature/expert-bias-support\n",
"```\n",
"### To use in production\n",
"\n",
"```bash\n",
"# nki-library: pip install fails due to setuptools_scm -- clone and copy instead\n",
"git clone -b feature/expert-bias-support https://github.com/jimburtoft/nki-library.git /tmp/nki-library\n",
"NKILIB_SITE=$(python -c \"import nkilib; print(nkilib.__path__[0])\")\n",
"cp /tmp/nki-library/src/nkilib_src/nkilib/core/router_topk/router_topk.py $NKILIB_SITE/core/router_topk/router_topk.py\n",
"\n",
"# neuronx-distributed and neuronx-distributed-inference\n",
"pip install --no-deps git+https://github.com/jimburtoft/neuronx-distributed.git@feature/expert-bias-support\n",
"pip install --no-deps git+https://github.com/jimburtoft/neuronx-distributed-inference.git@feature/expert-bias-support\n",
"```\n",
"\n",
"Then set `moe_fused_nki_kernel_enabled=True` in `MoENeuronConfig`."
]
Expand Down