aws-neuron · petesraj-aws · Mar 26, 2026 · Mar 19, 2026
diff --git a/contrib/models/Trinity/README.md b/contrib/models/Trinity/README.md
@@ -583,6 +583,36 @@ Trinity's mixed attention (sliding window + full attention every 4th layer) requ
 
 3. **`TrinityKVCacheManager`** replaces the standard `KVCacheManager` with per-layer awareness. All layers share uniform `max_length` cache buffers (required for CTE `fill_prefix` safety), but during TKG, scatter indices are modulated per-layer (sliding: `position % sliding_window`, global: raw position) and KV reads are sliced per-layer (sliding: `sliding_window`, global: `max_length`).
 
+## LNC (Logical NeuronCore) Configuration
+
+On trn2 instances, the LNC setting determines the number of logical NeuronCores and therefore the **valid TP degrees**. The Neuron runtime requires `NEURON_RT_NUM_CORES` to be either 1 or the full device (all logical cores). Intermediate values are rejected at runtime.
+
+| LNC | Logical Cores (trn2.3xlarge) | Valid TP | HBM per Core |
+|-----|------------------------------|----------|--------------|
+| LNC=2 (default) | 4 | **1 or 4 only** | 24 GB |
+| LNC=1 | 8 | **1 or 8 only** | 12 GB shared |
+
+**Common pitfall:** TP=2 is NOT valid with LNC=2 on trn2.3xlarge. The runtime error is: `NEURON_RT_NUM_CORES must request one core, or the whole device (multiple of 8)`. To use TP=2, switch to LNC=1 (which gives 8 logical cores, making TP=2 valid). However, LNC=1 halves HBM bandwidth per core.
+
+**Practical impact on Trinity:**
+- **Nano (TP=1):** Works on both LNC=1 and LNC=2
+- **Nano (TP=2):** Requires LNC=1. Use TP=1 or TP=4 with default LNC=2
+- **Mini (TP=4):** Works on LNC=2 (default) -- uses all 4 logical cores
+- **Large (TP=64):** Works on LNC=2 (default) on trn2.48xlarge -- uses all 64 logical cores
+
+To check or change LNC:
+```bash
+# Check current LNC
+neuron-ls  # Shows logical core count
+
+# Set LNC=1 (persistent, requires reboot)
+echo 'NEURON_LOGICAL_NC_CONFIG=1' | sudo tee /etc/environment
+sudo reboot
+
+# Set LNC=1 (current session only)
+export NEURON_LOGICAL_NC_CONFIG=1
+```
+
 ## Compatibility Matrix
 
 | Model | Instance | TP | LNC | Max seq_len | Status |
@@ -792,4 +822,4 @@ The NxDI framework uses several NKI (Neuron Kernel Interface) kernels during Tri
 
 Jim Burtoft
 
-**Last Updated:** 2026-03-18 (added fused MoE TKG with expert_bias benchmark: Mini +29% throughput, 5/5 correctness match)
+**Last Updated:** 2026-03-18 (added LNC configuration note with valid TP degrees per LNC mode; added fused MoE TKG with expert_bias benchmark: Mini +29% throughput, 5/5 correctness match)
diff --git a/contrib/models/Trinity/expert_bias_fused_kernel.ipynb b/contrib/models/Trinity/expert_bias_fused_kernel.ipynb
@@ -47,19 +47,23 @@
     "\n",
     "### Prerequisites\n",
     "\n",
-    "- trn2.3xlarge instance with SDK 2.28 DLAMI\n",
-    "- Neuron venv: `/opt/aws_neuronx_venv_pytorch_inference_vllm_0_13/`\n",
-    "- Model weights downloaded to `/mnt/models/`\n",
-    "- NxDI contrib fork cloned to `/home/ubuntu/nxdi-fork/`\n",
-    "\n",
-    "### Running This Notebook\n",
-    "\n",
-    "**Important:** NeuronCore device memory is not reliably released within a single kernel session. Run each model section (Nano non-fused, Nano fused, Mini non-fused, Mini fused) in a separate kernel, or **restart the kernel** between model loads.\n",
-    "\n",
-    "### Validated Results (2026-03-18, trn2.3xlarge, SDK 2.28)\n",
-    "\n",
-    "- **Trinity-Nano (TP=2):** 5/5 first-token matches between fused and non-fused\n",
-    "- **Trinity-Mini (TP=4, BS=1):** TKG 11.8 ms (non-fused) vs 9.1 ms (fused) = **22.5% reduction, +29% throughput**. 5/5 first-token correctness match."
+     "- trn2.3xlarge instance with SDK 2.28 DLAMI (LNC=2, default)\n",
+     "- Neuron venv: `/opt/aws_neuronx_venv_pytorch_inference_vllm_0_13/`\n",
+     "- Model weights downloaded to `/mnt/models/`\n",
+     "- NxDI cloned to `/home/ubuntu/nxdi/` (`git clone https://github.com/aws-neuron/neuronx-distributed-inference.git /home/ubuntu/nxdi`)\n",
+     "\n",
+     "### LNC Constraint\n",
+     "\n",
+     "With LNC=2 (default) on trn2.3xlarge, there are 4 logical NeuronCores. The runtime requires `NEURON_RT_NUM_CORES` to be 1 or the full device (4). **TP=2 is NOT valid with LNC=2.** This notebook uses TP=1 for Nano and TP=4 for Mini.\n",
+     "\n",
+     "### Running This Notebook\n",
+     "\n",
+     "**Important:** NeuronCore device memory is not reliably released within a single kernel session via `del model; gc.collect()`. You **must restart the kernel** between model loads (e.g., between Nano and Mini sections, and between Mini non-fused and Mini fused). Each section that loads a model is marked with a restart reminder.\n",
+     "\n",
+     "### Validated Results (2026-03-18, trn2.3xlarge, SDK 2.28)\n",
+     "\n",
+     "- **Trinity-Nano (TP=1):** 5/5 first-token matches between fused and non-fused\n",
+     "- **Trinity-Mini (TP=4, BS=1):** TKG 11.5 ms (non-fused) vs 9.0 ms (fused) = **21.8% reduction, +28% throughput**. 5/5 first-token correctness match."
    ]
   },
   {
@@ -75,9 +79,10 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "import os\n",
-    "os.environ[\"NEURON_RT_NUM_CORES\"] = \"4\"\n",
-    "\n",
+     "import os\n",
+     "# NEURON_RT_NUM_CORES is set automatically by NxDI based on tp_degree.\n",
+     "# For Mini (TP=4), all 4 cores are used. For Nano (TP=1), 1 core is used.\n",
+     "\n",
     "# Verify SDK versions before patching\n",
     "import pkg_resources\n",
     "for pkg in ['neuronx-cc', 'neuronx-distributed', 'neuronx-distributed-inference', 'torch-neuronx', 'torch']:\n",
@@ -120,25 +125,30 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "%%bash\n",
-    "set -e\n",
-    "\n",
-    "echo \"=== Installing patched nki-library ===\"\n",
-    "pip install --no-deps git+https://github.com/jimburtoft/nki-library.git@feature/expert-bias-support 2>&1 | tail -3\n",
-    "\n",
-    "echo \"\"\n",
-    "echo \"=== Installing patched neuronx-distributed ===\"\n",
-    "pip install --no-deps git+https://github.com/jimburtoft/neuronx-distributed.git@feature/expert-bias-support 2>&1 | tail -3\n",
-    "\n",
-    "echo \"\"\n",
-    "echo \"=== Installing patched neuronx-distributed-inference ===\"\n",
-    "pip install --no-deps git+https://github.com/jimburtoft/neuronx-distributed-inference.git@feature/expert-bias-support 2>&1 | tail -3\n",
-    "\n",
-    "echo \"\"\n",
-    "echo \"=== Done. Verifying installations ===\"\n",
-    "pip show neuronx-distributed 2>/dev/null | grep -E 'Version|Location'\n",
-    "pip show neuronx-distributed-inference 2>/dev/null | grep -E 'Version|Location'\n",
-    "pip show nki-library 2>/dev/null | grep -E 'Version|Location' || echo 'nki-library: installed via nkilib_src'"
+     "%%bash\n",
+     "set -e\n",
+     "\n",
+     "echo \"=== Installing patched nki-library ===\"\n",
+     "# pip install fails due to setuptools_scm issue -- clone and copy the patched file directly\n",
+     "if [ ! -d /tmp/nki-library ]; then\n",
+     "    git clone -b feature/expert-bias-support https://github.com/jimburtoft/nki-library.git /tmp/nki-library 2>&1 | tail -2\n",
+     "fi\n",
+     "NKILIB_SITE=$(python -c \"import nkilib; print(nkilib.__path__[0])\")\n",
+     "cp /tmp/nki-library/src/nkilib_src/nkilib/core/router_topk/router_topk.py $NKILIB_SITE/core/router_topk/router_topk.py\n",
+     "echo \"  Copied patched router_topk.py to $NKILIB_SITE/core/router_topk/\"\n",
+     "\n",
+     "echo \"\"\n",
+     "echo \"=== Installing patched neuronx-distributed ===\"\n",
+     "pip install --no-deps git+https://github.com/jimburtoft/neuronx-distributed.git@feature/expert-bias-support 2>&1 | tail -3\n",
+     "\n",
+     "echo \"\"\n",
+     "echo \"=== Installing patched neuronx-distributed-inference ===\"\n",
+     "pip install --no-deps git+https://github.com/jimburtoft/neuronx-distributed-inference.git@feature/expert-bias-support 2>&1 | tail -3\n",
+     "\n",
+     "echo \"\"\n",
+     "echo \"=== Done. Verifying installations ===\"\n",
+     "pip show neuronx-distributed 2>/dev/null | grep -E 'Version|Location'\n",
+     "pip show neuronx-distributed-inference 2>/dev/null | grep -E 'Version|Location'"
    ]
   },
   {
@@ -195,12 +205,16 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "%%bash\n",
-    "set -e\n",
-    "\n",
-    "HF_TOKEN = os.environ.get('HF_TOKEN', 'YOUR_HF_TOKEN_HERE')  # Set via: export HF_TOKEN=hf_...\n",
-    "\n",
-    "# Trinity-Nano\n",
+     "%%bash\n",
+     "set -e\n",
+     "\n",
+     "# Set your HuggingFace token before running: export HF_TOKEN=hf_...\n",
+     "if [ -z \"$HF_TOKEN\" ]; then\n",
+     "    echo \"ERROR: Set HF_TOKEN environment variable first: export HF_TOKEN=hf_...\"\n",
+     "    exit 1\n",
+     "fi\n",
+     "\n",
+     "# Trinity-Nano\n",
     "if [ ! -d \"/mnt/models/Trinity-Nano-HF\" ]; then\n",
     "    echo \"Downloading Trinity-Nano...\"\n",
     "    huggingface-cli download arcee-ai/Trinity-Nano-Preview \\\n",
@@ -227,15 +241,16 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## 3. Trinity-Nano Validation (TP=2)\n",
-    "\n",
-    "Quick validation on the smaller model. Trinity-Nano has:\n",
-    "- ~6B total parameters, ~1B active per token\n",
-    "- 54 MoE layers, 128 experts, top_k=8, intermediate=256\n",
-    "- Nano's small intermediate size means the fused kernel won't show a throughput improvement,\n",
-    "  but it validates that expert_bias is working correctly.\n",
-    "\n",
-    "We compile and run both **non-fused** (default) and **fused** paths, comparing first-token output."
+     "## 3. Trinity-Nano Validation (TP=1)\n",
+     "\n",
+     "Quick validation on the smaller model. Trinity-Nano has:\n",
+     "- ~6B total parameters, ~1B active per token\n",
+     "- 54 MoE layers, 128 experts, top_k=8, intermediate=256\n",
+     "- Nano's small intermediate size means the fused kernel won't show a throughput improvement,\n",
+     "  but it validates that expert_bias is working correctly.\n",
+     "- **TP=1** because LNC=2 (default) only allows TP=1 or TP=4 on trn2.3xlarge.\n",
+     "\n",
+     "We compile and run both **non-fused** (default) and **fused** paths, comparing first-token output."
    ]
   },
   {
@@ -249,7 +264,7 @@
     "import gc\n",
     "import torch\n",
     "\n",
-    "sys.path.insert(0, \"/home/ubuntu/nxdi-fork/contrib/models/Trinity/src\")\n",
+     "sys.path.insert(0, \"/home/ubuntu/nxdi/contrib/models/Trinity/src\")\n",
     "\n",
     "from transformers import AutoTokenizer\n",
     "from neuronx_distributed_inference.models.config import MoENeuronConfig\n",
@@ -262,8 +277,8 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "NANO_PATH = \"/mnt/models/Trinity-Nano-HF\"\n",
-    "NANO_TP = 2\n",
+     "NANO_PATH = \"/mnt/models/Trinity-Nano-HF\"\n",
+     "NANO_TP = 1\n",
     "SEQ_LEN = 2048\n",
     "BS = 1\n",
     "\n",
@@ -473,11 +488,25 @@
     "print(\"Fused model unloaded.\")"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## 4. Trinity-Mini Benchmark (TP=4)\n",
+   {
+    "cell_type": "markdown",
+    "metadata": {},
+    "source": [
+     "---\n",
+     "## \u26a0\ufe0f Restart Kernel Before Continuing\n",
+     "\n",
+     "**You must restart the Jupyter kernel now** (Kernel > Restart) before running the Mini sections below.\n",
+     "\n",
+     "NeuronCore device memory is not released by `del model; gc.collect()`. If you skip this step, the Mini model will fail to load with an out-of-memory error.\n",
+     "\n",
+     "After restarting, re-run the imports cell (Cell 10: `import sys, time, gc, torch...`) and then continue from Section 4."
+    ]
+   },
+   {
+    "cell_type": "markdown",
+    "metadata": {},
+    "source": [
+     "## 4. Trinity-Mini Benchmark (TP=4)\n",
     "\n",
     "Trinity-Mini is where the fused kernel shows real benefit:\n",
     "- ~26B total parameters, ~4.5B active per token\n",
@@ -656,11 +685,23 @@
     "print(\"\\nNon-fused Mini unloaded.\")"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### 4b. Compile and benchmark FUSED Mini (with expert_bias)"
+   {
+    "cell_type": "markdown",
+    "metadata": {},
+    "source": [
+     "---\n",
+     "## \u26a0\ufe0f Restart Kernel Before Continuing\n",
+     "\n",
+     "**Restart the kernel again** (Kernel > Restart) before loading the fused Mini model.\n",
+     "\n",
+     "After restarting, re-run: the imports cell, the Mini config cell (MINI_PATH, tokenizer), and the `benchmark_generation` function cell. Then continue below."
+    ]
+   },
+   {
+    "cell_type": "markdown",
+    "metadata": {},
+    "source": [
+     "### 4b. Compile and benchmark FUSED Mini (with expert_bias)"
    ]
   },
   {
@@ -811,13 +852,18 @@
     "| [neuronx-distributed](https://github.com/jimburtoft/neuronx-distributed/tree/feature/expert-bias-support) | `feature/expert-bias-support` |\n",
     "| [neuronx-distributed-inference](https://github.com/jimburtoft/neuronx-distributed-inference/tree/feature/expert-bias-support) | `feature/expert-bias-support` |\n",
     "\n",
-    "### To use in production\n",
-    "\n",
-    "```bash\n",
-    "pip install --no-deps git+https://github.com/jimburtoft/nki-library.git@feature/expert-bias-support\n",
-    "pip install --no-deps git+https://github.com/jimburtoft/neuronx-distributed.git@feature/expert-bias-support\n",
-    "pip install --no-deps git+https://github.com/jimburtoft/neuronx-distributed-inference.git@feature/expert-bias-support\n",
-    "```\n",
+     "### To use in production\n",
+     "\n",
+     "```bash\n",
+     "# nki-library: pip install fails due to setuptools_scm -- clone and copy instead\n",
+     "git clone -b feature/expert-bias-support https://github.com/jimburtoft/nki-library.git /tmp/nki-library\n",
+     "NKILIB_SITE=$(python -c \"import nkilib; print(nkilib.__path__[0])\")\n",
+     "cp /tmp/nki-library/src/nkilib_src/nkilib/core/router_topk/router_topk.py $NKILIB_SITE/core/router_topk/router_topk.py\n",
+     "\n",
+     "# neuronx-distributed and neuronx-distributed-inference\n",
+     "pip install --no-deps git+https://github.com/jimburtoft/neuronx-distributed.git@feature/expert-bias-support\n",
+     "pip install --no-deps git+https://github.com/jimburtoft/neuronx-distributed-inference.git@feature/expert-bias-support\n",
+     "```\n",
     "\n",
     "Then set `moe_fused_nki_kernel_enabled=True` in `MoENeuronConfig`."
    ]