The definitive Strix Halo LLM guide — 65 t/s on a $2,999 mini PC. Live benchmarks, tested optimizations, and everything that doesn't work.
-
Updated
Apr 26, 2026 - Shell
The definitive Strix Halo LLM guide — 65 t/s on a $2,999 mini PC. Live benchmarks, tested optimizations, and everything that doesn't work.
vLLM Qwen 3.6-27B (AWQ-INT4) + DFlash speculative decoding on AMD Strix Halo (gfx1151 iGPU, 128 GB UMA, ROCm 7.13). 24.8 t/s single-stream, vision, tool calling, 256K context, OpenAI-compatible, Docker. Matches DGX Spark FP8+DFlash+MTP at a third of the cost. No CUDA.
vLLM + Qwen3.6-27B (BF16) OpenAI-compatible inference server on AMD Strix Halo (Ryzen AI Max+ 395, gfx1151). Vision input, 256K context, /v1/responses with separated reasoning, via TheRock ROCm.
Local, ternary-weight LLM inference on AMD Strix Halo. Rust above the kernels, HIP below, zero Python at runtime. https://discord.gg/EhQgmNePg
llama.cpp + Qwen3.6-27B (Q8_0 GGUF) OpenAI-compatible inference server on AMD Strix Halo (Ryzen AI Max+ 395, gfx1151). 256K context, ~7.5 t/s decode via TheRock ROCm Docker.
ComfyUI on AMD Strix Halo (RDNA 3.5 / gfx1151) via Docker. Ubuntu Rolling + UV-managed Python 3.12 + ROCm preview wheels. Solves the silent CPU fallback Debian/Python 3.13 images hit on gfx1151.
Claude Code skill for AMD Strix Halo (Ryzen AI MAX+ 395) ML setup. Handles PyTorch installation (official wheels don't work with gfx1151), GTT memory config, and environment setup. Enables 30B parameter models.
Production-oriented Docker Compose stack serving openai/gpt-oss-20b via vLLM on AMD Strix Halo (gfx1151, ROCm 7.2). OpenAI Responses API, host-mounted weights, hard-capped KV cache. Verified, no source build.
Native ROCm C++ kernels for Strix Halo (gfx1151): ternary BitNet GEMV, RMSNorm, RoPE, split-KV Flash-Decoding attention. Zero hipBLAS, zero Python.
Docker infrastructure for AMD Strix Halo (RDNA 3.5 / gfx1151): PyTorch + ROCm base container and a separate Ollama LLM service. Two folders, two Compose files, one Strix Halo box.
Drop-in recipe for running faster-whisper on AMD Strix Halo (Ryzen AI Max+ 395, gfx1151) with Ubuntu 26.04 + ROCm 7.2.2 — no source build required
Docker stack: Ollama v0.21.0 built from source against ROCm 7.2.2 with native gfx1151 (Strix Halo) — serves Gemma 4 up to 256K context on AMD Ryzen AI MAX+ 395 / Radeon 8060S. Includes a 9-layer make validate ladder for the host firmware, ROCm runtime, container, and long-context inference.
Experimental local LLM API for AMD Strix Halo (gfx1151) on ROCm 7.10 (TheRock). Two-service split: vLLM inference engine + FastAPI gateway with OpenAI protocol normalization, auth, management. Docker Compose.
Unlock fast, local LLM inference on AMD-powered mini PCs delivering 65-87 t/s for large models without cloud or subscription costs
OpenAI-compatible /v1/embeddings server (BAAI/bge-m3, 1024 dims, 100+ langs) on AMD Strix Halo via ROCm. Drop-in replacement for OpenAI text-embedding-3, Docker, no API keys, ~47ms single-text latency.
Add a description, image, and links to the gfx1151 topic page so that developers can more easily learn about it.
To associate your repository with the gfx1151 topic, visit your repo's landing page and select "manage topics."