🌍 Read this in other languages: Deutsch | Español | Français | Italiano | 日本語 | 简体中文 | 繁體中文 | Русский | Українська | Português | 한국어 | العربية | Tiếng Việt | Türkçe
Your Nvidia DGX Spark should not be another side project. Start using it! This is a Docker-based inference stack for serving large language models (LLMs) using NVIDIA vLLM with intelligent resource management. This stack provides on-demand model loading with automatic idle shutdown, a single main-model scheduling lane with an optional utility helper, and a unified API gateway.
The goal of the project is to provide an inference server for your home. After testing this and adding new models for a month, I decided to release it for the community. Please understand that this is a hobby project and that concrete help to improve it is highly appreciated. It is based on information I found on the Internet and on the NVIDIA Forums, I really hope it helps driving forward homelabs. This is mainly focused on the single DGX Spark setup and must work on it by default but adding support for 2 is welcome.
- Architecture & How It Works - Understanding the stack, waker service, and request flow.
- Configuration - Environment variables, network settings, and waker tuning.
- Model Selection Guide - Current model catalog, quick chooser, and validation status.
- Integrations - Guides for Cline (VS Code) and OpenCode (Terminal Agent).
- Security & Remote Access - Hardening SSH and setting up restricted port forwarding.
- Troubleshooting & Monitoring - Debugging, logs, and common error solutions.
- Advanced Usage - Adding new models, custom configurations, and persistent operation.
- Runtime Baseline - Which local image tracks the repo expects and how to rebuild them.
- Tools & Validation Harness - The supported smoke, soak, inspection, and manual probe scripts.
- TODO Notes - Ideas I have for what to do next.
-
Clone the repository
git clone <repository-url> cd dgx-spark-inference-stack
-
Create necessary directories
mkdir -p models vllm_cache_huggingface manual_download/openai_gpt-oss-encodings_fix
-
Download required tokenizers (CRITICAL) The stack requires manual download of tiktoken files for GPT-OSS models.
wget https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken -O manual_download/openai_gpt-oss-encodings_fix/cl100k_base.tiktoken wget https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken -O manual_download/openai_gpt-oss-encodings_fix/o200k_base.tiktoken
-
Build Custom Docker Images (MANDATORY) The stack uses custom-optimized vLLM images that should be built locally to ensure maximum performance.
- Time: Expect ~20 minutes per image.
- Auth: You must authenticate with NVIDIA NGC to pull base images.
- Create a developer account at NVIDIA NGC Catalog (must not be in a sanctioned country).
- Run
docker login nvcr.iowith your credentials. Build Commands:
# Build Avarok image (General Purpose) - MUST use this tag to use local version over upstream. # Build from the repo root so the manually downloaded tokenizer files are included. docker build -t avarok/vllm-dgx-spark:v11 -f custom-docker-containers/avarok/Dockerfile . # If you want compose services that default to the pinned upstream Avarok image # to use your local rebuild instead, export this override for the current shell # or place it in .env before running docker compose. export VLLM_TRACK_AVAROK=avarok/vllm-dgx-spark:v11 # Build the repo MXFP4 track used by GPT-OSS. # This bakes the manually downloaded tiktoken files into the image. docker build -t vllm-node-mxfp4 -f custom-docker-containers/vllm-node-mxfp4/Dockerfile . # Build the refreshed TF5 track used by GLM 4.7. docker build -t local/vllm-node-tf5:cu131 -f custom-docker-containers/vllm-node-tf5/Dockerfile . # Build the upstream-style TF5 track used by Gemma 4 and newer TF5 recipe imports. # The active Gemma compose services expect this exact local image tag. git clone https://github.com/eugr/spark-vllm-docker tmp/spark-vllm-docker 2>/dev/null || git -C tmp/spark-vllm-docker pull --ff-only (cd tmp/spark-vllm-docker && bash build-and-copy.sh --pre-tf)
- Note:
vllm-node-tf5is not built from a repo-local Dockerfile today. If you plan to run Gemma 4 or the newer TF5-track Qwen follow-ons, build it explicitly with the upstream helper flow above. See docs/runtime-baseline.md for the exact reproduction notes and build-time network requirements. Compose defaults remain digest-pinned for external pulls, so local rebuilds of the Avarok lane requireVLLM_TRACK_AVAROK=avarok/vllm-dgx-spark:v11.
-
Start the stack
# Start gateway and waker only (models start on-demand) docker compose up -d # Pre-create all enabled model containers once the required local track images exist docker compose --profile models up --no-start
-
Test the API
# Request to the shipped utility helper curl -X POST http://localhost:8009/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer ${VLLM_API_KEY:-63TestTOKEN0REPLACEME}" \ -d '{ "model": "qwen3.5-0.8b", "messages": [{"role": "user", "content": "Hello!"}] }'
-
Use the supported validation harness After the first manual curl succeeds, switch to the repo's maintained bring-up flow instead of ad hoc scripts:
bash tools/validate-stack.sh bash tools/smoke-gateway.sh
For model-specific bring-up, smoke, soak, and manual probe commands, see tools/README.md.
- Read docs/architecture.md, then tools/README.md.
- Treat tools/README.md plus models.json as the current operational source of truth.
- Treat this README as the short entry point, not the full model catalog. Use docs/models.md for the broader catalog.
- Docker 20.10+ with Docker Compose
- NVIDIA GPU(s) with CUDA support and NVIDIA Container Toolkit
- Linux host (tested on Ubuntu)
Pull requests very welcome. :) However, to ensure stability, I enforce a strict Pull Request Template.
Maintainer note: Docker base-image digest refreshes and GitHub Action pin refreshes are gated through Renovate's Dependency Dashboard and scheduled monthly in UTC. If you want one earlier, approve that update from the GitHub Renovate Dependency Dashboard issue.
The README only highlights the stack's current recommended defaults.
- Validated main models:
gpt-oss-20b,gpt-oss-120b, andglm-4.7-flash-awq - Validated utility helper:
qwen3.5-0.8bfor titles and session metadata - Everything else: available in the repo, but not a README default until it is re-validated on the current harness
For the broader model catalog, experimental lanes, and manual-only paths, use docs/models.md and models.json.
For client caveats, runtime quirks, and troubleshooting notes, use docs/integrations.md and docs/troubleshooting.md.
Special thanks to the community members whose Docker images and recipe work enables this stack:
- Thomas P. Braun from Avarok: For the general-purpose vLLM image (
avarok/vllm-dgx-spark) with support for non-gated activations (Nemotron) and hybrid models and posts like this https://blog.avarok.net/dgx-spark-nemotron3-and-nvfp4-getting-to-65-tps-8c5569025eb6. - Christopher Owen: For the MXFP4-optimized vLLM images initial work (
christopherowen/vllm-dgx-spark) enabling high-performance inference on DGX Spark. - eugr: For the original community DGX Spark vLLM repo (
eugr/spark-vllm-docker), its customizations, and the great postings on NVIDIA Forums. - Patrick Yi / scitrera.ai: For the SGLang utility-model recipe that informed the local
qwen3.5-0.8bhelper path. - Raphael Amorim: For the community AutoRound recipe shape that informed the experimental local
qwen3.5-122b-a10b-int4-autoroundlane. - Bjarke Bolding: For the long-context AutoRound recipe shape that informed the experimental local
qwen3-coder-next-int4-autoroundlane.
This project is licensed under the Apache License 2.0. See the LICENSE file for details.