- CPU: AMD R7 5800H(with AVX2 support)
- GPU: AMD Radeon / Integrated Graphics (iGPU)
- Memory: UMA Shared Memory (8GB allocated in BIOS)
- Binary:
tabby_x86_64-manylinux2014-vulkan(tabby version 0.10.0) - Model:
Qwen2.5-Coder-7B-Instruct-q4_k_m.gguf - Host OS:
Ubuntu 24.04 minimize
The Vulkan-supported binary for Tabby (tabby_x86_64-manylinux2014-vulkan) is EXTREMELY LIMITED. This setup is only recommended if you absolutely require GPU acceleration on non-NVIDIA hardware and are willing to sacrifice almost all core features of Tabby.
-
No
config.tomlSupport- The binary does not scan or recognize external configuration files.
- All settings must be passed strictly through CLI arguments (e.g.,
--model,--chat-model,--device). - If a feature isn't available as a CLI flag, it cannot be configured.
-
No Embedding/RAG Functionality
- The embedding engine and indexing logic (Tantivy/Semantic Search) are completely stripped from this binary.
- Even if you have embedding models in your storage, the binary will ignore them.
- You cannot use the "Repository" or "Answer with Context" features.
-
Inference-Only Architecture
- This build is a "bare-bones" runner designed solely for text generation (Chat/Completion).
- It lacks the internal routing to connect to external embedding servers or sidecar containers.
-
Monitoring Confusion (UMA/iGPU)
- When using AMD APUs with UMA (Unified Memory Architecture), container monitoring tools (like Dozzle) may only report ~300MB of RAM usage.
- Actual VRAM usage (e.g., 4GB+ for a 7B model) is handled directly by the Vulkan driver and is invisible to standard Docker stats. Use
amdgpu_toporradeontopon the host to verify GPU load.
- UMA Memory Allocation: In AMD APU setups, the model is loaded into the VRAM reserved in BIOS. This memory is managed by the Vulkan driver at the hardware level.
- Dozzle / Docker Stats Limitation: These tools only track System RAM (RSS). Consequently, they may report only ~300MB of usage, failing to account for the GGUF model residing in the GPU's memory space.
- Verification: This is not a bug. To see actual load, use host-level tools like
amdgpu_toporradeontop.
To save your time, here is what failed during the research:
- Grep Analysis: Scanned the binary for
embedding,config.toml,llama_cpp, andasync_openai.- Found
async_openai..types..embedding, but it's only a remnant library without any functional entry point (CLI/Env). - No strings related to
config.tomlorembedding_modelflags were found in the Vulkan build.
- Found
- Config Ingestion: Tried placing
config.tomlin various paths (/data,/root/.tabby, etc.), but the binary ignored them all. - Embedding Sidecar: Attempted to connect a separate embedding container, but this binary has no "client" logic to send requests to an external endpoint.
- Custom Model directory: Llama inside tabby binary is try to read
tabby/models/${MODEL_NAME}/tabby.jsonandtabby/models/${MODEL_NAME}/ggml/q8_0.v2.ggufforcely, not onlytabby/models/${MODEL_NAME}/model.gguf, so i changed my modelmodel.gguftoggml/q8_0.v2.gguf(not recommended) and writetabby.jsonmanually
If you require RAG (Embedding) or Config File support on AMD hardware:
- Option A: Use the official CPU-only Docker image. It supports all features, and modern AMD CPUs (AVX2) provide acceptable performance for 7B models.
- Option B: Migrate to vLLM (ROCm-based) or llama.cpp (server), which offer superior AMD support and full feature parity. or, easy way to setup, Aphrodite or Habor
- Option C: Switch to claude (Paid)
"Only the strong (and the extremely patient) should venture beyond this point."
(κ°ν μ, κ·Έλ¦¬κ³ κ΅μ₯ν μΈλ΄μ¬μ κ°μ§ μλ§μ΄ μ΄ λλ¨Έλ₯Ό λμ νμμμ€.)
PS: Thanks for gemini, but you suggested this engine to me. I'll keep your "helpful" suggestion in mind, 'tabby supports vulkan! and you can use RAG!'
Last Updated: 2026-04-28