v0.5.0 — Gemma 4 MoE + 7x KV Compression + WASM
What's New
Gemma 4 26B-A4B MoE Support
Full support for Gemma 4's hybrid MoE architecture: 128 experts, dual-FFN, hybrid attention (sliding + full), QK-norm, learned RoPE, GeGLU activation. Generates correct answers in English and Korean.
7x KV Cache Compression
Same hardware, 7x longer context, zero quality loss.
| Model | FP16 KV | quant.cpp KV | Gain |
|---|---|---|---|
| Llama 3.2 3B (16GB Mac) | 50K tokens | 350K tokens | 6.9x |
| Gemma 4 26B (16GB Mac) | 4K tokens | 30K tokens | 6.9x |
New Models
- Llama 3.2 3B Instruct — 17 tok/s, correct code generation
- Gemma 4 26B-A4B-it — 3.9 tok/s, 128-expert MoE
WASM Browser Demo
192KB binary. Drag and drop a GGUF model, chat in the browser. Everything runs client-side.
→ Try it
Windows (MSVC) Support
Compiles with Visual Studio 2019/2022. pthread shim, C11 atomics compat.
quant.h Synced
Single header now includes Gemma 4, Llama 3, IQ3_XXS support. cc app.c -lm -lpthread — done.
Documentation
- API Reference — 730 lines, all platforms
- Custom Quantization Guide — add your own KV type in 3 functions
- ROADMAP — project direction
Performance
- Gemma 4 26B: 549ms → 257ms/token (-53%)
- Metal GPU: 7 compute kernels implemented (infrastructure for batch inference)
Bug Fixes
- Gemma 4 NaN regression, Llama head_dim misdetection
- TQ_STATIC_ASSERT in C mode, stack buffer overflow
- Zero build warnings, 34/34 tests pass, score 99.2%
Full changelog: CHANGELOG.md
Full Changelog: v0.2.0...v0.5.0