Release v0.5.0 — Gemma 4 MoE + 7x KV Compression + WASM · quantumaikr/quant.cpp

What's New

Gemma 4 26B-A4B MoE Support

Full support for Gemma 4's hybrid MoE architecture: 128 experts, dual-FFN, hybrid attention (sliding + full), QK-norm, learned RoPE, GeGLU activation. Generates correct answers in English and Korean.

7x KV Cache Compression

Same hardware, 7x longer context, zero quality loss.

Model	FP16 KV	quant.cpp KV	Gain
Llama 3.2 3B (16GB Mac)	50K tokens	350K tokens	6.9x
Gemma 4 26B (16GB Mac)	4K tokens	30K tokens	6.9x

New Models

Llama 3.2 3B Instruct — 17 tok/s, correct code generation
Gemma 4 26B-A4B-it — 3.9 tok/s, 128-expert MoE

WASM Browser Demo

192KB binary. Drag and drop a GGUF model, chat in the browser. Everything runs client-side.
→ Try it

Windows (MSVC) Support

Compiles with Visual Studio 2019/2022. pthread shim, C11 atomics compat.

quant.h Synced

Single header now includes Gemma 4, Llama 3, IQ3_XXS support. cc app.c -lm -lpthread — done.

Documentation

API Reference — 730 lines, all platforms
Custom Quantization Guide — add your own KV type in 3 functions
ROADMAP — project direction

Performance

Gemma 4 26B: 549ms → 257ms/token (-53%)
Metal GPU: 7 compute kernels implemented (infrastructure for batch inference)

Bug Fixes

Gemma 4 NaN regression, Llama head_dim misdetection
TQ_STATIC_ASSERT in C mode, stack buffer overflow
Zero build warnings, 34/34 tests pass, score 99.2%

Full changelog: CHANGELOG.md

Full Changelog: v0.2.0...v0.5.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.5.0 — Gemma 4 MoE + 7x KV Compression + WASM

Choose a tag to compare

Sorry, something went wrong.