Skip to content

v0.5.0 — Gemma 4 MoE + 7x KV Compression + WASM

Choose a tag to compare

@unamedkr unamedkr released this 05 Apr 06:36
· 551 commits to main since this release

What's New

Gemma 4 26B-A4B MoE Support

Full support for Gemma 4's hybrid MoE architecture: 128 experts, dual-FFN, hybrid attention (sliding + full), QK-norm, learned RoPE, GeGLU activation. Generates correct answers in English and Korean.

7x KV Cache Compression

Same hardware, 7x longer context, zero quality loss.

Model FP16 KV quant.cpp KV Gain
Llama 3.2 3B (16GB Mac) 50K tokens 350K tokens 6.9x
Gemma 4 26B (16GB Mac) 4K tokens 30K tokens 6.9x

New Models

  • Llama 3.2 3B Instruct — 17 tok/s, correct code generation
  • Gemma 4 26B-A4B-it — 3.9 tok/s, 128-expert MoE

WASM Browser Demo

192KB binary. Drag and drop a GGUF model, chat in the browser. Everything runs client-side.
Try it

Windows (MSVC) Support

Compiles with Visual Studio 2019/2022. pthread shim, C11 atomics compat.

quant.h Synced

Single header now includes Gemma 4, Llama 3, IQ3_XXS support. cc app.c -lm -lpthread — done.

Documentation

Performance

  • Gemma 4 26B: 549ms → 257ms/token (-53%)
  • Metal GPU: 7 compute kernels implemented (infrastructure for batch inference)

Bug Fixes

  • Gemma 4 NaN regression, Llama head_dim misdetection
  • TQ_STATIC_ASSERT in C mode, stack buffer overflow
  • Zero build warnings, 34/34 tests pass, score 99.2%

Full changelog: CHANGELOG.md

Full Changelog: v0.2.0...v0.5.0