Reproducible local benchmarking workspace for MLX-based model experiments on Apple Silicon.
This repository packages:
- pinned upstream sources (
dflash,ddtree,triattention,turboquant-mlx) - versioned local patches
- pinned Python runtime dependencies
- benchmark runners for both Gemma 4 and Qwen 3.5 experiment tracks
The campaign targeted two practical questions:
- Gemma 4 26B A4B track: Can a local 32 GB Apple Silicon machine run high-parameter Gemma with usable context and speed?
- Qwen 3.5 track: Do DFlash/DDTree/TriAttention/TurboQuant/RotorQuant-style optimizations materially improve local agent workload behavior?
To answer that, we measured:
- low-context throughput (prompt/decode speed)
- high-context viability (how far context can be pushed before instability/OOM)
- long-generation stability (can it generate long outputs reliably)
- structured tool-call adherence (JSON/tool schema reliability)
- memory behavior and failure modes (timeout vs OOM)
scripts/fetch_vendor_sources.sh
scripts/setup_env.sh
scripts/smoke_test.sh --venv-path .venvJiunsong/supergemma4-26b-uncensored-mlx-4bit-v2(Gemma 4 26B A4B)
majentik/gemma-4-26B-A4B-it-RotorQuant-MLX-4bit
mlx-community/gemma-4-e2b-it-4bit
baselinespeculativekv4triattentionturboquant-v2-leanturboquant-v2-rotturboquant-v3-3.5(compatibility-limited)rotorquantspeculative-rotorquant
Main family tested:
Qwen3.5-27Bvariants in MLX-compatible forms (baseline, DFlash and quant/cache variants)
Optimization directions tested:
- DFlash speculative decode path
- DDTree integration/prototype path
- TriAttention merge path
- TurboQuant (v2/v3) cache substitutions
- RotorQuant model variant
Qwen-specific docs:
docs/qwen-tuning-report.mddocs/qwen-tuning-benchmarks.mddocs/qwen-dflash-sweep.mddocs/qwen-turboquant-eval.mddocs/qwen-ddtree-eval.mddocs/qwen-optimization-report.md
- Decode speed winner:
baseline, KV 512 → ~41.86 tok/s - Prefill winner:
triattention, KV 512 → ~62.44 tok/s - Lowest memory (low-context set):
turboquant-v2-lean, KV 1536 → ~14.28 GB
- Recommended profile:
rotorquant(non-speculative) - Strong balance around KV 32768–36864 on this class of machine
- Verified run at ~111k prompt tokens + 32 decode tokens:
rotorquant, KV 36864 → prompt ~495.75 tok/s, decode ~14.51 tok/s, peak ~25.67 GB
- Stable meaningful decode verified at ~111,360 prompt tokens (32 decode tokens)
- Stress probe reached ~135,936 prompt tokens (1-token decode)
- Above that, failures become frequent due to Metal OOM (
kIOGPUCommandBufferCallbackErrorOutOfMemory)
baseline: 8192 tokens succeeded (~31.67 tok/s)rotorquant: 8192 tokens succeeded (~37.28 tok/s)speculative-rotorquant: unstable for long decode (timeouts/crashes at higher lengths)
- Structured JSON/tool-call adherence remained inconsistent across tested variants.
- Throughput/context can be made workable, but robust autonomous tool-calling reliability is still below production-grade expectations in this setup.
scripts/run_gemma4_a4b_variant.sh --variant baseline --preflight
scripts/run_gemma4_a4b_variant.sh --variant rotorquant --dry-runscripts/benchmark_gemma4_a4b_variants.sh \
--execute \
--max-kv-size 512 \
--max-kv-size 1024 \
--max-kv-size 1536 \
--max-tokens 256 \
--repeats 1Artifacts are written under artifacts/benchmarks/gemma4-a4b/.
scripts/run_dflash_mlx_benchmark.sh --dry-run
scripts/run_qwen_mlx_kv_sweep.sh --help
scripts/run_qwen_turboquant_mlx.sh --help
scripts/run_qwen_ddtree_benchmark.sh --helpEach matrix/sweep run stores:
- resolved config
- per-run status/exit code
- raw logs per variant/KV/token point
- parsed metric fields (prompt TPS, generation TPS, peak memory, finish reason)
Core files:
results.csvsummary.txtraw/*.log
- Pinned sources:
scripts/pinned_sources.sh - Deterministic fetch:
scripts/fetch_vendor_sources.sh - Versioned patches:
patches/*.patchapplied byscripts/apply_vendor_patches.sh - Pinned runtime deps:
requirements.txt
Recreate the same state on another machine by rerunning:
scripts/fetch_vendor_sources.sh
scripts/setup_env.shscripts/benchmark runners, setup, and orchestrationpatches/local vendor patch setdocs/benchmark and bootstrap documentationvendor/pinned upstream checkouts (ignored in git)
- This repo intentionally avoids global package installation.
- Heavy benchmark artifacts are local-only (
artifacts/) and not committed. - Some advanced combinations remain experimental and may hit Metal timeout/OOM depending on machine limits.
rotorquantis the strongest high-context candidate in this campaign.speculative-rotorquantimproved some short-context cases but was not stable in long-generation stress.