Unified edge AI inference engine: run local LLMs on low-resource devices with minimal RAM. Think of it as local AI as lightweight as a crab—optimized for edge, laptops, and machines where you want to run 8B–13B models in about 4GB RAM.
Cheese Crab is built on cheese.cpp and adds mandatory context compression, prompt caching, optional RAG, and vision-token reduction so that inference stays fast and memory use stays low. Let the crab nibble your prompt.
Get from zero to chat in three steps.
-
Build
mkdir build && cd build cmake .. -DCMAKE_BUILD_TYPE=Release cmake --build . -j
Binaries land in
build/bin/. -
Run a demo
No model yet? The crab can fetch one and serve the minimal Web UI:./build/bin/cheese-server --quickstart --webui --port 8080
Or pull a specific model and start the server:
./build/bin/cheese-server --pull hf://user/repo:Q4_K_M --webui --port 8080
-
Chat
Open http://localhost:8080 in your browser. Use the crab-themed UI: type in “Nibble prompt here…”, send, and watch the stream. Or call the API:curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" \ -d '{"model":"","messages":[{"role":"user","content":"Hello!"}],"stream":true}'
For Crab Mode (aggressive squeeze, 10 min cache): add --extreme. For Ultra Crab Mode (ultra light, Q2_K cache): add --low-ram.
High-level flow: squeeze context, cache prefixes, infer, optionally store.
flowchart LR
A[Prompt / context] --> B[Squeeze]
B --> C[Cache lookup]
C --> D[Infer]
D --> E[Response]
C -.-> F[(Cache storage)]
F -.-> C
Squeezing context is like melting cheese so the crab can snack faster; caching prefixes means the crab reuses work across turns.
-
Edge and low-resource inference
Run quantized LLMs (e.g. Q4_0, Q5_0) on CPU or GPU with small RAM footprint. Defaults (Q4_0 KV cache, context squeeze, vision squeeze) are tuned for ~4GB usage. -
Context and vision squeeze
- Text: Long prompts are compressed before tokenization (context squeezer) so the model sees fewer tokens.
- Vision / video: After the vision encoder, embeddings are reduced (merge or subsample) so the LLM gets fewer image/video tokens.
Both are built in and always enabled in the server/CLI path.
-
Prompt prefix caching
Repeated prompt prefixes are cached (pomaicache) to avoid recomputing the same KV cache across turns or users. -
Optional RAG (PomaiDB)
When configured, the server can augment prompts with retrieved chunks from a local vector DB for retrieval-augmented generation. -
Single stack: CLI + HTTP server
Use the same binary for interactive chat (cheese-cli) or an OpenAI-compatible API (cheese-server).
Requirements: CMake 3.14+, C++17, and (optional) a GPU backend.
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
cmake --build . -jBinaries are produced under build/bin/, including:
cheese-cli— interactive chat CLIcheese-server— HTTP API servercheese-quantize— quantize GGUF modelscheese-mtmd-cli— multimodal (image/audio) CLI
Run with a local GGUF model:
./build/bin/cheese-cli -m models/qwen0.5b.ggufOne-off completion with a prompt and token limit:
./build/bin/cheese-cli -m models/qwen0.5b.gguf -p "Hello, how are you?" -n 64Start without a model and load or pull one from the prompt:
./build/bin/cheese-cli
# Then: /model load models/qwen0.5b.gguf
# Or: /model pull user/repo:Q4_K_MUse more context or GPU layers if you have the RAM/VRAM:
./build/bin/cheese-cli -m models/qwen0.5b.gguf -c 2048 -ngl 99CPU-only (no GPU): The default KV cache type is Q4_0, which requires Flash Attention. On builds without GPU support you must use non-quantized cache: add -ctk f16 -ctv f16 (or -ctk f32 -ctv f32). Otherwise you may see "V cache quantization requires flash_attn" and a crash.
Serve a single model:
./build/bin/cheese-server -m models/qwen0.5b.gguf --host 0.0.0.0 --port 8080With optional web UI:
./build/bin/cheese-server -m models/qwen0.5b.gguf --port 8080 --webuiChat completions are then available at http://localhost:8080/v1/chat/completions (and other OpenAI-style routes).
Using multiple GGUF files from models/: To use all .gguf files in a directory (e.g. models/) and switch between them without restarting, run the server in router (multi-model) mode: omit -m and set --models-dir:
./build/bin/cheese-server --models-dir models --webui --port 8080The server discovers models from models/ and lists them in /v1/models. Load or switch models via the API (POST /models/load with {"model": "modelname"}) or the crab UI model selector. For a single-model setup, start with -m path/to/model.gguf; only one model is loaded and /v1/models returns that model.
-
Text: Context squeeze is enabled by default (aggressiveness 4, min length 8192 chars so short chats are not squeezed). Tune with:
CHEESE_SQUEEZE_AGGRESSIVENESS(0–10)- Server/params:
contextsqueeze_aggressiveness,contextsqueeze_min_chars
-
Vision: Vision token squeeze runs after the encoder (default aggressiveness 1). Tune with:
CHEESE_VISION_SQUEEZE_AGGRESSIVENESS(0–9)- Server/params:
vision_squeeze_aggressiveness
The server uses the model’s built-in chat template by default. To override or when the GGUF has no template, use --chat-template-file with a Jinja file from models/templates/:
./build/bin/cheese-server -m models/qwen0.5b.gguf --chat-template-file models/templates/Qwen-Qwen3-0.6B.jinja --webui --port 8080See models/templates/README.md for a model → template mapping (e.g. Qwen 0.5B/0.6B → Qwen-Qwen3-0.6B.jinja) so templates in that directory are used and not wasted.
When building with PomaiDB, set a RAG DB path and embedding dimension so the server can augment prompts with retrieved chunks:
- Server params:
rag_db_path,rag_dim,rag_topk,rag_token_budget,rag_membrane - Only used when
rag_db_pathis non-empty.
Use the multimodal CLI with a vision-capable model and projector:
./build/bin/cheese-mtmd-cli -m /path/to/model.gguf --mmproj /path/to/mmproj.gguf --image image.pngVision token squeeze applies automatically when the server or pipeline uses the mtmd path with a positive vision_squeeze_aggressiveness. With -hf user/repo, the crab can auto-fetch the mmproj from Hugging Face when it’s missing.
To run heavier models (e.g. 8B) on limited RAM:
-
Prefer a smaller quantization. Use Q4_K_M or Q3_K_M instead of Q8_0 so the model uses ~4–5 GB instead of ~8 GB. Download a pre-quantized file from Hugging Face when available.
-
Example: pull and run 8B Q4_K_M (CPU-only, 16GB-friendly):
./build/bin/cheese-cli -hf Qwen/Qwen3-8B-GGUF -hff Qwen3-8B-Q4_K_M.gguf --jinja --color auto -fa off -ngl 0 -ctk f16 -ctv f16 -c 8192 --temp 0.6 --top-p 0.95 -n 2048
With Crab Mode for more memory savings:
./build/bin/cheese-cli -hf Qwen/Qwen3-8B-GGUF -hff Qwen3-8B-Q4_K_M.gguf --jinja --color auto -fa off -ngl 0 -ctk f16 -ctv f16 -c 8192 --extreme -n 2048 --temp 0.6 --top-p 0.95
-
Local quantization: To build your own Q4_K_M or Q3_K_M from an F16 GGUF (or requantize from Q8_0), use tools/quantize/README.md. For requantizing an existing quant (e.g. Q8_0 → Q4_K_M), add
--allow-requantizetocheese-quantize(quality may drop slightly). -
Runtime tips: Use a smaller context (
-c 8192or-c 4096), add--extremefor Crab Mode (aggressive context squeeze), and on CPU-only builds always add-ctk f16 -ctv f16so the KV cache does not require Flash Attention.
A minimal image (target <200 MB) can be built with Alpine:
FROM alpine:3.19 AS builder
RUN apk add --no-cache build-base cmake git
WORKDIR /src
COPY . .
RUN mkdir build && cd build && cmake .. -DCMAKE_BUILD_TYPE=Release && cmake --build . -j
FROM alpine:3.19
RUN apk add --no-cache libstdc++
COPY --from=builder /src/build/bin/cheese-server /usr/local/bin/
COPY --from=builder /src/build/bin/cheese-cli /usr/local/bin/
EXPOSE 8080
ENTRYPOINT ["cheese-server"]Build and run with quickstart or pull:
docker build -t cheesecrab .
docker run -p 8080:8080 cheesecrab --quickstart --webui --port 8080
# or pull a model first:
docker run -p 8080:8080 cheesecrab --pull hf://user/repo:Q4_K_M --webui --port 8080With Nix (and Flakes) you can build and run Cheese Crab without installing CMake or compilers:
nix build .#default
./result/bin/cheese-server --quickstart --webui --port 8080Or run the CLI directly:
nix run .#default -- -m /path/to/model.gguf -p "Let the crab nibble!"| Area | Detail |
|---|---|
| KV cache | Default type is Q4_0 (not F16) to save RAM. Override with -ctk / -ctv or env CHEESE_ARG_CACHE_TYPE_K / CHEESE_ARG_CACHE_TYPE_V. |
| Models | Place GGUF files in models/ (or any path). Use -m path/to/model.gguf or /model load path. |
| Quantization | Use cheese-quantize to produce Q4_0/Q5_0 etc. Quantized models are recommended for edge. |
| Tests / benches | From build/: ctest -j4 runs tests; vendor benches: ./bin/bench-contextsqueezer, ./bin/bench-pomaicache, ./bin/bench-pomaidb-rag. |
src/— Core library (cheese, KV cache, model loading).common/— Shared CLI/server params, parsing, download.tools/cli/—cheese-cli.tools/server/—cheese-serverand server logic.tools/mtmd/— Multimodal (vision/audio) encode/decode.vendor/— Vendored libs: contextsqueezer, pomaicache, pomaidb, etc.
For more on the CLI and server (all flags, env vars), see:
