Cheese Crab

Unified edge AI inference engine: run local LLMs on low-resource devices with minimal RAM. Think of it as local AI as lightweight as a crab—optimized for edge, laptops, and machines where you want to run 8B–13B models in about 4GB RAM.

Cheese Crab is built on cheese.cpp and adds mandatory context compression, prompt caching, optional RAG, and vision-token reduction so that inference stays fast and memory use stays low. Let the crab nibble your prompt.

5-Minute Quickstart

Get from zero to chat in three steps.

Build

mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
cmake --build . -j

Binaries land in build/bin/.

Run a demo
No model yet? The crab can fetch one and serve the minimal Web UI:

./build/bin/cheese-server --quickstart --webui --port 8080

Or pull a specific model and start the server:

./build/bin/cheese-server --pull hf://user/repo:Q4_K_M --webui --port 8080

Chat
Open http://localhost:8080 in your browser. Use the crab-themed UI: type in “Nibble prompt here…”, send, and watch the stream. Or call the API:

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" \
  -d '{"model":"","messages":[{"role":"user","content":"Hello!"}],"stream":true}'

For Crab Mode (aggressive squeeze, 10 min cache): add --extreme. For Ultra Crab Mode (ultra light, Q2_K cache): add --low-ram.

Crab Pipeline

High-level flow: squeeze context, cache prefixes, infer, optionally store.

flowchart LR
  A[Prompt / context] --> B[Squeeze]
  B --> C[Cache lookup]
  C --> D[Infer]
  D --> E[Response]
  C -.-> F[(Cache storage)]
  F -.-> C

Squeezing context is like melting cheese so the crab can snack faster; caching prefixes means the crab reuses work across turns.

Main purposes

Edge and low-resource inference
Run quantized LLMs (e.g. Q4_0, Q5_0) on CPU or GPU with small RAM footprint. Defaults (Q4_0 KV cache, context squeeze, vision squeeze) are tuned for ~4GB usage.
Context and vision squeeze
- Text: Long prompts are compressed before tokenization (context squeezer) so the model sees fewer tokens.
- Vision / video: After the vision encoder, embeddings are reduced (merge or subsample) so the LLM gets fewer image/video tokens.
  Both are built in and always enabled in the server/CLI path.
Prompt prefix caching
Repeated prompt prefixes are cached (pomaicache) to avoid recomputing the same KV cache across turns or users.
Optional RAG (PomaiDB)
When configured, the server can augment prompts with retrieved chunks from a local vector DB for retrieval-augmented generation.
Single stack: CLI + HTTP server
Use the same binary for interactive chat (cheese-cli) or an OpenAI-compatible API (cheese-server).

Build

Requirements: CMake 3.14+, C++17, and (optional) a GPU backend.

mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
cmake --build . -j

Binaries are produced under build/bin/, including:

cheese-cli — interactive chat CLI
cheese-server — HTTP API server
cheese-quantize — quantize GGUF models
cheese-mtmd-cli — multimodal (image/audio) CLI

Usage examples

1. Interactive chat (CLI)

Run with a local GGUF model:

./build/bin/cheese-cli -m models/qwen0.5b.gguf

One-off completion with a prompt and token limit:

./build/bin/cheese-cli -m models/qwen0.5b.gguf -p "Hello, how are you?" -n 64

Start without a model and load or pull one from the prompt:

./build/bin/cheese-cli
# Then: /model load models/qwen0.5b.gguf
# Or:  /model pull user/repo:Q4_K_M

Use more context or GPU layers if you have the RAM/VRAM:

./build/bin/cheese-cli -m models/qwen0.5b.gguf -c 2048 -ngl 99

CPU-only (no GPU): The default KV cache type is Q4_0, which requires Flash Attention. On builds without GPU support you must use non-quantized cache: add -ctk f16 -ctv f16 (or -ctk f32 -ctv f32). Otherwise you may see "V cache quantization requires flash_attn" and a crash.

2. HTTP server (OpenAI-compatible API)

Serve a single model:

./build/bin/cheese-server -m models/qwen0.5b.gguf --host 0.0.0.0 --port 8080

With optional web UI:

./build/bin/cheese-server -m models/qwen0.5b.gguf --port 8080 --webui

Chat completions are then available at http://localhost:8080/v1/chat/completions (and other OpenAI-style routes).

Using multiple GGUF files from models/: To use all .gguf files in a directory (e.g. models/) and switch between them without restarting, run the server in router (multi-model) mode: omit -m and set --models-dir:

./build/bin/cheese-server --models-dir models --webui --port 8080

The server discovers models from models/ and lists them in /v1/models. Load or switch models via the API (POST /models/load with {"model": "modelname"}) or the crab UI model selector. For a single-model setup, start with -m path/to/model.gguf; only one model is loaded and /v1/models returns that model.

3. Context and vision squeeze (defaults)

Text: Context squeeze is enabled by default (aggressiveness 4, min length 8192 chars so short chats are not squeezed). Tune with:
- CHEESE_SQUEEZE_AGGRESSIVENESS (0–10)
- Server/params: contextsqueeze_aggressiveness, contextsqueeze_min_chars
Vision: Vision token squeeze runs after the encoder (default aggressiveness 1). Tune with:
- CHEESE_VISION_SQUEEZE_AGGRESSIVENESS (0–9)
- Server/params: vision_squeeze_aggressiveness

4. Chat templates and `models/templates`

The server uses the model’s built-in chat template by default. To override or when the GGUF has no template, use --chat-template-file with a Jinja file from models/templates/:

./build/bin/cheese-server -m models/qwen0.5b.gguf --chat-template-file models/templates/Qwen-Qwen3-0.6B.jinja --webui --port 8080

See models/templates/README.md for a model → template mapping (e.g. Qwen 0.5B/0.6B → Qwen-Qwen3-0.6B.jinja) so templates in that directory are used and not wasted.

5. RAG (PomaiDB)

When building with PomaiDB, set a RAG DB path and embedding dimension so the server can augment prompts with retrieved chunks:

Server params: rag_db_path, rag_dim, rag_topk, rag_token_budget, rag_membrane
Only used when rag_db_path is non-empty.

6. Multimodal (image / audio)

Use the multimodal CLI with a vision-capable model and projector:

./build/bin/cheese-mtmd-cli -m /path/to/model.gguf --mmproj /path/to/mmproj.gguf --image image.png

Vision token squeeze applies automatically when the server or pipeline uses the mtmd path with a positive vision_squeeze_aggressiveness. With -hf user/repo, the crab can auto-fetch the mmproj from Hugging Face when it’s missing.

7. Running on low-resource devices (e.g. 16GB laptop)

To run heavier models (e.g. 8B) on limited RAM:

Prefer a smaller quantization. Use Q4_K_M or Q3_K_M instead of Q8_0 so the model uses ~4–5 GB instead of ~8 GB. Download a pre-quantized file from Hugging Face when available.

Example: pull and run 8B Q4_K_M (CPU-only, 16GB-friendly):

./build/bin/cheese-cli -hf Qwen/Qwen3-8B-GGUF -hff Qwen3-8B-Q4_K_M.gguf --jinja --color auto -fa off -ngl 0 -ctk f16 -ctv f16 -c 8192 --temp 0.6 --top-p 0.95 -n 2048

With Crab Mode for more memory savings:

./build/bin/cheese-cli -hf Qwen/Qwen3-8B-GGUF -hff Qwen3-8B-Q4_K_M.gguf --jinja --color auto -fa off -ngl 0 -ctk f16 -ctv f16 -c 8192 --extreme -n 2048 --temp 0.6 --top-p 0.95

Local quantization: To build your own Q4_K_M or Q3_K_M from an F16 GGUF (or requantize from Q8_0), use tools/quantize/README.md. For requantizing an existing quant (e.g. Q8_0 → Q4_K_M), add --allow-requantize to cheese-quantize (quality may drop slightly).
Runtime tips: Use a smaller context (-c 8192 or -c 4096), add --extreme for Crab Mode (aggressive context squeeze), and on CPU-only builds always add -ctk f16 -ctv f16 so the KV cache does not require Flash Attention.

Docker

A minimal image (target <200 MB) can be built with Alpine:

FROM alpine:3.19 AS builder
RUN apk add --no-cache build-base cmake git
WORKDIR /src
COPY . .
RUN mkdir build && cd build && cmake .. -DCMAKE_BUILD_TYPE=Release && cmake --build . -j

FROM alpine:3.19
RUN apk add --no-cache libstdc++
COPY --from=builder /src/build/bin/cheese-server /usr/local/bin/
COPY --from=builder /src/build/bin/cheese-cli /usr/local/bin/
EXPOSE 8080
ENTRYPOINT ["cheese-server"]

Build and run with quickstart or pull:

docker build -t cheesecrab .
docker run -p 8080:8080 cheesecrab --quickstart --webui --port 8080
# or pull a model first:
docker run -p 8080:8080 cheesecrab --pull hf://user/repo:Q4_K_M --webui --port 8080

Nix

With Nix (and Flakes) you can build and run Cheese Crab without installing CMake or compilers:

nix build .#default
./result/bin/cheese-server --quickstart --webui --port 8080

Or run the CLI directly:

nix run .#default -- -m /path/to/model.gguf -p "Let the crab nibble!"

Details

Area	Detail
KV cache	Default type is Q4_0 (not F16) to save RAM. Override with `-ctk` / `-ctv` or env `CHEESE_ARG_CACHE_TYPE_K` / `CHEESE_ARG_CACHE_TYPE_V`.
Models	Place GGUF files in `models/` (or any path). Use `-m path/to/model.gguf` or `/model load path`.
Quantization	Use `cheese-quantize` to produce Q4_0/Q5_0 etc. Quantized models are recommended for edge.
Tests / benches	From `build/`: `ctest -j4` runs tests; vendor benches: `./bin/bench-contextsqueezer`, `./bin/bench-pomaicache`, `./bin/bench-pomaidb-rag`.

Project layout (short)

src/ — Core library (cheese, KV cache, model loading).
common/ — Shared CLI/server params, parsing, download.
tools/cli/ — cheese-cli.
tools/server/ — cheese-server and server logic.
tools/mtmd/ — Multimodal (vision/audio) encode/decode.
vendor/ — Vendored libs: contextsqueezer, pomaicache, pomaidb, etc.

For more on the CLI and server (all flags, env vars), see:

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
benches		benches
ci		ci
cmake		cmake
common		common
docs		docs
examples		examples
ggml		ggml
gguf-py		gguf-py
grammars		grammars
include		include
licenses		licenses
media		media
models		models
pocs		pocs
requirements		requirements
scripts		scripts
src		src
tests		tests
tools		tools
vendor		vendor
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
CMakePresets.json		CMakePresets.json
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
convert_cheese_ggml_to_gguf.py		convert_cheese_ggml_to_gguf.py
convert_hf_to_gguf.py		convert_hf_to_gguf.py
convert_hf_to_gguf_update.py		convert_hf_to_gguf_update.py
convert_lora_to_gguf.py		convert_lora_to_gguf.py
flake.lock		flake.lock
flake.nix		flake.nix
mypy.ini		mypy.ini
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json
requirements.txt		requirements.txt
test-grammar-output.tmp		test-grammar-output.tmp
test-json-schema-input.tmp		test-json-schema-input.tmp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cheese Crab

5-Minute Quickstart

Crab Pipeline

Main purposes

Build

Usage examples

1. Interactive chat (CLI)

2. HTTP server (OpenAI-compatible API)

3. Context and vision squeeze (defaults)

4. Chat templates and `models/templates`

5. RAG (PomaiDB)

6. Multimodal (image / audio)

7. Running on low-resource devices (e.g. 16GB laptop)

Docker

Nix

Details

Project layout (short)

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Cheese Crab

5-Minute Quickstart

Crab Pipeline

Main purposes

Build

Usage examples

1. Interactive chat (CLI)

2. HTTP server (OpenAI-compatible API)

3. Context and vision squeeze (defaults)

4. Chat templates and models/templates

5. RAG (PomaiDB)

6. Multimodal (image / audio)

7. Running on low-resource devices (e.g. 16GB laptop)

Docker

Nix

Details

Project layout (short)

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

4. Chat templates and `models/templates`

Packages