GitHub - cklxx/arle: Pure-Rust runtime for serving, local agents, On-Policy Distillation, and evaluation. infer is the OpenAI-compatible serving binary; arle is the unified front door.

ARLE
Pure-Rust runtime for serving, local agents, On-Policy Distillation, and evaluation. infer is the OpenAI-compatible serving binary; arle is the unified front door.

Quick Start · HTTP API · Support Matrix · Architecture · Roadmap · Changelog

English · 简体中文

Quick Start

# Apple Silicon — Homebrew
brew install cklxx/tap/arle

# Apple Silicon or Linux x86_64 — one-line installer
curl -fsSL https://github.com/cklxx/arle/releases/latest/download/install.sh | sh

# Linux + NVIDIA — Docker, no compile
docker run --rm --gpus all -p 8000:8000 -v /path/to/Qwen3.5-4B:/model:ro \
  ghcr.io/cklxx/arle:latest serve --backend cuda --model-path /model

# From source (any backend)
cargo build --release --features cuda --bin arle     # Linux + NVIDIA
cargo build --release --no-default-features --features metal,no-cuda,cli --bin arle  # Apple Silicon

Full install matrix + uninstall: docs/install.md.

Serve:

arle serve --backend cuda  --model-path /path/to/Qwen3.5-4B --port 8000
arle serve --backend metal --model-path mlx-community/Qwen3.5-0.8B-MLX-4bit --port 8000

Talk to it (OpenAI-compatible):

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
print(client.chat.completions.create(
    model="qwen3.5-4b",
    messages=[{"role": "user", "content": "Hello from ARLE"}],
).choices[0].message.content)

Local agent / self-check:

arle                              # interactive REPL with python/shell tools
arle run --prompt "Summarize this repo" --model-path /path/to/Qwen3.5-4B
arle --doctor --json              # CI-friendly self-check

More copy-paste: examples/.

Status at a glance

Backend	Platform	Status	Headline
CUDA	Linux + NVIDIA	Stable	Continuous batching, paged KV, radix-backed reuse, TileLang BF16 attention, CUDA Graph decode. L4 / Qwen3.5-4B BF16 + FP8 KV: 197 tok/s @ c=16 / 4k-in.
Metal	Apple Silicon	Beta	Scheduler-backed serving, chunked prefill, replay prefix reuse. Qwen3.6 35B-A3B 4-bit MLX: 85.6 tok/s decode / 385 ms TTFT on M4 Pro 48GB.
Metal DFlash	Apple Silicon	Beta — default-on	Speculative decode for Qwen3.5. Qwen3.5-4B-4bit bit-identical, c=1..8.
OPD train (CUDA)	Linux + NVIDIA	Beta	2.49–2.91× faster than HuggingFace TRL `GKDTrainer` at matched Qwen3-0.6B setup. LoRA-only: 0.140 s/step at 3.9 GB peak — fits 4 GB consumer cards. `arle train opd --student-model <dir>` ships end-to-end. See Latest Updates.
CPU	Portable	Dev-only	Smoke tests; not a perf target.

Models: Qwen3.5 family (0.8B / 4B / 30B-A3B / 35B) on CUDA + Metal. Next-model queue: DeepSeek V4 (#1) → Qwen 3.6 (#2) — see ROADMAP.md.

Authoritative tier matrix: docs/support-matrix.md · docs/stability-policy.md.

Why ARLE

In agent and RL workloads every turn pays a prefill tax: system prompt + history + tool results re-process every turn. ARLE treats this as the core problem in both serving and training:

Multi-turn KV reuse. Slot-sticky reuse + radix-backed tiered KV (T0 GPU → T1 host → T2 disk → T3 cluster) keep prior-turn KV hot.
Paged KV pool. page_size=16 with direct GPU page attach + tail-page CoW for shared prefixes — predictable accounting, cheap prefix sharing.
Shared runtime authority. infer, arle, and the OPD training loop share one Rust runtime + model code path — the OPD teacher is the production-serving runtime, not a separate stack.

Architecture deep-dive: docs/architecture.md · docs/codebase-map.md.

Entry surfaces

arle is the single binary:

Command	What it does
`arle` (no args)	Interactive agent REPL with `python` and `shell` tools.
`arle run --prompt "…"`	One-shot agent prompt. `--no-tools` to disable tools.
`arle serve --backend …`	OpenAI-compatible HTTP server.
`arle train opd`	On-Policy Distillation — teacher in `infer`, student in `train`. CUDA path. Usage manual.
`arle --doctor [--json]`	Backend / hardware / model-resolution self-check.

Operators wanting only the serving binary can use infer directly — same HTTP contract, without agent / train surfaces.

Latest Updates

2026-05-26 — OPD GKD chunked-KL Route B fits the 512-token corpus shape; full-logit OOMs even on V100 32 GB. Real-corpus Qwen3.5-4B → 0.8B-Base GKD with corpus-truth SFT anchor at prompt-max-tokens=512 previously KILLed on consumer 16 GB hardware because chunking the KL loss left full [B, S, V] teacher + student logits resident before the loss saw them. Route B (SequenceWindowedForward trait + per-window tape.backward(window_loss), never materialize [B, S, V], slice hidden then lm_head per window) lands the structural fix and is now validated end-to-end on Tesla V100-SXM2-32GB.

Mode	`--logits-window-size`	Peak GPU	Train step 1	Outcome
fullogit (T5b shape)	off	31 506 MiB	n/a	VRAM OOM — `cuda alloc_zeros failed (slice)`
windowed (Route B)	64	25 440 MiB	897.4 s (rollout 112 / teacher 168 / student 78 / backward 538)	loss 9.72 × 10⁻⁶, RSS 9.22 GB post-train

Same corpus + rollout + GKD config across rows; only --logits-window-size varied. −19 % peak GPU + train step lands (was OOM). Two structural fixes were needed end-to-end: per-window forward (Route B itself, never materialize [B, S, V]) plus evict_host_mirror to drop 19.8 GB → 2.1 GB of host RAM held by post-upload weight mirrors. On V100 32 GB the windowed path now runs the full GKD step the unwindowed path cannot start.

Evidence: wins entry · design plan (Route B) · prior 16 GB KILL

2026-05-25 — V100 (sm_70 Volta) inference target unlocked; capability preserved. ARLE serve now runs Qwen3.5-4B/9B on Tesla V100-SXM2-32GB end-to-end. Made it work through an upstream TileLang patch (PR #2279 — fragment-to-fragment dtype-converting copy via shared-memory staging + SM70 T.gemm GemmFMA fallback for unsupported BF16/layout combinations) plus an ARLE-side per-kernel allow_sm70 cubin filter that pins T0-legacy emission to Qwen3.5 dense + GDR chunkwise paths only. T1 (A100/L4/H100) builds and binaries untouched by construction.

MMLU 5-shot (n=200, harness scored ~164-165 valid)	Qwen3.5-4B	Qwen3.5-9B
T1 reference (A100/L4/H100)	77.33 %	n/a
V100 sm_70 (this work)	79.9 %	83.0 % (+3.1 pp size scaling)

Benefit: a cheap, widely-available Volta box becomes a usable ARLE inference target — same OpenAI-v1 surface, same Qwen3.5 family, no measurable capability cost from the Volta-specific fallback.

Evidence: P1 build pass · P1.4 smoke · P3.1 4B capability · P3.2 9B capability

Older entries (OPD CLI ship, OPD pipeline close + HF cross-validation, ARLE-vs-TRL 2.49–2.91×): CHANGELOG.md.

Documentation map

docs/http-api.md · HTTP contract & streaming
docs/support-matrix.md · backend / model / quant tiers
docs/architecture.md · package boundaries
docs/codebase-map.md · workspace layout & execution paths
docs/environment.md · env vars & runtime knobs
docs/troubleshooting.md · common build/runtime errors
docs/comparison.md · vs vLLM / SGLang / mistral.rs / llama.cpp
CONTRIBUTING.md · contributor setup & validation
examples/ · copy-paste smoke paths
docs/index.md · maintainer-facing PARA index

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 3,488 Commits
.cargo		.cargo
.claude		.claude
.githooks		.githooks
.github		.github
bench-output		bench-output
benchmarks		benchmarks
crates		crates
docs		docs
examples		examples
infer		infer
memory		memory
scripts		scripts
src		src
tests		tests
traces		traces
web		web
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
README.zh-CN.md		README.zh-CN.md
ROADMAP.md		ROADMAP.md
SECURITY.md		SECURITY.md
deny.toml		deny.toml
pyproject.toml		pyproject.toml
requirements-bench.txt		requirements-bench.txt
requirements-build.txt		requirements-build.txt
rust-toolchain.toml		rust-toolchain.toml
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Quick Start

Status at a glance

Why ARLE

Entry surfaces

Latest Updates

Documentation map

License

About

Uh oh!

Releases 6

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Quick Start

Status at a glance

Why ARLE

Entry surfaces

Latest Updates

Documentation map

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 6

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages