# 1. Install vllm-mlx
uv tool install git+https://github.com/waybarrios/vllm-mlx.git
# 2. Start the batched inference server (patches granite-docling compatibility issues)
/path/to/vllm-mlx/bin/python start_server.py
# 3. Send pages concurrently (in another terminal)
uv run --with openai python test_concurrent.pyThis runs GraniteDocling-258M via vllm-mlx with continuous batching. On an M5 Max it processes an 8-page PDF in 15s (0.52 pages/sec, 758 tok/s) — 2x faster than sequential docling.
Benchmarking Docling VLM-based PDF conversion on Apple Silicon (M5 Max, 40 GPU cores, 128GB unified memory).
Sequential VLM inference is memory-bandwidth-bound — the 258M model is tiny but autoregressive decoding reads all weights per token. Batching multiple pages amortizes this cost, reading weights once to generate tokens for all pages simultaneously.
| Approach | Wall clock | Pages/sec | Aggregate tok/s |
|---|---|---|---|
| Docling + GraniteDocling MLX (sequential) | 36s | 0.22 | ~345 |
| Docling + SmolDocling MLX (sequential) | 41s | 0.20 | ~345 |
| mlx-vlm direct (sequential, warm model) | 28s | 0.28 | ~430 |
| vllm-mlx batched (8 concurrent) | 15s | 0.52 | 758 |
- The GPU is not compute-bound — a 258M model barely uses the 40 GPU cores
- The bottleneck is memory bandwidth: each token reads ~516MB of weights
- M5 Max has ~546 GB/s bandwidth → theoretical max ~1,058 tok/s single-stream
- Achieved ~430 tok/s = ~40% utilization (typical with vision encoder overhead)
- Multiple processes don't help — they share the same memory bandwidth
- Batch N pages = read weights once, generate N tokens per forward pass
- With 8 concurrent pages: 758 tok/s aggregate (1.76x single-stream)
- More pages in flight = more throughput (up to hardware limits)
Standard docling VLM pipeline with per-page timing.
Same as above with SmolDocling-256M model.
Patches the Idefics3Processor to expose the chat template (required for vllm-mlx compatibility) and raises the prefill token limit. Starts a continuous-batching server on port 8000.
# Start the server (requires vllm-mlx tool installed)
/Users/olivier/.local/share/uv/tools/vllm-mlx/bin/python start_server.pySends all 8 pages simultaneously to the vllm-mlx server and measures aggregate throughput.
uv run --with openai python test_concurrent.pyExploratory script testing interleaved KV-cache generation (slower than vllm-mlx due to no true batched forward pass).
Spawns multiple MLX processes — doesn't help because they share GPU memory bandwidth.
Two issues prevent vllm-mlx from serving ibm-granite/granite-docling-258M-mlx out of the box. Both are patched in start_server.py.
vllm-mlx calls processor.apply_chat_template() for multimodal models, but the Idefics3Processor doesn't expose the chat template even though the underlying tokenizer has one. This causes:
ValueError: Cannot use apply_chat_template because this processor does not have a chat template.
Fix: Monkey-patch mlx_vlm.load to copy tokenizer.chat_template onto the processor after loading:
processor.chat_template = processor.tokenizer.chat_templateThis is a known issue in HuggingFace transformers (#40913) where processor chat_template kwargs get overridden by model defaults during from_pretrained.
The MLLM scheduler defaults to prefill_step_size=1024, but GraniteDocling prompts are ~1142 tokens (image patches + text), causing:
Total prompt tokens (1142) exceeds safe limit (1024)
Fix: Monkey-patch MLLMSchedulerConfig.__init__ to set prefill_step_size=4096.
# Install vllm-mlx
uv tool install git+https://github.com/waybarrios/vllm-mlx.git
# Run docling benchmarks
uv run run.py
uv run run_smoldocling.py
# Run batched benchmark
/Users/olivier/.local/share/uv/tools/vllm-mlx/bin/python start_server.py # terminal 1
uv run --with openai python test_concurrent.py # terminal 2- ibm-granite/granite-docling-258M-mlx — Idefics3 architecture, 258M params, DocTags output format
- docling-project/SmolDocling-256M-preview-mlx-bf16 — SmolVLM architecture, 256M params, DocTags output format
- Apple M5 Max — 40 GPU cores, 18 CPU cores, 128GB unified memory
- macOS Darwin 25.3.0