Development Roadmap (2026 H1)

Here is the development roadmap for H1 2026. We will pin this roadmap in Issues, and most of our subsequent work will be updated in this roadmap within Issues. In MLLM's documentation, we will archive each version of the roadmap and provide some outlooks. Contributions and feedback are welcome.

# Focus

- pymllm for embodied robots/agents on Jetson Orin/Thor.
- mllm's arm and NPU will still going on, supporting more models.
- NPU AOT shape bucketing optimization

# Model coverage

- Gemma4(E4B, E2B)
- Gemm3n(with support of AltUp, Embedding and SWA, cpu backend)
- Qwen3-VL 2B(pymllm cuda backend)
  - https://github.com/UbiquitousLearning/mllm/pull/663
- Qwen3.5 0.8B/4B/9B(pymllm cuda backend, cpu, npu backend)
- Qwen2.5-Omni
  - https://github.com/UbiquitousLearning/mllm/pull/612
- MiniCPM-o-4.5
  - https://github.com/UbiquitousLearning/mllm/pull/612
- Qwen3 MoE(pymllm cuda backend, cpu backend)

# NPU Backend

- **Shape Bucketing:** The current NPU AOT compilation approach generates two computation graphs under the same sequence length — one for chunk size = X (where X can be 32, 64, or 128) and another for chunk size = 1. This results in computational waste, particularly in the Attention layer. Shape Bucketing addresses this by generating dedicated computation graph pairs (chunk=X and chunk=1) for each distinct sequence length (e.g., 32, 64, 96, …), and automatically selecting the optimal graph at runtime based on the actual input shape.
- **Sliding Window Attention (SWA) Optimization:** The current NPU AOT compilation approach treats Sliding Window Attention as Full Attention, which introduces unnecessary computational overhead. This optimization decomposes SWA into two separate matrix attention operations and employs a circular cache queue to efficiently manage the sliding window, significantly reducing both memory footprint and redundant computation.
- **Graph Split:** The current NPU AOT implementation does not perform graph partitioning for larger models (e.g., 7B parameters). It is necessary to implement graph-splitting logic within the MLLM IR passes, decomposing oversized LLM computation graphs into smaller, manageable subgraphs for efficient compilation and execution.
- Qwen3 VL & Qwen2 VL's ViT part(with fixed image size, 480P)
- Benchmark prefill and decode TPS across chunk sizes of 32, 64, and 128, and identify the optimal chunk size for LPBQ with group sizes of 16 and 32.

# Kernels

- GDN kernel for Qwen3.5(arm backend, cpu)
- Marlin kernel for pymllm (mllm-kernel)
  - https://github.com/UbiquitousLearning/mllm/pull/663
- GDN kernel for pymllm(mllm-kernel, for SM80 chips)
- OpenCL GGUF compatible quantization kernels

# Pymllm

- Radix Cache corretness check
- Optimizing CPU busy loop

# Server

- ❗ High Priority ❗  **mllm-server:** Provides an OpenAI-compatible API with the mllm library as the inference backend. Currently, mllm lacks a production-ready server CLI tool. The existing mllm-cli is built with C++ and Go bindings, but has several limitations — notably, it is not fully compatible with the OpenAI API. We aim to build a stable, long-term maintained mllm-server that works reliably across scenarios such as Claw testing and beyond.
- ❗ High Priority ❗**Jinja Template:** Currently, mllm uses hand-written rules for Jinja template rendering. We should adopt the approach from [jinja.cpp](https://github.com/wangzhaode/jinja.cpp) and maintain our own lightweight Jinja2 template engine. This will provide a more robust and maintainable solution compared to the current ad-hoc rule-based implementation.
- ❗ High Priority ❗ **Use mllm as the API Server for OpenClaw:** Integrate mllm as the inference backend for OpenClaw's API server. This depends on the two High Priority items above and will likely require extensive testing and iteration to ensure stable performance.

# Document

- Polish current documents in https://ubiquitouslearning.github.io/mllm/index.html
- [QNN AOT Execution Flow document](https://ubiquitouslearning.github.io/mllm/qnn_backend/aot_execute.html) should be updated based on those isseus:
  - https://github.com/UbiquitousLearning/mllm/issues/654 https://github.com/UbiquitousLearning/mllm/issues/654#issuecomment-4095922007
  - https://github.com/UbiquitousLearning/mllm/issues/665

# Agentic

- Write more skills for calude code and other agent harness
  - How to debug mllm's C++ code
  - How to install mllm
  - How to convert and quantize models for mllm (GGUF, Q4, etc.)
  - How to integrate mllm into an existing C++ project as a library
  - etc
- Create/Update CLAUDE.md for mllm project

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Development Roadmap (2026 H1) #651

Focus

Model coverage

NPU Backend

Kernels

Pymllm

Server

Document

Agentic

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Development Roadmap (2026 H1) #651

Description

Focus

Model coverage

NPU Backend

Kernels

Pymllm

Server

Document

Agentic

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions