Skip to content

Development Roadmap (2026 H1) #651

@chenghuaWang

Description

@chenghuaWang

Here is the development roadmap for H1 2026. We will pin this roadmap in Issues, and most of our subsequent work will be updated in this roadmap within Issues. In MLLM's documentation, we will archive each version of the roadmap and provide some outlooks. Contributions and feedback are welcome.

Focus

  • pymllm for embodied robots/agents on Jetson Orin/Thor.
  • mllm's arm and NPU will still going on, supporting more models.
  • NPU AOT shape bucketing optimization

Model coverage

NPU Backend

  • Shape Bucketing: The current NPU AOT compilation approach generates two computation graphs under the same sequence length — one for chunk size = X (where X can be 32, 64, or 128) and another for chunk size = 1. This results in computational waste, particularly in the Attention layer. Shape Bucketing addresses this by generating dedicated computation graph pairs (chunk=X and chunk=1) for each distinct sequence length (e.g., 32, 64, 96, …), and automatically selecting the optimal graph at runtime based on the actual input shape.
  • Sliding Window Attention (SWA) Optimization: The current NPU AOT compilation approach treats Sliding Window Attention as Full Attention, which introduces unnecessary computational overhead. This optimization decomposes SWA into two separate matrix attention operations and employs a circular cache queue to efficiently manage the sliding window, significantly reducing both memory footprint and redundant computation.
  • Graph Split: The current NPU AOT implementation does not perform graph partitioning for larger models (e.g., 7B parameters). It is necessary to implement graph-splitting logic within the MLLM IR passes, decomposing oversized LLM computation graphs into smaller, manageable subgraphs for efficient compilation and execution.
  • Qwen3 VL & Qwen2 VL's ViT part(with fixed image size, 480P)
  • Benchmark prefill and decode TPS across chunk sizes of 32, 64, and 128, and identify the optimal chunk size for LPBQ with group sizes of 16 and 32.

Kernels

Pymllm

  • Radix Cache corretness check
  • Optimizing CPU busy loop

Server

  • ❗ High Priority ❗ mllm-server: Provides an OpenAI-compatible API with the mllm library as the inference backend. Currently, mllm lacks a production-ready server CLI tool. The existing mllm-cli is built with C++ and Go bindings, but has several limitations — notably, it is not fully compatible with the OpenAI API. We aim to build a stable, long-term maintained mllm-server that works reliably across scenarios such as Claw testing and beyond.
  • ❗ High Priority ❗Jinja Template: Currently, mllm uses hand-written rules for Jinja template rendering. We should adopt the approach from jinja.cpp and maintain our own lightweight Jinja2 template engine. This will provide a more robust and maintainable solution compared to the current ad-hoc rule-based implementation.
  • ❗ High Priority ❗ Use mllm as the API Server for OpenClaw: Integrate mllm as the inference backend for OpenClaw's API server. This depends on the two High Priority items above and will likely require extensive testing and iteration to ensure stable performance.

Document

Agentic

  • Write more skills for calude code and other agent harness
    • How to debug mllm's C++ code
    • How to install mllm
    • How to convert and quantize models for mllm (GGUF, Q4, etc.)
    • How to integrate mllm into an existing C++ project as a library
    • etc
  • Create/Update CLAUDE.md for mllm project

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions