Add compressed-tensors Marlin support for Qwen3-VL AWQ in pymllm by jialilve · Pull Request #663 · UbiquitousLearning/mllm

jialilve · 2026-04-05T17:04:28Z

Summary

add compressed-tensors Marlin support for Qwen3-VL AWQ in pymllm
add the missing gptq_marlin_repack Python JIT wrapper in mllm-kernel
add kernel and runtime tests for the new quantized path
add bilingual pymllm README documents for the validated Jetson Orin workflow

Why

Qwen3-VL-2B-Instruct-AWQ-4bit uses compressed-tensors, but pymllm + mllm-kernel previously
did not provide a complete formal Marlin-backed path for loading and serving this model.

This PR adds the missing kernel wrapper, quantization runtime integration, config parsing fix, and
usage documentation needed for the validated Jetson Orin workflow.

Key Changes

register a new compressed-tensors quantization method in pymllm
implement CompressedTensorsConfig, CompressedTensorsLinearMethod, and the Marlin-backed
weight scheme
add gptq_marlin_repack to the Python JIT wrapper layer in mllm-kernel
add tests for Marlin kernel wrappers and compressed-tensors runtime/config behavior
fix quantization config loading from config.json -> quantization_config
add pymllm/README.md and pymllm/README-ZH.md for the current Jetson Orin usage flow

Testing

pytest -q \
  mllm-kernel/tests/test_gptq_marlin.py \
  mllm-kernel/tests/test_gptq_marlin_repack.py \
  pymllm/tests/test_compressed_tensors_config.py \
  pymllm/tests/test_compressed_tensors_runtime.py

Result:

- 27 passed

## Validated Configuration

Validated with:

- Qwen3-VL-2B-Instruct-AWQ-4bit
- compressed-tensors
- safetensors
- float16

## Notes

- the current implementation targets the validated compressed-tensors signature used by the target
  AWQ checkpoint
- the Jetson Orin setup is documented in pymllm/README.md and pymllm/README-ZH.md

<!-- This is an auto-generated comment: release notes by coderabbit.ai -->
## Summary by CodeRabbit

* **New Features**
  * Compressed‑tensors 4‑bit GPTQ support with Marlin kernels and new repack kernel; public Python entrypoint exported
  * Timing metrics (vit_prefill_ms, llm_prefill_ms, llm_decode_ms) propagated and included in API responses
  * Multimodal message content preserved as structured parts for tokenization

* **Bug Fixes**
  * RMSNorm: added robust PyTorch fallback when FlashInfer fails
  * Quant config loading: now unwraps nested quantization_config from config.json

* **Documentation**
  * Added Jetson Orin–targeted docs (README and README‑ZH)

* **Tests**
  * New CUDA test suites for Marlin GEMM and repack, plus compressed‑tensors runtime/config tests
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

coderabbitai · 2026-04-05T17:04:42Z

📝 Walkthrough

Walkthrough

Adds compressed-tensors GPTQ Marlin quantization (CUDA kernels + Python integration), a new gptq_marlin_repack kernel and tests, timing instrumentation propagated through inference pipeline, multimodal/tokenizer/server adjustments, CMake CPM vendoring change, RMSNorm PyTorch fallback, and Jetson-oriented documentation.

Changes

Cohort / File(s)	Summary
CMake Bootstrap `mllm-kernel/cmake/CPM.cmake`	Prefer vendored `CPM.cmake` relative to script; fallback to downloading only; removed copy/rename fallback; fatal-check that `CPMAddPackage` exists.
Namespace & Kernel Includes `mllm-kernel/include/mllm_kernel/scalar_type.hpp`, `mllm-kernel/mllm_kernel/cuda/csrc/gemm/marlin/marlin.cuh`	Wrap `host` in `mllm_kernel::host` and add alias; include updated scalar header in marlin C++ header and use `host` symbols.
CUDA JIT: Marlin GEMM `mllm-kernel/mllm_kernel/cuda/jit/__init__.py`, `.../gptq_marlin.py`	Export JIT wrapper update: compute `cpp_args = make_cpp_args(dtype)`, pass `args=[dtype]` to `@jit`, and instantiate `gptq_marlin_gemm` with `cpp_args`.
CUDA JIT: Repack Entrypoint `mllm-kernel/mllm_kernel/cuda/jit/gptq_marlin_repack.py`	New module: permutation validation, cached kernel compilation for `gptq_marlin_repack`, public `gptq_marlin_repack(...)` API that allocates output and invokes kernel.
Quantization: Compressed-Tensors `pymllm/quantization/methods/compressed_tensors.py`, `pymllm/quantization/methods/__init__.py`	New `CompressedTensorsConfig` (registered `"compressed-tensors"`) and `CompressedTensorsLinearMethod` with Marlin repack/scale permutation, workspace/index management, weight registration, and runtime GEMM invocation; re-exported in package init.
Tests — Kernel & Quantization `mllm-kernel/tests/test_gptq_marlin.py`, `mllm-kernel/tests/test_gptq_marlin_repack.py`, `pymllm/tests/test_compressed_tensors_config.py`, `pymllm/tests/test_compressed_tensors_runtime.py`	Added CUDA-only and CPU tests covering gptq_marlin correctness, repack validation (shape/dtype/perm/errors), config parsing, and runtime integration/interaction checks with monkeypatches.
Timing Instrumentation `pymllm/models/qwen3_vl.py`, `pymllm/executor/model_runner.py`, `pymllm/orchestrator/model_runner_process.py`, `pymllm/orchestrator/scheduler_process.py`, `pymllm/orchestrator/detokenizer_process.py`	Record vision/LLM prefill and decode timings on ForwardBatch, propagate timing fields through scheduler → detokenizer → responses; capture outer forward time and fill missing decode timings; added cache guard for implementations lacking `page_size`.
Multimodal & Tokenization `pymllm/orchestrator/tokenizer_process.py`, `pymllm/server/launch.py`	Preserve structured multimodal message parts (text/image) for tokenizer template; server responses (streaming & non-streaming) include `timing` payload with prefill/decode ms, tokens, and throughput computations.
RMSNorm Fallback `pymllm/layers/rms_norm.py`	Added `_torch_rmsnorm` fallback and wrapped FlashInfer paths in try/except to fall back to pure-PyTorch behavior on error.
Docs & Misc `pymllm/README.md`, `pymllm/README-ZH.md`	Added Jetson Orin–focused documentation (validated env, install, server commands, examples) in English and Chinese.
Small API/Helpers `pymllm/executor/model_runner.py`, `mllm-kernel/include/.../scalar_type.hpp`, `pymllm/quantization/methods/__init__.py`	`ModelRunner._load_quant_config_dict` unwraps `quantization_config` from `config.json` when present; namespace alias added for backward compatibility; compressed-tensors re-exports added.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Client
  participant Server
  participant Scheduler
  participant ModelRunner
  participant CUDAKernel as GPU(gptq_marlin_repack/gemm)
  participant Detokenizer

  Client->>Server: request (chat/completion)
  Server->>Scheduler: enqueue/process batch
  Scheduler->>ModelRunner: forward_batch with multimodal inputs
  ModelRunner->>CUDAKernel: gptq_marlin_repack (weights) -> gptq_marlin_gemm (compute)
  CUDAKernel-->>ModelRunner: tensor outputs
  ModelRunner-->>Scheduler: batch results (+timings)
  Scheduler-->>Detokenizer: per-rid results (include timing fields)
  Detokenizer-->>Client: final response (includes timing payload)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

liang1232018
oreomaker
yirongjie

Poem

🐰
I hop through kernels, bits in tow,
Repacking weights so matrices grow,
Timings tick from vision to text,
Marlin shuffles scales and specs,
Jetson hums — my code says go!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 23.38% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely summarizes the main change: adding compressed-tensors Marlin support for Qwen3-VL AWQ quantization in pymllm, which aligns with the substantial additions of quantization runtime integration and kernel wrappers throughout the changeset.
Description check	✅ Passed	The PR description comprehensively covers the changes with clear sections (Summary, Why, Key Changes, Testing, Validated Configuration, Notes), specific technical details, and test results, exceeding the minimal template requirements.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 9

🧹 Nitpick comments (2)

pymllm/executor/model_runner.py (1)
490-495: Make config.json a true fallback.

Handling it inside the main candidate loop makes precedence depend on unique ordering. Since every HF checkpoint has a config.json, encountering it early makes the rest of the probe unreachable and the more specific quantization metadata can never win.
♻️ Suggested simplification
         for fname in unique:
+            if fname == "config.json":
+                continue
             fpath = model_path / fname
             if fpath.exists():
                 with open(fpath) as fp:
                     cfg = json.load(fp)
-                # config.json stores model metadata at the top level and
-                # nests quantization details under quantization_config.
-                if fname == "config.json" and "quantization_config" in cfg:
-                    return cfg["quantization_config"]
                 return cfg
Keep the existing explicit config.json -> quantization_config block below as the only fallback path.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/executor/model_runner.py` around lines 490 - 495, The config.json
handling inside the main candidate loop (the block that checks fname ==
"config.json" and returns cfg["quantization_config"]) makes config.json a
premature match and prevents more specific probes from being considered; remove
that in-loop special-case and instead handle "config.json" ->
"quantization_config" only as the final fallback path after the candidate/probe
loop completes (keep the existing fallback block below). Update references to
cfg/fname accordingly so the loop always continues for other candidates and only
uses config.json's quantization_config when no other probe matched.
pymllm/quantization/methods/compressed_tensors.py (1)

89-97: Use buffers for Marlin runtime tensors instead of mixing plain attributes with Parameters.

Line 230 stores workspace as a plain tensor attribute, which won't be moved by module.to()/cuda(). Lines 231-233 create empty Parameters for runtime metadata, causing them to appear in parameters()/state_dict() despite not being checkpoint-backed weights. All four tensors should be registered as non-persistent buffers, which ensures they move with the module while remaining excluded from the parameter list and checkpoint.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@mllm-kernel/mllm_kernel/cuda/jit/gptq_marlin_repack.py`:
- Around line 56-74: The function gptq_marlin_repack currently computes output
shape using floor division and can silently truncate/leave uninitialized data;
before allocating out or calling kernel, validate inputs: ensure num_bits
divides 32 evenly (so pack_factor = 32 // num_bits is integer and >0), ensure
size_k is a multiple of tile_size (16), ensure (size_n * tile_size) is divisible
by pack_factor (so out shape is exact), ensure b_q_weight.shape matches (size_k,
size_n) and its dtype/device are used for out, and ensure perm (if provided) has
length size_k; raise a clear ValueError on any violation and only then call
_normalize_perm, _make_gptq_marlin_repack_kernel(), allocate out and invoke
kernel.

In `@pymllm/layers/rms_norm.py`:
- Around line 52-54: The except branch in the RMSNorm fallback currently
computes x = x + residual but returns the old residual, causing downstream
blocks to receive stale residual state; update the except block in
pymllm/layers/rms_norm.py so that after computing x = x + residual you set a
new_residual (or reuse x) and return (_torch_rmsnorm(new_residual, self.weight,
self.eps), new_residual) instead of returning the old residual so subsequent
calls (e.g., in pymllm/models/qwen3_5.py flows) get the updated accumulated
residual.
- Around line 18-21: The code currently casts x_norm back to x.dtype before
multiplying by weight, which loses precision and can produce wrong dtype when
weight.dtype != x.dtype; change the order so you compute x_fp32, var, and x_norm
in FP32, cast weight to float32 (weight_fp32 = weight.to(torch.float32) or
weight.float()), multiply x_norm * weight_fp32 while still in FP32, then cast
the final product once to x.dtype before returning; locate symbols x_fp32, var,
x_norm, weight and eps in the RMSNorm implementation and apply this single-cast
fix consistent with rms_norm_gated.py.

In `@pymllm/models/qwen3_vl.py`:
- Around line 980-994: The helper _cuda_timed_run currently calls
torch.cuda.synchronize() on every invocation which forces a GPU sync in hot
paths (vision prefill and LLM forward); change it so synchronization is opt-in
or deferred: add a parameter like sync=False to _cuda_timed_run (or
alternatively return the start/end torch.cuda.Event objects and the raw output)
and only call torch.cuda.synchronize() and compute elapsed time when sync=True
(or when the caller performs synchronization at an existing barrier), ensuring
the default path avoids unconditional torch.cuda.synchronize() and preserves
existing return shapes by documenting the new return form if you choose the
event-return approach.

In `@pymllm/orchestrator/scheduler_process.py`:
- Around line 792-793: The code currently overwrites req.llm_decode_ms with
out["llm_decode_ms"] inside the decode-step loop; change this to accumulate
per-step timings by adding the step value to the existing request-level value
(e.g., req.llm_decode_ms = (req.llm_decode_ms or 0) + out["llm_decode_ms"]) so
each decode batch increments req.llm_decode_ms instead of replacing it; update
the setter site where "llm_decode_ms" is read from out (the block referencing
req.llm_decode_ms and out) to handle missing/None initial values safely.

In `@pymllm/orchestrator/tokenizer_process.py`:
- Around line 374-387: The multimodal branch replaces text tokenizer output with
processor IDs (mm_inputs -> image_inputs -> proc_input_ids) but skips the
earlier context-length guard; validate proc_input_ids against
self._context_length and either truncate or raise consistently with the text
path before assigning input_ids. Locate the block handling
mm_inputs/image_inputs/proc_input_ids in tokenizer_process.py and after
converting proc_input_ids to a Python list apply the same length-check logic
used earlier (use self._context_length to cap the list or raise a clear
exception), ensuring behavior matches the existing text-tokenizer
truncation/rejection policy.

In `@pymllm/quantization/methods/compressed_tensors.py`:
- Around line 61-70: Change verify_marlin_supported to accept a device argument
and use that device when checking CUDA capability (instead of calling
torch.cuda.get_device_capability() with no device), and update any callers
accordingly; specifically, update verify_marlin_supported(group_size: int,
device: torch.device) to first check MARLIN_SUPPORTED_GROUP_SIZES then if
device.type == "cuda" call torch.cuda.get_device_capability(device) and enforce
SM80+, and add an extra call to the new signature at the start of
process_weights_after_loading() using layer.weight_packed.device so validation
runs for the runtime device.

In `@pymllm/README-ZH.md`:
- Around line 5-16: The "## 环境配置 ToDo" section lacks validated Jetson
environment details; replace the checklist with a concrete validated matrix or a
link to a follow-up issue and populate the table with exact JetPack/L4T,
CUDA/cuDNN/TensorRT, Python/pip/venv or conda,
PyTorch/torchvision/transformers/safetensors, flashinfer version/installation,
apt dependencies, memory/GPU suggestions and required ENV vars; specifically
update the "## 环境配置 ToDo" header and the checklist items (e.g., JetPack / L4T,
CUDA / cuDNN / TensorRT, Python / pip / venv 或 conda, PyTorch / torchvision /
transformers / safetensors, flashinfer 版本与安装方式) with validated values or convert
them into a compact validated-version table and add a note linking to a
follow-up issue if full validation will occur later.

In `@pymllm/server/launch.py`:
- Around line 473-487: The code builds multimodal content lists in the messages
loop (mm_parts -> content) but may leave content as a list which later gets
stringified by the ChatML fallback and silently drops images; update the path
around tokenizer.apply_chat_template and the ChatML fallback to fail fast: after
constructing mm_parts (and before falling back to ChatML) check that
tokenizer.apply_chat_template exists and can accept list-form content, attempt
to call it in a try/except, and if it is missing or raises, immediately raise an
explicit exception (e.g., ValueError/RuntimeError) indicating structured chat
content could not be templated for the request instead of allowing silent
stringification; reference the variables/functions messages, mm_parts/content,
tokenizer.apply_chat_template, and the ChatML fallback so the change is made
where list content is prepared and before any stringification occurs.

---

Nitpick comments:
In `@pymllm/executor/model_runner.py`:
- Around line 490-495: The config.json handling inside the main candidate loop
(the block that checks fname == "config.json" and returns
cfg["quantization_config"]) makes config.json a premature match and prevents
more specific probes from being considered; remove that in-loop special-case and
instead handle "config.json" -> "quantization_config" only as the final fallback
path after the candidate/probe loop completes (keep the existing fallback block
below). Update references to cfg/fname accordingly so the loop always continues
for other candidates and only uses config.json's quantization_config when no
other probe matched.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5cd8b503-101e-4acb-822f-2d4c14c9ed71

📥 Commits

Reviewing files that changed from the base of the PR and between 690b859 and 058dee2.

📒 Files selected for processing (22)

mllm-kernel/cmake/CPM.cmake
mllm-kernel/include/mllm_kernel/scalar_type.hpp
mllm-kernel/mllm_kernel/cuda/csrc/gemm/marlin/marlin.cuh
mllm-kernel/mllm_kernel/cuda/jit/__init__.py
mllm-kernel/mllm_kernel/cuda/jit/gptq_marlin.py
mllm-kernel/mllm_kernel/cuda/jit/gptq_marlin_repack.py
mllm-kernel/tests/test_gptq_marlin.py
mllm-kernel/tests/test_gptq_marlin_repack.py
pymllm/README-ZH.md
pymllm/README.md
pymllm/executor/model_runner.py
pymllm/layers/rms_norm.py
pymllm/models/qwen3_vl.py
pymllm/orchestrator/detokenizer_process.py
pymllm/orchestrator/model_runner_process.py
pymllm/orchestrator/scheduler_process.py
pymllm/orchestrator/tokenizer_process.py
pymllm/quantization/methods/__init__.py
pymllm/quantization/methods/compressed_tensors.py
pymllm/server/launch.py
pymllm/tests/test_compressed_tensors_config.py
pymllm/tests/test_compressed_tensors_runtime.py

coderabbitai · 2026-04-05T17:17:46Z

+def gptq_marlin_repack(
+    b_q_weight: torch.Tensor,
+    perm: Optional[torch.Tensor],
+    size_k: int,
+    size_n: int,
+    num_bits: int,
+) -> torch.Tensor:
+    """Repack GPTQ/Compressed-Tensors weights into Marlin layout."""
+
+    pack_factor = 32 // num_bits
+    tile_size = 16
+    out = torch.empty(
+        (size_k // tile_size, size_n * tile_size // pack_factor),
+        dtype=b_q_weight.dtype,
+        device=b_q_weight.device,
+    )
+    kernel = _make_gptq_marlin_repack_kernel()
+    perm_t = _normalize_perm(perm, size_k, b_q_weight.device)
+    kernel(b_q_weight, perm_t, out, size_k, size_n, num_bits)


⚠️ Potential issue | 🟠 Major

Reject unsupported size_k/size_n/num_bits inputs before allocating out.

Line 68 derives the output shape with floor division, so unsupported inputs are silently truncated instead of failing fast: a size_k tail is dropped, a non-64-multiple size_n can leave part of out uninitialized, and a non-divisor num_bits produces the wrong packing factor. Since this wrapper is now public, please validate num_bits, alignment, and b_q_weight.shape/dtype/device before the allocation and kernel launch.

Suggested guardrail

def gptq_marlin_repack( b_q_weight: torch.Tensor, perm: Optional[torch.Tensor], size_k: int, size_n: int, num_bits: int, ) -> torch.Tensor: """Repack GPTQ/Compressed-Tensors weights into Marlin layout.""" + if b_q_weight.dtype != torch.int32: + raise ValueError("b_q_weight must be int32") + if b_q_weight.device.type != "cuda": + raise ValueError("b_q_weight must live on CUDA") + if num_bits <= 0 or 32 % num_bits != 0: + raise ValueError("num_bits must be a positive divisor of 32") + if size_k % 16 != 0: + raise ValueError("size_k must be divisible by 16") + if size_n % 64 != 0: + raise ValueError("size_n must be divisible by 64") + pack_factor = 32 // num_bits + expected_shape = (size_k // pack_factor, size_n) + if tuple(b_q_weight.shape) != expected_shape: + raise ValueError(f"b_q_weight must have shape {expected_shape}") + tile_size = 16 out = torch.empty( (size_k // tile_size, size_n * tile_size // pack_factor), dtype=b_q_weight.dtype, device=b_q_weight.device,

As per coding guidelines, "Validate inputs for public APIs and critical internal functions."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@mllm-kernel/mllm_kernel/cuda/jit/gptq_marlin_repack.py` around lines 56 - 74, The function gptq_marlin_repack currently computes output shape using floor division and can silently truncate/leave uninitialized data; before allocating out or calling kernel, validate inputs: ensure num_bits divides 32 evenly (so pack_factor = 32 // num_bits is integer and >0), ensure size_k is a multiple of tile_size (16), ensure (size_n * tile_size) is divisible by pack_factor (so out shape is exact), ensure b_q_weight.shape matches (size_k, size_n) and its dtype/device are used for out, and ensure perm (if provided) has length size_k; raise a clear ValueError on any violation and only then call _normalize_perm, _make_gptq_marlin_repack_kernel(), allocate out and invoke kernel.

coderabbitai · 2026-04-05T17:17:46Z

+    x_fp32 = x.float()
+    var = x_fp32.pow(2).mean(dim=-1, keepdim=True)
+    x_norm = x_fp32 * torch.rsqrt(var + eps)
+    return x_norm.to(dtype=x.dtype) * weight


⚠️ Potential issue | 🟠 Major

Apply the scale in FP32 before the final cast.

This helper downcasts x_norm before multiplying by weight, so the fallback loses precision and can even return the wrong dtype when weight.dtype != x.dtype. pymllm/layers/rms_norm_gated.py:40-61 already keeps both operands in FP32 and casts once at the end.

Suggested fix

def _torch_rmsnorm( x: torch.Tensor, weight: torch.Tensor, eps: float, ) -> torch.Tensor: x_fp32 = x.float() + weight_fp32 = weight.float() var = x_fp32.pow(2).mean(dim=-1, keepdim=True) - x_norm = x_fp32 * torch.rsqrt(var + eps) - return x_norm.to(dtype=x.dtype) * weight + out = x_fp32 * torch.rsqrt(var + eps) * weight_fp32 + return out.to(dtype=x.dtype)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pymllm/layers/rms_norm.py` around lines 18 - 21, The code currently casts x_norm back to x.dtype before multiplying by weight, which loses precision and can produce wrong dtype when weight.dtype != x.dtype; change the order so you compute x_fp32, var, and x_norm in FP32, cast weight to float32 (weight_fp32 = weight.to(torch.float32) or weight.float()), multiply x_norm * weight_fp32 while still in FP32, then cast the final product once to x.dtype before returning; locate symbols x_fp32, var, x_norm, weight and eps in the RMSNorm implementation and apply this single-cast fix consistent with rms_norm_gated.py.

coderabbitai · 2026-04-05T17:17:46Z

+            except Exception:
+                x = x + residual
+                return _torch_rmsnorm(x, self.weight, self.eps), residual


⚠️ Potential issue | 🔴 Critical

Return the accumulated residual from the fallback path.

This branch computes x + residual but returns the old residual. pymllm/models/qwen3_5.py:206-211 threads that second tuple element straight into the next norm call, so taking this fallback leaves all subsequent blocks operating on stale residual state.

Suggested fix

except Exception: - x = x + residual - return _torch_rmsnorm(x, self.weight, self.eps), residual + accumulated = x + residual + return _torch_rmsnorm(accumulated, self.weight, self.eps), accumulated

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

except Exception:

x = x + residual

return _torch_rmsnorm(x, self.weight, self.eps), residual

except Exception:

accumulated = x + residual

return _torch_rmsnorm(accumulated, self.weight, self.eps), accumulated

🧰 Tools

🪛 Ruff (0.15.9)

[warning] 52-52: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pymllm/layers/rms_norm.py` around lines 52 - 54, The except branch in the RMSNorm fallback currently computes x = x + residual but returns the old residual, causing downstream blocks to receive stale residual state; update the except block in pymllm/layers/rms_norm.py so that after computing x = x + residual you set a new_residual (or reuse x) and return (_torch_rmsnorm(new_residual, self.weight, self.eps), new_residual) instead of returning the old residual so subsequent calls (e.g., in pymllm/models/qwen3_5.py flows) get the updated accumulated residual.

coderabbitai · 2026-04-05T17:17:46Z

+        # If AutoProcessor produced multimodal input_ids, they must override
+        # the plain tokenizer result. Otherwise the prompt contains only a
+        # single image placeholder token and won't match the visual features.
+        if mm_inputs is not None:
+            image_inputs = mm_inputs.get("image_inputs")
+            if image_inputs is not None and "input_ids" in image_inputs:
+                proc_input_ids = image_inputs["input_ids"]
+                if hasattr(proc_input_ids, "ndim") and proc_input_ids.ndim > 1:
+                    proc_input_ids = proc_input_ids[0]
+                if hasattr(proc_input_ids, "tolist"):
+                    input_ids = proc_input_ids.tolist()
+                else:
+                    input_ids = list(proc_input_ids)
+


⚠️ Potential issue | 🟠 Major

Keep the context-length guard on the multimodal path.

Once these processor IDs replace the text-tokenizer output, multimodal requests no longer inherit the truncation/length check applied earlier. A long image prompt can now exceed self._context_length and fail later in the runner instead of being capped or rejected here. At minimum, validate the replacement IDs before packing the request.

🧭 Minimal safeguard

if hasattr(proc_input_ids, "tolist"): input_ids = proc_input_ids.tolist() else: input_ids = list(proc_input_ids) + if ( + self._context_length is not None + and len(input_ids) > self._context_length + ): + raise ValueError( + "Multimodal prompt exceeds context_length after processor tokenization" + )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# If AutoProcessor produced multimodal input_ids, they must override

# the plain tokenizer result. Otherwise the prompt contains only a

# single image placeholder token and won't match the visual features.

if mm_inputs is not None:

image_inputs = mm_inputs.get("image_inputs")

if image_inputs is not None and "input_ids" in image_inputs:

proc_input_ids = image_inputs["input_ids"]

if hasattr(proc_input_ids, "ndim") and proc_input_ids.ndim > 1:

proc_input_ids = proc_input_ids[0]

if hasattr(proc_input_ids, "tolist"):

input_ids = proc_input_ids.tolist()

else:

input_ids = list(proc_input_ids)

# If AutoProcessor produced multimodal input_ids, they must override

# the plain tokenizer result. Otherwise the prompt contains only a

# single image placeholder token and won't match the visual features.

if mm_inputs is not None:

image_inputs = mm_inputs.get("image_inputs")

if image_inputs is not None and "input_ids" in image_inputs:

proc_input_ids = image_inputs["input_ids"]

if hasattr(proc_input_ids, "ndim") and proc_input_ids.ndim > 1:

proc_input_ids = proc_input_ids[0]

if hasattr(proc_input_ids, "tolist"):

input_ids = proc_input_ids.tolist()

else:

input_ids = list(proc_input_ids)

if (

self._context_length is not None

and len(input_ids) > self._context_length

):

raise ValueError(

"Multimodal prompt exceeds context_length after processor tokenization"

)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pymllm/orchestrator/tokenizer_process.py` around lines 374 - 387, The multimodal branch replaces text tokenizer output with processor IDs (mm_inputs -> image_inputs -> proc_input_ids) but skips the earlier context-length guard; validate proc_input_ids against self._context_length and either truncate or raise consistently with the text path before assigning input_ids. Locate the block handling mm_inputs/image_inputs/proc_input_ids in tokenizer_process.py and after converting proc_input_ids to a Python list apply the same length-check logic used earlier (use self._context_length to cap the list or raise a clear exception), ensuring behavior matches the existing text-tokenizer truncation/rejection policy.

coderabbitai · 2026-04-05T17:17:46Z

+def verify_marlin_supported(group_size: int) -> None:
+    if group_size not in MARLIN_SUPPORTED_GROUP_SIZES:
+        raise ValueError(
+            f"Unsupported compressed-tensors group_size: {group_size}"
+        )
+    if not torch.cuda.is_available():
+        return
+    major, minor = torch.cuda.get_device_capability()
+    if major * 10 + minor < 80:
+        raise ValueError("compressed-tensors Marlin requires SM80+")


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

In PyTorch, if torch.cuda.get_device_capability() is called without a device argument, which CUDA device does it inspect?

💡 Result:

It inspects the current CUDA device, as returned by torch.cuda.current_device.

Citations:

1: https://pytorch.org/docs/2.5/_modules/torch/cuda.html

2: https://github.com/pytorch/pytorch/blob/main/torch/cuda/_utils.py

3: https://docs.pytorch.org/docs/1.10/_modules/torch/cuda.html

🏁 Script executed:

# First, locate the file and check its structure find . -path "*pymllm/quantization/methods/compressed_tensors.py" -type f

Repository: UbiquitousLearning/mllm

Length of output: 118

🏁 Script executed:

# Read the relevant sections of the file if [ -f "pymllm/quantization/methods/compressed_tensors.py" ]; then echo "=== Lines 61-75 (verify_marlin_supported) ===" sed -n '61,75p' pymllm/quantization/methods/compressed_tensors.py echo "" echo "=== Lines 85-105 (process_weights_after_loading area) ===" sed -n '85,105p' pymllm/quantization/methods/compressed_tensors.py echo "" echo "=== Lines 215-235 (second mentioned range) ===" sed -n '215,235p' pymllm/quantization/methods/compressed_tensors.py fi

Repository: UbiquitousLearning/mllm

Length of output: 2264

🏁 Script executed:

# Check the total file length to understand context wc -l pymllm/quantization/methods/compressed_tensors.py

Repository: UbiquitousLearning/mllm

Length of output: 120

Re-validate Marlin device compatibility in the actual runtime device context.

The verify_marlin_supported() function calls torch.cuda.get_device_capability() without a device argument, which inspects only the current CUDA device, not the device where the layer actually resides. The process_weights_after_loading() method extracts the correct device from layer.weight_packed.device but never re-validates Marlin support for that device before executing CUDA operations. In workflows where initialization occurs on a different device than the runtime device (e.g., CPU initialization before GPU placement, or multi-GPU scenarios), this allows invalid configurations to pass validation and fail later at runtime.

Modify verify_marlin_supported() to accept and use a device parameter, and call it again at the start of process_weights_after_loading() with the actual layer.weight_packed.device.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pymllm/quantization/methods/compressed_tensors.py` around lines 61 - 70, Change verify_marlin_supported to accept a device argument and use that device when checking CUDA capability (instead of calling torch.cuda.get_device_capability() with no device), and update any callers accordingly; specifically, update verify_marlin_supported(group_size: int, device: torch.device) to first check MARLIN_SUPPORTED_GROUP_SIZES then if device.type == "cuda" call torch.cuda.get_device_capability(device) and enforce SM80+, and add an extra call to the new signature at the start of process_weights_after_loading() using layer.weight_packed.device so validation runs for the runtime device.

coderabbitai · 2026-04-05T17:17:46Z

+    # Preserve multimodal message structure for tokenizer.apply_chat_template.
    msg_dicts: List[Dict[str, Any]] = []
    for msg in messages:
        content = msg.content
        if isinstance(content, list):
-            # Multimodal: extract only text parts for the prompt string.
-            text_parts = [p.text for p in content if p.type == "text" and p.text]
-            content = "\n".join(text_parts) if text_parts else ""
+            mm_parts: List[Dict[str, Any]] = []
+            for part in content:
+                if part.type == "text" and part.text is not None:
+                    mm_parts.append({"type": "text", "text": part.text})
+                elif part.type == "image_url" and part.image_url is not None:
+                    # Keep image content so chat template can emit vision tokens.
+                    mm_parts.append(
+                        {"type": "image", "image": part.image_url.url}
+                    )
+            content = mm_parts


⚠️ Potential issue | 🟠 Major

Fail fast when structured chat content cannot be templated.

After this change, content can remain a list of parts. If apply_chat_template() is unavailable or throws, the ChatML fallback below will stringify that list and silently drop image placeholders, so the request becomes malformed instead of being rejected.

💡 Suggested guard

@@ if tokenizer is not None and hasattr(tokenizer, "apply_chat_template"): try: extra = dict(chat_template_kwargs) if chat_template_kwargs else {} return tokenizer.apply_chat_template( msg_dicts, tokenize=False, add_generation_prompt=True, **extra, ) except Exception as e: logger.warning("apply_chat_template failed, using fallback: %s", e) + has_multimodal_content = any( + isinstance(m["content"], list) + and any(p.get("type") != "text" for p in m["content"]) + for m in msg_dicts + ) + if has_multimodal_content: + raise ValueError( + "Multimodal chat content requires apply_chat_template() support" + ) from e # Fallback: ChatML format (Qwen-style) parts: List[str] = [] for m in msg_dicts: - parts.append(f"<|im_start|>{m['role']}\n{m['content']}<|im_end|>") + content = m["content"] + if isinstance(content, list): + content = "\n".join( + p["text"] for p in content if p.get("type") == "text" + ) + parts.append(f"<|im_start|>{m['role']}\n{content}<|im_end|>")

As per coding guidelines, "Ensure functions that can fail return appropriate error codes or raise exceptions."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pymllm/server/launch.py` around lines 473 - 487, The code builds multimodal content lists in the messages loop (mm_parts -> content) but may leave content as a list which later gets stringified by the ChatML fallback and silently drops images; update the path around tokenizer.apply_chat_template and the ChatML fallback to fail fast: after constructing mm_parts (and before falling back to ChatML) check that tokenizer.apply_chat_template exists and can accept list-form content, attempt to call it in a try/except, and if it is missing or raises, immediately raise an explicit exception (e.g., ValueError/RuntimeError) indicating structured chat content could not be templated for the request instead of allowing silent stringification; reference the variables/functions messages, mm_parts/content, tokenizer.apply_chat_template, and the ChatML fallback so the change is made where list content is prepared and before any stringification occurs.

docs: add validated Jetson Orin environment to README

feat: update Jetson Qwen3-VL server timing and multimodal flow

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

pymllm/server/launch.py (1)

711-717: ⚠️ Potential issue | 🟠 Major

Guard empty results before using r in timing payload.

r is read at Line 744-766 but is only defined inside the for i, r in enumerate(results) loop. If results is empty, this path raises UnboundLocalError.

Proposed fix

     try:
         results = []
         async for item in _iter_with_disconnect_check(gen, request):
             results.append(item)
+        if not results:
+            raise HTTPException(status_code=500, detail="No output from engine")
         choices = []
+        r: Dict[str, Any] = results[-1]
         prompt_tokens = 0
         completion_tokens = 0
         for i, r in enumerate(results):
             choices.append(
                 {

As per coding guidelines, "Ensure functions that can fail return appropriate error codes or raise exceptions."

Also applies to: 743-767

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@pymllm/server/launch.py` around lines 711 - 717, The code collects async
items into results from _iter_with_disconnect_check(gen, request) and later
accesses r inside the for i, r in enumerate(results) loop to build a timing
payload, but if results is empty r is undefined and will raise
UnboundLocalError; modify the code that builds the timing/response payload (the
block referencing r around the loop and lines 743-767) to first guard for an
empty results list (e.g., if not results: handle early-return or construct a
safe timing/response with zeroed tokens and an empty choices list) before
referencing r, or ensure r is taken from a fallback value (such as the last
result or a default dict) so the timing payload and subsequent logic in the
function (around _iter_with_disconnect_check usage and the for i, r in
enumerate(results) loop) never reads an undefined variable.

♻️ Duplicate comments (1)

pymllm/server/launch.py (1)

473-487: ⚠️ Potential issue | 🟠 Major

Fail fast for multimodal content when chat templating is unavailable (still unresolved).

Line 511 can still stringify list-structured multimodal content if apply_chat_template is missing/fails, which silently corrupts request semantics instead of rejecting the request.

Proposed fix

 def _messages_to_prompt(
@@
     tokenizer = _tokenizer
+    has_structured_content = any(isinstance(m.get("content"), list) for m in msg_dicts)
+
+    if has_structured_content and not (
+        tokenizer is not None and hasattr(tokenizer, "apply_chat_template")
+    ):
+        raise ValueError(
+            "Multimodal chat content requires tokenizer.apply_chat_template support"
+        )
+
     if tokenizer is not None and hasattr(tokenizer, "apply_chat_template"):
         try:
             extra = dict(chat_template_kwargs) if chat_template_kwargs else {}
             return tokenizer.apply_chat_template(
                 msg_dicts,
@@
             )
         except Exception as e:
-            logger.warning("apply_chat_template failed, using fallback: %s", e)
+            if has_structured_content:
+                raise ValueError(
+                    "Multimodal chat content could not be templated by tokenizer"
+                ) from e
+            logger.warning("apply_chat_template failed, using fallback: %s", e)
@@
     parts: List[str] = []
     for m in msg_dicts:
-        parts.append(f"<|im_start|>{m['role']}\n{m['content']}<|im_end|>")
+        content = m["content"]
+        if isinstance(content, list):
+            content = "\n".join(
+                p.get("text", "") for p in content if p.get("type") == "text"
+            )
+        parts.append(f"<|im_start|>{m['role']}\n{content}<|im_end|>")

As per coding guidelines, "Ensure functions that can fail return appropriate error codes or raise exceptions."

Also applies to: 495-513

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@pymllm/server/launch.py` around lines 473 - 487, The multimodal message
handling currently silently falls back to stringifying list-structured content
if tokenizer.apply_chat_template is missing or raises, corrupting request
semantics; update the code in the message-processing block around msg_dicts and
the later analogous block (the apply_chat_template usage) to detect whether the
tokenizer object exposes apply_chat_template and to catch exceptions from
calling it, and if absent or an error occurs raise an explicit exception (or
return an error response) instead of converting multimodal lists to plain
text—refer to the variables/functions msg_dicts, messages, content/mm_parts and
tokenizer.apply_chat_template to locate both places to change.

🧹 Nitpick comments (1)

pymllm/server/launch.py (1)
743-767: Extract duplicated timing payload construction into a helper.

Both non-stream endpoints build nearly identical timing objects. A shared helper (e.g., _build_timing(r, prompt_tokens, completion_tokens)) will prevent drift.

As per coding guidelines, "Encourage consistent coding style and patterns with the existing codebase."

Also applies to: 1014-1038
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/server/launch.py` around lines 743 - 767, Extract the duplicated
timing-object construction into a single helper function named _build_timing(r,
prompt_tokens, completion_tokens) that returns the dict with keys
"vit_prefill_ms","llm_prefill_ms","llm_decode_ms","prefill_tokens","vit_prefill_tps","llm_prefill_tps","llm_decode_tps"
using the same guard logic currently used (checking None or <=0) and math
(tokens / (ms/1000.0)); then replace the two duplicated inline timing blocks
with calls to _build_timing(r, prompt_tokens, completion_tokens) so both
non-stream endpoints use the shared helper and avoid drift.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@pymllm/server/launch.py`:
- Around line 711-717: The code collects async items into results from
_iter_with_disconnect_check(gen, request) and later accesses r inside the for i,
r in enumerate(results) loop to build a timing payload, but if results is empty
r is undefined and will raise UnboundLocalError; modify the code that builds the
timing/response payload (the block referencing r around the loop and lines
743-767) to first guard for an empty results list (e.g., if not results: handle
early-return or construct a safe timing/response with zeroed tokens and an empty
choices list) before referencing r, or ensure r is taken from a fallback value
(such as the last result or a default dict) so the timing payload and subsequent
logic in the function (around _iter_with_disconnect_check usage and the for i, r
in enumerate(results) loop) never reads an undefined variable.

---

Duplicate comments:
In `@pymllm/server/launch.py`:
- Around line 473-487: The multimodal message handling currently silently falls
back to stringifying list-structured content if tokenizer.apply_chat_template is
missing or raises, corrupting request semantics; update the code in the
message-processing block around msg_dicts and the later analogous block (the
apply_chat_template usage) to detect whether the tokenizer object exposes
apply_chat_template and to catch exceptions from calling it, and if absent or an
error occurs raise an explicit exception (or return an error response) instead
of converting multimodal lists to plain text—refer to the variables/functions
msg_dicts, messages, content/mm_parts and tokenizer.apply_chat_template to
locate both places to change.

---

Nitpick comments:
In `@pymllm/server/launch.py`:
- Around line 743-767: Extract the duplicated timing-object construction into a
single helper function named _build_timing(r, prompt_tokens, completion_tokens)
that returns the dict with keys
"vit_prefill_ms","llm_prefill_ms","llm_decode_ms","prefill_tokens","vit_prefill_tps","llm_prefill_tps","llm_decode_tps"
using the same guard logic currently used (checking None or <=0) and math
(tokens / (ms/1000.0)); then replace the two duplicated inline timing blocks
with calls to _build_timing(r, prompt_tokens, completion_tokens) so both
non-stream endpoints use the shared helper and avoid drift.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0da92abb-fa6f-491f-b861-34c211875ea4

📥 Commits

Reviewing files that changed from the base of the PR and between 25cdb3d and 4d83b3d.

📒 Files selected for processing (5)

pymllm/models/qwen3_vl.py
pymllm/orchestrator/detokenizer_process.py
pymllm/orchestrator/model_runner_process.py
pymllm/orchestrator/scheduler_process.py
pymllm/server/launch.py

✅ Files skipped from review due to trivial changes (1)

pymllm/orchestrator/scheduler_process.py

🚧 Files skipped from review as they are similar to previous changes (3)

pymllm/orchestrator/detokenizer_process.py
pymllm/orchestrator/model_runner_process.py
pymllm/models/qwen3_vl.py

nuozhihan and others added 4 commits March 31, 2026 16:23

jetson: local qwen3-vl service changes

56443e9

mllm-kernel: prefer vendored CPM for editable builds

12a5fde

Add compressed-tensors Marlin support for Qwen3-VL AWQ

2488413

Add bilingual pymllm README for Jetson Orin usage

058dee2

coderabbitai bot reviewed Apr 5, 2026

View reviewed changes

nuozhihan and others added 3 commits April 6, 2026 22:46

docs: add validated Jetson Orin environment to README

35e0c85

Merge pull request #2 from nuozhihan/docs-readme-env

25cdb3d

docs: add validated Jetson Orin environment to README

feat: update Jetson Qwen3-VL server timing and multimodal flow

84ff031

chenghuaWang mentioned this pull request Apr 12, 2026

Development Roadmap (2026 H1) #651

Open

Merge pull request #3 from nuozhihan/docs-readme-env

4d83b3d

feat: update Jetson Qwen3-VL server timing and multimodal flow

coderabbitai bot reviewed Apr 12, 2026

View reviewed changes

Conversation

jialilve commented Apr 5, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Key Changes

Testing

Uh oh!

coderabbitai bot commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jialilve commented Apr 5, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 5, 2026 •

edited

Loading