Skip to content

Add compressed-tensors Marlin support for Qwen3-VL AWQ in pymllm#663

Open
jialilve wants to merge 8 commits intoUbiquitousLearning:mainfrom
jialilve:jetson-awq-qwen3vl
Open

Add compressed-tensors Marlin support for Qwen3-VL AWQ in pymllm#663
jialilve wants to merge 8 commits intoUbiquitousLearning:mainfrom
jialilve:jetson-awq-qwen3vl

Conversation

@jialilve
Copy link
Copy Markdown
Contributor

@jialilve jialilve commented Apr 5, 2026

Summary

  • add compressed-tensors Marlin support for Qwen3-VL AWQ in pymllm
  • add the missing gptq_marlin_repack Python JIT wrapper in mllm-kernel
  • add kernel and runtime tests for the new quantized path
  • add bilingual pymllm README documents for the validated Jetson Orin workflow

Why

Qwen3-VL-2B-Instruct-AWQ-4bit uses compressed-tensors, but pymllm + mllm-kernel previously
did not provide a complete formal Marlin-backed path for loading and serving this model.

This PR adds the missing kernel wrapper, quantization runtime integration, config parsing fix, and
usage documentation needed for the validated Jetson Orin workflow.

Key Changes

  • register a new compressed-tensors quantization method in pymllm
  • implement CompressedTensorsConfig, CompressedTensorsLinearMethod, and the Marlin-backed
    weight scheme
  • add gptq_marlin_repack to the Python JIT wrapper layer in mllm-kernel
  • add tests for Marlin kernel wrappers and compressed-tensors runtime/config behavior
  • fix quantization config loading from config.json -> quantization_config
  • add pymllm/README.md and pymllm/README-ZH.md for the current Jetson Orin usage flow

Testing

pytest -q \
  mllm-kernel/tests/test_gptq_marlin.py \
  mllm-kernel/tests/test_gptq_marlin_repack.py \
  pymllm/tests/test_compressed_tensors_config.py \
  pymllm/tests/test_compressed_tensors_runtime.py

Result:

- 27 passed

## Validated Configuration

Validated with:

- Qwen3-VL-2B-Instruct-AWQ-4bit
- compressed-tensors
- safetensors
- float16

## Notes

- the current implementation targets the validated compressed-tensors signature used by the target
  AWQ checkpoint
- the Jetson Orin setup is documented in pymllm/README.md and pymllm/README-ZH.md

<!-- This is an auto-generated comment: release notes by coderabbit.ai -->
## Summary by CodeRabbit

* **New Features**
  * Compressed‑tensors 4‑bit GPTQ support with Marlin kernels and new repack kernel; public Python entrypoint exported
  * Timing metrics (vit_prefill_ms, llm_prefill_ms, llm_decode_ms) propagated and included in API responses
  * Multimodal message content preserved as structured parts for tokenization

* **Bug Fixes**
  * RMSNorm: added robust PyTorch fallback when FlashInfer fails
  * Quant config loading: now unwraps nested quantization_config from config.json

* **Documentation**
  * Added Jetson Orin–targeted docs (README and README‑ZH)

* **Tests**
  * New CUDA test suites for Marlin GEMM and repack, plus compressed‑tensors runtime/config tests
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 5, 2026

📝 Walkthrough

Walkthrough

Adds compressed-tensors GPTQ Marlin quantization (CUDA kernels + Python integration), a new gptq_marlin_repack kernel and tests, timing instrumentation propagated through inference pipeline, multimodal/tokenizer/server adjustments, CMake CPM vendoring change, RMSNorm PyTorch fallback, and Jetson-oriented documentation.

Changes

Cohort / File(s) Summary
CMake Bootstrap
mllm-kernel/cmake/CPM.cmake
Prefer vendored CPM.cmake relative to script; fallback to downloading only; removed copy/rename fallback; fatal-check that CPMAddPackage exists.
Namespace & Kernel Includes
mllm-kernel/include/mllm_kernel/scalar_type.hpp, mllm-kernel/mllm_kernel/cuda/csrc/gemm/marlin/marlin.cuh
Wrap host in mllm_kernel::host and add alias; include updated scalar header in marlin C++ header and use host symbols.
CUDA JIT: Marlin GEMM
mllm-kernel/mllm_kernel/cuda/jit/__init__.py, .../gptq_marlin.py
Export JIT wrapper update: compute cpp_args = make_cpp_args(dtype), pass args=[dtype] to @jit, and instantiate gptq_marlin_gemm with cpp_args.
CUDA JIT: Repack Entrypoint
mllm-kernel/mllm_kernel/cuda/jit/gptq_marlin_repack.py
New module: permutation validation, cached kernel compilation for gptq_marlin_repack, public gptq_marlin_repack(...) API that allocates output and invokes kernel.
Quantization: Compressed-Tensors
pymllm/quantization/methods/compressed_tensors.py, pymllm/quantization/methods/__init__.py
New CompressedTensorsConfig (registered "compressed-tensors") and CompressedTensorsLinearMethod with Marlin repack/scale permutation, workspace/index management, weight registration, and runtime GEMM invocation; re-exported in package init.
Tests — Kernel & Quantization
mllm-kernel/tests/test_gptq_marlin.py, mllm-kernel/tests/test_gptq_marlin_repack.py, pymllm/tests/test_compressed_tensors_config.py, pymllm/tests/test_compressed_tensors_runtime.py
Added CUDA-only and CPU tests covering gptq_marlin correctness, repack validation (shape/dtype/perm/errors), config parsing, and runtime integration/interaction checks with monkeypatches.
Timing Instrumentation
pymllm/models/qwen3_vl.py, pymllm/executor/model_runner.py, pymllm/orchestrator/model_runner_process.py, pymllm/orchestrator/scheduler_process.py, pymllm/orchestrator/detokenizer_process.py
Record vision/LLM prefill and decode timings on ForwardBatch, propagate timing fields through scheduler → detokenizer → responses; capture outer forward time and fill missing decode timings; added cache guard for implementations lacking page_size.
Multimodal & Tokenization
pymllm/orchestrator/tokenizer_process.py, pymllm/server/launch.py
Preserve structured multimodal message parts (text/image) for tokenizer template; server responses (streaming & non-streaming) include timing payload with prefill/decode ms, tokens, and throughput computations.
RMSNorm Fallback
pymllm/layers/rms_norm.py
Added _torch_rmsnorm fallback and wrapped FlashInfer paths in try/except to fall back to pure-PyTorch behavior on error.
Docs & Misc
pymllm/README.md, pymllm/README-ZH.md
Added Jetson Orin–focused documentation (validated env, install, server commands, examples) in English and Chinese.
Small API/Helpers
pymllm/executor/model_runner.py, mllm-kernel/include/.../scalar_type.hpp, pymllm/quantization/methods/__init__.py
ModelRunner._load_quant_config_dict unwraps quantization_config from config.json when present; namespace alias added for backward compatibility; compressed-tensors re-exports added.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Client
  participant Server
  participant Scheduler
  participant ModelRunner
  participant CUDAKernel as GPU(gptq_marlin_repack/gemm)
  participant Detokenizer

  Client->>Server: request (chat/completion)
  Server->>Scheduler: enqueue/process batch
  Scheduler->>ModelRunner: forward_batch with multimodal inputs
  ModelRunner->>CUDAKernel: gptq_marlin_repack (weights) -> gptq_marlin_gemm (compute)
  CUDAKernel-->>ModelRunner: tensor outputs
  ModelRunner-->>Scheduler: batch results (+timings)
  Scheduler-->>Detokenizer: per-rid results (include timing fields)
  Detokenizer-->>Client: final response (includes timing payload)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

  • liang1232018
  • oreomaker
  • yirongjie

Poem

🐰
I hop through kernels, bits in tow,
Repacking weights so matrices grow,
Timings tick from vision to text,
Marlin shuffles scales and specs,
Jetson hums — my code says go!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 23.38% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: adding compressed-tensors Marlin support for Qwen3-VL AWQ quantization in pymllm, which aligns with the substantial additions of quantization runtime integration and kernel wrappers throughout the changeset.
Description check ✅ Passed The PR description comprehensively covers the changes with clear sections (Summary, Why, Key Changes, Testing, Validated Configuration, Notes), specific technical details, and test results, exceeding the minimal template requirements.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🧹 Nitpick comments (2)
pymllm/executor/model_runner.py (1)

490-495: Make config.json a true fallback.

Handling it inside the main candidate loop makes precedence depend on unique ordering. Since every HF checkpoint has a config.json, encountering it early makes the rest of the probe unreachable and the more specific quantization metadata can never win.

♻️ Suggested simplification
         for fname in unique:
+            if fname == "config.json":
+                continue
             fpath = model_path / fname
             if fpath.exists():
                 with open(fpath) as fp:
                     cfg = json.load(fp)
-                # config.json stores model metadata at the top level and
-                # nests quantization details under quantization_config.
-                if fname == "config.json" and "quantization_config" in cfg:
-                    return cfg["quantization_config"]
                 return cfg

Keep the existing explicit config.json -> quantization_config block below as the only fallback path.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/executor/model_runner.py` around lines 490 - 495, The config.json
handling inside the main candidate loop (the block that checks fname ==
"config.json" and returns cfg["quantization_config"]) makes config.json a
premature match and prevents more specific probes from being considered; remove
that in-loop special-case and instead handle "config.json" ->
"quantization_config" only as the final fallback path after the candidate/probe
loop completes (keep the existing fallback block below). Update references to
cfg/fname accordingly so the loop always continues for other candidates and only
uses config.json's quantization_config when no other probe matched.
pymllm/quantization/methods/compressed_tensors.py (1)

89-97: Use buffers for Marlin runtime tensors instead of mixing plain attributes with Parameters.

Line 230 stores workspace as a plain tensor attribute, which won't be moved by module.to()/cuda(). Lines 231-233 create empty Parameters for runtime metadata, causing them to appear in parameters()/state_dict() despite not being checkpoint-backed weights. All four tensors should be registered as non-persistent buffers, which ensures they move with the module while remaining excluded from the parameter list and checkpoint.

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@mllm-kernel/mllm_kernel/cuda/jit/gptq_marlin_repack.py`:
- Around line 56-74: The function gptq_marlin_repack currently computes output
shape using floor division and can silently truncate/leave uninitialized data;
before allocating out or calling kernel, validate inputs: ensure num_bits
divides 32 evenly (so pack_factor = 32 // num_bits is integer and >0), ensure
size_k is a multiple of tile_size (16), ensure (size_n * tile_size) is divisible
by pack_factor (so out shape is exact), ensure b_q_weight.shape matches (size_k,
size_n) and its dtype/device are used for out, and ensure perm (if provided) has
length size_k; raise a clear ValueError on any violation and only then call
_normalize_perm, _make_gptq_marlin_repack_kernel(), allocate out and invoke
kernel.

In `@pymllm/layers/rms_norm.py`:
- Around line 52-54: The except branch in the RMSNorm fallback currently
computes x = x + residual but returns the old residual, causing downstream
blocks to receive stale residual state; update the except block in
pymllm/layers/rms_norm.py so that after computing x = x + residual you set a
new_residual (or reuse x) and return (_torch_rmsnorm(new_residual, self.weight,
self.eps), new_residual) instead of returning the old residual so subsequent
calls (e.g., in pymllm/models/qwen3_5.py flows) get the updated accumulated
residual.
- Around line 18-21: The code currently casts x_norm back to x.dtype before
multiplying by weight, which loses precision and can produce wrong dtype when
weight.dtype != x.dtype; change the order so you compute x_fp32, var, and x_norm
in FP32, cast weight to float32 (weight_fp32 = weight.to(torch.float32) or
weight.float()), multiply x_norm * weight_fp32 while still in FP32, then cast
the final product once to x.dtype before returning; locate symbols x_fp32, var,
x_norm, weight and eps in the RMSNorm implementation and apply this single-cast
fix consistent with rms_norm_gated.py.

In `@pymllm/models/qwen3_vl.py`:
- Around line 980-994: The helper _cuda_timed_run currently calls
torch.cuda.synchronize() on every invocation which forces a GPU sync in hot
paths (vision prefill and LLM forward); change it so synchronization is opt-in
or deferred: add a parameter like sync=False to _cuda_timed_run (or
alternatively return the start/end torch.cuda.Event objects and the raw output)
and only call torch.cuda.synchronize() and compute elapsed time when sync=True
(or when the caller performs synchronization at an existing barrier), ensuring
the default path avoids unconditional torch.cuda.synchronize() and preserves
existing return shapes by documenting the new return form if you choose the
event-return approach.

In `@pymllm/orchestrator/scheduler_process.py`:
- Around line 792-793: The code currently overwrites req.llm_decode_ms with
out["llm_decode_ms"] inside the decode-step loop; change this to accumulate
per-step timings by adding the step value to the existing request-level value
(e.g., req.llm_decode_ms = (req.llm_decode_ms or 0) + out["llm_decode_ms"]) so
each decode batch increments req.llm_decode_ms instead of replacing it; update
the setter site where "llm_decode_ms" is read from out (the block referencing
req.llm_decode_ms and out) to handle missing/None initial values safely.

In `@pymllm/orchestrator/tokenizer_process.py`:
- Around line 374-387: The multimodal branch replaces text tokenizer output with
processor IDs (mm_inputs -> image_inputs -> proc_input_ids) but skips the
earlier context-length guard; validate proc_input_ids against
self._context_length and either truncate or raise consistently with the text
path before assigning input_ids. Locate the block handling
mm_inputs/image_inputs/proc_input_ids in tokenizer_process.py and after
converting proc_input_ids to a Python list apply the same length-check logic
used earlier (use self._context_length to cap the list or raise a clear
exception), ensuring behavior matches the existing text-tokenizer
truncation/rejection policy.

In `@pymllm/quantization/methods/compressed_tensors.py`:
- Around line 61-70: Change verify_marlin_supported to accept a device argument
and use that device when checking CUDA capability (instead of calling
torch.cuda.get_device_capability() with no device), and update any callers
accordingly; specifically, update verify_marlin_supported(group_size: int,
device: torch.device) to first check MARLIN_SUPPORTED_GROUP_SIZES then if
device.type == "cuda" call torch.cuda.get_device_capability(device) and enforce
SM80+, and add an extra call to the new signature at the start of
process_weights_after_loading() using layer.weight_packed.device so validation
runs for the runtime device.

In `@pymllm/README-ZH.md`:
- Around line 5-16: The "## 环境配置 ToDo" section lacks validated Jetson
environment details; replace the checklist with a concrete validated matrix or a
link to a follow-up issue and populate the table with exact JetPack/L4T,
CUDA/cuDNN/TensorRT, Python/pip/venv or conda,
PyTorch/torchvision/transformers/safetensors, flashinfer version/installation,
apt dependencies, memory/GPU suggestions and required ENV vars; specifically
update the "## 环境配置 ToDo" header and the checklist items (e.g., JetPack / L4T,
CUDA / cuDNN / TensorRT, Python / pip / venv 或 conda, PyTorch / torchvision /
transformers / safetensors, flashinfer 版本与安装方式) with validated values or convert
them into a compact validated-version table and add a note linking to a
follow-up issue if full validation will occur later.

In `@pymllm/server/launch.py`:
- Around line 473-487: The code builds multimodal content lists in the messages
loop (mm_parts -> content) but may leave content as a list which later gets
stringified by the ChatML fallback and silently drops images; update the path
around tokenizer.apply_chat_template and the ChatML fallback to fail fast: after
constructing mm_parts (and before falling back to ChatML) check that
tokenizer.apply_chat_template exists and can accept list-form content, attempt
to call it in a try/except, and if it is missing or raises, immediately raise an
explicit exception (e.g., ValueError/RuntimeError) indicating structured chat
content could not be templated for the request instead of allowing silent
stringification; reference the variables/functions messages, mm_parts/content,
tokenizer.apply_chat_template, and the ChatML fallback so the change is made
where list content is prepared and before any stringification occurs.

---

Nitpick comments:
In `@pymllm/executor/model_runner.py`:
- Around line 490-495: The config.json handling inside the main candidate loop
(the block that checks fname == "config.json" and returns
cfg["quantization_config"]) makes config.json a premature match and prevents
more specific probes from being considered; remove that in-loop special-case and
instead handle "config.json" -> "quantization_config" only as the final fallback
path after the candidate/probe loop completes (keep the existing fallback block
below). Update references to cfg/fname accordingly so the loop always continues
for other candidates and only uses config.json's quantization_config when no
other probe matched.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5cd8b503-101e-4acb-822f-2d4c14c9ed71

📥 Commits

Reviewing files that changed from the base of the PR and between 690b859 and 058dee2.

📒 Files selected for processing (22)
  • mllm-kernel/cmake/CPM.cmake
  • mllm-kernel/include/mllm_kernel/scalar_type.hpp
  • mllm-kernel/mllm_kernel/cuda/csrc/gemm/marlin/marlin.cuh
  • mllm-kernel/mllm_kernel/cuda/jit/__init__.py
  • mllm-kernel/mllm_kernel/cuda/jit/gptq_marlin.py
  • mllm-kernel/mllm_kernel/cuda/jit/gptq_marlin_repack.py
  • mllm-kernel/tests/test_gptq_marlin.py
  • mllm-kernel/tests/test_gptq_marlin_repack.py
  • pymllm/README-ZH.md
  • pymllm/README.md
  • pymllm/executor/model_runner.py
  • pymllm/layers/rms_norm.py
  • pymllm/models/qwen3_vl.py
  • pymllm/orchestrator/detokenizer_process.py
  • pymllm/orchestrator/model_runner_process.py
  • pymllm/orchestrator/scheduler_process.py
  • pymllm/orchestrator/tokenizer_process.py
  • pymllm/quantization/methods/__init__.py
  • pymllm/quantization/methods/compressed_tensors.py
  • pymllm/server/launch.py
  • pymllm/tests/test_compressed_tensors_config.py
  • pymllm/tests/test_compressed_tensors_runtime.py

Comment on lines +56 to +74
def gptq_marlin_repack(
b_q_weight: torch.Tensor,
perm: Optional[torch.Tensor],
size_k: int,
size_n: int,
num_bits: int,
) -> torch.Tensor:
"""Repack GPTQ/Compressed-Tensors weights into Marlin layout."""

pack_factor = 32 // num_bits
tile_size = 16
out = torch.empty(
(size_k // tile_size, size_n * tile_size // pack_factor),
dtype=b_q_weight.dtype,
device=b_q_weight.device,
)
kernel = _make_gptq_marlin_repack_kernel()
perm_t = _normalize_perm(perm, size_k, b_q_weight.device)
kernel(b_q_weight, perm_t, out, size_k, size_n, num_bits)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Reject unsupported size_k/size_n/num_bits inputs before allocating out.

Line 68 derives the output shape with floor division, so unsupported inputs are silently truncated instead of failing fast: a size_k tail is dropped, a non-64-multiple size_n can leave part of out uninitialized, and a non-divisor num_bits produces the wrong packing factor. Since this wrapper is now public, please validate num_bits, alignment, and b_q_weight.shape/dtype/device before the allocation and kernel launch.

Suggested guardrail
 def gptq_marlin_repack(
     b_q_weight: torch.Tensor,
     perm: Optional[torch.Tensor],
     size_k: int,
     size_n: int,
     num_bits: int,
 ) -> torch.Tensor:
     """Repack GPTQ/Compressed-Tensors weights into Marlin layout."""
 
+    if b_q_weight.dtype != torch.int32:
+        raise ValueError("b_q_weight must be int32")
+    if b_q_weight.device.type != "cuda":
+        raise ValueError("b_q_weight must live on CUDA")
+    if num_bits <= 0 or 32 % num_bits != 0:
+        raise ValueError("num_bits must be a positive divisor of 32")
+    if size_k % 16 != 0:
+        raise ValueError("size_k must be divisible by 16")
+    if size_n % 64 != 0:
+        raise ValueError("size_n must be divisible by 64")
+
     pack_factor = 32 // num_bits
+    expected_shape = (size_k // pack_factor, size_n)
+    if tuple(b_q_weight.shape) != expected_shape:
+        raise ValueError(f"b_q_weight must have shape {expected_shape}")
+
     tile_size = 16
     out = torch.empty(
         (size_k // tile_size, size_n * tile_size // pack_factor),
         dtype=b_q_weight.dtype,
         device=b_q_weight.device,

As per coding guidelines, "Validate inputs for public APIs and critical internal functions."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mllm-kernel/mllm_kernel/cuda/jit/gptq_marlin_repack.py` around lines 56 - 74,
The function gptq_marlin_repack currently computes output shape using floor
division and can silently truncate/leave uninitialized data; before allocating
out or calling kernel, validate inputs: ensure num_bits divides 32 evenly (so
pack_factor = 32 // num_bits is integer and >0), ensure size_k is a multiple of
tile_size (16), ensure (size_n * tile_size) is divisible by pack_factor (so out
shape is exact), ensure b_q_weight.shape matches (size_k, size_n) and its
dtype/device are used for out, and ensure perm (if provided) has length size_k;
raise a clear ValueError on any violation and only then call _normalize_perm,
_make_gptq_marlin_repack_kernel(), allocate out and invoke kernel.

Comment thread pymllm/layers/rms_norm.py
Comment on lines +18 to +21
x_fp32 = x.float()
var = x_fp32.pow(2).mean(dim=-1, keepdim=True)
x_norm = x_fp32 * torch.rsqrt(var + eps)
return x_norm.to(dtype=x.dtype) * weight
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Apply the scale in FP32 before the final cast.

This helper downcasts x_norm before multiplying by weight, so the fallback loses precision and can even return the wrong dtype when weight.dtype != x.dtype. pymllm/layers/rms_norm_gated.py:40-61 already keeps both operands in FP32 and casts once at the end.

Suggested fix
 def _torch_rmsnorm(
     x: torch.Tensor,
     weight: torch.Tensor,
     eps: float,
 ) -> torch.Tensor:
     x_fp32 = x.float()
+    weight_fp32 = weight.float()
     var = x_fp32.pow(2).mean(dim=-1, keepdim=True)
-    x_norm = x_fp32 * torch.rsqrt(var + eps)
-    return x_norm.to(dtype=x.dtype) * weight
+    out = x_fp32 * torch.rsqrt(var + eps) * weight_fp32
+    return out.to(dtype=x.dtype)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/layers/rms_norm.py` around lines 18 - 21, The code currently casts
x_norm back to x.dtype before multiplying by weight, which loses precision and
can produce wrong dtype when weight.dtype != x.dtype; change the order so you
compute x_fp32, var, and x_norm in FP32, cast weight to float32 (weight_fp32 =
weight.to(torch.float32) or weight.float()), multiply x_norm * weight_fp32 while
still in FP32, then cast the final product once to x.dtype before returning;
locate symbols x_fp32, var, x_norm, weight and eps in the RMSNorm implementation
and apply this single-cast fix consistent with rms_norm_gated.py.

Comment thread pymllm/layers/rms_norm.py
Comment on lines +52 to +54
except Exception:
x = x + residual
return _torch_rmsnorm(x, self.weight, self.eps), residual
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Return the accumulated residual from the fallback path.

This branch computes x + residual but returns the old residual. pymllm/models/qwen3_5.py:206-211 threads that second tuple element straight into the next norm call, so taking this fallback leaves all subsequent blocks operating on stale residual state.

Suggested fix
             except Exception:
-                x = x + residual
-                return _torch_rmsnorm(x, self.weight, self.eps), residual
+                accumulated = x + residual
+                return _torch_rmsnorm(accumulated, self.weight, self.eps), accumulated
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
except Exception:
x = x + residual
return _torch_rmsnorm(x, self.weight, self.eps), residual
except Exception:
accumulated = x + residual
return _torch_rmsnorm(accumulated, self.weight, self.eps), accumulated
🧰 Tools
🪛 Ruff (0.15.9)

[warning] 52-52: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/layers/rms_norm.py` around lines 52 - 54, The except branch in the
RMSNorm fallback currently computes x = x + residual but returns the old
residual, causing downstream blocks to receive stale residual state; update the
except block in pymllm/layers/rms_norm.py so that after computing x = x +
residual you set a new_residual (or reuse x) and return
(_torch_rmsnorm(new_residual, self.weight, self.eps), new_residual) instead of
returning the old residual so subsequent calls (e.g., in
pymllm/models/qwen3_5.py flows) get the updated accumulated residual.

Comment thread pymllm/models/qwen3_vl.py Outdated
Comment thread pymllm/orchestrator/scheduler_process.py Outdated
Comment on lines +374 to +387
# If AutoProcessor produced multimodal input_ids, they must override
# the plain tokenizer result. Otherwise the prompt contains only a
# single image placeholder token and won't match the visual features.
if mm_inputs is not None:
image_inputs = mm_inputs.get("image_inputs")
if image_inputs is not None and "input_ids" in image_inputs:
proc_input_ids = image_inputs["input_ids"]
if hasattr(proc_input_ids, "ndim") and proc_input_ids.ndim > 1:
proc_input_ids = proc_input_ids[0]
if hasattr(proc_input_ids, "tolist"):
input_ids = proc_input_ids.tolist()
else:
input_ids = list(proc_input_ids)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Keep the context-length guard on the multimodal path.

Once these processor IDs replace the text-tokenizer output, multimodal requests no longer inherit the truncation/length check applied earlier. A long image prompt can now exceed self._context_length and fail later in the runner instead of being capped or rejected here. At minimum, validate the replacement IDs before packing the request.

🧭 Minimal safeguard
                 if hasattr(proc_input_ids, "tolist"):
                     input_ids = proc_input_ids.tolist()
                 else:
                     input_ids = list(proc_input_ids)
+                if (
+                    self._context_length is not None
+                    and len(input_ids) > self._context_length
+                ):
+                    raise ValueError(
+                        "Multimodal prompt exceeds context_length after processor tokenization"
+                    )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# If AutoProcessor produced multimodal input_ids, they must override
# the plain tokenizer result. Otherwise the prompt contains only a
# single image placeholder token and won't match the visual features.
if mm_inputs is not None:
image_inputs = mm_inputs.get("image_inputs")
if image_inputs is not None and "input_ids" in image_inputs:
proc_input_ids = image_inputs["input_ids"]
if hasattr(proc_input_ids, "ndim") and proc_input_ids.ndim > 1:
proc_input_ids = proc_input_ids[0]
if hasattr(proc_input_ids, "tolist"):
input_ids = proc_input_ids.tolist()
else:
input_ids = list(proc_input_ids)
# If AutoProcessor produced multimodal input_ids, they must override
# the plain tokenizer result. Otherwise the prompt contains only a
# single image placeholder token and won't match the visual features.
if mm_inputs is not None:
image_inputs = mm_inputs.get("image_inputs")
if image_inputs is not None and "input_ids" in image_inputs:
proc_input_ids = image_inputs["input_ids"]
if hasattr(proc_input_ids, "ndim") and proc_input_ids.ndim > 1:
proc_input_ids = proc_input_ids[0]
if hasattr(proc_input_ids, "tolist"):
input_ids = proc_input_ids.tolist()
else:
input_ids = list(proc_input_ids)
if (
self._context_length is not None
and len(input_ids) > self._context_length
):
raise ValueError(
"Multimodal prompt exceeds context_length after processor tokenization"
)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/orchestrator/tokenizer_process.py` around lines 374 - 387, The
multimodal branch replaces text tokenizer output with processor IDs (mm_inputs
-> image_inputs -> proc_input_ids) but skips the earlier context-length guard;
validate proc_input_ids against self._context_length and either truncate or
raise consistently with the text path before assigning input_ids. Locate the
block handling mm_inputs/image_inputs/proc_input_ids in tokenizer_process.py and
after converting proc_input_ids to a Python list apply the same length-check
logic used earlier (use self._context_length to cap the list or raise a clear
exception), ensuring behavior matches the existing text-tokenizer
truncation/rejection policy.

Comment on lines +61 to +70
def verify_marlin_supported(group_size: int) -> None:
if group_size not in MARLIN_SUPPORTED_GROUP_SIZES:
raise ValueError(
f"Unsupported compressed-tensors group_size: {group_size}"
)
if not torch.cuda.is_available():
return
major, minor = torch.cuda.get_device_capability()
if major * 10 + minor < 80:
raise ValueError("compressed-tensors Marlin requires SM80+")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

In PyTorch, if torch.cuda.get_device_capability() is called without a device argument, which CUDA device does it inspect?

💡 Result:

It inspects the current CUDA device, as returned by torch.cuda.current_device.

Citations:


🏁 Script executed:

# First, locate the file and check its structure
find . -path "*pymllm/quantization/methods/compressed_tensors.py" -type f

Repository: UbiquitousLearning/mllm

Length of output: 118


🏁 Script executed:

# Read the relevant sections of the file
if [ -f "pymllm/quantization/methods/compressed_tensors.py" ]; then
  echo "=== Lines 61-75 (verify_marlin_supported) ===" 
  sed -n '61,75p' pymllm/quantization/methods/compressed_tensors.py
  echo ""
  echo "=== Lines 85-105 (process_weights_after_loading area) ===" 
  sed -n '85,105p' pymllm/quantization/methods/compressed_tensors.py
  echo ""
  echo "=== Lines 215-235 (second mentioned range) ===" 
  sed -n '215,235p' pymllm/quantization/methods/compressed_tensors.py
fi

Repository: UbiquitousLearning/mllm

Length of output: 2264


🏁 Script executed:

# Check the total file length to understand context
wc -l pymllm/quantization/methods/compressed_tensors.py

Repository: UbiquitousLearning/mllm

Length of output: 120


Re-validate Marlin device compatibility in the actual runtime device context.

The verify_marlin_supported() function calls torch.cuda.get_device_capability() without a device argument, which inspects only the current CUDA device, not the device where the layer actually resides. The process_weights_after_loading() method extracts the correct device from layer.weight_packed.device but never re-validates Marlin support for that device before executing CUDA operations. In workflows where initialization occurs on a different device than the runtime device (e.g., CPU initialization before GPU placement, or multi-GPU scenarios), this allows invalid configurations to pass validation and fail later at runtime.

Modify verify_marlin_supported() to accept and use a device parameter, and call it again at the start of process_weights_after_loading() with the actual layer.weight_packed.device.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/quantization/methods/compressed_tensors.py` around lines 61 - 70,
Change verify_marlin_supported to accept a device argument and use that device
when checking CUDA capability (instead of calling
torch.cuda.get_device_capability() with no device), and update any callers
accordingly; specifically, update verify_marlin_supported(group_size: int,
device: torch.device) to first check MARLIN_SUPPORTED_GROUP_SIZES then if
device.type == "cuda" call torch.cuda.get_device_capability(device) and enforce
SM80+, and add an extra call to the new signature at the start of
process_weights_after_loading() using layer.weight_packed.device so validation
runs for the runtime device.

Comment thread pymllm/README-ZH.md Outdated
Comment thread pymllm/server/launch.py
Comment on lines +473 to +487
# Preserve multimodal message structure for tokenizer.apply_chat_template.
msg_dicts: List[Dict[str, Any]] = []
for msg in messages:
content = msg.content
if isinstance(content, list):
# Multimodal: extract only text parts for the prompt string.
text_parts = [p.text for p in content if p.type == "text" and p.text]
content = "\n".join(text_parts) if text_parts else ""
mm_parts: List[Dict[str, Any]] = []
for part in content:
if part.type == "text" and part.text is not None:
mm_parts.append({"type": "text", "text": part.text})
elif part.type == "image_url" and part.image_url is not None:
# Keep image content so chat template can emit vision tokens.
mm_parts.append(
{"type": "image", "image": part.image_url.url}
)
content = mm_parts
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fail fast when structured chat content cannot be templated.

After this change, content can remain a list of parts. If apply_chat_template() is unavailable or throws, the ChatML fallback below will stringify that list and silently drop image placeholders, so the request becomes malformed instead of being rejected.

💡 Suggested guard
@@
     if tokenizer is not None and hasattr(tokenizer, "apply_chat_template"):
         try:
             extra = dict(chat_template_kwargs) if chat_template_kwargs else {}
             return tokenizer.apply_chat_template(
                 msg_dicts,
                 tokenize=False,
                 add_generation_prompt=True,
                 **extra,
             )
         except Exception as e:
             logger.warning("apply_chat_template failed, using fallback: %s", e)
+            has_multimodal_content = any(
+                isinstance(m["content"], list)
+                and any(p.get("type") != "text" for p in m["content"])
+                for m in msg_dicts
+            )
+            if has_multimodal_content:
+                raise ValueError(
+                    "Multimodal chat content requires apply_chat_template() support"
+                ) from e
 
     # Fallback: ChatML format (Qwen-style)
     parts: List[str] = []
     for m in msg_dicts:
-        parts.append(f"<|im_start|>{m['role']}\n{m['content']}<|im_end|>")
+        content = m["content"]
+        if isinstance(content, list):
+            content = "\n".join(
+                p["text"] for p in content if p.get("type") == "text"
+            )
+        parts.append(f"<|im_start|>{m['role']}\n{content}<|im_end|>")

As per coding guidelines, "Ensure functions that can fail return appropriate error codes or raise exceptions."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/server/launch.py` around lines 473 - 487, The code builds multimodal
content lists in the messages loop (mm_parts -> content) but may leave content
as a list which later gets stringified by the ChatML fallback and silently drops
images; update the path around tokenizer.apply_chat_template and the ChatML
fallback to fail fast: after constructing mm_parts (and before falling back to
ChatML) check that tokenizer.apply_chat_template exists and can accept list-form
content, attempt to call it in a try/except, and if it is missing or raises,
immediately raise an explicit exception (e.g., ValueError/RuntimeError)
indicating structured chat content could not be templated for the request
instead of allowing silent stringification; reference the variables/functions
messages, mm_parts/content, tokenizer.apply_chat_template, and the ChatML
fallback so the change is made where list content is prepared and before any
stringification occurs.

feat: update Jetson Qwen3-VL server timing and multimodal flow
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pymllm/server/launch.py (1)

711-717: ⚠️ Potential issue | 🟠 Major

Guard empty results before using r in timing payload.

r is read at Line 744-766 but is only defined inside the for i, r in enumerate(results) loop. If results is empty, this path raises UnboundLocalError.

Proposed fix
     try:
         results = []
         async for item in _iter_with_disconnect_check(gen, request):
             results.append(item)
+        if not results:
+            raise HTTPException(status_code=500, detail="No output from engine")
         choices = []
+        r: Dict[str, Any] = results[-1]
         prompt_tokens = 0
         completion_tokens = 0
         for i, r in enumerate(results):
             choices.append(
                 {

As per coding guidelines, "Ensure functions that can fail return appropriate error codes or raise exceptions."

Also applies to: 743-767

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/server/launch.py` around lines 711 - 717, The code collects async
items into results from _iter_with_disconnect_check(gen, request) and later
accesses r inside the for i, r in enumerate(results) loop to build a timing
payload, but if results is empty r is undefined and will raise
UnboundLocalError; modify the code that builds the timing/response payload (the
block referencing r around the loop and lines 743-767) to first guard for an
empty results list (e.g., if not results: handle early-return or construct a
safe timing/response with zeroed tokens and an empty choices list) before
referencing r, or ensure r is taken from a fallback value (such as the last
result or a default dict) so the timing payload and subsequent logic in the
function (around _iter_with_disconnect_check usage and the for i, r in
enumerate(results) loop) never reads an undefined variable.
♻️ Duplicate comments (1)
pymllm/server/launch.py (1)

473-487: ⚠️ Potential issue | 🟠 Major

Fail fast for multimodal content when chat templating is unavailable (still unresolved).

Line 511 can still stringify list-structured multimodal content if apply_chat_template is missing/fails, which silently corrupts request semantics instead of rejecting the request.

Proposed fix
 def _messages_to_prompt(
@@
     tokenizer = _tokenizer
+    has_structured_content = any(isinstance(m.get("content"), list) for m in msg_dicts)
+
+    if has_structured_content and not (
+        tokenizer is not None and hasattr(tokenizer, "apply_chat_template")
+    ):
+        raise ValueError(
+            "Multimodal chat content requires tokenizer.apply_chat_template support"
+        )
+
     if tokenizer is not None and hasattr(tokenizer, "apply_chat_template"):
         try:
             extra = dict(chat_template_kwargs) if chat_template_kwargs else {}
             return tokenizer.apply_chat_template(
                 msg_dicts,
@@
             )
         except Exception as e:
-            logger.warning("apply_chat_template failed, using fallback: %s", e)
+            if has_structured_content:
+                raise ValueError(
+                    "Multimodal chat content could not be templated by tokenizer"
+                ) from e
+            logger.warning("apply_chat_template failed, using fallback: %s", e)
@@
     parts: List[str] = []
     for m in msg_dicts:
-        parts.append(f"<|im_start|>{m['role']}\n{m['content']}<|im_end|>")
+        content = m["content"]
+        if isinstance(content, list):
+            content = "\n".join(
+                p.get("text", "") for p in content if p.get("type") == "text"
+            )
+        parts.append(f"<|im_start|>{m['role']}\n{content}<|im_end|>")

As per coding guidelines, "Ensure functions that can fail return appropriate error codes or raise exceptions."

Also applies to: 495-513

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/server/launch.py` around lines 473 - 487, The multimodal message
handling currently silently falls back to stringifying list-structured content
if tokenizer.apply_chat_template is missing or raises, corrupting request
semantics; update the code in the message-processing block around msg_dicts and
the later analogous block (the apply_chat_template usage) to detect whether the
tokenizer object exposes apply_chat_template and to catch exceptions from
calling it, and if absent or an error occurs raise an explicit exception (or
return an error response) instead of converting multimodal lists to plain
text—refer to the variables/functions msg_dicts, messages, content/mm_parts and
tokenizer.apply_chat_template to locate both places to change.
🧹 Nitpick comments (1)
pymllm/server/launch.py (1)

743-767: Extract duplicated timing payload construction into a helper.

Both non-stream endpoints build nearly identical timing objects. A shared helper (e.g., _build_timing(r, prompt_tokens, completion_tokens)) will prevent drift.

As per coding guidelines, "Encourage consistent coding style and patterns with the existing codebase."

Also applies to: 1014-1038

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/server/launch.py` around lines 743 - 767, Extract the duplicated
timing-object construction into a single helper function named _build_timing(r,
prompt_tokens, completion_tokens) that returns the dict with keys
"vit_prefill_ms","llm_prefill_ms","llm_decode_ms","prefill_tokens","vit_prefill_tps","llm_prefill_tps","llm_decode_tps"
using the same guard logic currently used (checking None or <=0) and math
(tokens / (ms/1000.0)); then replace the two duplicated inline timing blocks
with calls to _build_timing(r, prompt_tokens, completion_tokens) so both
non-stream endpoints use the shared helper and avoid drift.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@pymllm/server/launch.py`:
- Around line 711-717: The code collects async items into results from
_iter_with_disconnect_check(gen, request) and later accesses r inside the for i,
r in enumerate(results) loop to build a timing payload, but if results is empty
r is undefined and will raise UnboundLocalError; modify the code that builds the
timing/response payload (the block referencing r around the loop and lines
743-767) to first guard for an empty results list (e.g., if not results: handle
early-return or construct a safe timing/response with zeroed tokens and an empty
choices list) before referencing r, or ensure r is taken from a fallback value
(such as the last result or a default dict) so the timing payload and subsequent
logic in the function (around _iter_with_disconnect_check usage and the for i, r
in enumerate(results) loop) never reads an undefined variable.

---

Duplicate comments:
In `@pymllm/server/launch.py`:
- Around line 473-487: The multimodal message handling currently silently falls
back to stringifying list-structured content if tokenizer.apply_chat_template is
missing or raises, corrupting request semantics; update the code in the
message-processing block around msg_dicts and the later analogous block (the
apply_chat_template usage) to detect whether the tokenizer object exposes
apply_chat_template and to catch exceptions from calling it, and if absent or an
error occurs raise an explicit exception (or return an error response) instead
of converting multimodal lists to plain text—refer to the variables/functions
msg_dicts, messages, content/mm_parts and tokenizer.apply_chat_template to
locate both places to change.

---

Nitpick comments:
In `@pymllm/server/launch.py`:
- Around line 743-767: Extract the duplicated timing-object construction into a
single helper function named _build_timing(r, prompt_tokens, completion_tokens)
that returns the dict with keys
"vit_prefill_ms","llm_prefill_ms","llm_decode_ms","prefill_tokens","vit_prefill_tps","llm_prefill_tps","llm_decode_tps"
using the same guard logic currently used (checking None or <=0) and math
(tokens / (ms/1000.0)); then replace the two duplicated inline timing blocks
with calls to _build_timing(r, prompt_tokens, completion_tokens) so both
non-stream endpoints use the shared helper and avoid drift.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0da92abb-fa6f-491f-b861-34c211875ea4

📥 Commits

Reviewing files that changed from the base of the PR and between 25cdb3d and 4d83b3d.

📒 Files selected for processing (5)
  • pymllm/models/qwen3_vl.py
  • pymllm/orchestrator/detokenizer_process.py
  • pymllm/orchestrator/model_runner_process.py
  • pymllm/orchestrator/scheduler_process.py
  • pymllm/server/launch.py
✅ Files skipped from review due to trivial changes (1)
  • pymllm/orchestrator/scheduler_process.py
🚧 Files skipped from review as they are similar to previous changes (3)
  • pymllm/orchestrator/detokenizer_process.py
  • pymllm/orchestrator/model_runner_process.py
  • pymllm/models/qwen3_vl.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants