Pull the Llama4 inference benchmark from lightning-thunder by tbqh · Pull Request #5578 · NVIDIA/Fuser

tbqh · 2025-11-21T20:40:01Z

Pull the latest version of this script after substantial changes in the lightning repo. The script is being "moved" into the fuser repo - it will be deleted from lightning-thunder, and subsequent changes will be merged into the fuser repo.

The script does not have any nvfuser changes at this moment. No configs or model code are pulled inside yet.

greptile-apps · 2025-11-21T20:42:58Z

Greptile Overview

Greptile Summary

Pulls the latest Llama4 inference benchmark script from lightning-thunder repository into the fuser repo. This is part of migrating the script from lightning-thunder (where it will be deleted) to fuser for ongoing maintenance.

Key Changes:

Added comprehensive inference benchmark with support for Thunder/nvFuser compilation modes
Implemented custom MoE layers (Llama4MoE, GroupedSwiGLU, GroupedLinear) to use grouped matrix multiplication
Added nvFP4 quantization support for inference optimization
Includes tensor parallelism support via PyTorch distributed
Contains extensive imports from thunder package that must be available as an external dependency

Dependency Concern:
The script heavily depends on the thunder package (lines 47-60 in benchmark_inference.py and line 550 in layers_for_inference_benchmark.py), importing from modules like thunder.dynamo, thunder.benchmarks, thunder.tests.distributed, and thunder.transforms. Since this is being "moved" into the fuser repo, verify that:

The thunder package will be available as an installed dependency when running these benchmarks
The import paths match the actual structure when thunder is installed (e.g., thunder.benchmarks.layers_for_inference_benchmark and thunder.tests.distributed.test_moe)
Dependencies are documented in requirements or installation instructions

Confidence Score: 3/5

Safe to merge with dependency verification - code is well-structured but relies on external thunder package that needs to be available at runtime
Score reflects that while the code itself appears well-written and properly pulled from the source repository, there's a critical dependency on the thunder package that must be resolved. The script cannot run without thunder installed, and the import paths need verification. The PR description mentions this is a "move" from lightning-thunder but doesn't address how the dependency will be handled. Additionally, there's an import from thunder.tests.llama4_moe which references test code that may not be part of the public API.
Pay close attention to benchmarks/python/benchmark_inference.py - verify all thunder imports resolve correctly when thunder is installed as a dependency

Important Files Changed

File Analysis

Filename	Score	Overview
benchmarks/python/benchmark_inference.py	3/5	Added comprehensive Llama4 inference benchmark with Thunder/nvFuser integration. Contains numerous imports from `thunder` package which must be available as external dependency. Import paths reference thunder package structure that may not match fuser repo layout.
benchmarks/python/layers_for_inference_benchmark.py	4/5	Added custom layer implementations for inference benchmarking including GroupedLinear, GroupedSwiGLU, Llama4MoE, and nvFP4 quantization support. Includes reference to `thunder.tests.llama4_moe.Config` import that needs verification.

Sequence Diagram

sequenceDiagram
    participant User
    participant main
    participant InferenceBenchmark
    participant Model
    participant Thunder
    participant nvFuser

    User->>main: Run benchmark script
    main->>main: parse_args()
    main->>main: _register_nvfp4_ops()
    Note over main: Register nvFP4 custom ops<br/>with Thunder/nvFuser
    main->>InferenceBenchmark: __init__(config)
    InferenceBenchmark->>Model: _load_model()
    Note over Model: Load on meta device
    InferenceBenchmark->>Model: _replace_llama4_moe()
    Note over Model: Replace HF MoE with custom<br/>Llama4MoE using GroupedSwiGLU
    InferenceBenchmark->>Model: parallelize_module()
    Note over Model: Apply tensor parallelism
    InferenceBenchmark->>Model: to_empty(device)
    Note over Model: Materialize on GPU
    InferenceBenchmark->>Model: _quantize_llama4()
    Note over Model: Replace GroupedSwiGLU with<br/>NVFP4InferenceGroupedSwiGLU
    InferenceBenchmark->>Thunder: _compile_model()
    Thunder->>nvFuser: Apply transforms
    Note over Thunder,nvFuser: thunderfx/thunder.jit compilation
    InferenceBenchmark->>InferenceBenchmark: run_benchmark()
    loop warmup_iterations
        InferenceBenchmark->>Model: generate()
        Model->>Thunder: forward()
        Thunder->>nvFuser: Execute fused kernels
    end
    loop num_iterations
        InferenceBenchmark->>Model: measure_inference_step()
        Model->>Model: prefill()
        Model->>Model: decode_one_token() x N
        InferenceBenchmark->>InferenceBenchmark: Track metrics
    end
    InferenceBenchmark->>User: print_results()

github-actions · 2025-11-21T20:43:11Z

Review updated until commit f5d05ae

Auto-merge Status

✅ PR is approved
✅ Internal CI is finished
✅ No failed checks
✅ PR is mergeable

Description

Pull latest inference benchmark from lightning-thunder with substantial updates
Add NVFP4 quantization support for GroupedSwiGLU layers in MoE architectures
Implement tensor parallel support for both custom and HuggingFace MoE implementations
Add Thunder-specific optimizations including CUDA graphs and cache support
Update weight layouts and quantization functions for improved performance

Changes walkthrough

Relevant files

Enhancement

benchmark_inference.py `Main benchmark script with NVFP4 and tensor parallel updates` benchmarks/python/benchmark_inference.py Added lightning-thunder repo reference in docstring Implemented _register_nvfp4_ops() for nvfp4 custom operation registration Updated model loading to use AutoConfig.from_pretrained() instead of hardcoded config Added support for StaticCache vs HybridChunkedCache based on transformers version Added new config options: attn_implementation, thunder_cache, enable_thunder_cudagraph Updated tensor parallel plan for both custom and HF MoE implementations Added CUDA graph support and Thunder-specific optimizations Removed --profile option as per thunder PR Microbenchmarks the Transformer block. #2715 Added torch._grouped_mm support in eager mode as per thunder PR MarkAliasPrepare does not preserve shardings #2721	+292/-250
layers_for_inference_benchmark.py `Supporting layers with GroupedSwiGLU and updated weight layouts` benchmarks/python/layers_for_inference_benchmark.py Added lightning-thunder repo reference in docstring Added GroupedSwiGLU and NVFP4InferenceGroupedSwiGLU classes Removed NVFP4InferenceLinear class (replaced by GroupedSwiGLU approach) Updated GroupedLinear weight layout from [g,n,k] to [g,out_features,in_features] Updated quantization functions for new weight layout Added compute_auxiliary_tensors method for performance optimization Updated Llama4MoE to handle new weight layouts with proper transposes Added proper offset handling with prepended zero for grouped operations	+207/-262

PR Reviewer Guide

Here are some key observations to aid the review process:

🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review
Missing Tests This PR adds substantial new functionality including NVFP4 quantization, distributed tensor parallel support, CUDAGraph integration, and enhanced benchmarking capabilities. However, no new tests were added to validate these features work correctly. The complexity of the changes (especially around distributed setup, custom op registration, and quantization) warrants comprehensive test coverage to prevent regressions. # SPDX-FileCopyrightText: Copyright (c) 2025-present NVIDIA CORPORATION & AFFILIATES. # All rights reserved. # SPDX-License-Identifier: BSD-3-Clause """Inference benchmark focusing on throughput and latency metrics of prefill and decode phases. AutoModelForCausalLM from Hugging Face transformers is used for model implementation. Key metrics: - Throughput (tokens/second) - Latency (ms/token) - Time to First Token (TTFT) - Time Between Output Tokens (TBOT) Pulled from the lightning-thunder repo. Reference: https://github.com/Lightning-AI/lightning-thunder/blob/4d3a3c3a7481efdc6a23cdeea99c3ffd31af5e78/thunder/benchmarks/benchmark_inference.py """ # fmt: off from __future__ import annotations from contextlib import contextmanager from dataclasses import dataclass, field import argparse import json import os import statistics import time import warnings from typing import Any from collections.abc import Callable from looseversion import LooseVersion import torch import torch.distributed as dist import torch.nn as nn from torch.distributed.device_mesh import init_device_mesh from torch.distributed.tensor.parallel import parallelize_module, RowwiseParallel, ColwiseParallel from tqdm import tqdm import transformers from transformers import AutoConfig, AutoModelForCausalLM from transformers.cache_utils import HybridChunkedCache, StaticCache from transformers.models.llama4.modeling_llama4 import Llama4TextMoe from torch.distributed.tensor.placement_types import Shard from torch.distributed.tensor import DTensor import thunder from thunder.dynamo.compiler import thunderfx from thunder.benchmarks.layers_for_inference_benchmark import ( GroupedSwiGLU, Error Handling Robustness The new NVFP4 custom op registration in `_register_nvfp4_ops()` catches exceptions and only warns, which could hide critical failures during benchmarking. Additionally, the distributed setup with torchelastic detection and device mesh initialization lacks comprehensive error handling for scenarios like failed process group creation or device mesh initialization failures. def _register_nvfp4_ops(): """Register nvfp4 custom operations with Thunder.""" # Register f16a_nvfp4weight_scaled_grouped_mm with nvfuser translator _nvfp4_grouped_mm_symbol = _register_custom_op(nvfuser_f16a_nvfp4weight_scaled_grouped_mm) def nvfp4_grouped_mm_translator( activation, fp4_weight, weight_scaling_factor, global_scale, offsets, blockscale_offsets, problem_sizes, , fd, lc_to_nv_map, ): from nvfuser_direct import DataType from thunder.executors.nvfuserex_impl import getnv nv_act = getnv(activation, fd, lc_to_nv_map) nv_fp4_w = getnv(fp4_weight, fd, lc_to_nv_map) nv_sf_w = getnv(weight_scaling_factor, fd, lc_to_nv_map) nv_alpha = getnv(global_scale, fd, lc_to_nv_map) nv_offsets = getnv(offsets, fd, lc_to_nv_map) nv_blocksf_offsets = getnv(blockscale_offsets, fd, lc_to_nv_map) nv_problem_sizes = getnv(problem_sizes, fd, lc_to_nv_map) # dynamic shape support has some concretization issue m_size = activation.shape[0] k_size = activation.shape[1] k_tile_size = k_size // 16 reshaped_mat1 = fd.ops.reshape(nv_act, [m_size, k_tile_size, 16]) scale1 = fd.ops.abs(reshaped_mat1) scale1 = fd.ops.max(scale1, 2) scale1 = fd.ops.div(scale1, FLOAT4_E2M1_MAX) scale1 = fd.ops.clamp(scale1, FLOAT8_E4M3_EPS, FLOAT8_E4M3_MAX) broadcast_scale1 = fd.ops.broadcast(scale1, [False, False, True]) reshaped_scaled_mat1 = fd.ops.div(reshaped_mat1, broadcast_scale1) reshaped_scaled_mat1 = fd.ops.clamp(reshaped_scaled_mat1, -FLOAT8_E4M3_MAX, FLOAT8_E4M3_MAX) scaled_mat1 = fd.ops.reshape(reshaped_scaled_mat1, [m_size, k_size]) fp4_mat1 = fd.ops.cast(scaled_mat1, DataType.Float4_e2m1fn) fp8_scale1 = fd.ops.cast(scale1, DataType.Float8_e4m3fn) layout_fp8_scale1 = fd.ops.preprocess_grouped_matmul_input_sf(fp8_scale1, nv_offsets, nv_blocksf_offsets) out = fd.ops.cutlass_nvfp4_grouped_mm( fp4_mat1, nv_fp4_w, layout_fp8_scale1, nv_sf_w, nv_alpha, # NOTE: we might need to call contiguous on problem_sizes nv_problem_sizes, nv_offsets, nv_blocksf_offsets, DataType.BFloat16, ) return out _register_nvfuser_translator(_nvfp4_grouped_mm_symbol, nvfp4_grouped_mm_translator) Potential Memory Issues* The new `NVFP4InferenceGroupedSwiGLU` class computes auxiliary tensors (blockscale_offsets, problem_sizes) multiple times during forward passes. While there's an optimization to compute them once, the memory allocation patterns and tensor creation could lead to memory fragmentation or excessive memory usage during large-scale inference workloads. class NVFP4InferenceGroupedSwiGLU(nn.Module): """NVFP4 GroupedSwiGLU that efficiently reuses auxiliary tensor computations.""" def __init__( self, gate_proj: NVFP4InferenceGroupedLinear, up_proj: NVFP4InferenceGroupedLinear, down_proj: NVFP4InferenceGroupedLinear, ): super().__init__() self.gate_proj = gate_proj self.up_proj = up_proj self.down_proj = down_proj def forward(self, hidden_states: torch.Tensor, offsets: torch.Tensor) -> torch.Tensor: # Compute auxiliary tensors once for all three operations intermediate_features = self.gate_proj.out_features blockscale_offsets_gate, problem_sizes_gate = NVFP4InferenceGroupedLinear.compute_auxiliary_tensors( hidden_states, offsets, intermediate_features ) gate_out = self.gate_proj(hidden_states, offsets, blockscale_offsets_gate, problem_sizes_gate) up_out = self.up_proj(hidden_states, offsets, blockscale_offsets_gate, problem_sizes_gate) intermediate = torch.nn.functional.silu(gate_out) * up_out # For down_proj, we need different problem_sizes (different output features) hidden_features = self.down_proj.out_features blockscale_offsets_down, problem_sizes_down = NVFP4InferenceGroupedLinear.compute_auxiliary_tensors( intermediate, offsets, hidden_features ) return self.down_proj(intermediate, offsets, blockscale_offsets_down, problem_sizes_down) @staticmethod def from_grouped_swiglu(grouped_swiglu: GroupedSwiGLU, fqn: str \| None = None) -> NVFP4InferenceGroupedSwiGLU: """Create an NVFP4InferenceGroupedSwiGLU from a GroupedSwiGLU. Args: grouped_swiglu (GroupedSwiGLU): The source GroupedSwiGLU. fqn (str or None): Fully qualified name. Currently unused; reserved for future use or compatibility. """ gate_proj = NVFP4InferenceGroupedLinear.from_grouped_linear(grouped_swiglu.gate_proj) up_proj = NVFP4InferenceGroupedLinear.from_grouped_linear(grouped_swiglu.up_proj) down_proj = NVFP4InferenceGroupedLinear.from_grouped_linear(grouped_swiglu.down_proj) return NVFP4InferenceGroupedSwiGLU(gate_proj, up_proj, down_proj)

greptile-apps

_{2 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

wujingyue

I didn't review the nvfp4 stuff. Other changes LGTM!

wujingyue · 2025-11-21T20:48:25Z

    group_outs = []
-    for group_a, group_b in zip(a.split(group_sizes), b.unbind()):
-        group_outs.append(group_a @ group_b)
+    for idx, group_a in enumerate(a.split(group_sizes)):


I don't think this fallback implementation is necessary any more. Lightning-AI/lightning-thunder#2721

But this can come as a different PR.

Never mind -- this is still necessary for torch <2.8. OOC, what's the minimum torch version nvFuser supports? cc @xwang233 and @nWEIdia

we only build against the latest stable and nightly

Lightning-AI/lightning-thunder#2721

Lightning-AI/lightning-thunder#2715

greptile-apps

_{2 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

tbqh · 2025-11-21T23:11:13Z

!test

greptile-apps

Additional Comments (1)

benchmarks/python/layers_for_inference_benchmark.py, line 550 (link)

style: imports from test module thunder.tests.llama4_moe - verify this is intended to be part of thunder's public API when installed as a dependency, or if it should use a different public module

_{2 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

xwang233 · 2025-11-22T05:51:53Z

The latest internal pipeline run actually had two failures jit_binary_distributed_tests_20_GB200 and jit_python_distributed_tests_20_GB200 . Those were not detected by the auto-merge workflow because I missed the pagination of CI status check and only checked the latest 30 statuses, which were all successful. I'm working on a fix for that.

~~I'm not sure if this PR directly caused the two failures. If so, please help revert the PR. Sorry about the inconvenience.~~ Failures seem unrelated to this PR.

This fixes a severe bug where the auto-merge workflow only checked the first 30 commit statuses, causing it to miss failures and incorrectly merge PRs with failing checks. Root cause analysis: - PR #5578 had 2 failed GB200 tests at 23:23-23:25 UTC - By 03:29 UTC, 27+ new successful statuses pushed failures past position 30 - Workflow only fetched first page (30 items), saw 0 failures, and merged Fixed 4 critical pagination issues: 1. listCommitStatusesForRef (line 140) - CRITICAL: Only saw 30 of 57 statuses 2. checks.listForRef (line 173) - Could miss failed checks if >30 exist 3. issues.listComments (line 349) - Wouldn't find status comment if >30 comments 4. pulls.list (line 64) - Could miss PR if >30 open PRs on branch All API calls now use github.paginate() to retrieve complete results. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

## Summary Fixes a critical bug where the auto-merge workflow only fetched the first 30 results from GitHub API list operations, causing it to miss failed checks and incorrectly merge PRs. ## Root Cause PR #5578 had 2 failed GB200 tests that occurred early in the CI run. By the time the auto-merge action ran 4+ hours later, 27 newer successful statuses had been created. Since the workflow used unpaginated API calls (default limit: 30 items), the failed statuses were pushed beyond the first page and never detected. ## Changes Fixed 4 GitHub API calls to use `github.paginate()`: 1. `listCommitStatusesForRef` - Was only checking 30 of 57+ statuses 2. `checks.listForRef` - Could miss failed checks if >30 exist 3. `issues.listComments` - Could miss status comment if >30 comments 4. `pulls.list` - Could miss PR if >30 open PRs on branch Also simplified the `pr_approved` check logic which was deriving approval status from `mergeable_state` in a confusing way. The workflow now shows the actual `mergeable_state` value in status comments for transparency. ## Impact The auto-merge workflow will now correctly detect ALL failures regardless of how many statuses exist, preventing incorrect merges like #5578. --------- Co-authored-by: Claude <noreply@anthropic.com>

tbqh requested review from crcrpar and wujingyue November 21, 2025 20:40

This comment was marked as off-topic.

Sign in to view

greptile-apps Bot reviewed Nov 21, 2025

View reviewed changes

wujingyue approved these changes Nov 21, 2025

View reviewed changes

tbqh added 6 commits November 21, 2025 15:09

Pull latest benchmark_inference from lightning-thunder repo

b5e6175

Add references to lightning-thunder

4c6c47e

Pull thunder PR "Use torch._grouped_mm in eager mode"

e181595

Lightning-AI/lightning-thunder#2721

Pull thunder PR "Remove the --profile option"

b21af25

Lightning-AI/lightning-thunder#2715

Add SPDX header back to file

be28360

Simplify if statement

f5d05ae

tbqh force-pushed the inference_benchmark_Nov21 branch from 790d7d7 to f5d05ae Compare November 21, 2025 23:09

greptile-apps Bot reviewed Nov 21, 2025

View reviewed changes

tbqh added the enable-auto-merge Auto-merge a PR when: 1) PR mergeable 2) Internal CI complete 3) No failures label Nov 21, 2025

greptile-apps Bot reviewed Nov 21, 2025

View reviewed changes

github-actions Bot merged commit abbbf4e into main Nov 22, 2025
60 of 63 checks passed

github-actions Bot removed the enable-auto-merge Auto-merge a PR when: 1) PR mergeable 2) Internal CI complete 3) No failures label Nov 22, 2025

github-actions Bot deleted the inference_benchmark_Nov21 branch November 22, 2025 03:29

xwang233 mentioned this pull request Nov 22, 2025

Fix critical GitHub API pagination bugs in auto-merge workflow #5580

Merged

Conversation

tbqh commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greptile-apps Bot commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Overview

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

This comment was marked as off-topic.

Uh oh!

github-actions Bot commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Auto-merge Status

Description

Changes walkthrough

PR Reviewer Guide

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

wujingyue left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

wujingyue Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

wujingyue Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xwang233 Nov 22, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

tbqh commented Nov 21, 2025

Uh oh!

greptile-apps Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Additional Comments (1)

Uh oh!

Uh oh!

xwang233 commented Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tbqh commented Nov 21, 2025 •

edited

Loading

greptile-apps Bot commented Nov 21, 2025 •

edited

Loading

github-actions Bot commented Nov 21, 2025 •

edited

Loading

wujingyue Nov 22, 2025 •

edited

Loading

greptile-apps Bot left a comment •

edited

Loading

xwang233 commented Nov 22, 2025 •

edited

Loading