Nan propagation option for bf16 and half16 by haoran35-jpg · Pull Request #1958 · tile-ai/tilelang

haoran35-jpg · 2026-03-21T07:24:23Z

This PR introduces option to propagate NaN specifically for data type bf_16 and half, which modifies the following files:

file changed	implementation
`src/op/builtin.h`	Declares `kReduceMaxMinNanPropagate` and `tl::max_nan` / `tl::min_nan` ops.
`src/op/builtin.cc`	Registers TIR intrinsics `tl.max_nan` / `tl.min_nan`.
src/op/reduce.cc	TIR layer: For `AllReduce`, `MakeCodegenReducer()` returns `tl::MaxOpNan`, concatenate into `tl::AllReduce<tl::MaxOpNan, …>::run(...) with tl::MaxOpNan` defined in rumtime template reduce.h. For `MakeReduce`, turn into CallNode then rely on `tl::max_nan/tl::min_nan`.
src/target/codegen_cuda.cc	VisitExpr_(MaxNode/MinNode) choose between `__hmax/__hmin` and `__hmax_nan/__hmin_nan` according to Passcontext. VisitExpr_(CallNode) handles tl::max_nan / tl::min_nan as TVM op defined by buildin.
src/op/reduce.h	Define runtime template for MaxOpNan and MinOpNan
tilelang/transform/pass_config.py	Python layer: a PassConfigKey button to open NaN propagation functionality.
testing/python/language/test_tilelang_language_reduce_maxmin_nan.py	A test file for nan functionality.

Before this PR, MaxOp/MinOp with parameter type bfloat16 and half will lower to __hmax()/__hmin() forcefully. Now user can optionally lower them to _hmax_nan()/__hmin_nan() by setting kReduceMaxMinNanPropagate to false(default is true). I feel this global config is more elegant and convenient to implement, not sure if this is okay.
Disclaim: AI is used in code refinement and test generation.

Summary by CodeRabbit

New Features

Added nan_propagate parameter to reduce_max(), reduce_min(), and reduce_absmax() functions, enabling NaN-aware reduction behavior for fp16/bf16 tensors on CUDA targets.

Tests

Added comprehensive test suite validating NaN-propagation semantics and code generation for reduction primitives.

…hmax_nan for hf16

… in CUDA code generator's runtime template

” ‘ “

github-actions · 2026-03-21T07:24:32Z

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

coderabbitai · 2026-03-21T07:24:44Z

📝 Walkthrough

Walkthrough

This PR introduces NaN-propagating variants of max/min reduction operations for fp16/bf16 datatypes in TileLang. The feature adds a new optional nan_propagate parameter that enables CUDA-specific __hmax_nan/__hmin_nan intrinsics, with corresponding lowering, code generation, and Python API changes spanning the compiler pipeline.

Changes

Cohort / File(s)	Summary
Builtin Intrinsic Declarations `src/op/builtin.h`, `src/op/builtin.cc`	Added two new TIR intrinsics `tl.max_nan()` and `tl.min_nan()` with 2 inputs each, marked as pure effects for CUDA-style NaN-aware max/min operations.
Reduction Operator Updates `src/op/reduce.h`, `src/op/reduce.cc`	Added `nan_propagate` boolean field to `ReduceOpNode`, parsing logic in constructor, conditional lowering to use new NaN intrinsics for fp16/bf16 when enabled, and target-specific validation requiring CUDA support.
CUDA Code Generation `src/target/codegen_cuda.cc`, `src/tl_templates/cuda/reduce.h`	Implemented specialized CUDA codegen for `tl.max_nan`/`tl.min_nan` with type-specific intrinsic emission (`__hmax_nan`/`__hmin_nan` for f16/bf16), and added `MaxOpNan`/`MinOpNan` template functors in reduce templates.
Python API `tilelang/language/reduce_op.py`	Extended `reduce_max()`, `reduce_min()`, and `reduce_absmax()` with optional `nan_propagate: bool = False` parameter, threaded through internal `reduce()` function via annotation mechanism.
Tests `testing/python/language/test_tilelang_language_reduce_maxmin_nan.py`	Added new CUDA-only test module validating codegen correctness (checking for NaN-specific intrinsic/reducer strings) and runtime behavior (NaN propagation vs. non-propagation) for float16 and bfloat16 reductions.

Sequence Diagram

sequenceDiagram
    participant User as User Code
    participant PythonAPI as TileLang Python API
    participant Compiler as Reduction Compiler
    participant Lowering as Lowering Pass
    participant Codegen as CUDA Codegen
    participant Runtime as CUDA Runtime

    User->>PythonAPI: reduce_max(buffer, out, nan_propagate=True)
    PythonAPI->>Compiler: Pass annotation {"nan_propagate": True}
    Compiler->>Compiler: Parse nan_propagate flag into ReduceOpNode
    alt nan_propagate=True && (fp16 or bf16)
        Lowering->>Lowering: Select tl::max_nan intrinsic
    else
        Lowering->>Lowering: Select tl::max function
    end
    Lowering->>Codegen: Emit reduction with chosen intrinsic
    alt tl::max_nan for fp16/bf16
        Codegen->>Codegen: Emit __hmax_nan with type conversions
    else
        Codegen->>Codegen: Emit fast_max fallback
    end
    Codegen->>Runtime: Compiled CUDA kernel
    Runtime->>Runtime: Execute NaN-aware max operation

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

[Enhancement] Optimize templates for half/bfloat16 #1845: Modifies CUDA codegen and reduce template implementations for fp16/bf16 specialized max/min paths with similar dtype-specific intrinsic handling.
[BugFix] Implement bfloat16 support in CUDA code generation with min/max functions and inf/nan values #1143: Updates CUDA codegen for min/max operations on fp16/bf16 with intrinsic specialization in CodeGenTileLangCUDA.
[Language] Refactor reduce and support shared memory as its in/out #1219: Modifies reduce lowering logic in src/op/reduce.cc for reduction operation handling.

Suggested reviewers

bucket-xv

Poem

🐰 Hops through the max and min so fine,
With NaN-aware logic that's divine,
CUDA's __h intrinsics now aligned,
Float16 and bfloat16 perfectly designed!
Tests hopping in to verify the way,
A fuzzy feature coded today! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: adding a NaN propagation option for bf16 and float16 types.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

haoran35-jpg · 2026-03-21T07:30:23Z

@#1844

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/op/reduce.cc`:
- Around line 112-137: The nan_propagate boolean is used with inverted
semantics: change condition checks that currently use !nan_propagate to use
nan_propagate and swap the branches that return NaN-propagating ops vs
NaN-ignoring ops. Concretely, in the reducer logic around the acc/rhs handling
(references: variable nan_propagate, kReduceMaxMinNanPropagate, and the branches
returning Call(acc.dtype(), tl::max_nan()/tl::min_nan(), ...) vs
Max/Min/tvm::abs calls) flip the conditions so that nan_propagate selects the
Call(... tl::*_nan()) (NaN-propagating) paths and the else selects Max/Min
(NaN-ignoring). Apply the same inversion in the CUDA codegen spots that choose
__hmax/__hmax_nan/__hmin/__hmin_nan (matching codegen_cuda symbols) and update
related tests to expect the corrected semantics; also update the docstrings that
describe nan_propagate in builtin.h and pass_config.py to reflect that true
means NaN-propagate.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c7e56a0a-abf7-4691-969e-7ea8999118da

📥 Commits

Reviewing files that changed from the base of the PR and between 9eba9fe and dee042b.

📒 Files selected for processing (7)

src/op/builtin.cc
src/op/builtin.h
src/op/reduce.cc
src/target/codegen_cuda.cc
src/tl_templates/cuda/reduce.h
testing/python/language/test_tilelang_language_reduce_maxmin_nan.py
tilelang/transform/pass_config.py

coderabbitai · 2026-03-21T07:38:02Z

+  bool nan_propagate = true;
+  if (tvm::transform::PassContext::Current().defined()) {
+    nan_propagate = tvm::transform::PassContext::Current()
+                        ->GetConfig<Bool>(kReduceMaxMinNanPropagate, Bool(true))
+                        .value();
+  }
+  const bool is_fp16_or_bf16 = acc.dtype().is_float16() || acc.dtype().is_bfloat16();
  if (type->isSum()) {
    return acc + rhs;
  } else if (type->isAbsSum()) {
    return acc + Max(rhs, -rhs);
  } else if (type->isMax()) {
+    if (!nan_propagate && is_fp16_or_bf16) {
+      return Call(acc.dtype(), tl::max_nan(), {acc, rhs});
+    }
    return Max(acc, rhs);
  } else if (type->isMin()) {
+    if (!nan_propagate && is_fp16_or_bf16) {
+      return Call(acc.dtype(), tl::min_nan(), {acc, rhs});
+    }
    return Min(acc, rhs);
  } else if (type->isAbsMax()) {
+    if (!nan_propagate && is_fp16_or_bf16) {
+      return Call(acc.dtype(), tl::max_nan(), {acc, tvm::abs(rhs)});
+    }
    return Max(acc, tvm::abs(rhs));


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's examine the actual code in reduce.cc at the specified lines fd -t f reduce.cc | head -5

Repository: tile-ai/tilelang

Length of output: 140

🏁 Script executed:

# Read the reduce.cc file at the specified lines to understand the current state cat -n src/op/reduce.cc | sed -n '100,170p'

Repository: tile-ai/tilelang

Length of output: 2930

🏁 Script executed:

# Find the reduce.cc file in the repository find . -name "reduce.cc" -type f 2>/dev/null

Repository: tile-ai/tilelang

Length of output: 78

🏁 Script executed:

# List files in src/op directory to understand structure ls -la src/op/ 2>/dev/null | head -20 || echo "Directory not found, searching for op directories" find . -type d -name "op" 2>/dev/null | head -5

Repository: tile-ai/tilelang

Length of output: 1230

🏁 Script executed:

# Search for reduce.cc using git git ls-files | grep reduce.cc

Repository: tile-ai/tilelang

Length of output: 140

🏁 Script executed:

# Get repository structure git ls-files | grep -E "(op|reduce)" | head -20

Repository: tile-ai/tilelang

Length of output: 1081

🏁 Script executed:

# Check if the file exists by listing all C++ files git ls-files "*.cc" | grep -i reduce

Repository: tile-ai/tilelang

Length of output: 199

🏁 Script executed:

# List all files to understand repo structure git ls-files | head -30

Repository: tile-ai/tilelang

Length of output: 733

🏁 Script executed:

# Check if this is a TVM repository and find reduce-related files rg "tl::max_nan|tl::min_nan" --type cc -l

Repository: tile-ai/tilelang

Length of output: 86

🏁 Script executed:

# Search for MaxOpNan/MinOpNan patterns rg "MaxOpNan|MinOpNan" -l

Repository: tile-ai/tilelang

Length of output: 175

🏁 Script executed:

# Search for nan_propagate variable rg "nan_propagate" -l

Repository: tile-ai/tilelang

Length of output: 222

🏁 Script executed:

# Try to find any reduce-related operations rg "type->isSum|type->isMax" -l | head -10

Repository: tile-ai/tilelang

Length of output: 76

🏁 Script executed:

# Search across all files for the specific pattern rg "is_fp16_or_bf16" -A 5 -B 5

Repository: tile-ai/tilelang

Length of output: 1527

🏁 Script executed:

# Search for the exact snippet rg "kReduceMaxMinNanPropagate" -l

Repository: tile-ai/tilelang

Length of output: 138

🏁 Script executed:

# Search for reduce operations with nan rg "reduce.*nan|Reduce.*Nan" -i -l | head -10

Repository: tile-ai/tilelang

Length of output: 240

🏁 Script executed:

# List repository structure more broadly ls -la

Repository: tile-ai/tilelang

Length of output: 2819

🏁 Script executed:

# Check git log to understand this is a code review repository git log --oneline -5 2>/dev/null || echo "Not a git repo" git status 2>/dev/null || echo "Unable to get status"

Repository: tile-ai/tilelang

Length of output: 314

🌐 Web query:

NVIDIA CUDA __hmax __hmin __hmax_nan __hmin_nan semantics NaN handling

💡 Result:

The NVIDIA CUDA half-precision (FP16) intrinsic functions hmax, hmin, hmax_nan, and hmin_nan compute the maximum and minimum of two half values with specific NaN handling behaviors, as defined in the official CUDA Math API documentation. All functions calculate max(a, b) as (a > b) ? a : b for hmax/hmax_nan, and min(a, b) as (a < b) ? a : b for hmin/hmin_nan. They treat +0.0 > -0.0 when both inputs are zero. - hmax(const half a, const half b): If either input is NaN, returns the other input. If both are NaNs, returns canonical NaN. - hmin(const half a, const half b): If either input is NaN, returns the other input. If both are NaNs, returns canonical NaN. - hmax_nan(const half a, const half b): If either input is NaN, returns canonical NaN (NaNs pass through). - hmin_nan(const half a, const half b): If either input is NaN, returns canonical NaN (NaNs pass through). These are host device functions available by including cuda_fp16.h. They are part of the Half Comparison Functions group.

Citations:

1: https://docs.nvidia.com/cuda/cuda-math-api/cuda_math_api/group__CUDA__MATH____HALF__COMPARISON.html

2: https://docs.nvidia.com/cuda/cuda-math-api/cuda_math_api/group__CUDA__MATH__INTRINSIC__HALF.html

🌐 Web query:

CUDA __hmax_nan __hmin_nan half precision NaN propagation official documentation

💡 Result:

hmax_nan and hmin_nan are CUDA half-precision (half) intrinsic functions for computing maximum and minimum, respectively, with NaN propagation (NaNs pass through). These are defined in the official CUDA Math API documentation (section 4.3 Half Comparison Functions): - host device half hmax_nan(const half a, const half b): Calculates half max(a, b) as (a > b) ? a : b. If either input is NaN, returns canonical NaN. If both inputs are 0.0, +0.0 > -0.0. - host device half hmin_nan(const half a, const half b): Calculates half min(a, b) as (a < b) ? a : b. If either input is NaN, returns canonical NaN. If both inputs are 0.0, +0.0 > -0.0. To use them, include <cuda_fp16.h>. They differ from non-_nan variants (hmax, hmin), which return the other input if one is NaN (and canonical NaN if both are NaNs). Canonical NaN is defined as CUDART_NAN_FP16 = ushort_as_half((unsigned short)0x7FFFU). These functions are available in recent CUDA versions (e.g., 11.5+ documented, up to 13.2 as of 2025 docs). NaN handling in CUDA half follows IEEE 754 with specifics: operations propagate quiet NaNs, no signaling NaNs.

Citations:

1: https://docs.nvidia.com/cuda/cuda-math-api/cuda_math_api/group__CUDA__MATH____HALF__COMPARISON.html

2: https://docs.nvidia.com/cuda/archive/11.5.1/cuda-math-api/group__CUDA__MATH____HALF__COMPARISON.html

3: https://docs.nvidia.com/cuda/cuda-math-api/cuda_math_api/group__CUDA__MATH__INTRINSIC__HALF__CONSTANTS.html

4: https://docs.nvidia.com/cuda/pdf/CUDA_Math_API.pdf

5: https://docs.nvidia.com/cuda/cuda-math-api/cuda_math_api/group__CUDA__MATH__INTRINSIC__HALF.html

🌐 Web query:

site:docs.nvidia.com CUDA half precision comparison max min __hmax __hmin

💡 Result:

In CUDA, for half precision (half) floating-point numbers, the max and min comparison intrinsics are hmax and hmin, defined in cuda_fp16.h. host device half hmax(const half a, const half b) Calculates half maximum: returns a if a > b, else b. - If either input is NaN, returns the other input. - If both NaN, returns canonical NaN. - For zeros: +0.0 > -0.0. host device half hmin(const half a, const half b) Calculates half minimum: returns a if a < b, else b. - Same NaN and zero rules as hmax. Variants: - hmax_nan and hmin_nan: NaNs pass through (return canonical NaN if either is NaN). These are available in device code; host availability since recent CUDA versions (e.g., 12.x docs show host device). Usage example: #include <cuda_fp16.h> device half result = hmax(half1, half2);

Citations:

1: https://docs.nvidia.com/cuda/cuda-math-api/cuda_math_api/group__CUDA__MATH____HALF__COMPARISON.html

2: https://docs.nvidia.com/cuda/archive/11.5.1/cuda-math-api/group__CUDA__MATH____HALF__COMPARISON.html

3: https://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH__INTRINSIC__HALF.html

🌐 Web query:

CUDA math API __half comparison NaN behavior "max_nan" "min_nan"

💡 Result:

In the CUDA Math API for half (half-precision floating-point), the min/max functions have specific NaN propagation behaviors defined in the official documentation (cuda_fp16.h functions in Half Comparison Functions section). Standard max/min: - hmax(const half a, const half b): Returns max(a, b) as (a > b) ? a : b. - If either input is NaN, returns the other input. - If both are NaNs, returns canonical NaN. - hmin(const half a, const half b): Returns min(a, b) as (a < b) ? a : b. - If either input is NaN, returns the other input. - If both are NaNs, returns canonical NaN. "NaN pass-through" variants (named *_nan): - hmax_nan(const half a, const half b): Same max logic. - If either input is NaN, returns canonical NaN (NaNs pass through). - hmin_nan(const half a, const half b): Same min logic. - If either input is NaN, returns canonical NaN (NaNs pass through). For comparison operators and functions like heq, hlt, etc.: - NaN inputs generate false results (standard IEEE-754 ordered comparison behavior: any comparison involving NaN is false). These are available as host device functions. Include <cuda_fp16.h>. Note that +0.0 > -0.0 in all cases. This matches patterns in bfloat16 and half2 variants.

Citations:

1: https://docs.nvidia.com/cuda/cuda-math-api/cuda_math_api/group__CUDA__MATH____HALF__COMPARISON.html

2: https://docs.nvidia.com/cuda/archive/11.4.4/cuda-math-api/group__CUDA__MATH____HALF__COMPARISON.html

🏁 Script executed:

# Check CUDA codegen handling of max_nan/min_nan cat -n src/target/codegen_cuda.cc | grep -A 15 -B 5 "max_nan\|min_nan" | head -60

Repository: tile-ai/tilelang

Length of output: 2800

🏁 Script executed:

# Check for MaxOp and MaxOpNan in CUDA codegen rg "MaxOp|MinOp" src/target/codegen_cuda.cc -A 5 -B 2 | head -100

Repository: tile-ai/tilelang

Length of output: 42

🏁 Script executed:

# Look at builtin.h/builtin.cc to understand max_nan/min_nan definitions cat -n src/op/builtin.h | grep -A 10 -B 2 "max_nan\|min_nan" | head -50

Repository: tile-ai/tilelang

Length of output: 1683

🏁 Script executed:

# Check the test file to understand expected behavior cat -n testing/python/language/test_tilelang_language_reduce_maxmin_nan.py | head -100

Repository: tile-ai/tilelang

Length of output: 3505

🏁 Script executed:

# Check pass_config to understand the flag definition cat -n tilelang/transform/pass_config.py | grep -A 10 -B 2 "kReduceMaxMinNanPropagate\|nan_propagate"

Repository: tile-ai/tilelang

Length of output: 801

🏁 Script executed:

# Get more context on the test to see expected NaN behavior wc -l testing/python/language/test_tilelang_language_reduce_maxmin_nan.py

Repository: tile-ai/tilelang

Length of output: 131

🏁 Script executed:

# Read the entire test file cat testing/python/language/test_tilelang_language_reduce_maxmin_nan.py

Repository: tile-ai/tilelang

Length of output: 4193

🏁 Script executed:

# Search for tl::max_nan and tl::min_nan definitions/usage rg "tl::max_nan|tl::min_nan" -B 3 -A 3

Repository: tile-ai/tilelang

Length of output: 1586

🏁 Script executed:

# Check reduce.h template for max_nan/min_nan usage cat src/tl_templates/cuda/reduce.h | grep -A 10 -B 10 "max_nan\|min_nan"

Repository: tile-ai/tilelang

Length of output: 1484

🏁 Script executed:

# Check builtin.cc for max_nan/min_nan implementations rg "max_nan|min_nan" src/op/builtin.cc -A 5 -B 2

Repository: tile-ai/tilelang

Length of output: 555

Invert the nan_propagate semantics throughout the NaN-handling feature.

The flag nan_propagate = true currently uses __hmax / __hmin (NaN-ignoring functions), while nan_propagate = false uses __hmax_nan / __hmin_nan (NaN-propagating functions). This inverts the flag's intended semantics: per official CUDA documentation, __hmax_nan propagates NaNs and __hmax ignores them.

Fix in src/op/reduce.cc: Invert conditions at lines 124, 129, 134 from !nan_propagate to nan_propagate, and swap return values at lines 161, 163, 165. Matching inversions are required in src/target/codegen_cuda.cc (lines 1451, 1489) and all test assertions. Correct the docstrings in src/op/builtin.h (lines 193–196) and tilelang/transform/pass_config.py (lines 167–169).

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/op/reduce.cc` around lines 112 - 137, The nan_propagate boolean is used with inverted semantics: change condition checks that currently use !nan_propagate to use nan_propagate and swap the branches that return NaN-propagating ops vs NaN-ignoring ops. Concretely, in the reducer logic around the acc/rhs handling (references: variable nan_propagate, kReduceMaxMinNanPropagate, and the branches returning Call(acc.dtype(), tl::max_nan()/tl::min_nan(), ...) vs Max/Min/tvm::abs calls) flip the conditions so that nan_propagate selects the Call(... tl::*_nan()) (NaN-propagating) paths and the else selects Max/Min (NaN-ignoring). Apply the same inversion in the CUDA codegen spots that choose __hmax/__hmax_nan/__hmin/__hmin_nan (matching codegen_cuda symbols) and update related tests to expect the corrected semantics; also update the docstrings that describe nan_propagate in builtin.h and pass_config.py to reflect that true means NaN-propagate.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (3)

src/op/reduce.cc (1)

112-117: Consider extracting nan_propagate retrieval to a helper method.

The PassContext config lookup for kReduceMaxMinNanPropagate is duplicated in both MakeReduce() (lines 112-117) and MakeCodegenReducer() (lines 150-155). This creates a maintenance burden and risk of divergence.

♻️ Suggested refactor

Add a private helper method to ReduceOpNode or a free function:

+static bool GetNanPropagateSetting() {
+  if (tvm::transform::PassContext::Current().defined()) {
+    return tvm::transform::PassContext::Current()
+        ->GetConfig<Bool>(kReduceMaxMinNanPropagate, Bool(true))
+        .value();
+  }
+  return true;
+}
+
 PrimExpr ReduceOpNode::MakeReduce(const PrimExpr &acc,
                                   const PrimExpr &b) const {
   PrimExpr rhs = b;
   if (acc->dtype != rhs->dtype) {
     rhs = Cast(acc->dtype, rhs);
   }
-  bool nan_propagate = true;
-  if (tvm::transform::PassContext::Current().defined()) {
-    nan_propagate = tvm::transform::PassContext::Current()
-                        ->GetConfig<Bool>(kReduceMaxMinNanPropagate, Bool(true))
-                        .value();
-  }
+  bool nan_propagate = GetNanPropagateSetting();

Apply same change to MakeCodegenReducer().

Also applies to: 150-155

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/op/reduce.cc` around lines 112 - 117, Extract the PassContext lookup for
kReduceMaxMinNanPropagate into a single helper (either a private method on
ReduceOpNode like ReduceOpNode::GetNanPropagateFromPassContext() or a
static/free function e.g. GetReduceNanPropagate()) and replace the duplicated
blocks inside MakeReduce() and MakeCodegenReducer() with calls to that helper;
the helper should return a bool defaulting to true by reading
tvm::transform::PassContext::Current()->GetConfig<Bool>(kReduceMaxMinNanPropagate,
Bool(true)).value() and be used in both locations to avoid divergence.

testing/python/language/test_tilelang_language_reduce_maxmin_nan.py (2)

60-67: Inconsistent negative assertions in bf16 tests.

test_reduce_max_fp16_nan_propagate_false (line 49) asserts "MaxOpNan" not in src, but test_reduce_max_bf16_nan_propagate_false omits this check. Same pattern for reduce_min tests. Adding consistent negative assertions would catch cases where both variants accidentally appear.

💡 Suggested fix

 def test_reduce_max_bf16_nan_propagate_false():
     k = _compile_cuda(
         _kernel_reduce_max(64, T.bfloat16),
         pass_configs={tilelang.PassConfigKey.TL_REDUCE_MAXMIN_NAN_PROPAGATE: False},
     )
     src = k.get_kernel_source()
     assert "tl::MaxOp" in src
+    assert "MaxOpNan" not in src
     assert "__hmax(" in src

Apply similar changes to test_reduce_min_bf16_nan_propagate_false.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@testing/python/language/test_tilelang_language_reduce_maxmin_nan.py` around
lines 60 - 67, Add the missing negative assertions to the bfloat16 tests to
mirror the fp16 tests: in test_reduce_max_bf16_nan_propagate_false ensure the
compiled kernel source (src) does not contain "MaxOpNan" (similar to
test_reduce_max_fp16_nan_propagate_false), and in
test_reduce_min_bf16_nan_propagate_false ensure src does not contain "MinOpNan"
(matching the fp16 counterpart); update the assertions in those functions so
each checks both the positive expected op/implementation strings and the
corresponding "...OpNan" absence.

131-138: Missing bf16 tests for reduce_absmax.

The reduce_absmax tests only cover fp16. Consider adding bf16 tests for completeness, since bf16 uses the same NaN-handling code path.

💡 Suggested additions

def test_reduce_absmax_bf16_nan_propagate_default():
    k = _compile_cuda(_kernel_reduce_absmax(64, T.bfloat16))
    src = k.get_kernel_source()
    assert "tl::MaxOpNan" in src
    assert "__hmax_nan" in src


def test_reduce_absmax_bf16_nan_propagate_false():
    k = _compile_cuda(
        _kernel_reduce_absmax(64, T.bfloat16),
        pass_configs={tilelang.PassConfigKey.TL_REDUCE_MAXMIN_NAN_PROPAGATE: False},
    )
    src = k.get_kernel_source()
    assert "tl::MaxOp" in src
    assert "__hmax(" in src

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@testing/python/language/test_tilelang_language_reduce_maxmin_nan.py` around
lines 131 - 138, Add parallel bf16 tests for reduce_absmax mirroring the fp16
cases: create test_reduce_absmax_bf16_nan_propagate_default() that compiles
_kernel_reduce_absmax(64, T.bfloat16) via _compile_cuda and asserts the source
contains "tl::MaxOpNan" and "__hmax_nan", and create
test_reduce_absmax_bf16_nan_propagate_false() that compiles with
pass_configs={tilelang.PassConfigKey.TL_REDUCE_MAXMIN_NAN_PROPAGATE: False} and
asserts the source contains "tl::MaxOp" and "__hmax("; use the same helper names
(_compile_cuda, _kernel_reduce_absmax, T.bfloat16,
tilelang.PassConfigKey.TL_REDUCE_MAXMIN_NAN_PROPAGATE) to mirror the existing
fp16 tests.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@testing/python/language/test_tilelang_language_reduce_maxmin_nan.py`:
- Around line 101-108: Add a new test named
test_reduce_min_bf16_nan_propagate_default that mirrors
test_reduce_min_bf16_nan_propagate_false but uses the default pass config (i.e.,
call _compile_cuda(_kernel_reduce_min(64, T.bfloat16)) without pass_configs),
then call get_kernel_source() on the returned kernel and assert that the source
contains "tl::MinOpNan" and "__hmin_nan" to verify bf16 default NaN-propagation
for reduce_min.

---

Nitpick comments:
In `@src/op/reduce.cc`:
- Around line 112-117: Extract the PassContext lookup for
kReduceMaxMinNanPropagate into a single helper (either a private method on
ReduceOpNode like ReduceOpNode::GetNanPropagateFromPassContext() or a
static/free function e.g. GetReduceNanPropagate()) and replace the duplicated
blocks inside MakeReduce() and MakeCodegenReducer() with calls to that helper;
the helper should return a bool defaulting to true by reading
tvm::transform::PassContext::Current()->GetConfig<Bool>(kReduceMaxMinNanPropagate,
Bool(true)).value() and be used in both locations to avoid divergence.

In `@testing/python/language/test_tilelang_language_reduce_maxmin_nan.py`:
- Around line 60-67: Add the missing negative assertions to the bfloat16 tests
to mirror the fp16 tests: in test_reduce_max_bf16_nan_propagate_false ensure the
compiled kernel source (src) does not contain "MaxOpNan" (similar to
test_reduce_max_fp16_nan_propagate_false), and in
test_reduce_min_bf16_nan_propagate_false ensure src does not contain "MinOpNan"
(matching the fp16 counterpart); update the assertions in those functions so
each checks both the positive expected op/implementation strings and the
corresponding "...OpNan" absence.
- Around line 131-138: Add parallel bf16 tests for reduce_absmax mirroring the
fp16 cases: create test_reduce_absmax_bf16_nan_propagate_default() that compiles
_kernel_reduce_absmax(64, T.bfloat16) via _compile_cuda and asserts the source
contains "tl::MaxOpNan" and "__hmax_nan", and create
test_reduce_absmax_bf16_nan_propagate_false() that compiles with
pass_configs={tilelang.PassConfigKey.TL_REDUCE_MAXMIN_NAN_PROPAGATE: False} and
asserts the source contains "tl::MaxOp" and "__hmax("; use the same helper names
(_compile_cuda, _kernel_reduce_absmax, T.bfloat16,
tilelang.PassConfigKey.TL_REDUCE_MAXMIN_NAN_PROPAGATE) to mirror the existing
fp16 tests.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 670b319c-70b0-4e6c-9226-1b11611dde1f

📥 Commits

Reviewing files that changed from the base of the PR and between dee042b and 341a197.

📒 Files selected for processing (5)

src/op/builtin.h
src/op/reduce.cc
src/target/codegen_cuda.cc
testing/python/language/test_tilelang_language_reduce_maxmin_nan.py
tilelang/transform/pass_config.py

🚧 Files skipped from review as they are similar to previous changes (3)

tilelang/transform/pass_config.py
src/op/builtin.h
src/target/codegen_cuda.cc

LeiWang1999 · 2026-03-22T08:50:56Z

Thanks @haoran35-jpg , would you mind explain more for introducing kReduceMaxMinNanPropagate ? sorry I didn't get the point.

haoran35-jpg · 2026-03-22T09:20:54Z

Yes, thank you for comment! It's allowing max(...)/min(...) in frontend to be lowered to Nvidia intrinsic hmax_nan(...)/hmin_nan(...) when setting kReduceMaxMinNanPropagate to true, and those two functions return nan when either of the parameters are NaN. This is different from hmax(...)/hmin(...) intrinsic where nan is returned only when both parameters are NaN. This could make it more convenient and precise to study NaN propagation. I'm still a rookie trying to learn this project, so I'm not sure if this is the best way to do. Thank you for your time!

LeiWang1999 · 2026-03-22T10:26:21Z

@haoran35-jpg, Thanks and I think one better solution is to add an annotation for reduce tile op (in frontend have sth like T.reduce_max(nan_propagate=True)) instead of introducing a pass config.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/op/reduce.cc`:
- Around line 118-119: Make the NaN-specific gating consistent between the
thread-local reducer and the codegen reducer: in MakeCodegenReducer() only
return tl::MaxOpNan / tl::MinOpNan for the max/min/absmax reducers when
nan_propagate is true AND the accumulator dtype is half/bfloat (mirror the
is_fp16_or_bf16 check in MakeReduce() — i.e., use acc.dtype().is_float16() ||
acc.dtype().is_bfloat16()); otherwise return the regular tl::MaxOp / tl::MinOp
so both local (MakeReduce) and inter-thread (MakeCodegenReducer) phases behave
the same for non-half dtypes.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ececcd74-7a96-4fd5-a079-70de022f3ae5

📥 Commits

Reviewing files that changed from the base of the PR and between 341a197 and 208ade4.

📒 Files selected for processing (2)

src/op/reduce.cc
src/target/codegen_cuda.cc

🚧 Files skipped from review as they are similar to previous changes (1)

src/target/codegen_cuda.cc

Replace the global tl.reduce_maxmin_nan_propagate pass-config with a per-op nan_propagate flag carried on the ReduceOp annotation map. Frontend exposes it as a kwarg on T.reduce_max/min/absmax (default False, preserving prior behavior). The CUDA codegen MinNode/MaxNode visitors are reverted to plain __hmin/__hmax; only the new tl::max_nan /tl::min_nan CallNode handler emits __hmax_nan/__hmin_nan, so non-reduce Min/Max ops are no longer silently rewritten. ReduceOp::Lower now errors early with a clear message if nan_propagate=True is requested for fp16 /bf16 on a non-CUDA target, instead of emitting an undefined symbol on HIP/CPU. Tests rewritten to parametrize on the kwarg and assert runtime NaN-propagation behavior, not just generated source substrings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/op/reduce.cc (1)
120-164: ⚠️ Potential issue | 🟠 Major

Use the NaN-aware combiner when merging back into dst with clear=False.

The new logic makes the local reduction and AllReduce honor nan_propagate, but the later need_update merge still uses plain Max / Min. In the public clear=False path, a NaN from either the existing output or the newly reduced value is dropped there, so reduce_max/min/absmax(..., clear=False, nan_propagate=True) can still return a non-NaN result.
Suggested fix
-        } else if (this->type->isMax() || this->type->isAbsMax()) {
-          update = Max(dst_val, src_val);
-        } else if (this->type->isMin()) {
-          update = Min(dst_val, src_val);
+        } else if (this->type->isMax() || this->type->isMin() ||
+                   this->type->isAbsMax()) {
+          update = this->MakeReduce(dst_val, src_val);
The new runtime tests only exercise clear=True, so this regression is currently unguarded.

Also applies to: 470-485
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/op/reduce.cc` around lines 120 - 164, The merge step that writes back
into dst when clear=False still uses plain Max/Min (dropping NaNs) even when
nan_propagate is true; update the merge/need_update path to select the NaN-aware
combiners the same way as the local reducer logic and MakeCodegenReducer does:
when use_nan_op is true and type->isMax()/isMin()/isAbsMax(), use the NaN
variants (tl::MaxOpNan / tl::MinOpNan or Call with tl::max_nan()/tl::min_nan())
instead of plain Max/Min so that ReduceOpNode::MakeCodegenReducer, the merge
code that checks need_update, and any code paths invoking Max/Min honor
nan_propagate for clear=False merges.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/op/reduce.cc`:
- Around line 248-254: The current check in reduce.cc unconditionally FATALs
when nan_propagate is true for fp16/bf16 on non-CUDA targets; narrow this guard
to only the reduce kinds that actually use the NaN-aware path (max, min,
absmax). Modify the condition that uses nan_propagate,
dst->dtype.is_float16()/is_bfloat16(), and TargetIsCuda(T.target) so it also
checks the reduce kind (e.g., T.reduce kind or the variable representing the
reduction operator) and only LOG(FATAL) when the kind is "max", "min", or
"absmax"; leave other reduction kinds (e.g., "sum") unaffected so nan_propagate
becomes a no-op on non-CUDA targets.

---

Outside diff comments:
In `@src/op/reduce.cc`:
- Around line 120-164: The merge step that writes back into dst when clear=False
still uses plain Max/Min (dropping NaNs) even when nan_propagate is true; update
the merge/need_update path to select the NaN-aware combiners the same way as the
local reducer logic and MakeCodegenReducer does: when use_nan_op is true and
type->isMax()/isMin()/isAbsMax(), use the NaN variants (tl::MaxOpNan /
tl::MinOpNan or Call with tl::max_nan()/tl::min_nan()) instead of plain Max/Min
so that ReduceOpNode::MakeCodegenReducer, the merge code that checks
need_update, and any code paths invoking Max/Min honor nan_propagate for
clear=False merges.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5fe5e403-594e-415e-9419-73884f40da5a

📥 Commits

Reviewing files that changed from the base of the PR and between 208ade4 and 16f292f.

📒 Files selected for processing (7)

src/op/builtin.cc
src/op/builtin.h
src/op/reduce.cc
src/op/reduce.h
src/target/codegen_cuda.cc
testing/python/language/test_tilelang_language_reduce_maxmin_nan.py
tilelang/language/reduce_op.py

🚧 Files skipped from review as they are similar to previous changes (1)

src/target/codegen_cuda.cc

coderabbitai · 2026-04-13T06:57:56Z

+  if (nan_propagate && (dst->dtype.is_float16() || dst->dtype.is_bfloat16()) &&
+      !TargetIsCuda(T.target)) {
+    LOG(FATAL) << "ReduceOp: nan_propagate=True for fp16/bf16 max/min/absmax "
+                  "is only supported on CUDA targets (requires "
+                  "__hmax_nan/__hmin_nan intrinsics). Target was: "
+               << T.target->str();
+  }


⚠️ Potential issue | 🟡 Minor

Scope the CUDA-only guard to the supported reduce kinds.

This currently aborts any fp16/bf16 reduction with nan_propagate=True, even though only max, min, and absmax ever take the NaN-aware path. A direct T.reduce(..., "sum", nan_propagate=True) becomes a target-dependent hard failure instead of a no-op flag.

Suggested fix

- if (nan_propagate && (dst->dtype.is_float16() || dst->dtype.is_bfloat16()) && + if (nan_propagate && + (this->type->isMax() || this->type->isMin() || this->type->isAbsMax()) && + (dst->dtype.is_float16() || dst->dtype.is_bfloat16()) && !TargetIsCuda(T.target)) {

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/op/reduce.cc` around lines 248 - 254, The current check in reduce.cc unconditionally FATALs when nan_propagate is true for fp16/bf16 on non-CUDA targets; narrow this guard to only the reduce kinds that actually use the NaN-aware path (max, min, absmax). Modify the condition that uses nan_propagate, dst->dtype.is_float16()/is_bfloat16(), and TargetIsCuda(T.target) so it also checks the reduce kind (e.g., T.reduce kind or the variable representing the reduction operator) and only LOG(FATAL) when the kind is "max", "min", or "absmax"; leave other reduction kinds (e.g., "sum") unaffected so nan_propagate becomes a no-op on non-CUDA targets.

* register op, and add to build file * update semantics of NaN for Max and Min ops in Reduce * lower Max op to cutlass fastmax in general case, _hmax for bf16 and _hmax_nan for hf16 * update semantics of NaN for Max and Min ops in Reduce in reduce.h * develope codes transforming ReduceOpNode in TIR and handling CallNode in CUDA code generator's runtime template * test file for reduce maxmin nan * refine test case message and explanation“ ” ‘ “ * invert the senamtics of global config TL_REDUCE_MAXMIN_NAN_PROPAGATE * lint fix * refactor reduce maxmin NaN propagate into per-call annotation Replace the global tl.reduce_maxmin_nan_propagate pass-config with a per-op nan_propagate flag carried on the ReduceOp annotation map. Frontend exposes it as a kwarg on T.reduce_max/min/absmax (default False, preserving prior behavior). The CUDA codegen MinNode/MaxNode visitors are reverted to plain __hmin/__hmax; only the new tl::max_nan /tl::min_nan CallNode handler emits __hmax_nan/__hmin_nan, so non-reduce Min/Max ops are no longer silently rewritten. ReduceOp::Lower now errors early with a clear message if nan_propagate=True is requested for fp16 /bf16 on a non-CUDA target, instead of emitting an undefined symbol on HIP/CPU. Tests rewritten to parametrize on the kwarg and assert runtime NaN-propagation behavior, not just generated source substrings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Haoran <chaoran@umich.edu> Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Hrancheng added 7 commits March 18, 2026 17:20

register op, and add to build file

c5f289a

update semantics of NaN for Max and Min ops in Reduce

03e29f7

lower Max op to cutlass fastmax in general case, _hmax for bf16 and _…

6c5db14

…hmax_nan for hf16

update semantics of NaN for Max and Min ops in Reduce in reduce.h

e534859

develope codes transforming ReduceOpNode in TIR and handling CallNode…

3b28505

… in CUDA code generator's runtime template

test file for reduce maxmin nan

54d9f54

refine test case message and explanation“

dee042b

” ‘ “

coderabbitai bot reviewed Mar 21, 2026

View reviewed changes

invert the senamtics of global config TL_REDUCE_MAXMIN_NAN_PROPAGATE

341a197

coderabbitai bot reviewed Mar 22, 2026

View reviewed changes

Comment thread testing/python/language/test_tilelang_language_reduce_maxmin_nan.py Outdated

lint fix

208ade4

coderabbitai bot reviewed Mar 24, 2026

View reviewed changes

Comment thread src/op/reduce.cc Outdated

kurisu6912 linked an issue Mar 30, 2026 that may be closed by this pull request

[Feature Request] Detailed control on common functions #1844

Closed

1 task

coderabbitai bot reviewed Apr 13, 2026

View reviewed changes

LeiWang1999 merged commit 8243f7e into tile-ai:main Apr 13, 2026
8 of 10 checks passed

Conversation

haoran35-jpg commented Mar 21, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

New Features

Tests

Uh oh!

github-actions bot commented Mar 21, 2026

Uh oh!

coderabbitai bot commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

haoran35-jpg commented Mar 21, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

LeiWang1999 commented Mar 22, 2026

Uh oh!

haoran35-jpg commented Mar 22, 2026

Uh oh!

LeiWang1999 commented Mar 22, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

haoran35-jpg commented Mar 21, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 21, 2026 •

edited

Loading