Skip to content

Nan propagation option for bf16 and half16#1958

Merged
LeiWang1999 merged 10 commits intotile-ai:mainfrom
haoran35-jpg:main
Apr 13, 2026
Merged

Nan propagation option for bf16 and half16#1958
LeiWang1999 merged 10 commits intotile-ai:mainfrom
haoran35-jpg:main

Conversation

@haoran35-jpg
Copy link
Copy Markdown
Contributor

@haoran35-jpg haoran35-jpg commented Mar 21, 2026

This PR introduces option to propagate NaN specifically for data type bf_16 and half, which modifies the following files:

file changed implementation
src/op/builtin.h Declares kReduceMaxMinNanPropagate and tl::max_nan / tl::min_nan ops.
src/op/builtin.cc Registers TIR intrinsics tl.max_nan / tl.min_nan.
src/op/reduce.cc TIR layer: For AllReduce, MakeCodegenReducer() returns tl::MaxOpNan, concatenate into tl::AllReduce<tl::MaxOpNan, …>::run(...) with tl::MaxOpNan defined in rumtime template reduce.h. For MakeReduce, turn into CallNode then rely on tl::max_nan/tl::min_nan.
src/target/codegen_cuda.cc VisitExpr_(MaxNode/MinNode) choose between __hmax/__hmin and __hmax_nan/__hmin_nan according to Passcontext. VisitExpr_(CallNode) handles tl::max_nan / tl::min_nan as TVM op defined by buildin.
src/op/reduce.h Define runtime template for MaxOpNan and MinOpNan
tilelang/transform/pass_config.py Python layer: a PassConfigKey button to open NaN propagation functionality.
testing/python/language/test_tilelang_language_reduce_maxmin_nan.py A test file for nan functionality.

Before this PR, MaxOp/MinOp with parameter type bfloat16 and half will lower to __hmax()/__hmin() forcefully. Now user can optionally lower them to _hmax_nan()/__hmin_nan() by setting kReduceMaxMinNanPropagate to false(default is true). I feel this global config is more elegant and convenient to implement, not sure if this is okay.
Disclaim: AI is used in code refinement and test generation.

Summary by CodeRabbit

New Features

  • Added nan_propagate parameter to reduce_max(), reduce_min(), and reduce_absmax() functions, enabling NaN-aware reduction behavior for fp16/bf16 tensors on CUDA targets.

Tests

  • Added comprehensive test suite validating NaN-propagation semantics and code generation for reduction primitives.

@github-actions
Copy link
Copy Markdown

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 21, 2026

📝 Walkthrough

Walkthrough

This PR introduces NaN-propagating variants of max/min reduction operations for fp16/bf16 datatypes in TileLang. The feature adds a new optional nan_propagate parameter that enables CUDA-specific __hmax_nan/__hmin_nan intrinsics, with corresponding lowering, code generation, and Python API changes spanning the compiler pipeline.

Changes

Cohort / File(s) Summary
Builtin Intrinsic Declarations
src/op/builtin.h, src/op/builtin.cc
Added two new TIR intrinsics tl.max_nan() and tl.min_nan() with 2 inputs each, marked as pure effects for CUDA-style NaN-aware max/min operations.
Reduction Operator Updates
src/op/reduce.h, src/op/reduce.cc
Added nan_propagate boolean field to ReduceOpNode, parsing logic in constructor, conditional lowering to use new NaN intrinsics for fp16/bf16 when enabled, and target-specific validation requiring CUDA support.
CUDA Code Generation
src/target/codegen_cuda.cc, src/tl_templates/cuda/reduce.h
Implemented specialized CUDA codegen for tl.max_nan/tl.min_nan with type-specific intrinsic emission (__hmax_nan/__hmin_nan for f16/bf16), and added MaxOpNan/MinOpNan template functors in reduce templates.
Python API
tilelang/language/reduce_op.py
Extended reduce_max(), reduce_min(), and reduce_absmax() with optional nan_propagate: bool = False parameter, threaded through internal reduce() function via annotation mechanism.
Tests
testing/python/language/test_tilelang_language_reduce_maxmin_nan.py
Added new CUDA-only test module validating codegen correctness (checking for NaN-specific intrinsic/reducer strings) and runtime behavior (NaN propagation vs. non-propagation) for float16 and bfloat16 reductions.

Sequence Diagram

sequenceDiagram
    participant User as User Code
    participant PythonAPI as TileLang Python API
    participant Compiler as Reduction Compiler
    participant Lowering as Lowering Pass
    participant Codegen as CUDA Codegen
    participant Runtime as CUDA Runtime

    User->>PythonAPI: reduce_max(buffer, out, nan_propagate=True)
    PythonAPI->>Compiler: Pass annotation {"nan_propagate": True}
    Compiler->>Compiler: Parse nan_propagate flag into ReduceOpNode
    alt nan_propagate=True && (fp16 or bf16)
        Lowering->>Lowering: Select tl::max_nan intrinsic
    else
        Lowering->>Lowering: Select tl::max function
    end
    Lowering->>Codegen: Emit reduction with chosen intrinsic
    alt tl::max_nan for fp16/bf16
        Codegen->>Codegen: Emit __hmax_nan with type conversions
    else
        Codegen->>Codegen: Emit fast_max fallback
    end
    Codegen->>Runtime: Compiled CUDA kernel
    Runtime->>Runtime: Execute NaN-aware max operation
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • bucket-xv

Poem

🐰 Hops through the max and min so fine,
With NaN-aware logic that's divine,
CUDA's __h intrinsics now aligned,
Float16 and bfloat16 perfectly designed!
Tests hopping in to verify the way,
A fuzzy feature coded today!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: adding a NaN propagation option for bf16 and float16 types.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@haoran35-jpg
Copy link
Copy Markdown
Contributor Author

@#1844

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/op/reduce.cc`:
- Around line 112-137: The nan_propagate boolean is used with inverted
semantics: change condition checks that currently use !nan_propagate to use
nan_propagate and swap the branches that return NaN-propagating ops vs
NaN-ignoring ops. Concretely, in the reducer logic around the acc/rhs handling
(references: variable nan_propagate, kReduceMaxMinNanPropagate, and the branches
returning Call(acc.dtype(), tl::max_nan()/tl::min_nan(), ...) vs
Max/Min/tvm::abs calls) flip the conditions so that nan_propagate selects the
Call(... tl::*_nan()) (NaN-propagating) paths and the else selects Max/Min
(NaN-ignoring). Apply the same inversion in the CUDA codegen spots that choose
__hmax/__hmax_nan/__hmin/__hmin_nan (matching codegen_cuda symbols) and update
related tests to expect the corrected semantics; also update the docstrings that
describe nan_propagate in builtin.h and pass_config.py to reflect that true
means NaN-propagate.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c7e56a0a-abf7-4691-969e-7ea8999118da

📥 Commits

Reviewing files that changed from the base of the PR and between 9eba9fe and dee042b.

📒 Files selected for processing (7)
  • src/op/builtin.cc
  • src/op/builtin.h
  • src/op/reduce.cc
  • src/target/codegen_cuda.cc
  • src/tl_templates/cuda/reduce.h
  • testing/python/language/test_tilelang_language_reduce_maxmin_nan.py
  • tilelang/transform/pass_config.py

Comment thread src/op/reduce.cc Outdated
Comment on lines 112 to 137
bool nan_propagate = true;
if (tvm::transform::PassContext::Current().defined()) {
nan_propagate = tvm::transform::PassContext::Current()
->GetConfig<Bool>(kReduceMaxMinNanPropagate, Bool(true))
.value();
}
const bool is_fp16_or_bf16 = acc.dtype().is_float16() || acc.dtype().is_bfloat16();
if (type->isSum()) {
return acc + rhs;
} else if (type->isAbsSum()) {
return acc + Max(rhs, -rhs);
} else if (type->isMax()) {
if (!nan_propagate && is_fp16_or_bf16) {
return Call(acc.dtype(), tl::max_nan(), {acc, rhs});
}
return Max(acc, rhs);
} else if (type->isMin()) {
if (!nan_propagate && is_fp16_or_bf16) {
return Call(acc.dtype(), tl::min_nan(), {acc, rhs});
}
return Min(acc, rhs);
} else if (type->isAbsMax()) {
if (!nan_propagate && is_fp16_or_bf16) {
return Call(acc.dtype(), tl::max_nan(), {acc, tvm::abs(rhs)});
}
return Max(acc, tvm::abs(rhs));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's examine the actual code in reduce.cc at the specified lines
fd -t f reduce.cc | head -5

Repository: tile-ai/tilelang

Length of output: 140


🏁 Script executed:

# Read the reduce.cc file at the specified lines to understand the current state
cat -n src/op/reduce.cc | sed -n '100,170p'

Repository: tile-ai/tilelang

Length of output: 2930


🏁 Script executed:

# Find the reduce.cc file in the repository
find . -name "reduce.cc" -type f 2>/dev/null

Repository: tile-ai/tilelang

Length of output: 78


🏁 Script executed:

# List files in src/op directory to understand structure
ls -la src/op/ 2>/dev/null | head -20 || echo "Directory not found, searching for op directories"
find . -type d -name "op" 2>/dev/null | head -5

Repository: tile-ai/tilelang

Length of output: 1230


🏁 Script executed:

# Search for reduce.cc using git
git ls-files | grep reduce.cc

Repository: tile-ai/tilelang

Length of output: 140


🏁 Script executed:

# Get repository structure
git ls-files | grep -E "(op|reduce)" | head -20

Repository: tile-ai/tilelang

Length of output: 1081


🏁 Script executed:

# Check if the file exists by listing all C++ files
git ls-files "*.cc" | grep -i reduce

Repository: tile-ai/tilelang

Length of output: 199


🏁 Script executed:

# List all files to understand repo structure
git ls-files | head -30

Repository: tile-ai/tilelang

Length of output: 733


🏁 Script executed:

# Check if this is a TVM repository and find reduce-related files
rg "tl::max_nan|tl::min_nan" --type cc -l

Repository: tile-ai/tilelang

Length of output: 86


🏁 Script executed:

# Search for MaxOpNan/MinOpNan patterns
rg "MaxOpNan|MinOpNan" -l

Repository: tile-ai/tilelang

Length of output: 175


🏁 Script executed:

# Search for nan_propagate variable
rg "nan_propagate" -l

Repository: tile-ai/tilelang

Length of output: 222


🏁 Script executed:

# Try to find any reduce-related operations
rg "type->isSum|type->isMax" -l | head -10

Repository: tile-ai/tilelang

Length of output: 76


🏁 Script executed:

# Search across all files for the specific pattern
rg "is_fp16_or_bf16" -A 5 -B 5

Repository: tile-ai/tilelang

Length of output: 1527


🏁 Script executed:

# Search for the exact snippet
rg "kReduceMaxMinNanPropagate" -l

Repository: tile-ai/tilelang

Length of output: 138


🏁 Script executed:

# Search for reduce operations with nan
rg "reduce.*nan|Reduce.*Nan" -i -l | head -10

Repository: tile-ai/tilelang

Length of output: 240


🏁 Script executed:

# List repository structure more broadly
ls -la

Repository: tile-ai/tilelang

Length of output: 2819


🏁 Script executed:

# Check git log to understand this is a code review repository
git log --oneline -5 2>/dev/null || echo "Not a git repo"
git status 2>/dev/null || echo "Unable to get status"

Repository: tile-ai/tilelang

Length of output: 314


🌐 Web query:

NVIDIA CUDA __hmax __hmin __hmax_nan __hmin_nan semantics NaN handling

💡 Result:

The NVIDIA CUDA half-precision (FP16) intrinsic functions hmax, hmin, hmax_nan, and hmin_nan compute the maximum and minimum of two half values with specific NaN handling behaviors, as defined in the official CUDA Math API documentation. All functions calculate max(a, b) as (a > b) ? a : b for hmax/hmax_nan, and min(a, b) as (a < b) ? a : b for hmin/hmin_nan. They treat +0.0 > -0.0 when both inputs are zero. - hmax(const half a, const half b): If either input is NaN, returns the other input. If both are NaNs, returns canonical NaN. - hmin(const half a, const half b): If either input is NaN, returns the other input. If both are NaNs, returns canonical NaN. - hmax_nan(const half a, const half b): If either input is NaN, returns canonical NaN (NaNs pass through). - hmin_nan(const half a, const half b): If either input is NaN, returns canonical NaN (NaNs pass through). These are host device functions available by including cuda_fp16.h. They are part of the Half Comparison Functions group.

Citations:


🌐 Web query:

CUDA __hmax_nan __hmin_nan half precision NaN propagation official documentation

💡 Result:

hmax_nan and hmin_nan are CUDA half-precision (half) intrinsic functions for computing maximum and minimum, respectively, with NaN propagation (NaNs pass through). These are defined in the official CUDA Math API documentation (section 4.3 Half Comparison Functions): - host device half hmax_nan(const half a, const half b): Calculates half max(a, b) as (a > b) ? a : b. If either input is NaN, returns canonical NaN. If both inputs are 0.0, +0.0 > -0.0. - host device half hmin_nan(const half a, const half b): Calculates half min(a, b) as (a < b) ? a : b. If either input is NaN, returns canonical NaN. If both inputs are 0.0, +0.0 > -0.0. To use them, include <cuda_fp16.h>. They differ from non-_nan variants (hmax, hmin), which return the other input if one is NaN (and canonical NaN if both are NaNs). Canonical NaN is defined as CUDART_NAN_FP16 = ushort_as_half((unsigned short)0x7FFFU). These functions are available in recent CUDA versions (e.g., 11.5+ documented, up to 13.2 as of 2025 docs). NaN handling in CUDA half follows IEEE 754 with specifics: operations propagate quiet NaNs, no signaling NaNs.

Citations:


🌐 Web query:

site:docs.nvidia.com CUDA half precision comparison max min __hmax __hmin

💡 Result:

In CUDA, for half precision (half) floating-point numbers, the max and min comparison intrinsics are hmax and hmin, defined in cuda_fp16.h. host device half hmax(const half a, const half b) Calculates half maximum: returns a if a > b, else b. - If either input is NaN, returns the other input. - If both NaN, returns canonical NaN. - For zeros: +0.0 > -0.0. host device half hmin(const half a, const half b) Calculates half minimum: returns a if a < b, else b. - Same NaN and zero rules as hmax. Variants: - hmax_nan and hmin_nan: NaNs pass through (return canonical NaN if either is NaN). These are available in device code; host availability since recent CUDA versions (e.g., 12.x docs show host device). Usage example: #include <cuda_fp16.h> device half result = hmax(half1, half2);

Citations:


🌐 Web query:

CUDA math API __half comparison NaN behavior "max_nan" "min_nan"

💡 Result:

In the CUDA Math API for half (half-precision floating-point), the min/max functions have specific NaN propagation behaviors defined in the official documentation (cuda_fp16.h functions in Half Comparison Functions section). Standard max/min: - hmax(const half a, const half b): Returns max(a, b) as (a > b) ? a : b. - If either input is NaN, returns the other input. - If both are NaNs, returns canonical NaN. - hmin(const half a, const half b): Returns min(a, b) as (a < b) ? a : b. - If either input is NaN, returns the other input. - If both are NaNs, returns canonical NaN. "NaN pass-through" variants (named *_nan): - hmax_nan(const half a, const half b): Same max logic. - If either input is NaN, returns canonical NaN (NaNs pass through). - hmin_nan(const half a, const half b): Same min logic. - If either input is NaN, returns canonical NaN (NaNs pass through). For comparison operators and functions like heq, hlt, etc.: - NaN inputs generate false results (standard IEEE-754 ordered comparison behavior: any comparison involving NaN is false). These are available as host device functions. Include <cuda_fp16.h>. Note that +0.0 > -0.0 in all cases. This matches patterns in bfloat16 and half2 variants.

Citations:


🏁 Script executed:

# Check CUDA codegen handling of max_nan/min_nan
cat -n src/target/codegen_cuda.cc | grep -A 15 -B 5 "max_nan\|min_nan" | head -60

Repository: tile-ai/tilelang

Length of output: 2800


🏁 Script executed:

# Check for MaxOp and MaxOpNan in CUDA codegen
rg "MaxOp|MinOp" src/target/codegen_cuda.cc -A 5 -B 2 | head -100

Repository: tile-ai/tilelang

Length of output: 42


🏁 Script executed:

# Look at builtin.h/builtin.cc to understand max_nan/min_nan definitions
cat -n src/op/builtin.h | grep -A 10 -B 2 "max_nan\|min_nan" | head -50

Repository: tile-ai/tilelang

Length of output: 1683


🏁 Script executed:

# Check the test file to understand expected behavior
cat -n testing/python/language/test_tilelang_language_reduce_maxmin_nan.py | head -100

Repository: tile-ai/tilelang

Length of output: 3505


🏁 Script executed:

# Check pass_config to understand the flag definition
cat -n tilelang/transform/pass_config.py | grep -A 10 -B 2 "kReduceMaxMinNanPropagate\|nan_propagate"

Repository: tile-ai/tilelang

Length of output: 801


🏁 Script executed:

# Get more context on the test to see expected NaN behavior
wc -l testing/python/language/test_tilelang_language_reduce_maxmin_nan.py

Repository: tile-ai/tilelang

Length of output: 131


🏁 Script executed:

# Read the entire test file
cat testing/python/language/test_tilelang_language_reduce_maxmin_nan.py

Repository: tile-ai/tilelang

Length of output: 4193


🏁 Script executed:

# Search for tl::max_nan and tl::min_nan definitions/usage
rg "tl::max_nan|tl::min_nan" -B 3 -A 3

Repository: tile-ai/tilelang

Length of output: 1586


🏁 Script executed:

# Check reduce.h template for max_nan/min_nan usage
cat src/tl_templates/cuda/reduce.h | grep -A 10 -B 10 "max_nan\|min_nan"

Repository: tile-ai/tilelang

Length of output: 1484


🏁 Script executed:

# Check builtin.cc for max_nan/min_nan implementations
rg "max_nan|min_nan" src/op/builtin.cc -A 5 -B 2

Repository: tile-ai/tilelang

Length of output: 555


Invert the nan_propagate semantics throughout the NaN-handling feature.

The flag nan_propagate = true currently uses __hmax / __hmin (NaN-ignoring functions), while nan_propagate = false uses __hmax_nan / __hmin_nan (NaN-propagating functions). This inverts the flag's intended semantics: per official CUDA documentation, __hmax_nan propagates NaNs and __hmax ignores them.

Fix in src/op/reduce.cc: Invert conditions at lines 124, 129, 134 from !nan_propagate to nan_propagate, and swap return values at lines 161, 163, 165. Matching inversions are required in src/target/codegen_cuda.cc (lines 1451, 1489) and all test assertions. Correct the docstrings in src/op/builtin.h (lines 193–196) and tilelang/transform/pass_config.py (lines 167–169).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/op/reduce.cc` around lines 112 - 137, The nan_propagate boolean is used
with inverted semantics: change condition checks that currently use
!nan_propagate to use nan_propagate and swap the branches that return
NaN-propagating ops vs NaN-ignoring ops. Concretely, in the reducer logic around
the acc/rhs handling (references: variable nan_propagate,
kReduceMaxMinNanPropagate, and the branches returning Call(acc.dtype(),
tl::max_nan()/tl::min_nan(), ...) vs Max/Min/tvm::abs calls) flip the conditions
so that nan_propagate selects the Call(... tl::*_nan()) (NaN-propagating) paths
and the else selects Max/Min (NaN-ignoring). Apply the same inversion in the
CUDA codegen spots that choose __hmax/__hmax_nan/__hmin/__hmin_nan (matching
codegen_cuda symbols) and update related tests to expect the corrected
semantics; also update the docstrings that describe nan_propagate in builtin.h
and pass_config.py to reflect that true means NaN-propagate.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
src/op/reduce.cc (1)

112-117: Consider extracting nan_propagate retrieval to a helper method.

The PassContext config lookup for kReduceMaxMinNanPropagate is duplicated in both MakeReduce() (lines 112-117) and MakeCodegenReducer() (lines 150-155). This creates a maintenance burden and risk of divergence.

♻️ Suggested refactor

Add a private helper method to ReduceOpNode or a free function:

+static bool GetNanPropagateSetting() {
+  if (tvm::transform::PassContext::Current().defined()) {
+    return tvm::transform::PassContext::Current()
+        ->GetConfig<Bool>(kReduceMaxMinNanPropagate, Bool(true))
+        .value();
+  }
+  return true;
+}
+
 PrimExpr ReduceOpNode::MakeReduce(const PrimExpr &acc,
                                   const PrimExpr &b) const {
   PrimExpr rhs = b;
   if (acc->dtype != rhs->dtype) {
     rhs = Cast(acc->dtype, rhs);
   }
-  bool nan_propagate = true;
-  if (tvm::transform::PassContext::Current().defined()) {
-    nan_propagate = tvm::transform::PassContext::Current()
-                        ->GetConfig<Bool>(kReduceMaxMinNanPropagate, Bool(true))
-                        .value();
-  }
+  bool nan_propagate = GetNanPropagateSetting();

Apply same change to MakeCodegenReducer().

Also applies to: 150-155

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/op/reduce.cc` around lines 112 - 117, Extract the PassContext lookup for
kReduceMaxMinNanPropagate into a single helper (either a private method on
ReduceOpNode like ReduceOpNode::GetNanPropagateFromPassContext() or a
static/free function e.g. GetReduceNanPropagate()) and replace the duplicated
blocks inside MakeReduce() and MakeCodegenReducer() with calls to that helper;
the helper should return a bool defaulting to true by reading
tvm::transform::PassContext::Current()->GetConfig<Bool>(kReduceMaxMinNanPropagate,
Bool(true)).value() and be used in both locations to avoid divergence.
testing/python/language/test_tilelang_language_reduce_maxmin_nan.py (2)

60-67: Inconsistent negative assertions in bf16 tests.

test_reduce_max_fp16_nan_propagate_false (line 49) asserts "MaxOpNan" not in src, but test_reduce_max_bf16_nan_propagate_false omits this check. Same pattern for reduce_min tests. Adding consistent negative assertions would catch cases where both variants accidentally appear.

💡 Suggested fix
 def test_reduce_max_bf16_nan_propagate_false():
     k = _compile_cuda(
         _kernel_reduce_max(64, T.bfloat16),
         pass_configs={tilelang.PassConfigKey.TL_REDUCE_MAXMIN_NAN_PROPAGATE: False},
     )
     src = k.get_kernel_source()
     assert "tl::MaxOp" in src
+    assert "MaxOpNan" not in src
     assert "__hmax(" in src

Apply similar changes to test_reduce_min_bf16_nan_propagate_false.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@testing/python/language/test_tilelang_language_reduce_maxmin_nan.py` around
lines 60 - 67, Add the missing negative assertions to the bfloat16 tests to
mirror the fp16 tests: in test_reduce_max_bf16_nan_propagate_false ensure the
compiled kernel source (src) does not contain "MaxOpNan" (similar to
test_reduce_max_fp16_nan_propagate_false), and in
test_reduce_min_bf16_nan_propagate_false ensure src does not contain "MinOpNan"
(matching the fp16 counterpart); update the assertions in those functions so
each checks both the positive expected op/implementation strings and the
corresponding "...OpNan" absence.

131-138: Missing bf16 tests for reduce_absmax.

The reduce_absmax tests only cover fp16. Consider adding bf16 tests for completeness, since bf16 uses the same NaN-handling code path.

💡 Suggested additions
def test_reduce_absmax_bf16_nan_propagate_default():
    k = _compile_cuda(_kernel_reduce_absmax(64, T.bfloat16))
    src = k.get_kernel_source()
    assert "tl::MaxOpNan" in src
    assert "__hmax_nan" in src


def test_reduce_absmax_bf16_nan_propagate_false():
    k = _compile_cuda(
        _kernel_reduce_absmax(64, T.bfloat16),
        pass_configs={tilelang.PassConfigKey.TL_REDUCE_MAXMIN_NAN_PROPAGATE: False},
    )
    src = k.get_kernel_source()
    assert "tl::MaxOp" in src
    assert "__hmax(" in src
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@testing/python/language/test_tilelang_language_reduce_maxmin_nan.py` around
lines 131 - 138, Add parallel bf16 tests for reduce_absmax mirroring the fp16
cases: create test_reduce_absmax_bf16_nan_propagate_default() that compiles
_kernel_reduce_absmax(64, T.bfloat16) via _compile_cuda and asserts the source
contains "tl::MaxOpNan" and "__hmax_nan", and create
test_reduce_absmax_bf16_nan_propagate_false() that compiles with
pass_configs={tilelang.PassConfigKey.TL_REDUCE_MAXMIN_NAN_PROPAGATE: False} and
asserts the source contains "tl::MaxOp" and "__hmax("; use the same helper names
(_compile_cuda, _kernel_reduce_absmax, T.bfloat16,
tilelang.PassConfigKey.TL_REDUCE_MAXMIN_NAN_PROPAGATE) to mirror the existing
fp16 tests.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@testing/python/language/test_tilelang_language_reduce_maxmin_nan.py`:
- Around line 101-108: Add a new test named
test_reduce_min_bf16_nan_propagate_default that mirrors
test_reduce_min_bf16_nan_propagate_false but uses the default pass config (i.e.,
call _compile_cuda(_kernel_reduce_min(64, T.bfloat16)) without pass_configs),
then call get_kernel_source() on the returned kernel and assert that the source
contains "tl::MinOpNan" and "__hmin_nan" to verify bf16 default NaN-propagation
for reduce_min.

---

Nitpick comments:
In `@src/op/reduce.cc`:
- Around line 112-117: Extract the PassContext lookup for
kReduceMaxMinNanPropagate into a single helper (either a private method on
ReduceOpNode like ReduceOpNode::GetNanPropagateFromPassContext() or a
static/free function e.g. GetReduceNanPropagate()) and replace the duplicated
blocks inside MakeReduce() and MakeCodegenReducer() with calls to that helper;
the helper should return a bool defaulting to true by reading
tvm::transform::PassContext::Current()->GetConfig<Bool>(kReduceMaxMinNanPropagate,
Bool(true)).value() and be used in both locations to avoid divergence.

In `@testing/python/language/test_tilelang_language_reduce_maxmin_nan.py`:
- Around line 60-67: Add the missing negative assertions to the bfloat16 tests
to mirror the fp16 tests: in test_reduce_max_bf16_nan_propagate_false ensure the
compiled kernel source (src) does not contain "MaxOpNan" (similar to
test_reduce_max_fp16_nan_propagate_false), and in
test_reduce_min_bf16_nan_propagate_false ensure src does not contain "MinOpNan"
(matching the fp16 counterpart); update the assertions in those functions so
each checks both the positive expected op/implementation strings and the
corresponding "...OpNan" absence.
- Around line 131-138: Add parallel bf16 tests for reduce_absmax mirroring the
fp16 cases: create test_reduce_absmax_bf16_nan_propagate_default() that compiles
_kernel_reduce_absmax(64, T.bfloat16) via _compile_cuda and asserts the source
contains "tl::MaxOpNan" and "__hmax_nan", and create
test_reduce_absmax_bf16_nan_propagate_false() that compiles with
pass_configs={tilelang.PassConfigKey.TL_REDUCE_MAXMIN_NAN_PROPAGATE: False} and
asserts the source contains "tl::MaxOp" and "__hmax("; use the same helper names
(_compile_cuda, _kernel_reduce_absmax, T.bfloat16,
tilelang.PassConfigKey.TL_REDUCE_MAXMIN_NAN_PROPAGATE) to mirror the existing
fp16 tests.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 670b319c-70b0-4e6c-9226-1b11611dde1f

📥 Commits

Reviewing files that changed from the base of the PR and between dee042b and 341a197.

📒 Files selected for processing (5)
  • src/op/builtin.h
  • src/op/reduce.cc
  • src/target/codegen_cuda.cc
  • testing/python/language/test_tilelang_language_reduce_maxmin_nan.py
  • tilelang/transform/pass_config.py
🚧 Files skipped from review as they are similar to previous changes (3)
  • tilelang/transform/pass_config.py
  • src/op/builtin.h
  • src/target/codegen_cuda.cc

Comment thread testing/python/language/test_tilelang_language_reduce_maxmin_nan.py Outdated
@LeiWang1999
Copy link
Copy Markdown
Member

Thanks @haoran35-jpg , would you mind explain more for introducing kReduceMaxMinNanPropagate ? sorry I didn't get the point.

@haoran35-jpg
Copy link
Copy Markdown
Contributor Author

Yes, thank you for comment! It's allowing max(...)/min(...) in frontend to be lowered to Nvidia intrinsic hmax_nan(...)/hmin_nan(...) when setting kReduceMaxMinNanPropagate to true, and those two functions return nan when either of the parameters are NaN. This is different from hmax(...)/hmin(...) intrinsic where nan is returned only when both parameters are NaN. This could make it more convenient and precise to study NaN propagation. I'm still a rookie trying to learn this project, so I'm not sure if this is the best way to do. Thank you for your time!

@LeiWang1999
Copy link
Copy Markdown
Member

@haoran35-jpg, Thanks and I think one better solution is to add an annotation for reduce tile op (in frontend have sth like T.reduce_max(nan_propagate=True)) instead of introducing a pass config.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/op/reduce.cc`:
- Around line 118-119: Make the NaN-specific gating consistent between the
thread-local reducer and the codegen reducer: in MakeCodegenReducer() only
return tl::MaxOpNan / tl::MinOpNan for the max/min/absmax reducers when
nan_propagate is true AND the accumulator dtype is half/bfloat (mirror the
is_fp16_or_bf16 check in MakeReduce() — i.e., use acc.dtype().is_float16() ||
acc.dtype().is_bfloat16()); otherwise return the regular tl::MaxOp / tl::MinOp
so both local (MakeReduce) and inter-thread (MakeCodegenReducer) phases behave
the same for non-half dtypes.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ececcd74-7a96-4fd5-a079-70de022f3ae5

📥 Commits

Reviewing files that changed from the base of the PR and between 341a197 and 208ade4.

📒 Files selected for processing (2)
  • src/op/reduce.cc
  • src/target/codegen_cuda.cc
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/target/codegen_cuda.cc

Comment thread src/op/reduce.cc Outdated
@kurisu6912 kurisu6912 linked an issue Mar 30, 2026 that may be closed by this pull request
1 task
Replace the global tl.reduce_maxmin_nan_propagate pass-config with a
per-op nan_propagate flag carried on the ReduceOp annotation map.
Frontend exposes it as a kwarg on T.reduce_max/min/absmax (default
False, preserving prior behavior). The CUDA codegen MinNode/MaxNode
visitors are reverted to plain __hmin/__hmax; only the new tl::max_nan
/tl::min_nan CallNode handler emits __hmax_nan/__hmin_nan, so non-reduce
Min/Max ops are no longer silently rewritten. ReduceOp::Lower now errors
early with a clear message if nan_propagate=True is requested for fp16
/bf16 on a non-CUDA target, instead of emitting an undefined symbol on
HIP/CPU. Tests rewritten to parametrize on the kwarg and assert runtime
NaN-propagation behavior, not just generated source substrings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/op/reduce.cc (1)

120-164: ⚠️ Potential issue | 🟠 Major

Use the NaN-aware combiner when merging back into dst with clear=False.

The new logic makes the local reduction and AllReduce honor nan_propagate, but the later need_update merge still uses plain Max / Min. In the public clear=False path, a NaN from either the existing output or the newly reduced value is dropped there, so reduce_max/min/absmax(..., clear=False, nan_propagate=True) can still return a non-NaN result.

Suggested fix
-        } else if (this->type->isMax() || this->type->isAbsMax()) {
-          update = Max(dst_val, src_val);
-        } else if (this->type->isMin()) {
-          update = Min(dst_val, src_val);
+        } else if (this->type->isMax() || this->type->isMin() ||
+                   this->type->isAbsMax()) {
+          update = this->MakeReduce(dst_val, src_val);

The new runtime tests only exercise clear=True, so this regression is currently unguarded.

Also applies to: 470-485

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/op/reduce.cc` around lines 120 - 164, The merge step that writes back
into dst when clear=False still uses plain Max/Min (dropping NaNs) even when
nan_propagate is true; update the merge/need_update path to select the NaN-aware
combiners the same way as the local reducer logic and MakeCodegenReducer does:
when use_nan_op is true and type->isMax()/isMin()/isAbsMax(), use the NaN
variants (tl::MaxOpNan / tl::MinOpNan or Call with tl::max_nan()/tl::min_nan())
instead of plain Max/Min so that ReduceOpNode::MakeCodegenReducer, the merge
code that checks need_update, and any code paths invoking Max/Min honor
nan_propagate for clear=False merges.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/op/reduce.cc`:
- Around line 248-254: The current check in reduce.cc unconditionally FATALs
when nan_propagate is true for fp16/bf16 on non-CUDA targets; narrow this guard
to only the reduce kinds that actually use the NaN-aware path (max, min,
absmax). Modify the condition that uses nan_propagate,
dst->dtype.is_float16()/is_bfloat16(), and TargetIsCuda(T.target) so it also
checks the reduce kind (e.g., T.reduce kind or the variable representing the
reduction operator) and only LOG(FATAL) when the kind is "max", "min", or
"absmax"; leave other reduction kinds (e.g., "sum") unaffected so nan_propagate
becomes a no-op on non-CUDA targets.

---

Outside diff comments:
In `@src/op/reduce.cc`:
- Around line 120-164: The merge step that writes back into dst when clear=False
still uses plain Max/Min (dropping NaNs) even when nan_propagate is true; update
the merge/need_update path to select the NaN-aware combiners the same way as the
local reducer logic and MakeCodegenReducer does: when use_nan_op is true and
type->isMax()/isMin()/isAbsMax(), use the NaN variants (tl::MaxOpNan /
tl::MinOpNan or Call with tl::max_nan()/tl::min_nan()) instead of plain Max/Min
so that ReduceOpNode::MakeCodegenReducer, the merge code that checks
need_update, and any code paths invoking Max/Min honor nan_propagate for
clear=False merges.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5fe5e403-594e-415e-9419-73884f40da5a

📥 Commits

Reviewing files that changed from the base of the PR and between 208ade4 and 16f292f.

📒 Files selected for processing (7)
  • src/op/builtin.cc
  • src/op/builtin.h
  • src/op/reduce.cc
  • src/op/reduce.h
  • src/target/codegen_cuda.cc
  • testing/python/language/test_tilelang_language_reduce_maxmin_nan.py
  • tilelang/language/reduce_op.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/target/codegen_cuda.cc

Comment thread src/op/reduce.cc
Comment on lines +248 to +254
if (nan_propagate && (dst->dtype.is_float16() || dst->dtype.is_bfloat16()) &&
!TargetIsCuda(T.target)) {
LOG(FATAL) << "ReduceOp: nan_propagate=True for fp16/bf16 max/min/absmax "
"is only supported on CUDA targets (requires "
"__hmax_nan/__hmin_nan intrinsics). Target was: "
<< T.target->str();
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Scope the CUDA-only guard to the supported reduce kinds.

This currently aborts any fp16/bf16 reduction with nan_propagate=True, even though only max, min, and absmax ever take the NaN-aware path. A direct T.reduce(..., "sum", nan_propagate=True) becomes a target-dependent hard failure instead of a no-op flag.

Suggested fix
-  if (nan_propagate && (dst->dtype.is_float16() || dst->dtype.is_bfloat16()) &&
+  if (nan_propagate &&
+      (this->type->isMax() || this->type->isMin() || this->type->isAbsMax()) &&
+      (dst->dtype.is_float16() || dst->dtype.is_bfloat16()) &&
       !TargetIsCuda(T.target)) {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/op/reduce.cc` around lines 248 - 254, The current check in reduce.cc
unconditionally FATALs when nan_propagate is true for fp16/bf16 on non-CUDA
targets; narrow this guard to only the reduce kinds that actually use the
NaN-aware path (max, min, absmax). Modify the condition that uses nan_propagate,
dst->dtype.is_float16()/is_bfloat16(), and TargetIsCuda(T.target) so it also
checks the reduce kind (e.g., T.reduce kind or the variable representing the
reduction operator) and only LOG(FATAL) when the kind is "max", "min", or
"absmax"; leave other reduction kinds (e.g., "sum") unaffected so nan_propagate
becomes a no-op on non-CUDA targets.

@LeiWang1999 LeiWang1999 merged commit 8243f7e into tile-ai:main Apr 13, 2026
8 of 10 checks passed
kurisu6912 pushed a commit that referenced this pull request Apr 13, 2026
* register op, and add to build file

* update semantics of NaN for Max and Min ops in Reduce

* lower Max op to cutlass fastmax in general case, _hmax for bf16 and _hmax_nan for hf16

* update semantics of NaN for Max and Min ops in Reduce in reduce.h

* develope codes transforming ReduceOpNode in TIR and handling CallNode in CUDA code generator's runtime template

* test file for reduce maxmin nan

* refine test case message and explanation“
”
‘
“

* invert the senamtics of global config TL_REDUCE_MAXMIN_NAN_PROPAGATE

* lint fix

* refactor reduce maxmin NaN propagate into per-call annotation

Replace the global tl.reduce_maxmin_nan_propagate pass-config with a
per-op nan_propagate flag carried on the ReduceOp annotation map.
Frontend exposes it as a kwarg on T.reduce_max/min/absmax (default
False, preserving prior behavior). The CUDA codegen MinNode/MaxNode
visitors are reverted to plain __hmin/__hmax; only the new tl::max_nan
/tl::min_nan CallNode handler emits __hmax_nan/__hmin_nan, so non-reduce
Min/Max ops are no longer silently rewritten. ReduceOp::Lower now errors
early with a clear message if nan_propagate=True is requested for fp16
/bf16 on a non-CUDA target, instead of emitting an undefined symbol on
HIP/CPU. Tests rewritten to parametrize on the kwarg and assert runtime
NaN-propagation behavior, not just generated source substrings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Haoran <chaoran@umich.edu>
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature Request] Detailed control on common functions

3 participants