Nan propagation option for bf16 and half16#1958
Conversation
…hmax_nan for hf16
… in CUDA code generator's runtime template
|
👋 Hi! Thank you for contributing to the TileLang project. Please remember to run We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 |
📝 WalkthroughWalkthroughThis PR introduces NaN-propagating variants of max/min reduction operations for fp16/bf16 datatypes in TileLang. The feature adds a new optional Changes
Sequence DiagramsequenceDiagram
participant User as User Code
participant PythonAPI as TileLang Python API
participant Compiler as Reduction Compiler
participant Lowering as Lowering Pass
participant Codegen as CUDA Codegen
participant Runtime as CUDA Runtime
User->>PythonAPI: reduce_max(buffer, out, nan_propagate=True)
PythonAPI->>Compiler: Pass annotation {"nan_propagate": True}
Compiler->>Compiler: Parse nan_propagate flag into ReduceOpNode
alt nan_propagate=True && (fp16 or bf16)
Lowering->>Lowering: Select tl::max_nan intrinsic
else
Lowering->>Lowering: Select tl::max function
end
Lowering->>Codegen: Emit reduction with chosen intrinsic
alt tl::max_nan for fp16/bf16
Codegen->>Codegen: Emit __hmax_nan with type conversions
else
Codegen->>Codegen: Emit fast_max fallback
end
Codegen->>Runtime: Compiled CUDA kernel
Runtime->>Runtime: Execute NaN-aware max operation
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/op/reduce.cc`:
- Around line 112-137: The nan_propagate boolean is used with inverted
semantics: change condition checks that currently use !nan_propagate to use
nan_propagate and swap the branches that return NaN-propagating ops vs
NaN-ignoring ops. Concretely, in the reducer logic around the acc/rhs handling
(references: variable nan_propagate, kReduceMaxMinNanPropagate, and the branches
returning Call(acc.dtype(), tl::max_nan()/tl::min_nan(), ...) vs
Max/Min/tvm::abs calls) flip the conditions so that nan_propagate selects the
Call(... tl::*_nan()) (NaN-propagating) paths and the else selects Max/Min
(NaN-ignoring). Apply the same inversion in the CUDA codegen spots that choose
__hmax/__hmax_nan/__hmin/__hmin_nan (matching codegen_cuda symbols) and update
related tests to expect the corrected semantics; also update the docstrings that
describe nan_propagate in builtin.h and pass_config.py to reflect that true
means NaN-propagate.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: c7e56a0a-abf7-4691-969e-7ea8999118da
📒 Files selected for processing (7)
src/op/builtin.ccsrc/op/builtin.hsrc/op/reduce.ccsrc/target/codegen_cuda.ccsrc/tl_templates/cuda/reduce.htesting/python/language/test_tilelang_language_reduce_maxmin_nan.pytilelang/transform/pass_config.py
| bool nan_propagate = true; | ||
| if (tvm::transform::PassContext::Current().defined()) { | ||
| nan_propagate = tvm::transform::PassContext::Current() | ||
| ->GetConfig<Bool>(kReduceMaxMinNanPropagate, Bool(true)) | ||
| .value(); | ||
| } | ||
| const bool is_fp16_or_bf16 = acc.dtype().is_float16() || acc.dtype().is_bfloat16(); | ||
| if (type->isSum()) { | ||
| return acc + rhs; | ||
| } else if (type->isAbsSum()) { | ||
| return acc + Max(rhs, -rhs); | ||
| } else if (type->isMax()) { | ||
| if (!nan_propagate && is_fp16_or_bf16) { | ||
| return Call(acc.dtype(), tl::max_nan(), {acc, rhs}); | ||
| } | ||
| return Max(acc, rhs); | ||
| } else if (type->isMin()) { | ||
| if (!nan_propagate && is_fp16_or_bf16) { | ||
| return Call(acc.dtype(), tl::min_nan(), {acc, rhs}); | ||
| } | ||
| return Min(acc, rhs); | ||
| } else if (type->isAbsMax()) { | ||
| if (!nan_propagate && is_fp16_or_bf16) { | ||
| return Call(acc.dtype(), tl::max_nan(), {acc, tvm::abs(rhs)}); | ||
| } | ||
| return Max(acc, tvm::abs(rhs)); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's examine the actual code in reduce.cc at the specified lines
fd -t f reduce.cc | head -5Repository: tile-ai/tilelang
Length of output: 140
🏁 Script executed:
# Read the reduce.cc file at the specified lines to understand the current state
cat -n src/op/reduce.cc | sed -n '100,170p'Repository: tile-ai/tilelang
Length of output: 2930
🏁 Script executed:
# Find the reduce.cc file in the repository
find . -name "reduce.cc" -type f 2>/dev/nullRepository: tile-ai/tilelang
Length of output: 78
🏁 Script executed:
# List files in src/op directory to understand structure
ls -la src/op/ 2>/dev/null | head -20 || echo "Directory not found, searching for op directories"
find . -type d -name "op" 2>/dev/null | head -5Repository: tile-ai/tilelang
Length of output: 1230
🏁 Script executed:
# Search for reduce.cc using git
git ls-files | grep reduce.ccRepository: tile-ai/tilelang
Length of output: 140
🏁 Script executed:
# Get repository structure
git ls-files | grep -E "(op|reduce)" | head -20Repository: tile-ai/tilelang
Length of output: 1081
🏁 Script executed:
# Check if the file exists by listing all C++ files
git ls-files "*.cc" | grep -i reduceRepository: tile-ai/tilelang
Length of output: 199
🏁 Script executed:
# List all files to understand repo structure
git ls-files | head -30Repository: tile-ai/tilelang
Length of output: 733
🏁 Script executed:
# Check if this is a TVM repository and find reduce-related files
rg "tl::max_nan|tl::min_nan" --type cc -lRepository: tile-ai/tilelang
Length of output: 86
🏁 Script executed:
# Search for MaxOpNan/MinOpNan patterns
rg "MaxOpNan|MinOpNan" -lRepository: tile-ai/tilelang
Length of output: 175
🏁 Script executed:
# Search for nan_propagate variable
rg "nan_propagate" -lRepository: tile-ai/tilelang
Length of output: 222
🏁 Script executed:
# Try to find any reduce-related operations
rg "type->isSum|type->isMax" -l | head -10Repository: tile-ai/tilelang
Length of output: 76
🏁 Script executed:
# Search across all files for the specific pattern
rg "is_fp16_or_bf16" -A 5 -B 5Repository: tile-ai/tilelang
Length of output: 1527
🏁 Script executed:
# Search for the exact snippet
rg "kReduceMaxMinNanPropagate" -lRepository: tile-ai/tilelang
Length of output: 138
🏁 Script executed:
# Search for reduce operations with nan
rg "reduce.*nan|Reduce.*Nan" -i -l | head -10Repository: tile-ai/tilelang
Length of output: 240
🏁 Script executed:
# List repository structure more broadly
ls -laRepository: tile-ai/tilelang
Length of output: 2819
🏁 Script executed:
# Check git log to understand this is a code review repository
git log --oneline -5 2>/dev/null || echo "Not a git repo"
git status 2>/dev/null || echo "Unable to get status"Repository: tile-ai/tilelang
Length of output: 314
🌐 Web query:
NVIDIA CUDA __hmax __hmin __hmax_nan __hmin_nan semantics NaN handling
💡 Result:
The NVIDIA CUDA half-precision (FP16) intrinsic functions hmax, hmin, hmax_nan, and hmin_nan compute the maximum and minimum of two half values with specific NaN handling behaviors, as defined in the official CUDA Math API documentation. All functions calculate max(a, b) as (a > b) ? a : b for hmax/hmax_nan, and min(a, b) as (a < b) ? a : b for hmin/hmin_nan. They treat +0.0 > -0.0 when both inputs are zero. - hmax(const half a, const half b): If either input is NaN, returns the other input. If both are NaNs, returns canonical NaN. - hmin(const half a, const half b): If either input is NaN, returns the other input. If both are NaNs, returns canonical NaN. - hmax_nan(const half a, const half b): If either input is NaN, returns canonical NaN (NaNs pass through). - hmin_nan(const half a, const half b): If either input is NaN, returns canonical NaN (NaNs pass through). These are host device functions available by including cuda_fp16.h. They are part of the Half Comparison Functions group.
Citations:
- 1: https://docs.nvidia.com/cuda/cuda-math-api/cuda_math_api/group__CUDA__MATH____HALF__COMPARISON.html
- 2: https://docs.nvidia.com/cuda/cuda-math-api/cuda_math_api/group__CUDA__MATH__INTRINSIC__HALF.html
🌐 Web query:
CUDA __hmax_nan __hmin_nan half precision NaN propagation official documentation
💡 Result:
hmax_nan and hmin_nan are CUDA half-precision (half) intrinsic functions for computing maximum and minimum, respectively, with NaN propagation (NaNs pass through). These are defined in the official CUDA Math API documentation (section 4.3 Half Comparison Functions): - host device half hmax_nan(const half a, const half b): Calculates half max(a, b) as (a > b) ? a : b. If either input is NaN, returns canonical NaN. If both inputs are 0.0, +0.0 > -0.0. - host device half hmin_nan(const half a, const half b): Calculates half min(a, b) as (a < b) ? a : b. If either input is NaN, returns canonical NaN. If both inputs are 0.0, +0.0 > -0.0. To use them, include <cuda_fp16.h>. They differ from non-_nan variants (hmax, hmin), which return the other input if one is NaN (and canonical NaN if both are NaNs). Canonical NaN is defined as CUDART_NAN_FP16 = ushort_as_half((unsigned short)0x7FFFU). These functions are available in recent CUDA versions (e.g., 11.5+ documented, up to 13.2 as of 2025 docs). NaN handling in CUDA half follows IEEE 754 with specifics: operations propagate quiet NaNs, no signaling NaNs.
Citations:
- 1: https://docs.nvidia.com/cuda/cuda-math-api/cuda_math_api/group__CUDA__MATH____HALF__COMPARISON.html
- 2: https://docs.nvidia.com/cuda/archive/11.5.1/cuda-math-api/group__CUDA__MATH____HALF__COMPARISON.html
- 3: https://docs.nvidia.com/cuda/cuda-math-api/cuda_math_api/group__CUDA__MATH__INTRINSIC__HALF__CONSTANTS.html
- 4: https://docs.nvidia.com/cuda/pdf/CUDA_Math_API.pdf
- 5: https://docs.nvidia.com/cuda/cuda-math-api/cuda_math_api/group__CUDA__MATH__INTRINSIC__HALF.html
🌐 Web query:
site:docs.nvidia.com CUDA half precision comparison max min __hmax __hmin
💡 Result:
In CUDA, for half precision (half) floating-point numbers, the max and min comparison intrinsics are hmax and hmin, defined in cuda_fp16.h. host device half hmax(const half a, const half b) Calculates half maximum: returns a if a > b, else b. - If either input is NaN, returns the other input. - If both NaN, returns canonical NaN. - For zeros: +0.0 > -0.0. host device half hmin(const half a, const half b) Calculates half minimum: returns a if a < b, else b. - Same NaN and zero rules as hmax. Variants: - hmax_nan and hmin_nan: NaNs pass through (return canonical NaN if either is NaN). These are available in device code; host availability since recent CUDA versions (e.g., 12.x docs show host device). Usage example: #include <cuda_fp16.h> device half result = hmax(half1, half2);
Citations:
- 1: https://docs.nvidia.com/cuda/cuda-math-api/cuda_math_api/group__CUDA__MATH____HALF__COMPARISON.html
- 2: https://docs.nvidia.com/cuda/archive/11.5.1/cuda-math-api/group__CUDA__MATH____HALF__COMPARISON.html
- 3: https://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH__INTRINSIC__HALF.html
🌐 Web query:
CUDA math API __half comparison NaN behavior "max_nan" "min_nan"
💡 Result:
In the CUDA Math API for half (half-precision floating-point), the min/max functions have specific NaN propagation behaviors defined in the official documentation (cuda_fp16.h functions in Half Comparison Functions section). Standard max/min: - hmax(const half a, const half b): Returns max(a, b) as (a > b) ? a : b. - If either input is NaN, returns the other input. - If both are NaNs, returns canonical NaN. - hmin(const half a, const half b): Returns min(a, b) as (a < b) ? a : b. - If either input is NaN, returns the other input. - If both are NaNs, returns canonical NaN. "NaN pass-through" variants (named *_nan): - hmax_nan(const half a, const half b): Same max logic. - If either input is NaN, returns canonical NaN (NaNs pass through). - hmin_nan(const half a, const half b): Same min logic. - If either input is NaN, returns canonical NaN (NaNs pass through). For comparison operators and functions like heq, hlt, etc.: - NaN inputs generate false results (standard IEEE-754 ordered comparison behavior: any comparison involving NaN is false). These are available as host device functions. Include <cuda_fp16.h>. Note that +0.0 > -0.0 in all cases. This matches patterns in bfloat16 and half2 variants.
Citations:
- 1: https://docs.nvidia.com/cuda/cuda-math-api/cuda_math_api/group__CUDA__MATH____HALF__COMPARISON.html
- 2: https://docs.nvidia.com/cuda/archive/11.4.4/cuda-math-api/group__CUDA__MATH____HALF__COMPARISON.html
🏁 Script executed:
# Check CUDA codegen handling of max_nan/min_nan
cat -n src/target/codegen_cuda.cc | grep -A 15 -B 5 "max_nan\|min_nan" | head -60Repository: tile-ai/tilelang
Length of output: 2800
🏁 Script executed:
# Check for MaxOp and MaxOpNan in CUDA codegen
rg "MaxOp|MinOp" src/target/codegen_cuda.cc -A 5 -B 2 | head -100Repository: tile-ai/tilelang
Length of output: 42
🏁 Script executed:
# Look at builtin.h/builtin.cc to understand max_nan/min_nan definitions
cat -n src/op/builtin.h | grep -A 10 -B 2 "max_nan\|min_nan" | head -50Repository: tile-ai/tilelang
Length of output: 1683
🏁 Script executed:
# Check the test file to understand expected behavior
cat -n testing/python/language/test_tilelang_language_reduce_maxmin_nan.py | head -100Repository: tile-ai/tilelang
Length of output: 3505
🏁 Script executed:
# Check pass_config to understand the flag definition
cat -n tilelang/transform/pass_config.py | grep -A 10 -B 2 "kReduceMaxMinNanPropagate\|nan_propagate"Repository: tile-ai/tilelang
Length of output: 801
🏁 Script executed:
# Get more context on the test to see expected NaN behavior
wc -l testing/python/language/test_tilelang_language_reduce_maxmin_nan.pyRepository: tile-ai/tilelang
Length of output: 131
🏁 Script executed:
# Read the entire test file
cat testing/python/language/test_tilelang_language_reduce_maxmin_nan.pyRepository: tile-ai/tilelang
Length of output: 4193
🏁 Script executed:
# Search for tl::max_nan and tl::min_nan definitions/usage
rg "tl::max_nan|tl::min_nan" -B 3 -A 3Repository: tile-ai/tilelang
Length of output: 1586
🏁 Script executed:
# Check reduce.h template for max_nan/min_nan usage
cat src/tl_templates/cuda/reduce.h | grep -A 10 -B 10 "max_nan\|min_nan"Repository: tile-ai/tilelang
Length of output: 1484
🏁 Script executed:
# Check builtin.cc for max_nan/min_nan implementations
rg "max_nan|min_nan" src/op/builtin.cc -A 5 -B 2Repository: tile-ai/tilelang
Length of output: 555
Invert the nan_propagate semantics throughout the NaN-handling feature.
The flag nan_propagate = true currently uses __hmax / __hmin (NaN-ignoring functions), while nan_propagate = false uses __hmax_nan / __hmin_nan (NaN-propagating functions). This inverts the flag's intended semantics: per official CUDA documentation, __hmax_nan propagates NaNs and __hmax ignores them.
Fix in src/op/reduce.cc: Invert conditions at lines 124, 129, 134 from !nan_propagate to nan_propagate, and swap return values at lines 161, 163, 165. Matching inversions are required in src/target/codegen_cuda.cc (lines 1451, 1489) and all test assertions. Correct the docstrings in src/op/builtin.h (lines 193–196) and tilelang/transform/pass_config.py (lines 167–169).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/op/reduce.cc` around lines 112 - 137, The nan_propagate boolean is used
with inverted semantics: change condition checks that currently use
!nan_propagate to use nan_propagate and swap the branches that return
NaN-propagating ops vs NaN-ignoring ops. Concretely, in the reducer logic around
the acc/rhs handling (references: variable nan_propagate,
kReduceMaxMinNanPropagate, and the branches returning Call(acc.dtype(),
tl::max_nan()/tl::min_nan(), ...) vs Max/Min/tvm::abs calls) flip the conditions
so that nan_propagate selects the Call(... tl::*_nan()) (NaN-propagating) paths
and the else selects Max/Min (NaN-ignoring). Apply the same inversion in the
CUDA codegen spots that choose __hmax/__hmax_nan/__hmin/__hmin_nan (matching
codegen_cuda symbols) and update related tests to expect the corrected
semantics; also update the docstrings that describe nan_propagate in builtin.h
and pass_config.py to reflect that true means NaN-propagate.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
src/op/reduce.cc (1)
112-117: Consider extractingnan_propagateretrieval to a helper method.The PassContext config lookup for
kReduceMaxMinNanPropagateis duplicated in bothMakeReduce()(lines 112-117) andMakeCodegenReducer()(lines 150-155). This creates a maintenance burden and risk of divergence.♻️ Suggested refactor
Add a private helper method to
ReduceOpNodeor a free function:+static bool GetNanPropagateSetting() { + if (tvm::transform::PassContext::Current().defined()) { + return tvm::transform::PassContext::Current() + ->GetConfig<Bool>(kReduceMaxMinNanPropagate, Bool(true)) + .value(); + } + return true; +} + PrimExpr ReduceOpNode::MakeReduce(const PrimExpr &acc, const PrimExpr &b) const { PrimExpr rhs = b; if (acc->dtype != rhs->dtype) { rhs = Cast(acc->dtype, rhs); } - bool nan_propagate = true; - if (tvm::transform::PassContext::Current().defined()) { - nan_propagate = tvm::transform::PassContext::Current() - ->GetConfig<Bool>(kReduceMaxMinNanPropagate, Bool(true)) - .value(); - } + bool nan_propagate = GetNanPropagateSetting();Apply same change to
MakeCodegenReducer().Also applies to: 150-155
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/op/reduce.cc` around lines 112 - 117, Extract the PassContext lookup for kReduceMaxMinNanPropagate into a single helper (either a private method on ReduceOpNode like ReduceOpNode::GetNanPropagateFromPassContext() or a static/free function e.g. GetReduceNanPropagate()) and replace the duplicated blocks inside MakeReduce() and MakeCodegenReducer() with calls to that helper; the helper should return a bool defaulting to true by reading tvm::transform::PassContext::Current()->GetConfig<Bool>(kReduceMaxMinNanPropagate, Bool(true)).value() and be used in both locations to avoid divergence.testing/python/language/test_tilelang_language_reduce_maxmin_nan.py (2)
60-67: Inconsistent negative assertions in bf16 tests.
test_reduce_max_fp16_nan_propagate_false(line 49) asserts"MaxOpNan" not in src, buttest_reduce_max_bf16_nan_propagate_falseomits this check. Same pattern forreduce_mintests. Adding consistent negative assertions would catch cases where both variants accidentally appear.💡 Suggested fix
def test_reduce_max_bf16_nan_propagate_false(): k = _compile_cuda( _kernel_reduce_max(64, T.bfloat16), pass_configs={tilelang.PassConfigKey.TL_REDUCE_MAXMIN_NAN_PROPAGATE: False}, ) src = k.get_kernel_source() assert "tl::MaxOp" in src + assert "MaxOpNan" not in src assert "__hmax(" in srcApply similar changes to
test_reduce_min_bf16_nan_propagate_false.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@testing/python/language/test_tilelang_language_reduce_maxmin_nan.py` around lines 60 - 67, Add the missing negative assertions to the bfloat16 tests to mirror the fp16 tests: in test_reduce_max_bf16_nan_propagate_false ensure the compiled kernel source (src) does not contain "MaxOpNan" (similar to test_reduce_max_fp16_nan_propagate_false), and in test_reduce_min_bf16_nan_propagate_false ensure src does not contain "MinOpNan" (matching the fp16 counterpart); update the assertions in those functions so each checks both the positive expected op/implementation strings and the corresponding "...OpNan" absence.
131-138: Missing bf16 tests forreduce_absmax.The
reduce_absmaxtests only cover fp16. Consider adding bf16 tests for completeness, since bf16 uses the same NaN-handling code path.💡 Suggested additions
def test_reduce_absmax_bf16_nan_propagate_default(): k = _compile_cuda(_kernel_reduce_absmax(64, T.bfloat16)) src = k.get_kernel_source() assert "tl::MaxOpNan" in src assert "__hmax_nan" in src def test_reduce_absmax_bf16_nan_propagate_false(): k = _compile_cuda( _kernel_reduce_absmax(64, T.bfloat16), pass_configs={tilelang.PassConfigKey.TL_REDUCE_MAXMIN_NAN_PROPAGATE: False}, ) src = k.get_kernel_source() assert "tl::MaxOp" in src assert "__hmax(" in src🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@testing/python/language/test_tilelang_language_reduce_maxmin_nan.py` around lines 131 - 138, Add parallel bf16 tests for reduce_absmax mirroring the fp16 cases: create test_reduce_absmax_bf16_nan_propagate_default() that compiles _kernel_reduce_absmax(64, T.bfloat16) via _compile_cuda and asserts the source contains "tl::MaxOpNan" and "__hmax_nan", and create test_reduce_absmax_bf16_nan_propagate_false() that compiles with pass_configs={tilelang.PassConfigKey.TL_REDUCE_MAXMIN_NAN_PROPAGATE: False} and asserts the source contains "tl::MaxOp" and "__hmax("; use the same helper names (_compile_cuda, _kernel_reduce_absmax, T.bfloat16, tilelang.PassConfigKey.TL_REDUCE_MAXMIN_NAN_PROPAGATE) to mirror the existing fp16 tests.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@testing/python/language/test_tilelang_language_reduce_maxmin_nan.py`:
- Around line 101-108: Add a new test named
test_reduce_min_bf16_nan_propagate_default that mirrors
test_reduce_min_bf16_nan_propagate_false but uses the default pass config (i.e.,
call _compile_cuda(_kernel_reduce_min(64, T.bfloat16)) without pass_configs),
then call get_kernel_source() on the returned kernel and assert that the source
contains "tl::MinOpNan" and "__hmin_nan" to verify bf16 default NaN-propagation
for reduce_min.
---
Nitpick comments:
In `@src/op/reduce.cc`:
- Around line 112-117: Extract the PassContext lookup for
kReduceMaxMinNanPropagate into a single helper (either a private method on
ReduceOpNode like ReduceOpNode::GetNanPropagateFromPassContext() or a
static/free function e.g. GetReduceNanPropagate()) and replace the duplicated
blocks inside MakeReduce() and MakeCodegenReducer() with calls to that helper;
the helper should return a bool defaulting to true by reading
tvm::transform::PassContext::Current()->GetConfig<Bool>(kReduceMaxMinNanPropagate,
Bool(true)).value() and be used in both locations to avoid divergence.
In `@testing/python/language/test_tilelang_language_reduce_maxmin_nan.py`:
- Around line 60-67: Add the missing negative assertions to the bfloat16 tests
to mirror the fp16 tests: in test_reduce_max_bf16_nan_propagate_false ensure the
compiled kernel source (src) does not contain "MaxOpNan" (similar to
test_reduce_max_fp16_nan_propagate_false), and in
test_reduce_min_bf16_nan_propagate_false ensure src does not contain "MinOpNan"
(matching the fp16 counterpart); update the assertions in those functions so
each checks both the positive expected op/implementation strings and the
corresponding "...OpNan" absence.
- Around line 131-138: Add parallel bf16 tests for reduce_absmax mirroring the
fp16 cases: create test_reduce_absmax_bf16_nan_propagate_default() that compiles
_kernel_reduce_absmax(64, T.bfloat16) via _compile_cuda and asserts the source
contains "tl::MaxOpNan" and "__hmax_nan", and create
test_reduce_absmax_bf16_nan_propagate_false() that compiles with
pass_configs={tilelang.PassConfigKey.TL_REDUCE_MAXMIN_NAN_PROPAGATE: False} and
asserts the source contains "tl::MaxOp" and "__hmax("; use the same helper names
(_compile_cuda, _kernel_reduce_absmax, T.bfloat16,
tilelang.PassConfigKey.TL_REDUCE_MAXMIN_NAN_PROPAGATE) to mirror the existing
fp16 tests.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 670b319c-70b0-4e6c-9226-1b11611dde1f
📒 Files selected for processing (5)
src/op/builtin.hsrc/op/reduce.ccsrc/target/codegen_cuda.cctesting/python/language/test_tilelang_language_reduce_maxmin_nan.pytilelang/transform/pass_config.py
🚧 Files skipped from review as they are similar to previous changes (3)
- tilelang/transform/pass_config.py
- src/op/builtin.h
- src/target/codegen_cuda.cc
|
Thanks @haoran35-jpg , would you mind explain more for introducing |
|
Yes, thank you for comment! It's allowing max(...)/min(...) in frontend to be lowered to Nvidia intrinsic hmax_nan(...)/hmin_nan(...) when setting |
|
@haoran35-jpg, Thanks and I think one better solution is to add an annotation for |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/op/reduce.cc`:
- Around line 118-119: Make the NaN-specific gating consistent between the
thread-local reducer and the codegen reducer: in MakeCodegenReducer() only
return tl::MaxOpNan / tl::MinOpNan for the max/min/absmax reducers when
nan_propagate is true AND the accumulator dtype is half/bfloat (mirror the
is_fp16_or_bf16 check in MakeReduce() — i.e., use acc.dtype().is_float16() ||
acc.dtype().is_bfloat16()); otherwise return the regular tl::MaxOp / tl::MinOp
so both local (MakeReduce) and inter-thread (MakeCodegenReducer) phases behave
the same for non-half dtypes.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: ececcd74-7a96-4fd5-a079-70de022f3ae5
📒 Files selected for processing (2)
src/op/reduce.ccsrc/target/codegen_cuda.cc
🚧 Files skipped from review as they are similar to previous changes (1)
- src/target/codegen_cuda.cc
Replace the global tl.reduce_maxmin_nan_propagate pass-config with a per-op nan_propagate flag carried on the ReduceOp annotation map. Frontend exposes it as a kwarg on T.reduce_max/min/absmax (default False, preserving prior behavior). The CUDA codegen MinNode/MaxNode visitors are reverted to plain __hmin/__hmax; only the new tl::max_nan /tl::min_nan CallNode handler emits __hmax_nan/__hmin_nan, so non-reduce Min/Max ops are no longer silently rewritten. ReduceOp::Lower now errors early with a clear message if nan_propagate=True is requested for fp16 /bf16 on a non-CUDA target, instead of emitting an undefined symbol on HIP/CPU. Tests rewritten to parametrize on the kwarg and assert runtime NaN-propagation behavior, not just generated source substrings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/op/reduce.cc (1)
120-164:⚠️ Potential issue | 🟠 MajorUse the NaN-aware combiner when merging back into
dstwithclear=False.The new logic makes the local reduction and
AllReducehonornan_propagate, but the laterneed_updatemerge still uses plainMax/Min. In the publicclear=Falsepath, a NaN from either the existing output or the newly reduced value is dropped there, soreduce_max/min/absmax(..., clear=False, nan_propagate=True)can still return a non-NaN result.Suggested fix
- } else if (this->type->isMax() || this->type->isAbsMax()) { - update = Max(dst_val, src_val); - } else if (this->type->isMin()) { - update = Min(dst_val, src_val); + } else if (this->type->isMax() || this->type->isMin() || + this->type->isAbsMax()) { + update = this->MakeReduce(dst_val, src_val);The new runtime tests only exercise
clear=True, so this regression is currently unguarded.Also applies to: 470-485
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/op/reduce.cc` around lines 120 - 164, The merge step that writes back into dst when clear=False still uses plain Max/Min (dropping NaNs) even when nan_propagate is true; update the merge/need_update path to select the NaN-aware combiners the same way as the local reducer logic and MakeCodegenReducer does: when use_nan_op is true and type->isMax()/isMin()/isAbsMax(), use the NaN variants (tl::MaxOpNan / tl::MinOpNan or Call with tl::max_nan()/tl::min_nan()) instead of plain Max/Min so that ReduceOpNode::MakeCodegenReducer, the merge code that checks need_update, and any code paths invoking Max/Min honor nan_propagate for clear=False merges.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/op/reduce.cc`:
- Around line 248-254: The current check in reduce.cc unconditionally FATALs
when nan_propagate is true for fp16/bf16 on non-CUDA targets; narrow this guard
to only the reduce kinds that actually use the NaN-aware path (max, min,
absmax). Modify the condition that uses nan_propagate,
dst->dtype.is_float16()/is_bfloat16(), and TargetIsCuda(T.target) so it also
checks the reduce kind (e.g., T.reduce kind or the variable representing the
reduction operator) and only LOG(FATAL) when the kind is "max", "min", or
"absmax"; leave other reduction kinds (e.g., "sum") unaffected so nan_propagate
becomes a no-op on non-CUDA targets.
---
Outside diff comments:
In `@src/op/reduce.cc`:
- Around line 120-164: The merge step that writes back into dst when clear=False
still uses plain Max/Min (dropping NaNs) even when nan_propagate is true; update
the merge/need_update path to select the NaN-aware combiners the same way as the
local reducer logic and MakeCodegenReducer does: when use_nan_op is true and
type->isMax()/isMin()/isAbsMax(), use the NaN variants (tl::MaxOpNan /
tl::MinOpNan or Call with tl::max_nan()/tl::min_nan()) instead of plain Max/Min
so that ReduceOpNode::MakeCodegenReducer, the merge code that checks
need_update, and any code paths invoking Max/Min honor nan_propagate for
clear=False merges.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 5fe5e403-594e-415e-9419-73884f40da5a
📒 Files selected for processing (7)
src/op/builtin.ccsrc/op/builtin.hsrc/op/reduce.ccsrc/op/reduce.hsrc/target/codegen_cuda.cctesting/python/language/test_tilelang_language_reduce_maxmin_nan.pytilelang/language/reduce_op.py
🚧 Files skipped from review as they are similar to previous changes (1)
- src/target/codegen_cuda.cc
| if (nan_propagate && (dst->dtype.is_float16() || dst->dtype.is_bfloat16()) && | ||
| !TargetIsCuda(T.target)) { | ||
| LOG(FATAL) << "ReduceOp: nan_propagate=True for fp16/bf16 max/min/absmax " | ||
| "is only supported on CUDA targets (requires " | ||
| "__hmax_nan/__hmin_nan intrinsics). Target was: " | ||
| << T.target->str(); | ||
| } |
There was a problem hiding this comment.
Scope the CUDA-only guard to the supported reduce kinds.
This currently aborts any fp16/bf16 reduction with nan_propagate=True, even though only max, min, and absmax ever take the NaN-aware path. A direct T.reduce(..., "sum", nan_propagate=True) becomes a target-dependent hard failure instead of a no-op flag.
Suggested fix
- if (nan_propagate && (dst->dtype.is_float16() || dst->dtype.is_bfloat16()) &&
+ if (nan_propagate &&
+ (this->type->isMax() || this->type->isMin() || this->type->isAbsMax()) &&
+ (dst->dtype.is_float16() || dst->dtype.is_bfloat16()) &&
!TargetIsCuda(T.target)) {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/op/reduce.cc` around lines 248 - 254, The current check in reduce.cc
unconditionally FATALs when nan_propagate is true for fp16/bf16 on non-CUDA
targets; narrow this guard to only the reduce kinds that actually use the
NaN-aware path (max, min, absmax). Modify the condition that uses nan_propagate,
dst->dtype.is_float16()/is_bfloat16(), and TargetIsCuda(T.target) so it also
checks the reduce kind (e.g., T.reduce kind or the variable representing the
reduction operator) and only LOG(FATAL) when the kind is "max", "min", or
"absmax"; leave other reduction kinds (e.g., "sum") unaffected so nan_propagate
becomes a no-op on non-CUDA targets.
* register op, and add to build file * update semantics of NaN for Max and Min ops in Reduce * lower Max op to cutlass fastmax in general case, _hmax for bf16 and _hmax_nan for hf16 * update semantics of NaN for Max and Min ops in Reduce in reduce.h * develope codes transforming ReduceOpNode in TIR and handling CallNode in CUDA code generator's runtime template * test file for reduce maxmin nan * refine test case message and explanation“ ” ‘ “ * invert the senamtics of global config TL_REDUCE_MAXMIN_NAN_PROPAGATE * lint fix * refactor reduce maxmin NaN propagate into per-call annotation Replace the global tl.reduce_maxmin_nan_propagate pass-config with a per-op nan_propagate flag carried on the ReduceOp annotation map. Frontend exposes it as a kwarg on T.reduce_max/min/absmax (default False, preserving prior behavior). The CUDA codegen MinNode/MaxNode visitors are reverted to plain __hmin/__hmax; only the new tl::max_nan /tl::min_nan CallNode handler emits __hmax_nan/__hmin_nan, so non-reduce Min/Max ops are no longer silently rewritten. ReduceOp::Lower now errors early with a clear message if nan_propagate=True is requested for fp16 /bf16 on a non-CUDA target, instead of emitting an undefined symbol on HIP/CPU. Tests rewritten to parametrize on the kwarg and assert runtime NaN-propagation behavior, not just generated source substrings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Haoran <chaoran@umich.edu> Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This PR introduces option to propagate NaN specifically for data type bf_16 and half, which modifies the following files:
src/op/builtin.hkReduceMaxMinNanPropagateandtl::max_nan/tl::min_nanops.src/op/builtin.cctl.max_nan/tl.min_nan.AllReduce,MakeCodegenReducer()returnstl::MaxOpNan, concatenate intotl::AllReduce<tl::MaxOpNan, …>::run(...) with tl::MaxOpNandefined in rumtime template reduce.h. ForMakeReduce, turn into CallNode then rely ontl::max_nan/tl::min_nan.__hmax/__hminand__hmax_nan/__hmin_nanaccording to Passcontext. VisitExpr_(CallNode) handles tl::max_nan / tl::min_nan as TVM op defined by buildin.Before this PR, MaxOp/MinOp with parameter type bfloat16 and half will lower to __hmax()/__hmin() forcefully. Now user can optionally lower them to _hmax_nan()/__hmin_nan() by setting kReduceMaxMinNanPropagate to false(default is true). I feel this global config is more elegant and convenient to implement, not sure if this is okay.
Disclaim: AI is used in code refinement and test generation.
Summary by CodeRabbit
New Features
nan_propagateparameter toreduce_max(),reduce_min(), andreduce_absmax()functions, enabling NaN-aware reduction behavior for fp16/bf16 tensors on CUDA targets.Tests