GH-45193: [C++][Compute] Treat NaNs and nulls as distinct values in rank tie-breaking#49304
GH-45193: [C++][Compute] Treat NaNs and nulls as distinct values in rank tie-breaking#49304pitrou merged 1 commit intoapache:mainfrom
Conversation
|
@pitrou Please review! Thanks. |
pitrou
left a comment
There was a problem hiding this comment.
Thanks @abhishek593 ! This looks good to me, I just posted a suggestion below.
| template <typename ArrowType, typename ValueSelector, typename IsNullSelector> | ||
| void MarkDuplicates(const NullPartitionResult& sorted, ValueSelector&& value_selector, | ||
| IsNullSelector&& is_null_selector) { | ||
| using T = decltype(value_selector(int64_t{})); |
There was a problem hiding this comment.
Idea: instead of specializing on ArrowType, we can just pass a is_null_selector that always returns true for types without NaNs.
There was a problem hiding this comment.
@pitrou I have updated the PR with suggested changes.
816b597 to
aa324b5
Compare
|
Let's wait for @pitrou who might want to take another look. |
|
Well, there are a number of CI failures that need addressing, already :) |
|
You probably want to rebase from git main, actually, and fix the |
|
This LGTM, but there is a C++ lint failure, can you please reformat @abhishek593 ? (this should be as easy as |
| bool prev_is_null = is_null_selector(*it); | ||
| while (++it < sorted.nulls_end) { | ||
| *it |= kDuplicateMask; | ||
| bool curr_is_null = is_null_selector(*it); |
There was a problem hiding this comment.
By the way, we know that, after sorting, nulls are clustered either before or either NaNs, so we could take advantage of that to avoid looking the null bitmap for each element. But I think the current solution is still good enough.
…s in rank tie-breaking
|
Thanks for the update @abhishek593 ! I'm going to merge now. |
|
After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit 2937301. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 3 possible false positives for unstable benchmarks that are known to sometimes produce them. |
Rationale for this change
The rank kernel incorrectly treated NaNs and Nulls as ties. This fix ensures they are treated as distinct values according to Arrow's sorting conventions.
What changes are included in this PR?
Updated the internal MarkDuplicates helper in vector_rank.cc to distinguish between NaNs and Nulls.
Are these changes tested?
Yes. Added a regression test TestRank.NaNsAndNulls in vector_sort_test.cc and verified all compute tests pass.
Are there any user-facing changes?
The output of the rank function will now correctly differentiate between NaNs and Nulls instead of ranking them as ties. Fixes incorrect/invalid ranking results for datasets containing both NaNs and Nulls.