Skip to content

GH-45193: [C++][Compute] Treat NaNs and nulls as distinct values in rank tie-breaking#49304

Merged
pitrou merged 1 commit intoapache:mainfrom
abhishek593:GH-45193
Apr 21, 2026
Merged

GH-45193: [C++][Compute] Treat NaNs and nulls as distinct values in rank tie-breaking#49304
pitrou merged 1 commit intoapache:mainfrom
abhishek593:GH-45193

Conversation

@abhishek593
Copy link
Copy Markdown
Contributor

@abhishek593 abhishek593 commented Feb 17, 2026

Rationale for this change

The rank kernel incorrectly treated NaNs and Nulls as ties. This fix ensures they are treated as distinct values according to Arrow's sorting conventions.

What changes are included in this PR?

Updated the internal MarkDuplicates helper in vector_rank.cc to distinguish between NaNs and Nulls.

Are these changes tested?

Yes. Added a regression test TestRank.NaNsAndNulls in vector_sort_test.cc and verified all compute tests pass.

Are there any user-facing changes?

The output of the rank function will now correctly differentiate between NaNs and Nulls instead of ranking them as ties. Fixes incorrect/invalid ranking results for datasets containing both NaNs and Nulls.

@abhishek593
Copy link
Copy Markdown
Contributor Author

@pitrou Please review! Thanks.

Copy link
Copy Markdown
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @abhishek593 ! This looks good to me, I just posted a suggestion below.

Comment on lines 41 to 44
template <typename ArrowType, typename ValueSelector, typename IsNullSelector>
void MarkDuplicates(const NullPartitionResult& sorted, ValueSelector&& value_selector,
IsNullSelector&& is_null_selector) {
using T = decltype(value_selector(int64_t{}));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idea: instead of specializing on ArrowType, we can just pass a is_null_selector that always returns true for types without NaNs.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pitrou I have updated the PR with suggested changes.

@github-actions github-actions Bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Mar 10, 2026
@abhishek593 abhishek593 force-pushed the GH-45193 branch 2 times, most recently from 816b597 to aa324b5 Compare March 10, 2026 18:11
@abhishek593 abhishek593 requested a review from pitrou March 10, 2026 19:03
Copy link
Copy Markdown
Contributor

@zanmato1984 zanmato1984 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generaly lgtm.

Comment thread cpp/src/arrow/compute/kernels/vector_sort_test.cc
Copy link
Copy Markdown
Contributor

@zanmato1984 zanmato1984 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@zanmato1984
Copy link
Copy Markdown
Contributor

Let's wait for @pitrou who might want to take another look.

@pitrou
Copy link
Copy Markdown
Member

pitrou commented Apr 20, 2026

Well, there are a number of CI failures that need addressing, already :)

@pitrou
Copy link
Copy Markdown
Member

pitrou commented Apr 20, 2026

You probably want to rebase from git main, actually, and fix the span references to use std::span.

@pitrou
Copy link
Copy Markdown
Member

pitrou commented Apr 20, 2026

This LGTM, but there is a C++ lint failure, can you please reformat @abhishek593 ?

(this should be as easy as pre-commit run -a cpp)

bool prev_is_null = is_null_selector(*it);
while (++it < sorted.nulls_end) {
*it |= kDuplicateMask;
bool curr_is_null = is_null_selector(*it);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, we know that, after sorting, nulls are clustered either before or either NaNs, so we could take advantage of that to avoid looking the null bitmap for each element. But I think the current solution is still good enough.

@pitrou
Copy link
Copy Markdown
Member

pitrou commented Apr 21, 2026

Thanks for the update @abhishek593 ! I'm going to merge now.

@pitrou pitrou merged commit 2937301 into apache:main Apr 21, 2026
50 of 52 checks passed
@pitrou pitrou removed the awaiting committer review Awaiting committer review label Apr 21, 2026
@conbench-apache-arrow
Copy link
Copy Markdown

After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit 2937301.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 3 possible false positives for unstable benchmarks that are known to sometimes produce them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants