GH-45193: [C++][Compute] Treat NaNs and nulls as distinct values in rank tie-breaking by abhishek593 · Pull Request #49304 · apache/arrow

abhishek593 · 2026-02-17T07:42:37Z

Rationale for this change

The rank kernel incorrectly treated NaNs and Nulls as ties. This fix ensures they are treated as distinct values according to Arrow's sorting conventions.

What changes are included in this PR?

Updated the internal MarkDuplicates helper in vector_rank.cc to distinguish between NaNs and Nulls.

Are these changes tested?

Yes. Added a regression test TestRank.NaNsAndNulls in vector_sort_test.cc and verified all compute tests pass.

Are there any user-facing changes?

The output of the rank function will now correctly differentiate between NaNs and Nulls instead of ranking them as ties. Fixes incorrect/invalid ranking results for datasets containing both NaNs and Nulls.

GitHub Issue: [C++][Compute] Rank function considers NaNs and nulls equal #45193

abhishek593 · 2026-03-09T19:49:18Z

@pitrou Please review! Thanks.

pitrou

Thanks @abhishek593 ! This looks good to me, I just posted a suggestion below.

pitrou · 2026-03-10T10:10:28Z

+template <typename ArrowType, typename ValueSelector, typename IsNullSelector>
+void MarkDuplicates(const NullPartitionResult& sorted, ValueSelector&& value_selector,
+                    IsNullSelector&& is_null_selector) {
  using T = decltype(value_selector(int64_t{}));


Idea: instead of specializing on ArrowType, we can just pass a is_null_selector that always returns true for types without NaNs.

@pitrou I have updated the PR with suggested changes.

zanmato1984

Generaly lgtm.

zanmato1984

+1

zanmato1984 · 2026-04-17T21:36:33Z

Let's wait for @pitrou who might want to take another look.

pitrou · 2026-04-20T08:24:35Z

Well, there are a number of CI failures that need addressing, already :)

pitrou · 2026-04-20T08:26:15Z

You probably want to rebase from git main, actually, and fix the span references to use std::span.

pitrou · 2026-04-20T13:39:02Z

This LGTM, but there is a C++ lint failure, can you please reformat @abhishek593 ?

(this should be as easy as pre-commit run -a cpp)

pitrou · 2026-04-20T13:40:33Z

+    bool prev_is_null = is_null_selector(*it);
    while (++it < sorted.nulls_end) {
-      *it |= kDuplicateMask;
+      bool curr_is_null = is_null_selector(*it);


By the way, we know that, after sorting, nulls are clustered either before or either NaNs, so we could take advantage of that to avoid looking the null bitmap for each element. But I think the current solution is still good enough.

…s in rank tie-breaking

pitrou · 2026-04-21T07:48:04Z

Thanks for the update @abhishek593 ! I'm going to merge now.

conbench-apache-arrow · 2026-04-21T15:36:44Z

After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit 2937301.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 3 possible false positives for unstable benchmarks that are known to sometimes produce them.

github-actions Bot added Component: C++ awaiting review Awaiting review labels Feb 17, 2026

pitrou requested changes Mar 10, 2026

View reviewed changes

github-actions Bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Mar 10, 2026

abhishek593 force-pushed the GH-45193 branch 2 times, most recently from 816b597 to aa324b5 Compare March 10, 2026 18:11

abhishek593 requested a review from pitrou March 10, 2026 19:03

zanmato1984 reviewed Mar 23, 2026

View reviewed changes

Comment thread cpp/src/arrow/compute/kernels/vector_sort_test.cc

abhishek593 force-pushed the GH-45193 branch from aa324b5 to fd1ad2a Compare April 17, 2026 20:34

abhishek593 requested a review from zanmato1984 April 17, 2026 20:36

zanmato1984 approved these changes Apr 17, 2026

View reviewed changes

abhishek593 force-pushed the GH-45193 branch from fd1ad2a to 8d75b3d Compare April 20, 2026 12:22

pitrou approved these changes Apr 20, 2026

View reviewed changes

pitrou reviewed Apr 20, 2026

View reviewed changes

apacheGH-45193: [C++][Compute] Treat NaNs and nulls as distinct value…

bdd71bf

…s in rank tie-breaking

abhishek593 force-pushed the GH-45193 branch from 8d75b3d to bdd71bf Compare April 20, 2026 22:20

pitrou merged commit 2937301 into apache:main Apr 21, 2026
50 of 52 checks passed

pitrou removed the awaiting committer review Awaiting committer review label Apr 21, 2026

pitrou mentioned this pull request Apr 21, 2026

[C++][Compute] Rank function considers NaNs and nulls equal #45193

Closed

Conversation

abhishek593 commented Feb 17, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

abhishek593 commented Mar 9, 2026

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

pitrou Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

abhishek593 Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

zanmato1984 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zanmato1984 left a comment

Choose a reason for hiding this comment

Uh oh!

zanmato1984 commented Apr 17, 2026

Uh oh!

pitrou commented Apr 20, 2026

Uh oh!

pitrou commented Apr 20, 2026

Uh oh!

pitrou commented Apr 20, 2026

Uh oh!

pitrou Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

pitrou commented Apr 21, 2026

Uh oh!

Uh oh!

conbench-apache-arrow Bot commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

abhishek593 commented Feb 17, 2026 •

edited by github-actions Bot

Loading