Add has_true() and has_false() to BooleanArray#9511
Conversation
|
run benchmark boolean_array |
|
🤖 Hi @adriangb, thanks for the request (#9511 (comment)).
Please choose one or more of these with You can also set environment variables on subsequent lines: Unsupported benchmarks: boolean_array. |
|
cc @Dandandan |
|
run benchmark boolean_array |
|
🤖 Hi @adriangb, thanks for the request (#9511 (comment)).
Please choose one or more of these with You can also set environment variables on subsequent lines: Unsupported benchmarks: boolean_array. |
|
run benchmark boolean_array |
|
🤖 Hi @adriangb, thanks for the request (#9511 (comment)).
Please choose one or more of these with You can also set environment variables on subsequent lines: Unsupported benchmarks: boolean_array. |
|
Benchmark job started for this request (job |
|
Benchmark job started for this request (job |
|
🤖 Arrow criterion benchmark running (GKE) | trigger |
|
Benchmark for this request failed. Last 20 lines of output: Click to expand |
|
🤖 Arrow criterion benchmark running (GKE) | trigger |
|
Benchmark for this request failed. Last 20 lines of output: Click to expand |
|
🤖 Arrow criterion benchmark running (GKE) | trigger |
|
Benchmark for this request failed. Last 20 lines of output: Click to expand |
|
🤖 Arrow criterion benchmark running (GKE) | trigger |
|
Benchmark for this request failed. Last 20 lines of output: Click to expand |
|
🤖 Arrow criterion benchmark running (GKE) | trigger |
|
Benchmark for this request failed. Last 20 lines of output: Click to expand |
|
🤖 Arrow criterion benchmark running (GKE) | trigger |
|
Benchmark for this request failed. Last 20 lines of output: Click to expand |
|
🤖 Arrow criterion benchmark running (GKE) | trigger |
|
Benchmark for this request failed. Last 20 lines of output: Click to expand |
|
🤖 Arrow criterion benchmark running (GKE) | trigger |
|
Benchmark for this request failed. Last 20 lines of output: Click to expand |
|
run benchmark record_batch |
|
Benchmark job started for this request (job |
|
🤖 Arrow criterion benchmark running (GKE) | trigger |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Details
Resource Usagebase (merge-base)
branch
|
1551756 to
4e8b072
Compare
| null_chunks.zip(value_chunks).any(|(n, v)| (n & v) != 0) | ||
| } | ||
| None => { | ||
| let bit_chunks = UnalignedBitChunk::new( |
There was a problem hiding this comment.
Shouldn't you be able to use BitChunkIterator here?
|
This is really nice @adriangb I think as a next step let's just apply them on all of the various I think in DataFusion this has some non-trivial impact on Case / filter eval :) . I added one comment about |
Will try tomorrow 😄 |
|
I took a look at this PR and its performance and it seems to me like it is a good new API. Thank you @adriangb and @Dandandan Let's plan on adding in @Dandandan suggestions as follow on PRs. Perhaps we can open issues to track them so they don't get lost / others can help It is interesting that putting in a control flow / branch in the loop is actually faster than powering through |
| /// as soon as a `true` value is found, without counting all set bits. | ||
| /// | ||
| /// Null values are not counted as `true`. Returns `false` for empty arrays. | ||
| pub fn has_true(&self) -> bool { |
There was a problem hiding this comment.
We may want to add this API to BooleanBuffer as well (as a follow on PR)
The observation is that this is the same as DataFusion w/ RecordBatch: you need enough data to fully saturate SIMD, branch predition, etc. but it really doesn't hurt to pause every ~ large chunk to make decisions. |
|
BTW codex says it found a bug in this PR -- I am getting it to cough up a reproducer now (or will determine it is hallucinating) |
alamb
left a comment
There was a problem hiding this comment.
There is a bug I think -- see comments and reproducer
When the buffer is 8-byte aligned and >16 bytes, UnalignedBitChunk produces a suffix but no prefix. The wildcard match arm set suffix_fill to 0, so trailing padding bits (zeroed by UnalignedBitChunk) appeared as false values. Add explicit (None, Some(_)) arm to fill trailing padding with 1s. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extract CHUNK_FOLD_BLOCK_SIZE constant and unaligned_bit_chunks() helper to reduce duplication between has_true() and has_false(). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
run benchmark boolean_array |
@Dandandan I tried this and the TLDR is the code is simpler and it is faster for small inputs because it saves the overhead of constructing an |
|
🤖 Arrow criterion benchmark running (GKE) | trigger |
Thank you, I added the tests and fixed the bug. |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger New benchmark — branch-only results (no baseline comparison) Details
Resource Usagebranch
|
|
I'll wait for another look from @Dandandan since I've changed the implementation before merging. Thank you both for review. |
|
👨🍳 👌 |
| bit_chunks.prefix().unwrap_or(0) != 0 | ||
| || bit_chunks | ||
| .chunks() | ||
| .chunks(Self::CHUNK_FOLD_BLOCK_SIZE) |
There was a problem hiding this comment.
With chunks_exact you could probably use a smaller constant (as you can remove the inner branch / loop with a unrolled loop).
There was a problem hiding this comment.
(So I think it could terminate even earlier with a smaller constant - as it only has a "termination" branch after the e.g. 8 elements instead of also having a loop end)
|
Probably there is room for a bit improvement, but performance is already great! |
|
Thank you @adriangb ! |
|
Hi @adriangb, your benchmark configuration could not be parsed (#9511 (comment)). Error: Supported benchmarks:
Usage: Per-side configuration ( env:
SHARED_SETTING: enabled
baseline:
ref: v45.0.0
env:
DATAFUSION_RUNTIME_MEMORY_LIMIT: 1G
changed:
ref: v46.0.0
env:
DATAFUSION_RUNTIME_MEMORY_LIMIT: 2G |
…9570) ## Summary - Replace `.chunks(64)` with `.chunks_exact(16)` in `has_true()` and `has_false()` as suggested in #9511 (comment) - With `chunks_exact`, the compiler can fully unroll the inner fold (guaranteed size, no inner branch/loop), allowing a smaller block size for more frequent short-circuit exits without regressing the full-scan path ## Benchmark results (block size 16 vs baseline) - Full-scan worst case (65536): No regression (~49ns both) - Early-exit cases (65536): ~27% faster (6.0ns → 4.4ns) - Small arrays (64, 1024): Unchanged ## Test plan - [x] All 13 existing `test_has` tests pass run benchmarks boolean_array @Dandandan Would appreciate your review! 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Motivation
When working with
BooleanArray, a common pattern is checking whether any true or false value exists — e.g.arr.true_count() > 0orarr.false_count() == 0. This currently requirestrue_count()/false_count(), which scan the entire bitmap to count every set bit (viapopcount), even though we only need to know if at least one exists.This PR adds
has_true()andhas_false()methods that short-circuit as soon as they find a matching value, providing both:arr.has_true()expresses intent more clearly thanarr.true_count() > 0Callsites in DataFusion
There are several places in DataFusion that would benefit from these methods:
datafusion/functions-nested/src/array_has.rs—eq_array.true_count() > 0→eq_array.has_true()datafusion/physical-plan/src/topk/mod.rs—filter.true_count() == 0check →!filter.has_true()datafusion/datasource-parquet/src/metadata.rs—exactness.true_count() == 0andcombined_mask.true_count() > 0datafusion/physical-plan/src/joins/nested_loop_join.rs—bitmap.true_count() == 0checksdatafusion/physical-expr-common/src/physical_expr.rs—selection_count == 0fromselection.true_count()datafusion/physical-expr/src/expressions/binary.rs— short-circuit checks for AND/ORBenchmark Results
The key wins are on larger arrays (65,536 elements), where
has_true/has_falseare up to 16-129x faster thantrue_count()in best-case scenarios (early short-circuit). Even in worst case (must scan entire array), performance iscomparable to
true_count.Implementation
The implementation processes bits in 64-bit chunks using
UnalignedBitChunk, which handles arbitrary bit offsets and alignsdata for SIMD-friendly processing.
has_true(no nulls): OR-folds 64-bit chunks, short-circuits when any bit is sethas_false(no nulls): AND-folds 64-bit chunks, short-circuits when any bit is unset (with padding bits masked to 1)(null, value)chunks, checkingnull & value != 0(has_true) ornull & !value != 0(has_false)
Alternatives considered
true_count()but with simpler bitwise opsinstead of popcount. Marginally faster than
true_count()but misses the main optimization opportunity.self.iter().any(|v| v == Some(true)). Simple but processes one bit at atime, missing SIMD vectorization of the inner loop. Our approach processes 64 bits at a time while still supporting early
exit.
The chosen approach balances SIMD-friendly bulk processing (64 bits per iteration) with early termination, giving the best of
both worlds.
Test Plan
lengths (65 elements, 64+1 with trailing false)
has_true/has_falsevstrue_countacross sizes (64, 1024, 65536) and data distributions🤖 Generated with [Claude Code](https://claude.com/claude-code