Optimize `take_fixed_size_binary` For Predefined Value Lengths by tobixdev · Pull Request #9535 · apache/arrow-rs

tobixdev · 2026-03-11T12:03:53Z

Which issue does this PR close?

Related to Performance improvements for take #279

Rationale for this change

The take kernel is very important for many operations (e.g., HashJoin in DataFusion IIRC). Currently, there is a gap between the performance of the take kernel for primitive arrays (e.g., DataType::UInt32) and fixed size binary arrays of the same length (e.g., FixedSizeBinary<4>).

In our case this lead to a performance reduction when moving from an integer-based id column to a fixed-size-binary-based id column. This PR aims to address parts of this gap.

The 16-bytes case would especially benefit operations on UUID columns.

What changes are included in this PR?

Add take_fixed_size that can be called for set of predefined fsb-lengths that we want to support. This is a "flat buffer" version of the take_native kernel.

Are these changes tested?

I've added another test that still exercises the non-optimized code path.

Are there any user-facing changes?

No

tobixdev · 2026-03-11T12:08:46Z

run benchmark take_kernels

alamb-ghbot · 2026-03-11T12:08:56Z

🤖 Hi @tobixdev, thanks for the request (#9535 (comment)). scrape_comments.py only responds to whitelisted users. Allowed users: Dandandan, Jefffrey, Omega359, adriangb, alamb, comphead, etseidl, gabotechs, geoffreyclaude, klion26, rluvaton, xudong963, zhuqi-lucas.

tobixdev · 2026-03-11T12:10:31Z

I assumed that it would not work but it was worth a shot. Would appreciate someone running the benchmark. I saw approximately -80% on my machine.

alamb · 2026-03-11T18:44:22Z

run benchmark take_kernels

alamb-ghbot · 2026-03-11T18:44:31Z

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing improve-take-fsb (7aee3d9) to 33aed33 diff
BENCH_NAME=take_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench take_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=improve-take-fsb
Results will be posted here when complete

alamb-ghbot · 2026-03-11T18:55:07Z

🤖: Benchmark completed

Details

group                                                                     improve-take-fsb                       main
-----                                                                     ----------------                       ----
take bool 1024                                                            1.00   1328.7±5.30ns        ? ?/sec    1.00   1328.3±2.79ns        ? ?/sec
take bool 512                                                             1.00    728.2±8.58ns        ? ?/sec    1.00    727.1±2.58ns        ? ?/sec
take bool null indices 1024                                               1.00  1113.3±13.46ns        ? ?/sec    1.11  1232.8±24.14ns        ? ?/sec
take bool null values 1024                                                1.00      2.6±0.02µs        ? ?/sec    1.00      2.6±0.01µs        ? ?/sec
take bool null values null indices 1024                                   1.00      2.1±0.07µs        ? ?/sec    1.42      2.9±0.04µs        ? ?/sec
take check bounds i32 1024                                                1.10   930.9±68.94ns        ? ?/sec    1.00    844.0±4.14ns        ? ?/sec
take check bounds i32 512                                                 1.00    515.4±4.77ns        ? ?/sec    1.14    588.5±3.92ns        ? ?/sec
take i32 1024                                                             1.00    712.3±2.60ns        ? ?/sec    1.01    716.9±6.95ns        ? ?/sec
take i32 512                                                              1.00    441.8±1.44ns        ? ?/sec    1.00    442.8±1.33ns        ? ?/sec
take i32 null indices 1024                                                1.00    995.6±2.39ns        ? ?/sec    1.02  1015.2±126.82ns        ? ?/sec
take i32 null values 1024                                                 1.01      2.0±0.02µs        ? ?/sec    1.00      2.0±0.03µs        ? ?/sec
take i32 null values null indices 1024                                    1.00      2.1±0.02µs        ? ?/sec    1.05      2.2±0.08µs        ? ?/sec
take primitive fsb value len: 12, indices: 1024                           1.00  1057.2±10.66ns        ? ?/sec    3.33      3.5±0.17µs        ? ?/sec
take primitive fsb value len: 12, null values, indices: 1024              1.00      2.4±0.03µs        ? ?/sec    2.01      4.9±0.08µs        ? ?/sec
take primitive run logical len: 1024, physical len: 512, indices: 1024    1.00     20.9±0.21µs        ? ?/sec    1.00     20.8±0.09µs        ? ?/sec
take str 1024                                                             1.00     11.2±0.06µs        ? ?/sec    1.01     11.3±0.16µs        ? ?/sec
take str 512                                                              1.00      5.4±0.03µs        ? ?/sec    1.02      5.5±0.06µs        ? ?/sec
take str null indices 1024                                                1.00      7.9±0.04µs        ? ?/sec    1.00      7.9±0.05µs        ? ?/sec
take str null indices 512                                                 1.00      3.8±0.01µs        ? ?/sec    1.00      3.8±0.03µs        ? ?/sec
take str null values 1024                                                 1.00      8.7±0.03µs        ? ?/sec    1.00      8.6±0.07µs        ? ?/sec
take str null values null indices 1024                                    1.00      7.0±0.03µs        ? ?/sec    1.00      7.0±0.08µs        ? ?/sec
take stringview 1024                                                      1.00    809.3±8.35ns        ? ?/sec    1.10    893.3±7.40ns        ? ?/sec
take stringview 512                                                       1.00    579.9±4.53ns        ? ?/sec    1.01    587.5±8.20ns        ? ?/sec
take stringview null indices 1024                                         1.00  1440.7±21.77ns        ? ?/sec    1.01   1451.7±4.98ns        ? ?/sec
take stringview null indices 512                                          1.00   718.1±19.57ns        ? ?/sec    1.11    800.5±0.89ns        ? ?/sec
take stringview null values 1024                                          1.00      2.1±0.00µs        ? ?/sec    1.00      2.1±0.00µs        ? ?/sec
take stringview null values null indices 1024                             1.00      2.3±0.01µs        ? ?/sec    1.06      2.4±0.04µs        ? ?/sec

tobixdev · 2026-03-11T19:13:38Z

take primitive fsb value len: 12, indices: 1024                           1.00  1057.2±10.66ns        ? ?/sec    3.33      3.5±0.17µs        ? ?/sec
take primitive fsb value len: 12, null values, indices: 1024              1.00      2.4±0.03µs        ? ?/sec    2.01      4.9±0.08µs        ? ?/sec

It's not quite 80% but still significant. Maybe this difference is due to Aarch64 vs. x86 (my PC).

take i32 1024                                                             1.00    712.3±2.60ns        ? ?/sec    1.01    716.9±6.95ns        ? ?/sec

We're also close to the primitive kernel! Parts of the gap could be explained by the entry size (4 vs. 12 bytes).

alamb · 2026-03-11T20:29:39Z

run benchmark take_kernels

alamb · 2026-03-11T20:29:50Z

Will run once more to ensure we can reproduce the results. They look good

alamb-ghbot · 2026-03-11T20:29:51Z

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing improve-take-fsb (7aee3d9) to 33aed33 diff
BENCH_NAME=take_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench take_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=improve-take-fsb
Results will be posted here when complete

alamb · 2026-03-11T20:30:48Z

arrow-select/src/take.rs

-        }
-    }
+    let result_buffer = match size_usize {
+        1 => take_fixed_size::<IndexType, 1>(values.values(), indices),


as I read this it results in 7 copies of the code which is probably ok here but we do have to be careful in general to avoid too much bloat

Yes that's definetely a drawback (for compile time and binary size).

alamb-ghbot · 2026-03-11T20:40:09Z

🤖: Benchmark completed

Details

group                                                                     improve-take-fsb                       main
-----                                                                     ----------------                       ----
take bool 1024                                                            1.00   1327.8±4.95ns        ? ?/sec    1.00  1329.2±12.95ns        ? ?/sec
take bool 512                                                             1.00    726.9±3.22ns        ? ?/sec    1.00   727.1±14.49ns        ? ?/sec
take bool null indices 1024                                               1.00   1097.8±5.99ns        ? ?/sec    1.11  1215.4±31.46ns        ? ?/sec
take bool null values 1024                                                1.00      2.6±0.02µs        ? ?/sec    1.00      2.6±0.01µs        ? ?/sec
take bool null values null indices 1024                                   1.00  1977.1±21.64ns        ? ?/sec    1.48      2.9±0.05µs        ? ?/sec
take check bounds i32 1024                                                1.08    917.4±7.58ns        ? ?/sec    1.00   846.9±12.66ns        ? ?/sec
take check bounds i32 512                                                 1.00   516.9±21.67ns        ? ?/sec    1.14    587.1±1.14ns        ? ?/sec
take i32 1024                                                             1.00    712.5±1.52ns        ? ?/sec    1.01    717.9±2.41ns        ? ?/sec
take i32 512                                                              1.00    441.9±0.76ns        ? ?/sec    1.01    444.9±9.14ns        ? ?/sec
take i32 null indices 1024                                                1.00    993.5±2.18ns        ? ?/sec    1.00    993.2±2.50ns        ? ?/sec
take i32 null values 1024                                                 1.01      2.0±0.00µs        ? ?/sec    1.00      2.0±0.00µs        ? ?/sec
take i32 null values null indices 1024                                    1.00      2.1±0.02µs        ? ?/sec    1.05      2.2±0.02µs        ? ?/sec
take primitive fsb value len: 12, indices: 1024                           1.00   1056.9±6.85ns        ? ?/sec    3.28      3.5±0.02µs        ? ?/sec
take primitive fsb value len: 12, null values, indices: 1024              1.00      2.4±0.04µs        ? ?/sec    1.99      4.9±0.01µs        ? ?/sec
take primitive run logical len: 1024, physical len: 512, indices: 1024    1.00     21.0±0.24µs        ? ?/sec    1.00     21.0±0.10µs        ? ?/sec
take str 1024                                                             1.01     11.3±0.09µs        ? ?/sec    1.00     11.2±0.11µs        ? ?/sec
take str 512                                                              1.00      5.4±0.02µs        ? ?/sec    1.02      5.5±0.09µs        ? ?/sec
take str null indices 1024                                                1.00      7.9±0.20µs        ? ?/sec    1.00      7.8±0.02µs        ? ?/sec
take str null indices 512                                                 1.00      3.8±0.02µs        ? ?/sec    1.00      3.8±0.01µs        ? ?/sec
take str null values 1024                                                 1.00      8.6±0.13µs        ? ?/sec    1.00      8.6±0.03µs        ? ?/sec
take str null values null indices 1024                                    1.00      6.9±0.06µs        ? ?/sec    1.02      7.0±0.09µs        ? ?/sec
take stringview 1024                                                      1.00    811.9±1.46ns        ? ?/sec    1.09   887.2±12.82ns        ? ?/sec
take stringview 512                                                       1.00    582.0±7.36ns        ? ?/sec    1.01    588.3±2.38ns        ? ?/sec
take stringview null indices 1024                                         1.00  1436.1±22.29ns        ? ?/sec    1.01  1449.0±24.56ns        ? ?/sec
take stringview null indices 512                                          1.00    727.5±2.41ns        ? ?/sec    1.10    801.0±4.78ns        ? ?/sec
take stringview null values 1024                                          1.00      2.1±0.05µs        ? ?/sec    1.00      2.1±0.02µs        ? ?/sec
take stringview null values null indices 1024                             1.00      2.3±0.02µs        ? ?/sec    1.05      2.4±0.03µs        ? ?/sec

alamb · 2026-03-11T21:59:38Z

take primitive fsb value len: 12, indices: 1024                           1.00   1056.9±6.85ns        ? ?/sec    3.28      3.5±0.02µs        ? ?/sec
take primitive fsb value len: 12, null values, indices: 1024              1.00      2.4±0.04µs        ? ?/sec    1.99      4.9±0.01µs        ? ?/sec

🚀

Dandandan · 2026-03-13T19:35:50Z

arrow-select/src/take.rs

+        2 => take_fixed_size::<IndexType, 2>(values.values(), indices),
+        4 => take_fixed_size::<IndexType, 4>(values.values(), indices),
+        8 => take_fixed_size::<IndexType, 8>(values.values(), indices),
+        12 => take_fixed_size::<IndexType, 12>(values.values(), indices),


Is 12 a common enough size?

That's a good question. All of these are only based on a "gut feeling" that these are used in the wild.

I am happy to remove it and change the benchmark (currently 12 is used). But I suppose that's also just a chosen random value.

What is an example of 12 bytes of data that are stored in a fixed size byte array?

I think 16 is common for UUID.

Dandandan · 2026-03-17T06:31:53Z

arrow-select/src/take.rs

+                    &values_buffer[offset..offset + size_usize]
+                })
+            });
+            for slice in array_iter {


This might be specialized for non-null as well (

Do you have a pointer on how we could do that? From what I can tell, array_iter will always have a None value due to indices.null_count() > 0. The null buffer of the values array is only considered at the caller.

Maybe I am also thinking into a different direction.

Oh wait - the None / Some here is from the indices, not the values.

I think we can build the values / nulls separately like we do in other take kernels.

The nulls we can just cheaply (ref) clone from array_iter, this avoids rebuilding the entire null buffer from scratch.

I think I had it implemented like this before. Unfortunately, ignoring nulls in the loop that looks up the indices from the buffer could trigger a panic for an out-of-range index that is null. For example, when using the indices 0 (valid), 1 (valid), 10 (null) on an array with length 5, the last access would panic even though the index is null.

If you have a suggestion on how I could separate the null checking from the code I am happy to try it out and report the performance result.

Dandandan · 2026-03-17T08:26:47Z

arrow-select/src/take.rs

+        size_usize: usize,
+    ) -> Buffer {
+        let values_buffer = values.values().as_slice();
+        let mut values_buffer_builder = BufferBuilder::new(indices.len() * size_usize);


This overallocates for the null case (we could use the indices.null_count() if we want to reduce memory for this.

I think FSB is treated as a primitive array which also allocates for null values so that we can index into the buffer. So I think we need to allocate for null values.

Here is the code for FixedSizeBinaryArray::new_null:

pub fn new_null(size: i32, len: usize) -> Self { const BITS_IN_A_BYTE: usize = 8; let capacity_in_bytes = size.to_usize().unwrap().checked_mul(len).unwrap(); Self { data_type: DataType::FixedSizeBinary(size), value_data: MutableBuffer::new_null(capacity_in_bytes * BITS_IN_A_BYTE).into(), nulls: Some(NullBuffer::new_null(len)), value_length: size, len, } }

alamb

I think this looks find to me -- thanks @Dandandan and @tobixdev

My only question is about the 12 byte case. Let us know what you want to do about that one @tobixdev

alamb · 2026-03-17T18:28:42Z

arrow-select/src/take.rs

+        2 => take_fixed_size::<IndexType, 2>(values.values(), indices),
+        4 => take_fixed_size::<IndexType, 4>(values.values(), indices),
+        8 => take_fixed_size::<IndexType, 8>(values.values(), indices),
+        12 => take_fixed_size::<IndexType, 12>(values.values(), indices),


What is an example of 12 bytes of data that are stored in a fixed size byte array?

I think 16 is common for UUID.

tobixdev · 2026-03-18T08:04:56Z

I think this looks find to me -- thanks @Dandandan and @tobixdev

My only question is about the 12 byte case. Let us know what you want to do about that one @tobixdev

Thank you for the reviews @Dandandan and @alamb !
I've removed the 12-byte case for now and added another benchmark that exercieses the fixed-length code path (previously 12 was used). I've also changed the name to avoid "primitive" such that we can avoid confusion with the PrimitiveArray.

Regarding the 12-byte case, I guess here it's just best if we wait for users to come to us with use cases.

@Dandandan LMK if you think there is another straightforwards opportunity that I am missing.

My Results:

take fsb value len: 12, indices: 1024
                        time:   [2.2519 µs 2.2805 µs 2.3148 µs]
Found 21 outliers among 100 measurements (21.00%)
  2 (2.00%) high mild
  19 (19.00%) high severe

take fsb value len: 12, null values, indices: 1024
                        time:   [2.7533 µs 2.7818 µs 2.8160 µs]
Found 22 outliers among 100 measurements (22.00%)
  2 (2.00%) high mild
  20 (20.00%) high severe

take fsb value optimized len: 16, indices: 1024
                        time:   [370.33 ns 371.12 ns 371.96 ns]
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

take fsb value optimized len: 16, null values, indices: 1024
                        time:   [892.83 ns 893.59 ns 894.43 ns]
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe

tobixdev added 2 commits March 11, 2026 12:49

Optimize take_fixed_size_binary for a set of predefined value lengths

601c63a

Cleanup

7aee3d9

github-actions bot added the arrow Changes to the arrow crate label Mar 11, 2026

alamb reviewed Mar 11, 2026

View reviewed changes

Dandandan reviewed Mar 13, 2026

View reviewed changes

Dandandan approved these changes Mar 17, 2026

View reviewed changes

Dandandan reviewed Mar 17, 2026

View reviewed changes

alamb approved these changes Mar 17, 2026

View reviewed changes

tobixdev added 3 commits March 18, 2026 08:55

Remove optimization for take 12-byte fixed size binary

37b040b

Fix take kernels

24b5cce

Formatting

240f132

Conversation

tobixdev commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

tobixdev commented Mar 11, 2026

Uh oh!

alamb-ghbot commented Mar 11, 2026

Uh oh!

tobixdev commented Mar 11, 2026

Uh oh!

alamb commented Mar 11, 2026

Uh oh!

alamb-ghbot commented Mar 11, 2026

Uh oh!

alamb-ghbot commented Mar 11, 2026

Uh oh!

tobixdev commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb commented Mar 11, 2026

Uh oh!

alamb commented Mar 11, 2026

Uh oh!

alamb-ghbot commented Mar 11, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb-ghbot commented Mar 11, 2026

Uh oh!

alamb commented Mar 11, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tobixdev Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tobixdev Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dandandan Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tobixdev commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tobixdev commented Mar 11, 2026 •

edited

Loading

tobixdev commented Mar 11, 2026 •

edited

Loading

tobixdev Mar 16, 2026 •

edited

Loading

tobixdev Mar 17, 2026 •

edited

Loading

Dandandan Mar 17, 2026 •

edited

Loading

tobixdev commented Mar 18, 2026 •

edited

Loading