Skip to content

Optimize take_fixed_size_binary For Predefined Value Lengths#9535

Open
tobixdev wants to merge 5 commits intoapache:mainfrom
tobixdev:improve-take-fsb
Open

Optimize take_fixed_size_binary For Predefined Value Lengths#9535
tobixdev wants to merge 5 commits intoapache:mainfrom
tobixdev:improve-take-fsb

Conversation

@tobixdev
Copy link
Contributor

@tobixdev tobixdev commented Mar 11, 2026

Which issue does this PR close?

Rationale for this change

The take kernel is very important for many operations (e.g., HashJoin in DataFusion IIRC). Currently, there is a gap between the performance of the take kernel for primitive arrays (e.g., DataType::UInt32) and fixed size binary arrays of the same length (e.g., FixedSizeBinary<4>).

In our case this lead to a performance reduction when moving from an integer-based id column to a fixed-size-binary-based id column. This PR aims to address parts of this gap.

The 16-bytes case would especially benefit operations on UUID columns.

What changes are included in this PR?

  • Add take_fixed_size that can be called for set of predefined fsb-lengths that we want to support. This is a "flat buffer" version of the take_native kernel.

Are these changes tested?

I've added another test that still exercises the non-optimized code path.

Are there any user-facing changes?

No

@github-actions github-actions bot added the arrow Changes to the arrow crate label Mar 11, 2026
@tobixdev
Copy link
Contributor Author

run benchmark take_kernels

@alamb-ghbot
Copy link

🤖 Hi @tobixdev, thanks for the request (#9535 (comment)). scrape_comments.py only responds to whitelisted users. Allowed users: Dandandan, Jefffrey, Omega359, adriangb, alamb, comphead, etseidl, gabotechs, geoffreyclaude, klion26, rluvaton, xudong963, zhuqi-lucas.

@tobixdev
Copy link
Contributor Author

I assumed that it would not work but it was worth a shot. Would appreciate someone running the benchmark. I saw approximately -80% on my machine.

@alamb
Copy link
Contributor

alamb commented Mar 11, 2026

run benchmark take_kernels

@alamb-ghbot
Copy link

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing improve-take-fsb (7aee3d9) to 33aed33 diff
BENCH_NAME=take_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench take_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=improve-take-fsb
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

group                                                                     improve-take-fsb                       main
-----                                                                     ----------------                       ----
take bool 1024                                                            1.00   1328.7±5.30ns        ? ?/sec    1.00   1328.3±2.79ns        ? ?/sec
take bool 512                                                             1.00    728.2±8.58ns        ? ?/sec    1.00    727.1±2.58ns        ? ?/sec
take bool null indices 1024                                               1.00  1113.3±13.46ns        ? ?/sec    1.11  1232.8±24.14ns        ? ?/sec
take bool null values 1024                                                1.00      2.6±0.02µs        ? ?/sec    1.00      2.6±0.01µs        ? ?/sec
take bool null values null indices 1024                                   1.00      2.1±0.07µs        ? ?/sec    1.42      2.9±0.04µs        ? ?/sec
take check bounds i32 1024                                                1.10   930.9±68.94ns        ? ?/sec    1.00    844.0±4.14ns        ? ?/sec
take check bounds i32 512                                                 1.00    515.4±4.77ns        ? ?/sec    1.14    588.5±3.92ns        ? ?/sec
take i32 1024                                                             1.00    712.3±2.60ns        ? ?/sec    1.01    716.9±6.95ns        ? ?/sec
take i32 512                                                              1.00    441.8±1.44ns        ? ?/sec    1.00    442.8±1.33ns        ? ?/sec
take i32 null indices 1024                                                1.00    995.6±2.39ns        ? ?/sec    1.02  1015.2±126.82ns        ? ?/sec
take i32 null values 1024                                                 1.01      2.0±0.02µs        ? ?/sec    1.00      2.0±0.03µs        ? ?/sec
take i32 null values null indices 1024                                    1.00      2.1±0.02µs        ? ?/sec    1.05      2.2±0.08µs        ? ?/sec
take primitive fsb value len: 12, indices: 1024                           1.00  1057.2±10.66ns        ? ?/sec    3.33      3.5±0.17µs        ? ?/sec
take primitive fsb value len: 12, null values, indices: 1024              1.00      2.4±0.03µs        ? ?/sec    2.01      4.9±0.08µs        ? ?/sec
take primitive run logical len: 1024, physical len: 512, indices: 1024    1.00     20.9±0.21µs        ? ?/sec    1.00     20.8±0.09µs        ? ?/sec
take str 1024                                                             1.00     11.2±0.06µs        ? ?/sec    1.01     11.3±0.16µs        ? ?/sec
take str 512                                                              1.00      5.4±0.03µs        ? ?/sec    1.02      5.5±0.06µs        ? ?/sec
take str null indices 1024                                                1.00      7.9±0.04µs        ? ?/sec    1.00      7.9±0.05µs        ? ?/sec
take str null indices 512                                                 1.00      3.8±0.01µs        ? ?/sec    1.00      3.8±0.03µs        ? ?/sec
take str null values 1024                                                 1.00      8.7±0.03µs        ? ?/sec    1.00      8.6±0.07µs        ? ?/sec
take str null values null indices 1024                                    1.00      7.0±0.03µs        ? ?/sec    1.00      7.0±0.08µs        ? ?/sec
take stringview 1024                                                      1.00    809.3±8.35ns        ? ?/sec    1.10    893.3±7.40ns        ? ?/sec
take stringview 512                                                       1.00    579.9±4.53ns        ? ?/sec    1.01    587.5±8.20ns        ? ?/sec
take stringview null indices 1024                                         1.00  1440.7±21.77ns        ? ?/sec    1.01   1451.7±4.98ns        ? ?/sec
take stringview null indices 512                                          1.00   718.1±19.57ns        ? ?/sec    1.11    800.5±0.89ns        ? ?/sec
take stringview null values 1024                                          1.00      2.1±0.00µs        ? ?/sec    1.00      2.1±0.00µs        ? ?/sec
take stringview null values null indices 1024                             1.00      2.3±0.01µs        ? ?/sec    1.06      2.4±0.04µs        ? ?/sec

@tobixdev
Copy link
Contributor Author

tobixdev commented Mar 11, 2026

take primitive fsb value len: 12, indices: 1024                           1.00  1057.2±10.66ns        ? ?/sec    3.33      3.5±0.17µs        ? ?/sec
take primitive fsb value len: 12, null values, indices: 1024              1.00      2.4±0.03µs        ? ?/sec    2.01      4.9±0.08µs        ? ?/sec

It's not quite 80% but still significant. Maybe this difference is due to Aarch64 vs. x86 (my PC).

take i32 1024                                                             1.00    712.3±2.60ns        ? ?/sec    1.01    716.9±6.95ns        ? ?/sec

We're also close to the primitive kernel! Parts of the gap could be explained by the entry size (4 vs. 12 bytes).

@alamb
Copy link
Contributor

alamb commented Mar 11, 2026

run benchmark take_kernels

@alamb
Copy link
Contributor

alamb commented Mar 11, 2026

Will run once more to ensure we can reproduce the results. They look good

@alamb-ghbot
Copy link

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing improve-take-fsb (7aee3d9) to 33aed33 diff
BENCH_NAME=take_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench take_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=improve-take-fsb
Results will be posted here when complete

}
}
let result_buffer = match size_usize {
1 => take_fixed_size::<IndexType, 1>(values.values(), indices),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as I read this it results in 7 copies of the code which is probably ok here but we do have to be careful in general to avoid too much bloat

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's definetely a drawback (for compile time and binary size).

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

group                                                                     improve-take-fsb                       main
-----                                                                     ----------------                       ----
take bool 1024                                                            1.00   1327.8±4.95ns        ? ?/sec    1.00  1329.2±12.95ns        ? ?/sec
take bool 512                                                             1.00    726.9±3.22ns        ? ?/sec    1.00   727.1±14.49ns        ? ?/sec
take bool null indices 1024                                               1.00   1097.8±5.99ns        ? ?/sec    1.11  1215.4±31.46ns        ? ?/sec
take bool null values 1024                                                1.00      2.6±0.02µs        ? ?/sec    1.00      2.6±0.01µs        ? ?/sec
take bool null values null indices 1024                                   1.00  1977.1±21.64ns        ? ?/sec    1.48      2.9±0.05µs        ? ?/sec
take check bounds i32 1024                                                1.08    917.4±7.58ns        ? ?/sec    1.00   846.9±12.66ns        ? ?/sec
take check bounds i32 512                                                 1.00   516.9±21.67ns        ? ?/sec    1.14    587.1±1.14ns        ? ?/sec
take i32 1024                                                             1.00    712.5±1.52ns        ? ?/sec    1.01    717.9±2.41ns        ? ?/sec
take i32 512                                                              1.00    441.9±0.76ns        ? ?/sec    1.01    444.9±9.14ns        ? ?/sec
take i32 null indices 1024                                                1.00    993.5±2.18ns        ? ?/sec    1.00    993.2±2.50ns        ? ?/sec
take i32 null values 1024                                                 1.01      2.0±0.00µs        ? ?/sec    1.00      2.0±0.00µs        ? ?/sec
take i32 null values null indices 1024                                    1.00      2.1±0.02µs        ? ?/sec    1.05      2.2±0.02µs        ? ?/sec
take primitive fsb value len: 12, indices: 1024                           1.00   1056.9±6.85ns        ? ?/sec    3.28      3.5±0.02µs        ? ?/sec
take primitive fsb value len: 12, null values, indices: 1024              1.00      2.4±0.04µs        ? ?/sec    1.99      4.9±0.01µs        ? ?/sec
take primitive run logical len: 1024, physical len: 512, indices: 1024    1.00     21.0±0.24µs        ? ?/sec    1.00     21.0±0.10µs        ? ?/sec
take str 1024                                                             1.01     11.3±0.09µs        ? ?/sec    1.00     11.2±0.11µs        ? ?/sec
take str 512                                                              1.00      5.4±0.02µs        ? ?/sec    1.02      5.5±0.09µs        ? ?/sec
take str null indices 1024                                                1.00      7.9±0.20µs        ? ?/sec    1.00      7.8±0.02µs        ? ?/sec
take str null indices 512                                                 1.00      3.8±0.02µs        ? ?/sec    1.00      3.8±0.01µs        ? ?/sec
take str null values 1024                                                 1.00      8.6±0.13µs        ? ?/sec    1.00      8.6±0.03µs        ? ?/sec
take str null values null indices 1024                                    1.00      6.9±0.06µs        ? ?/sec    1.02      7.0±0.09µs        ? ?/sec
take stringview 1024                                                      1.00    811.9±1.46ns        ? ?/sec    1.09   887.2±12.82ns        ? ?/sec
take stringview 512                                                       1.00    582.0±7.36ns        ? ?/sec    1.01    588.3±2.38ns        ? ?/sec
take stringview null indices 1024                                         1.00  1436.1±22.29ns        ? ?/sec    1.01  1449.0±24.56ns        ? ?/sec
take stringview null indices 512                                          1.00    727.5±2.41ns        ? ?/sec    1.10    801.0±4.78ns        ? ?/sec
take stringview null values 1024                                          1.00      2.1±0.05µs        ? ?/sec    1.00      2.1±0.02µs        ? ?/sec
take stringview null values null indices 1024                             1.00      2.3±0.02µs        ? ?/sec    1.05      2.4±0.03µs        ? ?/sec

@alamb
Copy link
Contributor

alamb commented Mar 11, 2026

take primitive fsb value len: 12, indices: 1024                           1.00   1056.9±6.85ns        ? ?/sec    3.28      3.5±0.02µs        ? ?/sec
take primitive fsb value len: 12, null values, indices: 1024              1.00      2.4±0.04µs        ? ?/sec    1.99      4.9±0.01µs        ? ?/sec

🚀

2 => take_fixed_size::<IndexType, 2>(values.values(), indices),
4 => take_fixed_size::<IndexType, 4>(values.values(), indices),
8 => take_fixed_size::<IndexType, 8>(values.values(), indices),
12 => take_fixed_size::<IndexType, 12>(values.values(), indices),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 12 a common enough size?

Copy link
Contributor Author

@tobixdev tobixdev Mar 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question. All of these are only based on a "gut feeling" that these are used in the wild.

I am happy to remove it and change the benchmark (currently 12 is used). But I suppose that's also just a chosen random value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is an example of 12 bytes of data that are stored in a fixed size byte array?

I think 16 is common for UUID.

&values_buffer[offset..offset + size_usize]
})
});
for slice in array_iter {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be specialized for non-null as well (

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a pointer on how we could do that? From what I can tell, array_iter will always have a None value due to indices.null_count() > 0. The null buffer of the values array is only considered at the caller.

Maybe I am also thinking into a different direction.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wait - the None / Some here is from the indices, not the values.

I think we can build the values / nulls separately like we do in other take kernels.

The nulls we can just cheaply (ref) clone from array_iter, this avoids rebuilding the entire null buffer from scratch.

Copy link
Contributor Author

@tobixdev tobixdev Mar 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I had it implemented like this before. Unfortunately, ignoring nulls in the loop that looks up the indices from the buffer could trigger a panic for an out-of-range index that is null. For example, when using the indices 0 (valid), 1 (valid), 10 (null) on an array with length 5, the last access would panic even though the index is null.

If you have a suggestion on how I could separate the null checking from the code I am happy to try it out and report the performance result.

size_usize: usize,
) -> Buffer {
let values_buffer = values.values().as_slice();
let mut values_buffer_builder = BufferBuilder::new(indices.len() * size_usize);
Copy link
Contributor

@Dandandan Dandandan Mar 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This overallocates for the null case (we could use the indices.null_count() if we want to reduce memory for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think FSB is treated as a primitive array which also allocates for null values so that we can index into the buffer. So I think we need to allocate for null values.

Here is the code for FixedSizeBinaryArray::new_null:

pub fn new_null(size: i32, len: usize) -> Self {
        const BITS_IN_A_BYTE: usize = 8;
        let capacity_in_bytes = size.to_usize().unwrap().checked_mul(len).unwrap();
        Self {
            data_type: DataType::FixedSizeBinary(size),
            value_data: MutableBuffer::new_null(capacity_in_bytes * BITS_IN_A_BYTE).into(),
            nulls: Some(NullBuffer::new_null(len)),
            value_length: size,
            len,
        }
    }

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks find to me -- thanks @Dandandan and @tobixdev

My only question is about the 12 byte case. Let us know what you want to do about that one @tobixdev

2 => take_fixed_size::<IndexType, 2>(values.values(), indices),
4 => take_fixed_size::<IndexType, 4>(values.values(), indices),
8 => take_fixed_size::<IndexType, 8>(values.values(), indices),
12 => take_fixed_size::<IndexType, 12>(values.values(), indices),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is an example of 12 bytes of data that are stored in a fixed size byte array?

I think 16 is common for UUID.

@tobixdev
Copy link
Contributor Author

tobixdev commented Mar 18, 2026

I think this looks find to me -- thanks @Dandandan and @tobixdev

My only question is about the 12 byte case. Let us know what you want to do about that one @tobixdev

Thank you for the reviews @Dandandan and @alamb !
I've removed the 12-byte case for now and added another benchmark that exercieses the fixed-length code path (previously 12 was used). I've also changed the name to avoid "primitive" such that we can avoid confusion with the PrimitiveArray.

Regarding the 12-byte case, I guess here it's just best if we wait for users to come to us with use cases.

@Dandandan LMK if you think there is another straightforwards opportunity that I am missing.

My Results:

take fsb value len: 12, indices: 1024
                        time:   [2.2519 µs 2.2805 µs 2.3148 µs]
Found 21 outliers among 100 measurements (21.00%)
  2 (2.00%) high mild
  19 (19.00%) high severe

take fsb value len: 12, null values, indices: 1024
                        time:   [2.7533 µs 2.7818 µs 2.8160 µs]
Found 22 outliers among 100 measurements (22.00%)
  2 (2.00%) high mild
  20 (20.00%) high severe

take fsb value optimized len: 16, indices: 1024
                        time:   [370.33 ns 371.12 ns 371.96 ns]
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

take fsb value optimized len: 16, null values, indices: 1024
                        time:   [892.83 ns 893.59 ns 894.43 ns]
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants