Skip to content

Replace ArrayDataBuilder with GenericListArray in ListArrayDecoder in json-reader#9497

Open
liamzwbao wants to merge 3 commits intoapache:mainfrom
liamzwbao:issue-9298-replace-array-data
Open

Replace ArrayDataBuilder with GenericListArray in ListArrayDecoder in json-reader#9497
liamzwbao wants to merge 3 commits intoapache:mainfrom
liamzwbao:issue-9298-replace-array-data

Conversation

@liamzwbao
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

While implementing ListViewArrayDecoder in arrow-json, I noticed we could potentially retire ArrayDataBuilder inside ListArrayDecoder. Therefore, I'd like to use a small PR here to make sure there's no regression

What changes are included in this PR?

Replace ArrayDataBuilder with GenericListArray in ListArrayDecoder

Are these changes tested?

Covered by existing tests

Are there any user-facing changes?

No

@liamzwbao liamzwbao marked this pull request as ready for review March 2, 2026 02:28
@github-actions github-actions bot added the arrow Changes to the arrow crate label Mar 2, 2026
@liamzwbao
Copy link
Contributor Author

Hi @alamb @Jefffrey , would appreciate a review and a benchmark for this PR. Thanks!

@alamb
Copy link
Contributor

alamb commented Mar 2, 2026

Any particular benchmark you want me to run?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @liamzwbao -- this looks good to me -- I have just one question

// Safety
// Validated lengths above
Ok(unsafe { data.build_unchecked() })
let array = GenericListArray::<O>::try_new(field, offsets, values, nulls)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does try_new validate the offsets ? That could be a significant performance hit

Basically as long as this doesn't do additional validation I think it looks good to me

Copy link
Contributor Author

@liamzwbao liamzwbao Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it will do validation, but I think it's cheap and the local benchmark does not cause a regression. BTW, we don't have an unchecked API for ListArray, the new API simply unwrap the try_new.

/// * `offsets.len() - 1 != nulls.len()`
/// * `offsets.last() > values.len()`
/// * `!field.is_nullable() && values.is_nullable()`
/// * `field.data_type() != values.data_type()`
pub fn try_new(
field: FieldRef,
offsets: OffsetBuffer<OffsetSize>,
values: ArrayRef,
nulls: Option<NullBuffer>,
) -> Result<Self, ArrowError> {
let len = offsets.len() - 1; // Offsets guaranteed to not be empty
let end_offset = offsets.last().unwrap().as_usize();
// don't need to check other values of `offsets` because they are checked
// during construction of `OffsetBuffer`
if end_offset > values.len() {
return Err(ArrowError::InvalidArgumentError(format!(
"Max offset of {end_offset} exceeds length of values {}",
values.len()
)));
}
if let Some(n) = nulls.as_ref() {
if n.len() != len {
return Err(ArrowError::InvalidArgumentError(format!(
"Incorrect length of null buffer for {}ListArray, expected {len} got {}",
OffsetSize::PREFIX,
n.len(),
)));
}
}
if !field.is_nullable() && values.is_nullable() {
return Err(ArrowError::InvalidArgumentError(format!(
"Non-nullable field of {}ListArray {:?} cannot contain nulls",
OffsetSize::PREFIX,
field.name()
)));
}
if field.data_type() != values.data_type() {
return Err(ArrowError::InvalidArgumentError(format!(
"{}ListArray expected data type {} got {} for {:?}",
OffsetSize::PREFIX,
field.data_type(),
values.data_type(),
field.name()
)));
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benchmarks on our runner seem to suggest that this PR is slower for some reason 🤔

// Validated lengths above
Ok(unsafe { data.build_unchecked() })
let array = GenericListArray::<O>::try_new(field, offsets, values, nulls)?;
Ok(array.into_data())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this call simply creates an ArrayData (which is necessary given the API) so I am not sure it actually avoids any ArrayDatas

In order to avoid ArrayData we would probably need to change the signature of decode to return an ArrayRef directly (rather than ArrayData)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I can change the signature of all the decoders if you think it's worth a try

@liamzwbao
Copy link
Contributor Author

Any particular benchmark you want me to run?

This one should be good: https://github.com/apache/arrow-rs/blob/main/arrow-json/benches/json-reader.rs
I tested it locally (MacOS M3 MAX) and there's no regression. That's why I brought it up to see if we can get similar results on a cloud server

@liamzwbao
Copy link
Contributor Author

Hi @alamb, just noticed that we don't have benchmark for List yet. I will create one and get back to you later

@liamzwbao liamzwbao marked this pull request as draft March 4, 2026 02:29
alamb pushed a commit that referenced this pull request Mar 11, 2026
# Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.
-->

- Relates to #9497.

# Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

Add benchmark for `ListArray` in `json_reader` to support the
performance evaluation of #9497

# What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

- Benchmarks for decoding and serialize json list to `ListArray`.
- Benchmarks for `ListArray` and `FixedSizeListArray` json writer

# Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

Benchmarks only

# Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.
-->

No
@liamzwbao liamzwbao force-pushed the issue-9298-replace-array-data branch from 6155c94 to 7daefe6 Compare March 12, 2026 00:55
@liamzwbao liamzwbao marked this pull request as ready for review March 12, 2026 00:57
@liamzwbao
Copy link
Contributor Author

Hi @alamb, this PR is ready for the benchmark, could you please run json_reader bench in arrow-json?

@alamb
Copy link
Contributor

alamb commented Mar 12, 2026

run benchmark json_reader

@alamb-ghbot

This comment was marked as outdated.

@alamb-ghbot
Copy link

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing issue-9298-replace-array-data (7daefe6) to 6931d88 diff
BENCH_NAME=json_reader
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench json_reader
BENCH_FILTER=
BENCH_BRANCH_NAME=issue-9298-replace-array-data
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

group                                        issue-9298-replace-array-data          main
-----                                        -----------------------------          ----
decode_binary_hex_json                       1.03     27.2±0.62ms        ? ?/sec    1.00     26.4±0.14ms        ? ?/sec
decode_binary_view_hex_json                  1.07     26.4±0.18ms        ? ?/sec    1.00     24.7±0.19ms        ? ?/sec
decode_fixed_binary_hex_json                 1.06     25.7±0.35ms        ? ?/sec    1.00     24.1±0.11ms        ? ?/sec
decode_list_long_i64_json/131072             1.01    509.1±3.11ms   153.8 MB/sec    1.00    502.5±3.59ms   155.8 MB/sec
decode_list_long_i64_serialize               1.02    465.6±2.79ms        ? ?/sec    1.00    457.9±3.14ms        ? ?/sec
decode_list_short_i64_json/131072            1.01     33.2±0.22ms   157.1 MB/sec    1.00     33.0±0.11ms   158.4 MB/sec
decode_list_short_i64_serialize              1.01     28.2±0.30ms        ? ?/sec    1.00     27.9±0.48ms        ? ?/sec
decode_wide_object_i64_json                  1.00  1029.8±14.18ms        ? ?/sec    1.00  1026.3±18.52ms        ? ?/sec
decode_wide_object_i64_serialize             1.00    864.7±7.48ms        ? ?/sec    1.01    876.1±9.59ms        ? ?/sec
decode_wide_projection_full_json/131072      1.01   1863.6±9.72ms    93.4 MB/sec    1.00  1848.8±10.87ms    94.1 MB/sec
decode_wide_projection_narrow_json/131072    1.00    793.3±4.84ms   219.3 MB/sec    1.01    799.6±5.73ms   217.6 MB/sec
large_bench_primitive                        1.02      2.4±0.01ms        ? ?/sec    1.00      2.3±0.01ms        ? ?/sec
small_bench_list                             1.11     12.9±0.18µs        ? ?/sec    1.00     11.7±0.19µs        ? ?/sec
small_bench_primitive                        1.00      6.1±0.19µs        ? ?/sec    1.01      6.2±0.18µs        ? ?/sec
small_bench_primitive_with_utf8view          1.04      6.4±0.08µs        ? ?/sec    1.00      6.2±0.13µs        ? ?/sec

@alamb
Copy link
Contributor

alamb commented Mar 12, 2026

run benchmark json_reader

@alamb-ghbot
Copy link

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing issue-9298-replace-array-data (7daefe6) to 6931d88 diff
BENCH_NAME=json_reader
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench json_reader
BENCH_FILTER=
BENCH_BRANCH_NAME=issue-9298-replace-array-data
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

group                                        issue-9298-replace-array-data          main
-----                                        -----------------------------          ----
decode_binary_hex_json                       1.03     27.2±0.16ms        ? ?/sec    1.00     26.4±0.13ms        ? ?/sec
decode_binary_view_hex_json                  1.05     26.3±0.46ms        ? ?/sec    1.00     25.0±0.09ms        ? ?/sec
decode_fixed_binary_hex_json                 1.07     25.6±0.24ms        ? ?/sec    1.00     23.8±0.20ms        ? ?/sec
decode_list_long_i64_json/131072             1.01    510.4±2.17ms   153.4 MB/sec    1.00    503.7±3.04ms   155.5 MB/sec
decode_list_long_i64_serialize               1.00    465.4±3.72ms        ? ?/sec    1.00    465.2±4.51ms        ? ?/sec
decode_list_short_i64_json/131072            1.01     33.2±0.27ms   157.1 MB/sec    1.00     32.9±0.09ms   158.7 MB/sec
decode_list_short_i64_serialize              1.03     28.7±0.70ms        ? ?/sec    1.00     27.7±0.53ms        ? ?/sec
decode_wide_object_i64_json                  1.02  1033.6±12.94ms        ? ?/sec    1.00  1010.8±12.48ms        ? ?/sec
decode_wide_object_i64_serialize             1.00    875.4±7.13ms        ? ?/sec    1.01    885.5±6.89ms        ? ?/sec
decode_wide_projection_full_json/131072      1.00  1888.0±13.86ms    92.2 MB/sec    1.01  1904.1±13.15ms    91.4 MB/sec
decode_wide_projection_narrow_json/131072    1.00    795.5±9.31ms   218.7 MB/sec    1.01    803.2±3.76ms   216.6 MB/sec
large_bench_primitive                        1.02      2.4±0.01ms        ? ?/sec    1.00      2.3±0.01ms        ? ?/sec
small_bench_list                             1.11     12.8±0.14µs        ? ?/sec    1.00     11.6±0.05µs        ? ?/sec
small_bench_primitive                        1.00      6.1±0.06µs        ? ?/sec    1.00      6.1±0.06µs        ? ?/sec
small_bench_primitive_with_utf8view          1.02      6.4±0.06µs        ? ?/sec    1.00      6.2±0.11µs        ? ?/sec

@alamb
Copy link
Contributor

alamb commented Mar 17, 2026

run benchmark json_reader

@adriangbot
Copy link

🤖 Arrow criterion benchmark running (GKE) | trigger
Linux bench-c4077338921-384-gl4sd 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux
Comparing issue-9298-replace-array-data (7daefe6) to 6931d88 (merge-base) diff
BENCH_NAME=json_reader
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench json_reader
BENCH_FILTER=
Results will be posted here when complete

@alamb alamb changed the title Replace ArrayDataBuilder with GenericListArray in ListArrayDecoder Replace ArrayDataBuilder with GenericListArray in ListArrayDecoder in json-reader Mar 17, 2026
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this PR @liamzwbao

the benchmarks seem to show that this PR is slower for some reason. Could you look into that and figure out why?

// Safety
// Validated lengths above
Ok(unsafe { data.build_unchecked() })
let array = GenericListArray::<O>::try_new(field, offsets, values, nulls)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benchmarks on our runner seem to suggest that this PR is slower for some reason 🤔

@adriangbot
Copy link

🤖 Arrow criterion benchmark completed (GKE) | trigger

Details

group                                        issue-9298-replace-array-data          main
-----                                        -----------------------------          ----
decode_binary_hex_json                       1.00     13.7±0.04ms        ? ?/sec    1.01     13.9±0.04ms        ? ?/sec
decode_binary_view_hex_json                  1.00     14.3±0.09ms        ? ?/sec    1.01     14.5±0.04ms        ? ?/sec
decode_fixed_binary_hex_json                 1.01     13.9±0.15ms        ? ?/sec    1.00     13.8±0.07ms        ? ?/sec
decode_list_long_i64_json/131072             1.01    312.4±2.36ms   250.6 MB/sec    1.00    310.4±2.34ms   252.3 MB/sec
decode_list_long_i64_serialize               1.02    188.7±5.46ms        ? ?/sec    1.00    184.9±5.23ms        ? ?/sec
decode_list_short_i64_json/131072            1.00     20.1±0.04ms   259.5 MB/sec    1.00     20.1±0.03ms   259.2 MB/sec
decode_list_short_i64_serialize              1.03     11.9±0.67ms        ? ?/sec    1.00     11.5±0.52ms        ? ?/sec
decode_wide_object_i64_json                  1.02    471.6±9.06ms        ? ?/sec    1.00    462.6±8.49ms        ? ?/sec
decode_wide_object_i64_serialize             1.00   438.0±11.28ms        ? ?/sec    1.00   435.9±15.59ms        ? ?/sec
decode_wide_projection_full_json/131072      1.01   779.6±14.92ms   223.2 MB/sec    1.00   774.1±13.72ms   224.8 MB/sec
decode_wide_projection_narrow_json/131072    1.00    439.1±3.39ms   396.3 MB/sec    1.02    448.2±4.45ms   388.2 MB/sec
large_bench_primitive                        1.01   1533.7±2.90µs        ? ?/sec    1.00   1514.9±3.32µs        ? ?/sec
small_bench_list                             1.10      8.7±0.05µs        ? ?/sec    1.00      7.9±0.03µs        ? ?/sec
small_bench_primitive                        1.01      4.5±0.03µs        ? ?/sec    1.00      4.4±0.01µs        ? ?/sec
small_bench_primitive_with_utf8view          1.02      4.5±0.02µs        ? ?/sec    1.00      4.4±0.01µs        ? ?/sec

Resource Usage

base (merge-base)

Metric Value
Wall time 294.5s
Peak memory 3.1 GiB
Avg memory 2.5 GiB
CPU user 277.5s
CPU sys 16.8s
Disk read 0 B
Disk write 519.0 MiB

branch

Metric Value
Wall time 301.7s
Peak memory 3.1 GiB
Avg memory 2.5 GiB
CPU user 283.6s
CPU sys 18.0s
Disk read 0 B
Disk write 1.7 MiB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants