Fix MutableArrayData::extend for ListView#9562
Conversation
MutableArrayData's build_extend for ListView/LargeListView copies offsets and sizes from the source array but never extends the child data. This produces invalid arrays where offsets reference nonexistent child elements, causing panics on access.
| // Copy the child range | ||
| let child_base = if child_max > child_min { | ||
| let base = mutable.child_data[0].len(); | ||
| mutable.child_data[0].extend(index, child_min, child_max); |
There was a problem hiding this comment.
Are you referring to extend_with_slice?
There was a problem hiding this comment.
No, probably the slice operation!
| } | ||
|
|
||
| #[test] | ||
| fn test_list_view_mutable_array_data_extend() { |
There was a problem hiding this comment.
Can. you please also add tests with a ListView with overlapping and non contigous data?
| for &offset in &offsets[start..start + len] { | ||
| offset_buffer.push(offset); | ||
| } | ||
| // MutableArrayData builds a new independent array, so we must copy the child values |
There was a problem hiding this comment.
My reading of this code is that it will copy some contiguous range of values
One tradeoff with this approach is that it will reduce the number of memcpys (which is good) at the expense of potentially copying a bunch of garbage
So for (an extreme) example, if the list view is coping two elements:
[0](with offset 0, length 1)[1](offset 1000000, length 1)
I think this code is going to copy 1M entries when it could have just copied 2.
On the other hand, copying the entire chunk of underlying list values will preserve any aliasing that may exist (e.g. a ListView whose elements point at overlapping subsets) 🤔
What do you think about using a strategy to copy the lists one by one (that is to say, copy the lists one by one). The output ListView would have no "garbage" but would also be uncollected
I think this approach would be more aligned with what other types to. For example, StringView copies the elements one by one as well (though it is more complicated given it has multiple buffers):
arrow-rs/arrow-data/src/transform/variable_size.rs
Lines 48 to 57 in 04cd6cf
There was a problem hiding this comment.
This is a very good point, that's not good. I think your strategy sounds better.
|
I see now there is an alternate proposal from @asubiotto here |
Which issue does this PR close?
The existing implementation in list_view.rs copies offsets and sizes but skips copying child data. The original code copies offsets and sizes into a new array that has an empty child values array, so the offsets may point into nothing if the child data doesn't exist anymore.