Decompress / decoding in parquet reader improvements#9577
Decompress / decoding in parquet reader improvements#9577Dandandan wants to merge 9 commits intoapache:mainfrom
Conversation
Avoid unnecessary buffer zero-fill in Snappy decompression by writing directly into spare capacity, and reduce per-byte overhead in VLQ integer decoding by reading directly from the buffer slice instead of calling get_aligned for each byte. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
run benchmark arrow_reader_clickbench |
|
🤖 Arrow criterion benchmark running (GKE) | trigger |
|
Benchmark for this request failed. Last 20 lines of output: Click to expand |
|
run benchmark arrow_reader_clickbench |
|
🤖 Arrow criterion benchmark running (GKE) | trigger |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Details
Resource Usagebase (merge-base)
branch
|
|
I guess 1-3% sounds about right |
|
run benchmark arrow_reader_clickbench |
|
🤖 Arrow criterion benchmark running (GKE) | trigger |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Details
Resource Usagebase (merge-base)
branch
|
|
Nice seems some low single digit improvement across the board (I think mostly from the non-zeroing) |
When bit_width guarantees all possible indices fit within the dictionary, use unchecked indexing to allow LLVM to unroll the dict gather loop 4x with paired loads/stores instead of scalar with per-element bounds checks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
run benchmark arrow_reader arrow_reader_clickbench |
|
🤖 Arrow criterion benchmark running (GKE) | trigger |
|
🤖 Arrow criterion benchmark running (GKE) | trigger |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Details
Resource Usagebase (merge-base)
branch
|
When bit_width guarantees all possible indices fit within the dictionary, use unchecked access to eliminate per-element bounds checks. Also skip buffer management when all dictionary views are inlined (<=12 bytes). Generates a clean 8-instruction gather loop for the common case (all_indices_valid + base_buffer_idx=0) and a branchless 14-instruction loop for the non-zero buffer offset case. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
run benchmark arrow_reader arrow_reader_clickbench |
|
🤖 Arrow criterion benchmark running (GKE) | trigger |
|
🤖 Arrow criterion benchmark running (GKE) | trigger |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Details
Resource Usagebase (merge-base)
branch
|
Reserve the full output capacity upfront before the decode loop, eliminating per-chunk reallocation checks inside extend. This gives a ~25% speedup for dictionary-encoded StringView reads. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add RleDecoder::get_batch_direct which exposes RLE vs bit-packed batches via a callback, allowing callers to handle each case optimally. For RLE runs, the dict view is looked up once and repeated directly with repeat_n, skipping the index buffer entirely. For bit-packed runs, indices are decoded to a stack-local buffer and gathered immediately. This eliminates the intermediate index buffer roundtrip for the common RLE case and reduces StringView dictionary decoding time by ~49% (137µs → 70µs in benchmarks). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
run benchmark arrow_reader arrow_reader_clickbench |
|
🤖 Arrow criterion benchmark running (GKE) | trigger |
|
🤖 Arrow criterion benchmark running (GKE) | trigger |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Details
Resource Usagebase (merge-base)
branch
|
a0815cd to
4b9a13b
Compare
Replace the if/else checked/unchecked branching in get_batch_with_dict with a single branchless .min(max_idx) clamp. This: - Prevents UB on corrupt parquet files (indices clamped to valid range) - Removes the if/else branch, simplifying codegen - Improves i32 dict perf by ~13% (60µs → 52µs) due to simpler code - StringView dict remains at 75µs (45% faster than 137µs baseline) Remove unused bit_width field from DictIndexDecoder. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4b9a13b to
0fcda30
Compare
Reserve offsets capacity upfront before the decode loop to avoid per-chunk reallocation. ~3.5% improvement for StringArray dict reads. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
These are only used by the arrow dictionary_index decoder. Without the arrow feature, they appear as dead code to clippy. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
run benchmark arrow_reader_clickbench |
|
🤖 Arrow criterion benchmark running (GKE) | trigger |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Details
Resource Usagebase (merge-base)
branch
|
|
NIce - up to 10% with the combined changes |
Summary
Optimize dictionary-encoded column reading in the parquet reader, with focus on both primitive (Int32) and StringView types.
Changes
RLE decoder: branchless index clamping (
rle.rs).min(max_idx)clampget_batch_directmethod that exposes RLE vs bit-packed batches via callbackStringView dictionary decoding (
byte_view_array.rs,dictionary_index.rs)repeat_nto fill views directly, skipping the index buffer entirelyByteArray dictionary decoding (
byte_array.rs)Snappy / VLQ
VLQ
Benchmarks
Test plan
arrowfeature🤖 Generated with Claude Code