Support for importing Arrow extension arrays into core vortex arrays and ParquetVariant support.#8125
Support for importing Arrow extension arrays into core vortex arrays and ParquetVariant support.#8125AdamGS wants to merge 5 commits into
Conversation
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Merging this PR will improve performance by 18.38%
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ⚡ | Simulation | chunked_varbinview_opt_into_canonical[(1000, 10)] |
240 µs | 202.7 µs | +18.38% |
Tip
Curious why this is faster? Comment @codspeedbot explain why this is faster on this PR, or directly use the CodSpeed MCP with your agent.
Comparing adamg/parquet-arrow-import-export (0dd96bd) with develop (c55584a)
Polar Signals Profiling ResultsLatest Run
Powered by Polar Signals Cloud |
Benchmarks: PolarSignals ProfilingVortex (geomean): 1.111x ❌ datafusion / vortex-file-compressed (1.111x ❌, 0↑ 6↓)
|
File Sizes: PolarSignals ProfilingNo file size changes detected. |
Benchmarks: FineWeb NVMeVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (1.014x ➖, 0↑ 0↓)
datafusion / vortex-compact (1.006x ➖, 1↑ 0↓)
datafusion / parquet (0.997x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (0.965x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.976x ➖, 0↑ 0↓)
duckdb / parquet (1.063x ➖, 0↑ 2↓)
Full attributed analysis
|
File Sizes: FineWeb NVMeNo file size changes detected. |
Benchmarks: TPC-H SF=1 on NVMEVerdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (1.005x ➖, 0↑ 0↓)
datafusion / vortex-compact (1.016x ➖, 0↑ 0↓)
datafusion / parquet (0.992x ➖, 1↑ 2↓)
datafusion / arrow (1.018x ➖, 0↑ 2↓)
duckdb / vortex-file-compressed (1.009x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.999x ➖, 0↑ 0↓)
duckdb / parquet (0.987x ➖, 2↑ 1↓)
duckdb / duckdb (0.997x ➖, 0↑ 0↓)
Full attributed analysis
|
File Sizes: TPC-H SF=1 on NVMENo file size changes detected. |
Benchmarks: TPC-DS SF=1 on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (0.938x ➖, 18↑ 0↓)
datafusion / vortex-compact (0.947x ➖, 14↑ 0↓)
datafusion / parquet (0.946x ➖, 13↑ 1↓)
duckdb / vortex-file-compressed (0.934x ➖, 18↑ 0↓)
duckdb / vortex-compact (0.939x ➖, 14↑ 0↓)
duckdb / parquet (0.959x ➖, 2↑ 0↓)
duckdb / duckdb (0.963x ➖, 10↑ 4↓)
Full attributed analysis
|
File Sizes: TPC-DS SF=1 on NVMENo file size changes detected. |
Benchmarks: FineWeb S3Verdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (0.993x ➖, 1↑ 0↓)
datafusion / vortex-compact (1.121x ➖, 0↑ 1↓)
datafusion / parquet (1.031x ➖, 0↑ 1↓)
duckdb / vortex-file-compressed (1.064x ➖, 0↑ 2↓)
duckdb / vortex-compact (0.927x ➖, 1↑ 0↓)
duckdb / parquet (1.004x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: Statistical and Population GeneticsVerdict: No clear signal (low confidence) duckdb / vortex-file-compressed (1.052x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.029x ➖, 0↑ 1↓)
duckdb / parquet (1.043x ➖, 0↑ 0↓)
Full attributed analysis
|
File Sizes: Statistical and Population GeneticsNo file size changes detected. |
Benchmarks: Random AccessVortex (geomean): 1.100x ➖ unknown / unknown (1.119x ❌, 3↑ 27↓)
|
Benchmarks: TPC-H SF=10 on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (1.076x ➖, 0↑ 8↓)
datafusion / vortex-compact (1.097x ➖, 0↑ 7↓)
datafusion / parquet (1.097x ➖, 0↑ 9↓)
datafusion / arrow (1.087x ➖, 0↑ 8↓)
duckdb / vortex-file-compressed (1.157x ❌, 0↑ 13↓)
duckdb / vortex-compact (1.072x ➖, 0↑ 9↓)
duckdb / parquet (1.020x ➖, 0↑ 0↓)
duckdb / duckdb (1.018x ➖, 0↑ 1↓)
Full attributed analysis
|
File Sizes: TPC-H SF=10 on NVMENo file size changes detected. |
Benchmarks: Clickbench on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (1.015x ➖, 0↑ 3↓)
datafusion / parquet (0.989x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (0.993x ➖, 4↑ 1↓)
duckdb / parquet (1.003x ➖, 0↑ 0↓)
duckdb / duckdb (0.993x ➖, 0↑ 1↓)
Full attributed analysis
|
File Sizes: Clickbench on NVMEFile Size Changes (1 files changed, -0.0% overall, 0↑ 1↓)
Totals:
|
Benchmarks: TPC-H SF=1 on S3Verdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (0.882x ➖, 6↑ 4↓)
datafusion / vortex-compact (1.016x ➖, 0↑ 1↓)
datafusion / parquet (1.039x ➖, 2↑ 5↓)
duckdb / vortex-file-compressed (0.897x ➖, 1↑ 0↓)
duckdb / vortex-compact (0.868x ➖, 1↑ 0↓)
duckdb / parquet (0.887x ➖, 0↑ 0↓)
Full attributed analysis
|
|
do we know all of the possible arrow extension types that are considered not extension by Vortex? Is it just Variant? |
Benchmarks: CompressionVortex (geomean): 0.990x ➖ unknown / unknown (0.989x ➖, 0↑ 0↓)
|
| let int_array: ArrowArrayRef = Arc::new(Int32Array::from(vec![1, 2, 3])); | ||
| let result = <Vector as ArrowImportVTable>::from_arrow_array(&Vector, int_array, &ext)?; | ||
| let result = | ||
| <Vector as ArrowImportVTable>::from_arrow_array(&Vector, int_array, &field, &dtype)?; |
There was a problem hiding this comment.
It does feel a bit strange that we pass in both the Vector vtable as well as the extension dtype here? Maybe we don't care
Benchmarks: TPC-H SF=10 on S3Verdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (0.936x ➖, 2↑ 2↓)
datafusion / vortex-compact (0.914x ➖, 1↑ 1↓)
datafusion / parquet (1.023x ➖, 2↑ 5↓)
duckdb / vortex-file-compressed (0.984x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.065x ➖, 0↑ 1↓)
duckdb / parquet (0.946x ➖, 1↑ 0↓)
Full attributed analysis
|
|
@connortsui20 8-bit boolean too. |
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
8332a6f to
0dd96bd
Compare
|
The methods here still look weird, in particular to_arrow_field will never be called unless the source is an extension dtype which is not the case for things like 8-bit bool. However, the export should work just fine as long as the exporter doesn't look at the target dtype. The more confounding factor though is that for variant there's no real implementation you can have because of #8135, i.e you need to know physical information in order to export it. @a10y @palaska have you thought about these cases? Basically the code as it is now (and even after this pr is still subtly broken in a way that I find hard to unwind) |
robert3005
left a comment
There was a problem hiding this comment.
Let's merge this. I think we need to do better on schema inference but this will require some more thought
Summary
Allows importing/exporting vortex and arrow arrays in the case where the arrow side is an extension but Vortex isn't.
This PR also includes the motivating change - support for ParquetVariant import/export.
API Changes
Changes the signature of
ArrowImportVTable::from_arrow_arrayto include both theField(which has the full arrow-side type info) and a full vortexDType.