Describe the enhancement requested
Add public APIs to ORCFileReader for accessing stripe-level and file-level column statistics as Arrow scalars and parameters to read only selected stripes by index.
This request is part 1 of implementing ORC predicate pushdown (#48986).
Context
The ORCFileReader exposes stripe metadata (count, size, offsets) but not the column statistics stored within each stripe. Without column statistics, the Arrow dataset layer cannot evaluate filter predicates against stripes and must read the entire file. This PR adds APIs to extract column min/max/null statistics and to read only selected stripes by index, which are the two building blocks the dataset layer needs for predicate pushdown.
List of enhancmenets requested
-
Column statistics access: Methods to retrieve min/max statistics (as Arrow types) for a given column at both the file level and individual stripe level. liborc exposes type-erased ColumnStatistics* pointers that require dynamic_cast to extract typed values. So the new API should handle this conversion and return a uniform Arrow type.
-
Selective stripe reading: A ReadStripes() method (analogous to Parquet's ReadRowGroups()) that reads only specified stripes and concatenates them into a arrow::Table. After predicate evaluation eliminates stripes, the dataset layer uses ReadStripes() to read only the surviving ones.
Component(s)
C++
Describe the enhancement requested
Add public APIs to ORCFileReader for accessing stripe-level and file-level column statistics as Arrow scalars and parameters to read only selected stripes by index.
This request is part 1 of implementing ORC predicate pushdown (#48986).
Context
The ORCFileReader exposes stripe metadata (count, size, offsets) but not the column statistics stored within each stripe. Without column statistics, the Arrow dataset layer cannot evaluate filter predicates against stripes and must read the entire file. This PR adds APIs to extract column min/max/null statistics and to read only selected stripes by index, which are the two building blocks the dataset layer needs for predicate pushdown.
List of enhancmenets requested
Column statistics access: Methods to retrieve min/max statistics (as Arrow types) for a given column at both the file level and individual stripe level. liborc exposes type-erased
ColumnStatistics*pointers that requiredynamic_castto extract typed values. So the new API should handle this conversion and return a uniform Arrow type.Selective stripe reading: A
ReadStripes()method (analogous to Parquet'sReadRowGroups()) that reads only specified stripes and concatenates them into aarrow::Table. After predicate evaluation eliminates stripes, the dataset layer usesReadStripes()to read only the surviving ones.Component(s)
C++