Skip to content

[C++][ORC] Add stripe statistics API to ORCFileReader #49360

@cbb330

Description

@cbb330

Describe the enhancement requested

Add public APIs to ORCFileReader for accessing stripe-level and file-level column statistics as Arrow scalars and parameters to read only selected stripes by index.

This request is part 1 of implementing ORC predicate pushdown (#48986).

Context

The ORCFileReader exposes stripe metadata (count, size, offsets) but not the column statistics stored within each stripe. Without column statistics, the Arrow dataset layer cannot evaluate filter predicates against stripes and must read the entire file. This PR adds APIs to extract column min/max/null statistics and to read only selected stripes by index, which are the two building blocks the dataset layer needs for predicate pushdown.

List of enhancmenets requested

  • Column statistics access: Methods to retrieve min/max statistics (as Arrow types) for a given column at both the file level and individual stripe level. liborc exposes type-erased ColumnStatistics* pointers that require dynamic_cast to extract typed values. So the new API should handle this conversion and return a uniform Arrow type.

  • Selective stripe reading: A ReadStripes() method (analogous to Parquet's ReadRowGroups()) that reads only specified stripes and concatenates them into a arrow::Table. After predicate evaluation eliminates stripes, the dataset layer uses ReadStripes() to read only the surviving ones.

Component(s)

C++

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions