Approach 2 (micro-row-group) reader + writer prototype by alkis · Pull Request #3578 · apache/parquet-java

alkis · 2026-05-22T22:28:31Z

Summary

Prototype for a format extension that lets multiple BlockMetaData entries share a single physical column chunk ("Approach 2", aka micro-row-group). This PR is draft / RFC — opened for early feedback on the on-wire signaling, reader dispatch, and writer API, not for merge.

This PR has two commits:

Reader (commit 57b155d5) — ParquetFileReader.internalReadApproach2RowGroup dispatch via BlockMetaData.isApproach2(), plus the metadata primitives (ColumnChunkMetaData.SENTINEL_OFFSET, isPhysicallyShared(), RowRanges.createBetween).
Writer (commit c6cd039f) — public ParquetWriter.Builder.withMicroRowGroupRowCount(long) knob + the parquet.writer.micro-row-group.row-count Hadoop config key, routed through InternalParquetRecordWriter into a new ParquetFileWriter.writeMicroRowGroups(...) low-level primitive. Plus an end-to-end round-trip test that exercises both sides.

On-wire signaling

ColumnChunkMetaData.SENTINEL_OFFSET = -1 reuses data_page_offset to mark a column chunk as physically shared. ColumnChunkMetaData.isPhysicallyShared() and BlockMetaData.isApproach2() are the probes.
Per-block pages located via the block's OffsetIndex sidecar with file-absolute first_row_index values (not block-relative).
Per-block rowIndexOffset is not a thrift field — ParquetMetadataConverter.generateRowGroupOffsets already derives it at read time from cumulative num_rows, so the writer just needs to emit blocks in row order.
Dictionary page offset stays real (and is shared across the K micro-row-groups of the same physical chunk).
Boundary pages that straddle two adjacent micro-row-groups are listed by both blocks; the reader's SynchronizingColumnReader trims rows via an absolute RowRanges window.

Public API addition

try (ParquetWriter<Group> writer = ExampleParquetWriter.builder(path)
    .withConf(conf)
    .withRowGroupSize(1L << 24)
    .withMicroRowGroupRowCount(1000)   // opts in to Approach 2
    .build()) {
  // ... write records as usual; each record-batch flush produces K micro-row-groups ...
}

InternalParquetRecordWriter.flushRowGroupToStore() checks the knob: when > 0 and the flush has more rows than the target, it bypasses the legacy startBlock / pageStore.flushToFileWriter / endBlock sequence and calls the new ParquetFileWriter.writeMicroRowGroups(...) low-level primitive, which writes one physical column chunk per logical column and emits K BlockMetaData entries with sentinel offsets.

Known prototype limitations

Documented inline at call sites; calling out here because they shape the review:

Reader

No cross-block dictionary cache: each block re-fetches and re-decodes the shared dict page.
No cross-block page cache: a PhysicalChunkPageSource is built per readNextRowGroup() call.
Boundary pages listed by two adjacent blocks are read from disk twice.
Encrypted column chunks always return isPhysicallyShared() == false — encrypted-metadata integration is out of scope.

Writer

Per-block Statistics are empty; per-block ColumnIndex is null. Predicate pushdown over micro-row-groups is a follow-up.
Bloom filters not emitted for shared chunks.
Encryption falls back to the legacy single-block path (the new code path refuses encrypted writes).
Per-block valueCount is exact for non-repeated columns; for repeated columns it is approximated as the slice's per-page rowCount sum.
Per-block totalUncompressedSize is estimated proportionally from the slice's compressed size.

What's in this PR

Reader (existing commit)

RowRanges.createBetween(from, to) — immutable single-range constructor with absolute coordinates.
ColumnChunkMetaData.SENTINEL_OFFSET, isPhysicallyShared(), sentinel-aware getStartingPos().
BlockMetaData.isApproach2(), sentinel-aware getStartingPos() doc.
ParquetFileReader.internalReadApproach2RowGroup() + readDictionaryPageDirect() + drainDataPagesQueue() helpers, plus dispatch in readNextRowGroup().
Reader-side PhysicalChunkPageSource scaffolding.
Unit tests for RowRanges.createBetween, sentinel probes on ColumnChunkMetaData, and BlockMetaData.isApproach2() mode detection.

Writer (new commit)

ParquetProperties.microRowGroupRowCount + withMicroRowGroupRowCount(long) builder method (default 0 = disabled).
ParquetOutputFormat.MICRO_ROW_GROUP_ROW_COUNT Hadoop config constant + plumbing.
ParquetWriter.Builder.withMicroRowGroupRowCount(long) user-facing setter.
ParquetFileWriter.writeMicroRowGroups(...) low-level primitive that writes one physical column chunk per logical column and emits K BlockMetaData entries with SENTINEL_OFFSET and per-block absolute-row-index OffsetIndex sidecars.
MicroRowGroupColumnData package-private value class describing a physical column chunk.
ColumnChunkPageWriteStore.drainForMicroRowGroups() snapshot hook for the new writer path.
OffsetIndexBuilder per-page accessors (getPageCount/getCompressedPageSize(i)/getRowCount(i)) for the drain.
InternalParquetRecordWriter.flushRowGroupToStore() dispatch based on the new knob.
TestApproach2WriterReadWrite end-to-end round-trip: writes 3500 records with microRowGroupRowCount=1000, asserts 4 micro-row-groups in the footer with absolute rowIndexOffsets, sentinel data-page offsets, and absolute OffsetIndex.first_row_index values, then reads all rows back via the high-level ParquetReader<Group> (which transparently exercises the reader's Approach 2 dispatch).

Test plan

`mvn -pl parquet-column -Dtest=TestRowRanges test`
`mvn -pl parquet-hadoop -Dtest=TestColumnChunkMetaData,TestApproach2BlockMetaData,TestApproach2WriterReadWrite test`
`mvn -pl parquet-hadoop -Dtest=TestParquetFileWriter,TestParquetWriter test` (confirm legacy path unchanged when knob is `0`)
Manually inspect a file written with the knob set: confirm each `RowGroup` in the footer has `data_page_offset == -1` per column and cumulative `num_rows` reproduce expected `rowIndexOffset`s.

This pull request and its description were written by Isaac.

Prototype for a format extension that lets multiple BlockMetaData entries share a single physical column chunk. Sharing is signaled by the sentinel data_page_offset == -1 (ColumnChunkMetaData.SENTINEL_OFFSET); per-block page locations come from each block's OffsetIndex with absolute first_row_index values, and SynchronizingColumnReader trims rows that spill across block boundaries via an absolute RowRanges window. ParquetFileReader.readNextRowGroup() dispatches to a new internalReadApproach2RowGroup() when BlockMetaData.isApproach2() is true; legacy contiguous chunks take the existing path unchanged. Dictionary pages are read out-of-band (size not in OffsetIndex). Known prototype limitations (no cross-block dict/page cache, boundary pages re-read) are documented at the call sites. Co-authored-by: Isaac

Public knob `parquet.writer.micro-row-group.row-count` (and `ParquetWriter.Builder.withMicroRowGroupRowCount(long)`) opts in to emitting K logical BlockMetaData entries per physical column chunk on each record-batch flush, sized by this target row count. When unset (default 0), the legacy single-block-per-flush path runs unchanged. `InternalParquetRecordWriter.flushRowGroupToStore()` routes through a new `ParquetFileWriter.writeMicroRowGroups(...)` which writes one physical column chunk per logical column and emits K BlockMetaData entries with `data_page_offset == ColumnChunkMetaData.SENTINEL_OFFSET` plus per-block OffsetIndex sidecars carrying file-absolute `first_row_index` values — exactly the shape consumed by the reader's `internalReadApproach2RowGroup`. A round-trip test exercises the path via `ExampleParquetWriter`. Prototype limitations (documented in PR body): per-block stats, per-block ColumnIndex, and bloom filters are intentionally omitted; encryption falls back to the legacy path; per-block valueCount is exact only for non-repeated columns. Co-authored-by: Isaac

alkis added 2 commits May 23, 2026 00:22

alkis changed the title ~~Approach 2 (micro-row-group) reader prototype~~ Approach 2 (micro-row-group) reader + writer prototype May 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Approach 2 (micro-row-group) reader + writer prototype#3578

Approach 2 (micro-row-group) reader + writer prototype#3578
alkis wants to merge 2 commits into
apache:masterfrom
alkis:approach2-micro-row-group-prototype

alkis commented May 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

alkis commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

On-wire signaling

Public API addition

Known prototype limitations

What's in this PR

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

alkis commented May 22, 2026 •

edited

Loading