Skip to content

Approach 2 (micro-row-group) reader + writer prototype#3578

Draft
alkis wants to merge 2 commits into
apache:masterfrom
alkis:approach2-micro-row-group-prototype
Draft

Approach 2 (micro-row-group) reader + writer prototype#3578
alkis wants to merge 2 commits into
apache:masterfrom
alkis:approach2-micro-row-group-prototype

Conversation

@alkis
Copy link
Copy Markdown

@alkis alkis commented May 22, 2026

Summary

Prototype for a format extension that lets multiple BlockMetaData entries share a single physical column chunk ("Approach 2", aka micro-row-group). This PR is draft / RFC — opened for early feedback on the on-wire signaling, reader dispatch, and writer API, not for merge.

This PR has two commits:

  • Reader (commit 57b155d5) — ParquetFileReader.internalReadApproach2RowGroup dispatch via BlockMetaData.isApproach2(), plus the metadata primitives (ColumnChunkMetaData.SENTINEL_OFFSET, isPhysicallyShared(), RowRanges.createBetween).
  • Writer (commit c6cd039f) — public ParquetWriter.Builder.withMicroRowGroupRowCount(long) knob + the parquet.writer.micro-row-group.row-count Hadoop config key, routed through InternalParquetRecordWriter into a new ParquetFileWriter.writeMicroRowGroups(...) low-level primitive. Plus an end-to-end round-trip test that exercises both sides.

On-wire signaling

  • ColumnChunkMetaData.SENTINEL_OFFSET = -1 reuses data_page_offset to mark a column chunk as physically shared. ColumnChunkMetaData.isPhysicallyShared() and BlockMetaData.isApproach2() are the probes.
  • Per-block pages located via the block's OffsetIndex sidecar with file-absolute first_row_index values (not block-relative).
  • Per-block rowIndexOffset is not a thrift field — ParquetMetadataConverter.generateRowGroupOffsets already derives it at read time from cumulative num_rows, so the writer just needs to emit blocks in row order.
  • Dictionary page offset stays real (and is shared across the K micro-row-groups of the same physical chunk).
  • Boundary pages that straddle two adjacent micro-row-groups are listed by both blocks; the reader's SynchronizingColumnReader trims rows via an absolute RowRanges window.

Public API addition

try (ParquetWriter<Group> writer = ExampleParquetWriter.builder(path)
    .withConf(conf)
    .withRowGroupSize(1L << 24)
    .withMicroRowGroupRowCount(1000)   // opts in to Approach 2
    .build()) {
  // ... write records as usual; each record-batch flush produces K micro-row-groups ...
}

InternalParquetRecordWriter.flushRowGroupToStore() checks the knob: when > 0 and the flush has more rows than the target, it bypasses the legacy startBlock / pageStore.flushToFileWriter / endBlock sequence and calls the new ParquetFileWriter.writeMicroRowGroups(...) low-level primitive, which writes one physical column chunk per logical column and emits K BlockMetaData entries with sentinel offsets.

Known prototype limitations

Documented inline at call sites; calling out here because they shape the review:

Reader

  • No cross-block dictionary cache: each block re-fetches and re-decodes the shared dict page.
  • No cross-block page cache: a PhysicalChunkPageSource is built per readNextRowGroup() call.
  • Boundary pages listed by two adjacent blocks are read from disk twice.
  • Encrypted column chunks always return isPhysicallyShared() == false — encrypted-metadata integration is out of scope.

Writer

  • Per-block Statistics are empty; per-block ColumnIndex is null. Predicate pushdown over micro-row-groups is a follow-up.
  • Bloom filters not emitted for shared chunks.
  • Encryption falls back to the legacy single-block path (the new code path refuses encrypted writes).
  • Per-block valueCount is exact for non-repeated columns; for repeated columns it is approximated as the slice's per-page rowCount sum.
  • Per-block totalUncompressedSize is estimated proportionally from the slice's compressed size.

What's in this PR

Reader (existing commit)

  • RowRanges.createBetween(from, to) — immutable single-range constructor with absolute coordinates.
  • ColumnChunkMetaData.SENTINEL_OFFSET, isPhysicallyShared(), sentinel-aware getStartingPos().
  • BlockMetaData.isApproach2(), sentinel-aware getStartingPos() doc.
  • ParquetFileReader.internalReadApproach2RowGroup() + readDictionaryPageDirect() + drainDataPagesQueue() helpers, plus dispatch in readNextRowGroup().
  • Reader-side PhysicalChunkPageSource scaffolding.
  • Unit tests for RowRanges.createBetween, sentinel probes on ColumnChunkMetaData, and BlockMetaData.isApproach2() mode detection.

Writer (new commit)

  • ParquetProperties.microRowGroupRowCount + withMicroRowGroupRowCount(long) builder method (default 0 = disabled).
  • ParquetOutputFormat.MICRO_ROW_GROUP_ROW_COUNT Hadoop config constant + plumbing.
  • ParquetWriter.Builder.withMicroRowGroupRowCount(long) user-facing setter.
  • ParquetFileWriter.writeMicroRowGroups(...) low-level primitive that writes one physical column chunk per logical column and emits K BlockMetaData entries with SENTINEL_OFFSET and per-block absolute-row-index OffsetIndex sidecars.
  • MicroRowGroupColumnData package-private value class describing a physical column chunk.
  • ColumnChunkPageWriteStore.drainForMicroRowGroups() snapshot hook for the new writer path.
  • OffsetIndexBuilder per-page accessors (getPageCount/getCompressedPageSize(i)/getRowCount(i)) for the drain.
  • InternalParquetRecordWriter.flushRowGroupToStore() dispatch based on the new knob.
  • TestApproach2WriterReadWrite end-to-end round-trip: writes 3500 records with microRowGroupRowCount=1000, asserts 4 micro-row-groups in the footer with absolute rowIndexOffsets, sentinel data-page offsets, and absolute OffsetIndex.first_row_index values, then reads all rows back via the high-level ParquetReader<Group> (which transparently exercises the reader's Approach 2 dispatch).

Test plan

  • `mvn -pl parquet-column -Dtest=TestRowRanges test`
  • `mvn -pl parquet-hadoop -Dtest=TestColumnChunkMetaData,TestApproach2BlockMetaData,TestApproach2WriterReadWrite test`
  • `mvn -pl parquet-hadoop -Dtest=TestParquetFileWriter,TestParquetWriter test` (confirm legacy path unchanged when knob is `0`)
  • Manually inspect a file written with the knob set: confirm each `RowGroup` in the footer has `data_page_offset == -1` per column and cumulative `num_rows` reproduce expected `rowIndexOffset`s.

This pull request and its description were written by Isaac.

alkis added 2 commits May 23, 2026 00:22
Prototype for a format extension that lets multiple BlockMetaData entries
share a single physical column chunk. Sharing is signaled by the sentinel
data_page_offset == -1 (ColumnChunkMetaData.SENTINEL_OFFSET); per-block
page locations come from each block's OffsetIndex with absolute
first_row_index values, and SynchronizingColumnReader trims rows that
spill across block boundaries via an absolute RowRanges window.

ParquetFileReader.readNextRowGroup() dispatches to a new
internalReadApproach2RowGroup() when BlockMetaData.isApproach2() is true;
legacy contiguous chunks take the existing path unchanged. Dictionary
pages are read out-of-band (size not in OffsetIndex). Known prototype
limitations (no cross-block dict/page cache, boundary pages re-read) are
documented at the call sites.

Co-authored-by: Isaac
Public knob `parquet.writer.micro-row-group.row-count` (and
`ParquetWriter.Builder.withMicroRowGroupRowCount(long)`) opts in to emitting
K logical BlockMetaData entries per physical column chunk on each
record-batch flush, sized by this target row count. When unset (default 0),
the legacy single-block-per-flush path runs unchanged.

`InternalParquetRecordWriter.flushRowGroupToStore()` routes through a new
`ParquetFileWriter.writeMicroRowGroups(...)` which writes one physical
column chunk per logical column and emits K BlockMetaData entries with
`data_page_offset == ColumnChunkMetaData.SENTINEL_OFFSET` plus per-block
OffsetIndex sidecars carrying file-absolute `first_row_index` values —
exactly the shape consumed by the reader's `internalReadApproach2RowGroup`.
A round-trip test exercises the path via `ExampleParquetWriter`.

Prototype limitations (documented in PR body): per-block stats, per-block
ColumnIndex, and bloom filters are intentionally omitted; encryption falls
back to the legacy path; per-block valueCount is exact only for non-repeated
columns.

Co-authored-by: Isaac
@alkis alkis changed the title Approach 2 (micro-row-group) reader prototype Approach 2 (micro-row-group) reader + writer prototype May 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant