Approach 2 (micro-row-group) reader + writer prototype#3578
Draft
alkis wants to merge 2 commits into
Draft
Conversation
Prototype for a format extension that lets multiple BlockMetaData entries share a single physical column chunk. Sharing is signaled by the sentinel data_page_offset == -1 (ColumnChunkMetaData.SENTINEL_OFFSET); per-block page locations come from each block's OffsetIndex with absolute first_row_index values, and SynchronizingColumnReader trims rows that spill across block boundaries via an absolute RowRanges window. ParquetFileReader.readNextRowGroup() dispatches to a new internalReadApproach2RowGroup() when BlockMetaData.isApproach2() is true; legacy contiguous chunks take the existing path unchanged. Dictionary pages are read out-of-band (size not in OffsetIndex). Known prototype limitations (no cross-block dict/page cache, boundary pages re-read) are documented at the call sites. Co-authored-by: Isaac
Public knob `parquet.writer.micro-row-group.row-count` (and `ParquetWriter.Builder.withMicroRowGroupRowCount(long)`) opts in to emitting K logical BlockMetaData entries per physical column chunk on each record-batch flush, sized by this target row count. When unset (default 0), the legacy single-block-per-flush path runs unchanged. `InternalParquetRecordWriter.flushRowGroupToStore()` routes through a new `ParquetFileWriter.writeMicroRowGroups(...)` which writes one physical column chunk per logical column and emits K BlockMetaData entries with `data_page_offset == ColumnChunkMetaData.SENTINEL_OFFSET` plus per-block OffsetIndex sidecars carrying file-absolute `first_row_index` values — exactly the shape consumed by the reader's `internalReadApproach2RowGroup`. A round-trip test exercises the path via `ExampleParquetWriter`. Prototype limitations (documented in PR body): per-block stats, per-block ColumnIndex, and bloom filters are intentionally omitted; encryption falls back to the legacy path; per-block valueCount is exact only for non-repeated columns. Co-authored-by: Isaac
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Prototype for a format extension that lets multiple
BlockMetaDataentries share a single physical column chunk ("Approach 2", aka micro-row-group). This PR is draft / RFC — opened for early feedback on the on-wire signaling, reader dispatch, and writer API, not for merge.This PR has two commits:
57b155d5) —ParquetFileReader.internalReadApproach2RowGroupdispatch viaBlockMetaData.isApproach2(), plus the metadata primitives (ColumnChunkMetaData.SENTINEL_OFFSET,isPhysicallyShared(),RowRanges.createBetween).c6cd039f) — publicParquetWriter.Builder.withMicroRowGroupRowCount(long)knob + theparquet.writer.micro-row-group.row-countHadoop config key, routed throughInternalParquetRecordWriterinto a newParquetFileWriter.writeMicroRowGroups(...)low-level primitive. Plus an end-to-end round-trip test that exercises both sides.On-wire signaling
ColumnChunkMetaData.SENTINEL_OFFSET = -1reusesdata_page_offsetto mark a column chunk as physically shared.ColumnChunkMetaData.isPhysicallyShared()andBlockMetaData.isApproach2()are the probes.OffsetIndexsidecar with file-absolutefirst_row_indexvalues (not block-relative).rowIndexOffsetis not a thrift field —ParquetMetadataConverter.generateRowGroupOffsetsalready derives it at read time from cumulativenum_rows, so the writer just needs to emit blocks in row order.SynchronizingColumnReadertrims rows via an absoluteRowRangeswindow.Public API addition
InternalParquetRecordWriter.flushRowGroupToStore()checks the knob: when> 0and the flush has more rows than the target, it bypasses the legacystartBlock/pageStore.flushToFileWriter/endBlocksequence and calls the newParquetFileWriter.writeMicroRowGroups(...)low-level primitive, which writes one physical column chunk per logical column and emits KBlockMetaDataentries with sentinel offsets.Known prototype limitations
Documented inline at call sites; calling out here because they shape the review:
Reader
PhysicalChunkPageSourceis built perreadNextRowGroup()call.isPhysicallyShared() == false— encrypted-metadata integration is out of scope.Writer
Statisticsare empty; per-blockColumnIndexisnull. Predicate pushdown over micro-row-groups is a follow-up.valueCountis exact for non-repeated columns; for repeated columns it is approximated as the slice's per-pagerowCountsum.totalUncompressedSizeis estimated proportionally from the slice's compressed size.What's in this PR
Reader (existing commit)
RowRanges.createBetween(from, to)— immutable single-range constructor with absolute coordinates.ColumnChunkMetaData.SENTINEL_OFFSET,isPhysicallyShared(), sentinel-awaregetStartingPos().BlockMetaData.isApproach2(), sentinel-awaregetStartingPos()doc.ParquetFileReader.internalReadApproach2RowGroup()+readDictionaryPageDirect()+drainDataPagesQueue()helpers, plus dispatch inreadNextRowGroup().PhysicalChunkPageSourcescaffolding.RowRanges.createBetween, sentinel probes onColumnChunkMetaData, andBlockMetaData.isApproach2()mode detection.Writer (new commit)
ParquetProperties.microRowGroupRowCount+withMicroRowGroupRowCount(long)builder method (default0= disabled).ParquetOutputFormat.MICRO_ROW_GROUP_ROW_COUNTHadoop config constant + plumbing.ParquetWriter.Builder.withMicroRowGroupRowCount(long)user-facing setter.ParquetFileWriter.writeMicroRowGroups(...)low-level primitive that writes one physical column chunk per logical column and emits KBlockMetaDataentries withSENTINEL_OFFSETand per-block absolute-row-indexOffsetIndexsidecars.MicroRowGroupColumnDatapackage-private value class describing a physical column chunk.ColumnChunkPageWriteStore.drainForMicroRowGroups()snapshot hook for the new writer path.OffsetIndexBuilderper-page accessors (getPageCount/getCompressedPageSize(i)/getRowCount(i)) for the drain.InternalParquetRecordWriter.flushRowGroupToStore()dispatch based on the new knob.TestApproach2WriterReadWriteend-to-end round-trip: writes 3500 records withmicroRowGroupRowCount=1000, asserts 4 micro-row-groups in the footer with absoluterowIndexOffsets, sentinel data-page offsets, and absoluteOffsetIndex.first_row_indexvalues, then reads all rows back via the high-levelParquetReader<Group>(which transparently exercises the reader's Approach 2 dispatch).Test plan
This pull request and its description were written by Isaac.