GH-3242: Emit and Read min/max statistics for int96 timestamp columns#3590
Open
rahulketch wants to merge 29 commits into
Open
GH-3242: Emit and Read min/max statistics for int96 timestamp columns#3590rahulketch wants to merge 29 commits into
rahulketch wants to merge 29 commits into
Conversation
…veComparator.java Co-authored-by: Gang Wu <ustcwg@gmail.com>
…s.java Co-authored-by: Gang Wu <ustcwg@gmail.com>
# Conflicts: # parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java
…ParquetStatistics Address PR feedback (gszadovszky, wgtmac on apache#3243): keep the previously-public static methods as @deprecated shims that delegate to a default ParquetMetadataConverter instance, so 1.x callers still compile. The new createdBy-aware non-static methods remain the recommended path.
String.compareTo is lexicographic: "1.15.1".compareTo("1.16.0") returns -1,
which incorrectly rejected parquet-mr 1.15.1 (caught by testParquetMrValid)
and also accepted hypothetical pre-1.15 versions like 1.5.0 (since '5' > '1'
at position 2). Use SemanticVersion compareTo with a 1.15.0 lower bound;
strict > so 1.15.0 itself stays invalid, matching testParquetMrInvalid.
1.16.0 is the first parquet-mr release that emits correct INT96 statistics. Update tests: 1.15.1 / 1.15.1-SNAPSHOT move to the invalid set, 1.16.0-SNAPSHOT added as invalid (pre-release sorts below 1.16.0 final under semver).
Contributor
Author
|
@wgtmac @gszadovszky @emkornfield: Could you please take a look at this? It's a redo of #3243 which got closed due to inactivity |
Without the guard, a malformed Binary throws IndexOutOfBoundsException from inside the stats-pruning path (for length < 12) or silently mis-compares by reading only the first 12 bytes (for length > 12). Throw IllegalArgumentException with a clear message instead, matching the precedent set by Binary.get2BytesLittleEndian for FLOAT16.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR adds end-to-end support for INT96 timestamp min/max statistics in Parquet-Java by defining INT96 type ordering/comparison, emitting INT96 min/max into file metadata, and conditionally trusting/reading INT96 stats based on writer identity (created_by) with a new read option toggle.
Changes:
- Introduces an INT96 timestamp comparator and enables TYPE_DEFINED column order for INT96.
- Adds
parquet.read.int96stats.enabled(defaulttrue) to control whether INT96 stats are read/exposed, withValidInt96Statsgating trust bycreated_by. - Updates metadata conversion and multiple tests; adds new unit tests for comparator behavior and statistics round-tripping.
Reviewed changes
Copilot reviewed 18 out of 18 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| parquet-hadoop/src/test/java/org/apache/parquet/statistics/TestInt96TimestampStatisticsRoundTrip.java | New round-trip test validating INT96 stats behavior with the enable/disable flag |
| parquet-hadoop/src/test/java/org/apache/parquet/statistics/RandomValues.java | Updates INT96 random generation to produce 12-byte INT96 payloads |
| parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestParquetFileWriter.java | Updates tests to use instance-based statistics conversion APIs |
| parquet-hadoop/src/test/java/org/apache/parquet/format/converter/TestParquetMetadataConverter.java | Refactors tests for new converter APIs and createdBy-aware behavior |
| parquet-hadoop/src/main/java/org/apache/parquet/ParquetReadOptions.java | Adds readInt96Stats option plumbed through builder and accessors |
| parquet-hadoop/src/main/java/org/apache/parquet/HadoopReadOptions.java | Plumbs readInt96Stats through Hadoop read options |
| parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputFormat.java | Adds config key and default for INT96 stats reading |
| parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java | Passes created_by into column index conversion for trust gating |
| parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java | Adds createdBy/readInt96Stats-aware stats + column index conversion logic |
| parquet-column/src/test/java/org/apache/parquet/ValidInt96StatsTest.java | New tests for created_by trust detection |
| parquet-column/src/test/java/org/apache/parquet/schema/TestTypeBuildersWithLogicalTypes.java | Adjusts expectations around INT96 column order support |
| parquet-column/src/test/java/org/apache/parquet/schema/TestTypeBuilders.java | Enables type-defined column order for INT96 in builder tests |
| parquet-column/src/test/java/org/apache/parquet/schema/TestPrimitiveComparator.java | Adds INT96 comparator tests and invalid-length validation |
| parquet-column/src/test/java/org/apache/parquet/internal/column/columnindex/TestBinaryTruncator.java | Updates truncator tests to include INT96-specific contract checks |
| parquet-column/src/main/java/org/apache/parquet/ValidInt96Stats.java | New utility to determine whether INT96 stats can be trusted based on created_by |
| parquet-column/src/main/java/org/apache/parquet/schema/Types.java | Updates documentation around default column order selection |
| parquet-column/src/main/java/org/apache/parquet/schema/PrimitiveType.java | Enables type-defined order for INT96 and uses INT96 timestamp comparator |
| parquet-column/src/main/java/org/apache/parquet/schema/PrimitiveComparator.java | Adds INT96 timestamp comparator implementation |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
CI's spotless check (palantirJavaFormat) flagged over-indentation in testInt96Comparator's array literals and lambda body. Reformat to Palantir's continuation-indent style (6 spaces for items inside an array literal). Pure whitespace change; all 14 tests still pass.
Copilot review (apache#3590) flagged that toParquetStatistics was gating V2 min_value/max_value emission on isMinMaxStatsReadingSupported, which includes the read-side createdBy allowlist. That coupling silently suppresses INT96 stats emission for any writer whose createdBy isn't in ValidInt96Stats (e.g., pre-1.16.0 prereleases, downstream forks), even though the writer is fully capable of producing them correctly. Drop the new createdBy-aware toParquetStatistics(createdBy, ...) overloads added by this PR and inline the (now createdBy-free) body into the original static signatures, gating V2 emission on the type-only isMinMaxStatsWritingSupported. The read paths (fromParquetStatisticsInternal, fromParquetColumnIndex) already apply the allowlist on the correct side. Update the three tests (testSkippedV2Stats, testColumnOrders, testColumnIndexConversion) that asserted the pre-PR "INT96 has no stats" behavior.
After the last refactor restored toParquetStatistics(stats) and fromParquetStatistics(String, Statistics, PrimitiveTypeName) as static, several test sites that had been switched to instance calls no longer need to be. Revert TestParquetFileWriter entirely to its pre-PR state, drop the CREATED_BY constant from TestParquetMetadataConverter (the existing Version.FULL_VERSION references the actual writer version), and revert the three testSkippedV2Stats / testV2OnlyStats / testV2StatsEqualMinMax helpers and one round-trip call site back to static ParquetMetadataConverter.toParquetStatistics(stats).
Pre-PR, Int96Generator.nextBinaryValue went through FixedBinaryTestUtils.getFixedBinary which returned a constant-backed Binary. The refactor switched to Binary.fromReusedByteArray over the shared scratch buffer, so successive nextValue() calls mutate Binaries returned by earlier calls -- breaking any caller that retains results (e.g. RandomValues.wrapSorted).
CI japicmp:cmp check failed: METHOD_REMOVED on the pre-PR signature ParquetMetadataConverter.fromParquetColumnIndex(PrimitiveType, ColumnIndex). Add it back as a @deprecated shim that delegates to the new createdBy-aware instance method with createdBy=null. For non-INT96 types this falls through to the type-only writing-supported check (same as pre-PR). For INT96 the null createdBy fails the allowlist and returns null, which matches pre-PR behavior (INT96 had UNDEFINED column order pre-PR, so it also returned null).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Redoing #3243
Rationale for this change
Parquet-java does not emit or read stats for int96 timestamp columns. Since int96 is used as the default timestamp in spark, this limits a lot of optimization opportunities. Engines like Photon populate the statistics for the int96 timestamps correctly. So parquet-java can also emit the statistics, and also allow reading these statistics from known good writers.
What changes are included in this PR?
parquet.read.int96stats.enabledto control if the stats are read. It is defaulted to trueValidInt96Stats: Reads stats from known good writers. Currently including:a.
parquet-mr 1.16.0+b.
photonAre these changes tested?
Closes #3242