Skip to content

GH-3242: Emit and Read min/max statistics for int96 timestamp columns#3590

Open
rahulketch wants to merge 29 commits into
apache:masterfrom
rahulketch:int96-stats-imp
Open

GH-3242: Emit and Read min/max statistics for int96 timestamp columns#3590
rahulketch wants to merge 29 commits into
apache:masterfrom
rahulketch:int96-stats-imp

Conversation

@rahulketch
Copy link
Copy Markdown
Contributor

@rahulketch rahulketch commented May 27, 2026

Redoing #3243

Rationale for this change

Parquet-java does not emit or read stats for int96 timestamp columns. Since int96 is used as the default timestamp in spark, this limits a lot of optimization opportunities. Engines like Photon populate the statistics for the int96 timestamps correctly. So parquet-java can also emit the statistics, and also allow reading these statistics from known good writers.

What changes are included in this PR?

  1. Defining column ordering for int96
  2. Implementation of comparator for int96
  3. A flag parquet.read.int96stats.enabled to control if the stats are read. It is defaulted to true
  4. ValidInt96Stats: Reads stats from known good writers. Currently including:
    a. parquet-mr 1.16.0+
    b. photon
  5. Fixing tests which fail due to this change

Are these changes tested?

  1. Added more unit tests for the new behaviour

Closes #3242

rahulketch and others added 22 commits June 11, 2025 11:11
…veComparator.java

Co-authored-by: Gang Wu <ustcwg@gmail.com>
…s.java

Co-authored-by: Gang Wu <ustcwg@gmail.com>
# Conflicts:
#	parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java
…ParquetStatistics

Address PR feedback (gszadovszky, wgtmac on apache#3243): keep the previously-public
static methods as @deprecated shims that delegate to a default
ParquetMetadataConverter instance, so 1.x callers still compile. The new
createdBy-aware non-static methods remain the recommended path.
String.compareTo is lexicographic: "1.15.1".compareTo("1.16.0") returns -1,
which incorrectly rejected parquet-mr 1.15.1 (caught by testParquetMrValid)
and also accepted hypothetical pre-1.15 versions like 1.5.0 (since '5' > '1'
at position 2). Use SemanticVersion compareTo with a 1.15.0 lower bound;
strict > so 1.15.0 itself stays invalid, matching testParquetMrInvalid.
1.16.0 is the first parquet-mr release that emits correct INT96 statistics.
Update tests: 1.15.1 / 1.15.1-SNAPSHOT move to the invalid set, 1.16.0-SNAPSHOT
added as invalid (pre-release sorts below 1.16.0 final under semver).
@rahulketch
Copy link
Copy Markdown
Contributor Author

@wgtmac @gszadovszky @emkornfield: Could you please take a look at this? It's a redo of #3243 which got closed due to inactivity

Without the guard, a malformed Binary throws IndexOutOfBoundsException
from inside the stats-pruning path (for length < 12) or silently
mis-compares by reading only the first 12 bytes (for length > 12).
Throw IllegalArgumentException with a clear message instead, matching
the precedent set by Binary.get2BytesLittleEndian for FLOAT16.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds end-to-end support for INT96 timestamp min/max statistics in Parquet-Java by defining INT96 type ordering/comparison, emitting INT96 min/max into file metadata, and conditionally trusting/reading INT96 stats based on writer identity (created_by) with a new read option toggle.

Changes:

  • Introduces an INT96 timestamp comparator and enables TYPE_DEFINED column order for INT96.
  • Adds parquet.read.int96stats.enabled (default true) to control whether INT96 stats are read/exposed, with ValidInt96Stats gating trust by created_by.
  • Updates metadata conversion and multiple tests; adds new unit tests for comparator behavior and statistics round-tripping.

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
parquet-hadoop/src/test/java/org/apache/parquet/statistics/TestInt96TimestampStatisticsRoundTrip.java New round-trip test validating INT96 stats behavior with the enable/disable flag
parquet-hadoop/src/test/java/org/apache/parquet/statistics/RandomValues.java Updates INT96 random generation to produce 12-byte INT96 payloads
parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestParquetFileWriter.java Updates tests to use instance-based statistics conversion APIs
parquet-hadoop/src/test/java/org/apache/parquet/format/converter/TestParquetMetadataConverter.java Refactors tests for new converter APIs and createdBy-aware behavior
parquet-hadoop/src/main/java/org/apache/parquet/ParquetReadOptions.java Adds readInt96Stats option plumbed through builder and accessors
parquet-hadoop/src/main/java/org/apache/parquet/HadoopReadOptions.java Plumbs readInt96Stats through Hadoop read options
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputFormat.java Adds config key and default for INT96 stats reading
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java Passes created_by into column index conversion for trust gating
parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java Adds createdBy/readInt96Stats-aware stats + column index conversion logic
parquet-column/src/test/java/org/apache/parquet/ValidInt96StatsTest.java New tests for created_by trust detection
parquet-column/src/test/java/org/apache/parquet/schema/TestTypeBuildersWithLogicalTypes.java Adjusts expectations around INT96 column order support
parquet-column/src/test/java/org/apache/parquet/schema/TestTypeBuilders.java Enables type-defined column order for INT96 in builder tests
parquet-column/src/test/java/org/apache/parquet/schema/TestPrimitiveComparator.java Adds INT96 comparator tests and invalid-length validation
parquet-column/src/test/java/org/apache/parquet/internal/column/columnindex/TestBinaryTruncator.java Updates truncator tests to include INT96-specific contract checks
parquet-column/src/main/java/org/apache/parquet/ValidInt96Stats.java New utility to determine whether INT96 stats can be trusted based on created_by
parquet-column/src/main/java/org/apache/parquet/schema/Types.java Updates documentation around default column order selection
parquet-column/src/main/java/org/apache/parquet/schema/PrimitiveType.java Enables type-defined order for INT96 and uses INT96 timestamp comparator
parquet-column/src/main/java/org/apache/parquet/schema/PrimitiveComparator.java Adds INT96 timestamp comparator implementation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

CI's spotless check (palantirJavaFormat) flagged over-indentation in
testInt96Comparator's array literals and lambda body. Reformat to
Palantir's continuation-indent style (6 spaces for items inside an
array literal). Pure whitespace change; all 14 tests still pass.
Copilot review (apache#3590) flagged that toParquetStatistics was gating V2
min_value/max_value emission on isMinMaxStatsReadingSupported, which
includes the read-side createdBy allowlist. That coupling silently
suppresses INT96 stats emission for any writer whose createdBy isn't
in ValidInt96Stats (e.g., pre-1.16.0 prereleases, downstream forks),
even though the writer is fully capable of producing them correctly.

Drop the new createdBy-aware toParquetStatistics(createdBy, ...)
overloads added by this PR and inline the (now createdBy-free) body
into the original static signatures, gating V2 emission on the
type-only isMinMaxStatsWritingSupported. The read paths
(fromParquetStatisticsInternal, fromParquetColumnIndex) already apply
the allowlist on the correct side. Update the three tests
(testSkippedV2Stats, testColumnOrders, testColumnIndexConversion)
that asserted the pre-PR "INT96 has no stats" behavior.
After the last refactor restored toParquetStatistics(stats) and
fromParquetStatistics(String, Statistics, PrimitiveTypeName) as
static, several test sites that had been switched to instance calls
no longer need to be. Revert TestParquetFileWriter entirely to its
pre-PR state, drop the CREATED_BY constant from
TestParquetMetadataConverter (the existing Version.FULL_VERSION
references the actual writer version), and revert the three
testSkippedV2Stats / testV2OnlyStats / testV2StatsEqualMinMax
helpers and one round-trip call site back to static
ParquetMetadataConverter.toParquetStatistics(stats).
Pre-PR, Int96Generator.nextBinaryValue went through
FixedBinaryTestUtils.getFixedBinary which returned a constant-backed
Binary. The refactor switched to Binary.fromReusedByteArray over the
shared scratch buffer, so successive nextValue() calls mutate Binaries
returned by earlier calls -- breaking any caller that retains results
(e.g. RandomValues.wrapSorted).
CI japicmp:cmp check failed: METHOD_REMOVED on the pre-PR signature
ParquetMetadataConverter.fromParquetColumnIndex(PrimitiveType, ColumnIndex).
Add it back as a @deprecated shim that delegates to the new
createdBy-aware instance method with createdBy=null. For non-INT96
types this falls through to the type-only writing-supported check
(same as pre-PR). For INT96 the null createdBy fails the allowlist
and returns null, which matches pre-PR behavior (INT96 had UNDEFINED
column order pre-PR, so it also returned null).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Emit and Read min/max statistics for int96 timestamp columns

2 participants