GH-3585: Fix DataPageHeaderV2.num_nulls=-1 when column statistics are disabled#3586
GH-3585: Fix DataPageHeaderV2.num_nulls=-1 when column statistics are disabled#3586keuin wants to merge 1 commit into
Conversation
| this.nullCount = 0; | ||
| this.statistics = statisticsEnabled | ||
| ? Statistics.createStats(path.getPrimitiveType()) | ||
| : Statistics.noopStats(path.getPrimitiveType()); |
There was a problem hiding this comment.
Is it possible to change Statistics.noopStats to make it possible to ignore min/max stats but still collects null count?
There was a problem hiding this comment.
It's possible, but putting it outside Statistics is a design choice. By reading code I think the Statistics object is a 1:1 mapping to parquet header's statistics object, which is optional and may be disabled by user.
Putting counter of num_nulls outside statistics object makes the code clearer, because the code maps clearly to parquet header structure.
|
|
||
| // track the required `num_nulls` field in DataPageHeaderV2 | ||
| // https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift | ||
| private int nullCount; |
There was a problem hiding this comment.
Could we keep this counter in ColumnWriterBase instead? nullCount is page-level metadata, like valueCount and pageRowCount, while ColumnValueCollector is otherwise about optional statistics and bloom filters. Keeping it with the other page counters would make the ownership clearer.
| abstract void writePage( | ||
| int rowCount, | ||
| int valueCount, | ||
| int nullCount, |
There was a problem hiding this comment.
This adds a V2-only value to the shared writePage signature, so ColumnWriterV1 has to accept an unused parameter. Consider keeping nullCount on the V2 path only, either through a V2-specific helper or a protected getter from the base class.
| .named("test_schema"); | ||
|
|
||
| File file = temp.newFile(); | ||
| temp.delete(); |
There was a problem hiding this comment.
This should delete only the placeholder file, not the TemporaryFolder root. file.delete() is enough to free the output path and avoids changing the test fixture lifecycle.
Rationale for this change
Fix the num_nulls field in DataPageHeaderV2, which will be unexpectedly set to -1 when statistics is disabled.
What changes are included in this PR?
Add a counter tracker in ColumnValueCollector, which is always enabled, to track required field num_nulls.
Are these changes tested?
Yes, tested. The test will fail before this fix, and will succeed with this fix.
Are there any user-facing changes?
No user-facing or API changes.
Closes #3585