Skip to content

GH-3585: Fix DataPageHeaderV2.num_nulls=-1 when column statistics are disabled#3586

Open
keuin wants to merge 1 commit into
apache:masterfrom
keuin:master
Open

GH-3585: Fix DataPageHeaderV2.num_nulls=-1 when column statistics are disabled#3586
keuin wants to merge 1 commit into
apache:masterfrom
keuin:master

Conversation

@keuin
Copy link
Copy Markdown

@keuin keuin commented May 25, 2026

Rationale for this change

Fix the num_nulls field in DataPageHeaderV2, which will be unexpectedly set to -1 when statistics is disabled.

What changes are included in this PR?

Add a counter tracker in ColumnValueCollector, which is always enabled, to track required field num_nulls.

Are these changes tested?

Yes, tested. The test will fail before this fix, and will succeed with this fix.

Are there any user-facing changes?

No user-facing or API changes.

Closes #3585

this.nullCount = 0;
this.statistics = statisticsEnabled
? Statistics.createStats(path.getPrimitiveType())
: Statistics.noopStats(path.getPrimitiveType());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to change Statistics.noopStats to make it possible to ignore min/max stats but still collects null count?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible, but putting it outside Statistics is a design choice. By reading code I think the Statistics object is a 1:1 mapping to parquet header's statistics object, which is optional and may be disabled by user.

Putting counter of num_nulls outside statistics object makes the code clearer, because the code maps clearly to parquet header structure.


// track the required `num_nulls` field in DataPageHeaderV2
// https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift
private int nullCount;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we keep this counter in ColumnWriterBase instead? nullCount is page-level metadata, like valueCount and pageRowCount, while ColumnValueCollector is otherwise about optional statistics and bloom filters. Keeping it with the other page counters would make the ownership clearer.

abstract void writePage(
int rowCount,
int valueCount,
int nullCount,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This adds a V2-only value to the shared writePage signature, so ColumnWriterV1 has to accept an unused parameter. Consider keeping nullCount on the V2 path only, either through a V2-specific helper or a protected getter from the base class.

.named("test_schema");

File file = temp.newFile();
temp.delete();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should delete only the placeholder file, not the TemporaryFolder root. file.delete() is enough to free the output path and avoids changing the test fixture lifecycle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DataPageHeaderV2.num_nulls=-1 when column statistics are disabled

2 participants