Skip to content

Root schema element written with REPEATED repetition breaks interoperability #722

@hcrosse

Description

@hcrosse

Describe the bug, including details regarding any error messages, version, and platform.

arrow-go writes REPEATED as the repetition_type for the root SchemaElement in the Parquet Thrift footer. I think this is non-standard and it's caused some interoperability failures for me.

The default rootRepetition in WriterProperties is Repetitions.Repeated. While WithRootRepetition exists as an opt-in override, the default itself is non-standard, and consumers of arrow-go (like apache/iceberg-go) inherit this default and may not expose ways to modify it.

Per the Parquet format spec:

"The root of the schema does not have a repetition_type. All other nodes must have one."

The repetition_type field on SchemaElement is optional in the Thrift definition specifically because the root should not carry one. Among the Parquet implementations I checked, arrow-go is the only one that writes REPEATED into the Thrift footer for the root element:

Implementation In-memory On disk (Thrift footer) Source
Parquet spec N/A Not set parquet.thrift#L516-L518
parquet-java REPEATED Not set (stripped during serialization) MessageType.java#L36, ParquetMetadataConverter.java#L323-L329
Arrow C++ / pyarrow REQUIRED REQUIRED schema.cc#L1228
arrow-rs (Rust) None Not set types.rs#L45-L46, types.rs#L590-L591
arrow-go REPEATED REPEATED writer_properties.go#L519

For added context, arrow-rs explicitly tolerates and strips root repetition when reading files from other implementations (types.rs#L1383-L1396).

In my specific example, Snowflake rejects externally-managed Parquet files written with REPEATED root repetition when they contain list columns:

"List encoding is not supported. List encoding: '0'"

I was able to reproduce this consistently: any iceberg-go table with list columns fails to load in Snowflake when the root schema element has REPEATED repetition.

A couple possible fixes:

  1. Don't serialize repetition_type for the root SchemaElement at all, matching parquet-java and arrow-rs behavior and the Parquet spec exactly. The WithRootRepetition option and in-memory representation would be unaffected.
  2. Change the default to Repetitions.Required, matching Arrow C++/pyarrow. Less spec-pure but a smaller change.

Either way, the existing WithRootRepetition API will remain available for anyone who needs to override the behavior. I'm happy to submit a PR for whichever approach is preferred if either of these sound good.

Component(s)

Parquet

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type: bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions