Describe the bug, including details regarding any error messages, version, and platform.
arrow-go writes REPEATED as the repetition_type for the root SchemaElement in the Parquet Thrift footer. I think this is non-standard and it's caused some interoperability failures for me.
The default rootRepetition in WriterProperties is Repetitions.Repeated. While WithRootRepetition exists as an opt-in override, the default itself is non-standard, and consumers of arrow-go (like apache/iceberg-go) inherit this default and may not expose ways to modify it.
Per the Parquet format spec:
"The root of the schema does not have a repetition_type. All other nodes must have one."
The repetition_type field on SchemaElement is optional in the Thrift definition specifically because the root should not carry one. Among the Parquet implementations I checked, arrow-go is the only one that writes REPEATED into the Thrift footer for the root element:
For added context, arrow-rs explicitly tolerates and strips root repetition when reading files from other implementations (types.rs#L1383-L1396).
In my specific example, Snowflake rejects externally-managed Parquet files written with REPEATED root repetition when they contain list columns:
"List encoding is not supported. List encoding: '0'"
I was able to reproduce this consistently: any iceberg-go table with list columns fails to load in Snowflake when the root schema element has REPEATED repetition.
A couple possible fixes:
- Don't serialize
repetition_type for the root SchemaElement at all, matching parquet-java and arrow-rs behavior and the Parquet spec exactly. The WithRootRepetition option and in-memory representation would be unaffected.
- Change the default to
Repetitions.Required, matching Arrow C++/pyarrow. Less spec-pure but a smaller change.
Either way, the existing WithRootRepetition API will remain available for anyone who needs to override the behavior. I'm happy to submit a PR for whichever approach is preferred if either of these sound good.
Component(s)
Parquet
Describe the bug, including details regarding any error messages, version, and platform.
arrow-go writes
REPEATEDas therepetition_typefor the rootSchemaElementin the Parquet Thrift footer. I think this is non-standard and it's caused some interoperability failures for me.The default
rootRepetitioninWriterPropertiesisRepetitions.Repeated. WhileWithRootRepetitionexists as an opt-in override, the default itself is non-standard, and consumers of arrow-go (like apache/iceberg-go) inherit this default and may not expose ways to modify it.Per the Parquet format spec:
The
repetition_typefield onSchemaElementisoptionalin the Thrift definition specifically because the root should not carry one. Among the Parquet implementations I checked, arrow-go is the only one that writesREPEATEDinto the Thrift footer for the root element:REPEATEDREQUIREDREQUIREDNoneREPEATEDREPEATEDFor added context, arrow-rs explicitly tolerates and strips root repetition when reading files from other implementations (types.rs#L1383-L1396).
In my specific example, Snowflake rejects externally-managed Parquet files written with
REPEATEDroot repetition when they contain list columns:I was able to reproduce this consistently: any iceberg-go table with list columns fails to load in Snowflake when the root schema element has
REPEATEDrepetition.A couple possible fixes:
repetition_typefor the rootSchemaElementat all, matching parquet-java and arrow-rs behavior and the Parquet spec exactly. TheWithRootRepetitionoption and in-memory representation would be unaffected.Repetitions.Required, matching Arrow C++/pyarrow. Less spec-pure but a smaller change.Either way, the existing
WithRootRepetitionAPI will remain available for anyone who needs to override the behavior. I'm happy to submit a PR for whichever approach is preferred if either of these sound good.Component(s)
Parquet