Skip to content

fix: normalize nested field names in RecordBatchTransformer#2251

Open
vovacf201 wants to merge 1 commit intoapache:mainfrom
risingwavelabs:pr/normalize-nested-field-names
Open

fix: normalize nested field names in RecordBatchTransformer#2251
vovacf201 wants to merge 1 commit intoapache:mainfrom
risingwavelabs:pr/normalize-nested-field-names

Conversation

@vovacf201
Copy link

Parquet files use "item" as the List inner field name (Parquet spec) while Iceberg uses "element" (Iceberg spec). Similarly, Parquet uses "entries" for Map inner fields while Iceberg uses "key_value".

The RecordBatchTransformer previously used equals_datatype() (which ignores field names) to decide between PassThrough and Promote. This meant columns with mismatched nested field names were passed through unchanged, causing downstream consumers that use strict schema validation (like DataFusion's concat_batches) to fail with:
"column types must match schema types, expected List(Field { name: element ..."

Fix: use a 3-way comparison in generate_transform_operations:

  1. Strict == match -> PassThrough (no cast needed)
  2. equals_datatype() but != (field names differ) -> Promote (cast to normalize names)
  3. Neither -> Promote (actual type promotion)

Cherry-picked from risingwavelabs/iceberg-rust commit 2e56dde

* fix: normalize nested field names in RecordBatchTransformer

Parquet files use "item" as the List inner field name (Parquet spec)
while Iceberg uses "element" (Iceberg spec). Similarly, Parquet uses
"entries" for Map inner fields while Iceberg uses "key_value".

The RecordBatchTransformer previously used equals_datatype() (which
ignores field names) to decide between PassThrough and Promote. This
meant columns with mismatched nested field names were passed through
unchanged, causing downstream consumers that use strict schema
validation (like DataFusion's concat_batches) to fail with:
"column types must match schema types, expected List(Field { name:
element ..."

Fix: use a 3-way comparison in generate_transform_operations:
1. Strict == match → PassThrough (no cast needed)
2. equals_datatype() but != (field names differ) → Promote (cast to
   normalize names)
3. Neither → Promote (actual type promotion)

* style: apply nightly rustfmt formatting
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant