Describe the bug
recompute_schema() has a pretty well documented contract that if you change the schema that it should be recomputed.
However, the current check for Union only checks the length of the schema, not the types or names/aliases:
From recompute_schema():
LogicalPlan::Union(Union { inputs, schema }) => {
let first_input_schema = inputs[0].schema();
if schema.fields().len() == first_input_schema.fields().len() {
// If inputs are not pruned do not change schema
Ok(LogicalPlan::Union(Union { inputs, schema }))
} else {
// A note on `Union`s constructed via `try_new_by_name`:
//
// At this point, the schema for each input should have
// the same width. Thus, we do not need to save whether a
// `Union` was created `BY NAME`, and can safely rely on the
// `try_new` initializer to derive the new schema based on
// column positions.
Ok(LogicalPlan::Union(Union::try_new(inputs)?))
}
}
To Reproduce
#[test]
fn test_recompute_schema_union_type_mismatch() -> Result<()> {
let schema_i32 =
Schema::new(vec![Field::new("a", DataType::Int32, false)]);
let schema_i64 =
Schema::new(vec![Field::new("a", DataType::Int64, false)]);
// Build a Union whose schema starts out as Int32 (matching its inputs).
let original = Union::try_new(vec![
Arc::new(table_scan(Some("t1"), &schema_i32, None)?.build()?),
Arc::new(table_scan(Some("t2"), &schema_i32, None)?.build()?),
])?;
assert_eq!(
original.schema.field(0).data_type(),
&DataType::Int32,
"sanity: starting schema is Int32"
);
// Now simulate something like a type-coercion pass replacing the
// inputs with Int64-typed versions while leaving the Union's cached
// schema untouched. Same width, different types.
let stale = LogicalPlan::Union(Union {
inputs: vec![
Arc::new(table_scan(Some("t1"), &schema_i64, None)?.build()?),
Arc::new(table_scan(Some("t2"), &schema_i64, None)?.build()?),
],
schema: Arc::clone(&original.schema),
});
let recomputed = stale.recompute_schema()?;
assert_eq!(
recomputed.schema().field(0).data_type(),
&DataType::Int64,
"Union schema should track the new Int64 input types after \
recompute_schema(), but the width-only check left it stale"
);
Ok(())
}
Expected behavior
The schema is updated
Additional context
I'm trying to rewrite all exprs/aliases using a pattern for some optimizations. Rewriting exprs to use a new alias, even if it's consistent, you need to recompute the schema for all nodes.
Union is the only one that breaks that I've found but there might be others.
Describe the bug
recompute_schema()has a pretty well documented contract that if you change the schema that it should be recomputed.However, the current check for
Uniononly checks the length of the schema, not the types or names/aliases:From
recompute_schema():To Reproduce
Expected behavior
The schema is updated
Additional context
I'm trying to rewrite all exprs/aliases using a pattern for some optimizations. Rewriting exprs to use a new alias, even if it's consistent, you need to recompute the schema for all nodes.
Union is the only one that breaks that I've found but there might be others.