Skip to content

recompute_schema() is broken with LogicalPlan::Union #22447

@cetra3

Description

@cetra3

Describe the bug

recompute_schema() has a pretty well documented contract that if you change the schema that it should be recomputed.

However, the current check for Union only checks the length of the schema, not the types or names/aliases:

From recompute_schema():

LogicalPlan::Union(Union { inputs, schema }) => {
    let first_input_schema = inputs[0].schema();
    if schema.fields().len() == first_input_schema.fields().len() {
        // If inputs are not pruned do not change schema
        Ok(LogicalPlan::Union(Union { inputs, schema }))
    } else {
        // A note on `Union`s constructed via `try_new_by_name`:
        //
        // At this point, the schema for each input should have
        // the same width. Thus, we do not need to save whether a
        // `Union` was created `BY NAME`, and can safely rely on the
        // `try_new` initializer to derive the new schema based on
        // column positions.
        Ok(LogicalPlan::Union(Union::try_new(inputs)?))
    }
}

To Reproduce

#[test]
fn test_recompute_schema_union_type_mismatch() -> Result<()> {
    let schema_i32 =
        Schema::new(vec![Field::new("a", DataType::Int32, false)]);
    let schema_i64 =
        Schema::new(vec![Field::new("a", DataType::Int64, false)]);

    // Build a Union whose schema starts out as Int32 (matching its inputs).
    let original = Union::try_new(vec![
        Arc::new(table_scan(Some("t1"), &schema_i32, None)?.build()?),
        Arc::new(table_scan(Some("t2"), &schema_i32, None)?.build()?),
    ])?;
    assert_eq!(
        original.schema.field(0).data_type(),
        &DataType::Int32,
        "sanity: starting schema is Int32"
    );

    // Now simulate something like a type-coercion pass replacing the
    // inputs with Int64-typed versions while leaving the Union's cached
    // schema untouched. Same width, different types.
    let stale = LogicalPlan::Union(Union {
        inputs: vec![
            Arc::new(table_scan(Some("t1"), &schema_i64, None)?.build()?),
            Arc::new(table_scan(Some("t2"), &schema_i64, None)?.build()?),
        ],
        schema: Arc::clone(&original.schema),
    });

    let recomputed = stale.recompute_schema()?;

    assert_eq!(
        recomputed.schema().field(0).data_type(),
        &DataType::Int64,
        "Union schema should track the new Int64 input types after \
         recompute_schema(), but the width-only check left it stale"
    );

    Ok(())
}

Expected behavior

The schema is updated

Additional context

I'm trying to rewrite all exprs/aliases using a pattern for some optimizations. Rewriting exprs to use a new alias, even if it's consistent, you need to recompute the schema for all nodes.

Union is the only one that breaks that I've found but there might be others.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions