Skip to content

Enable parallel file-level scanning for IcebergTableScan Datafusion Integration #2220

@snithish

Description

@snithish

Is your feature request related to a problem or challenge?

As noted in #1604, Iceberg-DataFusion read performance is currently bottlenecked by single-threaded execution. While size-based planning is the proposed long-term solution, a more immediate improvement would be to parallelize over FileScanTask and leverage ArrowReaderBuilder during plan execution.

Describe the solution you'd like

Pre-calculates the FileScanTask streams and partitions them across the available DataFusion partitions, updating the IcebergTableScan struct and ExecutionPlan trait:

Pre-partitioning Scan Tasks: IcebergTableScan now accepts a grouped tasks: Vec<Vec> rather than computing streams eagerly.
Propagating Partition Counts: The compute_properties method now dynamically returns Partitioning::UnknownPartitioning(tasks.len()) instead of the hardcoded 1.
Parallel Stream Execution: The execute phase uses self.tasks.get(partition) to spawn an ArrowReaderBuilder specific only to the slice of tasks mapped to that discrete DataFusion partition index.

Willingness to contribute

I would be willing to contribute to this feature with guidance from the Iceberg Rust community

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions