-
Notifications
You must be signed in to change notification settings - Fork 430
Description
Is your feature request related to a problem or challenge?
As noted in #1604, Iceberg-DataFusion read performance is currently bottlenecked by single-threaded execution. While size-based planning is the proposed long-term solution, a more immediate improvement would be to parallelize over FileScanTask and leverage ArrowReaderBuilder during plan execution.
Describe the solution you'd like
Pre-calculates the FileScanTask streams and partitions them across the available DataFusion partitions, updating the IcebergTableScan struct and ExecutionPlan trait:
Pre-partitioning Scan Tasks: IcebergTableScan now accepts a grouped tasks: Vec<Vec> rather than computing streams eagerly.
Propagating Partition Counts: The compute_properties method now dynamically returns Partitioning::UnknownPartitioning(tasks.len()) instead of the hardcoded 1.
Parallel Stream Execution: The execute phase uses self.tasks.get(partition) to spawn an ArrowReaderBuilder specific only to the slice of tasks mapped to that discrete DataFusion partition index.
Willingness to contribute
I would be willing to contribute to this feature with guidance from the Iceberg Rust community