Add Python bindings for accessing ExecutionMetrics by ShreyeshArangath · Pull Request #1381 · apache/datafusion-python

ShreyeshArangath · 2026-02-15T01:52:11Z

Which issue does this PR close?

Rationale for this change

Today, DataFusion Python only exposes execution metrics through formatted console output via explain(analyze=True). This makes it difficult to programmatically inspect execution behavior.

There is currently no structured python API to access per-operator metrics such as output_rows, elapsed_compute, spill_count and other runtime metrics collected during execution.

This PR introduces APIs to surface the execution metrics, mirroring the Rust API in datafusion::physical_plan::metrics.

What changes are included in this PR?

Added plan caching to PyDataFrame so the physical plan used during execution is retained and available for metrics access.
Kept the metrics() method and added collect_metrics() helper to walk the execution plan tree and aggregate metrics from all operators.

Are there any user-facing changes?

Users can now programmatically access execution metrics

  df = ctx.sql("SELECT * FROM t WHERE x > 1")
  df.collect()
  plan = df.execution_plan()
  metrics = plan.collect_metrics() 
  for operator_name, metrics_set in metrics:
      print(f"{operator_name}: {metrics_set.output_rows} rows")

timsaucer

At a high level, I think this could bring a lot of value. Thank you for putting in the work!

From an implementation perspective, did you consider instead of caching the prior execution plan that instead we simply add the collect() and execute_stream() and so forth on PyExecutionPlan? It seems like that would more closely mirror the upstream repo and simplify the code. I haven't spent a lot of time going through the details of why you're caching the prior plan, so it's very possible I missed something.

ShreyeshArangath · 2026-02-20T05:58:50Z

@timsaucer Thanks for the suggestion! Initially when I designed the change, I did consider moving collect() / execute_*() onto plan object. The reason I didn’t go that route was more about how observability fits into real usage patterns (from the cases that I have seen).

Today, I think the users naturally treat a dataframe as the primary handle for a query:

df = ctx.sql("SELECT * FROM t WHERE column1 > 1")
batches = df.collect()

Requiring metrics to go through ExecutionPlan would effectively change the model to look something like so

df = ctx.sql("SELECT * FROM t WHERE column1 > 1")
plan = df.execution_plan()
batches = plan.collect()
metrics = plan.collect_metrics()

I thought that this would require users to restructure pipelines and thread a plan object through call chains purely to have access to metrics. The LoE required to get people to use it seemed high to me.

My goal was to make minimal changes to how users can add support for metrics without changing how they run queries

df = ctx.sql("SELECT * FROM t WHERE column1 > 1")
batches = df.collect()
plan = df.execution_plan()
metrics = plan.collect_metrics()

I’m happy to switch to the plan-based approach if we prefer stronger alignment with the upstream API, but I leaned toward this design to make observability easier to adopt without disrupting current usage patterns — lmk what you think

timsaucer

First off, I love this PR!

I've become convinced that your approach is better than what I was suggesting with regards to making them create a plan and execute!

One area I am concerned about is that when we do a display() we do bypass all of this mechanism. That is good and bad. The good is that the metrics are definitely going to be different between the smaller collection that happens when we display because it ends early. The bad is that as a user it's probably confusing to see the the data but then be told that we don't have the metrics for the data in front of them. What do you think?

The biggest area that I think is really necessary is around user facing documentation. I'm willing to chip in and help with this if you need. I think we want to tell the users how to use these metrics, both mechanically (like how you have to have executed the dataframe) and what information they provide. Plus there are differences between which stage of the plan you get them from and the fact that some metrics come from the different partitions as opposed to aggregate values.

timsaucer · 2026-02-25T21:07:14Z

python/datafusion/plan.py

+    def metrics(self) -> MetricsSet | None:
+        """Return metrics for this plan node after execution, or None if unavailable."""
+        raw = self._raw_plan.metrics()
+        if raw is None:
+            return None
+        return MetricsSet(raw)


This is leading me to think we should have some high level documentation, probably in the DataFrame page (or a subpage under it). Some of the things it would be good to do are to explain to a user what kinds of information they could find under these metrics and why that data are not available until after the DataFrame has been executed.

timsaucer · 2026-02-25T21:07:48Z

python/datafusion/plan.py

+        """Walk the plan tree and collect metrics from all operators.
+
+        Returns a list of (operator_name, MetricsSet) tuples.


"Walk the plan tree and collect metrics" probably does not make a lot of sense to someone other than a developer. I think we can make this more user focused.

I haven't dug in, but is operator_name the name of the execution plan?

timsaucer · 2026-02-25T21:10:19Z

python/datafusion/plan.py

+    Provides both individual metric access and convenience aggregations
+    across partitions.


A bit of an explanation is probably useful here. Again, I don't think we can assume the user understands that there are both individual execution plan metrics as well as aggregate. I think that some operators have metrics that cannot be aggregated. In general I suspect we really do need some high level documentation with examples we can point to that makes all of this more concrete.

On second read I now see this is aggregating across partitions. So does that mean the metrics() fn is returning per partition metrics for one ExecutionPlan? Asking for my understanding mostly.

timsaucer · 2026-02-25T21:12:27Z

python/datafusion/plan.py

+
+    @property
+    def elapsed_compute(self) -> int | None:
+        """Sum of elapsed_compute across all partitions, in nanoseconds."""


We probably want to describe what elapsed_compute is rather than assume user knowledge.

timsaucer · 2026-02-25T21:12:42Z

python/datafusion/plan.py

+
+    @property
+    def spill_count(self) -> int | None:
+        """Sum of spill_count across all partitions."""


Same with spill count. Do you know what units it has?

timsaucer · 2026-02-25T21:20:59Z

src/dataframe.rs

+        let df = self.df.as_ref().clone();
+        let plan = wait_for_future(py, df.create_physical_plan())?
+            .map_err(PyDataFusionError::from)?;
+        *self.last_plan.lock() = Some(Arc::clone(&plan));
+        let task_ctx = Arc::new(self.df.as_ref().task_ctx());
+        let batches = wait_for_future(py, df_collect(plan, task_ctx))?


If I run collect() twice on a DF, should we instead just do the lock on the last plan and clone it? I suspect there's not a huge performance difference the vast majority of the time as opposed to how you have it.

timsaucer · 2026-02-25T21:22:31Z

src/dataframe.rs

+        if let Some(plan) = self.last_plan.lock().as_ref() {
+            return Ok(PyExecutionPlan::new(Arc::clone(plan)));
+        }
        let plan = wait_for_future(py, self.df.as_ref().clone().create_physical_plan())??;
        Ok(plan.into())


If you go the route of using the existing last_plan for collect() like in my other comment then I think you could set it here just like you do in collect().

timsaucer · 2026-02-25T21:22:54Z

src/dataframe.rs

+        let plan = wait_for_future(py, df.create_physical_plan())?
+            .map_err(PyDataFusionError::from)?;
+        *self.last_plan.lock() = Some(Arc::clone(&plan));
+        let task_ctx = Arc::new(self.df.as_ref().task_ctx());


It feels like we're doing this in a bunch of places, so maybe make a private helper function.

timsaucer · 2026-02-25T21:26:49Z

src/metrics.rs

+        self.metrics.output_rows()
+    }
+
+    /// Returns the sum of all `elapsed_compute` metrics in nanoseconds, or None if not present.


There is a lot of boiler plate comments like this where the function is self explanatory and not exposed to the end user.

timsaucer · 2026-02-25T21:27:41Z

src/metrics.rs

+
+    /// Returns the numeric value of this metric, or None for non-numeric types.
+    #[getter]
+    fn value(&self) -> Option<usize> {


It feels like we could return Option<Py<PyAny>> and try casting the value appropriately.

ShreyeshArangath changed the title ~~feat: add Python bindings for accessing ExecutionMetrics~~ Add Python bindings for accessing ExecutionMetrics Feb 15, 2026

ShreyeshArangath marked this pull request as ready for review February 15, 2026 01:53

timsaucer reviewed Feb 18, 2026

View reviewed changes

ShreyeshArangath added 2 commits February 25, 2026 10:20

feat: add Python bindings for accessing ExecutionMetrics

697de36

test: imporve tests

0a57da6

ShreyeshArangath force-pushed the feat/support-metrics branch from 075e1ec to 0a57da6 Compare February 25, 2026 18:21

ShreyeshArangath requested a review from timsaucer February 25, 2026 18:21

timsaucer reviewed Feb 25, 2026

View reviewed changes

timsaucer mentioned this pull request Feb 25, 2026

Improve online documentation page for DataFrame #1397

Open

		"""Walk the plan tree and collect metrics from all operators.

		Returns a list of (operator_name, MetricsSet) tuples.

		Provides both individual metric access and convenience aggregations
		across partitions.

Conversation

ShreyeshArangath commented Feb 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

timsaucer left a comment

Choose a reason for hiding this comment

Uh oh!

ShreyeshArangath commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

timsaucer left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ShreyeshArangath commented Feb 15, 2026 •

edited

Loading

ShreyeshArangath commented Feb 20, 2026 •

edited

Loading