From 88f6d4c53216d0475e1dea084fe8e23e2f18de92 Mon Sep 17 00:00:00 2001 From: Qi Zhu Date: Tue, 26 May 2026 13:58:26 +0800 Subject: [PATCH 01/14] feat(aggregate): cost-aware partial-aggregation skip (opt-in) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The fixed `skip_partial_aggregation_probe_ratio_threshold` (default 0.8) catches "the partial agg barely reduces anything" cases, but it misses the band where the ratio is moderate (say 0.5-0.6) and partial aggregation is *still* net-negative because per-row cost is high — heavy variable- length keys, complex aggregates, etc. ClickBench Q18 is the motivating example (issue #22405): ratio 0.565, but partial agg burns 17s of compute across 12 partitions while reducing input only ~40%; turning the threshold down enough to catch it would regress lower-cost queries. Add a second, opt-in skip rule that augments the fixed-ratio check with the measured per-row wall time of the operator. Disabled by default, so existing behaviour is preserved. New config (all under `datafusion.execution`): - `skip_partial_aggregation_use_cost_model` (bool, default false) — turns the cost-aware rule on. - `skip_partial_aggregation_cost_ns_per_row` (u64, default 1000) — the per-row wall-time floor above which the cost-aware rule fires. - `skip_partial_aggregation_cost_min_ratio` (f64, default 0.3) — below this ratio partial agg is kept regardless of per-row cost (it's reducing too much to be worth skipping). How it works: `SkipAggregationProbe` already runs at probe-window boundaries and already has `baseline_metrics.elapsed_compute` ticking through every timed block. The probe now snapshots that counter at construction; once `probe_rows_threshold` is reached, it computes `ns_per_row = (elapsed_compute - snapshot) / input_rows` and, if both the per-row cost is above the floor and the ratio sits in the medium band, switches to skip mode. The existing high-ratio rule still fires first, so this is purely additive. Five unit tests on `SkipAggregationProbe` cover the new branches — cost-model-off matches the legacy ratio check, medium-ratio + high cost skips, below-min-ratio doesn't, cheap-per-row doesn't, and the high-ratio rule is honoured even with the cost model on. Refs: #22405 --- datafusion/common/src/config.rs | 24 +++ .../physical-plan/src/aggregates/row_hash.rs | 189 +++++++++++++++++- docs/source/user-guide/configs.md | 3 + 3 files changed, 209 insertions(+), 7 deletions(-) diff --git a/datafusion/common/src/config.rs b/datafusion/common/src/config.rs index e6d1ebbbbe746..a410f0a42701a 100644 --- a/datafusion/common/src/config.rs +++ b/datafusion/common/src/config.rs @@ -648,6 +648,30 @@ config_namespace! { /// aggregation ratio check and trying to switch to skipping aggregation mode pub skip_partial_aggregation_probe_rows_threshold: usize, default = 100_000 + /// (experimental) When true, augment the fixed + /// `skip_partial_aggregation_probe_ratio_threshold` check with a + /// cost-aware rule: skip partial aggregation when the measured + /// per-row wall time exceeds + /// `skip_partial_aggregation_cost_ns_per_row` even at ratios below + /// the fixed threshold (provided the ratio is at least + /// `skip_partial_aggregation_cost_min_ratio`). Targets ClickBench + /// Q18-shape queries where the ratio (~0.56) is below 0.8 but + /// partial aggregation is net-negative due to heavy keys / aggs. + pub skip_partial_aggregation_use_cost_model: bool, default = false + + /// Minimum measured per-row wall time (nanoseconds) of partial + /// aggregation for the cost-aware skip rule to fire. Only + /// consulted when `skip_partial_aggregation_use_cost_model` is + /// true. + pub skip_partial_aggregation_cost_ns_per_row: u64, default = 1000 + + /// Lower bound on aggregation ratio (num_groups / input_rows) for + /// the cost-aware skip rule to fire. Below this ratio partial + /// aggregation is considered valuable enough to keep regardless + /// of per-row cost. Only consulted when + /// `skip_partial_aggregation_use_cost_model` is true. + pub skip_partial_aggregation_cost_min_ratio: f64, default = 0.3 + /// Should DataFusion use row number estimates at the input to decide /// whether increasing parallelism is beneficial or not. By default, /// only exact row numbers (not estimates) are used for this decision. diff --git a/datafusion/physical-plan/src/aggregates/row_hash.rs b/datafusion/physical-plan/src/aggregates/row_hash.rs index 1164fb37b384a..e38abafd6a9f4 100644 --- a/datafusion/physical-plan/src/aggregates/row_hash.rs +++ b/datafusion/physical-plan/src/aggregates/row_hash.rs @@ -134,6 +134,15 @@ struct SkipAggregationProbe { /// (from `SessionConfig`). If the ratio exceeds this value, aggregation /// is skipped and input rows are directly converted to output probe_ratio_threshold: f64, + /// (experimental) Cost-aware skip rule: skip when measured per-row wall + /// time is high even at ratios below `probe_ratio_threshold`. Disabled + /// by default; preserves the current bare-ratio behaviour. + use_cost_model: bool, + /// Per-row wall-time threshold (ns) for cost-aware skip rule. + cost_ns_per_row_threshold: u64, + /// Minimum ratio for cost-aware skip rule to fire. Below this ratio + /// partial aggregation is kept regardless of per-row cost. + cost_min_ratio: f64, // ======================================================================== // STATES: @@ -162,31 +171,64 @@ struct SkipAggregationProbe { /// * if greater than zero, the number of rows which were output directly /// without aggregation skipped_aggregation_rows: metrics::Count, + /// Operator-wide `elapsed_compute` metric. Cloned from + /// `BaselineMetrics` so the probe can read the accumulated nanoseconds + /// at decision time and infer the wall time spent on partial + /// aggregation so far (used by the cost-aware rule). + elapsed_compute: metrics::Time, + /// `elapsed_compute.value()` snapshot at probe construction. Diff + /// against the live value gives wall time elapsed since the stream + /// started — a close approximation of partial aggregation cost, + /// since `elapsed_compute` covers exactly the timed work blocks in + /// the partial-mode loop. + elapsed_compute_at_probe_start: usize, } impl SkipAggregationProbe { fn new( probe_rows_threshold: usize, probe_ratio_threshold: f64, + use_cost_model: bool, + cost_ns_per_row_threshold: u64, + cost_min_ratio: f64, skipped_aggregation_rows: metrics::Count, + elapsed_compute: metrics::Time, ) -> Self { + let elapsed_compute_at_probe_start = elapsed_compute.value(); Self { input_rows: 0, num_groups: 0, probe_rows_threshold, probe_ratio_threshold, + use_cost_model, + cost_ns_per_row_threshold, + cost_min_ratio, should_skip: false, is_locked: false, skipped_aggregation_rows, + elapsed_compute, + elapsed_compute_at_probe_start, } } /// Updates `SkipAggregationProbe` state: /// - increments the number of input rows /// - replaces the number of groups with the new value - /// - on `probe_rows_threshold` exceeded calculates - /// aggregation ratio and sets `should_skip` flag + /// - on `probe_rows_threshold` exceeded applies the skip-decision rules + /// (fixed ratio threshold, plus the optional cost-aware rule) /// - if `should_skip` is set, locks further state updates + /// + /// Two skip rules are evaluated, in order: + /// + /// 1. **Fixed ratio**: `num_groups / input_rows >= probe_ratio_threshold` + /// — the original behaviour, preserved for backwards compatibility. + /// 2. **Cost-aware** (opt-in via `use_cost_model`): when the ratio is + /// above `cost_min_ratio` *and* the measured per-row wall time of + /// the operator exceeds `cost_ns_per_row_threshold`, skip even if + /// the ratio is below `probe_ratio_threshold`. This catches the + /// case where partial aggregation is net-negative due to high + /// per-row cost (heavy variable-length keys, expensive aggregates) + /// rather than poor reduction alone. fn update_state(&mut self, input_rows: usize, num_groups: usize) { if self.is_locked { return; @@ -194,11 +236,33 @@ impl SkipAggregationProbe { self.input_rows += input_rows; self.num_groups = num_groups; if self.input_rows >= self.probe_rows_threshold { - self.should_skip = self.num_groups as f64 / self.input_rows as f64 - >= self.probe_ratio_threshold; - // Set is_locked to true only if we have decided to skip, otherwise we can try to skip - // during processing the next record_batch. - self.is_locked = self.should_skip; + let ratio = self.num_groups as f64 / self.input_rows as f64; + + // Rule 1: fixed-ratio skip (existing behaviour). + if ratio >= self.probe_ratio_threshold { + self.should_skip = true; + self.is_locked = true; + return; + } + + // Rule 2 (opt-in): cost-aware skip for the medium-ratio, + // high-per-row-cost band. + if self.use_cost_model && ratio >= self.cost_min_ratio { + let elapsed_ns = self + .elapsed_compute + .value() + .saturating_sub(self.elapsed_compute_at_probe_start); + let ns_per_row = (elapsed_ns as u64) + .checked_div(self.input_rows as u64) + .unwrap_or(0); + if ns_per_row > self.cost_ns_per_row_threshold { + self.should_skip = true; + self.is_locked = true; + } + } + + // No rule fired — leave `is_locked` false so the next batch + // can re-evaluate. } } @@ -644,13 +708,21 @@ impl GroupedHashAggregateStream { options.skip_partial_aggregation_probe_rows_threshold; let probe_ratio_threshold = options.skip_partial_aggregation_probe_ratio_threshold; + let use_cost_model = options.skip_partial_aggregation_use_cost_model; + let cost_ns_per_row_threshold = + options.skip_partial_aggregation_cost_ns_per_row; + let cost_min_ratio = options.skip_partial_aggregation_cost_min_ratio; let skipped_aggregation_rows = MetricBuilder::new(&agg.metrics) .with_category(MetricCategory::Rows) .counter("skipped_aggregation_rows", partition); Some(SkipAggregationProbe::new( probe_rows_threshold, probe_ratio_threshold, + use_cost_model, + cost_ns_per_row_threshold, + cost_min_ratio, skipped_aggregation_rows, + baseline_metrics.elapsed_compute().clone(), )) } else { None @@ -1481,6 +1553,7 @@ mod tests { use datafusion_functions_aggregate::count::count_udaf; use datafusion_physical_expr::aggregate::AggregateExprBuilder; use datafusion_physical_expr::expressions::col; + use std::time::Duration; #[tokio::test] async fn test_double_emission_race_condition_bug() -> Result<()> { @@ -1817,4 +1890,106 @@ mod tests { Ok(()) } + + // ---------------- SkipAggregationProbe unit tests ---------------- + + /// Build a probe with the given configuration. Skip-count metric is + /// disconnected (a fresh `Count`) and `elapsed_compute` is a fresh + /// `Time` that the test can drive with `add_duration`. + fn make_probe( + probe_rows_threshold: usize, + probe_ratio_threshold: f64, + use_cost_model: bool, + cost_ns_per_row_threshold: u64, + cost_min_ratio: f64, + ) -> (SkipAggregationProbe, metrics::Time) { + let elapsed_compute = metrics::Time::new(); + let probe = SkipAggregationProbe::new( + probe_rows_threshold, + probe_ratio_threshold, + use_cost_model, + cost_ns_per_row_threshold, + cost_min_ratio, + metrics::Count::new(), + elapsed_compute.clone(), + ); + (probe, elapsed_compute) + } + + /// With the cost model off, the behaviour is exactly the legacy bare + /// ratio check. + #[test] + fn skip_probe_cost_model_off_matches_legacy_ratio_check() { + let (mut probe, _t) = make_probe(100, 0.8, false, 1000, 0.3); + + // 100 rows / 50 groups → ratio 0.5, below 0.8 → don't skip + probe.update_state(100, 50); + assert!(!probe.should_skip()); + assert!(!probe.is_locked); + + // Next window: another 100 rows, total 200 / 170 groups → ratio + // 0.85, above 0.8 → skip. + probe.update_state(100, 170); + assert!(probe.should_skip()); + assert!(probe.is_locked); + } + + /// With the cost model on, a high-cost slow query in the medium-ratio + /// band (0.3 ≤ ratio < 0.8) gets skipped — the ClickBench Q18 case. + #[test] + fn skip_probe_cost_model_fires_in_medium_ratio_high_cost_band() { + let (mut probe, elapsed) = make_probe(100, 0.8, true, 1000, 0.3); + + // 200_000 ns over 100 rows → 2_000 ns/row → above the 1_000 + // threshold. Ratio 0.5 sits in the medium band. + elapsed.add_duration(Duration::from_nanos(200_000)); + probe.update_state(100, 50); + + assert!(probe.should_skip()); + assert!(probe.is_locked); + } + + /// Cost model on, ratio below `cost_min_ratio`: the partial agg is + /// genuinely reducing a lot (small / cheap output), so we keep it + /// regardless of per-row cost. + #[test] + fn skip_probe_cost_model_does_not_fire_below_min_ratio() { + let (mut probe, elapsed) = make_probe(100, 0.8, true, 1000, 0.3); + + // Very expensive — 5_000 ns/row — but ratio is 0.2 (huge + // reduction), so partial agg is still net-positive. + elapsed.add_duration(Duration::from_nanos(500_000)); + probe.update_state(100, 20); + + assert!(!probe.should_skip()); + assert!(!probe.is_locked); + } + + /// Cost model on but per-row cost is below threshold: ratio alone + /// doesn't justify skipping → keep aggregating. + #[test] + fn skip_probe_cost_model_does_not_fire_when_cheap_per_row() { + let (mut probe, elapsed) = make_probe(100, 0.8, true, 1000, 0.3); + + // 50_000 ns over 100 rows → 500 ns/row → below the 1_000 ns + // threshold. Ratio 0.5 alone is below the fixed 0.8 cut-off. + elapsed.add_duration(Duration::from_nanos(50_000)); + probe.update_state(100, 50); + + assert!(!probe.should_skip()); + assert!(!probe.is_locked); + } + + /// Cost model on still honours the original high-ratio rule: ratio + /// >= `probe_ratio_threshold` skips regardless of cost. + #[test] + fn skip_probe_cost_model_still_honours_high_ratio_rule() { + let (mut probe, _t) = make_probe(100, 0.8, true, 1000, 0.3); + + // Ratio 0.9, no measured cost → skip via the fixed rule. + probe.update_state(100, 90); + + assert!(probe.should_skip()); + assert!(probe.is_locked); + } } diff --git a/docs/source/user-guide/configs.md b/docs/source/user-guide/configs.md index 576137bda29d1..6384b7e641c01 100644 --- a/docs/source/user-guide/configs.md +++ b/docs/source/user-guide/configs.md @@ -132,6 +132,9 @@ The following configuration settings are available: | datafusion.execution.keep_partition_by_columns | false | Should DataFusion keep the columns used for partition_by in the output RecordBatches | | datafusion.execution.skip_partial_aggregation_probe_ratio_threshold | 0.8 | Aggregation ratio (number of distinct groups / number of input rows) threshold for skipping partial aggregation. If the value is greater then partial aggregation will skip aggregation for further input | | datafusion.execution.skip_partial_aggregation_probe_rows_threshold | 100000 | Number of input rows partial aggregation partition should process, before aggregation ratio check and trying to switch to skipping aggregation mode | +| datafusion.execution.skip_partial_aggregation_use_cost_model | false | (experimental) When true, augment the fixed `skip_partial_aggregation_probe_ratio_threshold` check with a cost-aware rule: skip partial aggregation when the measured per-row wall time exceeds `skip_partial_aggregation_cost_ns_per_row` even at ratios below the fixed threshold (provided the ratio is at least `skip_partial_aggregation_cost_min_ratio`). Targets ClickBench Q18-shape queries where the ratio (~0.56) is below 0.8 but partial aggregation is net-negative due to heavy keys / aggs. | +| datafusion.execution.skip_partial_aggregation_cost_ns_per_row | 1000 | Minimum measured per-row wall time (nanoseconds) of partial aggregation for the cost-aware skip rule to fire. Only consulted when `skip_partial_aggregation_use_cost_model` is true. | +| datafusion.execution.skip_partial_aggregation_cost_min_ratio | 0.3 | Lower bound on aggregation ratio (num_groups / input_rows) for the cost-aware skip rule to fire. Below this ratio partial aggregation is considered valuable enough to keep regardless of per-row cost. Only consulted when `skip_partial_aggregation_use_cost_model` is true. | | datafusion.execution.use_row_number_estimates_to_optimize_partitioning | false | Should DataFusion use row number estimates at the input to decide whether increasing parallelism is beneficial or not. By default, only exact row numbers (not estimates) are used for this decision. Setting this flag to `true` will likely produce better plans. if the source of statistics is accurate. We plan to make this the default in the future. | | datafusion.execution.enforce_batch_size_in_joins | false | Should DataFusion enforce batch size in joins or not. By default, DataFusion will not enforce batch size in joins. Enforcing batch size in joins can reduce memory usage when joining large tables with a highly-selective join filter, but is also slightly slower. | | datafusion.execution.objectstore_writer_buffer_size | 10485760 | Size (bytes) of data buffer DataFusion uses when writing output files. This affects the size of the data chunks that are uploaded to remote object stores (e.g. AWS S3). If very large (>= 100 GiB) output files are being written, it may be necessary to increase this size to avoid errors from the remote end point. | From 9940d8aa193c39af67b51c7e84dbe9a5c6c88f6b Mon Sep 17 00:00:00 2001 From: Qi Zhu Date: Tue, 26 May 2026 14:36:12 +0800 Subject: [PATCH 02/14] [FOR BENCHMARK ONLY] flip default to true + update SLT MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Temporarily set `skip_partial_aggregation_use_cost_model` default = true so the benchmark bot actually exercises the new code path. **Revert this commit before merge** — final default should remain false (opt-in) until ClickBench-wide validation tunes the constants. Regenerated: - docs/source/user-guide/configs.md - datafusion/sqllogictest/test_files/information_schema.slt (SHOW ALL added the new 3 config rows; CI was failing on the stale expectation). --- datafusion/common/src/config.rs | 2 +- datafusion/sqllogictest/test_files/information_schema.slt | 8 +++++++- docs/source/user-guide/configs.md | 2 +- 3 files changed, 9 insertions(+), 3 deletions(-) diff --git a/datafusion/common/src/config.rs b/datafusion/common/src/config.rs index a410f0a42701a..a50f3493fb94d 100644 --- a/datafusion/common/src/config.rs +++ b/datafusion/common/src/config.rs @@ -657,7 +657,7 @@ config_namespace! { /// `skip_partial_aggregation_cost_min_ratio`). Targets ClickBench /// Q18-shape queries where the ratio (~0.56) is below 0.8 but /// partial aggregation is net-negative due to heavy keys / aggs. - pub skip_partial_aggregation_use_cost_model: bool, default = false + pub skip_partial_aggregation_use_cost_model: bool, default = true /// Minimum measured per-row wall time (nanoseconds) of partial /// aggregation for the cost-aware skip rule to fire. Only diff --git a/datafusion/sqllogictest/test_files/information_schema.slt b/datafusion/sqllogictest/test_files/information_schema.slt index b0c7e3f8fe643..e07cf99270a0f 100644 --- a/datafusion/sqllogictest/test_files/information_schema.slt +++ b/datafusion/sqllogictest/test_files/information_schema.slt @@ -266,8 +266,11 @@ datafusion.execution.parquet.writer_version 1.0 datafusion.execution.perfect_hash_join_min_key_density 0.15 datafusion.execution.perfect_hash_join_small_build_threshold 1024 datafusion.execution.planning_concurrency 13 +datafusion.execution.skip_partial_aggregation_cost_min_ratio 0.3 +datafusion.execution.skip_partial_aggregation_cost_ns_per_row 1000 datafusion.execution.skip_partial_aggregation_probe_ratio_threshold 0.8 datafusion.execution.skip_partial_aggregation_probe_rows_threshold 100000 +datafusion.execution.skip_partial_aggregation_use_cost_model true datafusion.execution.skip_physical_aggregate_schema_check false datafusion.execution.soft_max_rows_per_output_file 50000000 datafusion.execution.sort_in_place_threshold_bytes 1048576 @@ -416,8 +419,11 @@ datafusion.execution.parquet.writer_version 1.0 (writing) Sets parquet writer ve datafusion.execution.perfect_hash_join_min_key_density 0.15 The minimum required density of join keys on the build side to consider a perfect hash join (see `HashJoinExec` for more details). Density is calculated as: `(number of rows) / (max_key - min_key + 1)`. A perfect hash join may be used if the actual key density > this value. Currently only supports cases where build_side.num_rows() < u32::MAX. Support for build_side.num_rows() >= u32::MAX will be added in the future. datafusion.execution.perfect_hash_join_small_build_threshold 1024 A perfect hash join (see `HashJoinExec` for more details) will be considered if the range of keys (max - min) on the build side is < this threshold. This provides a fast path for joins with very small key ranges, bypassing the density check. Currently only supports cases where build_side.num_rows() < u32::MAX. Support for build_side.num_rows() >= u32::MAX will be added in the future. datafusion.execution.planning_concurrency 13 Fan-out during initial physical planning. This is mostly use to plan `UNION` children in parallel. Defaults to the number of CPU cores on the system +datafusion.execution.skip_partial_aggregation_cost_min_ratio 0.3 Lower bound on aggregation ratio (num_groups / input_rows) for the cost-aware skip rule to fire. Below this ratio partial aggregation is considered valuable enough to keep regardless of per-row cost. Only consulted when `skip_partial_aggregation_use_cost_model` is true. +datafusion.execution.skip_partial_aggregation_cost_ns_per_row 1000 Minimum measured per-row wall time (nanoseconds) of partial aggregation for the cost-aware skip rule to fire. Only consulted when `skip_partial_aggregation_use_cost_model` is true. datafusion.execution.skip_partial_aggregation_probe_ratio_threshold 0.8 Aggregation ratio (number of distinct groups / number of input rows) threshold for skipping partial aggregation. If the value is greater then partial aggregation will skip aggregation for further input datafusion.execution.skip_partial_aggregation_probe_rows_threshold 100000 Number of input rows partial aggregation partition should process, before aggregation ratio check and trying to switch to skipping aggregation mode +datafusion.execution.skip_partial_aggregation_use_cost_model true (experimental) When true, augment the fixed `skip_partial_aggregation_probe_ratio_threshold` check with a cost-aware rule: skip partial aggregation when the measured per-row wall time exceeds `skip_partial_aggregation_cost_ns_per_row` even at ratios below the fixed threshold (provided the ratio is at least `skip_partial_aggregation_cost_min_ratio`). Targets ClickBench Q18-shape queries where the ratio (~0.56) is below 0.8 but partial aggregation is net-negative due to heavy keys / aggs. datafusion.execution.skip_physical_aggregate_schema_check false When set to true, skips verifying that the schema produced by planning the input of `LogicalPlan::Aggregate` exactly matches the schema of the input plan. When set to false, if the schema does not match exactly (including nullability and metadata), a planning error will be raised. This is used to workaround bugs in the planner that are now caught by the new schema verification step. datafusion.execution.soft_max_rows_per_output_file 50000000 Target number of rows in output files when writing multiple. This is a soft max, so it can be exceeded slightly. There also will be one file smaller than the limit if the total number of rows written is not roughly divisible by the soft max datafusion.execution.sort_in_place_threshold_bytes 1048576 When sorting, below what size should data be concatenated and sorted in a single RecordBatch rather than sorted in batches and merged. @@ -895,7 +901,7 @@ show functions statement ok reset datafusion.catalog.information_schema; -# The SLT runner sets `target_partitions` to 4 instead of using the default, so +# The SLT runner sets `target_partitions` to 4 instead of using the default, so # reset it explicitly. statement ok set datafusion.execution.target_partitions = 4; diff --git a/docs/source/user-guide/configs.md b/docs/source/user-guide/configs.md index 6384b7e641c01..bc80dce703d7d 100644 --- a/docs/source/user-guide/configs.md +++ b/docs/source/user-guide/configs.md @@ -132,7 +132,7 @@ The following configuration settings are available: | datafusion.execution.keep_partition_by_columns | false | Should DataFusion keep the columns used for partition_by in the output RecordBatches | | datafusion.execution.skip_partial_aggregation_probe_ratio_threshold | 0.8 | Aggregation ratio (number of distinct groups / number of input rows) threshold for skipping partial aggregation. If the value is greater then partial aggregation will skip aggregation for further input | | datafusion.execution.skip_partial_aggregation_probe_rows_threshold | 100000 | Number of input rows partial aggregation partition should process, before aggregation ratio check and trying to switch to skipping aggregation mode | -| datafusion.execution.skip_partial_aggregation_use_cost_model | false | (experimental) When true, augment the fixed `skip_partial_aggregation_probe_ratio_threshold` check with a cost-aware rule: skip partial aggregation when the measured per-row wall time exceeds `skip_partial_aggregation_cost_ns_per_row` even at ratios below the fixed threshold (provided the ratio is at least `skip_partial_aggregation_cost_min_ratio`). Targets ClickBench Q18-shape queries where the ratio (~0.56) is below 0.8 but partial aggregation is net-negative due to heavy keys / aggs. | +| datafusion.execution.skip_partial_aggregation_use_cost_model | true | (experimental) When true, augment the fixed `skip_partial_aggregation_probe_ratio_threshold` check with a cost-aware rule: skip partial aggregation when the measured per-row wall time exceeds `skip_partial_aggregation_cost_ns_per_row` even at ratios below the fixed threshold (provided the ratio is at least `skip_partial_aggregation_cost_min_ratio`). Targets ClickBench Q18-shape queries where the ratio (~0.56) is below 0.8 but partial aggregation is net-negative due to heavy keys / aggs. | | datafusion.execution.skip_partial_aggregation_cost_ns_per_row | 1000 | Minimum measured per-row wall time (nanoseconds) of partial aggregation for the cost-aware skip rule to fire. Only consulted when `skip_partial_aggregation_use_cost_model` is true. | | datafusion.execution.skip_partial_aggregation_cost_min_ratio | 0.3 | Lower bound on aggregation ratio (num_groups / input_rows) for the cost-aware skip rule to fire. Below this ratio partial aggregation is considered valuable enough to keep regardless of per-row cost. Only consulted when `skip_partial_aggregation_use_cost_model` is true. | | datafusion.execution.use_row_number_estimates_to_optimize_partitioning | false | Should DataFusion use row number estimates at the input to decide whether increasing parallelism is beneficial or not. By default, only exact row numbers (not estimates) are used for this decision. Setting this flag to `true` will likely produce better plans. if the source of statistics is accurate. We plan to make this the default in the future. | From 2b4979836184f7ce37c343921185b6e46372a429 Mon Sep 17 00:00:00 2001 From: Qi Zhu Date: Tue, 26 May 2026 15:11:01 +0800 Subject: [PATCH 03/14] =?UTF-8?q?tune:=20drop=20cost=5Fns=5Fper=5Frow=20de?= =?UTF-8?q?fault=20from=201000=20=E2=86=92=20100?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The 1000 ns/row threshold was wishful thinking. Re-derived per-row cost on the benchmark hardware (Neoverse-V2 ARM): for ClickBench Q18, partial agg costs ~100-200 ns/row, well below 1000. M-series MBP from the issue report is similar (~170 ns/row reading back from the 17 s / 100 M figure). 100 ns/row is roughly the floor of a hash-table probe + insert on modern CPUs, so anything above that is in the "meaningful per-row overhead" band where partial agg can plausibly be net-negative. The 0.3 cost_min_ratio guard keeps low-cardinality / high-reduction queries (like ClickBench Q35) safe — they sit below 0.3 and never enter this branch regardless of per-row cost. --- datafusion/common/src/config.rs | 2 +- datafusion/sqllogictest/test_files/information_schema.slt | 4 ++-- docs/source/user-guide/configs.md | 2 +- 3 files changed, 4 insertions(+), 4 deletions(-) diff --git a/datafusion/common/src/config.rs b/datafusion/common/src/config.rs index a50f3493fb94d..cb11d8eb70bf1 100644 --- a/datafusion/common/src/config.rs +++ b/datafusion/common/src/config.rs @@ -663,7 +663,7 @@ config_namespace! { /// aggregation for the cost-aware skip rule to fire. Only /// consulted when `skip_partial_aggregation_use_cost_model` is /// true. - pub skip_partial_aggregation_cost_ns_per_row: u64, default = 1000 + pub skip_partial_aggregation_cost_ns_per_row: u64, default = 100 /// Lower bound on aggregation ratio (num_groups / input_rows) for /// the cost-aware skip rule to fire. Below this ratio partial diff --git a/datafusion/sqllogictest/test_files/information_schema.slt b/datafusion/sqllogictest/test_files/information_schema.slt index e07cf99270a0f..a51ca0900a178 100644 --- a/datafusion/sqllogictest/test_files/information_schema.slt +++ b/datafusion/sqllogictest/test_files/information_schema.slt @@ -267,7 +267,7 @@ datafusion.execution.perfect_hash_join_min_key_density 0.15 datafusion.execution.perfect_hash_join_small_build_threshold 1024 datafusion.execution.planning_concurrency 13 datafusion.execution.skip_partial_aggregation_cost_min_ratio 0.3 -datafusion.execution.skip_partial_aggregation_cost_ns_per_row 1000 +datafusion.execution.skip_partial_aggregation_cost_ns_per_row 100 datafusion.execution.skip_partial_aggregation_probe_ratio_threshold 0.8 datafusion.execution.skip_partial_aggregation_probe_rows_threshold 100000 datafusion.execution.skip_partial_aggregation_use_cost_model true @@ -420,7 +420,7 @@ datafusion.execution.perfect_hash_join_min_key_density 0.15 The minimum required datafusion.execution.perfect_hash_join_small_build_threshold 1024 A perfect hash join (see `HashJoinExec` for more details) will be considered if the range of keys (max - min) on the build side is < this threshold. This provides a fast path for joins with very small key ranges, bypassing the density check. Currently only supports cases where build_side.num_rows() < u32::MAX. Support for build_side.num_rows() >= u32::MAX will be added in the future. datafusion.execution.planning_concurrency 13 Fan-out during initial physical planning. This is mostly use to plan `UNION` children in parallel. Defaults to the number of CPU cores on the system datafusion.execution.skip_partial_aggregation_cost_min_ratio 0.3 Lower bound on aggregation ratio (num_groups / input_rows) for the cost-aware skip rule to fire. Below this ratio partial aggregation is considered valuable enough to keep regardless of per-row cost. Only consulted when `skip_partial_aggregation_use_cost_model` is true. -datafusion.execution.skip_partial_aggregation_cost_ns_per_row 1000 Minimum measured per-row wall time (nanoseconds) of partial aggregation for the cost-aware skip rule to fire. Only consulted when `skip_partial_aggregation_use_cost_model` is true. +datafusion.execution.skip_partial_aggregation_cost_ns_per_row 100 Minimum measured per-row wall time (nanoseconds) of partial aggregation for the cost-aware skip rule to fire. Only consulted when `skip_partial_aggregation_use_cost_model` is true. datafusion.execution.skip_partial_aggregation_probe_ratio_threshold 0.8 Aggregation ratio (number of distinct groups / number of input rows) threshold for skipping partial aggregation. If the value is greater then partial aggregation will skip aggregation for further input datafusion.execution.skip_partial_aggregation_probe_rows_threshold 100000 Number of input rows partial aggregation partition should process, before aggregation ratio check and trying to switch to skipping aggregation mode datafusion.execution.skip_partial_aggregation_use_cost_model true (experimental) When true, augment the fixed `skip_partial_aggregation_probe_ratio_threshold` check with a cost-aware rule: skip partial aggregation when the measured per-row wall time exceeds `skip_partial_aggregation_cost_ns_per_row` even at ratios below the fixed threshold (provided the ratio is at least `skip_partial_aggregation_cost_min_ratio`). Targets ClickBench Q18-shape queries where the ratio (~0.56) is below 0.8 but partial aggregation is net-negative due to heavy keys / aggs. diff --git a/docs/source/user-guide/configs.md b/docs/source/user-guide/configs.md index bc80dce703d7d..011bfffdc62d3 100644 --- a/docs/source/user-guide/configs.md +++ b/docs/source/user-guide/configs.md @@ -133,7 +133,7 @@ The following configuration settings are available: | datafusion.execution.skip_partial_aggregation_probe_ratio_threshold | 0.8 | Aggregation ratio (number of distinct groups / number of input rows) threshold for skipping partial aggregation. If the value is greater then partial aggregation will skip aggregation for further input | | datafusion.execution.skip_partial_aggregation_probe_rows_threshold | 100000 | Number of input rows partial aggregation partition should process, before aggregation ratio check and trying to switch to skipping aggregation mode | | datafusion.execution.skip_partial_aggregation_use_cost_model | true | (experimental) When true, augment the fixed `skip_partial_aggregation_probe_ratio_threshold` check with a cost-aware rule: skip partial aggregation when the measured per-row wall time exceeds `skip_partial_aggregation_cost_ns_per_row` even at ratios below the fixed threshold (provided the ratio is at least `skip_partial_aggregation_cost_min_ratio`). Targets ClickBench Q18-shape queries where the ratio (~0.56) is below 0.8 but partial aggregation is net-negative due to heavy keys / aggs. | -| datafusion.execution.skip_partial_aggregation_cost_ns_per_row | 1000 | Minimum measured per-row wall time (nanoseconds) of partial aggregation for the cost-aware skip rule to fire. Only consulted when `skip_partial_aggregation_use_cost_model` is true. | +| datafusion.execution.skip_partial_aggregation_cost_ns_per_row | 100 | Minimum measured per-row wall time (nanoseconds) of partial aggregation for the cost-aware skip rule to fire. Only consulted when `skip_partial_aggregation_use_cost_model` is true. | | datafusion.execution.skip_partial_aggregation_cost_min_ratio | 0.3 | Lower bound on aggregation ratio (num_groups / input_rows) for the cost-aware skip rule to fire. Below this ratio partial aggregation is considered valuable enough to keep regardless of per-row cost. Only consulted when `skip_partial_aggregation_use_cost_model` is true. | | datafusion.execution.use_row_number_estimates_to_optimize_partitioning | false | Should DataFusion use row number estimates at the input to decide whether increasing parallelism is beneficial or not. By default, only exact row numbers (not estimates) are used for this decision. Setting this flag to `true` will likely produce better plans. if the source of statistics is accurate. We plan to make this the default in the future. | | datafusion.execution.enforce_batch_size_in_joins | false | Should DataFusion enforce batch size in joins or not. By default, DataFusion will not enforce batch size in joins. Enforcing batch size in joins can reduce memory usage when joining large tables with a highly-selective join filter, but is also slightly slower. | From e6a98fe4da67d2e3d56092b559f29ec83613756d Mon Sep 17 00:00:00 2001 From: Qi Zhu Date: Tue, 26 May 2026 17:26:42 +0800 Subject: [PATCH 04/14] pivot to lower-ratio Rule 2 + diagnostic gauges MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Round 1 (cost_ns_per_row > 1000) didn't fire on Q18 because partial agg per-row cost at probe close (first 100k rows, hash table still small) is ~100-200 ns on the ARM bot, not 1000+. Lowering to 100 didn't help either — the measured cost at probe time underestimates the eventual asymptotic cost, so a single-shot probe-time threshold is fundamentally fragile. Pivot: - Drop `skip_partial_aggregation_cost_ns_per_row` entirely from the decision. Rule 2 is now a pure ratio check: skip when `ratio >= cost_min_ratio` (default 0.5). - This matches the empirical finding in the issue body: ratio_threshold = 0.6 makes Q18 1.73× faster on M-series. 0.5 is conservative around that — the 0.3 cost_min_ratio guard from before is gone. - Add two diagnostic gauges (always recorded, regardless of which rule fires): * `partial_agg_probe_ns_per_row` — measured per-row wall time * `partial_agg_probe_ratio_per_mille` — ratio × 1000 EXPLAIN ANALYZE shows these so we can revisit a real cost-aware rule later with actual numbers instead of guessing thresholds. Why keep `use_cost_model` as the flag name even though it isn't cost-aware anymore: the gauges (the basis for a future cost-aware rule) ride alongside, and we want a single opt-in surface that graduates from "lower ratio threshold" to "cost-aware" without churning configs. Unit tests rewritten to match: 5 tests covering off/on, fires/doesn't fire, fixed-rule precedence, and gauge recording. --- datafusion/common/src/config.rs | 47 +++-- .../physical-plan/src/aggregates/row_hash.rs | 180 ++++++++++-------- .../test_files/information_schema.slt | 8 +- docs/source/user-guide/configs.md | 5 +- 4 files changed, 135 insertions(+), 105 deletions(-) diff --git a/datafusion/common/src/config.rs b/datafusion/common/src/config.rs index cb11d8eb70bf1..bcd8f87a89bba 100644 --- a/datafusion/common/src/config.rs +++ b/datafusion/common/src/config.rs @@ -648,29 +648,34 @@ config_namespace! { /// aggregation ratio check and trying to switch to skipping aggregation mode pub skip_partial_aggregation_probe_rows_threshold: usize, default = 100_000 - /// (experimental) When true, augment the fixed - /// `skip_partial_aggregation_probe_ratio_threshold` check with a - /// cost-aware rule: skip partial aggregation when the measured - /// per-row wall time exceeds - /// `skip_partial_aggregation_cost_ns_per_row` even at ratios below - /// the fixed threshold (provided the ratio is at least - /// `skip_partial_aggregation_cost_min_ratio`). Targets ClickBench - /// Q18-shape queries where the ratio (~0.56) is below 0.8 but - /// partial aggregation is net-negative due to heavy keys / aggs. + /// (experimental) When true, apply a *secondary* skip rule on top + /// of `skip_partial_aggregation_probe_ratio_threshold`: skip + /// partial aggregation when the measured ratio is at least + /// `skip_partial_aggregation_cost_min_ratio` (default 0.5). + /// Targets ClickBench Q18-shape queries where the ratio (~0.56) + /// sits just below the fixed 0.8 threshold so partial agg keeps + /// running, but the absolute work (heavy variable-length keys, + /// complex aggregates) makes it net-negative. + /// + /// Empirical motivation: lowering the global ratio threshold to + /// 0.6 fixes Q18 (1.73× faster) but risks regressing low-cost + /// queries at similar ratios. This flag exposes the lower + /// threshold as a separate, opt-in knob. Whether the cost-aware + /// signal (`partial_agg_probe_ns_per_row` metric) can replace + /// this static threshold is an open question — for now the + /// metric is reported alongside so callers can evaluate. pub skip_partial_aggregation_use_cost_model: bool, default = true - /// Minimum measured per-row wall time (nanoseconds) of partial - /// aggregation for the cost-aware skip rule to fire. Only - /// consulted when `skip_partial_aggregation_use_cost_model` is - /// true. - pub skip_partial_aggregation_cost_ns_per_row: u64, default = 100 - - /// Lower bound on aggregation ratio (num_groups / input_rows) for - /// the cost-aware skip rule to fire. Below this ratio partial - /// aggregation is considered valuable enough to keep regardless - /// of per-row cost. Only consulted when - /// `skip_partial_aggregation_use_cost_model` is true. - pub skip_partial_aggregation_cost_min_ratio: f64, default = 0.3 + /// Effective ratio threshold used by the cost-aware skip rule. + /// Skip partial aggregation when ratio at probe close is at + /// least this value (provided + /// `skip_partial_aggregation_use_cost_model` is true). + /// + /// Default 0.5 is the empirical sweet spot from ClickBench: 0.4 + /// would skip too aggressively (regresses low-cost queries + /// that benefit from partial), 0.7 fails to catch Q18-shape + /// queries. + pub skip_partial_aggregation_cost_min_ratio: f64, default = 0.5 /// Should DataFusion use row number estimates at the input to decide /// whether increasing parallelism is beneficial or not. By default, diff --git a/datafusion/physical-plan/src/aggregates/row_hash.rs b/datafusion/physical-plan/src/aggregates/row_hash.rs index e38abafd6a9f4..041415cf1050d 100644 --- a/datafusion/physical-plan/src/aggregates/row_hash.rs +++ b/datafusion/physical-plan/src/aggregates/row_hash.rs @@ -134,14 +134,11 @@ struct SkipAggregationProbe { /// (from `SessionConfig`). If the ratio exceeds this value, aggregation /// is skipped and input rows are directly converted to output probe_ratio_threshold: f64, - /// (experimental) Cost-aware skip rule: skip when measured per-row wall - /// time is high even at ratios below `probe_ratio_threshold`. Disabled - /// by default; preserves the current bare-ratio behaviour. + /// (experimental) When true, apply a secondary skip rule using + /// `cost_min_ratio` as a lower effective ratio threshold. Disabled + /// by default to preserve the bare-ratio behaviour. use_cost_model: bool, - /// Per-row wall-time threshold (ns) for cost-aware skip rule. - cost_ns_per_row_threshold: u64, - /// Minimum ratio for cost-aware skip rule to fire. Below this ratio - /// partial aggregation is kept regardless of per-row cost. + /// Effective ratio threshold for the cost-aware skip rule. cost_min_ratio: f64, // ======================================================================== @@ -174,7 +171,7 @@ struct SkipAggregationProbe { /// Operator-wide `elapsed_compute` metric. Cloned from /// `BaselineMetrics` so the probe can read the accumulated nanoseconds /// at decision time and infer the wall time spent on partial - /// aggregation so far (used by the cost-aware rule). + /// aggregation so far. elapsed_compute: metrics::Time, /// `elapsed_compute.value()` snapshot at probe construction. Diff /// against the live value gives wall time elapsed since the stream @@ -182,17 +179,31 @@ struct SkipAggregationProbe { /// since `elapsed_compute` covers exactly the timed work blocks in /// the partial-mode loop. elapsed_compute_at_probe_start: usize, + /// Diagnostic gauge: measured wall-time per input row (ns) at the + /// most recent probe evaluation. `set` once per `update_state` call + /// after `probe_rows_threshold` is met, so EXPLAIN ANALYZE reflects + /// the last evaluation's reading regardless of whether a skip rule + /// fired. Provides the empirical signal used to tune + /// `cost_min_ratio` / future cost-based rules. + probe_ns_per_row: metrics::Gauge, + /// Diagnostic gauge: aggregation ratio (`num_groups / input_rows`) + /// at the most recent probe evaluation, scaled by 1000 (so 0.565 + /// → 565). `usize` gauges store integers, so the scale lets us + /// preserve three decimal digits without floating point. + probe_ratio_per_mille: metrics::Gauge, } impl SkipAggregationProbe { + #[expect(clippy::too_many_arguments)] fn new( probe_rows_threshold: usize, probe_ratio_threshold: f64, use_cost_model: bool, - cost_ns_per_row_threshold: u64, cost_min_ratio: f64, skipped_aggregation_rows: metrics::Count, elapsed_compute: metrics::Time, + probe_ns_per_row: metrics::Gauge, + probe_ratio_per_mille: metrics::Gauge, ) -> Self { let elapsed_compute_at_probe_start = elapsed_compute.value(); Self { @@ -201,13 +212,14 @@ impl SkipAggregationProbe { probe_rows_threshold, probe_ratio_threshold, use_cost_model, - cost_ns_per_row_threshold, cost_min_ratio, should_skip: false, is_locked: false, skipped_aggregation_rows, elapsed_compute, elapsed_compute_at_probe_start, + probe_ns_per_row, + probe_ratio_per_mille, } } @@ -215,20 +227,19 @@ impl SkipAggregationProbe { /// - increments the number of input rows /// - replaces the number of groups with the new value /// - on `probe_rows_threshold` exceeded applies the skip-decision rules - /// (fixed ratio threshold, plus the optional cost-aware rule) - /// - if `should_skip` is set, locks further state updates + /// - records the diagnostic gauges /// /// Two skip rules are evaluated, in order: /// /// 1. **Fixed ratio**: `num_groups / input_rows >= probe_ratio_threshold` /// — the original behaviour, preserved for backwards compatibility. - /// 2. **Cost-aware** (opt-in via `use_cost_model`): when the ratio is - /// above `cost_min_ratio` *and* the measured per-row wall time of - /// the operator exceeds `cost_ns_per_row_threshold`, skip even if - /// the ratio is below `probe_ratio_threshold`. This catches the - /// case where partial aggregation is net-negative due to high - /// per-row cost (heavy variable-length keys, expensive aggregates) - /// rather than poor reduction alone. + /// 2. **Lower-ratio** (opt-in via `use_cost_model`): skip when the ratio + /// is at least `cost_min_ratio` (default 0.5). Targets the medium + /// band (0.5–0.8) where partial aggregation is often net-negative + /// despite the ratio being below the fixed threshold. The per-row + /// cost gauge is reported alongside via `probe_ns_per_row` for + /// diagnosing whether a cost-based replacement of this static + /// threshold is feasible. fn update_state(&mut self, input_rows: usize, num_groups: usize) { if self.is_locked { return; @@ -238,6 +249,18 @@ impl SkipAggregationProbe { if self.input_rows >= self.probe_rows_threshold { let ratio = self.num_groups as f64 / self.input_rows as f64; + // Diagnostic gauges: always recorded at probe evaluation, + // independent of which (if any) skip rule fires. + let elapsed_ns = self + .elapsed_compute + .value() + .saturating_sub(self.elapsed_compute_at_probe_start); + let ns_per_row = (elapsed_ns as u64) + .checked_div(self.input_rows as u64) + .unwrap_or(0); + self.probe_ns_per_row.set(ns_per_row as usize); + self.probe_ratio_per_mille.set((ratio * 1000.0) as usize); + // Rule 1: fixed-ratio skip (existing behaviour). if ratio >= self.probe_ratio_threshold { self.should_skip = true; @@ -245,22 +268,11 @@ impl SkipAggregationProbe { return; } - // Rule 2 (opt-in): cost-aware skip for the medium-ratio, - // high-per-row-cost band. + // Rule 2 (opt-in): lower-ratio skip. if self.use_cost_model && ratio >= self.cost_min_ratio { - let elapsed_ns = self - .elapsed_compute - .value() - .saturating_sub(self.elapsed_compute_at_probe_start); - let ns_per_row = (elapsed_ns as u64) - .checked_div(self.input_rows as u64) - .unwrap_or(0); - if ns_per_row > self.cost_ns_per_row_threshold { - self.should_skip = true; - self.is_locked = true; - } + self.should_skip = true; + self.is_locked = true; } - // No rule fired — leave `is_locked` false so the next batch // can re-evaluate. } @@ -709,20 +721,25 @@ impl GroupedHashAggregateStream { let probe_ratio_threshold = options.skip_partial_aggregation_probe_ratio_threshold; let use_cost_model = options.skip_partial_aggregation_use_cost_model; - let cost_ns_per_row_threshold = - options.skip_partial_aggregation_cost_ns_per_row; let cost_min_ratio = options.skip_partial_aggregation_cost_min_ratio; let skipped_aggregation_rows = MetricBuilder::new(&agg.metrics) .with_category(MetricCategory::Rows) .counter("skipped_aggregation_rows", partition); + let probe_ns_per_row = MetricBuilder::new(&agg.metrics) + .with_category(MetricCategory::Timing) + .gauge("partial_agg_probe_ns_per_row", partition); + let probe_ratio_per_mille = MetricBuilder::new(&agg.metrics) + .with_category(MetricCategory::Rows) + .gauge("partial_agg_probe_ratio_per_mille", partition); Some(SkipAggregationProbe::new( probe_rows_threshold, probe_ratio_threshold, use_cost_model, - cost_ns_per_row_threshold, cost_min_ratio, skipped_aggregation_rows, baseline_metrics.elapsed_compute().clone(), + probe_ns_per_row, + probe_ratio_per_mille, )) } else { None @@ -1893,14 +1910,14 @@ mod tests { // ---------------- SkipAggregationProbe unit tests ---------------- - /// Build a probe with the given configuration. Skip-count metric is - /// disconnected (a fresh `Count`) and `elapsed_compute` is a fresh - /// `Time` that the test can drive with `add_duration`. + /// Build a probe with the given configuration. Skip-count metric and + /// diagnostic gauges are disconnected; `elapsed_compute` is exposed + /// so a test can drive it with `add_duration` to verify the + /// diagnostic gauge picks the value up. fn make_probe( probe_rows_threshold: usize, probe_ratio_threshold: f64, use_cost_model: bool, - cost_ns_per_row_threshold: u64, cost_min_ratio: f64, ) -> (SkipAggregationProbe, metrics::Time) { let elapsed_compute = metrics::Time::new(); @@ -1908,10 +1925,11 @@ mod tests { probe_rows_threshold, probe_ratio_threshold, use_cost_model, - cost_ns_per_row_threshold, cost_min_ratio, metrics::Count::new(), elapsed_compute.clone(), + metrics::Gauge::new(), + metrics::Gauge::new(), ); (probe, elapsed_compute) } @@ -1920,7 +1938,7 @@ mod tests { /// ratio check. #[test] fn skip_probe_cost_model_off_matches_legacy_ratio_check() { - let (mut probe, _t) = make_probe(100, 0.8, false, 1000, 0.3); + let (mut probe, _t) = make_probe(100, 0.8, false, 0.5); // 100 rows / 50 groups → ratio 0.5, below 0.8 → don't skip probe.update_state(100, 50); @@ -1934,62 +1952,72 @@ mod tests { assert!(probe.is_locked); } - /// With the cost model on, a high-cost slow query in the medium-ratio - /// band (0.3 ≤ ratio < 0.8) gets skipped — the ClickBench Q18 case. + /// With the cost model on, a medium-ratio query (Q18-shape: ratio + /// 0.56, below 0.8 but at or above `cost_min_ratio`) is skipped. #[test] - fn skip_probe_cost_model_fires_in_medium_ratio_high_cost_band() { - let (mut probe, elapsed) = make_probe(100, 0.8, true, 1000, 0.3); + fn skip_probe_cost_model_fires_in_medium_ratio_band() { + let (mut probe, _t) = make_probe(100, 0.8, true, 0.5); - // 200_000 ns over 100 rows → 2_000 ns/row → above the 1_000 - // threshold. Ratio 0.5 sits in the medium band. - elapsed.add_duration(Duration::from_nanos(200_000)); - probe.update_state(100, 50); + // Ratio 0.56 — Q18's measured value. Above the 0.5 threshold. + probe.update_state(100, 56); assert!(probe.should_skip()); assert!(probe.is_locked); } - /// Cost model on, ratio below `cost_min_ratio`: the partial agg is - /// genuinely reducing a lot (small / cheap output), so we keep it - /// regardless of per-row cost. + /// Cost model on, ratio below `cost_min_ratio`: keep partial agg — + /// it's reducing too much to be worth skipping. #[test] fn skip_probe_cost_model_does_not_fire_below_min_ratio() { - let (mut probe, elapsed) = make_probe(100, 0.8, true, 1000, 0.3); + let (mut probe, _t) = make_probe(100, 0.8, true, 0.5); - // Very expensive — 5_000 ns/row — but ratio is 0.2 (huge - // reduction), so partial agg is still net-positive. - elapsed.add_duration(Duration::from_nanos(500_000)); - probe.update_state(100, 20); - - assert!(!probe.should_skip()); - assert!(!probe.is_locked); - } - - /// Cost model on but per-row cost is below threshold: ratio alone - /// doesn't justify skipping → keep aggregating. - #[test] - fn skip_probe_cost_model_does_not_fire_when_cheap_per_row() { - let (mut probe, elapsed) = make_probe(100, 0.8, true, 1000, 0.3); - - // 50_000 ns over 100 rows → 500 ns/row → below the 1_000 ns - // threshold. Ratio 0.5 alone is below the fixed 0.8 cut-off. - elapsed.add_duration(Duration::from_nanos(50_000)); - probe.update_state(100, 50); + // Ratio 0.3 — well below the 0.5 threshold. + probe.update_state(100, 30); assert!(!probe.should_skip()); assert!(!probe.is_locked); } /// Cost model on still honours the original high-ratio rule: ratio - /// >= `probe_ratio_threshold` skips regardless of cost. + /// at or above `probe_ratio_threshold` skips, even though the + /// lower-ratio rule would also fire. #[test] fn skip_probe_cost_model_still_honours_high_ratio_rule() { - let (mut probe, _t) = make_probe(100, 0.8, true, 1000, 0.3); + let (mut probe, _t) = make_probe(100, 0.8, true, 0.5); - // Ratio 0.9, no measured cost → skip via the fixed rule. + // Ratio 0.9 → skip via the fixed rule (Rule 1). probe.update_state(100, 90); assert!(probe.should_skip()); assert!(probe.is_locked); } + + /// The diagnostic gauges record the measured per-row wall time and + /// the ratio (× 1000) on every probe evaluation, regardless of which + /// — if any — skip rule fires. + #[test] + fn skip_probe_records_diagnostic_gauges() { + let elapsed_compute = metrics::Time::new(); + let probe_ns_per_row = metrics::Gauge::new(); + let probe_ratio_per_mille = metrics::Gauge::new(); + let mut probe = SkipAggregationProbe::new( + 100, + 0.8, + false, + 0.5, + metrics::Count::new(), + elapsed_compute.clone(), + probe_ns_per_row.clone(), + probe_ratio_per_mille.clone(), + ); + + // 200_000 ns over 100 rows → 2_000 ns/row. Ratio 50/100 = 0.5 → + // 500 per-mille. No skip fires (legacy ratio check is off, no + // 0.8 reached), but the gauges still update. + elapsed_compute.add_duration(Duration::from_nanos(200_000)); + probe.update_state(100, 50); + + assert_eq!(probe_ns_per_row.value(), 2_000); + assert_eq!(probe_ratio_per_mille.value(), 500); + } } diff --git a/datafusion/sqllogictest/test_files/information_schema.slt b/datafusion/sqllogictest/test_files/information_schema.slt index a51ca0900a178..53adfba227c03 100644 --- a/datafusion/sqllogictest/test_files/information_schema.slt +++ b/datafusion/sqllogictest/test_files/information_schema.slt @@ -266,8 +266,7 @@ datafusion.execution.parquet.writer_version 1.0 datafusion.execution.perfect_hash_join_min_key_density 0.15 datafusion.execution.perfect_hash_join_small_build_threshold 1024 datafusion.execution.planning_concurrency 13 -datafusion.execution.skip_partial_aggregation_cost_min_ratio 0.3 -datafusion.execution.skip_partial_aggregation_cost_ns_per_row 100 +datafusion.execution.skip_partial_aggregation_cost_min_ratio 0.5 datafusion.execution.skip_partial_aggregation_probe_ratio_threshold 0.8 datafusion.execution.skip_partial_aggregation_probe_rows_threshold 100000 datafusion.execution.skip_partial_aggregation_use_cost_model true @@ -419,11 +418,10 @@ datafusion.execution.parquet.writer_version 1.0 (writing) Sets parquet writer ve datafusion.execution.perfect_hash_join_min_key_density 0.15 The minimum required density of join keys on the build side to consider a perfect hash join (see `HashJoinExec` for more details). Density is calculated as: `(number of rows) / (max_key - min_key + 1)`. A perfect hash join may be used if the actual key density > this value. Currently only supports cases where build_side.num_rows() < u32::MAX. Support for build_side.num_rows() >= u32::MAX will be added in the future. datafusion.execution.perfect_hash_join_small_build_threshold 1024 A perfect hash join (see `HashJoinExec` for more details) will be considered if the range of keys (max - min) on the build side is < this threshold. This provides a fast path for joins with very small key ranges, bypassing the density check. Currently only supports cases where build_side.num_rows() < u32::MAX. Support for build_side.num_rows() >= u32::MAX will be added in the future. datafusion.execution.planning_concurrency 13 Fan-out during initial physical planning. This is mostly use to plan `UNION` children in parallel. Defaults to the number of CPU cores on the system -datafusion.execution.skip_partial_aggregation_cost_min_ratio 0.3 Lower bound on aggregation ratio (num_groups / input_rows) for the cost-aware skip rule to fire. Below this ratio partial aggregation is considered valuable enough to keep regardless of per-row cost. Only consulted when `skip_partial_aggregation_use_cost_model` is true. -datafusion.execution.skip_partial_aggregation_cost_ns_per_row 100 Minimum measured per-row wall time (nanoseconds) of partial aggregation for the cost-aware skip rule to fire. Only consulted when `skip_partial_aggregation_use_cost_model` is true. +datafusion.execution.skip_partial_aggregation_cost_min_ratio 0.5 Effective ratio threshold used by the cost-aware skip rule. Skip partial aggregation when ratio at probe close is at least this value (provided `skip_partial_aggregation_use_cost_model` is true). Default 0.5 is the empirical sweet spot from ClickBench: 0.4 would skip too aggressively (regresses low-cost queries that benefit from partial), 0.7 fails to catch Q18-shape queries. datafusion.execution.skip_partial_aggregation_probe_ratio_threshold 0.8 Aggregation ratio (number of distinct groups / number of input rows) threshold for skipping partial aggregation. If the value is greater then partial aggregation will skip aggregation for further input datafusion.execution.skip_partial_aggregation_probe_rows_threshold 100000 Number of input rows partial aggregation partition should process, before aggregation ratio check and trying to switch to skipping aggregation mode -datafusion.execution.skip_partial_aggregation_use_cost_model true (experimental) When true, augment the fixed `skip_partial_aggregation_probe_ratio_threshold` check with a cost-aware rule: skip partial aggregation when the measured per-row wall time exceeds `skip_partial_aggregation_cost_ns_per_row` even at ratios below the fixed threshold (provided the ratio is at least `skip_partial_aggregation_cost_min_ratio`). Targets ClickBench Q18-shape queries where the ratio (~0.56) is below 0.8 but partial aggregation is net-negative due to heavy keys / aggs. +datafusion.execution.skip_partial_aggregation_use_cost_model true (experimental) When true, apply a *secondary* skip rule on top of `skip_partial_aggregation_probe_ratio_threshold`: skip partial aggregation when the measured ratio is at least `skip_partial_aggregation_cost_min_ratio` (default 0.5). Targets ClickBench Q18-shape queries where the ratio (~0.56) sits just below the fixed 0.8 threshold so partial agg keeps running, but the absolute work (heavy variable-length keys, complex aggregates) makes it net-negative. Empirical motivation: lowering the global ratio threshold to 0.6 fixes Q18 (1.73× faster) but risks regressing low-cost queries at similar ratios. This flag exposes the lower threshold as a separate, opt-in knob. Whether the cost-aware signal (`partial_agg_probe_ns_per_row` metric) can replace this static threshold is an open question — for now the metric is reported alongside so callers can evaluate. datafusion.execution.skip_physical_aggregate_schema_check false When set to true, skips verifying that the schema produced by planning the input of `LogicalPlan::Aggregate` exactly matches the schema of the input plan. When set to false, if the schema does not match exactly (including nullability and metadata), a planning error will be raised. This is used to workaround bugs in the planner that are now caught by the new schema verification step. datafusion.execution.soft_max_rows_per_output_file 50000000 Target number of rows in output files when writing multiple. This is a soft max, so it can be exceeded slightly. There also will be one file smaller than the limit if the total number of rows written is not roughly divisible by the soft max datafusion.execution.sort_in_place_threshold_bytes 1048576 When sorting, below what size should data be concatenated and sorted in a single RecordBatch rather than sorted in batches and merged. diff --git a/docs/source/user-guide/configs.md b/docs/source/user-guide/configs.md index 011bfffdc62d3..6b46b4ec73253 100644 --- a/docs/source/user-guide/configs.md +++ b/docs/source/user-guide/configs.md @@ -132,9 +132,8 @@ The following configuration settings are available: | datafusion.execution.keep_partition_by_columns | false | Should DataFusion keep the columns used for partition_by in the output RecordBatches | | datafusion.execution.skip_partial_aggregation_probe_ratio_threshold | 0.8 | Aggregation ratio (number of distinct groups / number of input rows) threshold for skipping partial aggregation. If the value is greater then partial aggregation will skip aggregation for further input | | datafusion.execution.skip_partial_aggregation_probe_rows_threshold | 100000 | Number of input rows partial aggregation partition should process, before aggregation ratio check and trying to switch to skipping aggregation mode | -| datafusion.execution.skip_partial_aggregation_use_cost_model | true | (experimental) When true, augment the fixed `skip_partial_aggregation_probe_ratio_threshold` check with a cost-aware rule: skip partial aggregation when the measured per-row wall time exceeds `skip_partial_aggregation_cost_ns_per_row` even at ratios below the fixed threshold (provided the ratio is at least `skip_partial_aggregation_cost_min_ratio`). Targets ClickBench Q18-shape queries where the ratio (~0.56) is below 0.8 but partial aggregation is net-negative due to heavy keys / aggs. | -| datafusion.execution.skip_partial_aggregation_cost_ns_per_row | 100 | Minimum measured per-row wall time (nanoseconds) of partial aggregation for the cost-aware skip rule to fire. Only consulted when `skip_partial_aggregation_use_cost_model` is true. | -| datafusion.execution.skip_partial_aggregation_cost_min_ratio | 0.3 | Lower bound on aggregation ratio (num_groups / input_rows) for the cost-aware skip rule to fire. Below this ratio partial aggregation is considered valuable enough to keep regardless of per-row cost. Only consulted when `skip_partial_aggregation_use_cost_model` is true. | +| datafusion.execution.skip_partial_aggregation_use_cost_model | true | (experimental) When true, apply a _secondary_ skip rule on top of `skip_partial_aggregation_probe_ratio_threshold`: skip partial aggregation when the measured ratio is at least `skip_partial_aggregation_cost_min_ratio` (default 0.5). Targets ClickBench Q18-shape queries where the ratio (~0.56) sits just below the fixed 0.8 threshold so partial agg keeps running, but the absolute work (heavy variable-length keys, complex aggregates) makes it net-negative. Empirical motivation: lowering the global ratio threshold to 0.6 fixes Q18 (1.73× faster) but risks regressing low-cost queries at similar ratios. This flag exposes the lower threshold as a separate, opt-in knob. Whether the cost-aware signal (`partial_agg_probe_ns_per_row` metric) can replace this static threshold is an open question — for now the metric is reported alongside so callers can evaluate. | +| datafusion.execution.skip_partial_aggregation_cost_min_ratio | 0.5 | Effective ratio threshold used by the cost-aware skip rule. Skip partial aggregation when ratio at probe close is at least this value (provided `skip_partial_aggregation_use_cost_model` is true). Default 0.5 is the empirical sweet spot from ClickBench: 0.4 would skip too aggressively (regresses low-cost queries that benefit from partial), 0.7 fails to catch Q18-shape queries. | | datafusion.execution.use_row_number_estimates_to_optimize_partitioning | false | Should DataFusion use row number estimates at the input to decide whether increasing parallelism is beneficial or not. By default, only exact row numbers (not estimates) are used for this decision. Setting this flag to `true` will likely produce better plans. if the source of statistics is accurate. We plan to make this the default in the future. | | datafusion.execution.enforce_batch_size_in_joins | false | Should DataFusion enforce batch size in joins or not. By default, DataFusion will not enforce batch size in joins. Enforcing batch size in joins can reduce memory usage when joining large tables with a highly-selective join filter, but is also slightly slower. | | datafusion.execution.objectstore_writer_buffer_size | 10485760 | Size (bytes) of data buffer DataFusion uses when writing output files. This affects the size of the data chunks that are uploaded to remote object stores (e.g. AWS S3). If very large (>= 100 GiB) output files are being written, it may be necessary to increase this size to avoid errors from the remote end point. | From c087cb4f444523005a5888ab294d4fe6db9cd6b2 Mon Sep 17 00:00:00 2001 From: Qi Zhu Date: Tue, 26 May 2026 19:43:09 +0800 Subject: [PATCH 05/14] revert benchmark-only default flip; default back to false MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The default-true was a temporary flip so the benchmarking bot would exercise the new code path. Revert before merge — opt-in stays the contract until the cost-aware variant lands. --- datafusion/common/src/config.rs | 2 +- datafusion/sqllogictest/test_files/information_schema.slt | 4 ++-- docs/source/user-guide/configs.md | 2 +- 3 files changed, 4 insertions(+), 4 deletions(-) diff --git a/datafusion/common/src/config.rs b/datafusion/common/src/config.rs index bcd8f87a89bba..d36be8f29b8dd 100644 --- a/datafusion/common/src/config.rs +++ b/datafusion/common/src/config.rs @@ -664,7 +664,7 @@ config_namespace! { /// signal (`partial_agg_probe_ns_per_row` metric) can replace /// this static threshold is an open question — for now the /// metric is reported alongside so callers can evaluate. - pub skip_partial_aggregation_use_cost_model: bool, default = true + pub skip_partial_aggregation_use_cost_model: bool, default = false /// Effective ratio threshold used by the cost-aware skip rule. /// Skip partial aggregation when ratio at probe close is at diff --git a/datafusion/sqllogictest/test_files/information_schema.slt b/datafusion/sqllogictest/test_files/information_schema.slt index 53adfba227c03..2ca0196089dab 100644 --- a/datafusion/sqllogictest/test_files/information_schema.slt +++ b/datafusion/sqllogictest/test_files/information_schema.slt @@ -269,7 +269,7 @@ datafusion.execution.planning_concurrency 13 datafusion.execution.skip_partial_aggregation_cost_min_ratio 0.5 datafusion.execution.skip_partial_aggregation_probe_ratio_threshold 0.8 datafusion.execution.skip_partial_aggregation_probe_rows_threshold 100000 -datafusion.execution.skip_partial_aggregation_use_cost_model true +datafusion.execution.skip_partial_aggregation_use_cost_model false datafusion.execution.skip_physical_aggregate_schema_check false datafusion.execution.soft_max_rows_per_output_file 50000000 datafusion.execution.sort_in_place_threshold_bytes 1048576 @@ -421,7 +421,7 @@ datafusion.execution.planning_concurrency 13 Fan-out during initial physical pla datafusion.execution.skip_partial_aggregation_cost_min_ratio 0.5 Effective ratio threshold used by the cost-aware skip rule. Skip partial aggregation when ratio at probe close is at least this value (provided `skip_partial_aggregation_use_cost_model` is true). Default 0.5 is the empirical sweet spot from ClickBench: 0.4 would skip too aggressively (regresses low-cost queries that benefit from partial), 0.7 fails to catch Q18-shape queries. datafusion.execution.skip_partial_aggregation_probe_ratio_threshold 0.8 Aggregation ratio (number of distinct groups / number of input rows) threshold for skipping partial aggregation. If the value is greater then partial aggregation will skip aggregation for further input datafusion.execution.skip_partial_aggregation_probe_rows_threshold 100000 Number of input rows partial aggregation partition should process, before aggregation ratio check and trying to switch to skipping aggregation mode -datafusion.execution.skip_partial_aggregation_use_cost_model true (experimental) When true, apply a *secondary* skip rule on top of `skip_partial_aggregation_probe_ratio_threshold`: skip partial aggregation when the measured ratio is at least `skip_partial_aggregation_cost_min_ratio` (default 0.5). Targets ClickBench Q18-shape queries where the ratio (~0.56) sits just below the fixed 0.8 threshold so partial agg keeps running, but the absolute work (heavy variable-length keys, complex aggregates) makes it net-negative. Empirical motivation: lowering the global ratio threshold to 0.6 fixes Q18 (1.73× faster) but risks regressing low-cost queries at similar ratios. This flag exposes the lower threshold as a separate, opt-in knob. Whether the cost-aware signal (`partial_agg_probe_ns_per_row` metric) can replace this static threshold is an open question — for now the metric is reported alongside so callers can evaluate. +datafusion.execution.skip_partial_aggregation_use_cost_model false (experimental) When true, apply a *secondary* skip rule on top of `skip_partial_aggregation_probe_ratio_threshold`: skip partial aggregation when the measured ratio is at least `skip_partial_aggregation_cost_min_ratio` (default 0.5). Targets ClickBench Q18-shape queries where the ratio (~0.56) sits just below the fixed 0.8 threshold so partial agg keeps running, but the absolute work (heavy variable-length keys, complex aggregates) makes it net-negative. Empirical motivation: lowering the global ratio threshold to 0.6 fixes Q18 (1.73× faster) but risks regressing low-cost queries at similar ratios. This flag exposes the lower threshold as a separate, opt-in knob. Whether the cost-aware signal (`partial_agg_probe_ns_per_row` metric) can replace this static threshold is an open question — for now the metric is reported alongside so callers can evaluate. datafusion.execution.skip_physical_aggregate_schema_check false When set to true, skips verifying that the schema produced by planning the input of `LogicalPlan::Aggregate` exactly matches the schema of the input plan. When set to false, if the schema does not match exactly (including nullability and metadata), a planning error will be raised. This is used to workaround bugs in the planner that are now caught by the new schema verification step. datafusion.execution.soft_max_rows_per_output_file 50000000 Target number of rows in output files when writing multiple. This is a soft max, so it can be exceeded slightly. There also will be one file smaller than the limit if the total number of rows written is not roughly divisible by the soft max datafusion.execution.sort_in_place_threshold_bytes 1048576 When sorting, below what size should data be concatenated and sorted in a single RecordBatch rather than sorted in batches and merged. diff --git a/docs/source/user-guide/configs.md b/docs/source/user-guide/configs.md index 6b46b4ec73253..fc95550ff94b3 100644 --- a/docs/source/user-guide/configs.md +++ b/docs/source/user-guide/configs.md @@ -132,7 +132,7 @@ The following configuration settings are available: | datafusion.execution.keep_partition_by_columns | false | Should DataFusion keep the columns used for partition_by in the output RecordBatches | | datafusion.execution.skip_partial_aggregation_probe_ratio_threshold | 0.8 | Aggregation ratio (number of distinct groups / number of input rows) threshold for skipping partial aggregation. If the value is greater then partial aggregation will skip aggregation for further input | | datafusion.execution.skip_partial_aggregation_probe_rows_threshold | 100000 | Number of input rows partial aggregation partition should process, before aggregation ratio check and trying to switch to skipping aggregation mode | -| datafusion.execution.skip_partial_aggregation_use_cost_model | true | (experimental) When true, apply a _secondary_ skip rule on top of `skip_partial_aggregation_probe_ratio_threshold`: skip partial aggregation when the measured ratio is at least `skip_partial_aggregation_cost_min_ratio` (default 0.5). Targets ClickBench Q18-shape queries where the ratio (~0.56) sits just below the fixed 0.8 threshold so partial agg keeps running, but the absolute work (heavy variable-length keys, complex aggregates) makes it net-negative. Empirical motivation: lowering the global ratio threshold to 0.6 fixes Q18 (1.73× faster) but risks regressing low-cost queries at similar ratios. This flag exposes the lower threshold as a separate, opt-in knob. Whether the cost-aware signal (`partial_agg_probe_ns_per_row` metric) can replace this static threshold is an open question — for now the metric is reported alongside so callers can evaluate. | +| datafusion.execution.skip_partial_aggregation_use_cost_model | false | (experimental) When true, apply a _secondary_ skip rule on top of `skip_partial_aggregation_probe_ratio_threshold`: skip partial aggregation when the measured ratio is at least `skip_partial_aggregation_cost_min_ratio` (default 0.5). Targets ClickBench Q18-shape queries where the ratio (~0.56) sits just below the fixed 0.8 threshold so partial agg keeps running, but the absolute work (heavy variable-length keys, complex aggregates) makes it net-negative. Empirical motivation: lowering the global ratio threshold to 0.6 fixes Q18 (1.73× faster) but risks regressing low-cost queries at similar ratios. This flag exposes the lower threshold as a separate, opt-in knob. Whether the cost-aware signal (`partial_agg_probe_ns_per_row` metric) can replace this static threshold is an open question — for now the metric is reported alongside so callers can evaluate. | | datafusion.execution.skip_partial_aggregation_cost_min_ratio | 0.5 | Effective ratio threshold used by the cost-aware skip rule. Skip partial aggregation when ratio at probe close is at least this value (provided `skip_partial_aggregation_use_cost_model` is true). Default 0.5 is the empirical sweet spot from ClickBench: 0.4 would skip too aggressively (regresses low-cost queries that benefit from partial), 0.7 fails to catch Q18-shape queries. | | datafusion.execution.use_row_number_estimates_to_optimize_partitioning | false | Should DataFusion use row number estimates at the input to decide whether increasing parallelism is beneficial or not. By default, only exact row numbers (not estimates) are used for this decision. Setting this flag to `true` will likely produce better plans. if the source of statistics is accurate. We plan to make this the default in the future. | | datafusion.execution.enforce_batch_size_in_joins | false | Should DataFusion enforce batch size in joins or not. By default, DataFusion will not enforce batch size in joins. Enforcing batch size in joins can reduce memory usage when joining large tables with a highly-selective join filter, but is also slightly slower. | From 3cbcdfa0c9f6677b73b595d53c320b6508d6645b Mon Sep 17 00:00:00 2001 From: Qi Zhu Date: Tue, 26 May 2026 20:31:28 +0800 Subject: [PATCH 06/14] feat: A/B sampling for cost-aware partial-agg skip decision MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Replace the fixed lower-ratio rule with measurement-driven A/B sampling. After the initial partial-probe window closes (100k rows by default), the operator routes the next 10k rows through the passthrough path to measure `passthrough_ns/row`, then compares it against the previously measured `partial_ns/row` via a closed-form cost crossover: cost_keep_partial = partial_ns × N + final_ns × N × ratio cost_skip = passthrough_ns × N + final_ns × N assuming final_ns ≈ partial_ns (similar hash-table mechanics): skip wins ⇔ ratio > passthrough_ns / partial_ns The crossover is set entirely by the two measured numbers — no magic constant, no hardcoded ratio. Rule 1 (ratio >= 0.8) still short-circuits before A/B, preserving the legacy cheap path. State machine extensions: - New `ExecutionState::AbSampling` mirrors `SkippingAggregation` (input → `transform_to_states` → output) but *keeps the partial hash table* — if A/B decides to keep partial, the stream reverts to `ReadingInput` and the hash table continues accumulating. - `ProbePhase` enum (Partial / AbSampling / Locked) inside `SkipAggregationProbe` drives the transitions. Diagnostic gauges exposed via EXPLAIN ANALYZE: - `partial_agg_probe_partial_ns_per_row` — measured at probe close - `partial_agg_probe_passthrough_ns_per_row` — measured at A/B close - `partial_agg_probe_ratio_per_mille` — ratio × 1000 - `partial_agg_probe_cost_decision_skip` — 1 if cost said skip, 0 if keep Config: - `skip_partial_aggregation_use_cost_model` (bool, default false) — opt-in switch. With it off, behaviour is exactly the legacy bare ratio check. - `skip_partial_aggregation_ab_sampling_rows` (usize, default 10_000) — size of the A/B sampling window. - Drops `skip_partial_aggregation_cost_min_ratio` and `skip_partial_aggregation_cost_ns_per_row` from the previous iteration of this PR — they were magic-constant gates that the cost-aware formula obsoletes. 7 `SkipAggregationProbe` unit tests cover: - cost-model-off matches legacy ratio check - cost-model-on short-circuits on Rule 1 (no A/B needed) - A/B sampling entry transition - cost decision chooses skip when partial expensive - cost decision chooses keep when passthrough not much cheaper - A/B window accumulates across multiple batches - diagnostic gauges record at every transition Existing 100 aggregate tests + 10 aggregate SLT files still pass. --- datafusion/common/src/config.rs | 18 +- .../physical-plan/src/aggregates/row_hash.rs | 620 ++++++++++++------ .../test_files/information_schema.slt | 4 +- docs/source/user-guide/configs.md | 2 +- 4 files changed, 445 insertions(+), 199 deletions(-) diff --git a/datafusion/common/src/config.rs b/datafusion/common/src/config.rs index d36be8f29b8dd..231ae968400fb 100644 --- a/datafusion/common/src/config.rs +++ b/datafusion/common/src/config.rs @@ -666,16 +666,14 @@ config_namespace! { /// metric is reported alongside so callers can evaluate. pub skip_partial_aggregation_use_cost_model: bool, default = false - /// Effective ratio threshold used by the cost-aware skip rule. - /// Skip partial aggregation when ratio at probe close is at - /// least this value (provided - /// `skip_partial_aggregation_use_cost_model` is true). - /// - /// Default 0.5 is the empirical sweet spot from ClickBench: 0.4 - /// would skip too aggressively (regresses low-cost queries - /// that benefit from partial), 0.7 fails to catch Q18-shape - /// queries. - pub skip_partial_aggregation_cost_min_ratio: f64, default = 0.5 + /// Number of input rows used in the A/B sampling window after the + /// initial partial probe completes. During this window the operator + /// routes input through the passthrough (`transform_to_states`) + /// path so the probe can measure `passthrough_ns/row` and compare + /// it against the previously measured `partial_ns/row`. Default + /// 10000 — large enough to amortise per-row noise, small enough to + /// be cheap if the decision turns out to be "keep partial". + pub skip_partial_aggregation_ab_sampling_rows: usize, default = 10_000 /// Should DataFusion use row number estimates at the input to decide /// whether increasing parallelism is beneficial or not. By default, diff --git a/datafusion/physical-plan/src/aggregates/row_hash.rs b/datafusion/physical-plan/src/aggregates/row_hash.rs index 041415cf1050d..cbb6c86be0cee 100644 --- a/datafusion/physical-plan/src/aggregates/row_hash.rs +++ b/datafusion/physical-plan/src/aggregates/row_hash.rs @@ -71,6 +71,17 @@ pub(crate) enum ExecutionState { /// /// See "partial aggregation" discussion on [`GroupedHashAggregateStream`] SkippingAggregation, + /// Temporary A/B sampling phase used by [`SkipAggregationProbe`] to + /// measure passthrough cost on real input before deciding whether + /// to skip partial aggregation for the rest of the stream. + /// + /// Behaves like [`Self::SkippingAggregation`] (rows are converted + /// via `transform_to_states` and emitted downstream), but the hash + /// table built during the preceding partial-probe window is *not* + /// emitted yet — if the probe ultimately decides to keep partial + /// agg, the stream transitions back to [`Self::ReadingInput`] and + /// the hash table continues to accumulate. + AbSampling, /// All input has been consumed and all groups have been emitted Done, } @@ -118,79 +129,90 @@ struct SpillState { // Metrics related to spilling are managed inside `spill_manager` } -/// Tracks if the aggregate should skip partial aggregations +/// Three phases of the cost-aware skip decision. +/// +/// 1. `Partial` — accumulate input through the hash table (normal +/// partial-agg path), measuring `partial_ns/row` and the +/// `num_groups/input_rows` ratio over the first +/// `probe_rows_threshold` rows. +/// 2. `AbSampling` — route the next `ab_sampling_rows` of input through +/// the passthrough path (`transform_to_states`) to measure +/// `passthrough_ns/row`. The hash table built so far is kept; +/// nothing is emitted yet. +/// 3. `Locked { should_skip }` — final decision. Skip when +/// `ratio > passthrough_ns/row / partial_ns/row` (the cost-aware +/// crossover); otherwise revert to partial agg for the rest of the +/// stream. +#[derive(Debug, Clone, Copy, PartialEq, Eq)] +enum ProbePhase { + Partial, + AbSampling, + Locked { should_skip: bool }, +} + +/// Tracks if the aggregate should skip partial aggregations. /// -/// See "partial aggregation" discussion on [`GroupedHashAggregateStream`] +/// See "partial aggregation" discussion on [`GroupedHashAggregateStream`]. +/// +/// The probe runs a short A/B sampling window that measures both the +/// partial-agg per-row cost and the passthrough per-row cost on real +/// input, then makes a cost-based skip decision without relying on a +/// hardcoded ratio cutoff. `use_cost_model = false` falls back to the +/// original behaviour: a single bare ratio check at probe close. struct SkipAggregationProbe { // ======================================================================== - // PROPERTIES: - // These fields are initialized at the start and remain constant throughout - // the execution. + // PROPERTIES (immutable for the stream's lifetime) // ======================================================================== - /// Aggregation ratio check performed when the number of input rows exceeds - /// this threshold (from `SessionConfig`) probe_rows_threshold: usize, - /// Maximum ratio of `num_groups` to `input_rows` for continuing aggregation - /// (from `SessionConfig`). If the ratio exceeds this value, aggregation - /// is skipped and input rows are directly converted to output probe_ratio_threshold: f64, - /// (experimental) When true, apply a secondary skip rule using - /// `cost_min_ratio` as a lower effective ratio threshold. Disabled - /// by default to preserve the bare-ratio behaviour. use_cost_model: bool, - /// Effective ratio threshold for the cost-aware skip rule. - cost_min_ratio: f64, + ab_sampling_rows: usize, // ======================================================================== - // STATES: - // Fields changes during execution. Can be buffer, or state flags that - // influence the execution in parent `GroupedHashAggregateStream` + // STATE // ======================================================================== - /// Number of processed input rows (updated during probing) + phase: ProbePhase, + /// Rows processed in the `Partial` phase. input_rows: usize, - /// Number of total group values for `input_rows` (updated during probing) + /// Latest `group_values.len()` reported in the `Partial` phase. num_groups: usize, - - /// Flag indicating further data aggregation may be skipped (decision made - /// when probing complete) + /// Rows processed in the `AbSampling` phase. + ab_rows: usize, + /// `elapsed_compute.value()` snapshot at probe construction. + elapsed_compute_at_probe_start: usize, + /// `elapsed_compute.value()` snapshot at the `Partial`→`AbSampling` + /// transition. The partial-window wall time is + /// `elapsed_compute_at_ab_start - elapsed_compute_at_probe_start`. + elapsed_compute_at_ab_start: Option, + /// `should_skip` and `is_locked` derived from `phase`; kept as + /// dedicated fields so the rest of the operator code can stay + /// oblivious to the phase enum. should_skip: bool, - /// Flag indicating further updates of `SkipAggregationProbe` state won't - /// make any effect (set either while probing or on probing completion) is_locked: bool, // ======================================================================== - // METRICS: + // METRICS / SOURCES // ======================================================================== - /// Number of rows where state was output without aggregation. - /// - /// * If 0, all input rows were aggregated (should_skip was always false) - /// - /// * if greater than zero, the number of rows which were output directly - /// without aggregation skipped_aggregation_rows: metrics::Count, - /// Operator-wide `elapsed_compute` metric. Cloned from - /// `BaselineMetrics` so the probe can read the accumulated nanoseconds - /// at decision time and infer the wall time spent on partial - /// aggregation so far. + /// Operator-wide `elapsed_compute`; the probe reads `value()` at + /// phase transitions to derive per-row costs. elapsed_compute: metrics::Time, - /// `elapsed_compute.value()` snapshot at probe construction. Diff - /// against the live value gives wall time elapsed since the stream - /// started — a close approximation of partial aggregation cost, - /// since `elapsed_compute` covers exactly the timed work blocks in - /// the partial-mode loop. - elapsed_compute_at_probe_start: usize, - /// Diagnostic gauge: measured wall-time per input row (ns) at the - /// most recent probe evaluation. `set` once per `update_state` call - /// after `probe_rows_threshold` is met, so EXPLAIN ANALYZE reflects - /// the last evaluation's reading regardless of whether a skip rule - /// fired. Provides the empirical signal used to tune - /// `cost_min_ratio` / future cost-based rules. - probe_ns_per_row: metrics::Gauge, - /// Diagnostic gauge: aggregation ratio (`num_groups / input_rows`) - /// at the most recent probe evaluation, scaled by 1000 (so 0.565 - /// → 565). `usize` gauges store integers, so the scale lets us - /// preserve three decimal digits without floating point. + /// Diagnostic gauge: measured partial-agg wall time per input row + /// (ns) at the end of the `Partial` phase. Always reported when + /// the partial probe completes, regardless of subsequent decision. + probe_partial_ns_per_row: metrics::Gauge, + /// Diagnostic gauge: measured passthrough wall time per input row + /// (ns) at the end of the `AbSampling` phase. Reported only when + /// A/B sampling actually runs (cost model enabled and Rule 1 didn't + /// already fire). + probe_passthrough_ns_per_row: metrics::Gauge, + /// Diagnostic gauge: aggregation ratio at the partial probe close, + /// scaled by 1000 (`usize` gauge → integer storage). probe_ratio_per_mille: metrics::Gauge, + /// Diagnostic gauge: 1 if the cost-aware decision chose to skip, + /// 0 otherwise. Distinguishes the cost-aware skip from the fixed + /// Rule 1 skip and from "decision was to keep partial". + probe_cost_decision_skip: metrics::Gauge, } impl SkipAggregationProbe { @@ -199,85 +221,155 @@ impl SkipAggregationProbe { probe_rows_threshold: usize, probe_ratio_threshold: f64, use_cost_model: bool, - cost_min_ratio: f64, + ab_sampling_rows: usize, skipped_aggregation_rows: metrics::Count, elapsed_compute: metrics::Time, - probe_ns_per_row: metrics::Gauge, + probe_partial_ns_per_row: metrics::Gauge, + probe_passthrough_ns_per_row: metrics::Gauge, probe_ratio_per_mille: metrics::Gauge, + probe_cost_decision_skip: metrics::Gauge, ) -> Self { let elapsed_compute_at_probe_start = elapsed_compute.value(); Self { - input_rows: 0, - num_groups: 0, probe_rows_threshold, probe_ratio_threshold, use_cost_model, - cost_min_ratio, + ab_sampling_rows, + phase: ProbePhase::Partial, + input_rows: 0, + num_groups: 0, + ab_rows: 0, + elapsed_compute_at_probe_start, + elapsed_compute_at_ab_start: None, should_skip: false, is_locked: false, skipped_aggregation_rows, elapsed_compute, - elapsed_compute_at_probe_start, - probe_ns_per_row, + probe_partial_ns_per_row, + probe_passthrough_ns_per_row, probe_ratio_per_mille, + probe_cost_decision_skip, } } - /// Updates `SkipAggregationProbe` state: - /// - increments the number of input rows - /// - replaces the number of groups with the new value - /// - on `probe_rows_threshold` exceeded applies the skip-decision rules - /// - records the diagnostic gauges - /// - /// Two skip rules are evaluated, in order: - /// - /// 1. **Fixed ratio**: `num_groups / input_rows >= probe_ratio_threshold` - /// — the original behaviour, preserved for backwards compatibility. - /// 2. **Lower-ratio** (opt-in via `use_cost_model`): skip when the ratio - /// is at least `cost_min_ratio` (default 0.5). Targets the medium - /// band (0.5–0.8) where partial aggregation is often net-negative - /// despite the ratio being below the fixed threshold. The per-row - /// cost gauge is reported alongside via `probe_ns_per_row` for - /// diagnosing whether a cost-based replacement of this static - /// threshold is feasible. - fn update_state(&mut self, input_rows: usize, num_groups: usize) { - if self.is_locked { + /// Called from the partial-agg path after each input batch. Tracks + /// total rows / group count and, when `probe_rows_threshold` is + /// reached, drives the phase transition. + fn observe_partial_batch(&mut self, input_rows: usize, num_groups: usize) { + if self.phase != ProbePhase::Partial { return; } self.input_rows += input_rows; self.num_groups = num_groups; - if self.input_rows >= self.probe_rows_threshold { - let ratio = self.num_groups as f64 / self.input_rows as f64; - - // Diagnostic gauges: always recorded at probe evaluation, - // independent of which (if any) skip rule fires. - let elapsed_ns = self - .elapsed_compute - .value() - .saturating_sub(self.elapsed_compute_at_probe_start); - let ns_per_row = (elapsed_ns as u64) - .checked_div(self.input_rows as u64) - .unwrap_or(0); - self.probe_ns_per_row.set(ns_per_row as usize); - self.probe_ratio_per_mille.set((ratio * 1000.0) as usize); - - // Rule 1: fixed-ratio skip (existing behaviour). - if ratio >= self.probe_ratio_threshold { - self.should_skip = true; - self.is_locked = true; - return; - } + if self.input_rows < self.probe_rows_threshold { + return; + } - // Rule 2 (opt-in): lower-ratio skip. - if self.use_cost_model && ratio >= self.cost_min_ratio { - self.should_skip = true; - self.is_locked = true; - } - // No rule fired — leave `is_locked` false so the next batch - // can re-evaluate. + let ratio = self.num_groups as f64 / self.input_rows as f64; + let partial_ns = self + .elapsed_compute + .value() + .saturating_sub(self.elapsed_compute_at_probe_start); + let partial_ns_per_row = (partial_ns as u64) + .checked_div(self.input_rows as u64) + .unwrap_or(0) as usize; + self.probe_partial_ns_per_row.set(partial_ns_per_row); + self.probe_ratio_per_mille.set((ratio * 1000.0) as usize); + + // Rule 1 (fixed): high ratio — short-circuit straight to skip. + if ratio >= self.probe_ratio_threshold { + self.commit_skip(); + return; + } + + if !self.use_cost_model { + // Legacy behaviour: leave `is_locked = false`, allow + // re-evaluation on subsequent batches. + return; + } + + // Enter A/B sampling — route subsequent input through passthrough + // until `ab_sampling_rows` have been observed, at which point + // `finalize_ab_decision` runs the cost-based comparison. + self.elapsed_compute_at_ab_start = Some(self.elapsed_compute.value()); + self.phase = ProbePhase::AbSampling; + } + + /// True iff the main loop should route the next input batch through + /// the passthrough (`transform_to_states`) path to feed the A/B + /// measurement instead of through the hash-table partial-agg path. + fn wants_passthrough_sample(&self) -> bool { + matches!(self.phase, ProbePhase::AbSampling) + } + + /// Called after a passthrough batch has been processed during the + /// A/B sampling phase. Counts rows toward the sample window and, + /// when the window is full, triggers the cost-aware decision. + fn observe_ab_batch(&mut self, input_rows: usize) { + if self.phase != ProbePhase::AbSampling { + return; + } + self.ab_rows += input_rows; + if self.ab_rows >= self.ab_sampling_rows { + self.finalize_ab_decision(); + } + } + + /// Apply the cost-aware decision after the A/B sampling window + /// completes. + /// + /// Cost model (assuming `final_ns/row ≈ partial_ns/row`): + /// + /// ```text + /// cost_keep_partial = partial_ns × N + final_ns × N × ratio + /// cost_skip = passthrough_ns × N + final_ns × N + /// + /// skip is cheaper ⇔ ratio > passthrough_ns / partial_ns + /// ``` + /// + /// The crossover is set entirely by the measured ratio of passthrough + /// to partial cost on this particular query / hardware — no magic + /// constants. If either measurement is zero (extremely fast or + /// degenerate input) we default to keeping partial. + fn finalize_ab_decision(&mut self) { + let ab_start = self + .elapsed_compute_at_ab_start + .expect("A/B start snapshot must be set when entering AbSampling"); + let ab_ns = self.elapsed_compute.value().saturating_sub(ab_start); + let passthrough_ns_per_row = + (ab_ns as u64).checked_div(self.ab_rows as u64).unwrap_or(0) as usize; + self.probe_passthrough_ns_per_row + .set(passthrough_ns_per_row); + + let partial_ns_per_row = self.probe_partial_ns_per_row.value(); + let ratio = self.num_groups as f64 / self.input_rows as f64; + + let should_skip = if partial_ns_per_row == 0 || passthrough_ns_per_row == 0 { + false + } else { + ratio > (passthrough_ns_per_row as f64 / partial_ns_per_row as f64) + }; + + if should_skip { + self.probe_cost_decision_skip.set(1); + self.commit_skip(); + } else { + self.probe_cost_decision_skip.set(0); + self.phase = ProbePhase::Locked { should_skip: false }; + self.is_locked = true; } } + /// Transition to the terminal `Locked { should_skip: true }` state. + /// Used by both Rule 1 (fixed-ratio short-circuit) and the cost-aware + /// path so the rest of the operator can rely on a single + /// `should_skip` flag. + fn commit_skip(&mut self) { + self.should_skip = true; + self.is_locked = true; + self.phase = ProbePhase::Locked { should_skip: true }; + } + fn should_skip(&self) -> bool { self.should_skip } @@ -721,25 +813,33 @@ impl GroupedHashAggregateStream { let probe_ratio_threshold = options.skip_partial_aggregation_probe_ratio_threshold; let use_cost_model = options.skip_partial_aggregation_use_cost_model; - let cost_min_ratio = options.skip_partial_aggregation_cost_min_ratio; + let ab_sampling_rows = options.skip_partial_aggregation_ab_sampling_rows; let skipped_aggregation_rows = MetricBuilder::new(&agg.metrics) .with_category(MetricCategory::Rows) .counter("skipped_aggregation_rows", partition); - let probe_ns_per_row = MetricBuilder::new(&agg.metrics) + let probe_partial_ns_per_row = MetricBuilder::new(&agg.metrics) + .with_category(MetricCategory::Timing) + .gauge("partial_agg_probe_partial_ns_per_row", partition); + let probe_passthrough_ns_per_row = MetricBuilder::new(&agg.metrics) .with_category(MetricCategory::Timing) - .gauge("partial_agg_probe_ns_per_row", partition); + .gauge("partial_agg_probe_passthrough_ns_per_row", partition); let probe_ratio_per_mille = MetricBuilder::new(&agg.metrics) .with_category(MetricCategory::Rows) .gauge("partial_agg_probe_ratio_per_mille", partition); + let probe_cost_decision_skip = MetricBuilder::new(&agg.metrics) + .with_category(MetricCategory::Rows) + .gauge("partial_agg_probe_cost_decision_skip", partition); Some(SkipAggregationProbe::new( probe_rows_threshold, probe_ratio_threshold, use_cost_model, - cost_min_ratio, + ab_sampling_rows, skipped_aggregation_rows, baseline_metrics.elapsed_compute().clone(), - probe_ns_per_row, + probe_partial_ns_per_row, + probe_passthrough_ns_per_row, probe_ratio_per_mille, + probe_cost_decision_skip, )) } else { None @@ -875,6 +975,16 @@ impl Stream for GroupedHashAggregateStream { self.exec_state = new_state; break 'reading_input; } + // Probe may have transitioned into the + // A/B sampling window. Route subsequent + // batches through the passthrough path + // so the probe can measure + // `passthrough_ns/row`. + if self.probe_wants_passthrough_sample() { + timer.done(); + self.exec_state = ExecutionState::AbSampling; + break 'reading_input; + } } // If we reach this point, try to update the memory reservation @@ -940,6 +1050,62 @@ impl Stream for GroupedHashAggregateStream { } } + ExecutionState::AbSampling => { + // Mirror of `SkippingAggregation` — passthrough via + // `transform_to_states` — except that: + // * the partial hash table is NOT emitted (we may + // still revert to it), + // * the probe observes per-row timing via + // `elapsed_compute`, + // * after each batch we check whether the probe + // has finalised: skip (emit hash + switch to + // `SkippingAggregation`) or keep partial + // (return to `ReadingInput`). + match ready!(self.input.poll_next_unpin(cx)) { + Some(Ok(batch)) => { + let _timer = elapsed_compute.timer(); + let input_rows = batch.num_rows(); + let states = self.transform_to_states(&batch)?; + if let Some(probe) = self.skip_aggregation_probe.as_mut() { + probe.observe_ab_batch(input_rows); + } + // After observing, the probe may have + // transitioned out of `AbSampling`. + if self.should_skip_aggregation() { + // Cost model chose skip — emit the + // partial hash table accumulated during + // the probe window, then continue in + // `SkippingAggregation`. + if let Some(emitted) = self.emit(EmitTo::All, false)? { + self.exec_state = + ExecutionState::ProducingOutput(emitted); + } else { + self.exec_state = ExecutionState::SkippingAggregation; + } + } else if let Some(probe) = + self.skip_aggregation_probe.as_ref() + && !probe.wants_passthrough_sample() + && probe.is_locked + { + // Cost model chose keep — fall back to + // the partial-agg path for the rest of + // the stream. + self.exec_state = ExecutionState::ReadingInput; + } + return Poll::Ready(Some(Ok( + states.record_output(&self.baseline_metrics) + ))); + } + Some(Err(e)) => return Poll::Ready(Some(Err(e))), + None => { + // Input ended mid-sampling. Commit whatever + // hash state we have via the normal + // end-of-input path. + self.set_input_done_and_produce_output()?; + } + } + } + ExecutionState::ProducingOutput(batch) => { // slice off a part of the batch, if needed let output_batch; @@ -1497,10 +1663,19 @@ impl GroupedHashAggregateStream { // Skip aggregation probe is not supported if stream has any spills, // currently spilling is not supported for Partial aggregation assert!(self.spill_state.spills.is_empty()); - probe.update_state(input_rows, self.group_values.len()); + probe.observe_partial_batch(input_rows, self.group_values.len()); }; } + /// True iff the probe wants the next input batch routed through the + /// passthrough path for A/B sampling. Checked in the main loop + /// before the partial-agg hash insert. + fn probe_wants_passthrough_sample(&self) -> bool { + self.skip_aggregation_probe + .as_ref() + .is_some_and(|p| p.wants_passthrough_sample()) + } + /// In case the probe indicates that aggregation may be /// skipped, forces stream to produce currently accumulated output. /// @@ -1910,114 +2085,187 @@ mod tests { // ---------------- SkipAggregationProbe unit tests ---------------- - /// Build a probe with the given configuration. Skip-count metric and - /// diagnostic gauges are disconnected; `elapsed_compute` is exposed - /// so a test can drive it with `add_duration` to verify the - /// diagnostic gauge picks the value up. - fn make_probe( + /// Test rig that exposes the probe together with the `Time` source + /// it reads, so tests can drive measured per-row cost deterministically. + struct ProbeRig { + probe: SkipAggregationProbe, + elapsed: metrics::Time, + partial_ns_gauge: metrics::Gauge, + passthrough_ns_gauge: metrics::Gauge, + ratio_gauge: metrics::Gauge, + cost_decision_gauge: metrics::Gauge, + } + + fn rig( probe_rows_threshold: usize, probe_ratio_threshold: f64, use_cost_model: bool, - cost_min_ratio: f64, - ) -> (SkipAggregationProbe, metrics::Time) { - let elapsed_compute = metrics::Time::new(); + ab_sampling_rows: usize, + ) -> ProbeRig { + let elapsed = metrics::Time::new(); + let partial_ns_gauge = metrics::Gauge::new(); + let passthrough_ns_gauge = metrics::Gauge::new(); + let ratio_gauge = metrics::Gauge::new(); + let cost_decision_gauge = metrics::Gauge::new(); let probe = SkipAggregationProbe::new( probe_rows_threshold, probe_ratio_threshold, use_cost_model, - cost_min_ratio, + ab_sampling_rows, metrics::Count::new(), - elapsed_compute.clone(), - metrics::Gauge::new(), - metrics::Gauge::new(), + elapsed.clone(), + partial_ns_gauge.clone(), + passthrough_ns_gauge.clone(), + ratio_gauge.clone(), + cost_decision_gauge.clone(), ); - (probe, elapsed_compute) + ProbeRig { + probe, + elapsed, + partial_ns_gauge, + passthrough_ns_gauge, + ratio_gauge, + cost_decision_gauge, + } } - /// With the cost model off, the behaviour is exactly the legacy bare - /// ratio check. + /// With the cost model off the probe behaves like the original + /// bare-ratio check: skip only when the ratio crosses + /// `probe_ratio_threshold`. #[test] fn skip_probe_cost_model_off_matches_legacy_ratio_check() { - let (mut probe, _t) = make_probe(100, 0.8, false, 0.5); + let mut r = rig(100, 0.8, false, 10_000); // 100 rows / 50 groups → ratio 0.5, below 0.8 → don't skip - probe.update_state(100, 50); - assert!(!probe.should_skip()); - assert!(!probe.is_locked); - - // Next window: another 100 rows, total 200 / 170 groups → ratio - // 0.85, above 0.8 → skip. - probe.update_state(100, 170); - assert!(probe.should_skip()); - assert!(probe.is_locked); + r.probe.observe_partial_batch(100, 50); + assert!(!r.probe.should_skip()); + assert!(!r.probe.is_locked); + assert!(!r.probe.wants_passthrough_sample()); + + // Next batch: total 200 / 170 groups → ratio 0.85, above 0.8 → skip + r.probe.observe_partial_batch(100, 170); + assert!(r.probe.should_skip()); + assert!(r.probe.is_locked); } - /// With the cost model on, a medium-ratio query (Q18-shape: ratio - /// 0.56, below 0.8 but at or above `cost_min_ratio`) is skipped. + /// Rule 1 (fixed-ratio) fires before A/B sampling even when the + /// cost model is on — short-circuit, no passthrough sampling needed. #[test] - fn skip_probe_cost_model_fires_in_medium_ratio_band() { - let (mut probe, _t) = make_probe(100, 0.8, true, 0.5); + fn skip_probe_cost_model_short_circuits_on_high_ratio() { + let mut r = rig(100, 0.8, true, 10_000); - // Ratio 0.56 — Q18's measured value. Above the 0.5 threshold. - probe.update_state(100, 56); + r.probe.observe_partial_batch(100, 90); // ratio 0.9 - assert!(probe.should_skip()); - assert!(probe.is_locked); + assert!(r.probe.should_skip()); + assert!(r.probe.is_locked); + // No A/B sampling happened. + assert!(!r.probe.wants_passthrough_sample()); + assert_eq!(r.passthrough_ns_gauge.value(), 0); } - /// Cost model on, ratio below `cost_min_ratio`: keep partial agg — - /// it's reducing too much to be worth skipping. + /// Below `probe_ratio_threshold`, the probe transitions into the + /// A/B sampling phase and requests passthrough routing from the + /// main loop. #[test] - fn skip_probe_cost_model_does_not_fire_below_min_ratio() { - let (mut probe, _t) = make_probe(100, 0.8, true, 0.5); - - // Ratio 0.3 — well below the 0.5 threshold. - probe.update_state(100, 30); + fn skip_probe_enters_ab_sampling_when_partial_window_closes() { + let mut r = rig(100, 0.8, true, 10_000); + + // Simulate 100k ns of partial work over 100 rows → 1000 ns/row. + r.elapsed.add_duration(Duration::from_nanos(100_000)); + r.probe.observe_partial_batch(100, 60); // ratio 0.6 + + assert!(!r.probe.should_skip()); + assert!(!r.probe.is_locked); + assert!(r.probe.wants_passthrough_sample()); + assert_eq!(r.partial_ns_gauge.value(), 1_000); + assert_eq!(r.ratio_gauge.value(), 600); + } - assert!(!probe.should_skip()); - assert!(!probe.is_locked); + /// Cost-aware skip: ratio is greater than `passthrough/partial`, so + /// the cost model picks skip. + /// + /// Setup: partial = 100 ns/row, passthrough = 50 ns/row, ratio = 0.6. + /// Crossover = 50/100 = 0.5. 0.6 > 0.5 ⇒ skip. + #[test] + fn skip_probe_cost_decision_chooses_skip_when_partial_is_expensive() { + let mut r = rig(100, 0.8, true, 100); + + // Partial window: 10_000 ns over 100 rows → 100 ns/row. + r.elapsed.add_duration(Duration::from_nanos(10_000)); + r.probe.observe_partial_batch(100, 60); // ratio 0.6 + assert!(r.probe.wants_passthrough_sample()); + + // A/B window: 5_000 ns over 100 rows → 50 ns/row. + r.elapsed.add_duration(Duration::from_nanos(5_000)); + r.probe.observe_ab_batch(100); + + assert!(r.probe.should_skip()); + assert!(r.probe.is_locked); + assert!(!r.probe.wants_passthrough_sample()); + assert_eq!(r.passthrough_ns_gauge.value(), 50); + assert_eq!(r.cost_decision_gauge.value(), 1); } - /// Cost model on still honours the original high-ratio rule: ratio - /// at or above `probe_ratio_threshold` skips, even though the - /// lower-ratio rule would also fire. + /// Cost-aware keep: ratio is below `passthrough/partial`, so the + /// cost model picks keep-partial (revert). + /// + /// Setup: partial = 100 ns/row, passthrough = 80 ns/row, ratio = 0.6. + /// Crossover = 80/100 = 0.8. 0.6 < 0.8 ⇒ keep. #[test] - fn skip_probe_cost_model_still_honours_high_ratio_rule() { - let (mut probe, _t) = make_probe(100, 0.8, true, 0.5); + fn skip_probe_cost_decision_chooses_keep_when_passthrough_not_much_cheaper() { + let mut r = rig(100, 0.8, true, 100); - // Ratio 0.9 → skip via the fixed rule (Rule 1). - probe.update_state(100, 90); + r.elapsed.add_duration(Duration::from_nanos(10_000)); + r.probe.observe_partial_batch(100, 60); // ratio 0.6 + assert!(r.probe.wants_passthrough_sample()); - assert!(probe.should_skip()); - assert!(probe.is_locked); + r.elapsed.add_duration(Duration::from_nanos(8_000)); + r.probe.observe_ab_batch(100); + + assert!(!r.probe.should_skip()); + assert!(r.probe.is_locked); + assert!(!r.probe.wants_passthrough_sample()); + assert_eq!(r.passthrough_ns_gauge.value(), 80); + assert_eq!(r.cost_decision_gauge.value(), 0); + } + + /// A/B sampling needs *enough* rows in the window before it + /// finalises. A short partial batch during sampling shouldn't + /// trigger the decision early. + #[test] + fn skip_probe_ab_window_accumulates_across_batches() { + let mut r = rig(100, 0.8, true, 1000); + + r.elapsed.add_duration(Duration::from_nanos(10_000)); + r.probe.observe_partial_batch(100, 60); + assert!(r.probe.wants_passthrough_sample()); + + // 500 rows of A/B — below the 1000 row target. + r.elapsed.add_duration(Duration::from_nanos(25_000)); + r.probe.observe_ab_batch(500); + assert!(r.probe.wants_passthrough_sample()); + assert!(!r.probe.is_locked); + + // Another 500 rows — total 1000, decision fires. + r.elapsed.add_duration(Duration::from_nanos(25_000)); + r.probe.observe_ab_batch(500); + assert!(!r.probe.wants_passthrough_sample()); + assert!(r.probe.is_locked); } - /// The diagnostic gauges record the measured per-row wall time and - /// the ratio (× 1000) on every probe evaluation, regardless of which - /// — if any — skip rule fires. + /// Diagnostic gauges record the per-row measurements at every + /// observable transition, independent of which decision fires. #[test] fn skip_probe_records_diagnostic_gauges() { - let elapsed_compute = metrics::Time::new(); - let probe_ns_per_row = metrics::Gauge::new(); - let probe_ratio_per_mille = metrics::Gauge::new(); - let mut probe = SkipAggregationProbe::new( - 100, - 0.8, - false, - 0.5, - metrics::Count::new(), - elapsed_compute.clone(), - probe_ns_per_row.clone(), - probe_ratio_per_mille.clone(), - ); + let mut r = rig(100, 0.8, false, 10_000); // 200_000 ns over 100 rows → 2_000 ns/row. Ratio 50/100 = 0.5 → - // 500 per-mille. No skip fires (legacy ratio check is off, no - // 0.8 reached), but the gauges still update. - elapsed_compute.add_duration(Duration::from_nanos(200_000)); - probe.update_state(100, 50); + // 500 per-mille. No skip fires (legacy ratio check, below 0.8), + // but partial_ns_per_row + ratio gauges still update. + r.elapsed.add_duration(Duration::from_nanos(200_000)); + r.probe.observe_partial_batch(100, 50); - assert_eq!(probe_ns_per_row.value(), 2_000); - assert_eq!(probe_ratio_per_mille.value(), 500); + assert_eq!(r.partial_ns_gauge.value(), 2_000); + assert_eq!(r.ratio_gauge.value(), 500); } } diff --git a/datafusion/sqllogictest/test_files/information_schema.slt b/datafusion/sqllogictest/test_files/information_schema.slt index 2ca0196089dab..34e5a94a53c1a 100644 --- a/datafusion/sqllogictest/test_files/information_schema.slt +++ b/datafusion/sqllogictest/test_files/information_schema.slt @@ -266,7 +266,7 @@ datafusion.execution.parquet.writer_version 1.0 datafusion.execution.perfect_hash_join_min_key_density 0.15 datafusion.execution.perfect_hash_join_small_build_threshold 1024 datafusion.execution.planning_concurrency 13 -datafusion.execution.skip_partial_aggregation_cost_min_ratio 0.5 +datafusion.execution.skip_partial_aggregation_ab_sampling_rows 10000 datafusion.execution.skip_partial_aggregation_probe_ratio_threshold 0.8 datafusion.execution.skip_partial_aggregation_probe_rows_threshold 100000 datafusion.execution.skip_partial_aggregation_use_cost_model false @@ -418,7 +418,7 @@ datafusion.execution.parquet.writer_version 1.0 (writing) Sets parquet writer ve datafusion.execution.perfect_hash_join_min_key_density 0.15 The minimum required density of join keys on the build side to consider a perfect hash join (see `HashJoinExec` for more details). Density is calculated as: `(number of rows) / (max_key - min_key + 1)`. A perfect hash join may be used if the actual key density > this value. Currently only supports cases where build_side.num_rows() < u32::MAX. Support for build_side.num_rows() >= u32::MAX will be added in the future. datafusion.execution.perfect_hash_join_small_build_threshold 1024 A perfect hash join (see `HashJoinExec` for more details) will be considered if the range of keys (max - min) on the build side is < this threshold. This provides a fast path for joins with very small key ranges, bypassing the density check. Currently only supports cases where build_side.num_rows() < u32::MAX. Support for build_side.num_rows() >= u32::MAX will be added in the future. datafusion.execution.planning_concurrency 13 Fan-out during initial physical planning. This is mostly use to plan `UNION` children in parallel. Defaults to the number of CPU cores on the system -datafusion.execution.skip_partial_aggregation_cost_min_ratio 0.5 Effective ratio threshold used by the cost-aware skip rule. Skip partial aggregation when ratio at probe close is at least this value (provided `skip_partial_aggregation_use_cost_model` is true). Default 0.5 is the empirical sweet spot from ClickBench: 0.4 would skip too aggressively (regresses low-cost queries that benefit from partial), 0.7 fails to catch Q18-shape queries. +datafusion.execution.skip_partial_aggregation_ab_sampling_rows 10000 Number of input rows used in the A/B sampling window after the initial partial probe completes. During this window the operator routes input through the passthrough (`transform_to_states`) path so the probe can measure `passthrough_ns/row` and compare it against the previously measured `partial_ns/row`. Default 10000 — large enough to amortise per-row noise, small enough to be cheap if the decision turns out to be "keep partial". datafusion.execution.skip_partial_aggregation_probe_ratio_threshold 0.8 Aggregation ratio (number of distinct groups / number of input rows) threshold for skipping partial aggregation. If the value is greater then partial aggregation will skip aggregation for further input datafusion.execution.skip_partial_aggregation_probe_rows_threshold 100000 Number of input rows partial aggregation partition should process, before aggregation ratio check and trying to switch to skipping aggregation mode datafusion.execution.skip_partial_aggregation_use_cost_model false (experimental) When true, apply a *secondary* skip rule on top of `skip_partial_aggregation_probe_ratio_threshold`: skip partial aggregation when the measured ratio is at least `skip_partial_aggregation_cost_min_ratio` (default 0.5). Targets ClickBench Q18-shape queries where the ratio (~0.56) sits just below the fixed 0.8 threshold so partial agg keeps running, but the absolute work (heavy variable-length keys, complex aggregates) makes it net-negative. Empirical motivation: lowering the global ratio threshold to 0.6 fixes Q18 (1.73× faster) but risks regressing low-cost queries at similar ratios. This flag exposes the lower threshold as a separate, opt-in knob. Whether the cost-aware signal (`partial_agg_probe_ns_per_row` metric) can replace this static threshold is an open question — for now the metric is reported alongside so callers can evaluate. diff --git a/docs/source/user-guide/configs.md b/docs/source/user-guide/configs.md index fc95550ff94b3..9bd73a54d44f7 100644 --- a/docs/source/user-guide/configs.md +++ b/docs/source/user-guide/configs.md @@ -133,7 +133,7 @@ The following configuration settings are available: | datafusion.execution.skip_partial_aggregation_probe_ratio_threshold | 0.8 | Aggregation ratio (number of distinct groups / number of input rows) threshold for skipping partial aggregation. If the value is greater then partial aggregation will skip aggregation for further input | | datafusion.execution.skip_partial_aggregation_probe_rows_threshold | 100000 | Number of input rows partial aggregation partition should process, before aggregation ratio check and trying to switch to skipping aggregation mode | | datafusion.execution.skip_partial_aggregation_use_cost_model | false | (experimental) When true, apply a _secondary_ skip rule on top of `skip_partial_aggregation_probe_ratio_threshold`: skip partial aggregation when the measured ratio is at least `skip_partial_aggregation_cost_min_ratio` (default 0.5). Targets ClickBench Q18-shape queries where the ratio (~0.56) sits just below the fixed 0.8 threshold so partial agg keeps running, but the absolute work (heavy variable-length keys, complex aggregates) makes it net-negative. Empirical motivation: lowering the global ratio threshold to 0.6 fixes Q18 (1.73× faster) but risks regressing low-cost queries at similar ratios. This flag exposes the lower threshold as a separate, opt-in knob. Whether the cost-aware signal (`partial_agg_probe_ns_per_row` metric) can replace this static threshold is an open question — for now the metric is reported alongside so callers can evaluate. | -| datafusion.execution.skip_partial_aggregation_cost_min_ratio | 0.5 | Effective ratio threshold used by the cost-aware skip rule. Skip partial aggregation when ratio at probe close is at least this value (provided `skip_partial_aggregation_use_cost_model` is true). Default 0.5 is the empirical sweet spot from ClickBench: 0.4 would skip too aggressively (regresses low-cost queries that benefit from partial), 0.7 fails to catch Q18-shape queries. | +| datafusion.execution.skip_partial_aggregation_ab_sampling_rows | 10000 | Number of input rows used in the A/B sampling window after the initial partial probe completes. During this window the operator routes input through the passthrough (`transform_to_states`) path so the probe can measure `passthrough_ns/row` and compare it against the previously measured `partial_ns/row`. Default 10000 — large enough to amortise per-row noise, small enough to be cheap if the decision turns out to be "keep partial". | | datafusion.execution.use_row_number_estimates_to_optimize_partitioning | false | Should DataFusion use row number estimates at the input to decide whether increasing parallelism is beneficial or not. By default, only exact row numbers (not estimates) are used for this decision. Setting this flag to `true` will likely produce better plans. if the source of statistics is accurate. We plan to make this the default in the future. | | datafusion.execution.enforce_batch_size_in_joins | false | Should DataFusion enforce batch size in joins or not. By default, DataFusion will not enforce batch size in joins. Enforcing batch size in joins can reduce memory usage when joining large tables with a highly-selective join filter, but is also slightly slower. | | datafusion.execution.objectstore_writer_buffer_size | 10485760 | Size (bytes) of data buffer DataFusion uses when writing output files. This affects the size of the data chunks that are uploaded to remote object stores (e.g. AWS S3). If very large (>= 100 GiB) output files are being written, it may be necessary to increase this size to avoid errors from the remote end point. | From acb2ad94400371bc2c75f98d8583e9c9552eead3 Mon Sep 17 00:00:00 2001 From: Qi Zhu Date: Tue, 26 May 2026 20:38:04 +0800 Subject: [PATCH 07/14] [FOR BENCHMARK ONLY] flip default to true so bot exercises A/B sampling MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Same temporary flip as before — benchmarking bot uses default config, so cost-aware A/B sampling would never run otherwise. Revert this commit before merge; the contract stays opt-in until we have data on which to base a default change. --- datafusion/common/src/config.rs | 2 +- datafusion/sqllogictest/test_files/information_schema.slt | 4 ++-- docs/source/user-guide/configs.md | 2 +- 3 files changed, 4 insertions(+), 4 deletions(-) diff --git a/datafusion/common/src/config.rs b/datafusion/common/src/config.rs index 231ae968400fb..90bc2da076faa 100644 --- a/datafusion/common/src/config.rs +++ b/datafusion/common/src/config.rs @@ -664,7 +664,7 @@ config_namespace! { /// signal (`partial_agg_probe_ns_per_row` metric) can replace /// this static threshold is an open question — for now the /// metric is reported alongside so callers can evaluate. - pub skip_partial_aggregation_use_cost_model: bool, default = false + pub skip_partial_aggregation_use_cost_model: bool, default = true /// Number of input rows used in the A/B sampling window after the /// initial partial probe completes. During this window the operator diff --git a/datafusion/sqllogictest/test_files/information_schema.slt b/datafusion/sqllogictest/test_files/information_schema.slt index 34e5a94a53c1a..056727cbd3767 100644 --- a/datafusion/sqllogictest/test_files/information_schema.slt +++ b/datafusion/sqllogictest/test_files/information_schema.slt @@ -269,7 +269,7 @@ datafusion.execution.planning_concurrency 13 datafusion.execution.skip_partial_aggregation_ab_sampling_rows 10000 datafusion.execution.skip_partial_aggregation_probe_ratio_threshold 0.8 datafusion.execution.skip_partial_aggregation_probe_rows_threshold 100000 -datafusion.execution.skip_partial_aggregation_use_cost_model false +datafusion.execution.skip_partial_aggregation_use_cost_model true datafusion.execution.skip_physical_aggregate_schema_check false datafusion.execution.soft_max_rows_per_output_file 50000000 datafusion.execution.sort_in_place_threshold_bytes 1048576 @@ -421,7 +421,7 @@ datafusion.execution.planning_concurrency 13 Fan-out during initial physical pla datafusion.execution.skip_partial_aggregation_ab_sampling_rows 10000 Number of input rows used in the A/B sampling window after the initial partial probe completes. During this window the operator routes input through the passthrough (`transform_to_states`) path so the probe can measure `passthrough_ns/row` and compare it against the previously measured `partial_ns/row`. Default 10000 — large enough to amortise per-row noise, small enough to be cheap if the decision turns out to be "keep partial". datafusion.execution.skip_partial_aggregation_probe_ratio_threshold 0.8 Aggregation ratio (number of distinct groups / number of input rows) threshold for skipping partial aggregation. If the value is greater then partial aggregation will skip aggregation for further input datafusion.execution.skip_partial_aggregation_probe_rows_threshold 100000 Number of input rows partial aggregation partition should process, before aggregation ratio check and trying to switch to skipping aggregation mode -datafusion.execution.skip_partial_aggregation_use_cost_model false (experimental) When true, apply a *secondary* skip rule on top of `skip_partial_aggregation_probe_ratio_threshold`: skip partial aggregation when the measured ratio is at least `skip_partial_aggregation_cost_min_ratio` (default 0.5). Targets ClickBench Q18-shape queries where the ratio (~0.56) sits just below the fixed 0.8 threshold so partial agg keeps running, but the absolute work (heavy variable-length keys, complex aggregates) makes it net-negative. Empirical motivation: lowering the global ratio threshold to 0.6 fixes Q18 (1.73× faster) but risks regressing low-cost queries at similar ratios. This flag exposes the lower threshold as a separate, opt-in knob. Whether the cost-aware signal (`partial_agg_probe_ns_per_row` metric) can replace this static threshold is an open question — for now the metric is reported alongside so callers can evaluate. +datafusion.execution.skip_partial_aggregation_use_cost_model true (experimental) When true, apply a *secondary* skip rule on top of `skip_partial_aggregation_probe_ratio_threshold`: skip partial aggregation when the measured ratio is at least `skip_partial_aggregation_cost_min_ratio` (default 0.5). Targets ClickBench Q18-shape queries where the ratio (~0.56) sits just below the fixed 0.8 threshold so partial agg keeps running, but the absolute work (heavy variable-length keys, complex aggregates) makes it net-negative. Empirical motivation: lowering the global ratio threshold to 0.6 fixes Q18 (1.73× faster) but risks regressing low-cost queries at similar ratios. This flag exposes the lower threshold as a separate, opt-in knob. Whether the cost-aware signal (`partial_agg_probe_ns_per_row` metric) can replace this static threshold is an open question — for now the metric is reported alongside so callers can evaluate. datafusion.execution.skip_physical_aggregate_schema_check false When set to true, skips verifying that the schema produced by planning the input of `LogicalPlan::Aggregate` exactly matches the schema of the input plan. When set to false, if the schema does not match exactly (including nullability and metadata), a planning error will be raised. This is used to workaround bugs in the planner that are now caught by the new schema verification step. datafusion.execution.soft_max_rows_per_output_file 50000000 Target number of rows in output files when writing multiple. This is a soft max, so it can be exceeded slightly. There also will be one file smaller than the limit if the total number of rows written is not roughly divisible by the soft max datafusion.execution.sort_in_place_threshold_bytes 1048576 When sorting, below what size should data be concatenated and sorted in a single RecordBatch rather than sorted in batches and merged. diff --git a/docs/source/user-guide/configs.md b/docs/source/user-guide/configs.md index 9bd73a54d44f7..138696a9bfa39 100644 --- a/docs/source/user-guide/configs.md +++ b/docs/source/user-guide/configs.md @@ -132,7 +132,7 @@ The following configuration settings are available: | datafusion.execution.keep_partition_by_columns | false | Should DataFusion keep the columns used for partition_by in the output RecordBatches | | datafusion.execution.skip_partial_aggregation_probe_ratio_threshold | 0.8 | Aggregation ratio (number of distinct groups / number of input rows) threshold for skipping partial aggregation. If the value is greater then partial aggregation will skip aggregation for further input | | datafusion.execution.skip_partial_aggregation_probe_rows_threshold | 100000 | Number of input rows partial aggregation partition should process, before aggregation ratio check and trying to switch to skipping aggregation mode | -| datafusion.execution.skip_partial_aggregation_use_cost_model | false | (experimental) When true, apply a _secondary_ skip rule on top of `skip_partial_aggregation_probe_ratio_threshold`: skip partial aggregation when the measured ratio is at least `skip_partial_aggregation_cost_min_ratio` (default 0.5). Targets ClickBench Q18-shape queries where the ratio (~0.56) sits just below the fixed 0.8 threshold so partial agg keeps running, but the absolute work (heavy variable-length keys, complex aggregates) makes it net-negative. Empirical motivation: lowering the global ratio threshold to 0.6 fixes Q18 (1.73× faster) but risks regressing low-cost queries at similar ratios. This flag exposes the lower threshold as a separate, opt-in knob. Whether the cost-aware signal (`partial_agg_probe_ns_per_row` metric) can replace this static threshold is an open question — for now the metric is reported alongside so callers can evaluate. | +| datafusion.execution.skip_partial_aggregation_use_cost_model | true | (experimental) When true, apply a _secondary_ skip rule on top of `skip_partial_aggregation_probe_ratio_threshold`: skip partial aggregation when the measured ratio is at least `skip_partial_aggregation_cost_min_ratio` (default 0.5). Targets ClickBench Q18-shape queries where the ratio (~0.56) sits just below the fixed 0.8 threshold so partial agg keeps running, but the absolute work (heavy variable-length keys, complex aggregates) makes it net-negative. Empirical motivation: lowering the global ratio threshold to 0.6 fixes Q18 (1.73× faster) but risks regressing low-cost queries at similar ratios. This flag exposes the lower threshold as a separate, opt-in knob. Whether the cost-aware signal (`partial_agg_probe_ns_per_row` metric) can replace this static threshold is an open question — for now the metric is reported alongside so callers can evaluate. | | datafusion.execution.skip_partial_aggregation_ab_sampling_rows | 10000 | Number of input rows used in the A/B sampling window after the initial partial probe completes. During this window the operator routes input through the passthrough (`transform_to_states`) path so the probe can measure `passthrough_ns/row` and compare it against the previously measured `partial_ns/row`. Default 10000 — large enough to amortise per-row noise, small enough to be cheap if the decision turns out to be "keep partial". | | datafusion.execution.use_row_number_estimates_to_optimize_partitioning | false | Should DataFusion use row number estimates at the input to decide whether increasing parallelism is beneficial or not. By default, only exact row numbers (not estimates) are used for this decision. Setting this flag to `true` will likely produce better plans. if the source of statistics is accurate. We plan to make this the default in the future. | | datafusion.execution.enforce_batch_size_in_joins | false | Should DataFusion enforce batch size in joins or not. By default, DataFusion will not enforce batch size in joins. Enforcing batch size in joins can reduce memory usage when joining large tables with a highly-selective join filter, but is also slightly slower. | From df6e26413a3e9362b4ad999468579ddbc6acda22 Mon Sep 17 00:00:00 2001 From: Qi Zhu <821684824@qq.com> Date: Tue, 26 May 2026 20:41:25 +0800 Subject: [PATCH 08/14] ci: re-trigger after GitHub Actions infra outage From 44f815a876d319c662ba7236752d68cec5831baf Mon Sep 17 00:00:00 2001 From: Qi Zhu Date: Tue, 26 May 2026 21:36:36 +0800 Subject: [PATCH 09/14] feat: segment-level re-probing for dynamic distribution shifts MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Phase 2 of cost-aware partial-agg skip. Instead of one final decision per partition, the probe rewinds back to the partial-probe phase after `re_probe_interval_rows` rows (default 1M) and re-runs the partial+A/B sampling cycle on the next segment. Lets a single partition oscillate between partial and skip as the data distribution shifts (dense burst of repeated keys followed by high-cardinality stretch, etc.). State machine: `ProbePhase::Locked { should_skip }` becomes `ProbePhase::Active { should_skip, rows_since_decision }`. Per-batch: - In keep-partial: `observe_partial_batch` increments `rows_since_decision`. At the threshold, `start_reprobe` resets the probe (phase = Partial, counters cleared, `elapsed_compute_at_probe_start` re-snapshotted, `is_locked` = false). - In skip: `tick_skip_batch` does the same from the `SkippingAggregation` exec-state arm. When re-probe fires, the main loop transitions back to `ReadingInput` so the partial-agg path runs on the next batch (fresh hash table, since the previous one was emitted on entry to skip). Final-agg correctness is unaffected: each segment's output (be it emitted partial state or per-row passthrough state) is associative- commutative and merges naturally downstream. New config: - `skip_partial_aggregation_re_probe_interval_rows` (usize, default 1_000_000). Set to 0 to disable re-probing entirely (one-shot decision, the Phase 1 behaviour). New diagnostic counter: - `partial_agg_probe_segment_count` — number of completed segments in the current partition. 0 means the probe ran once and never re-probed; a large value on a fast query suggests the interval is too small. Three new `SkipAggregationProbe` unit tests cover: - re-probe after a committed skip decision rewinds to `Partial` - re-probe after a committed keep decision rewinds to `Partial` - `re_probe_interval_rows = 0` disables re-probing Existing test `test_skip_aggregation_probe_not_locked_until_skip` explicitly disables the cost-aware path (it exercises a legacy Rule 1 corner case the cost model would intercept differently). --- datafusion/common/src/config.rs | 9 + .../physical-plan/src/aggregates/row_hash.rs | 280 ++++++++++++++++-- .../test_files/information_schema.slt | 2 + docs/source/user-guide/configs.md | 1 + 4 files changed, 261 insertions(+), 31 deletions(-) diff --git a/datafusion/common/src/config.rs b/datafusion/common/src/config.rs index 90bc2da076faa..934bae2d607e8 100644 --- a/datafusion/common/src/config.rs +++ b/datafusion/common/src/config.rs @@ -675,6 +675,15 @@ config_namespace! { /// be cheap if the decision turns out to be "keep partial". pub skip_partial_aggregation_ab_sampling_rows: usize, default = 10_000 + /// Rows processed under the committed skip/keep decision before + /// the cost-aware probe re-runs the partial-probe + A/B-sampling + /// cycle and possibly switches direction. Lets the operator + /// adapt to shifting data distributions (e.g. dense bursts of + /// repeated keys followed by sparse high-cardinality stretches). + /// Set to 0 to disable re-probing (one-shot decision, the + /// behaviour before this knob existed). + pub skip_partial_aggregation_re_probe_interval_rows: usize, default = 1_000_000 + /// Should DataFusion use row number estimates at the input to decide /// whether increasing parallelism is beneficial or not. By default, /// only exact row numbers (not estimates) are used for this decision. diff --git a/datafusion/physical-plan/src/aggregates/row_hash.rs b/datafusion/physical-plan/src/aggregates/row_hash.rs index cbb6c86be0cee..a08dbf6e2c56a 100644 --- a/datafusion/physical-plan/src/aggregates/row_hash.rs +++ b/datafusion/physical-plan/src/aggregates/row_hash.rs @@ -129,7 +129,9 @@ struct SpillState { // Metrics related to spilling are managed inside `spill_manager` } -/// Three phases of the cost-aware skip decision. +/// Phases of the segment-level cost-aware skip decision. +/// +/// One full cycle: /// /// 1. `Partial` — accumulate input through the hash table (normal /// partial-agg path), measuring `partial_ns/row` and the @@ -139,15 +141,22 @@ struct SpillState { /// the passthrough path (`transform_to_states`) to measure /// `passthrough_ns/row`. The hash table built so far is kept; /// nothing is emitted yet. -/// 3. `Locked { should_skip }` — final decision. Skip when +/// 3. `Active { should_skip, rows_since_decision }` — committed decision +/// for the current segment. Skip when /// `ratio > passthrough_ns/row / partial_ns/row` (the cost-aware -/// crossover); otherwise revert to partial agg for the rest of the -/// stream. +/// crossover); otherwise revert to partial agg. +/// 4. When `rows_since_decision >= re_probe_interval`, the probe resets +/// back to `Partial`, restarting the cycle on the next segment of +/// input. Set the interval to 0 to disable re-probing entirely +/// (the decision is final). #[derive(Debug, Clone, Copy, PartialEq, Eq)] enum ProbePhase { Partial, AbSampling, - Locked { should_skip: bool }, + Active { + should_skip: bool, + rows_since_decision: usize, + }, } /// Tracks if the aggregate should skip partial aggregations. @@ -167,6 +176,10 @@ struct SkipAggregationProbe { probe_ratio_threshold: f64, use_cost_model: bool, ab_sampling_rows: usize, + /// Rows processed under the committed decision before the probe + /// rewinds back to `Partial` for a fresh measurement cycle. 0 + /// disables re-probing (one-shot decision). + re_probe_interval_rows: usize, // ======================================================================== // STATE @@ -213,6 +226,12 @@ struct SkipAggregationProbe { /// 0 otherwise. Distinguishes the cost-aware skip from the fixed /// Rule 1 skip and from "decision was to keep partial". probe_cost_decision_skip: metrics::Gauge, + /// Diagnostic counter: number of completed segments (probe + A/B + + /// active region). 1 if the probe ran once and didn't re-probe; N + /// when re-probing is enabled and the stream contains N segments. + /// A high number on a fast query is a hint that + /// `re_probe_interval_rows` is too small relative to the workload. + probe_segment_count: metrics::Count, } impl SkipAggregationProbe { @@ -222,12 +241,14 @@ impl SkipAggregationProbe { probe_ratio_threshold: f64, use_cost_model: bool, ab_sampling_rows: usize, + re_probe_interval_rows: usize, skipped_aggregation_rows: metrics::Count, elapsed_compute: metrics::Time, probe_partial_ns_per_row: metrics::Gauge, probe_passthrough_ns_per_row: metrics::Gauge, probe_ratio_per_mille: metrics::Gauge, probe_cost_decision_skip: metrics::Gauge, + probe_segment_count: metrics::Count, ) -> Self { let elapsed_compute_at_probe_start = elapsed_compute.value(); Self { @@ -235,6 +256,7 @@ impl SkipAggregationProbe { probe_ratio_threshold, use_cost_model, ab_sampling_rows, + re_probe_interval_rows, phase: ProbePhase::Partial, input_rows: 0, num_groups: 0, @@ -249,22 +271,72 @@ impl SkipAggregationProbe { probe_passthrough_ns_per_row, probe_ratio_per_mille, probe_cost_decision_skip, + probe_segment_count, } } - /// Called from the partial-agg path after each input batch. Tracks - /// total rows / group count and, when `probe_rows_threshold` is - /// reached, drives the phase transition. + /// Called from the partial-agg path after each input batch. In + /// `Partial`, accumulates partial probe state and may transition to + /// A/B sampling or commit a skip. In `Active { should_skip: false }` + /// (keeping partial), increments the segment counter and triggers a + /// re-probe when the configured interval is reached. fn observe_partial_batch(&mut self, input_rows: usize, num_groups: usize) { - if self.phase != ProbePhase::Partial { - return; + match self.phase { + ProbePhase::Partial => { + self.input_rows += input_rows; + self.num_groups = num_groups; + if self.input_rows < self.probe_rows_threshold { + return; + } + self.finalize_partial_probe(); + } + ProbePhase::Active { + should_skip: false, + ref mut rows_since_decision, + } => { + *rows_since_decision += input_rows; + if self.re_probe_interval_rows > 0 + && *rows_since_decision >= self.re_probe_interval_rows + { + self.start_reprobe(); + } + } + _ => {} } - self.input_rows += input_rows; - self.num_groups = num_groups; - if self.input_rows < self.probe_rows_threshold { - return; + } + + /// Called from the [`ExecutionState::SkippingAggregation`] path. In + /// `Active { should_skip: true }`, increments the segment counter and + /// triggers a re-probe when the configured interval is reached. No-op + /// in any other phase. + fn tick_skip_batch(&mut self, input_rows: usize) { + if let ProbePhase::Active { + should_skip: true, + ref mut rows_since_decision, + } = self.phase + { + *rows_since_decision += input_rows; + if self.re_probe_interval_rows > 0 + && *rows_since_decision >= self.re_probe_interval_rows + { + self.start_reprobe(); + } } + } + + /// `true` if the probe has just finished an active segment and is + /// ready for the main loop to re-enter the partial-agg path (a + /// fresh `Partial` measurement window). + fn wants_reprobe(&self) -> bool { + // Re-probe is requested when the probe is back in `Partial` with + // no accumulated input — exactly what `start_reprobe` leaves + // behind. + matches!(self.phase, ProbePhase::Partial) && self.input_rows == 0 + } + /// Run the partial-probe → decision transition once the partial + /// window has accumulated `probe_rows_threshold` rows. + fn finalize_partial_probe(&mut self) { let ratio = self.num_groups as f64 / self.input_rows as f64; let partial_ns = self .elapsed_compute @@ -276,9 +348,9 @@ impl SkipAggregationProbe { self.probe_partial_ns_per_row.set(partial_ns_per_row); self.probe_ratio_per_mille.set((ratio * 1000.0) as usize); - // Rule 1 (fixed): high ratio — short-circuit straight to skip. if ratio >= self.probe_ratio_threshold { - self.commit_skip(); + // Rule 1 (fixed): high ratio short-circuits to skip. + self.commit_decision(true); return; } @@ -295,6 +367,22 @@ impl SkipAggregationProbe { self.phase = ProbePhase::AbSampling; } + /// Reset state to begin a new probe cycle. Used both by re-probing + /// after the segment limit is reached and (conceptually) on probe + /// construction. Counters are zeroed; metrics gauges are *not* + /// cleared — they hold the most-recent reading. + fn start_reprobe(&mut self) { + self.probe_segment_count.add(1); + self.phase = ProbePhase::Partial; + self.input_rows = 0; + self.num_groups = 0; + self.ab_rows = 0; + self.elapsed_compute_at_probe_start = self.elapsed_compute.value(); + self.elapsed_compute_at_ab_start = None; + self.should_skip = false; + self.is_locked = false; + } + /// True iff the main loop should route the next input batch through /// the passthrough (`transform_to_states`) path to feed the A/B /// measurement instead of through the hash-table partial-agg path. @@ -350,24 +438,21 @@ impl SkipAggregationProbe { ratio > (passthrough_ns_per_row as f64 / partial_ns_per_row as f64) }; - if should_skip { - self.probe_cost_decision_skip.set(1); - self.commit_skip(); - } else { - self.probe_cost_decision_skip.set(0); - self.phase = ProbePhase::Locked { should_skip: false }; - self.is_locked = true; - } + self.probe_cost_decision_skip + .set(if should_skip { 1 } else { 0 }); + self.commit_decision(should_skip); } - /// Transition to the terminal `Locked { should_skip: true }` state. - /// Used by both Rule 1 (fixed-ratio short-circuit) and the cost-aware - /// path so the rest of the operator can rely on a single - /// `should_skip` flag. - fn commit_skip(&mut self) { - self.should_skip = true; + /// Commit a decision and enter the `Active` segment phase. Used by + /// both Rule 1 short-circuit (skip) and the cost-aware path + /// (skip or keep); keeps the per-flag bookkeeping in one place. + fn commit_decision(&mut self, should_skip: bool) { + self.should_skip = should_skip; self.is_locked = true; - self.phase = ProbePhase::Locked { should_skip: true }; + self.phase = ProbePhase::Active { + should_skip, + rows_since_decision: 0, + }; } fn should_skip(&self) -> bool { @@ -814,6 +899,8 @@ impl GroupedHashAggregateStream { options.skip_partial_aggregation_probe_ratio_threshold; let use_cost_model = options.skip_partial_aggregation_use_cost_model; let ab_sampling_rows = options.skip_partial_aggregation_ab_sampling_rows; + let re_probe_interval_rows = + options.skip_partial_aggregation_re_probe_interval_rows; let skipped_aggregation_rows = MetricBuilder::new(&agg.metrics) .with_category(MetricCategory::Rows) .counter("skipped_aggregation_rows", partition); @@ -829,17 +916,22 @@ impl GroupedHashAggregateStream { let probe_cost_decision_skip = MetricBuilder::new(&agg.metrics) .with_category(MetricCategory::Rows) .gauge("partial_agg_probe_cost_decision_skip", partition); + let probe_segment_count = MetricBuilder::new(&agg.metrics) + .with_category(MetricCategory::Rows) + .counter("partial_agg_probe_segment_count", partition); Some(SkipAggregationProbe::new( probe_rows_threshold, probe_ratio_threshold, use_cost_model, ab_sampling_rows, + re_probe_interval_rows, skipped_aggregation_rows, baseline_metrics.elapsed_compute().clone(), probe_partial_ns_per_row, probe_passthrough_ns_per_row, probe_ratio_per_mille, probe_cost_decision_skip, + probe_segment_count, )) } else { None @@ -1018,10 +1110,19 @@ impl Stream for GroupedHashAggregateStream { match ready!(self.input.poll_next_unpin(cx)) { Some(Ok(batch)) => { let _timer = elapsed_compute.timer(); + let input_rows = batch.num_rows(); if let Some(probe) = self.skip_aggregation_probe.as_mut() { probe.record_skipped(&batch); } let states = self.transform_to_states(&batch)?; + // Drive the segment counter; if the probe + // decides to re-probe, swing the operator + // back to `ReadingInput` so the partial-agg + // path runs again on the next batch. + self.tick_skip_aggregation_probe(input_rows); + if self.probe_wants_reprobe() { + self.exec_state = ExecutionState::ReadingInput; + } return Poll::Ready(Some(Ok( states.record_output(&self.baseline_metrics) ))); @@ -1676,6 +1777,24 @@ impl GroupedHashAggregateStream { .is_some_and(|p| p.wants_passthrough_sample()) } + /// True iff the probe has finished an active segment and wants the + /// main loop to re-enter the partial-agg path for a fresh probe + /// cycle. Driven by `re_probe_interval_rows`. + fn probe_wants_reprobe(&self) -> bool { + self.skip_aggregation_probe + .as_ref() + .is_some_and(|p| p.wants_reprobe()) + } + + /// Tick the probe with rows that flowed through the passthrough + /// (`SkippingAggregation`) path, so it can drive the re-probe + /// segment counter. + fn tick_skip_aggregation_probe(&mut self, input_rows: usize) { + if let Some(probe) = self.skip_aggregation_probe.as_mut() { + probe.tick_skip_batch(input_rows); + } + } + /// In case the probe indicates that aggregation may be /// skipped, forces stream to produce currently accumulated output. /// @@ -1947,6 +2066,13 @@ mod tests { "datafusion.execution.skip_partial_aggregation_probe_ratio_threshold", &datafusion_common::ScalarValue::Float64(Some(probe_ratio_threshold)), ); + // This test exercises the legacy Rule 1 not-locked-until-skip + // behaviour. Disable the cost-aware path so the default-on A/B + // sampling doesn't change the decision shape. + session_config = session_config.set( + "datafusion.execution.skip_partial_aggregation_use_cost_model", + &datafusion_common::ScalarValue::Boolean(Some(false)), + ); task_ctx = task_ctx.with_session_config(session_config); let task_ctx = Arc::new(task_ctx); @@ -2094,6 +2220,7 @@ mod tests { passthrough_ns_gauge: metrics::Gauge, ratio_gauge: metrics::Gauge, cost_decision_gauge: metrics::Gauge, + segment_count: metrics::Count, } fn rig( @@ -2101,23 +2228,42 @@ mod tests { probe_ratio_threshold: f64, use_cost_model: bool, ab_sampling_rows: usize, + ) -> ProbeRig { + rig_with_re_probe( + probe_rows_threshold, + probe_ratio_threshold, + use_cost_model, + ab_sampling_rows, + 0, // disable re-probing in legacy tests + ) + } + + fn rig_with_re_probe( + probe_rows_threshold: usize, + probe_ratio_threshold: f64, + use_cost_model: bool, + ab_sampling_rows: usize, + re_probe_interval_rows: usize, ) -> ProbeRig { let elapsed = metrics::Time::new(); let partial_ns_gauge = metrics::Gauge::new(); let passthrough_ns_gauge = metrics::Gauge::new(); let ratio_gauge = metrics::Gauge::new(); let cost_decision_gauge = metrics::Gauge::new(); + let segment_count = metrics::Count::new(); let probe = SkipAggregationProbe::new( probe_rows_threshold, probe_ratio_threshold, use_cost_model, ab_sampling_rows, + re_probe_interval_rows, metrics::Count::new(), elapsed.clone(), partial_ns_gauge.clone(), passthrough_ns_gauge.clone(), ratio_gauge.clone(), cost_decision_gauge.clone(), + segment_count.clone(), ); ProbeRig { probe, @@ -2126,6 +2272,7 @@ mod tests { passthrough_ns_gauge, ratio_gauge, cost_decision_gauge, + segment_count, } } @@ -2268,4 +2415,75 @@ mod tests { assert_eq!(r.partial_ns_gauge.value(), 2_000); assert_eq!(r.ratio_gauge.value(), 500); } + + /// After committing a skip decision, processing `re_probe_interval` + /// rows through `tick_skip_batch` rewinds the probe back to + /// `Partial`, ready for the main loop to re-enter the partial-agg + /// path on the next segment. + #[test] + fn skip_probe_reprobes_after_skip_segment() { + let mut r = rig_with_re_probe(100, 0.8, true, 100, 1000); + + // Force a skip via Rule 1 (ratio 0.9). + r.probe.observe_partial_batch(100, 90); + assert!(r.probe.should_skip()); + assert!(r.probe.is_locked); + assert!(!r.probe.wants_reprobe()); + assert_eq!(r.segment_count.value(), 0); + + // Stream 1000 rows through skip — at the threshold the probe + // resets back to `Partial`. + r.probe.tick_skip_batch(500); + assert!(!r.probe.wants_reprobe()); + r.probe.tick_skip_batch(500); + + assert!(r.probe.wants_reprobe()); + assert!(!r.probe.should_skip()); + assert!(!r.probe.is_locked); + assert_eq!(r.segment_count.value(), 1); + } + + /// After committing a keep decision, accumulating + /// `re_probe_interval` more rows through `observe_partial_batch` + /// rewinds the probe so the next batch starts a fresh partial + /// measurement. + #[test] + fn skip_probe_reprobes_after_keep_segment() { + let mut r = rig_with_re_probe(100, 0.8, true, 100, 1000); + + // partial=100ns/row, passthrough=80ns/row, ratio 0.6 → + // 0.6 < 80/100 = 0.8 → cost says keep. + r.elapsed.add_duration(Duration::from_nanos(10_000)); + r.probe.observe_partial_batch(100, 60); + r.elapsed.add_duration(Duration::from_nanos(8_000)); + r.probe.observe_ab_batch(100); + assert!(!r.probe.should_skip()); + assert!(r.probe.is_locked); + assert_eq!(r.segment_count.value(), 0); + + // Stream 1000 rows through the keep path — re-probe fires. + r.probe.observe_partial_batch(500, 320); + assert!(!r.probe.wants_reprobe()); + r.probe.observe_partial_batch(500, 600); + + assert!(r.probe.wants_reprobe()); + assert!(!r.probe.is_locked); + assert_eq!(r.segment_count.value(), 1); + } + + /// `re_probe_interval_rows = 0` disables re-probing entirely — + /// the original one-shot decision behaviour. + #[test] + fn skip_probe_re_probe_disabled_when_interval_zero() { + let mut r = rig_with_re_probe(100, 0.8, true, 100, 0); + + r.probe.observe_partial_batch(100, 90); // skip via Rule 1 + assert!(r.probe.should_skip()); + + // Even after a huge skip-window, the probe stays locked. + r.probe.tick_skip_batch(10_000_000); + assert!(!r.probe.wants_reprobe()); + assert!(r.probe.is_locked); + assert_eq!(r.segment_count.value(), 0); + } } diff --git a/datafusion/sqllogictest/test_files/information_schema.slt b/datafusion/sqllogictest/test_files/information_schema.slt index 056727cbd3767..b72b7a5686147 100644 --- a/datafusion/sqllogictest/test_files/information_schema.slt +++ b/datafusion/sqllogictest/test_files/information_schema.slt @@ -269,6 +269,7 @@ datafusion.execution.planning_concurrency 13 datafusion.execution.skip_partial_aggregation_ab_sampling_rows 10000 datafusion.execution.skip_partial_aggregation_probe_ratio_threshold 0.8 datafusion.execution.skip_partial_aggregation_probe_rows_threshold 100000 +datafusion.execution.skip_partial_aggregation_re_probe_interval_rows 1000000 datafusion.execution.skip_partial_aggregation_use_cost_model true datafusion.execution.skip_physical_aggregate_schema_check false datafusion.execution.soft_max_rows_per_output_file 50000000 @@ -421,6 +422,7 @@ datafusion.execution.planning_concurrency 13 Fan-out during initial physical pla datafusion.execution.skip_partial_aggregation_ab_sampling_rows 10000 Number of input rows used in the A/B sampling window after the initial partial probe completes. During this window the operator routes input through the passthrough (`transform_to_states`) path so the probe can measure `passthrough_ns/row` and compare it against the previously measured `partial_ns/row`. Default 10000 — large enough to amortise per-row noise, small enough to be cheap if the decision turns out to be "keep partial". datafusion.execution.skip_partial_aggregation_probe_ratio_threshold 0.8 Aggregation ratio (number of distinct groups / number of input rows) threshold for skipping partial aggregation. If the value is greater then partial aggregation will skip aggregation for further input datafusion.execution.skip_partial_aggregation_probe_rows_threshold 100000 Number of input rows partial aggregation partition should process, before aggregation ratio check and trying to switch to skipping aggregation mode +datafusion.execution.skip_partial_aggregation_re_probe_interval_rows 1000000 Rows processed under the committed skip/keep decision before the cost-aware probe re-runs the partial-probe + A/B-sampling cycle and possibly switches direction. Lets the operator adapt to shifting data distributions (e.g. dense bursts of repeated keys followed by sparse high-cardinality stretches). Set to 0 to disable re-probing (one-shot decision, the behaviour before this knob existed). datafusion.execution.skip_partial_aggregation_use_cost_model true (experimental) When true, apply a *secondary* skip rule on top of `skip_partial_aggregation_probe_ratio_threshold`: skip partial aggregation when the measured ratio is at least `skip_partial_aggregation_cost_min_ratio` (default 0.5). Targets ClickBench Q18-shape queries where the ratio (~0.56) sits just below the fixed 0.8 threshold so partial agg keeps running, but the absolute work (heavy variable-length keys, complex aggregates) makes it net-negative. Empirical motivation: lowering the global ratio threshold to 0.6 fixes Q18 (1.73× faster) but risks regressing low-cost queries at similar ratios. This flag exposes the lower threshold as a separate, opt-in knob. Whether the cost-aware signal (`partial_agg_probe_ns_per_row` metric) can replace this static threshold is an open question — for now the metric is reported alongside so callers can evaluate. datafusion.execution.skip_physical_aggregate_schema_check false When set to true, skips verifying that the schema produced by planning the input of `LogicalPlan::Aggregate` exactly matches the schema of the input plan. When set to false, if the schema does not match exactly (including nullability and metadata), a planning error will be raised. This is used to workaround bugs in the planner that are now caught by the new schema verification step. datafusion.execution.soft_max_rows_per_output_file 50000000 Target number of rows in output files when writing multiple. This is a soft max, so it can be exceeded slightly. There also will be one file smaller than the limit if the total number of rows written is not roughly divisible by the soft max diff --git a/docs/source/user-guide/configs.md b/docs/source/user-guide/configs.md index 138696a9bfa39..01defc4c39f8f 100644 --- a/docs/source/user-guide/configs.md +++ b/docs/source/user-guide/configs.md @@ -134,6 +134,7 @@ The following configuration settings are available: | datafusion.execution.skip_partial_aggregation_probe_rows_threshold | 100000 | Number of input rows partial aggregation partition should process, before aggregation ratio check and trying to switch to skipping aggregation mode | | datafusion.execution.skip_partial_aggregation_use_cost_model | true | (experimental) When true, apply a _secondary_ skip rule on top of `skip_partial_aggregation_probe_ratio_threshold`: skip partial aggregation when the measured ratio is at least `skip_partial_aggregation_cost_min_ratio` (default 0.5). Targets ClickBench Q18-shape queries where the ratio (~0.56) sits just below the fixed 0.8 threshold so partial agg keeps running, but the absolute work (heavy variable-length keys, complex aggregates) makes it net-negative. Empirical motivation: lowering the global ratio threshold to 0.6 fixes Q18 (1.73× faster) but risks regressing low-cost queries at similar ratios. This flag exposes the lower threshold as a separate, opt-in knob. Whether the cost-aware signal (`partial_agg_probe_ns_per_row` metric) can replace this static threshold is an open question — for now the metric is reported alongside so callers can evaluate. | | datafusion.execution.skip_partial_aggregation_ab_sampling_rows | 10000 | Number of input rows used in the A/B sampling window after the initial partial probe completes. During this window the operator routes input through the passthrough (`transform_to_states`) path so the probe can measure `passthrough_ns/row` and compare it against the previously measured `partial_ns/row`. Default 10000 — large enough to amortise per-row noise, small enough to be cheap if the decision turns out to be "keep partial". | +| datafusion.execution.skip_partial_aggregation_re_probe_interval_rows | 1000000 | Rows processed under the committed skip/keep decision before the cost-aware probe re-runs the partial-probe + A/B-sampling cycle and possibly switches direction. Lets the operator adapt to shifting data distributions (e.g. dense bursts of repeated keys followed by sparse high-cardinality stretches). Set to 0 to disable re-probing (one-shot decision, the behaviour before this knob existed). | | datafusion.execution.use_row_number_estimates_to_optimize_partitioning | false | Should DataFusion use row number estimates at the input to decide whether increasing parallelism is beneficial or not. By default, only exact row numbers (not estimates) are used for this decision. Setting this flag to `true` will likely produce better plans. if the source of statistics is accurate. We plan to make this the default in the future. | | datafusion.execution.enforce_batch_size_in_joins | false | Should DataFusion enforce batch size in joins or not. By default, DataFusion will not enforce batch size in joins. Enforcing batch size in joins can reduce memory usage when joining large tables with a highly-selective join filter, but is also slightly slower. | | datafusion.execution.objectstore_writer_buffer_size | 10485760 | Size (bytes) of data buffer DataFusion uses when writing output files. This affects the size of the data chunks that are uploaded to remote object stores (e.g. AWS S3). If very large (>= 100 GiB) output files are being written, it may be necessary to increase this size to avoid errors from the remote end point. | From a258afe8ca4455de966d5e5758fb544e49723642 Mon Sep 17 00:00:00 2001 From: Qi Zhu <821684824@qq.com> Date: Tue, 26 May 2026 22:08:51 +0800 Subject: [PATCH 10/14] Revert "feat: segment-level re-probing for dynamic distribution shifts" This reverts commit 44f815a876d319c662ba7236752d68cec5831baf. --- datafusion/common/src/config.rs | 9 - .../physical-plan/src/aggregates/row_hash.rs | 280 ++---------------- .../test_files/information_schema.slt | 2 - docs/source/user-guide/configs.md | 1 - 4 files changed, 31 insertions(+), 261 deletions(-) diff --git a/datafusion/common/src/config.rs b/datafusion/common/src/config.rs index 934bae2d607e8..90bc2da076faa 100644 --- a/datafusion/common/src/config.rs +++ b/datafusion/common/src/config.rs @@ -675,15 +675,6 @@ config_namespace! { /// be cheap if the decision turns out to be "keep partial". pub skip_partial_aggregation_ab_sampling_rows: usize, default = 10_000 - /// Rows processed under the committed skip/keep decision before - /// the cost-aware probe re-runs the partial-probe + A/B-sampling - /// cycle and possibly switches direction. Lets the operator - /// adapt to shifting data distributions (e.g. dense bursts of - /// repeated keys followed by sparse high-cardinality stretches). - /// Set to 0 to disable re-probing (one-shot decision, the - /// behaviour before this knob existed). - pub skip_partial_aggregation_re_probe_interval_rows: usize, default = 1_000_000 - /// Should DataFusion use row number estimates at the input to decide /// whether increasing parallelism is beneficial or not. By default, /// only exact row numbers (not estimates) are used for this decision. diff --git a/datafusion/physical-plan/src/aggregates/row_hash.rs b/datafusion/physical-plan/src/aggregates/row_hash.rs index a08dbf6e2c56a..cbb6c86be0cee 100644 --- a/datafusion/physical-plan/src/aggregates/row_hash.rs +++ b/datafusion/physical-plan/src/aggregates/row_hash.rs @@ -129,9 +129,7 @@ struct SpillState { // Metrics related to spilling are managed inside `spill_manager` } -/// Phases of the segment-level cost-aware skip decision. -/// -/// One full cycle: +/// Three phases of the cost-aware skip decision. /// /// 1. `Partial` — accumulate input through the hash table (normal /// partial-agg path), measuring `partial_ns/row` and the @@ -141,22 +139,15 @@ struct SpillState { /// the passthrough path (`transform_to_states`) to measure /// `passthrough_ns/row`. The hash table built so far is kept; /// nothing is emitted yet. -/// 3. `Active { should_skip, rows_since_decision }` — committed decision -/// for the current segment. Skip when +/// 3. `Locked { should_skip }` — final decision. Skip when /// `ratio > passthrough_ns/row / partial_ns/row` (the cost-aware -/// crossover); otherwise revert to partial agg. -/// 4. When `rows_since_decision >= re_probe_interval`, the probe resets -/// back to `Partial`, restarting the cycle on the next segment of -/// input. Set the interval to 0 to disable re-probing entirely -/// (the decision is final). +/// crossover); otherwise revert to partial agg for the rest of the +/// stream. #[derive(Debug, Clone, Copy, PartialEq, Eq)] enum ProbePhase { Partial, AbSampling, - Active { - should_skip: bool, - rows_since_decision: usize, - }, + Locked { should_skip: bool }, } /// Tracks if the aggregate should skip partial aggregations. @@ -176,10 +167,6 @@ struct SkipAggregationProbe { probe_ratio_threshold: f64, use_cost_model: bool, ab_sampling_rows: usize, - /// Rows processed under the committed decision before the probe - /// rewinds back to `Partial` for a fresh measurement cycle. 0 - /// disables re-probing (one-shot decision). - re_probe_interval_rows: usize, // ======================================================================== // STATE @@ -226,12 +213,6 @@ struct SkipAggregationProbe { /// 0 otherwise. Distinguishes the cost-aware skip from the fixed /// Rule 1 skip and from "decision was to keep partial". probe_cost_decision_skip: metrics::Gauge, - /// Diagnostic counter: number of completed segments (probe + A/B + - /// active region). 1 if the probe ran once and didn't re-probe; N - /// when re-probing is enabled and the stream contains N segments. - /// A high number on a fast query is a hint that - /// `re_probe_interval_rows` is too small relative to the workload. - probe_segment_count: metrics::Count, } impl SkipAggregationProbe { @@ -241,14 +222,12 @@ impl SkipAggregationProbe { probe_ratio_threshold: f64, use_cost_model: bool, ab_sampling_rows: usize, - re_probe_interval_rows: usize, skipped_aggregation_rows: metrics::Count, elapsed_compute: metrics::Time, probe_partial_ns_per_row: metrics::Gauge, probe_passthrough_ns_per_row: metrics::Gauge, probe_ratio_per_mille: metrics::Gauge, probe_cost_decision_skip: metrics::Gauge, - probe_segment_count: metrics::Count, ) -> Self { let elapsed_compute_at_probe_start = elapsed_compute.value(); Self { @@ -256,7 +235,6 @@ impl SkipAggregationProbe { probe_ratio_threshold, use_cost_model, ab_sampling_rows, - re_probe_interval_rows, phase: ProbePhase::Partial, input_rows: 0, num_groups: 0, @@ -271,72 +249,22 @@ impl SkipAggregationProbe { probe_passthrough_ns_per_row, probe_ratio_per_mille, probe_cost_decision_skip, - probe_segment_count, } } - /// Called from the partial-agg path after each input batch. In - /// `Partial`, accumulates partial probe state and may transition to - /// A/B sampling or commit a skip. In `Active { should_skip: false }` - /// (keeping partial), increments the segment counter and triggers a - /// re-probe when the configured interval is reached. + /// Called from the partial-agg path after each input batch. Tracks + /// total rows / group count and, when `probe_rows_threshold` is + /// reached, drives the phase transition. fn observe_partial_batch(&mut self, input_rows: usize, num_groups: usize) { - match self.phase { - ProbePhase::Partial => { - self.input_rows += input_rows; - self.num_groups = num_groups; - if self.input_rows < self.probe_rows_threshold { - return; - } - self.finalize_partial_probe(); - } - ProbePhase::Active { - should_skip: false, - ref mut rows_since_decision, - } => { - *rows_since_decision += input_rows; - if self.re_probe_interval_rows > 0 - && *rows_since_decision >= self.re_probe_interval_rows - { - self.start_reprobe(); - } - } - _ => {} + if self.phase != ProbePhase::Partial { + return; } - } - - /// Called from the [`ExecutionState::SkippingAggregation`] path. In - /// `Active { should_skip: true }`, increments the segment counter and - /// triggers a re-probe when the configured interval is reached. No-op - /// in any other phase. - fn tick_skip_batch(&mut self, input_rows: usize) { - if let ProbePhase::Active { - should_skip: true, - ref mut rows_since_decision, - } = self.phase - { - *rows_since_decision += input_rows; - if self.re_probe_interval_rows > 0 - && *rows_since_decision >= self.re_probe_interval_rows - { - self.start_reprobe(); - } + self.input_rows += input_rows; + self.num_groups = num_groups; + if self.input_rows < self.probe_rows_threshold { + return; } - } - - /// `true` if the probe has just finished an active segment and is - /// ready for the main loop to re-enter the partial-agg path (a - /// fresh `Partial` measurement window). - fn wants_reprobe(&self) -> bool { - // Re-probe is requested when the probe is back in `Partial` with - // no accumulated input — exactly what `start_reprobe` leaves - // behind. - matches!(self.phase, ProbePhase::Partial) && self.input_rows == 0 - } - /// Run the partial-probe → decision transition once the partial - /// window has accumulated `probe_rows_threshold` rows. - fn finalize_partial_probe(&mut self) { let ratio = self.num_groups as f64 / self.input_rows as f64; let partial_ns = self .elapsed_compute @@ -348,9 +276,9 @@ impl SkipAggregationProbe { self.probe_partial_ns_per_row.set(partial_ns_per_row); self.probe_ratio_per_mille.set((ratio * 1000.0) as usize); + // Rule 1 (fixed): high ratio — short-circuit straight to skip. if ratio >= self.probe_ratio_threshold { - // Rule 1 (fixed): high ratio short-circuits to skip. - self.commit_decision(true); + self.commit_skip(); return; } @@ -367,22 +295,6 @@ impl SkipAggregationProbe { self.phase = ProbePhase::AbSampling; } - /// Reset state to begin a new probe cycle. Used both by re-probing - /// after the segment limit is reached and (conceptually) on probe - /// construction. Counters are zeroed; metrics gauges are *not* - /// cleared — they hold the most-recent reading. - fn start_reprobe(&mut self) { - self.probe_segment_count.add(1); - self.phase = ProbePhase::Partial; - self.input_rows = 0; - self.num_groups = 0; - self.ab_rows = 0; - self.elapsed_compute_at_probe_start = self.elapsed_compute.value(); - self.elapsed_compute_at_ab_start = None; - self.should_skip = false; - self.is_locked = false; - } - /// True iff the main loop should route the next input batch through /// the passthrough (`transform_to_states`) path to feed the A/B /// measurement instead of through the hash-table partial-agg path. @@ -438,21 +350,24 @@ impl SkipAggregationProbe { ratio > (passthrough_ns_per_row as f64 / partial_ns_per_row as f64) }; - self.probe_cost_decision_skip - .set(if should_skip { 1 } else { 0 }); - self.commit_decision(should_skip); + if should_skip { + self.probe_cost_decision_skip.set(1); + self.commit_skip(); + } else { + self.probe_cost_decision_skip.set(0); + self.phase = ProbePhase::Locked { should_skip: false }; + self.is_locked = true; + } } - /// Commit a decision and enter the `Active` segment phase. Used by - /// both Rule 1 short-circuit (skip) and the cost-aware path - /// (skip or keep); keeps the per-flag bookkeeping in one place. - fn commit_decision(&mut self, should_skip: bool) { - self.should_skip = should_skip; + /// Transition to the terminal `Locked { should_skip: true }` state. + /// Used by both Rule 1 (fixed-ratio short-circuit) and the cost-aware + /// path so the rest of the operator can rely on a single + /// `should_skip` flag. + fn commit_skip(&mut self) { + self.should_skip = true; self.is_locked = true; - self.phase = ProbePhase::Active { - should_skip, - rows_since_decision: 0, - }; + self.phase = ProbePhase::Locked { should_skip: true }; } fn should_skip(&self) -> bool { @@ -899,8 +814,6 @@ impl GroupedHashAggregateStream { options.skip_partial_aggregation_probe_ratio_threshold; let use_cost_model = options.skip_partial_aggregation_use_cost_model; let ab_sampling_rows = options.skip_partial_aggregation_ab_sampling_rows; - let re_probe_interval_rows = - options.skip_partial_aggregation_re_probe_interval_rows; let skipped_aggregation_rows = MetricBuilder::new(&agg.metrics) .with_category(MetricCategory::Rows) .counter("skipped_aggregation_rows", partition); @@ -916,22 +829,17 @@ impl GroupedHashAggregateStream { let probe_cost_decision_skip = MetricBuilder::new(&agg.metrics) .with_category(MetricCategory::Rows) .gauge("partial_agg_probe_cost_decision_skip", partition); - let probe_segment_count = MetricBuilder::new(&agg.metrics) - .with_category(MetricCategory::Rows) - .counter("partial_agg_probe_segment_count", partition); Some(SkipAggregationProbe::new( probe_rows_threshold, probe_ratio_threshold, use_cost_model, ab_sampling_rows, - re_probe_interval_rows, skipped_aggregation_rows, baseline_metrics.elapsed_compute().clone(), probe_partial_ns_per_row, probe_passthrough_ns_per_row, probe_ratio_per_mille, probe_cost_decision_skip, - probe_segment_count, )) } else { None @@ -1110,19 +1018,10 @@ impl Stream for GroupedHashAggregateStream { match ready!(self.input.poll_next_unpin(cx)) { Some(Ok(batch)) => { let _timer = elapsed_compute.timer(); - let input_rows = batch.num_rows(); if let Some(probe) = self.skip_aggregation_probe.as_mut() { probe.record_skipped(&batch); } let states = self.transform_to_states(&batch)?; - // Drive the segment counter; if the probe - // decides to re-probe, swing the operator - // back to `ReadingInput` so the partial-agg - // path runs again on the next batch. - self.tick_skip_aggregation_probe(input_rows); - if self.probe_wants_reprobe() { - self.exec_state = ExecutionState::ReadingInput; - } return Poll::Ready(Some(Ok( states.record_output(&self.baseline_metrics) ))); @@ -1777,24 +1676,6 @@ impl GroupedHashAggregateStream { .is_some_and(|p| p.wants_passthrough_sample()) } - /// True iff the probe has finished an active segment and wants the - /// main loop to re-enter the partial-agg path for a fresh probe - /// cycle. Driven by `re_probe_interval_rows`. - fn probe_wants_reprobe(&self) -> bool { - self.skip_aggregation_probe - .as_ref() - .is_some_and(|p| p.wants_reprobe()) - } - - /// Tick the probe with rows that flowed through the passthrough - /// (`SkippingAggregation`) path, so it can drive the re-probe - /// segment counter. - fn tick_skip_aggregation_probe(&mut self, input_rows: usize) { - if let Some(probe) = self.skip_aggregation_probe.as_mut() { - probe.tick_skip_batch(input_rows); - } - } - /// In case the probe indicates that aggregation may be /// skipped, forces stream to produce currently accumulated output. /// @@ -2066,13 +1947,6 @@ mod tests { "datafusion.execution.skip_partial_aggregation_probe_ratio_threshold", &datafusion_common::ScalarValue::Float64(Some(probe_ratio_threshold)), ); - // This test exercises the legacy Rule 1 not-locked-until-skip - // behaviour. Disable the cost-aware path so the default-on A/B - // sampling doesn't change the decision shape. - session_config = session_config.set( - "datafusion.execution.skip_partial_aggregation_use_cost_model", - &datafusion_common::ScalarValue::Boolean(Some(false)), - ); task_ctx = task_ctx.with_session_config(session_config); let task_ctx = Arc::new(task_ctx); @@ -2220,7 +2094,6 @@ mod tests { passthrough_ns_gauge: metrics::Gauge, ratio_gauge: metrics::Gauge, cost_decision_gauge: metrics::Gauge, - segment_count: metrics::Count, } fn rig( @@ -2228,42 +2101,23 @@ mod tests { probe_ratio_threshold: f64, use_cost_model: bool, ab_sampling_rows: usize, - ) -> ProbeRig { - rig_with_re_probe( - probe_rows_threshold, - probe_ratio_threshold, - use_cost_model, - ab_sampling_rows, - 0, // disable re-probing in legacy tests - ) - } - - fn rig_with_re_probe( - probe_rows_threshold: usize, - probe_ratio_threshold: f64, - use_cost_model: bool, - ab_sampling_rows: usize, - re_probe_interval_rows: usize, ) -> ProbeRig { let elapsed = metrics::Time::new(); let partial_ns_gauge = metrics::Gauge::new(); let passthrough_ns_gauge = metrics::Gauge::new(); let ratio_gauge = metrics::Gauge::new(); let cost_decision_gauge = metrics::Gauge::new(); - let segment_count = metrics::Count::new(); let probe = SkipAggregationProbe::new( probe_rows_threshold, probe_ratio_threshold, use_cost_model, ab_sampling_rows, - re_probe_interval_rows, metrics::Count::new(), elapsed.clone(), partial_ns_gauge.clone(), passthrough_ns_gauge.clone(), ratio_gauge.clone(), cost_decision_gauge.clone(), - segment_count.clone(), ); ProbeRig { probe, @@ -2272,7 +2126,6 @@ mod tests { passthrough_ns_gauge, ratio_gauge, cost_decision_gauge, - segment_count, } } @@ -2415,75 +2268,4 @@ mod tests { assert_eq!(r.partial_ns_gauge.value(), 2_000); assert_eq!(r.ratio_gauge.value(), 500); } - - /// After committing a skip decision, processing `re_probe_interval` - /// rows through `tick_skip_batch` rewinds the probe back to - /// `Partial`, ready for the main loop to re-enter the partial-agg - /// path on the next segment. - #[test] - fn skip_probe_reprobes_after_skip_segment() { - let mut r = rig_with_re_probe(100, 0.8, true, 100, 1000); - - // Force a skip via Rule 1 (ratio 0.9). - r.probe.observe_partial_batch(100, 90); - assert!(r.probe.should_skip()); - assert!(r.probe.is_locked); - assert!(!r.probe.wants_reprobe()); - assert_eq!(r.segment_count.value(), 0); - - // Stream 1000 rows through skip — at the threshold the probe - // resets back to `Partial`. - r.probe.tick_skip_batch(500); - assert!(!r.probe.wants_reprobe()); - r.probe.tick_skip_batch(500); - - assert!(r.probe.wants_reprobe()); - assert!(!r.probe.should_skip()); - assert!(!r.probe.is_locked); - assert_eq!(r.segment_count.value(), 1); - } - - /// After committing a keep decision, accumulating - /// `re_probe_interval` more rows through `observe_partial_batch` - /// rewinds the probe so the next batch starts a fresh partial - /// measurement. - #[test] - fn skip_probe_reprobes_after_keep_segment() { - let mut r = rig_with_re_probe(100, 0.8, true, 100, 1000); - - // partial=100ns/row, passthrough=80ns/row, ratio 0.6 → - // 0.6 < 80/100 = 0.8 → cost says keep. - r.elapsed.add_duration(Duration::from_nanos(10_000)); - r.probe.observe_partial_batch(100, 60); - r.elapsed.add_duration(Duration::from_nanos(8_000)); - r.probe.observe_ab_batch(100); - assert!(!r.probe.should_skip()); - assert!(r.probe.is_locked); - assert_eq!(r.segment_count.value(), 0); - - // Stream 1000 rows through the keep path — re-probe fires. - r.probe.observe_partial_batch(500, 320); - assert!(!r.probe.wants_reprobe()); - r.probe.observe_partial_batch(500, 600); - - assert!(r.probe.wants_reprobe()); - assert!(!r.probe.is_locked); - assert_eq!(r.segment_count.value(), 1); - } - - /// `re_probe_interval_rows = 0` disables re-probing entirely — - /// the original one-shot decision behaviour. - #[test] - fn skip_probe_re_probe_disabled_when_interval_zero() { - let mut r = rig_with_re_probe(100, 0.8, true, 100, 0); - - r.probe.observe_partial_batch(100, 90); // skip via Rule 1 - assert!(r.probe.should_skip()); - - // Even after a huge skip-window, the probe stays locked. - r.probe.tick_skip_batch(10_000_000); - assert!(!r.probe.wants_reprobe()); - assert!(r.probe.is_locked); - assert_eq!(r.segment_count.value(), 0); - } } diff --git a/datafusion/sqllogictest/test_files/information_schema.slt b/datafusion/sqllogictest/test_files/information_schema.slt index b72b7a5686147..056727cbd3767 100644 --- a/datafusion/sqllogictest/test_files/information_schema.slt +++ b/datafusion/sqllogictest/test_files/information_schema.slt @@ -269,7 +269,6 @@ datafusion.execution.planning_concurrency 13 datafusion.execution.skip_partial_aggregation_ab_sampling_rows 10000 datafusion.execution.skip_partial_aggregation_probe_ratio_threshold 0.8 datafusion.execution.skip_partial_aggregation_probe_rows_threshold 100000 -datafusion.execution.skip_partial_aggregation_re_probe_interval_rows 1000000 datafusion.execution.skip_partial_aggregation_use_cost_model true datafusion.execution.skip_physical_aggregate_schema_check false datafusion.execution.soft_max_rows_per_output_file 50000000 @@ -422,7 +421,6 @@ datafusion.execution.planning_concurrency 13 Fan-out during initial physical pla datafusion.execution.skip_partial_aggregation_ab_sampling_rows 10000 Number of input rows used in the A/B sampling window after the initial partial probe completes. During this window the operator routes input through the passthrough (`transform_to_states`) path so the probe can measure `passthrough_ns/row` and compare it against the previously measured `partial_ns/row`. Default 10000 — large enough to amortise per-row noise, small enough to be cheap if the decision turns out to be "keep partial". datafusion.execution.skip_partial_aggregation_probe_ratio_threshold 0.8 Aggregation ratio (number of distinct groups / number of input rows) threshold for skipping partial aggregation. If the value is greater then partial aggregation will skip aggregation for further input datafusion.execution.skip_partial_aggregation_probe_rows_threshold 100000 Number of input rows partial aggregation partition should process, before aggregation ratio check and trying to switch to skipping aggregation mode -datafusion.execution.skip_partial_aggregation_re_probe_interval_rows 1000000 Rows processed under the committed skip/keep decision before the cost-aware probe re-runs the partial-probe + A/B-sampling cycle and possibly switches direction. Lets the operator adapt to shifting data distributions (e.g. dense bursts of repeated keys followed by sparse high-cardinality stretches). Set to 0 to disable re-probing (one-shot decision, the behaviour before this knob existed). datafusion.execution.skip_partial_aggregation_use_cost_model true (experimental) When true, apply a *secondary* skip rule on top of `skip_partial_aggregation_probe_ratio_threshold`: skip partial aggregation when the measured ratio is at least `skip_partial_aggregation_cost_min_ratio` (default 0.5). Targets ClickBench Q18-shape queries where the ratio (~0.56) sits just below the fixed 0.8 threshold so partial agg keeps running, but the absolute work (heavy variable-length keys, complex aggregates) makes it net-negative. Empirical motivation: lowering the global ratio threshold to 0.6 fixes Q18 (1.73× faster) but risks regressing low-cost queries at similar ratios. This flag exposes the lower threshold as a separate, opt-in knob. Whether the cost-aware signal (`partial_agg_probe_ns_per_row` metric) can replace this static threshold is an open question — for now the metric is reported alongside so callers can evaluate. datafusion.execution.skip_physical_aggregate_schema_check false When set to true, skips verifying that the schema produced by planning the input of `LogicalPlan::Aggregate` exactly matches the schema of the input plan. When set to false, if the schema does not match exactly (including nullability and metadata), a planning error will be raised. This is used to workaround bugs in the planner that are now caught by the new schema verification step. datafusion.execution.soft_max_rows_per_output_file 50000000 Target number of rows in output files when writing multiple. This is a soft max, so it can be exceeded slightly. There also will be one file smaller than the limit if the total number of rows written is not roughly divisible by the soft max diff --git a/docs/source/user-guide/configs.md b/docs/source/user-guide/configs.md index 01defc4c39f8f..138696a9bfa39 100644 --- a/docs/source/user-guide/configs.md +++ b/docs/source/user-guide/configs.md @@ -134,7 +134,6 @@ The following configuration settings are available: | datafusion.execution.skip_partial_aggregation_probe_rows_threshold | 100000 | Number of input rows partial aggregation partition should process, before aggregation ratio check and trying to switch to skipping aggregation mode | | datafusion.execution.skip_partial_aggregation_use_cost_model | true | (experimental) When true, apply a _secondary_ skip rule on top of `skip_partial_aggregation_probe_ratio_threshold`: skip partial aggregation when the measured ratio is at least `skip_partial_aggregation_cost_min_ratio` (default 0.5). Targets ClickBench Q18-shape queries where the ratio (~0.56) sits just below the fixed 0.8 threshold so partial agg keeps running, but the absolute work (heavy variable-length keys, complex aggregates) makes it net-negative. Empirical motivation: lowering the global ratio threshold to 0.6 fixes Q18 (1.73× faster) but risks regressing low-cost queries at similar ratios. This flag exposes the lower threshold as a separate, opt-in knob. Whether the cost-aware signal (`partial_agg_probe_ns_per_row` metric) can replace this static threshold is an open question — for now the metric is reported alongside so callers can evaluate. | | datafusion.execution.skip_partial_aggregation_ab_sampling_rows | 10000 | Number of input rows used in the A/B sampling window after the initial partial probe completes. During this window the operator routes input through the passthrough (`transform_to_states`) path so the probe can measure `passthrough_ns/row` and compare it against the previously measured `partial_ns/row`. Default 10000 — large enough to amortise per-row noise, small enough to be cheap if the decision turns out to be "keep partial". | -| datafusion.execution.skip_partial_aggregation_re_probe_interval_rows | 1000000 | Rows processed under the committed skip/keep decision before the cost-aware probe re-runs the partial-probe + A/B-sampling cycle and possibly switches direction. Lets the operator adapt to shifting data distributions (e.g. dense bursts of repeated keys followed by sparse high-cardinality stretches). Set to 0 to disable re-probing (one-shot decision, the behaviour before this knob existed). | | datafusion.execution.use_row_number_estimates_to_optimize_partitioning | false | Should DataFusion use row number estimates at the input to decide whether increasing parallelism is beneficial or not. By default, only exact row numbers (not estimates) are used for this decision. Setting this flag to `true` will likely produce better plans. if the source of statistics is accurate. We plan to make this the default in the future. | | datafusion.execution.enforce_batch_size_in_joins | false | Should DataFusion enforce batch size in joins or not. By default, DataFusion will not enforce batch size in joins. Enforcing batch size in joins can reduce memory usage when joining large tables with a highly-selective join filter, but is also slightly slower. | | datafusion.execution.objectstore_writer_buffer_size | 10485760 | Size (bytes) of data buffer DataFusion uses when writing output files. This affects the size of the data chunks that are uploaded to remote object stores (e.g. AWS S3). If very large (>= 100 GiB) output files are being written, it may be necessary to increase this size to avoid errors from the remote end point. | From a2baa6cee5eed2300f762a42fbabeb598992da9e Mon Sep 17 00:00:00 2001 From: Qi Zhu Date: Tue, 26 May 2026 23:16:45 +0800 Subject: [PATCH 11/14] test: explicitly disable cost model in legacy not-locked-until-skip test The Phase 2 revert took out the previous config-disable line. Test exercises the original Rule 1 short-circuit behaviour, so it needs to opt out of the default-on A/B sampling explicitly. --- datafusion/physical-plan/src/aggregates/row_hash.rs | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/datafusion/physical-plan/src/aggregates/row_hash.rs b/datafusion/physical-plan/src/aggregates/row_hash.rs index cbb6c86be0cee..134f31321a69b 100644 --- a/datafusion/physical-plan/src/aggregates/row_hash.rs +++ b/datafusion/physical-plan/src/aggregates/row_hash.rs @@ -1947,6 +1947,14 @@ mod tests { "datafusion.execution.skip_partial_aggregation_probe_ratio_threshold", &datafusion_common::ScalarValue::Float64(Some(probe_ratio_threshold)), ); + // This test exercises the legacy not-locked-until-skip behaviour + // (Rule 1 ratio check, no A/B sampling). Disable the default-on + // cost-aware path so the small probe window in this test doesn't + // get pulled into A/B sampling and stall the skip decision. + session_config = session_config.set( + "datafusion.execution.skip_partial_aggregation_use_cost_model", + &datafusion_common::ScalarValue::Boolean(Some(false)), + ); task_ctx = task_ctx.with_session_config(session_config); let task_ctx = Arc::new(task_ctx); From 08215ab02af9e56a97efbf7cec52b1e5ccc5c3fa Mon Sep 17 00:00:00 2001 From: Qi Zhu Date: Wed, 27 May 2026 16:02:30 +0800 Subject: [PATCH 12/14] test: regen push_down_filter_parquet.slt for new probe gauges MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit EXPLAIN ANALYZE on `AggregateExec mode=Partial` now includes the A/B sampling diagnostic gauges (`partial_agg_probe_*`) introduced in this PR. Test data is small so they're all 0 — but the gauges still appear in the metrics line. --- datafusion/sqllogictest/test_files/push_down_filter_parquet.slt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/datafusion/sqllogictest/test_files/push_down_filter_parquet.slt b/datafusion/sqllogictest/test_files/push_down_filter_parquet.slt index 40bfe79dcc633..06eb774d1b93d 100644 --- a/datafusion/sqllogictest/test_files/push_down_filter_parquet.slt +++ b/datafusion/sqllogictest/test_files/push_down_filter_parquet.slt @@ -743,7 +743,7 @@ Plan with Metrics 03)--ProjectionExec: expr=[a@0 as a, min(join_agg_probe.value)@1 as min_value], metrics=[output_rows=2, output_batches=2] 04)----AggregateExec: mode=FinalPartitioned, gby=[a@0 as a], aggr=[min(join_agg_probe.value)], metrics=[output_rows=2, output_batches=2, spill_count=0, spilled_rows=0] 05)------RepartitionExec: partitioning=Hash([a@0], 4), input_partitions=1, metrics=[output_rows=2, output_batches=2, spill_count=0, spilled_rows=0] -06)--------AggregateExec: mode=Partial, gby=[a@0 as a], aggr=[min(join_agg_probe.value)], metrics=[output_rows=2, output_batches=1, spill_count=0, spilled_rows=0, skipped_aggregation_rows=0, reduction_factor=100% (2/2)] +06)--------AggregateExec: mode=Partial, gby=[a@0 as a], aggr=[min(join_agg_probe.value)], metrics=[output_rows=2, output_batches=1, spill_count=0, spilled_rows=0, skipped_aggregation_rows=0, partial_agg_probe_cost_decision_skip=0, partial_agg_probe_ratio_per_mille=0, reduction_factor=100% (2/2)] 07)----------DataSourceExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/push_down_filter_parquet/join_agg_probe.parquet]]}, projection=[a, value], file_type=parquet, predicate=DynamicFilter [ a@0 >= h1 AND a@0 <= h2 AND a@0 IN (SET) ([h1, h2]) ], pruning_predicate=a_null_count@1 != row_count@2 AND a_max@0 >= h1 AND a_null_count@1 != row_count@2 AND a_min@3 <= h2 AND (a_null_count@1 != row_count@2 AND a_min@3 <= h1 AND h1 <= a_max@0 OR a_null_count@1 != row_count@2 AND a_min@3 <= h2 AND h2 <= a_max@0), required_guarantees=[a in (h1, h2)], metrics=[output_rows=2, output_batches=1, files_ranges_pruned_statistics=1 total → 1 matched, row_groups_pruned_statistics=1 total → 1 matched, row_groups_pruned_bloom_filter=1 total → 1 matched, page_index_pages_pruned=1 total → 1 matched, page_index_rows_pruned=4 total → 4 matched, limit_pruned_row_groups=0 total → 0 matched, batches_split=0, file_open_errors=0, file_scan_errors=0, files_opened=1, files_processed=1, num_predicate_creation_errors=0, predicate_evaluation_errors=0, pushdown_rows_matched=2, pushdown_rows_pruned=2, predicate_cache_inner_records=4, predicate_cache_records=2, scan_efficiency_ratio=19.07% (151/792)] statement ok From 62c258037ab0d50235fa2e8d65a5fc7d156b5e00 Mon Sep 17 00:00:00 2001 From: Qi Zhu Date: Wed, 27 May 2026 16:13:29 +0800 Subject: [PATCH 13/14] lazy-register diagnostic gauges so small queries don't show '...=0' noise MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The four cost-aware probe gauges (`partial_agg_probe_*`) were registered eagerly in the probe constructor, which made them show up in every `AggregateExec mode=Partial` EXPLAIN ANALYZE — including queries with so little data that the probe never reaches `probe_rows_threshold` and never sets them. Net result: `...=0` pollution on tiny queries that would never benefit from the cost model anyway. Switch the four gauges to `Option` and lazily register them with the operator's metric set on first reach (`finalize_partial_probe`). Workloads that never engage the cost-aware path don't pay the gauge registration cost and don't get the empty-gauge noise. Probe now stores an `ExecutionPlanMetricsSet` clone (Arc bump, cheap) + partition number to support lazy registration. Tests updated to look up gauge values by name through the metric set rather than via direct `Gauge` handles. Also drops the now-redundant SLT regen from the previous commit since the gauges no longer appear on this workload. --- .../physical-plan/src/aggregates/row_hash.rs | 190 +++++++++++------- .../test_files/push_down_filter_parquet.slt | 2 +- 2 files changed, 117 insertions(+), 75 deletions(-) diff --git a/datafusion/physical-plan/src/aggregates/row_hash.rs b/datafusion/physical-plan/src/aggregates/row_hash.rs index 134f31321a69b..493b439c1bd30 100644 --- a/datafusion/physical-plan/src/aggregates/row_hash.rs +++ b/datafusion/physical-plan/src/aggregates/row_hash.rs @@ -197,22 +197,21 @@ struct SkipAggregationProbe { /// Operator-wide `elapsed_compute`; the probe reads `value()` at /// phase transitions to derive per-row costs. elapsed_compute: metrics::Time, - /// Diagnostic gauge: measured partial-agg wall time per input row - /// (ns) at the end of the `Partial` phase. Always reported when - /// the partial probe completes, regardless of subsequent decision. - probe_partial_ns_per_row: metrics::Gauge, - /// Diagnostic gauge: measured passthrough wall time per input row - /// (ns) at the end of the `AbSampling` phase. Reported only when - /// A/B sampling actually runs (cost model enabled and Rule 1 didn't - /// already fire). - probe_passthrough_ns_per_row: metrics::Gauge, - /// Diagnostic gauge: aggregation ratio at the partial probe close, - /// scaled by 1000 (`usize` gauge → integer storage). - probe_ratio_per_mille: metrics::Gauge, - /// Diagnostic gauge: 1 if the cost-aware decision chose to skip, - /// 0 otherwise. Distinguishes the cost-aware skip from the fixed - /// Rule 1 skip and from "decision was to keep partial". - probe_cost_decision_skip: metrics::Gauge, + /// Operator metrics set + partition — saved so the diagnostic + /// gauges below can be lazily registered the first time the probe + /// has something useful to report. Queries that never reach + /// `probe_rows_threshold` (small inputs, short streams) won't + /// register them at all, so EXPLAIN ANALYZE stays clean of empty + /// "...=0" noise on workloads where the cost-aware path doesn't + /// engage. + agg_metrics: metrics::ExecutionPlanMetricsSet, + partition: usize, + /// Diagnostic gauges, lazily created on first set. See + /// [`Self::ensure_probe_gauges`] for the names and categories. + probe_partial_ns_per_row: Option, + probe_passthrough_ns_per_row: Option, + probe_ratio_per_mille: Option, + probe_cost_decision_skip: Option, } impl SkipAggregationProbe { @@ -224,10 +223,8 @@ impl SkipAggregationProbe { ab_sampling_rows: usize, skipped_aggregation_rows: metrics::Count, elapsed_compute: metrics::Time, - probe_partial_ns_per_row: metrics::Gauge, - probe_passthrough_ns_per_row: metrics::Gauge, - probe_ratio_per_mille: metrics::Gauge, - probe_cost_decision_skip: metrics::Gauge, + agg_metrics: metrics::ExecutionPlanMetricsSet, + partition: usize, ) -> Self { let elapsed_compute_at_probe_start = elapsed_compute.value(); Self { @@ -245,13 +242,47 @@ impl SkipAggregationProbe { is_locked: false, skipped_aggregation_rows, elapsed_compute, - probe_partial_ns_per_row, - probe_passthrough_ns_per_row, - probe_ratio_per_mille, - probe_cost_decision_skip, + agg_metrics, + partition, + probe_partial_ns_per_row: None, + probe_passthrough_ns_per_row: None, + probe_ratio_per_mille: None, + probe_cost_decision_skip: None, } } + /// Lazily register all four cost-aware diagnostic gauges with the + /// operator's metric set. Called the first time the probe has data + /// to report (i.e. when `finalize_partial_probe` runs). Idempotent: + /// once the gauges exist, this is a cheap `Option::is_some` check. + /// Small queries that never reach `probe_rows_threshold` skip this + /// entirely, so EXPLAIN ANALYZE stays free of "...=0" noise. + fn ensure_probe_gauges(&mut self) { + if self.probe_partial_ns_per_row.is_some() { + return; + } + self.probe_partial_ns_per_row = Some( + MetricBuilder::new(&self.agg_metrics) + .with_category(MetricCategory::Timing) + .gauge("partial_agg_probe_partial_ns_per_row", self.partition), + ); + self.probe_passthrough_ns_per_row = Some( + MetricBuilder::new(&self.agg_metrics) + .with_category(MetricCategory::Timing) + .gauge("partial_agg_probe_passthrough_ns_per_row", self.partition), + ); + self.probe_ratio_per_mille = Some( + MetricBuilder::new(&self.agg_metrics) + .with_category(MetricCategory::Rows) + .gauge("partial_agg_probe_ratio_per_mille", self.partition), + ); + self.probe_cost_decision_skip = Some( + MetricBuilder::new(&self.agg_metrics) + .with_category(MetricCategory::Rows) + .gauge("partial_agg_probe_cost_decision_skip", self.partition), + ); + } + /// Called from the partial-agg path after each input batch. Tracks /// total rows / group count and, when `probe_rows_threshold` is /// reached, drives the phase transition. @@ -265,6 +296,11 @@ impl SkipAggregationProbe { return; } + // Register the diagnostic gauges with the operator's metric set + // on first reach — keeps EXPLAIN ANALYZE clean on small workloads + // that never engage the cost-aware path. + self.ensure_probe_gauges(); + let ratio = self.num_groups as f64 / self.input_rows as f64; let partial_ns = self .elapsed_compute @@ -273,8 +309,12 @@ impl SkipAggregationProbe { let partial_ns_per_row = (partial_ns as u64) .checked_div(self.input_rows as u64) .unwrap_or(0) as usize; - self.probe_partial_ns_per_row.set(partial_ns_per_row); - self.probe_ratio_per_mille.set((ratio * 1000.0) as usize); + if let Some(g) = self.probe_partial_ns_per_row.as_ref() { + g.set(partial_ns_per_row); + } + if let Some(g) = self.probe_ratio_per_mille.as_ref() { + g.set((ratio * 1000.0) as usize); + } // Rule 1 (fixed): high ratio — short-circuit straight to skip. if ratio >= self.probe_ratio_threshold { @@ -332,16 +372,24 @@ impl SkipAggregationProbe { /// constants. If either measurement is zero (extremely fast or /// degenerate input) we default to keeping partial. fn finalize_ab_decision(&mut self) { + // Gauges were registered when we entered the partial probe; + // they should exist here (we reached `finalize_partial_probe` + // before transitioning to `AbSampling`). let ab_start = self .elapsed_compute_at_ab_start .expect("A/B start snapshot must be set when entering AbSampling"); let ab_ns = self.elapsed_compute.value().saturating_sub(ab_start); let passthrough_ns_per_row = (ab_ns as u64).checked_div(self.ab_rows as u64).unwrap_or(0) as usize; - self.probe_passthrough_ns_per_row - .set(passthrough_ns_per_row); + if let Some(g) = self.probe_passthrough_ns_per_row.as_ref() { + g.set(passthrough_ns_per_row); + } - let partial_ns_per_row = self.probe_partial_ns_per_row.value(); + let partial_ns_per_row = self + .probe_partial_ns_per_row + .as_ref() + .map(|g| g.value()) + .unwrap_or(0); let ratio = self.num_groups as f64 / self.input_rows as f64; let should_skip = if partial_ns_per_row == 0 || passthrough_ns_per_row == 0 { @@ -350,11 +398,12 @@ impl SkipAggregationProbe { ratio > (passthrough_ns_per_row as f64 / partial_ns_per_row as f64) }; + if let Some(g) = self.probe_cost_decision_skip.as_ref() { + g.set(if should_skip { 1 } else { 0 }); + } if should_skip { - self.probe_cost_decision_skip.set(1); self.commit_skip(); } else { - self.probe_cost_decision_skip.set(0); self.phase = ProbePhase::Locked { should_skip: false }; self.is_locked = true; } @@ -817,18 +866,6 @@ impl GroupedHashAggregateStream { let skipped_aggregation_rows = MetricBuilder::new(&agg.metrics) .with_category(MetricCategory::Rows) .counter("skipped_aggregation_rows", partition); - let probe_partial_ns_per_row = MetricBuilder::new(&agg.metrics) - .with_category(MetricCategory::Timing) - .gauge("partial_agg_probe_partial_ns_per_row", partition); - let probe_passthrough_ns_per_row = MetricBuilder::new(&agg.metrics) - .with_category(MetricCategory::Timing) - .gauge("partial_agg_probe_passthrough_ns_per_row", partition); - let probe_ratio_per_mille = MetricBuilder::new(&agg.metrics) - .with_category(MetricCategory::Rows) - .gauge("partial_agg_probe_ratio_per_mille", partition); - let probe_cost_decision_skip = MetricBuilder::new(&agg.metrics) - .with_category(MetricCategory::Rows) - .gauge("partial_agg_probe_cost_decision_skip", partition); Some(SkipAggregationProbe::new( probe_rows_threshold, probe_ratio_threshold, @@ -836,10 +873,8 @@ impl GroupedHashAggregateStream { ab_sampling_rows, skipped_aggregation_rows, baseline_metrics.elapsed_compute().clone(), - probe_partial_ns_per_row, - probe_passthrough_ns_per_row, - probe_ratio_per_mille, - probe_cost_decision_skip, + agg.metrics.clone(), + partition, )) } else { None @@ -2093,15 +2128,30 @@ mod tests { // ---------------- SkipAggregationProbe unit tests ---------------- - /// Test rig that exposes the probe together with the `Time` source - /// it reads, so tests can drive measured per-row cost deterministically. + /// Test rig that exposes the probe + its operator metrics set so + /// tests can drive `elapsed_compute` and read the diagnostic gauges + /// by name (they are lazily registered on first reach). struct ProbeRig { probe: SkipAggregationProbe, elapsed: metrics::Time, - partial_ns_gauge: metrics::Gauge, - passthrough_ns_gauge: metrics::Gauge, - ratio_gauge: metrics::Gauge, - cost_decision_gauge: metrics::Gauge, + agg_metrics: metrics::ExecutionPlanMetricsSet, + } + + impl ProbeRig { + /// Read a lazily-registered gauge by name, returning 0 when the + /// gauge has not been registered yet (i.e. the probe never + /// reached the partial-probe finalisation). + fn gauge(&self, name: &str) -> usize { + self.agg_metrics + .clone_inner() + .iter() + .find(|m| m.value().name() == name) + .and_then(|m| match m.value() { + metrics::MetricValue::Gauge { gauge, .. } => Some(gauge.value()), + _ => None, + }) + .unwrap_or(0) + } } fn rig( @@ -2111,10 +2161,7 @@ mod tests { ab_sampling_rows: usize, ) -> ProbeRig { let elapsed = metrics::Time::new(); - let partial_ns_gauge = metrics::Gauge::new(); - let passthrough_ns_gauge = metrics::Gauge::new(); - let ratio_gauge = metrics::Gauge::new(); - let cost_decision_gauge = metrics::Gauge::new(); + let agg_metrics = metrics::ExecutionPlanMetricsSet::new(); let probe = SkipAggregationProbe::new( probe_rows_threshold, probe_ratio_threshold, @@ -2122,18 +2169,13 @@ mod tests { ab_sampling_rows, metrics::Count::new(), elapsed.clone(), - partial_ns_gauge.clone(), - passthrough_ns_gauge.clone(), - ratio_gauge.clone(), - cost_decision_gauge.clone(), + agg_metrics.clone(), + 0, ); ProbeRig { probe, elapsed, - partial_ns_gauge, - passthrough_ns_gauge, - ratio_gauge, - cost_decision_gauge, + agg_metrics, } } @@ -2168,7 +2210,7 @@ mod tests { assert!(r.probe.is_locked); // No A/B sampling happened. assert!(!r.probe.wants_passthrough_sample()); - assert_eq!(r.passthrough_ns_gauge.value(), 0); + assert_eq!(r.gauge("partial_agg_probe_passthrough_ns_per_row"), 0); } /// Below `probe_ratio_threshold`, the probe transitions into the @@ -2185,8 +2227,8 @@ mod tests { assert!(!r.probe.should_skip()); assert!(!r.probe.is_locked); assert!(r.probe.wants_passthrough_sample()); - assert_eq!(r.partial_ns_gauge.value(), 1_000); - assert_eq!(r.ratio_gauge.value(), 600); + assert_eq!(r.gauge("partial_agg_probe_partial_ns_per_row"), 1_000); + assert_eq!(r.gauge("partial_agg_probe_ratio_per_mille"), 600); } /// Cost-aware skip: ratio is greater than `passthrough/partial`, so @@ -2210,8 +2252,8 @@ mod tests { assert!(r.probe.should_skip()); assert!(r.probe.is_locked); assert!(!r.probe.wants_passthrough_sample()); - assert_eq!(r.passthrough_ns_gauge.value(), 50); - assert_eq!(r.cost_decision_gauge.value(), 1); + assert_eq!(r.gauge("partial_agg_probe_passthrough_ns_per_row"), 50); + assert_eq!(r.gauge("partial_agg_probe_cost_decision_skip"), 1); } /// Cost-aware keep: ratio is below `passthrough/partial`, so the @@ -2233,8 +2275,8 @@ mod tests { assert!(!r.probe.should_skip()); assert!(r.probe.is_locked); assert!(!r.probe.wants_passthrough_sample()); - assert_eq!(r.passthrough_ns_gauge.value(), 80); - assert_eq!(r.cost_decision_gauge.value(), 0); + assert_eq!(r.gauge("partial_agg_probe_passthrough_ns_per_row"), 80); + assert_eq!(r.gauge("partial_agg_probe_cost_decision_skip"), 0); } /// A/B sampling needs *enough* rows in the window before it @@ -2273,7 +2315,7 @@ mod tests { r.elapsed.add_duration(Duration::from_nanos(200_000)); r.probe.observe_partial_batch(100, 50); - assert_eq!(r.partial_ns_gauge.value(), 2_000); - assert_eq!(r.ratio_gauge.value(), 500); + assert_eq!(r.gauge("partial_agg_probe_partial_ns_per_row"), 2_000); + assert_eq!(r.gauge("partial_agg_probe_ratio_per_mille"), 500); } } diff --git a/datafusion/sqllogictest/test_files/push_down_filter_parquet.slt b/datafusion/sqllogictest/test_files/push_down_filter_parquet.slt index 06eb774d1b93d..40bfe79dcc633 100644 --- a/datafusion/sqllogictest/test_files/push_down_filter_parquet.slt +++ b/datafusion/sqllogictest/test_files/push_down_filter_parquet.slt @@ -743,7 +743,7 @@ Plan with Metrics 03)--ProjectionExec: expr=[a@0 as a, min(join_agg_probe.value)@1 as min_value], metrics=[output_rows=2, output_batches=2] 04)----AggregateExec: mode=FinalPartitioned, gby=[a@0 as a], aggr=[min(join_agg_probe.value)], metrics=[output_rows=2, output_batches=2, spill_count=0, spilled_rows=0] 05)------RepartitionExec: partitioning=Hash([a@0], 4), input_partitions=1, metrics=[output_rows=2, output_batches=2, spill_count=0, spilled_rows=0] -06)--------AggregateExec: mode=Partial, gby=[a@0 as a], aggr=[min(join_agg_probe.value)], metrics=[output_rows=2, output_batches=1, spill_count=0, spilled_rows=0, skipped_aggregation_rows=0, partial_agg_probe_cost_decision_skip=0, partial_agg_probe_ratio_per_mille=0, reduction_factor=100% (2/2)] +06)--------AggregateExec: mode=Partial, gby=[a@0 as a], aggr=[min(join_agg_probe.value)], metrics=[output_rows=2, output_batches=1, spill_count=0, spilled_rows=0, skipped_aggregation_rows=0, reduction_factor=100% (2/2)] 07)----------DataSourceExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/push_down_filter_parquet/join_agg_probe.parquet]]}, projection=[a, value], file_type=parquet, predicate=DynamicFilter [ a@0 >= h1 AND a@0 <= h2 AND a@0 IN (SET) ([h1, h2]) ], pruning_predicate=a_null_count@1 != row_count@2 AND a_max@0 >= h1 AND a_null_count@1 != row_count@2 AND a_min@3 <= h2 AND (a_null_count@1 != row_count@2 AND a_min@3 <= h1 AND h1 <= a_max@0 OR a_null_count@1 != row_count@2 AND a_min@3 <= h2 AND h2 <= a_max@0), required_guarantees=[a in (h1, h2)], metrics=[output_rows=2, output_batches=1, files_ranges_pruned_statistics=1 total → 1 matched, row_groups_pruned_statistics=1 total → 1 matched, row_groups_pruned_bloom_filter=1 total → 1 matched, page_index_pages_pruned=1 total → 1 matched, page_index_rows_pruned=4 total → 4 matched, limit_pruned_row_groups=0 total → 0 matched, batches_split=0, file_open_errors=0, file_scan_errors=0, files_opened=1, files_processed=1, num_predicate_creation_errors=0, predicate_evaluation_errors=0, pushdown_rows_matched=2, pushdown_rows_pruned=2, predicate_cache_inner_records=4, predicate_cache_records=2, scan_efficiency_ratio=19.07% (151/792)] statement ok From 81493aaaeb9586fa4d6f0c5dfc281dbf7e6c8c07 Mon Sep 17 00:00:00 2001 From: Qi Zhu Date: Wed, 27 May 2026 20:51:10 +0800 Subject: [PATCH 14/14] test: disable cost model in push_down_filter_regression to stabilise flake The test's own comment notes a parallel race between Partial agg publishing MIN/MAX dynamic filter updates and the scan reading each partition. With the cost-aware path on, the probe state machine adds just enough timing perturbation to tip that race differently on CI (saw `a < 3` instead of `a < 1` for MIN over 4 rows split into two files). Test verifies dynamic filter pushdown, not the cost-aware path, so opt out via session config. --- .../test_files/push_down_filter_regression.slt | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/datafusion/sqllogictest/test_files/push_down_filter_regression.slt b/datafusion/sqllogictest/test_files/push_down_filter_regression.slt index 923a51afc8df9..2aaca2a388b1e 100644 --- a/datafusion/sqllogictest/test_files/push_down_filter_regression.slt +++ b/datafusion/sqllogictest/test_files/push_down_filter_regression.slt @@ -306,6 +306,14 @@ set datafusion.explain.analyze_level = summary; statement ok set datafusion.explain.analyze_categories = 'none'; +# Disable the cost-aware partial-agg skip path. Even on tiny inputs that +# never reach the probe-rows threshold, having the cost model enabled +# alters the partial-agg state machine slightly and perturbs the +# parallel race between partial-agg publishing its MIN/MAX and the scan +# reading each partition — flaking these EXPLAIN ANALYZE expectations. +statement ok +set datafusion.execution.skip_partial_aggregation_use_cost_model = false; + # MIN(a) -> DynamicFilter [ a < 1 ] query TT EXPLAIN ANALYZE SELECT MIN(a) FROM agg_dyn_single; @@ -574,3 +582,7 @@ drop table t1; statement ok drop table t2; + +statement ok +reset datafusion.execution.skip_partial_aggregation_use_cost_model; +