Skip to content

feat: accept distinct kwarg on sum and avg#1556

Open
timsaucer wants to merge 4 commits into
apache:mainfrom
timsaucer:feat/df54-followups-wave2
Open

feat: accept distinct kwarg on sum and avg#1556
timsaucer wants to merge 4 commits into
apache:mainfrom
timsaucer:feat/df54-followups-wave2

Conversation

@timsaucer
Copy link
Copy Markdown
Member

@timsaucer timsaucer commented May 22, 2026

Which issue does this PR close?

Related to #1533

Rationale for this change

Upstream DataFusion now supports the distinct variant for these functions. We expose this functionality here.

What changes are included in this PR?

  • Enhance sum and avg to take the distinct keyword.
  • Add unit tests
  • Update check-upstream skill so that these are known, already covered functions.
  • Document the argument-order change in the upgrade guide.

Are there any user-facing changes?

sum and avg gain a distinct keyword. This is a breaking change: distinct is inserted before filter in the argument list to stay consistent with the other aggregate functions, so any caller that passed filter positionally (e.g. f.sum(col, my_filter)) will now bind that value to distinct. Such callers must pass filter as a keyword argument. See the DataFusion 54.0.0 entry in the upgrade guide for details.

timsaucer and others added 3 commits May 22, 2026 07:26
Upstream exposes `sum_distinct` / `avg_distinct` / `count_distinct` as
sibling functions that call the same underlying UDAF with
`distinct: bool = true`. The Rust binding side already routes
`distinct=Some(true)` through the aggregate builder for `sum`, `avg`,
and `count` — but only `count` exposed the kwarg on the Python wrapper.

Add `distinct: bool = False` to `sum()` and `avg()` mirroring the
existing `count()` signature, and update SKILL.md so the check-upstream
audit does not re-flag the three upstream `*_distinct` shortcuts as
gaps. The plan emitted by `sum(col, distinct=True)` matches what
upstream's `sum_distinct(col)` builds.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move the standalone test_sum_distinct_kwarg and test_avg_distinct_kwarg
from test_functions.py into the existing test_aggregation::test_aggregation
parameterization, matching how distinct is already covered for median,
array_agg, count, and bit_xor.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop the unhelpful "upstream avg_distinct/sum_distinct shortcut"
reference in favor of describing the actual behavior.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@timsaucer timsaucer changed the title feat: accept distinct kwarg on sum, avg, and count feat: accept distinct kwarg on sum and avg May 27, 2026
distinct is inserted before filter on sum and avg for consistency with
the other aggregate functions, breaking positional filter callers. Add a
DataFusion 54.0.0 upgrade-guide entry covering the migration.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@timsaucer timsaucer marked this pull request as ready for review May 27, 2026 20:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant