[fix](join): avoid join amplification in scalar subquery window rewrite by starocean999 · Pull Request #63763 · apache/doris

starocean999 · 2026-05-27T13:23:45Z

Problem Summary:
The rewrite in AggScalarSubQueryToWindowFunction.java partitioned the generated window function only by the correlated slots. That assumption is not sufficient when the outer query contains an additional relation whose rows are duplicated under the same correlated key. In that situation, multiple distinct outer rows are merged into the same window partition, so the rewritten window aggregate is evaluated over join-amplified rows rather than over the logical row set seen by the original correlated scalar subquery. The rule also kept mutable matching state on the rewriter instance, which needed to be reset per rewrite attempt.

Fix:
The fix changes the rewrite to partition the generated window function by all visible slots from the outer-only relation, so different outer rows that share the same correlated key remain isolated in separate window partitions. This preserves the original scalar-subquery semantics even when the outer plan has join-expanded rows.

Check List (For Author)

Test
- Regression test
- Unit Test
- Manual test (add detailed scripts or steps below)
- No need to test or manual test. Explain why:
  - This is a refactor/code format and no logic has been changed.
  - Previous test can cover this change.
  - No code files have been changed.
  - Other reason
Behavior changed:
- No.
- Yes.
Does this need documentation?
- No.
- Yes.

Check List (For Reviewer who merge this PR)

Confirm the release note
Confirm test cases
Confirm document
Add branch pick label

hello-stephen · 2026-05-27T13:23:52Z

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

What problem was fixed (it's best to include specific error reporting information). How it was fixed.
Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
What features were added. Why was this function added?
Which code was refactored and why was this part of the code refactored?
Which functions were optimized and what is the difference before and after the optimization?

starocean999 · 2026-05-27T13:23:58Z

/review

github-actions

I found a remaining correctness hole in the scalar-subquery window rewrite. The new partitioning fixes the case where an outer-only relation has a visible distinguishing column, but the rule can still fire when the distinguishing columns have been pruned from the Apply left output, leaving only the correlated key and still allowing join amplification.

Critical checkpoint conclusions:

Goal/test proof: The goal is to avoid join amplification in the scalar-subquery-to-window rewrite. The added test proves one case with a visible dim.did, but does not prove cases where duplicate outer-only rows are not distinguished by the visible output.
Scope/focus: The change is focused, but the implementation is incomplete for pruned duplicate-distinguishing columns.
Concurrency/lifecycle: No new concurrency, locking, or lifecycle concerns identified.
Configuration/compatibility: No new configs or incompatible serialization/storage changes identified in the actual PR diff.
Parallel paths: The relevant parallel path is the same rewrite when column pruning leaves only correlated slots from the outer-only relation; it still has the original amplification problem.
Tests/results: Existing new regression output is deterministic, but coverage should add the non-visible duplicate-column case or the rule should bail out when it cannot partition by the full outer-only relation identity.
Observability/transactions/persistence: Not applicable.
Performance: Partitioning by extra slots can cost more, but no blocking performance issue beyond the correctness concern.

User focus points: No additional user-provided review focus was present.

github-actions · 2026-05-27T13:36:29Z

                .filter(node -> outerIds.contains(node.getTable().getId()))
                .map(LogicalRelation.class::cast)
                .map(LogicalRelation::getOutputExprIdSet).flatMap(Collection::stream).collect(Collectors.toSet());
+        partitionBySlots.addAll(apply.left().getOutput().stream()


This still does not distinguish duplicate outer-only rows when their distinguishing columns are not present in apply.left().getOutput(). For example, if dim has two rows with the same k and the outer query only outputs/uses d.k (no d.did or other unique column), the original scalar subquery is evaluated once per dim row, but this code partitions the window only by the visible d.k. The joined inner rows for both duplicate dim rows then land in the same window partition and the aggregate is multiplied, so predicates such as f.v * 2 > (select sum(f2.v) ... where f2.k = d.k) can incorrectly filter out rows. The new regression includes d.did in the select list, which makes this code include a distinguishing slot and misses this case. Please either carry/partition by all slots from the outer-only relation needed to preserve row identity, or make the rule return false when apply.left().getOutput() does not contain the full outer-only relation output.

fix(join): avoid join amplification in scalar subquery window rewrite

e69fe3e

github-actions Bot requested changes May 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix](join): avoid join amplification in scalar subquery window rewrite#63763

[fix](join): avoid join amplification in scalar subquery window rewrite#63763
starocean999 wants to merge 1 commit into
apache:masterfrom
starocean999:master_0527

starocean999 commented May 27, 2026

Uh oh!

hello-stephen commented May 27, 2026

Uh oh!

starocean999 commented May 27, 2026

Uh oh!

github-actions Bot left a comment

Uh oh!

github-actions Bot May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

starocean999 commented May 27, 2026

Check List (For Author)

Check List (For Reviewer who merge this PR)

Uh oh!

hello-stephen commented May 27, 2026

Uh oh!

starocean999 commented May 27, 2026

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants