Skip to content

[fix](join): avoid join amplification in scalar subquery window rewrite#63763

Draft
starocean999 wants to merge 1 commit into
apache:masterfrom
starocean999:master_0527
Draft

[fix](join): avoid join amplification in scalar subquery window rewrite#63763
starocean999 wants to merge 1 commit into
apache:masterfrom
starocean999:master_0527

Conversation

@starocean999
Copy link
Copy Markdown
Contributor

Problem Summary:
The rewrite in AggScalarSubQueryToWindowFunction.java partitioned the generated window function only by the correlated slots. That assumption is not sufficient when the outer query contains an additional relation whose rows are duplicated under the same correlated key. In that situation, multiple distinct outer rows are merged into the same window partition, so the rewritten window aggregate is evaluated over join-amplified rows rather than over the logical row set seen by the original correlated scalar subquery. The rule also kept mutable matching state on the rewriter instance, which needed to be reset per rewrite attempt.

Fix:
The fix changes the rewrite to partition the generated window function by all visible slots from the outer-only relation, so different outer rows that share the same correlated key remain isolated in separate window partitions. This preserves the original scalar-subquery semantics even when the outer plan has join-expanded rows.

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@starocean999
Copy link
Copy Markdown
Contributor Author

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found a remaining correctness hole in the scalar-subquery window rewrite. The new partitioning fixes the case where an outer-only relation has a visible distinguishing column, but the rule can still fire when the distinguishing columns have been pruned from the Apply left output, leaving only the correlated key and still allowing join amplification.

Critical checkpoint conclusions:

  • Goal/test proof: The goal is to avoid join amplification in the scalar-subquery-to-window rewrite. The added test proves one case with a visible dim.did, but does not prove cases where duplicate outer-only rows are not distinguished by the visible output.
  • Scope/focus: The change is focused, but the implementation is incomplete for pruned duplicate-distinguishing columns.
  • Concurrency/lifecycle: No new concurrency, locking, or lifecycle concerns identified.
  • Configuration/compatibility: No new configs or incompatible serialization/storage changes identified in the actual PR diff.
  • Parallel paths: The relevant parallel path is the same rewrite when column pruning leaves only correlated slots from the outer-only relation; it still has the original amplification problem.
  • Tests/results: Existing new regression output is deterministic, but coverage should add the non-visible duplicate-column case or the rule should bail out when it cannot partition by the full outer-only relation identity.
  • Observability/transactions/persistence: Not applicable.
  • Performance: Partitioning by extra slots can cost more, but no blocking performance issue beyond the correctness concern.

User focus points: No additional user-provided review focus was present.

.filter(node -> outerIds.contains(node.getTable().getId()))
.map(LogicalRelation.class::cast)
.map(LogicalRelation::getOutputExprIdSet).flatMap(Collection::stream).collect(Collectors.toSet());
partitionBySlots.addAll(apply.left().getOutput().stream()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still does not distinguish duplicate outer-only rows when their distinguishing columns are not present in apply.left().getOutput(). For example, if dim has two rows with the same k and the outer query only outputs/uses d.k (no d.did or other unique column), the original scalar subquery is evaluated once per dim row, but this code partitions the window only by the visible d.k. The joined inner rows for both duplicate dim rows then land in the same window partition and the aggregate is multiplied, so predicates such as f.v * 2 > (select sum(f2.v) ... where f2.k = d.k) can incorrectly filter out rows. The new regression includes d.did in the select list, which makes this code include a distinguishing slot and misses this case. Please either carry/partition by all slots from the outer-only relation needed to preserve row identity, or make the rule return false when apply.left().getOutput() does not contain the full outer-only relation output.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants