[fix](join): avoid join amplification in scalar subquery window rewrite#63763
[fix](join): avoid join amplification in scalar subquery window rewrite#63763starocean999 wants to merge 1 commit into
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
/review |
There was a problem hiding this comment.
I found a remaining correctness hole in the scalar-subquery window rewrite. The new partitioning fixes the case where an outer-only relation has a visible distinguishing column, but the rule can still fire when the distinguishing columns have been pruned from the Apply left output, leaving only the correlated key and still allowing join amplification.
Critical checkpoint conclusions:
- Goal/test proof: The goal is to avoid join amplification in the scalar-subquery-to-window rewrite. The added test proves one case with a visible
dim.did, but does not prove cases where duplicate outer-only rows are not distinguished by the visible output. - Scope/focus: The change is focused, but the implementation is incomplete for pruned duplicate-distinguishing columns.
- Concurrency/lifecycle: No new concurrency, locking, or lifecycle concerns identified.
- Configuration/compatibility: No new configs or incompatible serialization/storage changes identified in the actual PR diff.
- Parallel paths: The relevant parallel path is the same rewrite when column pruning leaves only correlated slots from the outer-only relation; it still has the original amplification problem.
- Tests/results: Existing new regression output is deterministic, but coverage should add the non-visible duplicate-column case or the rule should bail out when it cannot partition by the full outer-only relation identity.
- Observability/transactions/persistence: Not applicable.
- Performance: Partitioning by extra slots can cost more, but no blocking performance issue beyond the correctness concern.
User focus points: No additional user-provided review focus was present.
| .filter(node -> outerIds.contains(node.getTable().getId())) | ||
| .map(LogicalRelation.class::cast) | ||
| .map(LogicalRelation::getOutputExprIdSet).flatMap(Collection::stream).collect(Collectors.toSet()); | ||
| partitionBySlots.addAll(apply.left().getOutput().stream() |
There was a problem hiding this comment.
This still does not distinguish duplicate outer-only rows when their distinguishing columns are not present in apply.left().getOutput(). For example, if dim has two rows with the same k and the outer query only outputs/uses d.k (no d.did or other unique column), the original scalar subquery is evaluated once per dim row, but this code partitions the window only by the visible d.k. The joined inner rows for both duplicate dim rows then land in the same window partition and the aggregate is multiplied, so predicates such as f.v * 2 > (select sum(f2.v) ... where f2.k = d.k) can incorrectly filter out rows. The new regression includes d.did in the select list, which makes this code include a distinguishing slot and misses this case. Please either carry/partition by all slots from the outer-only relation needed to preserve row identity, or make the rule return false when apply.left().getOutput() does not contain the full outer-only relation output.
Problem Summary:
The rewrite in AggScalarSubQueryToWindowFunction.java partitioned the generated window function only by the correlated slots. That assumption is not sufficient when the outer query contains an additional relation whose rows are duplicated under the same correlated key. In that situation, multiple distinct outer rows are merged into the same window partition, so the rewritten window aggregate is evaluated over join-amplified rows rather than over the logical row set seen by the original correlated scalar subquery. The rule also kept mutable matching state on the rewriter instance, which needed to be reset per rewrite attempt.
Fix:
The fix changes the rewrite to partition the generated window function by all visible slots from the outer-only relation, so different outer rows that share the same correlated key remain isolated in separate window partitions. This preserves the original scalar-subquery semantics even when the outer plan has join-expanded rows.
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)