[refine](pipeline) stop splitting distinct streaming agg output by batch size#63750
Open
Mryange wants to merge 1 commit into
Open
[refine](pipeline) stop splitting distinct streaming agg output by batch size#63750Mryange wants to merge 1 commit into
Mryange wants to merge 1 commit into
Conversation
Contributor
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
Contributor
Author
|
/review |
Contributor
There was a problem hiding this comment.
Review result: no blocking PR-introduced issues found in the distinct streaming aggregation change.
Critical checkpoint conclusions:
- Goal/test: The patch removes the cache-block split path so distinct streaming aggregation can return the full accumulated block after crossing batch_size. The updated BE unit test covers the new oversized output behavior and passthrough transition.
- Scope: The code change is small and focused on _cache_block removal and related test expectations.
- Concurrency: No new shared state, locks, dependencies, or thread interactions are introduced; operator local state remains task-local.
- Lifecycle: Removing _cache_block simplifies close/pull lifecycle. _aggregated_block is still cleared/reused through the existing Block swap/clear path.
- Configuration/compatibility: No new configs, formats, thrift fields, or mixed-version protocol concerns.
- Parallel paths: The related pull/push/need_more_input_data paths were checked; behavior is intentionally aligned with allowing a full accumulated block rather than splitting by batch_size.
- Conditions/error handling: Existing Status handling is preserved; no ignored new Status values.
- Tests: Unit coverage was updated for the changed behavior. I did not run tests in this review runner.
- Observability/performance: No new logging or metrics needed; output can now exceed batch_size by the current accumulated block as intended by the PR.
- Data correctness/transactions/persistence: No storage, transaction, or visibility-version paths are modified.
User focus: No additional user-provided review focus was present.
Contributor
Author
|
run buildall |
Contributor
TPC-H: Total hot run time: 32097 ms |
Contributor
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
Contributor
TPC-DS: Total hot run time: 172415 ms |
Contributor
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What problem does this PR solve?
Issue Number: N/A
Problem Summary:
DistinctStreamingAggOperatorX kept each output block at or below batch_size by splitting rows between _aggregated_block and an extra _cache_block, then replaying the cached tail on the next pull. Root cause: the operator coupled its accumulation threshold with a hard output-size cap. This change removes the cache path and appends all distinct rows directly into _aggregated_block. The operator still uses batch_size in need_more_input_data() to decide when to stop reading more input, but once one append crosses the threshold it now returns the full block instead of splitting it.
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)