Skip to content

[feat][cp] Add merge source start processing mechanism for pyspark-velox#405

Merged
guhaiyan0221 merged 1 commit intobytedance:mainfrom
guhaiyan0221:fix_cp_mergesource_start
Mar 18, 2026
Merged

[feat][cp] Add merge source start processing mechanism for pyspark-velox#405
guhaiyan0221 merged 1 commit intobytedance:mainfrom
guhaiyan0221:fix_cp_mergesource_start

Conversation

@guhaiyan0221
Copy link
Copy Markdown
Collaborator

What problem does this PR solve?

Issue Number: close #191

Type of Change

  • 🐛 Bug fix (non-breaking change which fixes an issue)
  • ✨ New feature (non-breaking change which adds functionality)
  • 🚀 Performance improvement (optimization)
  • ⚠️ Breaking change (fix or feature that would cause existing functionality to change)
  • 🔨 Refactoring (no logic changes)
  • 🔧 Build/CI or Infrastructure changes
  • 📝 Documentation only

Description

Summary:
Extend MergeSource to support merge source start control mechanism used to implement lazy source start to cap the local merge source memory usage. The merge operator needs to call start on each MergeSource to signal producer the start of source processing. Each source producer, CallbackSink operator check if the corresponding merge source is started or not in isBlocked method. The exchange the source start signal is expected to happen once. Unit test is added to verify this behavior. This PR also move the producer/consumer signal out of locks to prevent potential deadlock plus some code cleanup in the relevant code path.

The followup is to add recursive spill based on the lazy start mechanism built to cap the merge source memory usage when there are a large number of sources such in pyspark-velox use case.

Performance Impact

  • No Impact: This change does not affect the critical path (e.g., build system, doc, error handling).

  • Positive Impact: I have run benchmarks.

    Click to view Benchmark Results
    Paste your google-benchmark or TPC-H results here.
    Before: 10.5s
    After:   8.2s  (+20%)
    
  • Negative Impact: Explained below (e.g., trade-off for correctness).

Release Note

Please describe the changes in this PR

Release Note:

Release Note:
- Fixed a crash in `substr` when input is null.
- optimized `group by` performance by 20%.

Checklist (For Author)

  • I have added/updated unit tests (ctest).
  • I have verified the code with local build (Release/Debug).
  • I have run clang-format / linters.
  • (Optional) I have run Sanitizers (ASAN/TSAN) locally for complex C++ changes.
  • No need to test or manual test.

Breaking Changes

  • No

  • Yes (Description: ...)

    Click to view Breaking Changes
    Breaking Changes:
    - Description of the breaking change.
    - Possible solutions or workarounds.
    - Any other relevant information.
    

@guhaiyan0221 guhaiyan0221 requested a review from Weixin-Xu March 17, 2026 15:26
@guhaiyan0221 guhaiyan0221 force-pushed the fix_cp_mergesource_start branch 2 times, most recently from b74cede to 294f711 Compare March 18, 2026 07:26
Summary:
Corresponding PR: facebookincubator/velox#13139

Extend MergeSource to support merge source start control mechanism used to implement lazy source start to cap the local merge source memory usage. The merge operator needs to call start on each MergeSource to signal producer the start of source processing. Each source producer, CallbackSink operator check if the corresponding merge source is started or not in isBlocked method. The exchange the source start signal is expected to happen once. Unit test is added to verify this behavior. This PR also move the producer/consumer signal out of locks to prevent potential deadlock plus some code cleanup in the relevant code path.

The followup is to add recursive spill based on the lazy start mechanism built to cap the merge source memory usage when there are a large number of sources such in pyspark-velox use case.
@guhaiyan0221 guhaiyan0221 force-pushed the fix_cp_mergesource_start branch from 294f711 to 0807ab1 Compare March 18, 2026 10:34
@guhaiyan0221 guhaiyan0221 changed the title [feat][cp] dd merge source start processing mechanism for pyspark-velox [feat][cp] Add merge source start processing mechanism for pyspark-velox Mar 18, 2026
Copy link
Copy Markdown
Collaborator

@Weixin-Xu Weixin-Xu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@guhaiyan0221 guhaiyan0221 added this pull request to the merge queue Mar 18, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Mar 18, 2026
@guhaiyan0221 guhaiyan0221 added this pull request to the merge queue Mar 18, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Mar 18, 2026
@guhaiyan0221 guhaiyan0221 added this pull request to the merge queue Mar 18, 2026
Merged via the queue into bytedance:main with commit c1640e4 Mar 18, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants