feat(velox): Support RESPECT NULLS for collect_list/collect_set#11837
feat(velox): Support RESPECT NULLS for collect_list/collect_set#11837yaooqinn wants to merge 2 commits intoapache:mainfrom
Conversation
Add ignoreNulls parameter to VeloxCollectList/VeloxCollectSet to support Spark's RESPECT NULLS syntax (SPARK-55256). When ignoreNulls=false, null elements are included in the collected array. - VeloxCollect: conditionally skip nulls based on ignoreNulls parameter - CollectRewriteRule: propagate ignoreNulls from Spark's CollectList/CollectSet via reflection (backward-compatible with Spark versions without ignoreNulls) - ArrayType containsNull reflects the ignoreNulls setting Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
@yaooqinn thanks, could you please also verify on a new veox branch with patches for collect_list/set - in the current branch the collect_set commit is actually reverted https://github.com/IBM/velox/commits/dft-2026_03_24/ |
|
Attempted local verification with Velox patches (collect_set from upstream main + collect_list from PR #16933). The Velox native lib (libvelox.a) builds successfully with both patches. However, full Gluten build cannot complete locally due to pre-existing Scala version mismatch in cached Maven artifacts (unrelated to this PR). The VeloxCollect changes are purely Scala/JVM-side (DeclarativeAggregate), so they're independent of the Velox native collect_set/collect_list functions. CI will provide the definitive verification. @zhouyuan The native Velox patches for collect_set/collect_list add |
|
@yaooqinn not really I just made a test with your patch + velox changes, the spark unit tests are failing https://github.com/apache/gluten/actions/runs/23652189825/job/68903134168?pr=11712 |
Root Cause AnalysisThanks @zhouyuan for catching this! I traced the issue: What changedBefore PR #16416: After PR #16416: The mismatchGluten's
Fix options
I'll fix this in the Velox PR #16416 (or a follow-up) since the root cause is the signature change creating a suffix-based companion registration. |
|
@yaooqinn I assume we should adapt to the signature change in Gluten (see |
…t_set/list When aggregate functions have multiple signatures with the same intermediate type (e.g., collect_set with 1-arg and 2-arg signatures), Velox registers companion functions with suffix using generic type variables (e.g., collect_set_merge_extract_array_T). The Substrait layer was constructing concrete type suffixes (e.g., array_row_VARCHAR_BIGINT_BIGINT_endrow) that don't match. Fix: After failing exact concrete suffix lookup, fall back to discovering companion function names via getCompanionFunctionSignatures() API. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Pushed fix: Updated Change: After failing the concrete type suffix lookup (e.g., The file compiles cleanly against the existing Velox EP. @zhouyuan could you verify with your test setup? |
Verified ✅Tested locally against The C++ Substrait fix (
|
| auto entrySigs = | ||
| exec::getAggregateFunctionSignatures(entry.functionName); | ||
| if (entrySigs.has_value() && entrySigs.value().size() > 0) { | ||
| return entry.functionName; |
There was a problem hiding this comment.
When two merge-extract functions share the same intermediate type, this approach cannot guarantee that the correct function will be returned. We need to fix the companionFunctionSuffix, which needs to be fully compatible with Velox’s std::string toSuffixString(const TypeSignature& type) function. Currently, we are encountering fallback issue because Gluten appears to generate a function name that differs from the one Velox uses. It looks like Gluten is generating collect_set_merge_extract_array_row_VARCHAR_BIGINT_BIGINT_endrow while Velox uses collect_set_merge_extract_array_T.
rui-mo
left a comment
There was a problem hiding this comment.
Also consider adding test in https://github.com/apache/gluten/blob/main/backends-clickhouse/src/test/scala/org/apache/gluten/execution/GlutenFunctionValidateSuite.scala to ensure the two functions can be offloaded to native.
|
@yaooqinn thanks for the efforts, looks like dont not work: The IBM/velox dft-2026_03_24 actually reverted the collect_set patch, could you please have a try with this branch: |
What changes were proposed in this pull request?
Add
ignoreNullsparameter toVeloxCollectList/VeloxCollectSetto support Spark's RESPECT NULLS syntax (SPARK-55256). WhenignoreNulls=false, null elements are included in the collected array.Changes:
VeloxCollectSet/VeloxCollectList: AcceptignoreNullsparameter (defaulttruefor backward compatibility).CollectRewriteRule: PropagatesignoreNullsfrom Spark'sCollectList/CollectSetusing reflection, making it backward-compatible with Spark versions that don't haveignoreNulls.Related:
How was this patch tested?
CI validation. The default behavior (
ignoreNulls=true) is unchanged, ensuring backward compatibility.