Skip to content

[SPARK-55502][PYTHON] Unify UDF and UDTF Arrow conversion error handling#54398

Closed
Yicong-Huang wants to merge 3 commits intoapache:masterfrom
Yicong-Huang:SPARK-55502/refactor/eliminate-is-udtf-flag
Closed

[SPARK-55502][PYTHON] Unify UDF and UDTF Arrow conversion error handling#54398
Yicong-Huang wants to merge 3 commits intoapache:masterfrom
Yicong-Huang:SPARK-55502/refactor/eliminate-is-udtf-flag

Conversation

@Yicong-Huang
Copy link
Copy Markdown
Contributor

@Yicong-Huang Yicong-Huang commented Feb 20, 2026

What changes were proposed in this pull request?

Remove the is_udtf parameter from PandasToArrowConversion.convert() and add a new is_legacy parameter to unify error handling for both UDF and UDTF Arrow conversions.

Key changes:

  • Removed is_udtf parameter and added is_legacy for clarity — it controls exception catch breadth and error message style, not UDTF-specific behavior
  • Removed UDTF-specific error condition UDTF_ARROW_TYPE_CAST_ERROR and replaced with unified PySparkTypeError/PySparkValueError
  • Legacy path (broad ArrowException catch): keeps original "Exception thrown when converting pandas.Series..." error format
  • Non-legacy path (narrow ArrowInvalid catch): uses new user-friendly error messages, with separate messages for TypeError and ValueError

Why are the changes needed?

The UDTF-specific UDTF_ARROW_TYPE_CAST_ERROR error condition was unnecessary — the same conversion errors occur in both UDF and UDTF contexts. Unifying error handling provides:

  • Clearer parameter semantics
  • Simpler, more maintainable code
  • Consistent, user-friendly error messages across UDF/UDTF

Does this PR introduce any user-facing change?

Yes, error messages change for the non-legacy path (UDF/UDTF with spark.sql.legacy.execution.pythonUDTF.pandas.conversion.enabled=false):

TypeError (e.g. int → struct type mismatch):

Before:

PySparkTypeError: Exception thrown when converting pandas.Series (int64) with name 'x' to Arrow Array (struct<a: int32>).

After:

PySparkTypeError: Cannot convert the output value of the column 'x' with type 'int64' to the specified return type of the column: 'struct<a: int32>'. Please check if the data types match and try again.

ValueError (e.g. string → double value error):

Before:

PySparkValueError: Exception thrown when converting pandas.Series (object) with name 'val' to Arrow Array (double).

After:

PySparkValueError: Failed to convert the value of the column 'val' with type 'object' to Arrow type 'double'.

Legacy UDTF path error messages remain unchanged.

How was this patch tested?

Updated existing unit tests.

Was this patch authored or co-authored using generative AI tooling?

No.

@Yicong-Huang Yicong-Huang force-pushed the SPARK-55502/refactor/eliminate-is-udtf-flag branch from 4b2718e to e250ebf Compare February 20, 2026 17:35
Copy link
Copy Markdown
Contributor

@allisonwang-db allisonwang-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add before and after error messages PR descriptions?

@Yicong-Huang
Copy link
Copy Markdown
Contributor Author

Can we add before and after error messages PR descriptions?

Thanks for the suggestion. I've added them in the PR description.

@Yicong-Huang Yicong-Huang force-pushed the SPARK-55502/refactor/eliminate-is-udtf-flag branch from aee750c to 0937f16 Compare March 3, 2026 23:26
Rename the `is_udtf` parameter to `is_legacy` to better reflect its
purpose: it controls the legacy UDTF pandas conversion path with
broader Arrow exception handling, not whether the caller is a UDTF.
@Yicong-Huang Yicong-Huang force-pushed the SPARK-55502/refactor/eliminate-is-udtf-flag branch 5 times, most recently from ceb1e2e to 31ae3fa Compare March 4, 2026 23:48
- Legacy path: keeps original "Exception thrown when converting
  pandas.Series..." format with PySparkTypeError/PySparkValueError
  (replaces UDTF_ARROW_TYPE_CAST_ERROR)
- Non-legacy path: new user-friendly "Cannot convert column '{name}'
  from {dtype} to {arrow_type}." format
- Remove UDTF_ARROW_TYPE_CAST_ERROR from error-conditions.json
@Yicong-Huang Yicong-Huang force-pushed the SPARK-55502/refactor/eliminate-is-udtf-flag branch from 31ae3fa to 0215fb8 Compare March 4, 2026 23:49
@Yicong-Huang
Copy link
Copy Markdown
Contributor Author

@allisonwang-db could you please have another look at this PR? thanks. also cc @zhengruifeng

@zhengruifeng
Copy link
Copy Markdown
Contributor

merged to master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants