[GH-2927] Add geometry ↔ Box2D Catalyst cast#2952
Merged
Merged
Conversation
Spark's `Cast.canCast` rejects arbitrary UDT-to-UDT casts, so today users have to call `ST_Box2D(geom)` / `ST_GeomFromBox2D(box)` explicitly. This PR teaches Catalyst to handle the conversions as ordinary SQL casts. Implementation: - New `Box2DCastResolutionRule` analyzer rule rewrites `Cast(geom, Box2DUDT)` → `ST_Box2D(geom)` and `Cast(box, GeometryUDT)` → `ST_GeomFromBox2D(box)`. It runs before `CheckAnalysis`, so the downstream optimizer/codegen see the rewritten expression and never observe the rejected Cast. - Registered via `SparkSessionExtensions.injectResolutionRule` from `SedonaSqlExtensions`. - `SedonaSqlAstBuilder.visitPrimitiveDataType` now recognizes `BOX2D` as a type keyword across all supported Spark versions, so SQL `CAST(... AS box2d)` parses to `Box2DUDT`. Tests: - Unit test of the rule (rewrite for both directions, no-op for unrelated casts). - SQL end-to-end suite under spark-3.5 covering SQL CAST in both directions, DataFrame `.cast(Box2DUDT)` / `.cast(GeometryUDT())`, round-trip Geometry → Box2D → Geometry, and NULL propagation. Closes apache#2927.
Two follow-ups on the cast suite: - The SQL `CAST(... AS box2d)` / `CAST(... AS geometry)` syntax requires Sedona's `SedonaSqlAstBuilder` to be the active parser. The test base randomizes `spark.sedona.enableParserExtension` across CI runs, so the SQL-level tests now `assume(parserExtensionEnabled)` and are skipped (rather than failed) when the stock Spark parser is in effect. The DataFrame `.cast(...)` tests run unconditionally because the resolution rule is always injected. - The polygon vertex order produced by `ST_GeomFromBox2D` walks the envelope as (xmin,ymin) → (xmin,ymax) → (xmax,ymax) → (xmax,ymin); the expected WKTs are corrected to match. - The suite is now replicated across spark-3.4 / 3.5 / 4.0 / 4.1 so each supported Spark version exercises the SQL + DataFrame surfaces of the geometry↔Box2D cast.
The previous gate read `spark.sedona.enableParserExtension` from `sparkSession.conf`, but `SparkContext` is JVM-singleton, so the session-level config in this suite can differ from the value `SedonaSqlExtensions` actually saw when it decided whether to inject `SedonaSqlAstBuilder`. CI repeatedly hit a mismatch where the assume returned true while the active parser was Spark's stock one, causing the SQL CAST tests to fail with ParseException. Switch to a behavioral probe: try to parse a tiny `CAST(... AS box2d)` SELECT once and cache the outcome. This matches what the SQL tests actually depend on regardless of which config layer holds the truth. DataFrame `.cast(...)` tests remain unconditional. Verified locally for spark-3.4 / 3.5 across both parser states (succeeded 8, canceled 0) and (succeeded 4, canceled 4).
Contributor
There was a problem hiding this comment.
Pull request overview
Teaches Catalyst to handle CAST between GeometryUDT and Box2DUDT (in both directions) so users can write CAST(geom AS box2d) / CAST(box AS geometry) (SQL) or col.cast(Box2DUDT) / col.cast(GeometryUDT()) (DataFrame) instead of explicit ST_Box2D / ST_GeomFromBox2D calls.
Changes:
- New analyzer rule
Box2DCastResolutionRulerewrites the twoCastshapes into the existing Sedona expressions, registered viaSedonaSqlExtensions.injectResolutionRule. SedonaSqlAstBuilder.visitPrimitiveDataTypenow also acceptsBOX2Das a primitive type keyword, across spark-3.4/3.5/4.0/4.1.- Adds a unit test for the rule and an end-to-end
Box2DCastSuite(replicated across all four Spark version modules) covering SQL cast, DataFrame.cast, round-trip, and NULL propagation.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| spark/common/.../optimization/Box2DCastResolutionRule.scala | New analyzer rule rewriting Cast↔UDT pairs into ST_Box2D / ST_GeomFromBox2D. |
| spark/common/.../SedonaSqlExtensions.scala | Injects the new resolution rule. |
| spark/common/.../expressions/Constructors.scala | Updates ST_GeomFromBox2D doc to mention the new cast form. |
| spark/common/.../Box2DCastResolutionRuleSuite.scala | Unit tests for the rewrite (both directions + no-op). |
| spark/spark-{3.4,3.5,4.0,4.1}/.../parser/SedonaSqlAstBuilder.scala | Recognize BOX2D keyword in CAST type position. |
| spark/spark-{3.4,3.5,4.0,4.1}/.../Box2DCastSuite.scala | End-to-end cast tests (DataFrame + SQL, NULL handling, round-trip). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Did you read the Contributor Guide?
Is this PR related to a ticket?
[GH-XXX] my subject. Closes Implicit Catalyst cast: geometry -> Box2D #2927.What changes were proposed in this PR?
Spark's
Cast.canCastrejects arbitrary UDT-to-UDT casts, so today users have to callST_Box2D(geom)/ST_GeomFromBox2D(box)explicitly. This PR teaches Catalyst to handle the conversions as ordinary SQL casts:CAST(geom AS box2d)andCAST(box AS geometry)parse and execute.col.cast(Box2DUDT)andcol.cast(GeometryUDT())work the same way.How it works
A new analyzer rule
Box2DCastResolutionRulerewrites Cast nodes during resolution, beforeCheckAnalysisruns:Cast(geom, Box2DUDT)ST_Box2D(geom)Cast(box, GeometryUDT)ST_GeomFromBox2D(box)Because the rewrite happens before
CheckAnalysis, Catalyst never observes the rejected Cast — the downstream optimizer and codegen path see the same expression tree that the explicitST_Box2D(geom)/ST_GeomFromBox2D(box)constructs already produce. No new code-gen path, nocanCastchange, no extra null/eval handling.The rule is registered via
SparkSessionExtensions.injectResolutionRulefromSedonaSqlExtensions, so it activates whenever users wire Sedona throughspark.sql.extensions=org.apache.sedona.sql.SedonaSqlExtensions(the standard install).SQL type keyword
SedonaSqlAstBuilder.visitPrimitiveDataTypealready recognizedGEOMETRYas a type keyword. This PR addsBOX2Dalongside it, across all supported Spark versions (3.4 / 3.5 / 4.0 / 4.1). Without it, the SQL parser would rejectCAST(... AS box2d)before any analyzer rule could fire.Scope notes
Implicit type coercion in function dispatch (e.g. passing a Geometry directly into a Box2D-typed function argument without an explicit cast) is intentionally out of scope here — it requires hooking into Catalyst's type coercion rules and is tracked as a follow-up.
How was this patch tested?
Box2DCastResolutionRuleSuite(spark/common): unit test of the rewrite for both directions, plus a no-op assertion for unrelated casts.Box2DCastSuite(spark-3.5): end-to-end tests covering SQLCAST(both directions), DataFrame.cast(Box2DUDT)/.cast(GeometryUDT()), the round-tripGeometry → Box2D → Geometry, and NULL propagation.Did this PR include necessary documentation updates?