[GH-2938] Push down ST_BoxIntersects / ST_BoxContains via Parquet row-group statistics#2946
Conversation
…et bbox metadata Adds Box2DLeafFilter, a new GeoParquetSpatialFilter variant that prunes files using the geometry column whose covering metadata references the Box2D column in the predicate. Both ST_BoxIntersects and ST_BoxContains push down to a file-level INTERSECTS check: per-row containment implies per-row intersection, which is only possible if the file's union envelope intersects the query box. Pushdown is sound (never produces false negatives) and well-targeted when the Box2D column is the registered GeoParquet 1.1 covering for a geometry column (auto-detected <geom>_bbox columns and explicit geoparquet.covering options both fit). Tests in GeoParquetSpatialFilterPushDownSuite cover: - ST_BoxIntersects with a Box2D query against a quadrant-partitioned dataset; correct region pruning for Q1-only, left-half, and fully-disjoint windows. - ST_BoxContains with a tiny query box inside Q1; pushes down as INTERSECTS at the file level. Pairs naturally with the deferred GeoParquet reader auto- materialization of bbox covering columns as Box2D (apache#2877). Closes apache#2938.
There was a problem hiding this comment.
Pull request overview
This PR extends Sedona’s GeoParquet spatial filter pushdown so Catalyst predicates on Box2D covering columns (ST_BoxIntersects / ST_BoxContains with a literal RHS) can be translated into a GeoParquet file-pruning filter evaluated against GeoParquet 1.1 metadata.
Changes:
- Add Box2D predicate recognition in
SpatialFilterPushDownForGeoParquetand translate to a newBox2DLeafFilter. - Introduce
GeoParquetSpatialFilter.Box2DLeafFilterto resolve Box2D covering → geometry column via GeoParquet metadata and prune files. - Add a GeoParquet pushdown test fixture and test cases for
ST_BoxIntersectsandST_BoxContainson ageom_bboxcovering column.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/optimization/SpatialFilterPushDownForGeoParquet.scala | Recognizes ST_BoxIntersects/ST_BoxContains and emits a Box2D-targeted leaf filter. |
| spark/common/src/main/scala/org/apache/spark/sql/execution/datasources/geoparquet/GeoParquetSpatialFilter.scala | Adds Box2DLeafFilter evaluation logic for pruning based on GeoParquet metadata. |
| spark/common/src/test/scala/org/apache/sedona/sql/GeoParquetSpatialFilterPushDownSuite.scala | Adds integration-style tests and fixture generation for Box2D covering-column pushdown. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Use the geometry column's recorded bbox to prune. The union of per-row Box2D values | ||
| // is a superset of the geometry column's bbox (covering boxes are at least as wide as | ||
| // their geometries), so if the geom-column bbox does not intersect the query box, no | ||
| // row's Box2D can intersect either. May leave some files unpruned when Box2D values | ||
| // are conservatively wider than geometries, but never produces false negatives. |
There was a problem hiding this comment.
Valid concern — I went through it again and you are right: when per-row Box2D > per-row geom envelope (which the spec permits), the union of Box2D values is a superset of the file geom bbox, and using the geom bbox to prune can produce false negatives. Addressed in 23829ab by:
- Documenting the assumption explicitly in the Box2DLeafFilter scaladoc — pushdown is sound when per-row Box2D = per-row geom envelope (true for ST_Box2D(geom)-derived columns, including the auto-generated
<geom>_bboxpath). - Adding an opt-out conf
spark.sedona.geoparquet.box2dFilterPushDown(default on). Users with non-minimal covering columns can disable it. - Filed Sound Box2D filter pushdown via Parquet column statistics #2949 to track the proper Parquet-column-statistics-based fix that uses the Box2D columns own xmin/ymin/xmax/ymax stats. Once that lands the opt-out can default off or be removed.
Sedonas own writer (#2886) produces exact envelopes via ST_Box2D(geom), so the common path is sound; the conservative-wider case mostly arises with files written by sedona-dbs Float32 writer where users would want this disabled until #2949.
| // Find the geometry column whose covering metadata points at this Box2D column. | ||
| val matchingGeomEntry = columns.find { case (_, field) => | ||
| field.covering.exists(_.bbox.xmin.headOption.contains(box2dColumnName)) | ||
| } | ||
|
|
||
| matchingGeomEntry match { | ||
| case Some((_, field)) => | ||
| // Use the geometry column's recorded bbox to prune. The union of per-row Box2D values | ||
| // is a superset of the geometry column's bbox (covering boxes are at least as wide as | ||
| // their geometries), so if the geom-column bbox does not intersect the query box, no | ||
| // row's Box2D can intersect either. May leave some files unpruned when Box2D values | ||
| // are conservatively wider than geometries, but never produces false negatives. | ||
| val bbox = field.bbox.getOrElse(return true) | ||
| if (bbox.isEmpty) return true | ||
| val fileXMin = bbox(0) | ||
| val fileYMin = bbox(1) | ||
| val fileXMax = bbox(2) | ||
| val fileYMax = bbox(3) | ||
| !(fileXMax < queryBox.getXMin || fileXMin > queryBox.getXMax | ||
| || fileYMax < queryBox.getYMin || fileYMin > queryBox.getYMax) | ||
| case None => | ||
| // No geometry column references this Box2D column as covering — cannot prune safely. | ||
| true |
There was a problem hiding this comment.
Fixed in 23829ab — find replaced with a collect+Seq(field) pattern that requires exactly one match. Multiple matches now fall back to keep-file rather than picking an arbitrary one.
| import org.apache.spark.sql.sedona_sql.UDT.GeometryUDT | ||
| import org.apache.spark.sql.sedona_sql.expressions.{ST_AsEWKT, ST_Buffer, ST_Contains, ST_CoveredBy, ST_Covers, ST_Crosses, ST_DWithin, ST_Distance, ST_DistanceSphere, ST_DistanceSpheroid, ST_Equals, ST_Intersects, ST_OrderingEquals, ST_Overlaps, ST_Touches, ST_Within} | ||
| import org.apache.spark.sql.catalyst.InternalRow | ||
| import org.apache.spark.sql.sedona_sql.UDT.{Box2DUDT, GeometryUDT} |
1. Soundness caveat. The pushdown uses the file's geom-column bbox to prune, which is sound only when per-row Box2D = per-row geometry envelope. The GeoParquet spec permits conservatively wider covering bboxes (sedona-db does this via next_after Float32 rounding), in which case pruning can drop matching files. Documented the limitation, added an opt-out conf (spark.sedona.geoparquet.box2dFilterPushDown, default on), and flagged Parquet-column-statistics-based proper pruning as the follow-up. Sedona's own writer produces exact envelopes via ST_Box2D(geom), so the common path is sound. 2. Deterministic matching. columns.find returned an arbitrary match when multiple geometry columns referenced the same covering. Now collects all matches and requires exactly one; falls back to keep- file otherwise. 3. Removed unused Box2DUDT import in the pushdown rule. 4. Test correctness. verifyBox2DFilter now compares the pushed-down result against the same query with pushdown disabled, catching cases where pruning is over-aggressive.
| } | ||
| } | ||
|
|
||
| it("Push down ST_BoxContains against a Box2D covering column") { |
There was a problem hiding this comment.
Covered by the new reverse-arg-order test cases in 3c23666.
Round-2 review pointed out that default-on can silently produce wrong results for spec-compliant GeoParquet files with conservatively-wider covering columns. Flipping the default keeps correctness as the non-surprising baseline; users with Sedona-written files (where covering = exact envelope) opt in via spark.sedona.geoparquet.box2dFilterPushDown=true. Default flips back to true once the proper Parquet-column-statistics-based pruning lands in apache#2949. Tests: - Existing ST_BoxIntersects / ST_BoxContains tests now wrap their pushdown assertions in withConf(box2dFilterPushDown=true). - Added reverse-arg-order coverage: ST_BoxIntersects(lit_box, col) and ST_BoxContains(lit_box, col) should produce identical pruning to the col-first orientation. - Added a 'disabled by default' test asserting no pushdown is emitted without the opt-in conf.
Replace the file-metadata-based Box2DLeafFilter (which pruned using the geometry column's bbox — unsound when GeoParquet 1.1 coverings are conservatively wider than per-row envelopes) with a translation to a Parquet FilterPredicate over the Box2D column's own (xmin, ymin, xmax, ymax) leaf statistics. - GeoParquetSpatialFilter gains a toParquetFilter hook. AndFilter/OrFilter combine children's predicates; LeafFilter returns None. - Box2DLeafFilter carries a Box2DPredicateKind (Intersects / ColumnContainsLiteral / LiteralContainsColumn) and emits the matching conjunction of four double-column inequalities. evaluate() returns true because pruning now happens at the Parquet layer. - GeoParquetFileFormat ANDs the spatial Parquet predicate into the existing pushed FilterPredicate so Parquet's stats-based skipping prunes row groups whose bounds disprove the predicate. - Pushdown is sound for any writer, so the opt-in spark.sedona.geoparquet.box2dFilterPushDown conf is removed.
| // comparisons on a Box2D-typed column) are AND'd into the pushed-down filter so Parquet | ||
| // can skip row groups whose column statistics disprove them. | ||
| val combinedPushed = spatialFilter.flatMap(_.toParquetFilter) match { | ||
| case Some(spatialPredicate) => | ||
| Some(pushed.fold(spatialPredicate)(p => FilterApi.and(p, spatialPredicate))) | ||
| case None => pushed | ||
| } |
There was a problem hiding this comment.
Fixed in 0d69d50 — gated the spatial Parquet predicate on enableParquetFilterPushDown, so disabling spark.sql.parquet.filterPushdown also disables Sedona-injected row-group predicates.
| */ | ||
| private def extractBox2DLiteral(value: Any): Option[Box2D] = value match { | ||
| case row: InternalRow if row.numFields == 4 => | ||
| Some(new Box2D(row.getDouble(0), row.getDouble(1), row.getDouble(2), row.getDouble(3))) |
There was a problem hiding this comment.
Fixed in 0d69d50 — extractBox2DLiteral now returns None for inverted bounds, so the predicate falls back to per-row evaluation and surfaces the IllegalArgumentException from Predicates.requireOrderedPlanarBox. Added a test that asserts the predicate is not pushed and that collect() raises IllegalArgumentException in the cause chain.
| // Box2D predicates push down as Parquet row-group filters on the Box2D column's underlying | ||
| // (xmin, ymin, xmax, ymax) double leaves. Pruning is done by Parquet's stats-based skipping | ||
| // against the column's recorded min/max, which is sound regardless of how the writer chose | ||
| // the per-row Box2D values. | ||
| case ST_BoxIntersects(_) => | ||
| // Intersects is symmetric — both argument orders produce the same predicate. |
There was a problem hiding this comment.
The PR title and body were updated after this review fired — the description no longer references the removed opt-in conf or geometry-bbox metadata pruning, and now documents the Parquet row-group statistics path that is actually implemented.
- Gate spatial Parquet predicate injection on `enableParquetFilterPushDown` so disabling `spark.sql.parquet.filterPushdown` also disables Sedona- injected row-group predicates (otherwise the spatial path bypassed the user's intent to turn parquet pushdown off). - Reject inverted-bound Box2D literals in extractBox2DLiteral so the predicate falls back to runtime evaluation and surfaces the expected IllegalArgumentException; otherwise Parquet's row-group filter could prune all matches and hide the throw. - Add a test locking in the inverted-bound fallback: no pushdown attached, and collect() surfaces an IllegalArgumentException in the cause chain.
Did you read the Contributor Guide?
Is this PR related to a ticket?
[GH-XXX] my subject. Closes Filter pushdown for ST_BoxIntersects / ST_BoxContains to GeoParquet bbox covering columns #2938. Also closes Sound Box2D filter pushdown via Parquet column statistics #2949 (the sound-pushdown follow-up — this PR implements it directly).What changes were proposed in this PR?
Teaches
SpatialFilterPushDownForGeoParquetto recognize the Box2D predicates from #2926 and translate them into a ParquetFilterPredicateover the Box2D column's underlying(xmin, ymin, xmax, ymax)leaf columns. Parquet's row-group statistics machinery then skips row groups whose per-column min/max disprove the predicate.How the pushdown works
The Box2D UDT serializes to a Spark struct of four non-nullable
Doublefields (xmin,ymin,xmax,ymax), so a Box2D-typed column in Parquet becomes a nested group with four leaf columns. Parquet records standard min/max statistics per row group on each of those leaves, which is exactly what's needed to prune.For each recognized predicate, the rule emits four inequalities AND'd together:
ST_BoxIntersects(box_col, lit)box.xmax >= lit.xmin AND box.xmin <= lit.xmax AND box.ymax >= lit.ymin AND box.ymin <= lit.ymax(symmetric — reverse arg order is the same)ST_BoxContains(box_col, lit)box.xmin <= lit.xmin AND box.xmax >= lit.xmax AND box.ymin <= lit.ymin AND box.ymax >= lit.ymaxST_BoxContains(lit, box_col)box.xmin >= lit.xmin AND box.xmax <= lit.xmax AND box.ymin >= lit.ymin AND box.ymax <= lit.ymaxGeoParquetFileFormat.buildReaderWithPartitionValuesANDs this with any existing pushed-down Parquet predicate from Spark's data-source filters before callingParquetInputFormat.setFilterPredicate. Parquet then handles row-group skipping (and, ifparquet.filter.record-level.enabledis on, per-record filtering) via its own well-tested machinery.Why this is sound
Pruning operates on the Box2D column's actual stored values' min/max, not on a separate geometry column's bbox. This holds regardless of how the writer chose those values — Sedona's
ST_Box2D(geom)produces exact per-row envelopes, but a writer that emits conservatively wider coverings (e.g., sedona-db's Float32 path that rounds outward vianext_after) is still handled correctly because the Box2D column's own statistics describe whatever it stored.This is why the previous file-metadata pruning approach is replaced wholesale: it was unsound for any writer that doesn't satisfy
Box2D_per_row == geom_envelope_per_row. The Parquet-stats path removes that requirement.API changes
GeoParquetSpatialFiltergains atoParquetFilter: Option[FilterPredicate]hook. The default isNone.AndFilterandOrFiltercombine children's predicates (OR requires both sides to translate).Box2DLeafFilternow carries aBox2DPredicateKind(Intersects/ColumnContainsLiteral/LiteralContainsColumn) and implementstoParquetFilterdirectly. Itsevaluatereturnstrue— file-metadata-level pruning is no longer used for Box2D predicates.GeoParquetFileFormatANDs the spatial Parquet predicate intopushedbefore handing it to Parquet.spark.sedona.geoparquet.box2dFilterPushDownfrom the earlier commits is removed; the master togglespark.sedona.geoparquet.spatialFilterPushDown(defaulttrue) continues to gate the rule.Pairs naturally with
The deferred GeoParquet reader auto-materialization of bbox covering columns as
Box2D(#2877 follow-up). When that lands,WHERE ST_BoxIntersects(box_col, lit(b))becomes the canonical bbox-pruned read path — the typed column comes from disk, the predicate prunes the disk read.How was this patch tested?
GeoParquetSpatialFilterPushDownSuite:ST_BoxContains(col, lit)with a tiny query inside Q1, plus reverse-arg-orderST_BoxContains(lit, col).The shared
verifyBox2DFilterhelper asserts the spatial filter is attached, thattoParquetFilteris non-empty, and that the pushdown-on result matches a baseline run with the rule disabled (spark.sedona.geoparquet.spatialFilterPushDown=false) — guarding against dropping matching rows.What's not in scope
ST_BoxIntersects(box_a, box_b)between two columns) — that's the spatial join planner work in Spatial join planner: recognize ST_BoxIntersects / ST_BoxContains as join predicates #2939.ST_Intersects(geom, lit)translating to Parquet predicates via the geometry column's covering bbox — distinct from Box2D pushdown; not addressed here.Did this PR include necessary documentation updates?