[GH-2938] Push down ST_BoxIntersects / ST_BoxContains via Parquet row-group statistics by jiayuasu · Pull Request #2946 · apache/sedona

jiayuasu · 2026-05-11T03:29:20Z

Did you read the Contributor Guide?

Yes, I have read the Contributor Rules and Contributor Development Guide

Is this PR related to a ticket?

Yes, and the PR name follows the format [GH-XXX] my subject. Closes Filter pushdown for ST_BoxIntersects / ST_BoxContains to GeoParquet bbox covering columns #2938. Also closes Sound Box2D filter pushdown via Parquet column statistics #2949 (the sound-pushdown follow-up — this PR implements it directly).

What changes were proposed in this PR?

Teaches SpatialFilterPushDownForGeoParquet to recognize the Box2D predicates from #2926 and translate them into a Parquet FilterPredicate over the Box2D column's underlying (xmin, ymin, xmax, ymax) leaf columns. Parquet's row-group statistics machinery then skips row groups whose per-column min/max disprove the predicate.

How the pushdown works

The Box2D UDT serializes to a Spark struct of four non-nullable Double fields (xmin, ymin, xmax, ymax), so a Box2D-typed column in Parquet becomes a nested group with four leaf columns. Parquet records standard min/max statistics per row group on each of those leaves, which is exactly what's needed to prune.

For each recognized predicate, the rule emits four inequalities AND'd together:

Predicate	Pushed-down conjunction (per row)
`ST_BoxIntersects(box_col, lit)`	`box.xmax >= lit.xmin AND box.xmin <= lit.xmax AND box.ymax >= lit.ymin AND box.ymin <= lit.ymax` (symmetric — reverse arg order is the same)
`ST_BoxContains(box_col, lit)`	`box.xmin <= lit.xmin AND box.xmax >= lit.xmax AND box.ymin <= lit.ymin AND box.ymax >= lit.ymax`
`ST_BoxContains(lit, box_col)`	`box.xmin >= lit.xmin AND box.xmax <= lit.xmax AND box.ymin >= lit.ymin AND box.ymax <= lit.ymax`

GeoParquetFileFormat.buildReaderWithPartitionValues ANDs this with any existing pushed-down Parquet predicate from Spark's data-source filters before calling ParquetInputFormat.setFilterPredicate. Parquet then handles row-group skipping (and, if parquet.filter.record-level.enabled is on, per-record filtering) via its own well-tested machinery.

Why this is sound

Pruning operates on the Box2D column's actual stored values' min/max, not on a separate geometry column's bbox. This holds regardless of how the writer chose those values — Sedona's ST_Box2D(geom) produces exact per-row envelopes, but a writer that emits conservatively wider coverings (e.g., sedona-db's Float32 path that rounds outward via next_after) is still handled correctly because the Box2D column's own statistics describe whatever it stored.

This is why the previous file-metadata pruning approach is replaced wholesale: it was unsound for any writer that doesn't satisfy Box2D_per_row == geom_envelope_per_row. The Parquet-stats path removes that requirement.

API changes

GeoParquetSpatialFilter gains a toParquetFilter: Option[FilterPredicate] hook. The default is None. AndFilter and OrFilter combine children's predicates (OR requires both sides to translate).
Box2DLeafFilter now carries a Box2DPredicateKind (Intersects / ColumnContainsLiteral / LiteralContainsColumn) and implements toParquetFilter directly. Its evaluate returns true — file-metadata-level pruning is no longer used for Box2D predicates.
GeoParquetFileFormat ANDs the spatial Parquet predicate into pushed before handing it to Parquet.
The opt-in conf spark.sedona.geoparquet.box2dFilterPushDown from the earlier commits is removed; the master toggle spark.sedona.geoparquet.spatialFilterPushDown (default true) continues to gate the rule.

Pairs naturally with

The deferred GeoParquet reader auto-materialization of bbox covering columns as Box2D (#2877 follow-up). When that lands, WHERE ST_BoxIntersects(box_col, lit(b)) becomes the canonical bbox-pruned read path — the typed column comes from disk, the predicate prunes the disk read.

How was this patch tested?

GeoParquetSpatialFilterPushDownSuite:

"Push down ST_BoxIntersects against a Box2D column" — Q1-only, left-half, fully-disjoint, and reverse-arg-order query windows.
"Push down ST_BoxContains against a Box2D column" — ST_BoxContains(col, lit) with a tiny query inside Q1, plus reverse-arg-order ST_BoxContains(lit, col).

The shared verifyBox2DFilter helper asserts the spatial filter is attached, that toParquetFilter is non-empty, and that the pushdown-on result matches a baseline run with the rule disabled (spark.sedona.geoparquet.spatialFilterPushDown=false) — guarding against dropping matching rows.

What's not in scope

Two-sided pushdown (ST_BoxIntersects(box_a, box_b) between two columns) — that's the spatial join planner work in Spatial join planner: recognize ST_BoxIntersects / ST_BoxContains as join predicates #2939.
Mixed Box2D / Geometry predicates — waits on the implicit cast in Implicit Catalyst cast: geometry -> Box2D #2927 or explicit mixed overloads.
ST_Intersects(geom, lit) translating to Parquet predicates via the geometry column's covering bbox — distinct from Box2D pushdown; not addressed here.

Did this PR include necessary documentation updates?

No, this PR does not affect any public SQL API documentation surface in isolation. Documentation lands with the consolidated Phase 1+2+3 Box2D docs update.

…et bbox metadata Adds Box2DLeafFilter, a new GeoParquetSpatialFilter variant that prunes files using the geometry column whose covering metadata references the Box2D column in the predicate. Both ST_BoxIntersects and ST_BoxContains push down to a file-level INTERSECTS check: per-row containment implies per-row intersection, which is only possible if the file's union envelope intersects the query box. Pushdown is sound (never produces false negatives) and well-targeted when the Box2D column is the registered GeoParquet 1.1 covering for a geometry column (auto-detected <geom>_bbox columns and explicit geoparquet.covering options both fit). Tests in GeoParquetSpatialFilterPushDownSuite cover: - ST_BoxIntersects with a Box2D query against a quadrant-partitioned dataset; correct region pruning for Q1-only, left-half, and fully-disjoint windows. - ST_BoxContains with a tiny query box inside Q1; pushes down as INTERSECTS at the file level. Pairs naturally with the deferred GeoParquet reader auto- materialization of bbox covering columns as Box2D (apache#2877). Closes apache#2938.

Copilot

Pull request overview

This PR extends Sedona’s GeoParquet spatial filter pushdown so Catalyst predicates on Box2D covering columns (ST_BoxIntersects / ST_BoxContains with a literal RHS) can be translated into a GeoParquet file-pruning filter evaluated against GeoParquet 1.1 metadata.

Changes:

Add Box2D predicate recognition in SpatialFilterPushDownForGeoParquet and translate to a new Box2DLeafFilter.
Introduce GeoParquetSpatialFilter.Box2DLeafFilter to resolve Box2D covering → geometry column via GeoParquet metadata and prune files.
Add a GeoParquet pushdown test fixture and test cases for ST_BoxIntersects and ST_BoxContains on a geom_bbox covering column.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File	Description
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/optimization/SpatialFilterPushDownForGeoParquet.scala	Recognizes `ST_BoxIntersects`/`ST_BoxContains` and emits a Box2D-targeted leaf filter.
spark/common/src/main/scala/org/apache/spark/sql/execution/datasources/geoparquet/GeoParquetSpatialFilter.scala	Adds `Box2DLeafFilter` evaluation logic for pruning based on GeoParquet metadata.
spark/common/src/test/scala/org/apache/sedona/sql/GeoParquetSpatialFilterPushDownSuite.scala	Adds integration-style tests and fixture generation for Box2D covering-column pushdown.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

jiayuasu · 2026-05-11T06:25:19Z

+          // Use the geometry column's recorded bbox to prune. The union of per-row Box2D values
+          // is a superset of the geometry column's bbox (covering boxes are at least as wide as
+          // their geometries), so if the geom-column bbox does not intersect the query box, no
+          // row's Box2D can intersect either. May leave some files unpruned when Box2D values
+          // are conservatively wider than geometries, but never produces false negatives.


Valid concern — I went through it again and you are right: when per-row Box2D > per-row geom envelope (which the spec permits), the union of Box2D values is a superset of the file geom bbox, and using the geom bbox to prune can produce false negatives. Addressed in 23829ab by:

Documenting the assumption explicitly in the Box2DLeafFilter scaladoc — pushdown is sound when per-row Box2D = per-row geom envelope (true for ST_Box2D(geom)-derived columns, including the auto-generated <geom>_bbox path).

Adding an opt-out conf spark.sedona.geoparquet.box2dFilterPushDown (default on). Users with non-minimal covering columns can disable it.

Filed Sound Box2D filter pushdown via Parquet column statistics #2949 to track the proper Parquet-column-statistics-based fix that uses the Box2D columns own xmin/ymin/xmax/ymax stats. Once that lands the opt-out can default off or be removed.

Sedonas own writer (#2886) produces exact envelopes via ST_Box2D(geom), so the common path is sound; the conservative-wider case mostly arises with files written by sedona-dbs Float32 writer where users would want this disabled until #2949.

jiayuasu · 2026-05-11T06:25:20Z

+      // Find the geometry column whose covering metadata points at this Box2D column.
+      val matchingGeomEntry = columns.find { case (_, field) =>
+        field.covering.exists(_.bbox.xmin.headOption.contains(box2dColumnName))
+      }
+
+      matchingGeomEntry match {
+        case Some((_, field)) =>
+          // Use the geometry column's recorded bbox to prune. The union of per-row Box2D values
+          // is a superset of the geometry column's bbox (covering boxes are at least as wide as
+          // their geometries), so if the geom-column bbox does not intersect the query box, no
+          // row's Box2D can intersect either. May leave some files unpruned when Box2D values
+          // are conservatively wider than geometries, but never produces false negatives.
+          val bbox = field.bbox.getOrElse(return true)
+          if (bbox.isEmpty) return true
+          val fileXMin = bbox(0)
+          val fileYMin = bbox(1)
+          val fileXMax = bbox(2)
+          val fileYMax = bbox(3)
+          !(fileXMax < queryBox.getXMin || fileXMin > queryBox.getXMax
+            || fileYMax < queryBox.getYMin || fileYMin > queryBox.getYMax)
+        case None =>
+          // No geometry column references this Box2D column as covering — cannot prune safely.
+          true


Fixed in 23829ab — find replaced with a collect+Seq(field) pattern that requires exactly one match. Multiple matches now fall back to keep-file rather than picking an arbitrary one.

jiayuasu · 2026-05-11T06:25:22Z

-import org.apache.spark.sql.sedona_sql.UDT.GeometryUDT
-import org.apache.spark.sql.sedona_sql.expressions.{ST_AsEWKT, ST_Buffer, ST_Contains, ST_CoveredBy, ST_Covers, ST_Crosses, ST_DWithin, ST_Distance, ST_DistanceSphere, ST_DistanceSpheroid, ST_Equals, ST_Intersects, ST_OrderingEquals, ST_Overlaps, ST_Touches, ST_Within}
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.sedona_sql.UDT.{Box2DUDT, GeometryUDT}


Removed in 23829ab.

1. Soundness caveat. The pushdown uses the file's geom-column bbox to prune, which is sound only when per-row Box2D = per-row geometry envelope. The GeoParquet spec permits conservatively wider covering bboxes (sedona-db does this via next_after Float32 rounding), in which case pruning can drop matching files. Documented the limitation, added an opt-out conf (spark.sedona.geoparquet.box2dFilterPushDown, default on), and flagged Parquet-column-statistics-based proper pruning as the follow-up. Sedona's own writer produces exact envelopes via ST_Box2D(geom), so the common path is sound. 2. Deterministic matching. columns.find returned an arbitrary match when multiple geometry columns referenced the same covering. Now collects all matches and requires exactly one; falls back to keep- file otherwise. 3. Removed unused Box2DUDT import in the pushdown rule. 4. Test correctness. verifyBox2DFilter now compares the pushed-down result against the same query with pushdown disabled, catching cases where pruning is over-aggressive.

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.

jiayuasu · 2026-05-12T04:51:30Z

+      }
+    }
+
+    it("Push down ST_BoxContains against a Box2D covering column") {


Covered by the new reverse-arg-order test cases in 3c23666.

Round-2 review pointed out that default-on can silently produce wrong results for spec-compliant GeoParquet files with conservatively-wider covering columns. Flipping the default keeps correctness as the non-surprising baseline; users with Sedona-written files (where covering = exact envelope) opt in via spark.sedona.geoparquet.box2dFilterPushDown=true. Default flips back to true once the proper Parquet-column-statistics-based pruning lands in apache#2949. Tests: - Existing ST_BoxIntersects / ST_BoxContains tests now wrap their pushdown assertions in withConf(box2dFilterPushDown=true). - Added reverse-arg-order coverage: ST_BoxIntersects(lit_box, col) and ST_BoxContains(lit_box, col) should produce identical pruning to the col-first orientation. - Added a 'disabled by default' test asserting no pushdown is emitted without the opt-in conf.

Replace the file-metadata-based Box2DLeafFilter (which pruned using the geometry column's bbox — unsound when GeoParquet 1.1 coverings are conservatively wider than per-row envelopes) with a translation to a Parquet FilterPredicate over the Box2D column's own (xmin, ymin, xmax, ymax) leaf statistics. - GeoParquetSpatialFilter gains a toParquetFilter hook. AndFilter/OrFilter combine children's predicates; LeafFilter returns None. - Box2DLeafFilter carries a Box2DPredicateKind (Intersects / ColumnContainsLiteral / LiteralContainsColumn) and emits the matching conjunction of four double-column inequalities. evaluate() returns true because pruning now happens at the Parquet layer. - GeoParquetFileFormat ANDs the spatial Parquet predicate into the existing pushed FilterPredicate so Parquet's stats-based skipping prunes row groups whose bounds disprove the predicate. - Pushdown is sound for any writer, so the opt-in spark.sedona.geoparquet.box2dFilterPushDown conf is removed.

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

jiayuasu · 2026-05-12T05:47:36Z

+      // comparisons on a Box2D-typed column) are AND'd into the pushed-down filter so Parquet
+      // can skip row groups whose column statistics disprove them.
+      val combinedPushed = spatialFilter.flatMap(_.toParquetFilter) match {
+        case Some(spatialPredicate) =>
+          Some(pushed.fold(spatialPredicate)(p => FilterApi.and(p, spatialPredicate)))
+        case None => pushed
+      }


Fixed in 0d69d50 — gated the spatial Parquet predicate on enableParquetFilterPushDown, so disabling spark.sql.parquet.filterPushdown also disables Sedona-injected row-group predicates.

jiayuasu · 2026-05-12T05:47:37Z

+   */
+  private def extractBox2DLiteral(value: Any): Option[Box2D] = value match {
+    case row: InternalRow if row.numFields == 4 =>
+      Some(new Box2D(row.getDouble(0), row.getDouble(1), row.getDouble(2), row.getDouble(3)))


Fixed in 0d69d50 — extractBox2DLiteral now returns None for inverted bounds, so the predicate falls back to per-row evaluation and surfaces the IllegalArgumentException from Predicates.requireOrderedPlanarBox. Added a test that asserts the predicate is not pushed and that collect() raises IllegalArgumentException in the cause chain.

jiayuasu · 2026-05-12T05:47:39Z

+      // Box2D predicates push down as Parquet row-group filters on the Box2D column's underlying
+      // (xmin, ymin, xmax, ymax) double leaves. Pruning is done by Parquet's stats-based skipping
+      // against the column's recorded min/max, which is sound regardless of how the writer chose
+      // the per-row Box2D values.
+      case ST_BoxIntersects(_) =>
+        // Intersects is symmetric — both argument orders produce the same predicate.


The PR title and body were updated after this review fired — the description no longer references the removed opt-in conf or geometry-bbox metadata pruning, and now documents the Parquet row-group statistics path that is actually implemented.

- Gate spatial Parquet predicate injection on `enableParquetFilterPushDown` so disabling `spark.sql.parquet.filterPushdown` also disables Sedona- injected row-group predicates (otherwise the spatial path bypassed the user's intent to turn parquet pushdown off). - Reject inverted-bound Box2D literals in extractBox2DLiteral so the predicate falls back to runtime evaluation and surfaces the expected IllegalArgumentException; otherwise Parquet's row-group filter could prune all matches and hide the throw. - Add a test locking in the inverted-bound fallback: no pushdown attached, and collect() surfaces an IllegalArgumentException in the cause chain.

jiayuasu requested a review from Copilot May 11, 2026 04:11

Copilot started reviewing on behalf of jiayuasu May 11, 2026 04:12 View session

Copilot AI reviewed May 11, 2026

View reviewed changes

jiayuasu requested a review from Copilot May 11, 2026 16:47

Copilot AI reviewed May 11, 2026

View reviewed changes

Copilot started reviewing on behalf of jiayuasu May 11, 2026 18:06 View session

jiayuasu added 2 commits May 11, 2026 21:50

jiayuasu requested a review from Copilot May 12, 2026 05:38

jiayuasu changed the title ~~[GH-2938] Push down ST_BoxIntersects / ST_BoxContains to GeoParquet bbox metadata~~ [GH-2938] Push down ST_BoxIntersects / ST_BoxContains via Parquet row-group statistics May 12, 2026

Copilot started reviewing on behalf of jiayuasu May 12, 2026 05:39 View session

Copilot AI reviewed May 12, 2026

View reviewed changes

jiayuasu added this to the sedona-1.9.1 milestone May 13, 2026

jiayuasu merged commit 1ae9c01 into apache:master May 13, 2026
42 checks passed

jiayuasu deleted the feature/box2d-filter-pushdown branch May 13, 2026 06:13

Conversation

jiayuasu commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Did you read the Contributor Guide?

Is this PR related to a ticket?

What changes were proposed in this PR?

How the pushdown works

Why this is sound

API changes

Pairs naturally with

How was this patch tested?

What's not in scope

Did this PR include necessary documentation updates?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

jiayuasu May 11, 2026

Choose a reason for hiding this comment

Uh oh!

jiayuasu May 11, 2026

Choose a reason for hiding this comment

Uh oh!

jiayuasu May 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jiayuasu May 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

jiayuasu May 12, 2026

Choose a reason for hiding this comment

Uh oh!

jiayuasu May 12, 2026

Choose a reason for hiding this comment

Uh oh!

jiayuasu May 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jiayuasu commented May 11, 2026 •

edited

Loading