Skip to content

[GH-2938] Push down ST_BoxIntersects / ST_BoxContains via Parquet row-group statistics#2946

Merged
jiayuasu merged 5 commits into
apache:masterfrom
jiayuasu:feature/box2d-filter-pushdown
May 13, 2026
Merged

[GH-2938] Push down ST_BoxIntersects / ST_BoxContains via Parquet row-group statistics#2946
jiayuasu merged 5 commits into
apache:masterfrom
jiayuasu:feature/box2d-filter-pushdown

Conversation

@jiayuasu
Copy link
Copy Markdown
Member

@jiayuasu jiayuasu commented May 11, 2026

Did you read the Contributor Guide?

Is this PR related to a ticket?

What changes were proposed in this PR?

Teaches SpatialFilterPushDownForGeoParquet to recognize the Box2D predicates from #2926 and translate them into a Parquet FilterPredicate over the Box2D column's underlying (xmin, ymin, xmax, ymax) leaf columns. Parquet's row-group statistics machinery then skips row groups whose per-column min/max disprove the predicate.

How the pushdown works

The Box2D UDT serializes to a Spark struct of four non-nullable Double fields (xmin, ymin, xmax, ymax), so a Box2D-typed column in Parquet becomes a nested group with four leaf columns. Parquet records standard min/max statistics per row group on each of those leaves, which is exactly what's needed to prune.

For each recognized predicate, the rule emits four inequalities AND'd together:

Predicate Pushed-down conjunction (per row)
ST_BoxIntersects(box_col, lit) box.xmax >= lit.xmin AND box.xmin <= lit.xmax AND box.ymax >= lit.ymin AND box.ymin <= lit.ymax (symmetric — reverse arg order is the same)
ST_BoxContains(box_col, lit) box.xmin <= lit.xmin AND box.xmax >= lit.xmax AND box.ymin <= lit.ymin AND box.ymax >= lit.ymax
ST_BoxContains(lit, box_col) box.xmin >= lit.xmin AND box.xmax <= lit.xmax AND box.ymin >= lit.ymin AND box.ymax <= lit.ymax

GeoParquetFileFormat.buildReaderWithPartitionValues ANDs this with any existing pushed-down Parquet predicate from Spark's data-source filters before calling ParquetInputFormat.setFilterPredicate. Parquet then handles row-group skipping (and, if parquet.filter.record-level.enabled is on, per-record filtering) via its own well-tested machinery.

Why this is sound

Pruning operates on the Box2D column's actual stored values' min/max, not on a separate geometry column's bbox. This holds regardless of how the writer chose those values — Sedona's ST_Box2D(geom) produces exact per-row envelopes, but a writer that emits conservatively wider coverings (e.g., sedona-db's Float32 path that rounds outward via next_after) is still handled correctly because the Box2D column's own statistics describe whatever it stored.

This is why the previous file-metadata pruning approach is replaced wholesale: it was unsound for any writer that doesn't satisfy Box2D_per_row == geom_envelope_per_row. The Parquet-stats path removes that requirement.

API changes

  • GeoParquetSpatialFilter gains a toParquetFilter: Option[FilterPredicate] hook. The default is None. AndFilter and OrFilter combine children's predicates (OR requires both sides to translate).
  • Box2DLeafFilter now carries a Box2DPredicateKind (Intersects / ColumnContainsLiteral / LiteralContainsColumn) and implements toParquetFilter directly. Its evaluate returns true — file-metadata-level pruning is no longer used for Box2D predicates.
  • GeoParquetFileFormat ANDs the spatial Parquet predicate into pushed before handing it to Parquet.
  • The opt-in conf spark.sedona.geoparquet.box2dFilterPushDown from the earlier commits is removed; the master toggle spark.sedona.geoparquet.spatialFilterPushDown (default true) continues to gate the rule.

Pairs naturally with

The deferred GeoParquet reader auto-materialization of bbox covering columns as Box2D (#2877 follow-up). When that lands, WHERE ST_BoxIntersects(box_col, lit(b)) becomes the canonical bbox-pruned read path — the typed column comes from disk, the predicate prunes the disk read.

How was this patch tested?

GeoParquetSpatialFilterPushDownSuite:

  • "Push down ST_BoxIntersects against a Box2D column" — Q1-only, left-half, fully-disjoint, and reverse-arg-order query windows.
  • "Push down ST_BoxContains against a Box2D column" — ST_BoxContains(col, lit) with a tiny query inside Q1, plus reverse-arg-order ST_BoxContains(lit, col).

The shared verifyBox2DFilter helper asserts the spatial filter is attached, that toParquetFilter is non-empty, and that the pushdown-on result matches a baseline run with the rule disabled (spark.sedona.geoparquet.spatialFilterPushDown=false) — guarding against dropping matching rows.

What's not in scope

Did this PR include necessary documentation updates?

  • No, this PR does not affect any public SQL API documentation surface in isolation. Documentation lands with the consolidated Phase 1+2+3 Box2D docs update.

…et bbox metadata

Adds Box2DLeafFilter, a new GeoParquetSpatialFilter variant that
prunes files using the geometry column whose covering metadata
references the Box2D column in the predicate. Both ST_BoxIntersects
and ST_BoxContains push down to a file-level INTERSECTS check: per-row
containment implies per-row intersection, which is only possible if
the file's union envelope intersects the query box. Pushdown is sound
(never produces false negatives) and well-targeted when the Box2D
column is the registered GeoParquet 1.1 covering for a geometry
column (auto-detected <geom>_bbox columns and explicit
geoparquet.covering options both fit).

Tests in GeoParquetSpatialFilterPushDownSuite cover:
- ST_BoxIntersects with a Box2D query against a quadrant-partitioned
  dataset; correct region pruning for Q1-only, left-half, and
  fully-disjoint windows.
- ST_BoxContains with a tiny query box inside Q1; pushes down as
  INTERSECTS at the file level.

Pairs naturally with the deferred GeoParquet reader auto-
materialization of bbox covering columns as Box2D (apache#2877).
Closes apache#2938.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends Sedona’s GeoParquet spatial filter pushdown so Catalyst predicates on Box2D covering columns (ST_BoxIntersects / ST_BoxContains with a literal RHS) can be translated into a GeoParquet file-pruning filter evaluated against GeoParquet 1.1 metadata.

Changes:

  • Add Box2D predicate recognition in SpatialFilterPushDownForGeoParquet and translate to a new Box2DLeafFilter.
  • Introduce GeoParquetSpatialFilter.Box2DLeafFilter to resolve Box2D covering → geometry column via GeoParquet metadata and prune files.
  • Add a GeoParquet pushdown test fixture and test cases for ST_BoxIntersects and ST_BoxContains on a geom_bbox covering column.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/optimization/SpatialFilterPushDownForGeoParquet.scala Recognizes ST_BoxIntersects/ST_BoxContains and emits a Box2D-targeted leaf filter.
spark/common/src/main/scala/org/apache/spark/sql/execution/datasources/geoparquet/GeoParquetSpatialFilter.scala Adds Box2DLeafFilter evaluation logic for pruning based on GeoParquet metadata.
spark/common/src/test/scala/org/apache/sedona/sql/GeoParquetSpatialFilterPushDownSuite.scala Adds integration-style tests and fixture generation for Box2D covering-column pushdown.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +122 to +126
// Use the geometry column's recorded bbox to prune. The union of per-row Box2D values
// is a superset of the geometry column's bbox (covering boxes are at least as wide as
// their geometries), so if the geom-column bbox does not intersect the query box, no
// row's Box2D can intersect either. May leave some files unpruned when Box2D values
// are conservatively wider than geometries, but never produces false negatives.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid concern — I went through it again and you are right: when per-row Box2D > per-row geom envelope (which the spec permits), the union of Box2D values is a superset of the file geom bbox, and using the geom bbox to prune can produce false negatives. Addressed in 23829ab by:

  1. Documenting the assumption explicitly in the Box2DLeafFilter scaladoc — pushdown is sound when per-row Box2D = per-row geom envelope (true for ST_Box2D(geom)-derived columns, including the auto-generated <geom>_bbox path).
  2. Adding an opt-out conf spark.sedona.geoparquet.box2dFilterPushDown (default on). Users with non-minimal covering columns can disable it.
  3. Filed Sound Box2D filter pushdown via Parquet column statistics #2949 to track the proper Parquet-column-statistics-based fix that uses the Box2D columns own xmin/ymin/xmax/ymax stats. Once that lands the opt-out can default off or be removed.

Sedonas own writer (#2886) produces exact envelopes via ST_Box2D(geom), so the common path is sound; the conservative-wider case mostly arises with files written by sedona-dbs Float32 writer where users would want this disabled until #2949.

Comment on lines +115 to +137
// Find the geometry column whose covering metadata points at this Box2D column.
val matchingGeomEntry = columns.find { case (_, field) =>
field.covering.exists(_.bbox.xmin.headOption.contains(box2dColumnName))
}

matchingGeomEntry match {
case Some((_, field)) =>
// Use the geometry column's recorded bbox to prune. The union of per-row Box2D values
// is a superset of the geometry column's bbox (covering boxes are at least as wide as
// their geometries), so if the geom-column bbox does not intersect the query box, no
// row's Box2D can intersect either. May leave some files unpruned when Box2D values
// are conservatively wider than geometries, but never produces false negatives.
val bbox = field.bbox.getOrElse(return true)
if (bbox.isEmpty) return true
val fileXMin = bbox(0)
val fileYMin = bbox(1)
val fileXMax = bbox(2)
val fileYMax = bbox(3)
!(fileXMax < queryBox.getXMin || fileXMin > queryBox.getXMax
|| fileYMax < queryBox.getYMin || fileYMin > queryBox.getYMax)
case None =>
// No geometry column references this Box2D column as covering — cannot prune safely.
true
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 23829abfind replaced with a collect+Seq(field) pattern that requires exactly one match. Multiple matches now fall back to keep-file rather than picking an arbitrary one.

import org.apache.spark.sql.sedona_sql.UDT.GeometryUDT
import org.apache.spark.sql.sedona_sql.expressions.{ST_AsEWKT, ST_Buffer, ST_Contains, ST_CoveredBy, ST_Covers, ST_Crosses, ST_DWithin, ST_Distance, ST_DistanceSphere, ST_DistanceSpheroid, ST_Equals, ST_Intersects, ST_OrderingEquals, ST_Overlaps, ST_Touches, ST_Within}
import org.apache.spark.sql.catalyst.InternalRow
import org.apache.spark.sql.sedona_sql.UDT.{Box2DUDT, GeometryUDT}
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed in 23829ab.

1. Soundness caveat. The pushdown uses the file's geom-column bbox to
   prune, which is sound only when per-row Box2D = per-row geometry
   envelope. The GeoParquet spec permits conservatively wider covering
   bboxes (sedona-db does this via next_after Float32 rounding), in
   which case pruning can drop matching files. Documented the
   limitation, added an opt-out conf
   (spark.sedona.geoparquet.box2dFilterPushDown, default on), and
   flagged Parquet-column-statistics-based proper pruning as the
   follow-up. Sedona's own writer produces exact envelopes via
   ST_Box2D(geom), so the common path is sound.

2. Deterministic matching. columns.find returned an arbitrary match
   when multiple geometry columns referenced the same covering. Now
   collects all matches and requires exactly one; falls back to keep-
   file otherwise.

3. Removed unused Box2DUDT import in the pushdown rule.

4. Test correctness. verifyBox2DFilter now compares the pushed-down
   result against the same query with pushdown disabled, catching
   cases where pruning is over-aggressive.
@jiayuasu jiayuasu requested a review from Copilot May 11, 2026 16:47
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.

}
}

it("Push down ST_BoxContains against a Box2D covering column") {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Covered by the new reverse-arg-order test cases in 3c23666.

jiayuasu added 2 commits May 11, 2026 21:50
Round-2 review pointed out that default-on can silently produce wrong
results for spec-compliant GeoParquet files with conservatively-wider
covering columns. Flipping the default keeps correctness as the
non-surprising baseline; users with Sedona-written files (where
covering = exact envelope) opt in via
spark.sedona.geoparquet.box2dFilterPushDown=true. Default flips back
to true once the proper Parquet-column-statistics-based pruning lands
in apache#2949.

Tests:
- Existing ST_BoxIntersects / ST_BoxContains tests now wrap their
  pushdown assertions in withConf(box2dFilterPushDown=true).
- Added reverse-arg-order coverage: ST_BoxIntersects(lit_box, col)
  and ST_BoxContains(lit_box, col) should produce identical pruning
  to the col-first orientation.
- Added a 'disabled by default' test asserting no pushdown is emitted
  without the opt-in conf.
Replace the file-metadata-based Box2DLeafFilter (which pruned using the
geometry column's bbox — unsound when GeoParquet 1.1 coverings are
conservatively wider than per-row envelopes) with a translation to a
Parquet FilterPredicate over the Box2D column's own (xmin, ymin, xmax,
ymax) leaf statistics.

- GeoParquetSpatialFilter gains a toParquetFilter hook. AndFilter/OrFilter
  combine children's predicates; LeafFilter returns None.
- Box2DLeafFilter carries a Box2DPredicateKind (Intersects /
  ColumnContainsLiteral / LiteralContainsColumn) and emits the matching
  conjunction of four double-column inequalities. evaluate() returns true
  because pruning now happens at the Parquet layer.
- GeoParquetFileFormat ANDs the spatial Parquet predicate into the
  existing pushed FilterPredicate so Parquet's stats-based skipping prunes
  row groups whose bounds disprove the predicate.
- Pushdown is sound for any writer, so the opt-in
  spark.sedona.geoparquet.box2dFilterPushDown conf is removed.
@jiayuasu jiayuasu requested a review from Copilot May 12, 2026 05:38
@jiayuasu jiayuasu changed the title [GH-2938] Push down ST_BoxIntersects / ST_BoxContains to GeoParquet bbox metadata [GH-2938] Push down ST_BoxIntersects / ST_BoxContains via Parquet row-group statistics May 12, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

Comment on lines +277 to +283
// comparisons on a Box2D-typed column) are AND'd into the pushed-down filter so Parquet
// can skip row groups whose column statistics disprove them.
val combinedPushed = spatialFilter.flatMap(_.toParquetFilter) match {
case Some(spatialPredicate) =>
Some(pushed.fold(spatialPredicate)(p => FilterApi.and(p, spatialPredicate)))
case None => pushed
}
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 0d69d50 — gated the spatial Parquet predicate on enableParquetFilterPushDown, so disabling spark.sql.parquet.filterPushdown also disables Sedona-injected row-group predicates.

Comment on lines +284 to +287
*/
private def extractBox2DLiteral(value: Any): Option[Box2D] = value match {
case row: InternalRow if row.numFields == 4 =>
Some(new Box2D(row.getDouble(0), row.getDouble(1), row.getDouble(2), row.getDouble(3)))
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 0d69d50 — extractBox2DLiteral now returns None for inverted bounds, so the predicate falls back to per-row evaluation and surfaces the IllegalArgumentException from Predicates.requireOrderedPlanarBox. Added a test that asserts the predicate is not pushed and that collect() raises IllegalArgumentException in the cause chain.

Comment on lines +151 to +156
// Box2D predicates push down as Parquet row-group filters on the Box2D column's underlying
// (xmin, ymin, xmax, ymax) double leaves. Pruning is done by Parquet's stats-based skipping
// against the column's recorded min/max, which is sound regardless of how the writer chose
// the per-row Box2D values.
case ST_BoxIntersects(_) =>
// Intersects is symmetric — both argument orders produce the same predicate.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR title and body were updated after this review fired — the description no longer references the removed opt-in conf or geometry-bbox metadata pruning, and now documents the Parquet row-group statistics path that is actually implemented.

- Gate spatial Parquet predicate injection on `enableParquetFilterPushDown`
  so disabling `spark.sql.parquet.filterPushdown` also disables Sedona-
  injected row-group predicates (otherwise the spatial path bypassed the
  user's intent to turn parquet pushdown off).
- Reject inverted-bound Box2D literals in extractBox2DLiteral so the
  predicate falls back to runtime evaluation and surfaces the expected
  IllegalArgumentException; otherwise Parquet's row-group filter could
  prune all matches and hide the throw.
- Add a test locking in the inverted-bound fallback: no pushdown attached,
  and collect() surfaces an IllegalArgumentException in the cause chain.
@jiayuasu jiayuasu added this to the sedona-1.9.1 milestone May 13, 2026
@jiayuasu jiayuasu merged commit 1ae9c01 into apache:master May 13, 2026
42 checks passed
@jiayuasu jiayuasu deleted the feature/box2d-filter-pushdown branch May 13, 2026 06:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Sound Box2D filter pushdown via Parquet column statistics Filter pushdown for ST_BoxIntersects / ST_BoxContains to GeoParquet bbox covering columns

2 participants