Skip to content

Add nested-type access and more SQL operators to data-generation inference#608

Open
wmoustafa wants to merge 1 commit into
linkedin:masterfrom
wmoustafa:wmoustaf/coral-data-generation-2
Open

Add nested-type access and more SQL operators to data-generation inference#608
wmoustafa wants to merge 1 commit into
linkedin:masterfrom
wmoustafa:wmoustaf/coral-data-generation-2

Conversation

@wmoustafa
Copy link
Copy Markdown
Contributor

Extends coral-data-generation so the symbolic-constraint solver from PR #564 covers a wider class of WHERE predicates: more SQL operators, struct and map/array element access, and a predicate-based inference entry point that resolves per-path domains from a DNF query. Also tightens two inference paths whose existing rewrites silently produced wrong results for the new cases.

New operator coverage

Eight new DomainTransformer implementations are wired into DomainInferenceProgram.withDefaultTransformers():

Transformer SQL operator
AbsIntegerTransformer ABS(x)
MinusIntegerTransformer binary x - k and k - x
UpperRegexTransformer UPPER(x)
ConcatRegexTransformer CONCAT(x, lit) / CONCAT(lit, x)

ConcatRegexTransformer matches both SqlStdOperatorTable.CONCAT (the SQL || operator) and the OTHER_FUNCTION named concat that Hive emits. Existing transformers (LowerRegexTransformer, PlusIntegerTransformer, TimesIntegerTransformer, SubstringRegexTransformer) now accept RexFieldAccess as a valid variable operand, so expressions like LOWER(s.name), s.age + 5, and UPPER(sarr[0].name) flow through. SubstringRegexTransformer.canHandle also gained an operand-arity check.

The transformer registration is grouped into string ops → integer ops → cross-domain → structural pass-throughs for readability.

Nested-type access

New AccessPath value type identifies any value reachable from a root column index through a chain of struct fields (FIELD), map lookups (MAP_KEY), and array indices (ARRAY_INDEX). It's the key type of the new multi-path resolution API (below) and is also used in tests to assert which nested values were resolved.

DomainInferenceProgram.deriveInputDomain gained two base cases so inference terminates correctly at nested column references — struct field access on a RexInputRef (e.g., $3.name) and ITEM access on a RexInputRef (e.g., ITEM($2, 1) for arrays, ITEM($4, 'env') for maps).

Predicate-based inference: two reductions up the SQL evaluation hierarchy

Master exposed one primitive — deriveInputDomain(expr, outputDomain) → inputDomain — which answers the leaf question: given an expression and a constraint on its output, derive the constraint on the input variable. Real callers, though, start higher up the SQL evaluation stack. The PR adds the two reductions that bridge a full WHERE clause down to the primitive:

WHERE clause (tree of AND / OR over comparisons)
        │
        │  DnfRewriter (already exists)
        ▼
list of DNF disjuncts                              ── resolveAllPaths (new)
        │
        │  for each disjunct, for each conjunct
        ▼
single comparison predicate (expr OP literal)      ── deriveInputDomainFromPredicate (new)
        │
        │  compute output domain from OP + literal
        ▼
(expression, output domain) pair                   ── deriveInputDomain (primitive)
        │
        │  walk expr, refine via transformers
        ▼
domain on the input variable
  • deriveInputDomainFromPredicate(RexCall predicate) is one reduction above the primitive. It takes a comparison expr OP literal (=, <, >, <=, >=), computes the output domain from the operator and literal — > 5IntegerDomain([6, ∞)), = 'abc'RegexDomain.literal("abc") — and reduces to deriveInputDomain(expr, that). It also unwraps the RexCall(UNARY_MINUS, RexLiteral) shape Calcite uses for negative literals so age = -5 works the same as age = 5.

  • resolveAllPaths(List<RexNode> disjuncts) is one reduction above that. Given the DNF disjuncts produced by DnfRewriter, it walks every disjunct, every conjunct, calls deriveInputDomainFromPredicate on each comparison, and combines the per-AccessPath results with AND semantics within a disjunct (intersection) and OR semantics across disjuncts (union). Predicates outside the comparison-with-literal shape are silently skipped — notably column-to-column join predicates, which still require per-column literals.

    For WHERE (age > 10 AND name = 'foo') OR (age = 0) the result is roughly { $age → IntegerDomain([11,∞) ∪ {0}), $name → RegexDomain("foo") }.

Nothing else is added: anything more specific belongs in a transformer, and anything less specific (such as converting a WHERE tree to DNF in the first place) was already the caller's job via DnfRewriter.

Tighten RegexToIntegerDomainConverter: accept only canonical decimal regexes

  • Input: R = ^[0-9]{3}$.
  • Master returns: IntegerDomain{0..999}.
  • Should return: IntegerDomain{100..999} — SQL CAST(integer AS VARCHAR) produces canonical decimal (0 → "0", never "000"), so 0 does not belong.
  • Fix: narrow the converter's contract to canonical-decimal regexes only. The accept rule changes from "finite + digit-only" to "finite
    • subset of ^(0|[1-9][0-9]*)$". Non-canonical inputs (^[0-9]{3}$, ^009$, empty regex, …) are now rejected with NonConvertibleDomainException. CastRegexTransformer's CAST(int AS VARCHAR) branch keeps calling convert(outputRegex) directly and relies on this strict contract.

ProjectPullUpRewriter: remap the join condition when a left Project changes field count

Concrete scenario: tables T1(a, b, c) (3 cols) and T2(x, y) (2 cols). Plan before pull-up:

Join(condition: b = x)
 ├── Project(a, b)         keeps 2 of T1's 3 columns
 │    └── Scan(T1)
 └── Scan(T2)

The join's row type is [Project-output | T2] = [a, b, x, y], so inside the condition b resolves to $1 and x to $2. The condition is $1 = $2.

After pull-up, the Project moves above the Join, and the new join's left input is the raw Scan(T1):

Project(...)
 └── Join(condition: ???)
      ├── Scan(T1)
      └── Scan(T2)

The new join's row type is [T1 | T2] = [a, b, c, x, y]. b is still $1, but x is now $3 because the left input grew from 2 columns back to 3. The rewritten condition must be $1 = $3.

Master inlined left-side InputRefs through the removed Project but left right-side InputRefs at their old positions. The rewritten condition came out as $1 = $2, which in the new frame points at T1.c (VARCHAR) — not T2.x (INTEGER). Wrong column, and a type mismatch that breaks join evaluation.

The fix replaces the two side-specific helpers (inlineLeftSide, inlineRightSide) with a single remapJoinCondition pass. For every InputRef in the old condition it computes the position in the new frame using oldLeftCount (Project-output width) and newLeftCount (unprojected-left width): right-side references shift by newLeftCount - oldLeftCount; left-side references are remapped through the lifted projection expressions.

IntegerDomain

  • New negate() method (returns multiply(-1)), used by the new NegateIntegerTransformer.
  • Interval.isAdjacent refactored to make the overflow guard explicit in two named booleans, matching the original behavior.

Build

coral-data-generation/build.gradle now applies the java-library plugin so the module exposes proper api/implementation configurations.

Tests

RegexDomainInferenceProgramTest is the main integration suite and grows substantially: it exercises every new operator individually, every new nested-type access pattern, and combined SQL queries with AND/OR over struct/map/array paths against four test tables (test.T, test.complex, test.deep, test.interleaved). Notable coverage areas:

  • single-operator tests for SUBSTRING, LOWER, UPPER, CAST(int→str), CAST(str→int), CAST(str→date), arithmetic, MINUS, ABS, unary minus, CONCAT, TRIM, comparison operators with and without arithmetic
  • multi-column AND/OR with same-column intersection, disjoint ranges, range-with-equality, contradictory ranges, mixed regex/integer domains
  • struct field equality and arithmetic, map-element equality, array of structs, nested struct (nested_struct.sub.value), map of structs (map_of_structs['key'].score), and interleaved combinations
  • CAST cross-domain on struct fields, OR disjunction on struct fields, per-column union semantics

RegexTransformerTest is a new dedicated unit-test class for Concat: prefix/suffix stripping, prefix/suffix mismatch (empty domain), empty suffix as identity, non-literal output passthrough.

IntegerTransformerTest adds rigorous-style cases for Minus, Negate, and Abs: each test constructs the RexCall via RexBuilder and calls transformer.refineInputDomain directly, then asserts containment and boundaries — including the empty case for ABS over an all-negative output interval.

RegexToIntegerDomainConverterTest is updated to match the new contract: tests that previously passed non-canonical regexes (e.g., ^[0-9]{3}$, ^009$, ^[0-9]?$) now assert the converter rejects them with NonConvertibleDomainException. Parallel positive tests use canonical-form inputs (^[1-9][0-9]{2}$ instead of ^[0-9]{3}$).

CastRegexTransformerTest adds concrete accept/reject probes for the returned regex (e.g., getAutomaton().run("100")), pins the canonical behavior of CAST(int AS VARCHAR) with a canonical 3-digit output, and documents the non-canonical fallback path.

ProjectPullUpRewriterTest asserts row-type field-name and type preservation across pull-ups, and pins the rewritten join condition to =($1, $3) for the case described above.

Verification

Full module pipeline (build, javadoc, spotlessJavaCheck) passes; all tests in the module pass.

…rence

Extends `coral-data-generation` so the symbolic-constraint solver from PR
linkedin#564 covers a wider class of WHERE predicates: more SQL operators, struct
and map/array element access, and a predicate-based inference entry point
that resolves per-path domains from a DNF query. Also tightens two
inference paths whose existing rewrites silently produced wrong results
for the new cases.

## New operator coverage

Eight new `DomainTransformer` implementations are wired into
`DomainInferenceProgram.withDefaultTransformers()`:

| Transformer | SQL operator |
| --- | --- |
| `AbsIntegerTransformer` | `ABS(x)` |
| `MinusIntegerTransformer` | binary `x - k` and `k - x` |
| `NegateIntegerTransformer` | unary `-x` |
| `UpperRegexTransformer` | `UPPER(x)` |
| `ConcatRegexTransformer` | `CONCAT(x, lit)` / `CONCAT(lit, x)` |
| `TrimRegexTransformer` | `TRIM(x)` — supports both Calcite's 3-operand standard form and Hive's 1-operand form |
| `FieldAccessTransformer` | struct field access (`s.name`) on nested expressions |
| `ItemTransformer` | `ITEM(coll, idx-or-key)` for array indexing and map lookup on nested expressions |

`ConcatRegexTransformer` matches both `SqlStdOperatorTable.CONCAT` (the SQL
`||` operator) and the `OTHER_FUNCTION` named `concat` that Hive emits.
Existing transformers (`LowerRegexTransformer`, `PlusIntegerTransformer`,
`TimesIntegerTransformer`, `SubstringRegexTransformer`) now accept
`RexFieldAccess` as a valid variable operand, so expressions like
`LOWER(s.name)`, `s.age + 5`, and `UPPER(sarr[0].name)` flow through.
`SubstringRegexTransformer.canHandle` also gained an operand-arity check.

The transformer registration is grouped into string ops → integer ops →
cross-domain → structural pass-throughs for readability.

## Nested-type access

New `AccessPath` value type identifies any value reachable from a root
column index through a chain of struct fields (`FIELD`), map lookups
(`MAP_KEY`), and array indices (`ARRAY_INDEX`). It's the key type of the
new multi-path resolution API (below) and is also used in tests to
assert which nested values were resolved.

`DomainInferenceProgram.deriveInputDomain` gained two base cases so
inference terminates correctly at nested column references — struct field
access on a `RexInputRef` (e.g., `$3.name`) and ITEM access on a
`RexInputRef` (e.g., `ITEM($2, 1)` for arrays, `ITEM($4, 'env')` for
maps).

## Predicate-based inference: two reductions up the SQL evaluation hierarchy

Master exposed one primitive — `deriveInputDomain(expr, outputDomain) →
inputDomain` — which answers the leaf question: given an expression and
a constraint on its output, derive the constraint on the input variable.
Real callers, though, start higher up the SQL evaluation stack. The PR
adds the two reductions that bridge a full WHERE clause down to the
primitive:

```
WHERE clause (tree of AND / OR over comparisons)
        │
        │  DnfRewriter (already exists)
        ▼
list of DNF disjuncts                              ── resolveAllPaths (new)
        │
        │  for each disjunct, for each conjunct
        ▼
single comparison predicate (expr OP literal)      ── deriveInputDomainFromPredicate (new)
        │
        │  compute output domain from OP + literal
        ▼
(expression, output domain) pair                   ── deriveInputDomain (primitive)
        │
        │  walk expr, refine via transformers
        ▼
domain on the input variable
```

- **`deriveInputDomainFromPredicate(RexCall predicate)`** is one
  reduction above the primitive. It takes a comparison `expr OP literal`
  (`=`, `<`, `>`, `<=`, `>=`), computes the output domain from the
  operator and literal — `> 5` ⇒ `IntegerDomain([6, ∞))`,
  `= 'abc'` ⇒ `RegexDomain.literal("abc")` — and reduces to
  `deriveInputDomain(expr, that)`. It also unwraps the
  `RexCall(UNARY_MINUS, RexLiteral)` shape Calcite uses for negative
  literals so `age = -5` works the same as `age = 5`.

- **`resolveAllPaths(List<RexNode> disjuncts)`** is one reduction above
  that. Given the DNF disjuncts produced by `DnfRewriter`, it walks every
  disjunct, every conjunct, calls
  `deriveInputDomainFromPredicate` on each comparison, and combines the
  per-`AccessPath` results with AND semantics within a disjunct
  (intersection) and OR semantics across disjuncts (union). Predicates
  outside the comparison-with-literal shape are silently skipped —
  notably column-to-column join predicates, which still require
  per-column literals.

  For `WHERE (age > 10 AND name = 'foo') OR (age = 0)` the result is
  roughly
  `{ $age → IntegerDomain([11,∞) ∪ {0}), $name → RegexDomain("foo") }`.

Nothing else is added: anything more specific belongs in a transformer,
and anything less specific (such as converting a WHERE tree to DNF in
the first place) was already the caller's job via `DnfRewriter`.

## Tighten `RegexToIntegerDomainConverter`: accept only canonical decimal regexes

- **Input:** `R = ^[0-9]{3}$`.
- **Master returns:** `IntegerDomain{0..999}`.
- **Should return:** `IntegerDomain{100..999}` — SQL
  `CAST(integer AS VARCHAR)` produces canonical decimal (`0 → "0"`,
  never `"000"`), so `0` does not belong.
- **Fix:** narrow the converter's contract to canonical-decimal regexes
  only. The accept rule changes from "finite + digit-only" to "finite
  + subset of `^(0|[1-9][0-9]*)$`". Non-canonical inputs (`^[0-9]{3}$`,
  `^009$`, empty regex, …) are now rejected with
  `NonConvertibleDomainException`. `CastRegexTransformer`'s
  `CAST(int AS VARCHAR)` branch keeps calling `convert(outputRegex)`
  directly and relies on this strict contract.

## ProjectPullUpRewriter: remap the join condition when a left Project changes field count

Concrete scenario: tables `T1(a, b, c)` (3 cols) and `T2(x, y)` (2 cols).
Plan before pull-up:

```
Join(condition: b = x)
 ├── Project(a, b)         keeps 2 of T1's 3 columns
 │    └── Scan(T1)
 └── Scan(T2)
```

The join's row type is `[Project-output | T2] = [a, b, x, y]`, so
inside the condition `b` resolves to `$1` and `x` to `$2`. The condition
is `$1 = $2`.

After pull-up, the `Project` moves above the `Join`, and the new join's
left input is the raw `Scan(T1)`:

```
Project(...)
 └── Join(condition: ???)
      ├── Scan(T1)
      └── Scan(T2)
```

The new join's row type is `[T1 | T2] = [a, b, c, x, y]`. `b` is still
`$1`, but `x` is now `$3` because the left input grew from 2 columns
back to 3. The rewritten condition must be `$1 = $3`.

Master inlined left-side `InputRef`s through the removed `Project` but
left right-side `InputRef`s at their old positions. The rewritten
condition came out as `$1 = $2`, which in the new frame points at `T1.c`
(`VARCHAR`) — not `T2.x` (`INTEGER`). Wrong column, and a type mismatch
that breaks join evaluation.

The fix replaces the two side-specific helpers (`inlineLeftSide`,
`inlineRightSide`) with a single `remapJoinCondition` pass. For every
`InputRef` in the old condition it computes the position in the new
frame using `oldLeftCount` (Project-output width) and `newLeftCount`
(unprojected-left width): right-side references shift by
`newLeftCount - oldLeftCount`; left-side references are remapped through
the lifted projection expressions.

## IntegerDomain

- New `negate()` method (returns `multiply(-1)`), used by the new
  `NegateIntegerTransformer`.
- `Interval.isAdjacent` refactored to make the overflow guard explicit
  in two named booleans, matching the original behavior.

## Build

`coral-data-generation/build.gradle` now applies the `java-library`
plugin so the module exposes proper `api`/`implementation` configurations.

## Tests

`RegexDomainInferenceProgramTest` is the main integration suite and grows
substantially: it exercises every new operator individually, every new
nested-type access pattern, and combined SQL queries with AND/OR over
struct/map/array paths against four test tables (`test.T`,
`test.complex`, `test.deep`, `test.interleaved`). Notable coverage areas:

- single-operator tests for `SUBSTRING`, `LOWER`, `UPPER`, `CAST(int→str)`,
  `CAST(str→int)`, `CAST(str→date)`, arithmetic, `MINUS`, `ABS`, unary
  minus, `CONCAT`, `TRIM`, comparison operators with and without
  arithmetic
- multi-column AND/OR with same-column intersection, disjoint ranges,
  range-with-equality, contradictory ranges, mixed regex/integer domains
- struct field equality and arithmetic, map-element equality, array of
  structs, nested struct (`nested_struct.sub.value`), map of structs
  (`map_of_structs['key'].score`), and interleaved combinations
- CAST cross-domain on struct fields, OR disjunction on struct fields,
  per-column union semantics

`RegexTransformerTest` is a new dedicated unit-test class for `Concat`:
prefix/suffix stripping, prefix/suffix mismatch (empty domain), empty
suffix as identity, non-literal output passthrough.

`IntegerTransformerTest` adds rigorous-style cases for `Minus`, `Negate`,
and `Abs`: each test constructs the `RexCall` via `RexBuilder` and calls
`transformer.refineInputDomain` directly, then asserts containment and
boundaries — including the empty case for `ABS` over an all-negative
output interval.

`RegexToIntegerDomainConverterTest` is updated to match the new contract:
tests that previously passed non-canonical regexes (e.g.,
`^[0-9]{3}$`, `^009$`, `^[0-9]?$`) now assert the converter rejects them
with `NonConvertibleDomainException`. Parallel positive tests use
canonical-form inputs (`^[1-9][0-9]{2}$` instead of `^[0-9]{3}$`).

`CastRegexTransformerTest` adds concrete accept/reject probes for the
returned regex (e.g., `getAutomaton().run("100")`), pins the canonical
behavior of `CAST(int AS VARCHAR)` with a canonical 3-digit output, and
documents the non-canonical fallback path.

`ProjectPullUpRewriterTest` asserts row-type field-name and type
preservation across pull-ups, and pins the rewritten join condition to
`=($1, $3)` for the case described above.

## Verification

Full module pipeline (`build`, `javadoc`, `spotlessJavaCheck`) passes;
all tests in the module pass.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant