Skip to content

[Coral-Schema] Add CoralCatalog base-table and table-scan paths for Avro schema generation#604

Merged
aastha25 merged 1 commit into
linkedin:masterfrom
yyy1000:iceberg-first-avro/pr5-pr6-coral-catalog-paths
May 7, 2026
Merged

[Coral-Schema] Add CoralCatalog base-table and table-scan paths for Avro schema generation#604
aastha25 merged 1 commit into
linkedin:masterfrom
yyy1000:iceberg-first-avro/pr5-pr6-coral-catalog-paths

Conversation

@yyy1000
Copy link
Copy Markdown
Contributor

@yyy1000 yyy1000 commented Apr 27, 2026

Motivation

PR #600 added the MergeCoralSchemaWithAvro engine but did not wire it into any call site. As a result, base tables and table scans inside views still resolve through the Hive-only path, even when the table is Iceberg-backed.

This PR adds the Coral-side CoralCatalog plumbing so base tables and table scans can resolve through the new merge engine. Hive- and Iceberg-backed base tables share a single dispatch path; Hive behavior is preserved unchanged for callers that don't opt in.

Summary

  • SchemaUtilities.getAvroSchemaForTable(CoralTable, strictMode) — new overload. HiveTable delegates to the existing getAvroSchemaForTable(Table, strictMode). Other CoralTables (Iceberg-backed) parse partner Avro from CoralTable.properties() via getCasePreservedSchemaFromPropertyMaps (PR Refactor partner-Avro parsing into reusable property-map helper #589) and merge with the Coral schema using MergeCoralSchemaWithAvro.merge. Strict mode throws SchemaNotFoundException when partner Avro is absent; non-strict mode emits a pure Coral-derived schema.
  • ViewToAvroSchemaConverter.create(CoralCatalog, HiveMetastoreClient) — new factory + matching private constructor + optional CoralCatalog field. The existing create(HiveMetastoreClient) factory and constructor are unchanged. inferAvroSchema() dispatches base-table resolution through the CoralCatalog when one is provided; views fall through to the existing path because view conversion still requires HiveToRelConverter on a Hive metastore. coralCatalog is also threaded into the inner RelToAvroSchemaConverter.
  • RelToAvroSchemaConverter — new optional CoralCatalog field + (CoralCatalog, HiveMetastoreClient) constructor. The existing (HiveMetastoreClient) constructor delegates to it with null. SchemaRelShuttle.getTableScanSchema() resolves through coralCatalog.getTable() first when set, falling back to hiveMetastoreClient.getTable() if the CoralCatalog does not surface the table.

Behavior preservation

Every existing constructor/factory still exists. The new constructors delegate to the legacy ones with a null coralCatalog, so the legacy code path is byte-for-byte identical. Callers that don't opt in see no behavior change.

Test plan

  • New SchemaUtilitiesTests cases for the Iceberg path:
    • testGetAvroSchemaForCoralTableIcebergWithPartner — partner-provided record name preserved, fields populated.
    • testGetAvroSchemaForCoralTableIcebergNoPartnerNonStrict — pure Coral-derived schema with name/namespace from the CoralTable name.
    • testGetAvroSchemaForCoralTableIcebergNoPartnerStrictThrowsSchemaNotFoundException raised.
    • testGetAvroSchemaForCoralTableUnqualifiedNameYieldsEmptyNamespace — unqualified name handling.
  • The Hive-delegation branch and the ViewToAvroSchemaConverter / RelToAvroSchemaConverter dispatches are covered transitively by existing tests on the HiveMetastoreClient-only constructors — the new constructors and factories are additive, and null coralCatalog reproduces prior behavior.
  • Full coral-schema:test suite passes; spotlessJavaCheck clean.

Follow-up

Complex multi-branch unions in partner Avro are not yet handled by MergeCoralSchemaWithAvro (only [null, T] shapes are). A follow-up PR will extend the merge engine for union/fixed/UUID/Avro-only edge cases. Pure Iceberg base tables rarely carry multi-branch unions in their partner Avro, so this PR is safe to ship before that work; production regression testing on this path will inform the exact scope of the follow-up.

@yyy1000 yyy1000 force-pushed the iceberg-first-avro/pr5-pr6-coral-catalog-paths branch from ec53682 to 84af052 Compare April 28, 2026 21:47
* @param hiveMetastoreClient client used for view conversion and as a fallback; must not be null
* @return a new converter wired to both
*/
public static ViewToAvroSchemaConverter create(CoralCatalog coralCatalog, HiveMetastoreClient hiveMetastoreClient) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not aligned with this user facing method. from a UX perspective, if a user is looking to generate a view's avro schema , they will either pass CoralCatalog or HiveMetastoreClient depending on whether they are on new or old variant.
we should -
(1) expose a public static - public static ViewToAvroSchemaConverter create(CoralCatalog coralCatalog) {..} method.
(2) mark public static ViewToAvroSchemaConverter create(HiveMetastoreClient hiveMetastoreClient) {..) deprecated.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in dfff1b6create(CoralCatalog, HiveMetastoreClient) removed; new create(CoralCatalog) exposed; create(HiveMetastoreClient) marked @Deprecated. The new factory wires the inner HiveToRelConverter and RelToAvroSchemaConverter against the same CoralCatalog end-to-end, so the converter no longer mixes abstractions internally.

}

Table baseTable = hiveMetastoreClient.getTable(dbName, tableName);
if (baseTable == null) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this codepath makes we wonder that we are potentially allowing users to create a CoralRelNode using the old abstractions - HiveMetastoreClient and then create a view schema by merging coralRelNode generated using HiveMetastoreClient with CoralTable generated using CoralCatalog. that can give some weird bugs and else cases.

we need to decouple the two code paths a bit more.
the user's entry point is the class - ViewToAvroSchemaConverter (is that right?) . if yes, we need to wire it up such that there are 2 independent code paths (pseudo code example) -
(1)

viewToAvroSchemaConverter1 = new ViewToAvroSchemaConverter(HiveMetastoreClient);
viewToAvroSchemaConverter1.getAvroSchema(db,tbl)

-- this uses HiveToRelConverter(HiveMetastoreClient) & RelToAvroConverter(HiveMetastoreClient)

and

(2)

viewToAvroSchemaConverter2 = new ViewToAvroSchemaConverter(CoralCatalog);
viewToAvroSchemaConverter2.getAvroSchema(db,tbl)

-- this uses HiveToRelConverter(CoralCatalog) & RelToAvroConverter(CoralCatalog)

please check if we're enabling a code path which inter mixes the two interactions.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in dfff1b6 — the two paths are now fully decoupled. SchemaRelShuttle branches on a single non-null field (CoralCatalog xor HiveMetastoreClient) with no fallback. A CoralCatalog-backed instance never resolves through HiveMetastoreClient and vice versa; if the configured catalog can't find the table it throws instead of silently degrading.

ViewToAvroSchemaConverter was also refactored along the same lines: create(CoralCatalog) constructs HiveToRelConverter(coralCatalog) + RelToAvroSchemaConverter(coralCatalog) end-to-end, and inferAvroSchema splits into two branches that share no state (inferAvroSchemaCoralCatalog and inferAvroSchemaHiveMetastore). A user can no longer build a CoralRelNode through HiveMetastoreClient and resolve base tables through CoralCatalog on the same converter instance.

Comment on lines 124 to 131
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

public facing API should only take either CoralCatalog or HiveMetastoreClient. please refactor the code.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in dfff1b6(CoralCatalog, HiveMetastoreClient) constructor removed; new (CoralCatalog) constructor added; (HiveMetastoreClient) marked @Deprecated with a Javadoc pointer to the new constructor.

@aastha25
Copy link
Copy Markdown
Contributor

@yyy1000 thanks for the PR! keep the momentum going :)

feel free to introduce new class variants , if needed, to enable the decoupling of new & old codepaths. And btw, the build is also failing.

@yyy1000 yyy1000 force-pushed the iceberg-first-avro/pr5-pr6-coral-catalog-paths branch from 84af052 to dfff1b6 Compare May 1, 2026 22:02
@yyy1000
Copy link
Copy Markdown
Contributor Author

yyy1000 commented May 1, 2026

Pushed dfff1b6 with the decoupling refactor — replies posted on each of the three threads above.

Also folded the coral-hive:spotlessJavagenerateTestGrammarSource Gradle ordering fix into this push, so CI should now go green modulo the unrelated ParseTreeBuilderTest.testSql11KeywordAsExplicitAlias flakes that I confirmed reproduce on master without any of my changes (likely fallout from #602).


// Hive-backed CoralTable: delegate to the existing Hive-table API to preserve current behavior.
if (coralTable instanceof HiveTable) {
return getAvroSchemaForTable(((HiveTable) coralTable).getHiveTable(), strictMode);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getHiveTable() method has documentation:

/**
   * INTERNAL API
   * @deprecated This method is for internal use only and will be removed in a future release.
   * Do not depend on this API.
   *
   * @return Hive metastore Table object
   */

please do not leverage this API. if the implementation requires a conversion from (HiveTable) CoralTable -> Hive's table object, implement the converter. Feel free to do that in coral-common so that other usages of that API can migrate too

Copy link
Copy Markdown
Contributor Author

@yyy1000 yyy1000 May 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed on dropping the deprecated getHiveTable(). Worth flagging two things before sketching a fix:

  1. PR [Coral-Schema] Add CoralCatalog base-table and table-scan paths for Avro schema generation #604 isn't the first caller. HiveCalciteTableAdapter.java:122 and ToRelConverter.java:209 already use getHiveTable() today — both inside coral-common itself. Whatever we land here is the natural migration target for those too.
  2. A real converter needs more of HiveTable's state than the current public API exposes — serdeInfo.serializationLib (AvroSerDe escape), serde parameters (partner-Avro extraction), and the cols/partition-keys split (today collapsed into one StructType by getSchema()). Reconstructing a Hive Table from name() + properties() + getSchema() alone would silently break the Hive arm.

Two ways forward:

  1. Expand HiveTable's public API in a separate prep PR, then add a real HiveCoralTableConverter here.
  2. Add HiveCoralTableConverter here as a centralized facade that internally uses getHiveTable() (with @SuppressWarnings), so the deprecated call lives in exactly one place and the existing in-tree usages migrate behind the same converter.

1 is the right end state, 2 is a pragmatic interim. Which would you prefer?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you think about more options from a design standpoint? how about we expand hiveTable.properties() method here to also include the serde properties in addition to the table properties - essentially that would truly capture the hive table's properties.
would that resolve the issue? what all aspects of 'hiveTable' are required across coral-common & coral-schema? we can limit scope to just coral-schema for now. what are the implications here? or should we implement a new getAvroSchemaForTable() without knowledge of the base hive table? is that possible?

please spend some time thinking about these options and more and share what you think should be adopted and why.

Copy link
Copy Markdown
Contributor Author

@yyy1000 yyy1000 May 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spent some time on this. My pick: targeted expansion of HiveTable's public API plus a shared private helper between the two getAvroSchemaForTable overloads.

Four new methods on HiveTable, scoped to what coral-schema's Hive path actually consumes:

public Map<String, String> serdeProperties()      // sd.serdeInfo.parameters
public String serializationLib()                  // sd.serdeInfo.serializationLib
public List<FieldSchema> regularColumns()         // sd.cols
public List<FieldSchema> partitionColumns()       // partitionKeys

Then in SchemaUtilities, extract the existing getAvroSchemaForTable(Table) body into a private helper that takes those fields plus tableProperties, fullyQualifiedName, strictMode. Both overloads (Table and CoralTable) become thin wrappers that route through it.

Result: bytecode-equivalent Hive behavior (same helper, same inputs), no getHiveTable() in coral-schema. Scope stays at "what coral-schema needs from HiveTable" per your framing — the other two in-tree usages (HiveCalciteTableAdapter, ToRelConverter) can migrate to these accessors in a follow-up without further API additions.

What do you think?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i looked at Apache-Iceberg's Table API and serde properties are not retrievable using a top class API, it i shoved under properties. We should also look to not expand the user facing APIs extensively.

So maybe create a constant key for SERDE_SERIALIZATION and put the serde in it in the same properties key value map. But agree that this is going beyond the scope of coral-schema.
So for now lets log an issue for this follow up item, link this thread in the issue as well and link the issue in code as a TODO.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed offline — deferring this. Filed #606 to track the migration; linked the discussion threads (and your SERDE_SERIALIZATION constant-key suggestion) there. Added a TODO at the call site in this PR pointing at the issue (b8e6c1e).

Comment on lines +168 to +169
String recordNamespace = lastDot >= 0 ? fullName.substring(0, lastDot) : "";
String recordName = lastDot >= 0 ? fullName.substring(lastDot + 1) : fullName;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this is how name & namespace is extracted for hive tables? shouldnt we prefer to extract is from the Avro schema partner, if that is present? lets keep parity with old behaviour

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — fixed in b8e6c1e. When partner Avro is present, MergeCoralSchemaWithAvro.mergeTopLevelStruct already takes name/namespace from the partner record (via copyRecord), so legacy parity is already there for that case. For the no-partner fallback, I now match convertHiveSchemaToAvro: recordName = the bare table name, recordNamespace = the fully qualified name (e.g. default.my_table). Updated the affected SchemaUtilitiesTests assertions too.

Preconditions.checkNotNull(tableOrViewName);

if (coralCatalog != null) {
return inferAvroSchemaCoralCatalog(dbName, tableOrViewName, strictMode, forceLowercase);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in your regression testing, please make sure that you test for permutations of strictMode + forceLowercase as well.

Copy link
Copy Markdown
Contributor Author

@yyy1000 yyy1000 May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracking on internal testing side where the regression harness lives. Will extend CoralSparkBaselineAnalysisForTrino to emit one avro_schema column per (strictMode, forceLowercase) permutation so the diff query covers all four.

* {@link CoralCatalog#getTable}; views use the CoralCatalog-aware
* {@link HiveToRelConverter} and {@link RelToAvroSchemaConverter}.
*/
private Schema inferAvroSchemaCoralCatalog(String dbName, String tableOrViewName, boolean strictMode,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: inferAvroSchemaUsingCoralCatalog

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed in b8e6c1e. Also renamed the sibling inferAvroSchemaHiveMetastoreinferAvroSchemaUsingHiveMetastore for symmetry.

…ration

Wires the Coral side of the Iceberg-first Avro flow end-to-end as two
fully independent code paths -- one driven by CoralCatalog, one by the
legacy HiveMetastoreClient. Instances built from one abstraction never
mix with the other.

SchemaUtilities:
  - New getAvroSchemaForTable(CoralTable, strictMode) overload.
  - For HiveTable, delegates to the existing getAvroSchemaForTable(Table,
    strictMode) so current Hive behavior is preserved unchanged.
  - For other CoralTables (e.g. Iceberg-backed), reads partner Avro
    from CoralTable.properties() via the property-map helper from
    PR linkedin#589 and merges with the Coral schema using Iceberg-first
    semantics via MergeCoralSchemaWithAvro.merge.
  - Generates a pure Coral-derived schema when no partner Avro is
    available; in strict mode this case throws SchemaNotFoundException.

ViewToAvroSchemaConverter:
  - Two mutually exclusive public factories:
      create(CoralCatalog)        -- new path
      create(HiveMetastoreClient) -- legacy, marked @deprecated
  - Each factory wires its inner HiveToRelConverter and
    RelToAvroSchemaConverter against the same abstraction, end to end.
  - inferAvroSchema delegates to one of two private branches that share
    no state with each other; views and tables are both surfaced
    through the configured catalog/metastore. The CoralCatalog branch
    uses CoralTable.tableType() to distinguish base tables from views.
  - The previous (CoralCatalog, HiveMetastoreClient) factory is removed
    so callers can no longer build a converter that mixes the two
    abstractions internally.

RelToAvroSchemaConverter:
  - Two mutually exclusive public constructors:
      RelToAvroSchemaConverter(CoralCatalog)        -- new path
      RelToAvroSchemaConverter(HiveMetastoreClient) -- legacy, @deprecated
  - SchemaRelShuttle holds both fields but exactly one is non-null per
    instance; getTableScanSchema branches on that and never falls back
    between the two abstractions. A CoralCatalog-backed shuttle that
    cannot resolve a table now throws instead of silently degrading
    through Hive metastore.
  - The previous (CoralCatalog, HiveMetastoreClient) constructor is
    removed.

coral-hive build:
  - tasks.named('spotlessJava') now mustRunAfter both
    generateGrammarSource and generateTestGrammarSource. Without the
    test-grammar entry, Gradle 8 strict validation fails the
    coral-hive build.

Tests: SchemaUtilitiesTests covers the new Iceberg path (with-partner,
no-partner non-strict, no-partner strict-throws, unqualified-name
fallback, plus a Coral-nullability-wins assertion proving the merge
engine ran). The HiveMetastoreClient-backed paths remain covered by
existing tests through their unchanged constructors and factories.
@yyy1000 yyy1000 force-pushed the iceberg-first-avro/pr5-pr6-coral-catalog-paths branch from dfff1b6 to b8e6c1e Compare May 5, 2026 23:15
@aastha25 aastha25 merged commit 54ac559 into linkedin:master May 7, 2026
1 check passed
@aastha25 aastha25 changed the title Add CoralCatalog base-table and table-scan paths for Avro schema generation [Coral-Schema] Add CoralCatalog base-table and table-scan paths for Avro schema generation May 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants