-
Notifications
You must be signed in to change notification settings - Fork 4
SQL: struct support #586
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: rp-sql
Are you sure you want to change the base?
SQL: struct support #586
Changes from all commits
49edf69
5623c8a
d35813d
e1f3231
60053ed
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,134 @@ | ||
| = Query topics with nested fields | ||
| :description: Map a topic with nested Protobuf, Avro, or JSON fields to SQL ROW columns, then query those fields directly. | ||
| :page-topic-type: how-to | ||
| :personas: app_developer, data_engineer | ||
| :learning-objective-1: Map a topic with a nested schema as a SQL table using struct_mapping_policy = 'COMPOUND' | ||
| :learning-objective-2: Query nested fields using ROW field-access syntax | ||
| :learning-objective-3: Resolve cyclic-reference errors | ||
|
|
||
| When a glossterm:topic[]'s schema includes nested Protobuf, Avro, or JSON message types, you can map those nested structures as user-defined types (UDTs) with named fields, queryable using SQL `ROW` field-access syntax, instead of opaque JSON. This makes nested fields queryable by name, includable in projections, and usable in `WHERE`, `GROUP BY`, and `ORDER BY` clauses, without parsing JSON at query time. | ||
|
|
||
| After completing these steps, you will be able to: | ||
|
|
||
| * [ ] {learning-objective-1} | ||
| * [ ] {learning-objective-2} | ||
| * [ ] {learning-objective-3} | ||
|
|
||
| == Prerequisites | ||
|
|
||
| Before you query a topic with nested fields: | ||
|
|
||
| * Enable Redpanda SQL on your Redpanda Bring Your Own Cloud (BYOC) cluster. See xref:sql:get-started/deploy-sql-cluster.adoc[Enable Redpanda SQL]. | ||
| * Connect to Redpanda SQL with `psql` or another PostgreSQL client. See xref:sql:connect-to-sql/index.adoc[Connect to Redpanda SQL]. | ||
| * The topic has a schema registered in glossterm:schema-registry[Schema Registry]. The schema includes one or more nested message types. | ||
| * You have a Redpanda catalog connection. See xref:reference:sql/sql-statements/create-redpanda-catalog.adoc[CREATE REDPANDA CATALOG]. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A little misleading since in BYOC this is auto-created for your @kbatuigas |
||
|
|
||
| == Map the topic as a SQL table | ||
|
|
||
| Create the SQL table with `struct_mapping_policy = 'COMPOUND'` to surface each nested message as a user-defined type column: | ||
|
|
||
| [source,sql] | ||
| ---- | ||
| CREATE TABLE default_redpanda_catalog=>orders WITH ( | ||
| topic = 'orders', | ||
| schema_subject = 'orders-value', | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is schema_subject required or optional? If it's required then maybe the naming convention is not mandatory @pkonrad1229 ? |
||
| struct_mapping_policy = 'COMPOUND' | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It says below this is optional. Comments should explain the same here (what is optional vs not) |
||
| ); | ||
| ---- | ||
|
|
||
| Replace `orders` with your topic name and `orders-value` with the Schema Registry subject that holds the topic's value schema. | ||
|
|
||
| For a topic schema with this Protobuf definition: | ||
|
|
||
| [source,proto] | ||
| ---- | ||
| message Order { | ||
| string order_id = 1; | ||
| Customer customer = 2; | ||
| double amount = 3; | ||
| } | ||
|
|
||
| message Customer { | ||
| string customer_id = 1; | ||
| string name = 2; | ||
| string region = 3; | ||
| } | ||
| ---- | ||
|
|
||
| Redpanda SQL maps the table with three columns: `order_id` (text), `customer` (a user-defined type with fields `customer_id`, `name`, and `region`), and `amount` (double precision). | ||
|
|
||
| TIP: `COMPOUND` is the default `struct_mapping_policy`. To map nested structures as opaque JSON instead, use `struct_mapping_policy = 'JSON'`. JSON mapping is the only option that supports recursive (cyclic) types. See <<handle-recursive-cyclic-schemas, Handle recursive (cyclic) schemas>>. | ||
|
|
||
| == Query nested fields | ||
|
|
||
| Access a nested field by its declared name using the `(column).field` form. You must wrap the column in parentheses: | ||
|
|
||
| [source,sql] | ||
| ---- | ||
| SELECT order_id, (customer).name, (customer).region, amount | ||
| FROM default_redpanda_catalog=>orders | ||
| WHERE (customer).region = 'EMEA'; | ||
| ---- | ||
|
|
||
| To project every field of a nested structure as separate result columns, use the wildcard `.*` form: | ||
|
|
||
| [source,sql] | ||
| ---- | ||
| SELECT order_id, (customer).* | ||
| FROM default_redpanda_catalog=>orders | ||
| LIMIT 10; | ||
| ---- | ||
|
|
||
| For schemas with multiple levels of nesting, chain the parenthesized field access. For example, if `Customer` itself contained a nested `address` message with a `zip_code` field, you would query the zip code as: | ||
|
|
||
| [source,sql] | ||
| ---- | ||
| SELECT ((customer).address).zip_code FROM default_redpanda_catalog=>orders; | ||
| ---- | ||
|
|
||
| For the full `ROW` reference, including comparison operators, NULL handling, and `::text` casting, see xref:reference:sql/sql-data-types/row.adoc[ROW]. | ||
|
|
||
| [[handle-recursive-cyclic-schemas]] | ||
| == Handle recursive (cyclic) schemas | ||
|
|
||
| Topic schemas can include recursive structures, such as a `Comment` message that references itself or two messages that reference each other. Mapping such a schema with `COMPOUND` fails at table-creation time with the following error: | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @kbatuigas this should be explained a bit more explicitly up front. Explain that RP SQL supports working with recursive schemas types only by mapping them to JSON, not as compound types. This is the punchline, before saying where the failure occurs with 'COMPOUND' |
||
|
|
||
| [source,text] | ||
| ---- | ||
| Cyclic reference at '<parent>.<field>' → '<type>'. Cyclic types are not supported in COMPOUND struct mapping policy; use struct_mapping_policy=JSON for recursive types. | ||
| ---- | ||
|
|
||
| The error message tells you the resolution: re-create the table with `struct_mapping_policy = 'JSON'`. In JSON mode, Redpanda SQL stores each nested structure as a JSON value: | ||
|
|
||
| [source,sql] | ||
| ---- | ||
| CREATE TABLE default_redpanda_catalog=>comments WITH ( | ||
| topic = 'comments', | ||
| schema_subject = 'comments-value', | ||
| struct_mapping_policy = 'JSON' | ||
| ); | ||
| ---- | ||
|
|
||
| Query JSON-mapped fields with standard JSON functions instead of ROW field access. See xref:reference:sql/sql-data-types/json.adoc[JSON]. | ||
|
|
||
| == Choose between COMPOUND and JSON | ||
|
|
||
| [cols="<20%,<40%,<40%",options="header"] | ||
| |=== | ||
| | Policy | Use when | Trade-offs | ||
|
|
||
| | `COMPOUND` (default) | ||
| | The topic schema has nested structures that are not recursive, and you want to query nested fields directly by name. | ||
| | Typed access; usable in `WHERE`, `GROUP BY`, `ORDER BY`. Required if you xref:sql:query-data/query-iceberg-topics.adoc[query an Iceberg-enabled topic via a linked Redpanda catalog], so that nested fields stay typed across both live and Iceberg-translated records. | ||
|
|
||
| | `JSON` | ||
| | The topic schema is recursive, or you prefer flexible access through JSON functions. | ||
| | Recursive types supported; fields are untyped until extracted with JSON functions. Queries that span the Redpanda topic and its linked Iceberg table do not align cleanly, because Iceberg always exposes nested structures as typed columns. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @grzebiel this warning would imply something very important to alert the user of (we dont really support querying iceberg topics with recursive types). However, I don't think this is the correct message here (at least not always), because Iceberg topics has a special handling encoding recursive Protobuf Struct fields as a JSON string in the Iceberg table. SO for protobuf, we do have a story for recursive fields (at least in the protobuf case). So, how should this be adjusted. |
||
| |=== | ||
|
|
||
| == Next steps | ||
|
|
||
| * xref:sql:query-data/query-streaming-topics.adoc[Query streaming topics]: query a topic without Iceberg history. | ||
| * xref:sql:query-data/query-iceberg-topics.adoc[Query Iceberg topics]: query the Iceberg-translated history of a topic. Use `struct_mapping_policy = 'COMPOUND'` so nested fields align between the Redpanda topic and the linked Iceberg table. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @kbatuigas wrong wording IMO. 'Query a topic with Iceberg history' is better. What's here is technically incorrect because it makes it sound like you're ONLY querying the iceberg portion (tail). but in fact this link is to how to do a bridge query that queries both the live streaming data and iceberg history. We should ensure we correct this everywhere. |
||
| * xref:reference:sql/sql-data-types/row.adoc[ROW]: full reference for the `ROW` data type, including comparisons, NULL semantics, and conversion to text. | ||
| * xref:reference:sql/sql-statements/create-table.adoc[CREATE TABLE]: complete option list for mapping a Redpanda topic to a SQL table. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we be specifying that the schema is registered for the topic using the TopicNamingStrategy naming convention @pkonrad1229 @kbatuigas ? You have to name it correctly for this to work, right? If people are not already familiar with this in SR we should educate them (point them to this naming convention)