Skip to content

Comments

Add support for external table creation#1301

Open
aholyoke wants to merge 1 commit intogoogleapis:mainfrom
aholyoke:feature/external-tables
Open

Add support for external table creation#1301
aholyoke wants to merge 1 commit intogoogleapis:mainfrom
aholyoke:feature/external-tables

Conversation

@aholyoke
Copy link
Contributor

Implements the ability to create externally backed tables + hive partitioning support. #372 🦕

Borrows the convention used for time and range partitioning where we create an instance of a struct from the core google cloud sdk, and pass it as a sqlalchemy bigquery dialect option.

Example usage:

import sqlalchemy as sa
from sqlalchemy_bigquery.base import BigQueryDialect
from google.cloud.bigquery import ExternalConfig, ExternalSourceFormat, HivePartitioningOptions

hive_partitioning = HivePartitioningOptions()
hive_partitioning.source_uri_prefix = "gs://bucket-name/prefix"
hive_partitioning.require_partition_filter = False

external_config = ExternalConfig(ExternalSourceFormat.PARQUET)
external_config.source_uris = [
    "gs://bucket-name/prefix/string_col=A/*",
    "gs://bucket-name/prefix/string_col=B/*",
]
external_config.hive_partitioning = hive_partitioning

table = sa.Table(
    "my_table",
    sa.MetaData(),
    sa.Column("string_col", sa.String),
    sa.Column("int_col", sa.Integer),
    schema="my_schema",
    prefixes=["EXTERNAL"],
    bigquery_external_data_configuration=external_config,
)

print(sa.schema.CreateTable(table).compile(dialect=BigQueryDialect()))

Will produce this query:

CREATE EXTERNAL TABLE
  `my_schema`.`my_table`
  WITH PARTITION COLUMNS (
    `string_col` STRING,
    `int_col` INT64)
  OPTIONS (
    format = 'PARQUET',
    uris = ['gs://bucket-name/prefix/string_col=A/*', 'gs://bucket-name/prefix/string_col=B/*'],
    hive_partition_uri_prefix = 'gs://bucket-name/prefix',
    require_hive_partition_filter = false)

@aholyoke aholyoke requested review from a team as code owners February 22, 2026 21:48
@product-auto-label product-auto-label bot added size: m Pull request size is medium. api: bigquery Issues related to the googleapis/python-bigquery-sqlalchemy API. labels Feb 22, 2026
@gemini-code-assist
Copy link

Summary of Changes

Hello @aholyoke, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the sqlalchemy-bigquery dialect by introducing the ability to define and create BigQuery external tables directly through SQLAlchemy's DDL. This allows users to seamlessly integrate external data sources, including those structured with Hive partitioning, into their BigQuery data models using familiar SQLAlchemy constructs, thereby expanding the dialect's utility for data warehousing and lakehouse architectures.

Highlights

  • External Table Creation: Added comprehensive support for creating BigQuery external tables, allowing users to define tables backed by external data sources like Google Cloud Storage within SQLAlchemy.
  • Hive Partitioning Support: Implemented functionality to specify Hive partitioning options for external tables, including source URI prefixes and partition filter requirements, aligning with BigQuery's capabilities.
  • Integration with Google Cloud BigQuery SDK: Integrated google.cloud.bigquery SDK's ExternalConfig and HivePartitioningOptions classes, enabling direct use of these objects to configure external tables via SQLAlchemy dialect options.
  • Documentation and Testing: Included clear documentation examples in README.rst for both basic external tables and those with Hive partitioning, alongside extensive unit tests to validate the new DDL generation logic.
Changelog
  • README.rst
    • Added documentation and example code snippets for creating external tables.
    • Included documentation and example code for creating external tables with Hive partitioning.
  • sqlalchemy_bigquery/base.py
    • Imported inspect, ExternalConfig, and HivePartitioningOptions from google.cloud.bigquery.
    • Extended option_datatype_mapping in BigQueryDDLCompiler to include new options relevant to external tables such as format, hive_partition_uri_prefix, and require_hive_partition_filter.
    • Introduced a new method create_table_suffix to append WITH PARTITION COLUMNS for external tables utilizing Hive partitioning.
    • Modified post_create_table to parse and validate external_data_configuration dialect options, extracting format, URIs, and Hive partitioning details for DDL generation.
    • Updated _process_option_value to correctly handle list types when generating SQL options, specifically for uris in external table definitions.
  • tests/unit/test_table_options.py
    • Imported AvroOptions, CSVOptions, ExternalConfig, ExternalSourceFormat, HivePartitioningOptions, and ParquetOptions for testing external table configurations.
    • Added test_create_external_table to verify basic external table creation.
    • Added test_create_external_table_hive_partitioning to confirm correct DDL generation for external tables with Hive partitioning.
    • Included test_create_external_table_format_csv_options, test_create_external_table_format_parquet_options, and test_create_external_table_format_avro_options to test external tables with specific format options.
Activity
  • No specific activity (comments, reviews, etc.) was provided in the context for this pull request.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces valuable support for creating external tables, including those with Hive partitioning. The implementation is solid, leveraging SQLAlchemy's DDL compilation hooks effectively. The changes in sqlalchemy_bigquery/base.py to handle external_data_configuration are well-designed, particularly the dynamic handling of format-specific options. The new documentation in README.rst is clear, and the tests in tests/unit/test_table_options.py are comprehensive. I have one suggestion to make the type validation for source_uris more robust.

Comment on lines +778 to +782
if not isinstance(external_config.source_uris, (list, str)):
raise TypeError(
"External table source_uris must be a list of strings"
" (or a single string for Bigtable)"
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current type check for source_uris is not strict enough. It allows a list containing non-string elements (e.g., [1, 2, 3]), which would lead to invalid SQL being generated as the uris option in BigQuery expects an array of strings. It would be more robust to also validate the types of the elements when a list is provided.

            if isinstance(external_config.source_uris, list):
                if not all(isinstance(uri, str) for uri in external_config.source_uris):
                    raise TypeError(
                        "When providing a list for source_uris, all elements must be strings."
                    )
            elif not isinstance(external_config.source_uris, str):
                raise TypeError(
                    "External table source_uris must be a list of strings"
                    " (or a single string for Bigtable)"
                )

Add bigquery_external_data_configuration dialect option for creating
external tables backed by files in Google Cloud Storage.

Supports:
- ExternalConfig with source format and URIs
- HivePartitioningOptions for hive-partitioned external tables
- Format-specific options (CSV, Parquet, Avro)

Includes README documentation with usage examples.
@aholyoke aholyoke force-pushed the feature/external-tables branch from 2ce3800 to e88f0b8 Compare February 22, 2026 21:51
@aholyoke
Copy link
Contributor Author

@chalmerlowe @Linchin For review when you have a chance, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api: bigquery Issues related to the googleapis/python-bigquery-sqlalchemy API. size: m Pull request size is medium.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants