Skip to content

feat: introduce Result Service using Lakekeeper as REST catalog for Iceberg - catalog migration #4272

Open
mengw15 wants to merge 2 commits intoapache:mainfrom
mengw15:Lakekeeper-catalog-migration
Open

feat: introduce Result Service using Lakekeeper as REST catalog for Iceberg - catalog migration #4272
mengw15 wants to merge 2 commits intoapache:mainfrom
mengw15:Lakekeeper-catalog-migration

Conversation

@mengw15
Copy link
Contributor

@mengw15 mengw15 commented Mar 9, 2026

What changes were proposed in this PR?

This is PR 1 of a decomposed series from #4242, focusing on the core Iceberg catalog migration to support Lakekeeper as a
REST catalog.

Scala changes:

  • IcebergUtil.scala: added createRestCatalog() for REST catalog connections with S3FileIO (MinIO), and namespace auto-creation for all catalog types
  • IcebergCatalogInstance.scala: updated singleton to support REST catalog type selection
  • IcebergTableWriter.scala: updated for REST catalog compatibility
  • StorageConfig.scala / EnvironmentalVariable.scala: added REST catalog configuration (URI, warehouse name, region, S3
    bucket) and environment variable support
  • storage.conf: added REST catalog config section (default remains postgres for backward compatibility)
  • build.sbt: added iceberg-aws, AWS SDK dependencies, and Netty version override for Arrow compatibility
  • PythonWorkflowWorker.scala / ComputingUnitManagingResource.scala: propagate REST catalog config to Python workers and
    computing units

Python changes:

  • iceberg_catalog_instance.py / iceberg_utils.py: added REST catalog support via PyIceberg
  • storage_config.py: added REST catalog configuration parsing
  • texera_run_python_worker.py: accept REST catalog config from Scala side
  • requirements.txt: upgraded PyIceberg (0.8.1 → 0.9.0), added s3fs/aiobotocore for S3 access

Database:

  • texera_lakekeeper.sql: schema for Lakekeeper's backing database

Note: This PR keeps postgres as the default catalog type in storage.conf. Switching to REST catalog will be enabled
in subsequent deployment PRs.

Any related issues, documentation, discussions?

Part of #4126. Subsequent PRs will cover:

  • Lakekeeper bootstrap script
  • Single-node deployment
  • Kubernetes deployment
  • CI integration

How was this PR tested?

Manual

Was this PR authored or co-authored using generative AI tooling?

co-authored with Claude

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common ddl-change Changes to the TexeraDB DDL dependencies Pull requests that update a dependency file engine python service

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant