Public GCP Data Architecture Baseline: Hybrid Warehouse/Lakehouse with Batch + Streaming
This repository contains architecture guidance, standards, diagram templates, and folder placeholders for GCP data platforms:
- Event-driven ingestion patterns with Cloud Functions + Pub/Sub
- Stream and batch processing patterns with Dataflow (Apache Beam)
- Data organization patterns with Bronze/Silver/Gold conventions on GCS + BigQuery
- Architecture documentation and pattern placeholders for pipeline design decisions
- Alignment with a separate Terraform infrastructure repository for GCP provisioning
Scope note:
- This repository does not host production runtime implementations.
- Production runtime code is expected in private runtime repositories.
- Infrastructure provisioning is expected in a separate Terraform repository.
- Production folders exist but are still placeholders:
functions/ingestion/functions/trigger/dataflow/pipelines/
- Runtime modules are intentionally not implemented in this public baseline.
- CI currently runs quality gates (lint, type checks via pre-commit, tests).
- This repository is maintained as a public baseline (docs, patterns, templates).
- Concrete production pipelines should live in private runtime repositories per project context.
This project is intentionally hybrid, not pure Kappa or pure Lambda:
- Use warehouse-first patterns when BigQuery native tables are fastest to deliver value.
- Use lakehouse patterns (BigLake + open formats) when interoperability and file-based processing matter.
- Run streaming and batch side by side; choose per source SLA, data shape, and cost profile.
- Treat Data Mesh as an organizational model, not a mandatory runtime pattern for this repo.
- Apply cross-cutting controls: data contracts, schema evolution, DLQ/replay, idempotency, quality gates, observability/SLO, and governance baselines.
See the full decision model in docs/architecture.md.
| Category | Technologies in scope |
|---|---|
| Processing | Cloud Functions, Pub/Sub, Dataflow |
| Storage & query | GCS, BigQuery, BigLake |
| Table formats | Apache Iceberg (primary lakehouse table format), BigQuery native tables |
| File formats | JSON/NDJSON, Avro, Parquet |
| Optional ecosystem | Databricks/Delta interoperability considered when required by source/domain |
| Directory | Purpose |
|---|---|
functions/ |
Cloud Function folder structure placeholders (public baseline) |
dataflow/ |
Beam pipeline folder structure placeholders (public baseline) |
shared/common/ |
Shared infrastructure utilities (I/O, logging, secrets) |
tests/ |
Unit/integration tests mirroring source layout |
docs/ |
Architecture, CI/CD, and engineering guidance |
For medium/large runtimes (for example, ~50 pipelines) in GCP:
- Dataflow: organize by
domain -> layer (bronze/silver/gold) -> pipeline module. - Functions: keep
ingestion/andtrigger/, then group by domain and source/event purpose. - CI/CD: avoid one mega deploy pipeline; use shared CI plus selective CD by changed module in private runtime repos.
See details in:
just sync
just pre-commit-installjust lint
just type
just test- Architecture Guide
- CI/CD Guide
- Architecture Decisions (ADRs)
- Dataflow Guide
- Cloud Functions Guide
- GCP Project Baseline Guide
- Diagram Catalog
- First E2E Blueprint
- First Runtime Scope (Step 2)
- Step 3 Private Runtime Checklist
MIT License. See LICENSE.