Skip to content

landerox/cloud-landerox-data

cloud-landerox-data

Quality Gates Python Version Type Check Security GCP

Public GCP Data Architecture Baseline: Hybrid Warehouse/Lakehouse with Batch + Streaming

What this repository is

This repository contains architecture guidance, standards, diagram templates, and folder placeholders for GCP data platforms:

  • Event-driven ingestion patterns with Cloud Functions + Pub/Sub
  • Stream and batch processing patterns with Dataflow (Apache Beam)
  • Data organization patterns with Bronze/Silver/Gold conventions on GCS + BigQuery
  • Architecture documentation and pattern placeholders for pipeline design decisions
  • Alignment with a separate Terraform infrastructure repository for GCP provisioning

Scope note:

  • This repository does not host production runtime implementations.
  • Production runtime code is expected in private runtime repositories.
  • Infrastructure provisioning is expected in a separate Terraform repository.

Current status

  • Production folders exist but are still placeholders:
    • functions/ingestion/
    • functions/trigger/
    • dataflow/pipelines/
  • Runtime modules are intentionally not implemented in this public baseline.
  • CI currently runs quality gates (lint, type checks via pre-commit, tests).
  • This repository is maintained as a public baseline (docs, patterns, templates).
  • Concrete production pipelines should live in private runtime repositories per project context.

Architecture stance

This project is intentionally hybrid, not pure Kappa or pure Lambda:

  • Use warehouse-first patterns when BigQuery native tables are fastest to deliver value.
  • Use lakehouse patterns (BigLake + open formats) when interoperability and file-based processing matter.
  • Run streaming and batch side by side; choose per source SLA, data shape, and cost profile.
  • Treat Data Mesh as an organizational model, not a mandatory runtime pattern for this repo.
  • Apply cross-cutting controls: data contracts, schema evolution, DLQ/replay, idempotency, quality gates, observability/SLO, and governance baselines.

See the full decision model in docs/architecture.md.

Reference technology map

Category Technologies in scope
Processing Cloud Functions, Pub/Sub, Dataflow
Storage & query GCS, BigQuery, BigLake
Table formats Apache Iceberg (primary lakehouse table format), BigQuery native tables
File formats JSON/NDJSON, Avro, Parquet
Optional ecosystem Databricks/Delta interoperability considered when required by source/domain

Repository structure

Directory Purpose
functions/ Cloud Function folder structure placeholders (public baseline)
dataflow/ Beam pipeline folder structure placeholders (public baseline)
shared/common/ Shared infrastructure utilities (I/O, logging, secrets)
tests/ Unit/integration tests mirroring source layout
docs/ Architecture, CI/CD, and engineering guidance

Scaling guidance (recommended)

For medium/large runtimes (for example, ~50 pipelines) in GCP:

  • Dataflow: organize by domain -> layer (bronze/silver/gold) -> pipeline module.
  • Functions: keep ingestion/ and trigger/, then group by domain and source/event purpose.
  • CI/CD: avoid one mega deploy pipeline; use shared CI plus selective CD by changed module in private runtime repos.

See details in:

Getting started

Prerequisites

Setup

just sync
just pre-commit-install

Quality checks

just lint
just type
just test

Documentation

License

MIT License. See LICENSE.

About

Public GCP Data Architecture Baseline: Hybrid Warehouse/Lakehouse with Batch + Streaming

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors