Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# GDAL platform tarball — ~250 MB, slowly-changing (bumped only when the
# UbuntuGIS-PPA-resolved GDAL/PROJ/GEOS set changes). Stored via Git LFS so
# clones stay light; LFS pulls the bytes on demand. See
# resources/static/README.md for how to rebuild it.
resources/static/geobrix-gdal-platform-noble.tar.gz filter=lfs diff=lfs merge=lfs -text
305 changes: 305 additions & 0 deletions .github/workflows/package-geobrix-artifacts.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,305 @@
name: package geobrix artifacts
# One-stop release packaging: builds the JAR + Python wheel inline,
# repackages the committed GDAL platform tarball with the JAR baked in, and
# attaches every release artifact (JAR, wheel, GDAL tarball, sidecar, init
# script, docs zip) to an existing tag via `gh release upload --clobber`.
#
# Manual-only: Actions -> "package geobrix artifacts" -> Run workflow.
#
# Inputs:
# ref - git ref to build from (branch / tag / SHA). Empty = the
# ref the workflow was launched on.
# attach_to_tag - tag (e.g. v0.3.0) to attach all six files to. Empty =
# produce workflow artifacts only, no tag mutation.
#
# Always uploaded as workflow artifacts (downloadable from the run page):
# 1. geobrix-<version>-jar-with-dependencies.jar (built inline)
# 2. dblabs_geobrix-<version>-py3-none-any.whl (built inline)
# 3. geobrix-gdal-artifacts-v<version>-noble.tar.gz (repackaged)
# 4. geobrix-gdal-artifacts-v<version>-noble.tar.gz.sha256 (computed)
# 5. geobrix-gdal-init.sh (committed)
# 6. geobrix-docs-<version>.zip (built inline from docs/)
#
# The slow PPA/apt dance does NOT run here - the GDAL platform layer
# (~250 MB of .debs + wheels + JNI) is committed under resources/static/
# (Git LFS) and was reviewed at the PR that bumped it. This workflow only
# grafts the per-release JAR into a copy of those bytes and recomputes the
# SHA256SUMS manifest. Total runtime ~2-3 min vs. ~15 if we rebuilt the
# platform layer per release.
#
# Security: every workflow_dispatch input is surfaced as env: before any
# run: block. Direct ${{ ... }} interpolation of user inputs into shell is
# a command-injection risk - we don't do it. See:
# https://github.blog/security/vulnerability-research/how-to-catch-github-actions-workflow-injections-before-attackers-do/
#
# All jobs run on the Databricks-hardened runner group (Labs lockdown policy).
on:
workflow_dispatch:
inputs:
ref:
description: "Git ref (branch / tag / SHA) to build from. Empty = the ref the workflow was launched on."
required: false
type: string
default: ""
attach_to_tag:
description: "Tag to attach the six files to (e.g. v0.3.0). Empty = workflow artifacts only."
required: false
type: string
default: ""

permissions:
contents: read

jobs:
package:
runs-on:
group: databrickslabs-protected-runner-group
labels: linux-ubuntu-latest
environment: runtime
permissions:
contents: write
id-token: write
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
INPUT_REF: ${{ inputs.ref }}
INPUT_ATTACH_TO_TAG: ${{ inputs.attach_to_tag }}
strategy:
matrix:
python: [ 3.12.3 ]
numpy: [ 2.1.3 ]
gdal: [ 3.11.4 ]
spark: [ 4.0.0 ]
steps:
- name: checkout code (with LFS)
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
ref: ${{ inputs.ref || github.ref }}
token: ${{ secrets.REPO_ACCESS_TOKEN || secrets.GITHUB_TOKEN }}
lfs: true

- name: verify platform tarball is LFS-pulled
shell: bash
run: |
PLATFORM=resources/static/geobrix-gdal-platform-noble.tar.gz
if [ ! -s "$PLATFORM" ]; then
echo "$PLATFORM is missing or empty" >&2
exit 1
fi
if head -c 50 "$PLATFORM" | grep -q '^version https://git-lfs'; then
echo "$PLATFORM is an LFS pointer, not the binary - checkout's lfs: true didn't resolve it." >&2
exit 1
fi
( cd resources/static && sha256sum -c geobrix-gdal-platform-noble.tar.gz.sha256 )

- name: Configure JDK
uses: actions/setup-java@be666c2fcd27ec809703dec50e508c2fdc7f6654 # v5.2.0
with:
java-version: "17"
distribution: "zulu"
cache: "maven"
cache-dependency-path: "pom.xml"

- name: Set Maven opts
shell: bash
run: echo "MAVEN_OPTS=-Xmx4g -XX:+UseG1GC" >> "$GITHUB_ENV"

- name: Create pip cache key file
shell: bash
env:
GH_REF: ${{ github.ref }}
PY: ${{ matrix.python }}
NP: ${{ matrix.numpy }}
SP: ${{ matrix.spark }}
GD: ${{ matrix.gdal }}
run: |
echo "${GH_REF}-${PY}-${NP}-${SP}-${GD}" > .ci-pip-cache-key

- name: Pre-bootstrap pip for JFrog
uses: ./.github/actions/jfrog-pip-bootstrap

- name: Configure Python
uses: actions/setup-python@a309ff8b426b58ec0e2a45f0f869d46889d02405 # v6.2.0
with:
cache: "pip"
cache-dependency-path: ".ci-pip-cache-key"
python-version: ${{ matrix.python }}

- name: Authenticate for JFrog
uses: ./.github/actions/jfrog-auth

- name: Verify Maven dependency PGP signatures
shell: bash
run: ./scripts/security/maven-pgp-verify

- name: build scala JAR (skip tests, no GDAL install)
shell: bash
run: |
mvn -C -q clean package -DskipTests -Dscoverage.skip -Dscalastyle.fail=false
ls -lh target/geobrix-*-jar-with-dependencies.jar

# Hash-pinned minimal build set: just build / setuptools / wheel. See
# python/geobrix/requirements-build.in for the source list; the .txt
# lockfile must be regenerated via `uv pip compile --generate-hashes`
# when bumping any of those. Slimmer than requirements-ci.txt (which
# pulls pytest, black, scientific stack) by ~30s of pip resolve time.
- name: install Python build deps (hash-pinned)
shell: bash
run: |
pip install --upgrade pip==25.0.1
pip install --require-hashes -r python/geobrix/requirements-build.txt

- name: build Python wheel
shell: bash
run: |
cd python/geobrix
python -m build
ls -lh dist/*.whl

# ---- docs ----------------------------------------------------------
# Docusaurus static-zip build, version-named from the JAR we just
# produced. This replaces the previously-committed
# resources/static/geobrix-docs-*.zip — the doc bundle is now a
# release-time output, since release IS the natural time to cut a
# static docs snapshot.
- name: Setup Node
uses: actions/setup-node@53b83947a5a98c8d113130e565377fae1a50d02f # v6.3.0
with:
node-version: "20"
cache: "npm"
cache-dependency-path: docs/package-lock.json

# JFrog auth ran earlier; that step writes npm credentials too (per
# deploy-docs.yml's pattern), so `npm ci` below will route through
# the JFrog mirror.
- name: Install docs dependencies
shell: bash
run: cd docs && npm ci

- name: Build static docs zip
shell: bash
run: |
set -euo pipefail
# Parse version up front so we can name the zip — the repackage
# step below also derives it, but doing it here too lets us name
# the docs zip in a single shell invocation.
JAR=$(ls target/geobrix-*-jar-with-dependencies.jar 2>/dev/null | head -1)
VERSION=$(basename "$JAR" | sed -nE 's/^geobrix-(.+)-jar-with-dependencies\.jar$/\1/p')
if [ -z "$VERSION" ]; then
echo "could not parse version from JAR name: $(basename "$JAR")" >&2
exit 1
fi

( cd docs && npm run build:static-zip )

mkdir -p dist
DOCS_ZIP="dist/geobrix-docs-${VERSION}.zip"
# Strip DS_Store noise; -q to keep run log readable.
( cd docs/build-static-zip && zip -qr "../../$DOCS_ZIP" . -x "*.DS_Store" )

echo "GBX_DOCS_ZIP=$DOCS_ZIP" >> "$GITHUB_ENV"
ls -lh "$DOCS_ZIP"

- name: repackage platform tarball with JAR
shell: bash
run: |
set -euo pipefail
JAR=$(ls target/geobrix-*-jar-with-dependencies.jar 2>/dev/null | head -1)
if [ -z "$JAR" ]; then
echo "no geobrix-*-jar-with-dependencies.jar found in target/" >&2
exit 1
fi

VERSION=$(basename "$JAR" | sed -nE 's/^geobrix-(.+)-jar-with-dependencies\.jar$/\1/p')
if [ -z "$VERSION" ]; then
echo "could not parse version from JAR name: $(basename "$JAR")" >&2
exit 1
fi
echo "GeoBrix version: $VERSION"

mkdir -p dist staging
rm -rf staging/bundle
mkdir -p staging/bundle

tar -xzf resources/static/geobrix-gdal-platform-noble.tar.gz \
-C staging/bundle --strip-components=1

cp "$JAR" "staging/bundle/$(basename "$JAR")"

( cd staging/bundle && \
rm -f SHA256SUMS && \
find . -type f ! -name SHA256SUMS -print0 \
| LC_ALL=C sort -z \
| xargs -0 sha256sum > SHA256SUMS )

TARBALL=geobrix-gdal-artifacts-v${VERSION}-noble.tar.gz
tar --sort=name --mtime='UTC 2020-01-01' \
--owner=0 --group=0 --numeric-owner \
-czf "dist/$TARBALL" -C staging bundle/
( cd dist && sha256sum "$TARBALL" > "$TARBALL.sha256" )

echo "GBX_VERSION=$VERSION" >> "$GITHUB_ENV"
echo "GBX_TARBALL=$TARBALL" >> "$GITHUB_ENV"

- name: show release manifest
shell: bash
run: |
echo "=== files to publish ==="
ls -lh \
"target/geobrix-${GBX_VERSION}-jar-with-dependencies.jar" \
"python/geobrix/dist/"*.whl \
"dist/${GBX_TARBALL}" \
"dist/${GBX_TARBALL}.sha256" \
scripts/geobrix-gdal-init.sh \
"${GBX_DOCS_ZIP}"
echo
echo "=== outer sidecar ==="
cat "dist/${GBX_TARBALL}.sha256"

- name: upload as workflow artifacts
uses: actions/upload-artifact@bbbca2ddaa5d8feaa63e36b76fdaad77386f024f # v7.0.0
with:
name: geobrix-release-artifacts
path: |
target/geobrix-*-jar-with-dependencies.jar
python/geobrix/dist/*.whl
dist/geobrix-gdal-artifacts-*.tar.gz
dist/geobrix-gdal-artifacts-*.tar.gz.sha256
scripts/geobrix-gdal-init.sh
dist/geobrix-docs-*.zip
if-no-files-found: error
retention-days: 30

- name: attach to tag
if: ${{ inputs.attach_to_tag != '' }}
shell: bash
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
set -euo pipefail
case "$INPUT_ATTACH_TO_TAG" in
v[0-9]*|[0-9]*) : ;;
*)
echo "attach_to_tag '$INPUT_ATTACH_TO_TAG' doesn't look like a version tag." >&2
exit 1
;;
esac

JAR=$(ls target/geobrix-*-jar-with-dependencies.jar | head -1)
WHL=$(ls python/geobrix/dist/*.whl | head -1)

gh release upload "$INPUT_ATTACH_TO_TAG" \
"$JAR" \
"$WHL" \
"dist/${GBX_TARBALL}" \
"dist/${GBX_TARBALL}.sha256" \
scripts/geobrix-gdal-init.sh \
"${GBX_DOCS_ZIP}" \
--clobber

echo "Attached to tag $INPUT_ATTACH_TO_TAG:"
echo " $(basename "$JAR")"
echo " $(basename "$WHL")"
echo " ${GBX_TARBALL}"
echo " ${GBX_TARBALL}.sha256"
echo " geobrix-gdal-init.sh"
echo " $(basename "${GBX_DOCS_ZIP}")"
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,10 @@
/target/
/spark-warehouse/
/artifacts/
# Local output of scripts/build-gdal-artifacts.sh — the platform tarball
# + sidecar are moved into resources/static/ (committed via Git LFS);
# everything else in dist/ is intermediate (extracted bundle, etc.).
/dist/
/python/geobrix/artifacts/
/python/geobrix/test/vectorx/spark-warehouse/
/python/geobrix/test/gridx/artifacts/
Expand Down
1 change: 1 addition & 0 deletions docs/docs/beta-release-notes.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ Released 2026-05-19. Per-version highlights; full migration tables are in the pe
- **Scalar args without `f.lit(...)`.** Python wrappers auto-wrap `bool` / `int` / `float` / `bytes`; Scala adds typed overloads. SQL was already natively-typed. String literals still wrap in `f.lit(...)` per pyspark's column-ref convention. Details and migration examples in [Scalar values vs `lit(...)` wrapping](#scalar-values-vs-lit-wrapping).
- **Example notebooks — EO Series, xView, and enablement diagrams.** New end-to-end walkthroughs under `docs/examples/` covering EO time-series, xView object-detection rasters, and RasterX architecture diagrams.
- **Supply-chain hardening (lockdown).** Jobs pinned to the Databricks-hardened runner group (org-level allowlist, ephemeral VMs, constrained secret access); every Maven dependency, transitive dep, plugin, and plugin dependency is PGP-verified against `.maven-keys.list` before any compile or test execution; pip and Maven routed through JFrog with OIDC; init script + pinned package versions vetted; new [Security](./security.mdx) page in the docs.
- **Pre-built, hash-verified GDAL bundle.** The GDAL native install path is now a CI-built tarball (`geobrix-gdal-artifacts-v<version>-noble.tar.gz` + matching `.sha256` sidecar, attached to each release alongside a versioned `geobrix-gdal-init.sh`). Cluster start drops from ~15 minutes (legacy PPA dance per boot) to ~30–90 seconds (verify sidecar → extract → `dpkg -i`). Trust chain is now four layers: CI-side GPG fingerprint pin → per-file `SHA256SUMS` inside the tarball → outer `.sha256` sidecar in the staging Volume → the Volume's write ACL. The legacy on-cluster path is preserved as [`scripts/geobrix-gdal-init-ppa.sh`](https://github.com/databrickslabs/geobrix/blob/main/scripts/geobrix-gdal-init-ppa.sh) for bundle bootstrapping. Bundle is `amd64` / `x86_64` only (Intel or AMD CPUs); ARM-based instance types — AWS Graviton, Ampere, Apple Silicon — are not supported. See [Installation](./installation) and the rationale on the [Security](./security.mdx#pinned-gdal-native--multi-layer-trust-chain) page.

---

Expand Down
35 changes: 35 additions & 0 deletions docs/docs/developers.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,41 @@ GeoBrix is a multi-artifact repo: Scala/JVM core, Python bindings, docs, and too

Development and CI use a **Docker** image (`geobrix-dev`) for a consistent environment; many Cursor commands run inside that container.

### Git LFS — required to clone the GDAL platform tarball

The GDAL platform tarball at `resources/static/geobrix-gdal-platform-noble.tar.gz` (~90 MB, ships in every GeoBrix release as the runtime GDAL bundle) is stored via **Git LFS** so the binary lives in LFS storage instead of the git pack. The matching `.sha256` sidecar is small enough to live in git directly and is NOT LFS-tracked. The tracking rule is in [`.gitattributes`](https://github.com/databrickslabs/geobrix/blob/main/.gitattributes) at the repo root.

#### One-time install per machine

```bash
brew install git-lfs # macOS; or apt-get install git-lfs on Debian/Ubuntu
git lfs install # writes LFS filters into ~/.gitconfig
```

#### Cloning the repo

After `git lfs install`, a normal `git clone` of geobrix automatically fetches LFS objects:

```bash
git clone git@github.com:databrickslabs/geobrix.git
```

If you cloned **before** installing git-lfs, run `git lfs pull` from inside the working tree to fetch the binary. Without that step, `resources/static/geobrix-gdal-platform-noble.tar.gz` will be a ~130-byte LFS pointer file rather than the real 90 MB tarball, and the [`package-geobrix-artifacts.yml`](https://github.com/databrickslabs/geobrix/blob/main/.github/workflows/package-geobrix-artifacts.yml) workflow's `lfs: true` checkout will fail an integrity check.

#### Updating the platform tarball

Rebuild only when `GDAL_PPA_VERSION` changes, when DBR moves to a new Ubuntu LTS, or for a security advisory against one of the bundled libs. See [`resources/static/README.md`](https://github.com/databrickslabs/geobrix/blob/main/resources/static/README.md) for the full Docker-based recipe. The short version:

1. Run [`scripts/build-gdal-artifacts.sh --platform-only`](https://github.com/databrickslabs/geobrix/blob/main/scripts/build-gdal-artifacts.sh) inside a fresh `ubuntu:24.04` container (Docker recipe in the README).
2. Move the resulting `geobrix-gdal-platform-noble.tar.gz` + `.sha256` from `dist/` into `resources/static/`.
3. `git add resources/static/geobrix-gdal-platform-noble.tar.gz` — the LFS filter intercepts via `.gitattributes`. Verify with `git lfs ls-files` (should list the tarball) and `git diff --cached --stat resources/static/geobrix-gdal-platform-noble.tar.gz` (should show ~3 lines added — the pointer — not 90 MB).
4. `git add resources/static/geobrix-gdal-platform-noble.tar.gz.sha256` — committed normally, not LFS.
5. Open a PR. The reviewer **re-runs the build script locally in their own `ubuntu:24.04` container** and confirms the resulting sha256 matches the committed sidecar before approving — that PR review is the trust anchor for every cluster that subsequently installs from this bundle. See [Security](./security#pinned-gdal-native--multi-layer-trust-chain) for the full chain.

#### Storage considerations

LFS bandwidth and storage come from the `databrickslabs` GitHub org quota. Each tarball bump consumes both. Don't rebuild the tarball just to bump GeoBrix versions — the release workflow grafts the per-release JAR onto the committed platform tarball without changing it.

### Testing on a Databricks cluster

You can run the **Essential bundle** and **primitive Volume tests** on a live Databricks cluster so that Volume paths are FUSE-mounted and the bundle uses pathlib/shutil only (no Databricks Files API).
Expand Down
Loading
Loading