Skip to content

[CI] Trim docker image build context: exclude Maven target/, drop pip cache#2907

Merged
jiayuasu merged 1 commit intoapache:masterfrom
jiayuasu:docker-trim-build-context
May 6, 2026
Merged

[CI] Trim docker image build context: exclude Maven target/, drop pip cache#2907
jiayuasu merged 1 commit intoapache:masterfrom
jiayuasu:docker-trim-build-context

Conversation

@jiayuasu
Copy link
Copy Markdown
Member

@jiayuasu jiayuasu commented May 6, 2026

Did you read the Contributor Guide?

Is this PR related to a ticket?

  • No: this is a CI / release-tooling update. The PR name follows the format [CI] my subject.

What changes were proposed in this PR?

Two related fixes that together restore the apache/sedona docker image download size to its pre-1.9.0 footprint.

Background — what regressed

Per the Docker Hub tags API, apache/sedona:1.9.0 ships at 4.03 GB compressed (per-arch, both amd64 and arm64) while apache/sedona:1.8.1 is 2.97 GB. A layer-by-layer comparison of the published manifests (37 layers per arch) shows all but one layer are identical to within a few MB:

Layer 1.8.1 1.9.0 Δ
L08 (install-spark.sh) 1241 MB 1239 MB ~0
L10 (pip install -r requirements.txt) 715 MB 728 MB +13
L12 (COPY ./spark-shaded/) 0.7 MB 1104 MB +1103
L18 (install-zeppelin.sh) 504 MB 504 MB 0

Layer 12 is COPY ./spark-shaded/ ${SEDONA_HOME}/spark-shaded/ (line 57 of docker/sedona-docker.dockerfile). The 1.9.0 release was published from a tree that had Maven build outputs in spark-shaded/target/ (likely from a prior mvn package); the existing .dockerignore allow-list !spark-shaded/** re-included everything under that directory, so the COPY swept in ~1.1 GB of JARs and test classes.

The dockerfile's trailing RUN rm -rf ${SEDONA_HOME} does delete the content from the running container's filesystem, so du inside the container looks normal — but it cannot shrink the prior layer that already committed those bytes. The wasted ~1 GB stays in the published manifest and inflates every pull's download size.

Fix 1 — docker/sedona-docker.dockerfile.dockerignore

Re-exclude Maven and Python build outputs after the allow-list (last-match-wins). Even on a release machine that has stale target/ directories, the Docker build context will not include them.

 *
 !docker/**
 !zeppelin/**
 !docs/usecases/**
 !python/**
 !spark-shaded/**
+
+# Re-exclude Maven build outputs and Python build artifacts so a tree
+# that has had `mvn package` or `python -m build` run against it does
+# not balloon the COPY layers ...
+**/target/
+python/build/
+python/dist/
+**/*.egg-info/
+**/__pycache__/

Fix 2 — docker/sedona-docker.dockerfile

Pass --no-cache-dir to both pip3 install invocations. Without it, pip leaves ~439 MB of wheel downloads under /root/.cache/pip in the requirements.txt install layer — measured du -sh /root/.cache inside the running 1.9.0 image. --no-cache-dir skips that write and brings the pip layer down to roughly its installed-package size, with no runtime impact.

How was this patch tested?

  1. Probe build proves the deny rule actually fires. Synthesized a 200 MB fake JAR at spark-shaded/target/fake-jar.jar, then built a one-liner FROM alpine; COPY ./spark-shaded/ /shaded/ Dockerfile twice — once with the new .dockerignore, once with the old:

    Variant /shaded size in resulting image
    With **/target/ deny rule 24 KB (pom.xml + .gitignore only)
    Without the deny rule 200 MB (fake jar leaked in)
  2. Full image rebuild confirms --no-cache-dir shrinks the pip layer locally. Local tree has no target/ (clean checkout), so the deny rule is a no-op for our local build; the win comes purely from --no-cache-dir:

    Image Total size pip install -r layer
    sedona:dev (master @ HEAD) 5.06 GB 1.48 GB
    sedona:trim (this PR) 4.81 GB 1.24 GB
  3. Existing Docker-build CI matrix exercises the change. The path filter (docker/**) widened in [GH-2700] Add 05-geopandas-on-spark notebook #2889 means this PR triggers docker-build.yml, which runs ./docker/build.sh ... local ... and docker/test-notebooks.sh against the resulting image — so the existing 6-notebook test suite verifies the new dockerfile end-to-end.

Expected impact on the next 1.9.0 re-publish

Combining both fixes on a release-machine tree should drop the published image from 4.03 GB → ~3.0 GB compressed (roughly the 1.8.1 baseline). Recipe:

git clean -fdX -- spark-shaded/ python/   # belt-and-suspenders, in case the new .dockerignore misses anything
./docker/build.sh 4.0.1 1.9.0 release 33.5

release mode does --platform linux/amd64,linux/arm64 --output type=registry, so it pushes both arches to Docker Hub directly.

Did this PR include necessary documentation updates?

  • No public API changes.
  • No documentation updates needed; the .dockerignore comment block explains the regression and the rationale for any future contributor running git blame.

…, drop pip cache

Two related fixes that together restore the apache/sedona docker image
download size to its pre-1.9.0 footprint.

Background — what regressed
---------------------------
Per the Docker Hub tags API, apache/sedona:1.9.0 ships at 4.03 GB
compressed (per-arch) while apache/sedona:1.8.1 is 2.97 GB. A layer-
by-layer comparison of the published manifests (37 layers per arch)
shows all but one layer are identical to within a few MB:

  Layer                                | 1.8.1   | 1.9.0    | delta
  -------------------------------------|---------|----------|------
  L08 (install-spark.sh)               | 1241 MB | 1239 MB  | ~0
  L10 (pip install -r requirements.txt)|  715 MB |  728 MB  | +13
  L12 (COPY ./spark-shaded/)           |  0.7 MB | 1104 MB  | +1103
  L18 (install-zeppelin.sh)            |  504 MB |  504 MB  | 0

Layer 12 is COPY ./spark-shaded/ (line 57 of the dockerfile). The 1.9.0
release was published from a tree that had stale Maven build outputs in
spark-shaded/target/; the .dockerignore allow-list `!spark-shaded/**`
re-included everything under that directory, so the COPY swept in
~1.1 GB of JARs and test classes.

The dockerfile's trailing `RUN rm -rf ${SEDONA_HOME}` does delete the
content from the running container's filesystem, so `du` inside the
container looks normal — but it cannot shrink the prior layer that
already committed those bytes. The wasted ~1 GB stays in the published
manifest and adds to every pull's download size.

Fix 1 — clean stale Maven artifacts in docker/build.sh
------------------------------------------------------
The .dockerignore can't simply blacklist `**/target/` because the
`latest` matrix entry intentionally runs `mvn clean install` first and
install-sedona.sh inside the image copies the freshly-built shaded JAR
from `${SEDONA_HOME}/spark-shaded/target/`. (This was the failure mode
of the first attempt at this PR — both Docker-build legs went red on
the missing JAR.)

Instead, build.sh's existing `if SEDONA_VERSION = latest; then mvn ...`
gets a matching `else` branch that `rm -rf`s spark-shaded/target,
python/build, and python/dist. For the non-latest path (e.g.
`./docker/build.sh 4.0.1 1.9.0 release 33.5`), install-sedona.sh
downloads the JAR from Maven Central via curl — it never touches
spark-shaded/target/, so removing those artifacts is safe and prevents
the regression at the source.

Fix 2 — pass --no-cache-dir to pip
----------------------------------
Without it, pip leaves ~439 MB of wheel downloads under /root/.cache/pip
in the requirements.txt install layer (measured `du -sh /root/.cache`
inside the running 1.9.0 image). --no-cache-dir skips that write and
brings the pip layer down to roughly its installed-package size. The
local rebuild went from 5.06 GB to 4.81 GB extracted on the same tree.

Fix 3 — defensive .dockerignore deny rules for Python build outputs
-------------------------------------------------------------------
python/build/, python/dist/, *.egg-info/, and __pycache__/ are excluded
even though the rm step in build.sh covers the obvious cases. These are
harmless rules that prevent any future contributor from accidentally
shipping Python build artifacts the same way.

For the next 1.9.0 re-publish:
    ./docker/build.sh 4.0.1 1.9.0 release 33.5
    # build.sh now cleans spark-shaded/target/ et al. before building
@jiayuasu jiayuasu force-pushed the docker-trim-build-context branch from 1eda34f to 4ffad3e Compare May 6, 2026 18:04
@jiayuasu
Copy link
Copy Markdown
Member Author

jiayuasu commented May 6, 2026

Force-pushed 4ffad3e745 (was 1eda34f9ef). Both Docker-build matrix legs went red on the previous commit because the **/target/ deny rule excluded spark-shaded/target/ from the build context — but the latest matrix entry needs that directory: build.sh runs mvn clean install -DskipTests first, then install-sedona.sh inside the image does cp \${SEDONA_HOME}/spark-shaded/target/sedona-spark-shaded-*.jar \${SPARK_HOME}/jars/. With target/ excluded, the cp failed and so did install-sedona.sh.

Pivoted the fix:

Old approach (red) New approach
Add **/target/ to .dockerignore Removed — would break latest mode
(n/a) Add else: rm -rf spark-shaded/target python/build python/dist to build.sh's if SEDONA_VERSION = latest block. Non-latest builds (the path the 1.9.0 release publisher uses) never touch target/ since install-sedona.sh downloads from Maven Central instead.
Add --no-cache-dir to pip Kept — independent improvement
Add python/build/, python/dist/, **/*.egg-info/, **/__pycache__/ to .dockerignore Kept — harmless defensive rules

Result:

  • The latest matrix entry still has mvn-built artifacts in spark-shaded/target/ so install-sedona.sh's cp succeeds.
  • A non-latest release build (./docker/build.sh 4.0.1 1.9.0 release 33.5) cleans stale target/ first, so the COPY layer stays small and the published image avoids the 1 GB regression.

CI should now go green on both legs.

@jiayuasu jiayuasu requested a review from Copilot May 6, 2026 19:42
@jiayuasu jiayuasu merged commit 6b4a8e5 into apache:master May 6, 2026
15 checks passed
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reduces the published Docker image download size by trimming what gets sent in the Docker build context and by preventing pip from persisting its download cache into image layers.

Changes:

  • Update the Dockerfile-specific ignore file to exclude common Python build artifacts and __pycache__ directories from the build context.
  • Add --no-cache-dir to pip3 install invocations in the Dockerfile to avoid embedding pip’s wheel/download cache in layers.
  • Add a docker/build.sh cleanup step (for non-latest builds) to remove stale local build outputs before building the image.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
docker/sedona-docker.dockerfile.dockerignore Excludes Python build artifacts/caches from the Docker build context.
docker/sedona-docker.dockerfile Uses pip --no-cache-dir for smaller image layers.
docker/build.sh Removes stale local build outputs in non-latest mode to avoid bloating COPY layers.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +13 to +22
# Note: `**/target/` is intentionally NOT excluded here because the
# build.sh `latest` mode runs `mvn clean install -DskipTests` and then
# install-sedona.sh inside the image copies the freshly-built shaded
# JAR from `${SEDONA_HOME}/spark-shaded/target/`. Stale Maven `target/`
# directories from a prior local build are cleaned up by build.sh
# (non-latest branch) instead — see docker/build.sh.
python/build/
python/dist/
**/*.egg-info/
**/__pycache__/
Comment on lines +46 to +53
RUN pip3 install --no-cache-dir pipenv --break-system-packages
COPY ./docker/install-spark.sh ${SEDONA_HOME}/docker/
RUN chmod +x ${SEDONA_HOME}/docker/install-spark.sh
RUN ${SEDONA_HOME}/docker/install-spark.sh ${spark_version} ${hadoop_s3_version} ${aws_sdk_version}

# Install Python dependencies
COPY docker/requirements.txt /opt/requirements.txt
RUN pip3 install -r /opt/requirements.txt --break-system-packages
RUN pip3 install --no-cache-dir -r /opt/requirements.txt --break-system-packages
Comment thread docker/build.sh
Comment on lines +84 to 95
else
# When building against a published Sedona version, install-sedona.sh
# downloads the shaded JAR from Maven Central inside the container and
# never reads spark-shaded/target/. Any stale Maven artifacts in the
# local tree would still be pulled into the build context by the
# `COPY ./spark-shaded/` step, ship in the published manifest, and add
# to every pull's download size — even though the dockerfile's trailing
# `RUN rm -rf ${SEDONA_HOME}` deletes them from the runtime filesystem.
# apache/sedona:1.9.0 hit this regression and shipped 1.1 GB heavier than
# 1.8.1 (4.03 GB vs 2.97 GB compressed) for exactly this reason.
rm -rf spark-shaded/target python/build python/dist
fi
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants