[CI] Trim docker image build context: exclude Maven target/, drop pip cache#2907
Conversation
…, drop pip cache
Two related fixes that together restore the apache/sedona docker image
download size to its pre-1.9.0 footprint.
Background — what regressed
---------------------------
Per the Docker Hub tags API, apache/sedona:1.9.0 ships at 4.03 GB
compressed (per-arch) while apache/sedona:1.8.1 is 2.97 GB. A layer-
by-layer comparison of the published manifests (37 layers per arch)
shows all but one layer are identical to within a few MB:
Layer | 1.8.1 | 1.9.0 | delta
-------------------------------------|---------|----------|------
L08 (install-spark.sh) | 1241 MB | 1239 MB | ~0
L10 (pip install -r requirements.txt)| 715 MB | 728 MB | +13
L12 (COPY ./spark-shaded/) | 0.7 MB | 1104 MB | +1103
L18 (install-zeppelin.sh) | 504 MB | 504 MB | 0
Layer 12 is COPY ./spark-shaded/ (line 57 of the dockerfile). The 1.9.0
release was published from a tree that had stale Maven build outputs in
spark-shaded/target/; the .dockerignore allow-list `!spark-shaded/**`
re-included everything under that directory, so the COPY swept in
~1.1 GB of JARs and test classes.
The dockerfile's trailing `RUN rm -rf ${SEDONA_HOME}` does delete the
content from the running container's filesystem, so `du` inside the
container looks normal — but it cannot shrink the prior layer that
already committed those bytes. The wasted ~1 GB stays in the published
manifest and adds to every pull's download size.
Fix 1 — clean stale Maven artifacts in docker/build.sh
------------------------------------------------------
The .dockerignore can't simply blacklist `**/target/` because the
`latest` matrix entry intentionally runs `mvn clean install` first and
install-sedona.sh inside the image copies the freshly-built shaded JAR
from `${SEDONA_HOME}/spark-shaded/target/`. (This was the failure mode
of the first attempt at this PR — both Docker-build legs went red on
the missing JAR.)
Instead, build.sh's existing `if SEDONA_VERSION = latest; then mvn ...`
gets a matching `else` branch that `rm -rf`s spark-shaded/target,
python/build, and python/dist. For the non-latest path (e.g.
`./docker/build.sh 4.0.1 1.9.0 release 33.5`), install-sedona.sh
downloads the JAR from Maven Central via curl — it never touches
spark-shaded/target/, so removing those artifacts is safe and prevents
the regression at the source.
Fix 2 — pass --no-cache-dir to pip
----------------------------------
Without it, pip leaves ~439 MB of wheel downloads under /root/.cache/pip
in the requirements.txt install layer (measured `du -sh /root/.cache`
inside the running 1.9.0 image). --no-cache-dir skips that write and
brings the pip layer down to roughly its installed-package size. The
local rebuild went from 5.06 GB to 4.81 GB extracted on the same tree.
Fix 3 — defensive .dockerignore deny rules for Python build outputs
-------------------------------------------------------------------
python/build/, python/dist/, *.egg-info/, and __pycache__/ are excluded
even though the rm step in build.sh covers the obvious cases. These are
harmless rules that prevent any future contributor from accidentally
shipping Python build artifacts the same way.
For the next 1.9.0 re-publish:
./docker/build.sh 4.0.1 1.9.0 release 33.5
# build.sh now cleans spark-shaded/target/ et al. before building
1eda34f to
4ffad3e
Compare
|
Force-pushed Pivoted the fix:
Result:
CI should now go green on both legs. |
There was a problem hiding this comment.
Pull request overview
This PR reduces the published Docker image download size by trimming what gets sent in the Docker build context and by preventing pip from persisting its download cache into image layers.
Changes:
- Update the Dockerfile-specific ignore file to exclude common Python build artifacts and
__pycache__directories from the build context. - Add
--no-cache-dirtopip3 installinvocations in the Dockerfile to avoid embedding pip’s wheel/download cache in layers. - Add a
docker/build.shcleanup step (for non-latestbuilds) to remove stale local build outputs before building the image.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| docker/sedona-docker.dockerfile.dockerignore | Excludes Python build artifacts/caches from the Docker build context. |
| docker/sedona-docker.dockerfile | Uses pip --no-cache-dir for smaller image layers. |
| docker/build.sh | Removes stale local build outputs in non-latest mode to avoid bloating COPY layers. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Note: `**/target/` is intentionally NOT excluded here because the | ||
| # build.sh `latest` mode runs `mvn clean install -DskipTests` and then | ||
| # install-sedona.sh inside the image copies the freshly-built shaded | ||
| # JAR from `${SEDONA_HOME}/spark-shaded/target/`. Stale Maven `target/` | ||
| # directories from a prior local build are cleaned up by build.sh | ||
| # (non-latest branch) instead — see docker/build.sh. | ||
| python/build/ | ||
| python/dist/ | ||
| **/*.egg-info/ | ||
| **/__pycache__/ |
| RUN pip3 install --no-cache-dir pipenv --break-system-packages | ||
| COPY ./docker/install-spark.sh ${SEDONA_HOME}/docker/ | ||
| RUN chmod +x ${SEDONA_HOME}/docker/install-spark.sh | ||
| RUN ${SEDONA_HOME}/docker/install-spark.sh ${spark_version} ${hadoop_s3_version} ${aws_sdk_version} | ||
|
|
||
| # Install Python dependencies | ||
| COPY docker/requirements.txt /opt/requirements.txt | ||
| RUN pip3 install -r /opt/requirements.txt --break-system-packages | ||
| RUN pip3 install --no-cache-dir -r /opt/requirements.txt --break-system-packages |
| else | ||
| # When building against a published Sedona version, install-sedona.sh | ||
| # downloads the shaded JAR from Maven Central inside the container and | ||
| # never reads spark-shaded/target/. Any stale Maven artifacts in the | ||
| # local tree would still be pulled into the build context by the | ||
| # `COPY ./spark-shaded/` step, ship in the published manifest, and add | ||
| # to every pull's download size — even though the dockerfile's trailing | ||
| # `RUN rm -rf ${SEDONA_HOME}` deletes them from the runtime filesystem. | ||
| # apache/sedona:1.9.0 hit this regression and shipped 1.1 GB heavier than | ||
| # 1.8.1 (4.03 GB vs 2.97 GB compressed) for exactly this reason. | ||
| rm -rf spark-shaded/target python/build python/dist | ||
| fi |
Did you read the Contributor Guide?
Is this PR related to a ticket?
[CI] my subject.What changes were proposed in this PR?
Two related fixes that together restore the
apache/sedonadocker image download size to its pre-1.9.0 footprint.Background — what regressed
Per the Docker Hub tags API,
apache/sedona:1.9.0ships at 4.03 GB compressed (per-arch, both amd64 and arm64) whileapache/sedona:1.8.1is 2.97 GB. A layer-by-layer comparison of the published manifests (37 layers per arch) shows all but one layer are identical to within a few MB:install-spark.sh)pip install -r requirements.txt)COPY ./spark-shaded/)install-zeppelin.sh)Layer 12 is
COPY ./spark-shaded/ ${SEDONA_HOME}/spark-shaded/(line 57 ofdocker/sedona-docker.dockerfile). The 1.9.0 release was published from a tree that had Maven build outputs inspark-shaded/target/(likely from a priormvn package); the existing.dockerignoreallow-list!spark-shaded/**re-included everything under that directory, so the COPY swept in ~1.1 GB of JARs and test classes.The dockerfile's trailing
RUN rm -rf ${SEDONA_HOME}does delete the content from the running container's filesystem, soduinside the container looks normal — but it cannot shrink the prior layer that already committed those bytes. The wasted ~1 GB stays in the published manifest and inflates every pull's download size.Fix 1 —
docker/sedona-docker.dockerfile.dockerignoreRe-exclude Maven and Python build outputs after the allow-list (last-match-wins). Even on a release machine that has stale
target/directories, the Docker build context will not include them.Fix 2 —
docker/sedona-docker.dockerfilePass
--no-cache-dirto bothpip3 installinvocations. Without it, pip leaves ~439 MB of wheel downloads under/root/.cache/pipin the requirements.txt install layer — measureddu -sh /root/.cacheinside the running 1.9.0 image.--no-cache-dirskips that write and brings the pip layer down to roughly its installed-package size, with no runtime impact.How was this patch tested?
Probe build proves the deny rule actually fires. Synthesized a 200 MB fake JAR at
spark-shaded/target/fake-jar.jar, then built a one-linerFROM alpine; COPY ./spark-shaded/ /shaded/Dockerfile twice — once with the new.dockerignore, once with the old:/shadedsize in resulting image**/target/deny ruleFull image rebuild confirms
--no-cache-dirshrinks the pip layer locally. Local tree has notarget/(clean checkout), so the deny rule is a no-op for our local build; the win comes purely from--no-cache-dir:pip install -rlayersedona:dev(master @ HEAD)sedona:trim(this PR)Existing Docker-build CI matrix exercises the change. The path filter (
docker/**) widened in [GH-2700] Add 05-geopandas-on-spark notebook #2889 means this PR triggersdocker-build.yml, which runs./docker/build.sh ... local ...anddocker/test-notebooks.shagainst the resulting image — so the existing 6-notebook test suite verifies the new dockerfile end-to-end.Expected impact on the next 1.9.0 re-publish
Combining both fixes on a release-machine tree should drop the published image from 4.03 GB → ~3.0 GB compressed (roughly the 1.8.1 baseline). Recipe:
git clean -fdX -- spark-shaded/ python/ # belt-and-suspenders, in case the new .dockerignore misses anything ./docker/build.sh 4.0.1 1.9.0 release 33.5releasemode does--platform linux/amd64,linux/arm64 --output type=registry, so it pushes both arches to Docker Hub directly.Did this PR include necessary documentation updates?
.dockerignorecomment block explains the regression and the rationale for any future contributor runninggit blame.