From 4ffad3e7457c1175ab8eb29781b13deb2564255d Mon Sep 17 00:00:00 2001 From: Jia Yu Date: Wed, 6 May 2026 10:19:35 -0700 Subject: [PATCH] [CI] Trim docker image build context: clean stale target/ in build.sh, drop pip cache MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Two related fixes that together restore the apache/sedona docker image download size to its pre-1.9.0 footprint. Background — what regressed --------------------------- Per the Docker Hub tags API, apache/sedona:1.9.0 ships at 4.03 GB compressed (per-arch) while apache/sedona:1.8.1 is 2.97 GB. A layer- by-layer comparison of the published manifests (37 layers per arch) shows all but one layer are identical to within a few MB: Layer | 1.8.1 | 1.9.0 | delta -------------------------------------|---------|----------|------ L08 (install-spark.sh) | 1241 MB | 1239 MB | ~0 L10 (pip install -r requirements.txt)| 715 MB | 728 MB | +13 L12 (COPY ./spark-shaded/) | 0.7 MB | 1104 MB | +1103 L18 (install-zeppelin.sh) | 504 MB | 504 MB | 0 Layer 12 is COPY ./spark-shaded/ (line 57 of the dockerfile). The 1.9.0 release was published from a tree that had stale Maven build outputs in spark-shaded/target/; the .dockerignore allow-list `!spark-shaded/**` re-included everything under that directory, so the COPY swept in ~1.1 GB of JARs and test classes. The dockerfile's trailing `RUN rm -rf ${SEDONA_HOME}` does delete the content from the running container's filesystem, so `du` inside the container looks normal — but it cannot shrink the prior layer that already committed those bytes. The wasted ~1 GB stays in the published manifest and adds to every pull's download size. Fix 1 — clean stale Maven artifacts in docker/build.sh ------------------------------------------------------ The .dockerignore can't simply blacklist `**/target/` because the `latest` matrix entry intentionally runs `mvn clean install` first and install-sedona.sh inside the image copies the freshly-built shaded JAR from `${SEDONA_HOME}/spark-shaded/target/`. (This was the failure mode of the first attempt at this PR — both Docker-build legs went red on the missing JAR.) Instead, build.sh's existing `if SEDONA_VERSION = latest; then mvn ...` gets a matching `else` branch that `rm -rf`s spark-shaded/target, python/build, and python/dist. For the non-latest path (e.g. `./docker/build.sh 4.0.1 1.9.0 release 33.5`), install-sedona.sh downloads the JAR from Maven Central via curl — it never touches spark-shaded/target/, so removing those artifacts is safe and prevents the regression at the source. Fix 2 — pass --no-cache-dir to pip ---------------------------------- Without it, pip leaves ~439 MB of wheel downloads under /root/.cache/pip in the requirements.txt install layer (measured `du -sh /root/.cache` inside the running 1.9.0 image). --no-cache-dir skips that write and brings the pip layer down to roughly its installed-package size. The local rebuild went from 5.06 GB to 4.81 GB extracted on the same tree. Fix 3 — defensive .dockerignore deny rules for Python build outputs ------------------------------------------------------------------- python/build/, python/dist/, *.egg-info/, and __pycache__/ are excluded even though the rm step in build.sh covers the obvious cases. These are harmless rules that prevent any future contributor from accidentally shipping Python build artifacts the same way. For the next 1.9.0 re-publish: ./docker/build.sh 4.0.1 1.9.0 release 33.5 # build.sh now cleans spark-shaded/target/ et al. before building --- docker/build.sh | 11 +++++++++++ docker/sedona-docker.dockerfile | 4 ++-- docker/sedona-docker.dockerfile.dockerignore | 13 +++++++++++++ 3 files changed, 26 insertions(+), 2 deletions(-) diff --git a/docker/build.sh b/docker/build.sh index 7cf3d95a8ce..77c8a54767a 100755 --- a/docker/build.sh +++ b/docker/build.sh @@ -81,6 +81,17 @@ if [ "$SEDONA_VERSION" = "latest" ]; then # The compilation must take place outside Docker to avoid unnecessary maven packages mvn clean install -DskipTests -Dspark="${SEDONA_SPARK_VERSION}" -Dscala=2.13 +else + # When building against a published Sedona version, install-sedona.sh + # downloads the shaded JAR from Maven Central inside the container and + # never reads spark-shaded/target/. Any stale Maven artifacts in the + # local tree would still be pulled into the build context by the + # `COPY ./spark-shaded/` step, ship in the published manifest, and add + # to every pull's download size — even though the dockerfile's trailing + # `RUN rm -rf ${SEDONA_HOME}` deletes them from the runtime filesystem. + # apache/sedona:1.9.0 hit this regression and shipped 1.1 GB heavier than + # 1.8.1 (4.03 GB vs 2.97 GB compressed) for exactly this reason. + rm -rf spark-shaded/target python/build python/dist fi # -- Building the image diff --git a/docker/sedona-docker.dockerfile b/docker/sedona-docker.dockerfile index b12c98d5104..4156fa36c3b 100644 --- a/docker/sedona-docker.dockerfile +++ b/docker/sedona-docker.dockerfile @@ -43,14 +43,14 @@ ENV PYTHONPATH=${SPARK_HOME}/python # Set up OS libraries and PySpark RUN apt-get update RUN apt-get install -y openjdk-17-jdk-headless curl python3-pip maven -RUN pip3 install pipenv --break-system-packages +RUN pip3 install --no-cache-dir pipenv --break-system-packages COPY ./docker/install-spark.sh ${SEDONA_HOME}/docker/ RUN chmod +x ${SEDONA_HOME}/docker/install-spark.sh RUN ${SEDONA_HOME}/docker/install-spark.sh ${spark_version} ${hadoop_s3_version} ${aws_sdk_version} # Install Python dependencies COPY docker/requirements.txt /opt/requirements.txt -RUN pip3 install -r /opt/requirements.txt --break-system-packages +RUN pip3 install --no-cache-dir -r /opt/requirements.txt --break-system-packages # Copy local compiled jars and python code to the docker environment diff --git a/docker/sedona-docker.dockerfile.dockerignore b/docker/sedona-docker.dockerfile.dockerignore index 3699b015a2b..0b5ea0bd6ac 100644 --- a/docker/sedona-docker.dockerfile.dockerignore +++ b/docker/sedona-docker.dockerfile.dockerignore @@ -7,3 +7,16 @@ !docs/usecases/** !python/** !spark-shaded/** + +# Re-exclude Python build artifacts and pyc caches so a tree that has +# had `python -m build` run against it does not balloon the COPY layers. +# Note: `**/target/` is intentionally NOT excluded here because the +# build.sh `latest` mode runs `mvn clean install -DskipTests` and then +# install-sedona.sh inside the image copies the freshly-built shaded +# JAR from `${SEDONA_HOME}/spark-shaded/target/`. Stale Maven `target/` +# directories from a prior local build are cleaned up by build.sh +# (non-latest branch) instead — see docker/build.sh. +python/build/ +python/dist/ +**/*.egg-info/ +**/__pycache__/