Skip to content

[SPARK-51654][BUILD] Add a dev script to compare SBT and Maven builds#54371

Open
fangchenli wants to merge 21 commits intoapache:masterfrom
fangchenli:compare-builds-script
Open

[SPARK-51654][BUILD] Add a dev script to compare SBT and Maven builds#54371
fangchenli wants to merge 21 commits intoapache:masterfrom
fangchenli:compare-builds-script

Conversation

@fangchenli
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Add a dev script to compare SBT and Maven builds. Pure Python, no dependency.

Why are the changes needed?

Currently, the Jars produced by Maven and SBT differ; we need to be able to inspect those differences. This is also the precursor for native SBT build. We can answer the question in the original Jira issue:

python dev/compare-builds.py --compare \
  assembly/target/scala-2.13/jars/connect-repl/spark-connect-client-jvm_2.13-4.2.0-SNAPSHOT.jar \
  assembly/target/scala-2.13/jars/connect-repl/spark-connect-client-jvm-assembly-4.2.0-SNAPSHOT.jar

Output:

Comparing JARs
────────────────────────────────────────────────────────────────────────
  JAR 1: assembly/target/scala-2.13/jars/connect-repl/spark-connect-client-jvm_2.13-4.2.0-SNAPSHOT.jar
         26,035,401 bytes, 12182 classes, 95 resources, 5 services
  JAR 2: assembly/target/scala-2.13/jars/connect-repl/spark-connect-client-jvm-assembly-4.2.0-SNAPSHOT.jar
         63,882,318 bytes, 21289 classes, 4392 resources, 20 services
  Size:  JAR 2 is 2x larger
────────────────────────────────────────────────────────────────────────
Summary: 3119 identical, 9052 matched after de-shading, 11 only in JAR 1, 9118 only in JAR 2, 23 service diffs, 1 service contents differ

De-shading Analysis
────────────────────────────────────────────────────────────────────────
  ✓ 9052 classes are the same original class under different shading prefixes

  Classes truly only in JAR 1 (11):
    org/sparkproject/com/google/gson/internal/bind/ (5 classes)
    org/sparkproject/com/google/gson/internal/ (4 classes)
    org/sparkproject/com/google/gson/ (1 classes)
    org/apache/spark/unused/ (1 classes)

  Classes truly only in JAR 2 (9118):
    com/ibm/icu/text/ (544 classes)
    com/ibm/icu/impl/ (388 classes)
    org/json4s/ (278 classes)
    com/ibm/icu/util/ (190 classes)
    com/esotericsoftware/kryo/serializers/ (143 classes)
    com/twitter/chill/ (123 classes)
    org/apache/logging/log4j/core/pattern/ (113 classes)
    org/json4s/scalap/scalasig/ (105 classes)
    org/apache/logging/log4j/layout/template/json/resolver/ (93 classes)
    org/apache/logging/log4j/core/appender/ (91 classes)
    com/fasterxml/jackson/databind/deser/std/ (90 classes)
    com/fasterxml/jackson/module/scala/deser/ (88 classes)
    org/antlr/v4/runtime/atn/ (87 classes)
    org/apache/commons/lang3/ (82 classes)
    scala/xml/ (82 classes)
    org/apache/logging/log4j/core/tools/picocli/ (80 classes)
    com/fasterxml/jackson/databind/ser/std/ (79 classes)
    org/apache/logging/log4j/core/layout/ (79 classes)
    org/apache/ivy/ant/ (77 classes)
    com/fasterxml/jackson/databind/introspect/ (76 classes)
    ... and 399 more packages

  Services only in JAR 1 (4):
    META-INF/services/org.sparkproject.io.grpc.LoadBalancerProvider
    META-INF/services/org.sparkproject.io.grpc.ManagedChannelProvider
    META-INF/services/org.sparkproject.io.grpc.NameResolverProvider
    META-INF/services/org.sparkproject.io.grpc.ServerProvider

  Services only in JAR 2 (19):
    META-INF/services/com.fasterxml.jackson.core.JsonFactory
    META-INF/services/com.fasterxml.jackson.core.ObjectCodec
    META-INF/services/com.fasterxml.jackson.databind.Module
    META-INF/services/exec
    META-INF/services/ffm
    META-INF/services/jansi
    META-INF/services/javax.annotation.processing.Processor
    META-INF/services/jna
    META-INF/services/jni
    META-INF/services/org.apache.commons.logging.LogFactory
    META-INF/services/org.apache.logging.log4j.core.util.ContextDataProvider
    META-INF/services/org.apache.logging.log4j.message.ThreadDumpMessage$ThreadInfoFactory
    META-INF/services/org.apache.logging.log4j.spi.Provider
    META-INF/services/org.apache.logging.log4j.util.PropertySource
    META-INF/services/org.slf4j.spi.SLF4JServiceProvider
    META-INF/services/org.sparkproject.connect.client.io.grpc.LoadBalancerProvider
    META-INF/services/org.sparkproject.connect.client.io.grpc.ManagedChannelProvider
    META-INF/services/org.sparkproject.connect.client.io.grpc.NameResolverProvider
    META-INF/services/org.sparkproject.connect.client.io.grpc.ServerProvider

  Services with different content (1):
    META-INF/services/reactor.blockhound.integration.BlockHoundIntegration

Does this PR introduce any user-facing change?

No.

How was this patch tested?

The script includes a partial self-test. But to further test this script, we need more user feedback and to investigate the differences it found.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Opus 4.6

@Yicong-Huang
Copy link
Copy Markdown
Contributor

That's a great finding!

@holdenk
Copy link
Copy Markdown
Contributor

holdenk commented Feb 23, 2026

What is this for? Like that's the end goal? The JIRA just says need to investigate but generally we've just had maven be the build of record and sbt be the dev build is there a problem we're trying to solve?

@zhengruifeng
Copy link
Copy Markdown
Contributor

What is this for? Like that's the end goal? The JIRA just says need to investigate but generally we've just had maven be the build of record and sbt be the dev build is there a problem we're trying to solve?

+1, I am also wondering how to use it in release process or daily development, and if there are differences, what are the follow-up actions?

@LuciferYang
Copy link
Copy Markdown
Contributor

What is this for? Like that's the end goal? The JIRA just says need to investigate but generally we've just had maven be the build of record and sbt be the dev build is there a problem we're trying to solve?

+1. This is actually a long-standing 'issue'. From a personal perspective, I would much prefer to convert the project into an sbt-only project to completely eliminate the possibility of such inconsistencies. However, I've somewhat forgotten the reasons for not doing so.

@cloud-fan
Copy link
Copy Markdown
Contributor

knowing the differences is good but fixing them is more important. It's more valuable to merge the changes that can make SBT jar the same as maven jar.

@fangchenli
Copy link
Copy Markdown
Contributor Author

I started looking into this because I want to enable Scala 3 support for Spark. Cross-compiling Scala 3 and 2.13 gets a lot easier if we separate the sbt build from Maven, so I tried that out (haven’t opened a PR yet), and it worked locally.

Right now, I’m just relying on “all tests passing” as the only sign that things work. The goal of this PR is to add more ways to observe build outputs, so we can be more confident that the sbt build refactor didn’t introduce any subtle changes.

Alternatively, I can keep this as a personal dev tool for now and focus on advancing the native sbt build. We could get that merged quickly and use it experimentally to monitor for potential issues. Once we’re confident in its stability, we could migrate the release process to sbt and make Spark an sbt-only project.

xref: SPARK-44173

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants