Skip to content

Cosmos Live tests pipeline optimizations#48506

Draft
kushagraThapar wants to merge 6 commits intoAzure:mainfrom
kushagraThapar:kushagra/cosmos-ci-pipeline-optimizations
Draft

Cosmos Live tests pipeline optimizations#48506
kushagraThapar wants to merge 6 commits intoAzure:mainfrom
kushagraThapar:kushagra/cosmos-ci-pipeline-optimizations

Conversation

@kushagraThapar
Copy link
Member

Summary

Optimize the java - cosmos - tests live test CI pipeline to reduce total job count and wall-clock execution time. This is a follow-up to PR #48259 which optimized individual test code. This PR focuses on pipeline matrix efficiency and remaining Thread.sleep reductions.

Pipeline Matrix Optimizations

1. Skip shade plugin in live test pipeline (-Dshade.skip=true -Dmaven.antrun.skip=true)

  • Shade/antrun are unnecessary for test-only runs since the shared build already produces the shaded JAR
  • Applied to all 7 test stages (except Kafka which needs shade for uber JARs)

2. Consolidate e2e profile matrix

  • Flattened the ConsistencyConfig dimension (Session + Strong → Session only)
  • Strong consistency is already covered by the separate multi-master matrix entry
  • Restored BoundedStaleness e2e job (tests consistency downgrade path: BoundedStaleness read on Strong account)

3. Consolidate circuit breaker jobs (4 → 2)

  • Previously: 4 separate matrix entries (multi-master CB=true, CB=false, FI CB=true, FI CB=false)
  • Now: 2 entries with RunCBSecondPass=true variable. First pass runs CB=TRUE, then a PostStep re-runs with CB=FALSE on the same agent
  • Halves the number of ARM deployments (the slowest CI step)

4. Add failsafe execution bindings to e2e profile POMs

  • The e2e profile in azure-cosmos-tests and azure-cosmos-benchmark was missing explicit failsafe integration-test and verify goals, causing tests to be silently skipped

5. Remove -DCOSMOS.HTTP2_ENABLED from thin client test args

  • HTTP/2 enablement is now controlled internally by the SDK

6. Reduce Kafka test matrix from 5 → 3 Java versions

  • Dropped Java 8 and 11 (past end-of-life), kept 17, 21, 25

Thread.sleep Optimizations

7. Change Feed Processor tests (~15 min savings)

  • Reduced CHANGE_FEED_PROCESSOR_TIMEOUT from 5000ms to 2000ms (test-only wait interval)
  • Reduced REPLICA_IN_SATELLITE_REGION_CATCH_UP_TIME from 10000ms to 5000ms
  • Removed 5 redundant post-shutdown sleeps (safeStopChangeFeedProcessor already blocks)
  • Reduced 47 cleanup sleeps from 500ms to 100ms
  • Files: IncrementalChangeFeedProcessorTest (epk + pk), FullFidelityChangeFeedProcessorTest

8. RntbdServiceEndpointTest (27s savings)

  • Reduced idle endpoint timeout from 20s to 2s, sleep from 30s to 3s (unit test — value doesn't need to be realistic)

9. CosmosDiagnosticsTest (2.4s savings)

  • Reduced inter-operation sleeps from 1000ms to 200ms

Estimated Impact

  • ~5 fewer CI jobs (e2e consolidation, CB consolidation, Kafka reduction)
  • ~15 min less sleep time per pipeline run
  • Faster per-job execution from shade plugin skip

kushagraThapar and others added 5 commits March 21, 2026 10:52
Pipeline matrix optimizations to reduce total CI job count and
wall-clock time for cosmos live tests.

Changes:

1. Skip shade plugin in live test pipeline (-Dshade.skip=true)
   Shade/antrun are unnecessary for test-only runs since the shared
   build already produces the shaded JAR. Applied to all test stages.

2. Consolidate e2e profile: flatten ConsistencyConfig matrix
   The e2e profile previously created 2 jobs (Session + Strong
   consistency) via a nested ConsistencyConfig dimension. Flattened
   to a single Session-only job since Strong consistency is already
   covered by the separate multi-master matrix entry.

3. Remove BoundedStaleness e2e job
   BoundedStaleness with Strong account consistency is redundant —
   the multi-master Strong consistency job already covers this path.

4. Consolidate circuit breaker jobs (CB=true + CB=false → single job)
   Instead of 4 separate matrix entries (multi-master CB=true,
   multi-master CB=false, FI CB=true, FI CB=false), consolidate into
   2 entries with a RunCBSecondPass variable. The first pass runs with
   CB=TRUE, then a PostStep re-runs with CB=FALSE on the same agent.
   This halves the number of ARM deployments (the slowest CI step).

5. Add failsafe execution bindings to e2e profile POMs
   The e2e profile in azure-cosmos-tests and azure-cosmos-benchmark
   was missing explicit failsafe plugin execution goals, causing
   tests to be skipped. Added integration-test and verify goals.

6. Remove -DCOSMOS.HTTP2_ENABLED from thin client test args
   HTTP/2 enablement is now controlled by the SDK internally.

Net effect: ~4 fewer CI jobs, ~30min less total pipeline time.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Restore the BoundedStaleness e2e job with Strong account consistency.
This ensures e2e tests are validated under BoundedStaleness read
consistency with a Strong account, which exercises a distinct
consistency downgrade path not covered by other matrix entries.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Reduce Kafka test matrix from 5 Java versions to 3 (drop Java 8 and
  11, keep 17, 21, 25). Java 8 and 11 are past end-of-life and the
  Kafka connector's minimum supported version is Java 11. This saves
  2 CI jobs per Kafka test run.

- Reduce RntbdServiceEndpointTest idle timeout from 20s to 2s and sleep
  from 30s to 3s. This is a unit test that validates endpoint eviction
  on idle timeout — the timeout value doesn't need to be realistic,
  just long enough for the eviction timer to fire. Saves ~27s per run.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…0ms to 200ms

These sleeps are spacers between sequential upsert operations in
validateRntbdStatistics test - they don't need full 1s waits.
Saves ~2.4s per test invocation.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Reduce ~21 minutes of pure sleep time across 3 CFP test files:

- Reduce CHANGE_FEED_PROCESSOR_TIMEOUT from 5000ms to 2000ms (test-only
  constant used for wait intervals, not actual processor config)
- Reduce REPLICA_IN_SATELLITE_REGION_CATCH_UP_TIME from 10000ms to
  5000ms in epk/pk IncrementalChangeFeedProcessorTest
- Remove 5 redundant post-shutdown sleeps — safeStopChangeFeedProcessor
  already blocks on .stop().block() with a timeout
- Reduce 47 cleanup sleeps from 500ms to 100ms (collection deletion
  wait in finally blocks)

Estimated savings: ~15 minutes of sleep time per CI run across these
3 test files.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The reduction from 5000ms to 2000ms caused CFP test failures:
- staledLeaseAcquiring: waitToReceiveDocuments timed out because the
  polling timeout is N * CHANGE_FEED_PROCESSOR_TIMEOUT
- readFeedDocumentsWithThroughputControl: document validation failed
  due to insufficient processing time

The other CFP optimizations (removed redundant shutdown sleeps,
cleanup 500ms→100ms, REPLICA_IN_SATELLITE_REGION_CATCH_UP_TIME
10s→5s) are kept as they don't affect polling timeouts.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant