Cosmos Live tests pipeline optimizations#48506
Draft
kushagraThapar wants to merge 6 commits intoAzure:mainfrom
Draft
Cosmos Live tests pipeline optimizations#48506kushagraThapar wants to merge 6 commits intoAzure:mainfrom
kushagraThapar wants to merge 6 commits intoAzure:mainfrom
Conversation
Pipeline matrix optimizations to reduce total CI job count and wall-clock time for cosmos live tests. Changes: 1. Skip shade plugin in live test pipeline (-Dshade.skip=true) Shade/antrun are unnecessary for test-only runs since the shared build already produces the shaded JAR. Applied to all test stages. 2. Consolidate e2e profile: flatten ConsistencyConfig matrix The e2e profile previously created 2 jobs (Session + Strong consistency) via a nested ConsistencyConfig dimension. Flattened to a single Session-only job since Strong consistency is already covered by the separate multi-master matrix entry. 3. Remove BoundedStaleness e2e job BoundedStaleness with Strong account consistency is redundant — the multi-master Strong consistency job already covers this path. 4. Consolidate circuit breaker jobs (CB=true + CB=false → single job) Instead of 4 separate matrix entries (multi-master CB=true, multi-master CB=false, FI CB=true, FI CB=false), consolidate into 2 entries with a RunCBSecondPass variable. The first pass runs with CB=TRUE, then a PostStep re-runs with CB=FALSE on the same agent. This halves the number of ARM deployments (the slowest CI step). 5. Add failsafe execution bindings to e2e profile POMs The e2e profile in azure-cosmos-tests and azure-cosmos-benchmark was missing explicit failsafe plugin execution goals, causing tests to be skipped. Added integration-test and verify goals. 6. Remove -DCOSMOS.HTTP2_ENABLED from thin client test args HTTP/2 enablement is now controlled by the SDK internally. Net effect: ~4 fewer CI jobs, ~30min less total pipeline time. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Restore the BoundedStaleness e2e job with Strong account consistency. This ensures e2e tests are validated under BoundedStaleness read consistency with a Strong account, which exercises a distinct consistency downgrade path not covered by other matrix entries. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Reduce Kafka test matrix from 5 Java versions to 3 (drop Java 8 and 11, keep 17, 21, 25). Java 8 and 11 are past end-of-life and the Kafka connector's minimum supported version is Java 11. This saves 2 CI jobs per Kafka test run. - Reduce RntbdServiceEndpointTest idle timeout from 20s to 2s and sleep from 30s to 3s. This is a unit test that validates endpoint eviction on idle timeout — the timeout value doesn't need to be realistic, just long enough for the eviction timer to fire. Saves ~27s per run. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…0ms to 200ms These sleeps are spacers between sequential upsert operations in validateRntbdStatistics test - they don't need full 1s waits. Saves ~2.4s per test invocation. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Reduce ~21 minutes of pure sleep time across 3 CFP test files: - Reduce CHANGE_FEED_PROCESSOR_TIMEOUT from 5000ms to 2000ms (test-only constant used for wait intervals, not actual processor config) - Reduce REPLICA_IN_SATELLITE_REGION_CATCH_UP_TIME from 10000ms to 5000ms in epk/pk IncrementalChangeFeedProcessorTest - Remove 5 redundant post-shutdown sleeps — safeStopChangeFeedProcessor already blocks on .stop().block() with a timeout - Reduce 47 cleanup sleeps from 500ms to 100ms (collection deletion wait in finally blocks) Estimated savings: ~15 minutes of sleep time per CI run across these 3 test files. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The reduction from 5000ms to 2000ms caused CFP test failures: - staledLeaseAcquiring: waitToReceiveDocuments timed out because the polling timeout is N * CHANGE_FEED_PROCESSOR_TIMEOUT - readFeedDocumentsWithThroughputControl: document validation failed due to insufficient processing time The other CFP optimizations (removed redundant shutdown sleeps, cleanup 500ms→100ms, REPLICA_IN_SATELLITE_REGION_CATCH_UP_TIME 10s→5s) are kept as they don't affect polling timeouts. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Optimize the
java - cosmos - testslive test CI pipeline to reduce total job count and wall-clock execution time. This is a follow-up to PR #48259 which optimized individual test code. This PR focuses on pipeline matrix efficiency and remaining Thread.sleep reductions.Pipeline Matrix Optimizations
1. Skip shade plugin in live test pipeline (
-Dshade.skip=true -Dmaven.antrun.skip=true)2. Consolidate e2e profile matrix
ConsistencyConfigdimension (Session + Strong → Session only)3. Consolidate circuit breaker jobs (4 → 2)
RunCBSecondPass=truevariable. First pass runs CB=TRUE, then a PostStep re-runs with CB=FALSE on the same agent4. Add failsafe execution bindings to e2e profile POMs
azure-cosmos-testsandazure-cosmos-benchmarkwas missing explicit failsafeintegration-testandverifygoals, causing tests to be silently skipped5. Remove
-DCOSMOS.HTTP2_ENABLEDfrom thin client test args6. Reduce Kafka test matrix from 5 → 3 Java versions
Thread.sleep Optimizations
7. Change Feed Processor tests (~15 min savings)
CHANGE_FEED_PROCESSOR_TIMEOUTfrom 5000ms to 2000ms (test-only wait interval)REPLICA_IN_SATELLITE_REGION_CATCH_UP_TIMEfrom 10000ms to 5000msIncrementalChangeFeedProcessorTest(epk + pk),FullFidelityChangeFeedProcessorTest8. RntbdServiceEndpointTest (27s savings)
9. CosmosDiagnosticsTest (2.4s savings)
Estimated Impact