Skip to content

feat(gax): Implement cert-rotation retries for grpc and http-json.#13246

Draft
vverman wants to merge 1 commit into
googleapis:mainfrom
vverman:agentic-cert-rotation-retry
Draft

feat(gax): Implement cert-rotation retries for grpc and http-json.#13246
vverman wants to merge 1 commit into
googleapis:mainfrom
vverman:agentic-cert-rotation-retry

Conversation

@vverman
Copy link
Copy Markdown
Contributor

@vverman vverman commented May 21, 2026

No description provided.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a mechanism for dynamic mTLS certificate rotation by adding a refresh() capability to transport channels and implementing a RefreshingHttpJsonChannel for hot-swapping connections. It also updates various callables and retry algorithms to trigger these refreshes upon encountering unauthenticated errors. Feedback identifies a critical infinite loop bug in the new channel's entry retention logic and a violation of the ManagedChannel contract regarding termination status. Additionally, the reviewer recommends using System.nanoTime() for monotonic timing and advises against hardcoding environment variable checks in core library logic.

Comment on lines +94 to +101
private ChannelEntry getRetainedEntry() {
while (true) {
ChannelEntry entry = activeEntry.get();
if (entry.retain()) {
return entry;
}
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There is a critical bug in getRetainedEntry() that can lead to an infinite loop and 100% CPU usage. If shutdown() is called on the RefreshingHttpJsonChannel, it marks the current activeEntry for shutdown. Subsequent calls to newCall() will invoke getRetainedEntry(), which will repeatedly call retain() on the same activeEntry. Since shutdownRequested is now true, retain() will return false, causing the while(true) loop to spin indefinitely. You should check if the channel is shutting down and throw an IllegalStateException or return a failed call instead of looping.

Comment on lines +41 to +44
if ("true".equalsIgnoreCase(System.getenv("isMwlidEnvironment"))
&& previousThrowable instanceof UnauthenticatedException) {
return true;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Hardcoding environment variable checks (isMwlidEnvironment) directly into the core retry logic is an anti-pattern for a library. This makes the library's behavior dependent on external state that isn't explicitly configured through its API, making it harder to test and reason about. This logic should be moved to a configurable property in RetrySettings or ApiCallContext.

}
try {
synchronized (lock) {
long now = System.currentTimeMillis();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Prefer using System.nanoTime() over System.currentTimeMillis() for measuring elapsed time and durations. System.nanoTime() is monotonic and unaffected by adjustments to the system wall clock, making it more reliable for cooldown logic.

Suggested change
long now = System.currentTimeMillis();
long now = System.nanoTime();

Comment on lines +138 to +140
public boolean awaitTermination(long duration, TimeUnit unit) throws InterruptedException {
return activeEntry.get().channel.awaitTermination(duration, unit);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The implementation of awaitTermination, isShutdown, and isTerminated only considers the current active entry. During a certificate rotation, there may be multiple older entries still draining. This implementation might return true for isTerminated while older connections are still active, which violates the contract of ManagedChannel. Consider tracking all active entries to provide an aggregate status.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant