Skip to content

core: Fix pick_first NPE when accepting resolved addresses and in CONNECTING state#12814

Merged
kannanjgithub merged 1 commit into
grpc:masterfrom
schiemon:fix/pick-first-leaf-npe-12796
May 22, 2026
Merged

core: Fix pick_first NPE when accepting resolved addresses and in CONNECTING state#12814
kannanjgithub merged 1 commit into
grpc:masterfrom
schiemon:fix/pick-first-leaf-npe-12796

Conversation

@schiemon
Copy link
Copy Markdown
Contributor

@schiemon schiemon commented May 19, 2026

This PR resolves #12796.

It makes sure that whenever PickFirstLeafLoadBalancer transitions into CONNECTING the current address in the addressIndex has a corresponding subchannel. This prevents an NPE when hitting

if (rawConnectivityState == READY
|| (rawConnectivityState == CONNECTING
&& (!enableHappyEyeballs || addressIndex.isValid()))) {
// If the previous ready (or connecting) subchannel exists in new address list,
// keep this connection and don't create new subchannels. Happy Eyeballs is excluded when
// connecting, because it allows multiple attempts simultaneously, thus is fine to start at
// the beginning.
SocketAddress previousAddress = addressIndex.getCurrentAddress();
addressIndex.updateGroups(newImmutableAddressGroups);
if (addressIndex.seekTo(previousAddress)) {
SubchannelData subchannelData = subchannels.get(previousAddress);
subchannelData.getSubchannel().updateAddresses(addressIndex.getCurrentEagAsList());

in acceptResolvedAddresses in some situations.

Note that for READY, a state from which this path can also be entered, this is already guaranteed.

@linux-foundation-easycla
Copy link
Copy Markdown

linux-foundation-easycla Bot commented May 19, 2026

CLA Signed
The committers listed above are authorized under a signed CLA.

  • ✅ login: schiemon / name: szymon.habrainski (c76956a)

// addresses in the updated list.
subchannelData.updateState(newState);
if (rawConnectivityState == TRANSIENT_FAILURE || concludedState == TRANSIENT_FAILURE) {
if (newState == CONNECTING) {
Copy link
Copy Markdown
Contributor Author

@schiemon schiemon May 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If concludedState == TRANSIENT_FAILURE but rawConnectivityState != TRANSIENT_FAILURE, shouldn't we adapt our rawConnectivityState?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without diving deep into it, I'd say since it is not related to the fix for the NPE, a new issue or discussion can be started with your concern.


if (rawConnectivityState == READY
|| (rawConnectivityState == CONNECTING
&& (!enableHappyEyeballs || addressIndex.isValid()))) {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not

Suggested change
&& addressIndex.isValid())) {

? If we !enableHappyEyeballs and !addressIndex.isValid() then we get an exception when calling addressIndex.getCurrentAddress().

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the fix, !enableHappyEyeballs && !addressIndex.isValid() can never evaluate to true because processSubchannelState does a seek and updates the address index.

addressIndex.updateGroups(newImmutableAddressGroups);
if (addressIndex.seekTo(previousAddress)) {
SubchannelData subchannelData = subchannels.get(previousAddress);
subchannelData.getSubchannel().updateAddresses(addressIndex.getCurrentEagAsList());
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With HE we have potentionally multiple connecting subchannels: why aren't we updating them all?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is part of a short-circuit optimization. It says: "I have found the one address I was
focusing on. I'll update it and stop looking."

If you had multiple subchannels connecting and wanted to update them all, you would effectively be implementing a "Parallel Load Balancer" rather than a "Pick First" balancer. In PickFirst, the moment we can confirm our "focus" is still valid, we stop processing the update and get back to work.

@schiemon schiemon marked this pull request as ready for review May 19, 2026 14:03

if (rawConnectivityState == READY
|| (rawConnectivityState == CONNECTING
&& (!enableHappyEyeballs || addressIndex.isValid()))) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the fix, !enableHappyEyeballs && !addressIndex.isValid() can never evaluate to true because processSubchannelState does a seek and updates the address index.

addressIndex.updateGroups(newImmutableAddressGroups);
if (addressIndex.seekTo(previousAddress)) {
SubchannelData subchannelData = subchannels.get(previousAddress);
subchannelData.getSubchannel().updateAddresses(addressIndex.getCurrentEagAsList());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is part of a short-circuit optimization. It says: "I have found the one address I was
focusing on. I'll update it and stop looking."

If you had multiple subchannels connecting and wanted to update them all, you would effectively be implementing a "Parallel Load Balancer" rather than a "Pick First" balancer. In PickFirst, the moment we can confirm our "focus" is still valid, we stop processing the update and get back to work.

// addresses in the updated list.
subchannelData.updateState(newState);
if (rawConnectivityState == TRANSIENT_FAILURE || concludedState == TRANSIENT_FAILURE) {
if (newState == CONNECTING) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without diving deep into it, I'd say since it is not related to the fix for the NPE, a new issue or discussion can be started with your concern.

@kannanjgithub
Copy link
Copy Markdown
Contributor

/gcbrun

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a NullPointerException in PickFirstLeafLoadBalancer.acceptResolvedAddresses() that can occur when the balancer is in CONNECTING and the addressIndex points at an address that doesn’t yet have a corresponding entry in subchannels (reported in #12796). The fix ensures that when transitioning into CONNECTING, the addressIndex is adjusted to point at an address backed by an existing subchannel, preventing subchannels.get(previousAddress) from returning null during address updates.

Changes:

  • Adjust processSubchannelState(CONNECTING) to realign addressIndex when its current address lacks a subchannel (or the index is invalid in non-Happy-Eyeballs mode).
  • Add a regression test that reproduces the state inconsistency leading to the NPE (health-check + transient failure scenario).
  • Expose rawConnectivityState via a @VisibleForTesting accessor to support the new test assertions.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
core/src/main/java/io/grpc/internal/PickFirstLeafLoadBalancer.java Ensures addressIndex points to an address with a subchannel when entering CONNECTING, preventing NPEs in acceptResolvedAddresses; adds a testing accessor.
core/src/test/java/io/grpc/internal/PickFirstLeafLoadBalancerTest.java Adds a regression test reproducing #12796 by driving the LB through the problematic CONNECTING/address-update sequence.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +336 to +340
// If we get a newly resolved address list via acceptResolvedAddresses,
// as we are in CONNECTING, we will try to .updateAddresses the currently
// connecting subchannel if it exists in the new list.
// As such, We need to make sure that with transitioning to CONNECTING the subchannel for
// the current address of a valid index exists.
Comment on lines +342 to +343
|| (addressIndex.isValid() && !subchannels.containsKey(
addressIndex.getCurrentAddress()))) {

// reproduces #12796
@Test
public void healthCheckWithTF_AllowsStateInconsistency() {
Comment on lines +685 to +687
// HealthListener.onSubchannelState gets called. It updates the LBs balancing
// state/concludedState.
assertEquals(TRANSIENT_FAILURE, loadBalancer.getConcludedConnectivityState());
// Given the new list, it is now pointing to server 2 which does not have a subchannel.
assertEquals(0, loadBalancer.getIndexLocation());

// Subchannel 1 is still in TRANSIENT_FAILURE state. Is backoff expires,
@kannanjgithub kannanjgithub added the kokoro:run Add this label to a PR to tell Kokoro the code is safe and tests can be run label May 22, 2026
@grpc-kokoro grpc-kokoro removed the kokoro:run Add this label to a PR to tell Kokoro the code is safe and tests can be run label May 22, 2026
@kannanjgithub kannanjgithub merged commit a6916c9 into grpc:master May 22, 2026
15 of 17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

NPE in PickFirstLeafLoadBalancer

4 participants