Skip to content

fix: close #3247 nacos/directory lifecycle race risks#3270

Open
xxs588 wants to merge 9 commits intoapache:developfrom
xxs588:split/3247-registry-lifecycle
Open

fix: close #3247 nacos/directory lifecycle race risks#3270
xxs588 wants to merge 9 commits intoapache:developfrom
xxs588:split/3247-registry-lifecycle

Conversation

@xxs588
Copy link
Copy Markdown
Contributor

@xxs588 xxs588 commented Mar 19, 2026

Description

Fixes #3247

  • registry/nacos 中将 scheduledLookUptime.Sleep 轮询改为 ticker + done,关闭时可及时退出
  • registry/directory 中调整 Subscribe 时序:先执行带超时的 register,再启动 subscribe goroutine;register 路径使用 url.Clone()
  • 补充回归测试:TestNacosRegistryScheduledLookUpStopsWhenDoneClosedTestRegistryDirectorySubscribeTimeoutSkipsSubscribeStart 及相关并发访问测试

Verification

  • go test ./registry/nacos ./registry/directory
  • go test -race ./registry/nacos -run TestNacosRegistryScheduledLookUpStopsWhenDoneClosed
  • go test -race ./registry/directory -run 'TestRegistryDirectorySubscribeTimeoutSkipsSubscribeStart|TestRegistryDirectoryIsAvailableConcurrentWithCacheUpdates|TestRegistryDirectoryConfiguratorConcurrentAccess|TestRegistryDirectorySubscribedURLConcurrentAccess'

Checklist

  • I confirm the target branch is develop
  • Code has passed local testing
  • I have added tests that prove my fix is effective or that my feature works

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Mar 19, 2026

Codecov Report

❌ Patch coverage is 77.52809% with 20 lines in your changes missing coverage. Please review.
✅ Project coverage is 49.65%. Comparing base (60d1c2a) to head (3314048).
⚠️ Report is 767 commits behind head on develop.

Files with missing lines Patch % Lines
registry/directory/directory.go 75.00% 14 Missing and 5 partials ⚠️
registry/nacos/registry.go 83.33% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #3270      +/-   ##
===========================================
+ Coverage    46.76%   49.65%   +2.88%     
===========================================
  Files          295      469     +174     
  Lines        17172    34657   +17485     
===========================================
+ Hits          8031    17209    +9178     
- Misses        8287    16060    +7773     
- Partials       854     1388     +534     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@xxs588 xxs588 force-pushed the split/3247-registry-lifecycle branch from 20fcf13 to ceb551b Compare March 19, 2026 07:32
@xxs588 xxs588 marked this pull request as draft March 19, 2026 07:52
@xxs588 xxs588 changed the title fix(race): close #3247 nacos and directory lifecycle race risks fix: close #3247 nacos/directory lifecycle race risks Mar 19, 2026
@xxs588 xxs588 marked this pull request as ready for review March 19, 2026 15:24
@Alanxtl Alanxtl added ☢️ Bug 3.3.2 version 3.3.2 labels Mar 21, 2026
Copy link
Copy Markdown
Contributor

@AlexStocks AlexStocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

修复了 directory 生命周期和 nacos subscribe 的多处 race,整体方向对。有几个问题需要处理。

  1. Subscribe 注册超时行为变更:原来无限等待直到成功,新版有超时限制。这个行为变更需要明确说明,可能会影响依赖旧行为的上层调用。

  2. TestRegistryDirectorySubscribeTimeoutSkipsSubscribeStart 的 20ms 超时在慢 CI 或高负载机器上可能误判,建议至少 100ms。

  3. overrideUrl 里 snapshotConfigurators() 是只读副本,建议加注释说明 snapshot 是为了只读,避免后续维护者误改。

@xxs588
Copy link
Copy Markdown
Contributor Author

xxs588 commented Mar 23, 2026

感谢建议,我现在开始修改

return dir.cacheInvokers
return dir.snapshotCacheInvokers()
}
return routerChain.Route(dir.consumerURL, invocation)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里虽然把 routerChain == nil 的分支换成了快照,但 routerChain.Route() 仍然会无锁读 RouterChain.invokers,同时 setNewInvokers() 又会在另一个 goroutine 里通过 SetInvokers() 写这份切片。当前 HEAD 直接跑 go test -race ./registry/directory ./registry/nacos 仍会在 Test_List 报 race。建议在 RouterChain 内部先做带锁 snapshot 再路由,只修 cacheInvokers 读取还不够。

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

感谢建议,已修复,Route() 不再直接读 c.invokers,改为先走 snapshotInvokers() 拿快照再路由,对cluster/router全包进行了测试,通过
这是修复的地方
图片

case <-time.After(timeout):
logger.Errorf("register timed out for service: %s", url.Key())
return fmt.Errorf("register timed out for service: %s", url.Key())
case <-timer.C:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的超时只是让调用方提前返回,后台 Register goroutine 还会继续跑。如果它在超时之后成功,consumer 实际上已经注册了,但 Subscribe 不会启动,后续也没有补偿 UnRegister,状态会变成“返回失败但部分初始化成功”。建议把取消能力下推到 registry.Register,或者在超时后处理 late success 并做回滚。

Copy link
Copy Markdown
Contributor Author

@xxs588 xxs588 Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

感谢 ,已修改成 registerConsumerWithTimeout() 超时返回后,如果 Register 迟到成功,则执行 UnRegister 回滚
逻辑详见图片
图片

图片 如果`Subscribe`超时返回错误了但是后面goroutine可能没那么快结束,所以再起一个goroutine等待结果,如果`Register`失败了就不用补偿直接return ,如果成功了说明出现 调用方收到超时失败,但实际上已经注册成功的状态,然后就再补偿 回滚把迟到成功的注册注销掉,如果回滚失败也补上日志方便定位

@xxs588 xxs588 force-pushed the split/3247-registry-lifecycle branch from 450d90c to fd4fcd2 Compare March 25, 2026 14:08
xxs588 and others added 8 commits March 31, 2026 21:47
- replace sleep loop with ticker + done select in scheduledLookUp
- ensure lookup goroutine exits promptly when registry is destroyed
- add regression test for done-triggered exit

Fixes: apache#3247
- register consumer with timeout before starting subscribe goroutine
- use URL clone in register path to avoid cross-goroutine URL mutation races
- add timeout regression test to ensure subscribe is not started on register timeout

Fixes: apache#3247
Co-authored-by: Xuetao Li <xuetaoli@apache.org>
@xxs588 xxs588 force-pushed the split/3247-registry-lifecycle branch from fd4fcd2 to a041ff6 Compare April 2, 2026 13:10
serviceKey := url.Key()
// Registration is bounded by registry timeout (default 5s).
// On timeout, we return immediately and skip starting subscribe goroutine.
if err := dir.registerConsumerWithTimeout(url, timeout, serviceKey); err != nil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P0] 这里把订阅启动放到了 Register 成功之后,改变了原有容错语义。registry/protocol/protocol.godic.Subscribe(...) 返回错误时只记日志,后面仍会继续 cluster.Join(dic) 并返回 invoker;也就是说一旦注册超时或失败,消费者已经被创建出来,但永远不会启动订阅协程,目录会一直拿不到 provider 列表。建议保持订阅与注册解耦,先启动 registry.Subscribe,再单独处理注册超时/回滚,至少不要让注册失败把服务发现直接短路掉。

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

收到,我现在去修改

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这次改为先启动 registry.Subscribe,再执行带超时的 Register,保持订阅与注册解耦。
同时在 -race 验证中发现了测试侧存在新竞态(测试直接读 cacheInvokers、以及 normalRegistryDir 异步启动 Subscribe 带来的并发噪音),已同步修成并发安全读取和同步初始化。

Address PR apache#3270 P0 review: keep discovery path alive by starting registry.Subscribe before timeout-bounded Register.

Also fix newly surfaced test-side races under -race by replacing direct cacheInvokers reads with snapshotCacheInvokers and removing async Subscribe startup in normalRegistryDir.
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud bot commented Apr 3, 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3.3.2 version 3.3.2 ☢️ Bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Systematic data races: 20+ extension global maps unprotected, wrong lock in SetProviderService, RouterChain.Route lockless

4 participants