fix: close #3247 nacos/directory lifecycle race risks#3270
fix: close #3247 nacos/directory lifecycle race risks#3270xxs588 wants to merge 9 commits intoapache:developfrom
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #3270 +/- ##
===========================================
+ Coverage 46.76% 49.65% +2.88%
===========================================
Files 295 469 +174
Lines 17172 34657 +17485
===========================================
+ Hits 8031 17209 +9178
- Misses 8287 16060 +7773
- Partials 854 1388 +534 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
20fcf13 to
ceb551b
Compare
AlexStocks
left a comment
There was a problem hiding this comment.
修复了 directory 生命周期和 nacos subscribe 的多处 race,整体方向对。有几个问题需要处理。
-
Subscribe 注册超时行为变更:原来无限等待直到成功,新版有超时限制。这个行为变更需要明确说明,可能会影响依赖旧行为的上层调用。
-
TestRegistryDirectorySubscribeTimeoutSkipsSubscribeStart 的 20ms 超时在慢 CI 或高负载机器上可能误判,建议至少 100ms。
-
overrideUrl 里 snapshotConfigurators() 是只读副本,建议加注释说明 snapshot 是为了只读,避免后续维护者误改。
|
感谢建议,我现在开始修改 |
| return dir.cacheInvokers | ||
| return dir.snapshotCacheInvokers() | ||
| } | ||
| return routerChain.Route(dir.consumerURL, invocation) |
There was a problem hiding this comment.
这里虽然把 routerChain == nil 的分支换成了快照,但 routerChain.Route() 仍然会无锁读 RouterChain.invokers,同时 setNewInvokers() 又会在另一个 goroutine 里通过 SetInvokers() 写这份切片。当前 HEAD 直接跑 go test -race ./registry/directory ./registry/nacos 仍会在 Test_List 报 race。建议在 RouterChain 内部先做带锁 snapshot 再路由,只修 cacheInvokers 读取还不够。
| case <-time.After(timeout): | ||
| logger.Errorf("register timed out for service: %s", url.Key()) | ||
| return fmt.Errorf("register timed out for service: %s", url.Key()) | ||
| case <-timer.C: |
There was a problem hiding this comment.
这里的超时只是让调用方提前返回,后台 Register goroutine 还会继续跑。如果它在超时之后成功,consumer 实际上已经注册了,但 Subscribe 不会启动,后续也没有补偿 UnRegister,状态会变成“返回失败但部分初始化成功”。建议把取消能力下推到 registry.Register,或者在超时后处理 late success 并做回滚。
450d90c to
fd4fcd2
Compare
- replace sleep loop with ticker + done select in scheduledLookUp - ensure lookup goroutine exits promptly when registry is destroyed - add regression test for done-triggered exit Fixes: apache#3247
- register consumer with timeout before starting subscribe goroutine - use URL clone in register path to avoid cross-goroutine URL mutation races - add timeout regression test to ensure subscribe is not started on register timeout Fixes: apache#3247
Co-authored-by: Xuetao Li <xuetaoli@apache.org>
fd4fcd2 to
a041ff6
Compare
registry/directory/directory.go
Outdated
| serviceKey := url.Key() | ||
| // Registration is bounded by registry timeout (default 5s). | ||
| // On timeout, we return immediately and skip starting subscribe goroutine. | ||
| if err := dir.registerConsumerWithTimeout(url, timeout, serviceKey); err != nil { |
There was a problem hiding this comment.
[P0] 这里把订阅启动放到了 Register 成功之后,改变了原有容错语义。registry/protocol/protocol.go 里 dic.Subscribe(...) 返回错误时只记日志,后面仍会继续 cluster.Join(dic) 并返回 invoker;也就是说一旦注册超时或失败,消费者已经被创建出来,但永远不会启动订阅协程,目录会一直拿不到 provider 列表。建议保持订阅与注册解耦,先启动 registry.Subscribe,再单独处理注册超时/回滚,至少不要让注册失败把服务发现直接短路掉。
There was a problem hiding this comment.
这次改为先启动 registry.Subscribe,再执行带超时的 Register,保持订阅与注册解耦。
同时在 -race 验证中发现了测试侧存在新竞态(测试直接读 cacheInvokers、以及 normalRegistryDir 异步启动 Subscribe 带来的并发噪音),已同步修成并发安全读取和同步初始化。
Address PR apache#3270 P0 review: keep discovery path alive by starting registry.Subscribe before timeout-bounded Register. Also fix newly surfaced test-side races under -race by replacing direct cacheInvokers reads with snapshotCacheInvokers and removing async Subscribe startup in normalRegistryDir.
|






Description
Fixes #3247
registry/nacos中将scheduledLookUp从time.Sleep轮询改为ticker + done,关闭时可及时退出registry/directory中调整Subscribe时序:先执行带超时的 register,再启动 subscribe goroutine;register 路径使用url.Clone()TestNacosRegistryScheduledLookUpStopsWhenDoneClosed、TestRegistryDirectorySubscribeTimeoutSkipsSubscribeStart及相关并发访问测试Verification
Checklist
develop