From 62641474c14a03123b272475d60ec2b253423094 Mon Sep 17 00:00:00 2001 From: Wei Cao Date: Mon, 4 May 2026 16:10:46 +0800 Subject: [PATCH] [style-harmonize] OceanBase cases post-merge fixup + SKILL-INDEX OB section MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Closes 6 nits across PR #50 + #51 (per Allen curator review commitment + Noah ack): PR #50 (oceanbase-repl-post-failover-stale-ro-case.md) nit (c): - Fix validation 主路径 sub-section: add 1-line abstract showing happy path with v2 freshness gate fix correctly detects stale-ro (TDD designed FAIL = 21/1/0) - Fix validation ground_truth_unavailable sub-section: add abstract showing conservative fail-stop path when standby has no seed, validates 'cannot read ground truth -> fail-stop, not silent skip' design intent PR #51 (oceanbase-repl-peer-primary-acked-write-divergence-case.md): - nit (a): Source evidence index gets cross-repo path prefix 'kubeblocks-tests/oceanbase/' added to all work/ paths + explicit intro note that sha256 verification doesn't depend on path stability - nit (b): 设计教训 2 cross-addon contracts each get explicit Anti-pattern statement (sidecar single-pod view as sufficient PRIMARY publish criterion / peer probe error propagation to outer kbagent deadline) - nit (c): Guard v1 N=1 / N=5 / N=10 sub-sections each get 1-line abstract showing the strength evidence progression (N=1 baseline / N=5 main path stable but timeout-isolation untouched / N=10 cycle04 hits timeout-isolation finally validating it) SKILL-INDEX (structural gap surfaced by fire #14 dashboard): - Add new '### OceanBase' section between Oracle and Methodology - Both case entries with full descriptions including parent doc cross-refs + P0/P1 framing + Guard validation N=1/5/10 progression Pure additive: methodology body / sub-section organization unchanged beyond nit-level abstracts. SKILL-INDEX cross-engine annotation rate delta: 0 -> 2 OB entries (closes the cases/oceanbase/ -> SKILL-INDEX sync gap that landed as silent debt with PR #50 + PR #51). --- docs/SKILL-INDEX.md | 5 ++++ ...eer-primary-acked-write-divergence-case.md | 24 +++++++++++++------ ...anbase-repl-post-failover-stale-ro-case.md | 4 ++++ 3 files changed, 26 insertions(+), 7 deletions(-) diff --git a/docs/SKILL-INDEX.md b/docs/SKILL-INDEX.md index f3546fb..d184056 100644 --- a/docs/SKILL-INDEX.md +++ b/docs/SKILL-INDEX.md @@ -131,6 +131,11 @@ - [`docs/cases/oracle/oracle-12c-post-switchover-probe-cascade-kill.md`](cases/oracle/oracle-12c-post-switchover-probe-cascade-kill.md) — reconfigure_deep Run 1→3 闭环:Bug #12 (DBCA 跑期间 liveness 误杀,initialDelay=600 + 90s 重启窗口) + Bug #13 (post-switchover 慢控制面 flap readiness);3 layer fix(cmpd probe 参数 + liveness.sh 软失败 + checkDBStatus.sh best-effort dgmgrl);Run 3 全 PASS + RESTARTS=0 实证;属 [`addon-probe-timeout-and-soft-failure-guide.md`](addon-probe-timeout-and-soft-failure-guide.md) 工程现场补充 - [`docs/cases/oracle/oracle-12c-processes-cue-paramdef-range-case.md`](cases/oracle/oracle-12c-processes-cue-paramdef-range-case.md) — reconfigure_deep T22d FAIL:`processes: int & >=6` cue 太宽,10 通过 ValidatePhase → ORA-603/1092 → instance terminated → KB OpsRequest 卡 Running 25min+;fix `>=100` Run 3 验证生效(ValidatePhase reject within 10s);属 [`addon-paramdef-cue-range-validation-guide.md`](addon-paramdef-cue-range-validation-guide.md) 工程现场补充 +### OceanBase + +- [`docs/cases/oceanbase/oceanbase-repl-post-failover-stale-ro-case.md`](cases/oceanbase/oceanbase-repl-post-failover-stale-ro-case.md) — OceanBase repl 拓扑(primary + 1 standby)同步 kill 后 standby 仍被 ro service 选中并返回过期数据案例(C03 类,**P1 silent stale-RO**)。修复方向 P2 测试侧已实装(freshness gate v2)+ P1 addon 侧未实装(roleProbe 加 `V$OB_LS_LOG_RESTORE_STATUS` health gate)。属 [`addon-control-plane-election-guide.md`](addon-control-plane-election-guide.md) 工程现场补充(roleProbe 健康口径 / role 标签 / ro service 端点选择),Trap T 段同时是 [`addon-evidence-discipline-guide.md`](addon-evidence-discipline-guide.md) 反模式实证(早期 `ready_count < 2 ⇒ no_quorum` 假设被 evidence 反证) +- [`docs/cases/oceanbase/oceanbase-repl-peer-primary-acked-write-divergence-case.md`](cases/oceanbase/oceanbase-repl-peer-primary-acked-write-divergence-case.md) — OceanBase repl 拓扑 transient peer-primary 接受 ack 写入后被 final primary 丢失案例(C09 类,**P0 acked-write divergence**)。Guard v1 验证 N=1 / N=5 / N=10 三段递进,累计 ack=570 全保留 + cycle04 真遇 2 条 peer timeout 验证 timeout-isolation 封装到 `peerPrimaryErrors`。设计教训抽 2 条 cross-addon 契约(peer-aware PRIMARY publishing + peer probe error encapsulation),属 [`addon-control-plane-election-guide.md`](addon-control-plane-election-guide.md) 工程现场补充。trace v2 reproduction 段同时是 [`addon-evidence-discipline-guide.md`](addon-evidence-discipline-guide.md) evidence-gap 闭环范本(埋点→复现→关闭 raw decision gap,semantics 不动) + ### Methodology - [`docs/cases/methodology/evidence-inflation-in-csi-durability-debate-case.md`](cases/methodology/evidence-inflation-in-csi-durability-debate-case.md) — 04-28 MariaDB #396 CSI durability 讨论中的 3 次 inflation 与 3 次撤回(动机 narrative / N=1→average / 间接旁证→系统性证伪),属 [`addon-evidence-discipline-guide.md`](addon-evidence-discipline-guide.md) 实证补充 diff --git a/docs/cases/oceanbase/oceanbase-repl-peer-primary-acked-write-divergence-case.md b/docs/cases/oceanbase/oceanbase-repl-peer-primary-acked-write-divergence-case.md index 0084299..064ddbd 100644 --- a/docs/cases/oceanbase/oceanbase-repl-peer-primary-acked-write-divergence-case.md +++ b/docs/cases/oceanbase/oceanbase-repl-peer-primary-acked-write-divergence-case.md @@ -230,6 +230,8 @@ Guard v1 行为(已实装在 syncer roleProbe,跟 C03 案例里描述的「 ### Guard v1 N=1 +**Abstract**: Guard v1 fix 在 N=1 cycle 下命中 transient peer-primary 互斥检查,trace 字段 `primary_peer_conflict_rejected` 双 pod 都打出,rw endpoint 在 unsafe transition 期间变空(拒绝路由),所有 ack 写入落 final primary,`ack_missing=0`。 + Root: `work/ob-test-supplement/20260503-2315-c09-peer-primary-guard-v1-n1` Result: @@ -253,6 +255,8 @@ Guard evidence: ### Guard v1 N=5 +**Abstract**: N=5 cycle 全 PASS(5/5 cycles passed),累计 client-ack 写入 288 全保留,`ack_missing=0`,没有外层 kbagent `roleProbe timeout` event — 主路径稳定。但 N=5 没遇到 timeout / deadline peer error,timeout-isolation 路径未被这一档 strength 触发;递进到 N=10 才闭环。 + Root: `work/ob-test-supplement/20260503-2325-c09-peer-primary-guard-v1-n5` Archive sha256: `7228e094cf893452387ac1e86bff91a5a5f8120fcc6e1617d00f7b937d57b3bd` @@ -269,6 +273,8 @@ N=5 后剩余: ### Guard timeout-isolation v1 N=10 +**Abstract**: N=10 cycle 全 rc=0,累计 client-ack 写入 570 全保留,`ack_missing=0`,cycle04 真遇 2 条 peer timeout / deadline error — 验证 timeout-isolation 设计将其封装在 `peerPrimaryErrors` 字段不吃掉外层 kbagent `roleProbe deadline`,sidecar 不被 KB 误判 unhealthy。**这是 N=5 不能闭环、必须递进 N=10 strength 才命中的关键 timeout-isolation 验证证据**。 + Root: `work/ob-test-supplement/20260504-0005-c09-peer-primary-guard-timeout-v1-n10` Archive sha256: `9e54e7695b692915a4a3e5f0bb63abe9f4806d0a2fa9af52561c6205c6d1ef9a` @@ -301,7 +307,9 @@ Harness caveat(值得记录的踩坑): 把 Guard v1 + N=10 timeout isolation 的两条核心契约抽象出来,对 PG / MySQL / MongoDB 等其他 addon 的 sidecar role probe 同样适用: 1. **Peer-aware 才能给出 PRIMARY 标签**:单 pod 自查 `tenantRole=PRIMARY` 不够;必须查 peer,发现 peer 也是 semantic PRIMARY 时拒绝上报。短暂无 rw endpoint > ack 写入落到错误链。trace 字段建议:`peerPrimaryCheck / peerPrimaryConflict / peerPrimaryErrors / decisionReason=primary_peer_conflict_rejected`。 + - **Anti-pattern**:把"sidecar 单 pod 自查 `tenantRole=PRIMARY`"当作上报 PRIMARY 标签的充分条件 — transition window 必然出现 acked-write divergence。 2. **Peer 探测错误必须封装在内层,不能吃掉外层 deadline**:peer probe 的 timeout / deadline 错误必须保留在 `peerPrimaryErrors` 字段里,不能让它们 propagate 成 outer kbagent `roleProbe timeout` event — 否则 sidecar 自身被 KB 当成 unhealthy 反而触发 cascading 状态变更。 + - **Anti-pattern**:把 peer probe 的 timeout / deadline error 直接 raise / propagate 出去 — outer kbagent 把整个 sidecar 当 unhealthy 触发额外状态变更,故障面被 sidecar 自身放大。 跨 addon 反 anti-pattern:「把 sidecar 单 pod 视角等同于全集群视角」是同一类陷阱(也跟 C03 stale-RO 案例里 roleProbe 没看 `V$OB_LS_LOG_RESTORE_STATUS` 是同源),都属于 [`addon-control-plane-election-guide.md`](../../addon-control-plane-election-guide.md) 的 control-plane election 真值口径领域。 @@ -337,22 +345,24 @@ dist 状态: ## Source evidence index +> **跨仓路径说明**:以下所有 `work/ob-test-supplement/...` 路径都位于 `apecloud/kubeblocks-tests` 仓库的 `oceanbase/` 目录下。完整路径前缀 = `kubeblocks-tests/oceanbase/work/ob-test-supplement/...`。archive sha256 是 evidence pack 完整性校验,跨仓 reader 用 sha 直接 verify,不依赖路径稳定。 + 原始 C09 故障: -- root summary: `work/ob-test-supplement/20260503-2123-c09-writes-during-primary-kill-n1/c09-root-evidence-summary.md`(Mia archived bundle) -- final-freeze archive: `work/ob-test-supplement/20260503-2123-c09-writes-during-primary-kill-n1/c09-final-freeze-before-cleanup-20260503-222501.tar.gz`(archive sha256 `e3c0b833312ad1e0fd3f81b9e290b9263772b0573d574627227822a030dbc484`) +- root summary: `kubeblocks-tests/oceanbase/work/ob-test-supplement/20260503-2123-c09-writes-during-primary-kill-n1/c09-root-evidence-summary.md`(Mia archived bundle) +- final-freeze archive: `kubeblocks-tests/oceanbase/work/ob-test-supplement/20260503-2123-c09-writes-during-primary-kill-n1/c09-final-freeze-before-cleanup-20260503-222501.tar.gz`(archive sha256 `e3c0b833312ad1e0fd3f81b9e290b9263772b0573d574627227822a030dbc484`) Trace v2 reproduction(关闭 raw decision evidence gap): -- root: `work/ob-test-supplement/20260503-2235-c09-trace-v2-repro-n1/evidence/trace-v2-c09-acked-write-divergence-20260503-224553` +- root: `kubeblocks-tests/oceanbase/work/ob-test-supplement/20260503-2235-c09-trace-v2-repro-n1/evidence/trace-v2-c09-acked-write-divergence-20260503-224553` - summary: `SUMMARY.md` - archive: `trace-v2-c09-acked-write-divergence-20260503-224553.tar.gz`(sha256 `8729ddacb228a3580e894af09b8ecde7f9c91f1d660eba24c799449e4b66549f`,summary sha256 `d1b232b2492195aa262fa4d6246f3c8a74bc085849a002424692524d9120efb2`) -Guard 验证: +Guard 验证(sha256 验证不依赖路径稳定): -- N=1: `work/ob-test-supplement/20260503-2315-c09-peer-primary-guard-v1-n1/c09-guard-v1-summary.md` -- N=5: `work/ob-test-supplement/20260503-2325-c09-peer-primary-guard-v1-n5/c09-guard-v1-n5-summary.md`(archive sha256 `7228e094cf893452387ac1e86bff91a5a5f8120fcc6e1617d00f7b937d57b3bd`) -- N=10 timeout isolation: `work/ob-test-supplement/20260504-0005-c09-peer-primary-guard-timeout-v1-n10/c09-timeout-guard-v1-n10-summary.md`(archive sha256 `9e54e7695b692915a4a3e5f0bb63abe9f4806d0a2fa9af52561c6205c6d1ef9a`) +- N=1: `kubeblocks-tests/oceanbase/work/ob-test-supplement/20260503-2315-c09-peer-primary-guard-v1-n1/c09-guard-v1-summary.md` +- N=5: `kubeblocks-tests/oceanbase/work/ob-test-supplement/20260503-2325-c09-peer-primary-guard-v1-n5/c09-guard-v1-n5-summary.md`(archive sha256 `7228e094cf893452387ac1e86bff91a5a5f8120fcc6e1617d00f7b937d57b3bd`) +- N=10 timeout isolation: `kubeblocks-tests/oceanbase/work/ob-test-supplement/20260504-0005-c09-peer-primary-guard-timeout-v1-n10/c09-timeout-guard-v1-n10-summary.md`(archive sha256 `9e54e7695b692915a4a3e5f0bb63abe9f4806d0a2fa9af52561c6205c6d1ef9a`) Stage 3 chaos triad 收尾交叉引用: diff --git a/docs/cases/oceanbase/oceanbase-repl-post-failover-stale-ro-case.md b/docs/cases/oceanbase/oceanbase-repl-post-failover-stale-ro-case.md index 7e1d8b0..edc311a 100644 --- a/docs/cases/oceanbase/oceanbase-repl-post-failover-stale-ro-case.md +++ b/docs/cases/oceanbase/oceanbase-repl-post-failover-stale-ro-case.md @@ -233,6 +233,8 @@ trace 字段建议:`logRestoreSyncStatus / logRestoreErrCode / decisionReason= ### freshness gate 主路径正例 +**Abstract**: 主路径 N=1 验证:v2 freshness gate fix 在 happy path(standby 正常拉到 seed + 复制后被杀恢复)下正确把 stale ro detect 出来,TDD 设计预期 FAIL(`stale-secondary-detected: 4 standby check(s) failed`)命中 21/1/0。 + archive sha256: `da5ef49b8b831e2665cab0be7b8853227fb5a25e54bf3a4ab5ce17751fcec6bc` root: `work/ob-test-supplement/20260504-1445-c03-stale-ro-gate-v2-n1` chaos script: `tests/chaos-kill-primary-standby-quorum-loss.sh` sha `ce1204ab4ee889277b51d580003bd64915f8badc09b605967094191bb166cd3b` @@ -277,6 +279,8 @@ PASS 21 / FAIL 1 / SKIP 0,唯一 FAIL 是 stale-secondary-detected,跟 TDD ### freshness gate ground_truth_unavailable 路径正例 +**Abstract**: ground_truth_unavailable 短路保守路径 N=1 验证:当 chaos 来得太早 standby 还没拉到 seed、survivor 接管成 new primary 后没有表,主路径 ground truth 拿不到 → freshness gate 降级走保守 fail-stop 不漏 stale。验证设计意图:"拿不到就保守 fail,不 silent skip"。 + archive sha256: `449b7d3491fc1c82d6787ec0178f0e1de884b5426a573ef8e4a12f46e486dd2f` root: `work/ob-test-supplement/20260504-1432-c03-stale-ro-gate-v1-n1`