Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions docs/SKILL-INDEX.md
Original file line number Diff line number Diff line change
Expand Up @@ -131,6 +131,11 @@
- [`docs/cases/oracle/oracle-12c-post-switchover-probe-cascade-kill.md`](cases/oracle/oracle-12c-post-switchover-probe-cascade-kill.md) — reconfigure_deep Run 1→3 闭环:Bug #12 (DBCA 跑期间 liveness 误杀,initialDelay=600 + 90s 重启窗口) + Bug #13 (post-switchover 慢控制面 flap readiness);3 layer fix(cmpd probe 参数 + liveness.sh 软失败 + checkDBStatus.sh best-effort dgmgrl);Run 3 全 PASS + RESTARTS=0 实证;属 [`addon-probe-timeout-and-soft-failure-guide.md`](addon-probe-timeout-and-soft-failure-guide.md) 工程现场补充
- [`docs/cases/oracle/oracle-12c-processes-cue-paramdef-range-case.md`](cases/oracle/oracle-12c-processes-cue-paramdef-range-case.md) — reconfigure_deep T22d FAIL:`processes: int & >=6` cue 太宽,10 通过 ValidatePhase → ORA-603/1092 → instance terminated → KB OpsRequest 卡 Running 25min+;fix `>=100` Run 3 验证生效(ValidatePhase reject within 10s);属 [`addon-paramdef-cue-range-validation-guide.md`](addon-paramdef-cue-range-validation-guide.md) 工程现场补充

### OceanBase

- [`docs/cases/oceanbase/oceanbase-repl-post-failover-stale-ro-case.md`](cases/oceanbase/oceanbase-repl-post-failover-stale-ro-case.md) — OceanBase repl 拓扑(primary + 1 standby)同步 kill 后 standby 仍被 ro service 选中并返回过期数据案例(C03 类,**P1 silent stale-RO**)。修复方向 P2 测试侧已实装(freshness gate v2)+ P1 addon 侧未实装(roleProbe 加 `V$OB_LS_LOG_RESTORE_STATUS` health gate)。属 [`addon-control-plane-election-guide.md`](addon-control-plane-election-guide.md) 工程现场补充(roleProbe 健康口径 / role 标签 / ro service 端点选择),Trap T 段同时是 [`addon-evidence-discipline-guide.md`](addon-evidence-discipline-guide.md) 反模式实证(早期 `ready_count < 2 ⇒ no_quorum` 假设被 evidence 反证)
- [`docs/cases/oceanbase/oceanbase-repl-peer-primary-acked-write-divergence-case.md`](cases/oceanbase/oceanbase-repl-peer-primary-acked-write-divergence-case.md) — OceanBase repl 拓扑 transient peer-primary 接受 ack 写入后被 final primary 丢失案例(C09 类,**P0 acked-write divergence**)。Guard v1 验证 N=1 / N=5 / N=10 三段递进,累计 ack=570 全保留 + cycle04 真遇 2 条 peer timeout 验证 timeout-isolation 封装到 `peerPrimaryErrors`。设计教训抽 2 条 cross-addon 契约(peer-aware PRIMARY publishing + peer probe error encapsulation),属 [`addon-control-plane-election-guide.md`](addon-control-plane-election-guide.md) 工程现场补充。trace v2 reproduction 段同时是 [`addon-evidence-discipline-guide.md`](addon-evidence-discipline-guide.md) evidence-gap 闭环范本(埋点→复现→关闭 raw decision gap,semantics 不动)

### Methodology

- [`docs/cases/methodology/evidence-inflation-in-csi-durability-debate-case.md`](cases/methodology/evidence-inflation-in-csi-durability-debate-case.md) — 04-28 MariaDB #396 CSI durability 讨论中的 3 次 inflation 与 3 次撤回(动机 narrative / N=1→average / 间接旁证→系统性证伪),属 [`addon-evidence-discipline-guide.md`](addon-evidence-discipline-guide.md) 实证补充
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -230,6 +230,8 @@ Guard v1 行为(已实装在 syncer roleProbe,跟 C03 案例里描述的「

### Guard v1 N=1

**Abstract**: Guard v1 fix 在 N=1 cycle 下命中 transient peer-primary 互斥检查,trace 字段 `primary_peer_conflict_rejected` 双 pod 都打出,rw endpoint 在 unsafe transition 期间变空(拒绝路由),所有 ack 写入落 final primary,`ack_missing=0`。

Root: `work/ob-test-supplement/20260503-2315-c09-peer-primary-guard-v1-n1`

Result:
Expand All @@ -253,6 +255,8 @@ Guard evidence:

### Guard v1 N=5

**Abstract**: N=5 cycle 全 PASS(5/5 cycles passed),累计 client-ack 写入 288 全保留,`ack_missing=0`,没有外层 kbagent `roleProbe timeout` event — 主路径稳定。但 N=5 没遇到 timeout / deadline peer error,timeout-isolation 路径未被这一档 strength 触发;递进到 N=10 才闭环。

Root: `work/ob-test-supplement/20260503-2325-c09-peer-primary-guard-v1-n5`
Archive sha256: `7228e094cf893452387ac1e86bff91a5a5f8120fcc6e1617d00f7b937d57b3bd`

Expand All @@ -269,6 +273,8 @@ N=5 后剩余:

### Guard timeout-isolation v1 N=10

**Abstract**: N=10 cycle 全 rc=0,累计 client-ack 写入 570 全保留,`ack_missing=0`,cycle04 真遇 2 条 peer timeout / deadline error — 验证 timeout-isolation 设计将其封装在 `peerPrimaryErrors` 字段不吃掉外层 kbagent `roleProbe deadline`,sidecar 不被 KB 误判 unhealthy。**这是 N=5 不能闭环、必须递进 N=10 strength 才命中的关键 timeout-isolation 验证证据**。

Root: `work/ob-test-supplement/20260504-0005-c09-peer-primary-guard-timeout-v1-n10`
Archive sha256: `9e54e7695b692915a4a3e5f0bb63abe9f4806d0a2fa9af52561c6205c6d1ef9a`

Expand Down Expand Up @@ -301,7 +307,9 @@ Harness caveat(值得记录的踩坑):
把 Guard v1 + N=10 timeout isolation 的两条核心契约抽象出来,对 PG / MySQL / MongoDB 等其他 addon 的 sidecar role probe 同样适用:

1. **Peer-aware 才能给出 PRIMARY 标签**:单 pod 自查 `tenantRole=PRIMARY` 不够;必须查 peer,发现 peer 也是 semantic PRIMARY 时拒绝上报。短暂无 rw endpoint > ack 写入落到错误链。trace 字段建议:`peerPrimaryCheck / peerPrimaryConflict / peerPrimaryErrors / decisionReason=primary_peer_conflict_rejected`。
- **Anti-pattern**:把"sidecar 单 pod 自查 `tenantRole=PRIMARY`"当作上报 PRIMARY 标签的充分条件 — transition window 必然出现 acked-write divergence。
2. **Peer 探测错误必须封装在内层,不能吃掉外层 deadline**:peer probe 的 timeout / deadline 错误必须保留在 `peerPrimaryErrors` 字段里,不能让它们 propagate 成 outer kbagent `roleProbe timeout` event — 否则 sidecar 自身被 KB 当成 unhealthy 反而触发 cascading 状态变更。
- **Anti-pattern**:把 peer probe 的 timeout / deadline error 直接 raise / propagate 出去 — outer kbagent 把整个 sidecar 当 unhealthy 触发额外状态变更,故障面被 sidecar 自身放大。

跨 addon 反 anti-pattern:「把 sidecar 单 pod 视角等同于全集群视角」是同一类陷阱(也跟 C03 stale-RO 案例里 roleProbe 没看 `V$OB_LS_LOG_RESTORE_STATUS` 是同源),都属于 [`addon-control-plane-election-guide.md`](../../addon-control-plane-election-guide.md) 的 control-plane election 真值口径领域。

Expand Down Expand Up @@ -337,22 +345,24 @@ dist 状态:

## Source evidence index

> **跨仓路径说明**:以下所有 `work/ob-test-supplement/...` 路径都位于 `apecloud/kubeblocks-tests` 仓库的 `oceanbase/` 目录下。完整路径前缀 = `kubeblocks-tests/oceanbase/work/ob-test-supplement/...`。archive sha256 是 evidence pack 完整性校验,跨仓 reader 用 sha 直接 verify,不依赖路径稳定。

原始 C09 故障:

- root summary: `work/ob-test-supplement/20260503-2123-c09-writes-during-primary-kill-n1/c09-root-evidence-summary.md`(Mia archived bundle)
- final-freeze archive: `work/ob-test-supplement/20260503-2123-c09-writes-during-primary-kill-n1/c09-final-freeze-before-cleanup-20260503-222501.tar.gz`(archive sha256 `e3c0b833312ad1e0fd3f81b9e290b9263772b0573d574627227822a030dbc484`)
- root summary: `kubeblocks-tests/oceanbase/work/ob-test-supplement/20260503-2123-c09-writes-during-primary-kill-n1/c09-root-evidence-summary.md`(Mia archived bundle)
- final-freeze archive: `kubeblocks-tests/oceanbase/work/ob-test-supplement/20260503-2123-c09-writes-during-primary-kill-n1/c09-final-freeze-before-cleanup-20260503-222501.tar.gz`(archive sha256 `e3c0b833312ad1e0fd3f81b9e290b9263772b0573d574627227822a030dbc484`)

Trace v2 reproduction(关闭 raw decision evidence gap):

- root: `work/ob-test-supplement/20260503-2235-c09-trace-v2-repro-n1/evidence/trace-v2-c09-acked-write-divergence-20260503-224553`
- root: `kubeblocks-tests/oceanbase/work/ob-test-supplement/20260503-2235-c09-trace-v2-repro-n1/evidence/trace-v2-c09-acked-write-divergence-20260503-224553`
- summary: `SUMMARY.md`
- archive: `trace-v2-c09-acked-write-divergence-20260503-224553.tar.gz`(sha256 `8729ddacb228a3580e894af09b8ecde7f9c91f1d660eba24c799449e4b66549f`,summary sha256 `d1b232b2492195aa262fa4d6246f3c8a74bc085849a002424692524d9120efb2`)

Guard 验证:
Guard 验证(sha256 验证不依赖路径稳定)

- N=1: `work/ob-test-supplement/20260503-2315-c09-peer-primary-guard-v1-n1/c09-guard-v1-summary.md`
- N=5: `work/ob-test-supplement/20260503-2325-c09-peer-primary-guard-v1-n5/c09-guard-v1-n5-summary.md`(archive sha256 `7228e094cf893452387ac1e86bff91a5a5f8120fcc6e1617d00f7b937d57b3bd`)
- N=10 timeout isolation: `work/ob-test-supplement/20260504-0005-c09-peer-primary-guard-timeout-v1-n10/c09-timeout-guard-v1-n10-summary.md`(archive sha256 `9e54e7695b692915a4a3e5f0bb63abe9f4806d0a2fa9af52561c6205c6d1ef9a`)
- N=1: `kubeblocks-tests/oceanbase/work/ob-test-supplement/20260503-2315-c09-peer-primary-guard-v1-n1/c09-guard-v1-summary.md`
- N=5: `kubeblocks-tests/oceanbase/work/ob-test-supplement/20260503-2325-c09-peer-primary-guard-v1-n5/c09-guard-v1-n5-summary.md`(archive sha256 `7228e094cf893452387ac1e86bff91a5a5f8120fcc6e1617d00f7b937d57b3bd`)
- N=10 timeout isolation: `kubeblocks-tests/oceanbase/work/ob-test-supplement/20260504-0005-c09-peer-primary-guard-timeout-v1-n10/c09-timeout-guard-v1-n10-summary.md`(archive sha256 `9e54e7695b692915a4a3e5f0bb63abe9f4806d0a2fa9af52561c6205c6d1ef9a`)

Stage 3 chaos triad 收尾交叉引用:

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -233,6 +233,8 @@ trace 字段建议:`logRestoreSyncStatus / logRestoreErrCode / decisionReason=

### freshness gate 主路径正例

**Abstract**: 主路径 N=1 验证:v2 freshness gate fix 在 happy path(standby 正常拉到 seed + 复制后被杀恢复)下正确把 stale ro detect 出来,TDD 设计预期 FAIL(`stale-secondary-detected: 4 standby check(s) failed`)命中 21/1/0。

archive sha256: `da5ef49b8b831e2665cab0be7b8853227fb5a25e54bf3a4ab5ce17751fcec6bc`
root: `work/ob-test-supplement/20260504-1445-c03-stale-ro-gate-v2-n1`
chaos script: `tests/chaos-kill-primary-standby-quorum-loss.sh` sha `ce1204ab4ee889277b51d580003bd64915f8badc09b605967094191bb166cd3b`
Expand Down Expand Up @@ -277,6 +279,8 @@ PASS 21 / FAIL 1 / SKIP 0,唯一 FAIL 是 stale-secondary-detected,跟 TDD

### freshness gate ground_truth_unavailable 路径正例

**Abstract**: ground_truth_unavailable 短路保守路径 N=1 验证:当 chaos 来得太早 standby 还没拉到 seed、survivor 接管成 new primary 后没有表,主路径 ground truth 拿不到 → freshness gate 降级走保守 fail-stop 不漏 stale。验证设计意图:"拿不到就保守 fail,不 silent skip"。

archive sha256: `449b7d3491fc1c82d6787ec0178f0e1de884b5426a573ef8e4a12f46e486dd2f`
root: `work/ob-test-supplement/20260504-1432-c03-stale-ro-gate-v1-n1`

Expand Down