From 84c1ec629d62b43fce0a0bf0071357270f8e5244 Mon Sep 17 00:00:00 2001 From: Wei Cao Date: Mon, 4 May 2026 00:30:35 +0800 Subject: [PATCH] docs: retrofit two methodology guides with real PR #10191 evidence and command-line examples Allen's quality dashboard v2 flagged addon-controller-crash-resilience-guide.md and addon-ship-readiness-multi-phase-validation-guide.md as prose-heavy / 0 evidence / 0 code block. Both guides predated the RebuildInstance race fix work in apecloud/kubeblocks PR #10191, which is now sealed and gives concrete cross-engine evidence: controller restart inside a precise cleanup-timing window did exercise the design assumption, and a five-layer validation matrix beyond the baseline three-segment ship matrix did surface four contract-level race gaps. Two new case appendices, both engine-specific (Valkey / PR #10191) following the project convention that engine details live in case appendices and the body remains methodology: - addon-controller-crash-resilience-guide.md: a Valkey RebuildInstance cleanup-timing-window appendix with the exact polling loop that triggers the chaos restart on InstanceSet annotation transition + pod Ready=False (not on a controller log scrape), the before/after annotation-snapshot capture pattern, and the four-property terminal verification list. - addon-ship-readiness-multi-phase-validation-guide.md: a five-layer matrix appendix (baseline 10 + extension 4 + same-cluster dense initial 35 + chaos gate 1 + same-cluster dense follow-up 35) showing how the last two layers turned a clean 49-sample baseline into a 4-contract-gap race finder. Includes the trial-loop wrapper, the forensics grep predicate, and the PVC / PV invariant queries. Both appendices add direct cross-refs to addon-pvc-rebind-via- workload-intent-guide.md and addon-design-contract-review-during-xp- guide.md so the four guides form a connected sub-graph on the RebuildInstance race fix. No changes to the guide bodies; appendices are additive only. --- ...addon-controller-crash-resilience-guide.md | 68 +++++++++++++++++ ...-readiness-multi-phase-validation-guide.md | 75 +++++++++++++++++++ 2 files changed, 143 insertions(+) diff --git a/docs/addon-controller-crash-resilience-guide.md b/docs/addon-controller-crash-resilience-guide.md index ca4b7bf..7ce44e5 100644 --- a/docs/addon-controller-crash-resilience-guide.md +++ b/docs/addon-controller-crash-resilience-guide.md @@ -165,8 +165,76 @@ Valkey ship-readiness G4 验证的是 KubeBlocks 控制器在 Valkey OpsRequest - 验证脚本:`tests/g4-controller-crash.sh` - 验证环境:k3d `kb-local`,KubeBlocks `1.2`,Valkey addon rev 57 +## 案例附录:Valkey RebuildInstance — cleanup-timing 窗口 controller restart + +apecloud/kubeblocks PR #10191 的 e2e 验证里加了一个特定的 controller restart 场景:在 OpsRequest 已经清掉自己写在 InstanceSet 上的 rebuild intent annotation 之后、但新 target pod 还没 Ready 之前,主动重启 KB controller,验证 OpsRequest 不会因为内存里"in-flight 状态"丢了而把已收敛的 PVC 绑定改回去。本附录给具体的触发命令和 pass 条件。 + +### 现场设置 + +- KB 控制器 image:`apecloud/kubeblocks:rebuild-fix-20260503-2312`(PR #10191 18-commit branch 编译产物)。 +- Valkey cluster:1 primary + 2 replicas,复用 `kubeblocks-tests/valkey/tests/rebuild.sh` baseline 数据 + targeted slave disruption。 +- Trial 编号:第 10 个 trial 专门用来插入 chaos restart,前 9 个保持 baseline 形态。 + +### 触发 chaos restart 的 polling 循环 + +不依赖 controller 日志关键字 — 触发依据是 InstanceSet annotation 实时 transition `present → absent` AND target pod `Ready=False`: + +```bash +INTENT_KEY="operations.kubeblocks.io/rebuild-instance-pvc-overrides" +RESTART_TRIGGERED=0 +for i in $(seq 1 60); do + intent=$(kubectl -n "$NS" get instanceset "$ITS" \ + -o jsonpath="{.metadata.annotations['operations\.kubeblocks\.io/rebuild-instance-pvc-overrides']}") + ready=$(kubectl -n "$NS" get pod "$REBUILT_POD" \ + -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}' 2>/dev/null || echo "") + if [ -z "$intent" ] && [ "$ready" != "True" ]; then + kubectl -n kb-system rollout restart deploy/kubeblocks + RESTART_TRIGGERED=1 + break + fi + sleep 1 +done +``` + +`RESTART_TRIGGERED=1` 后捕获 IS annotation snapshot 作为 before-restart 证据,再每 10s 抓一份 after-restart 直到 pod Ready: + +```bash +kubectl -n "$NS" get instanceset "$ITS" -o yaml > "$ART/its-before-restart.yaml" +for i in 1 2 3 4 5 6 7 8 9; do + sleep 10 + kubectl -n "$NS" get instanceset "$ITS" -o yaml > "$ART/its-after-restart-${i}.yaml" + if kubectl -n "$NS" wait --for=condition=Ready pod/"$REBUILT_POD" --timeout=5s 2>/dev/null; then + break + fi +done +``` + +### 4 项终态验证 + +- OpsRequest 终态 `Succeed`(不是被 controller restart 卡在 Running)。 +- 目标 source PVC `spec.volumeName` 等于 helper PV name(chaos restart 没让 binding 回退)。 +- helper PV reclaim policy 等于推断出来的原值(不是被反复改写)。 +- IS annotation 全周期没有 `operations.kubeblocks.io/rebuild-instance-pvc-overrides`(chaos restart 也没让 intent 复活)。 + +### 观察结果 + +- Trial 10 单次 PASS。`restartTriggered=1`,monitor 抓到 `intent_present=yes → no` + `pod_ready=False` 同时成立才打的 restart,restart 后 pod 在两轮 reconcile 内重新 Ready,前 9 个 trial 累计 0 个 race-related 错误,第 10 个 trial 维持同样的 0 错误。 +- Phase 6 N=10 acceptance + 同 cluster 多轮 follow-up(N=5 / N=10 / N=20)一共 84+ 个 valid sample,全程 0 个 `PersistentVolume "" not found`、0 个 source PVC 最终绑到非 helper PV。 + +### 结论 + +- 一个有效的 controller crash resilience 测试不应该靠 controller 日志关键字判定 restart 时机 — 日志是观测信号,不是设计契约本身。靠 CR 上可观测的状态字段(这里是 InstanceSet annotation transition)作为触发依据,得到的证据是不可伪造的。 +- "测试期间 controller 没挂 = OpsRequest desired state 在 CR 上完整" 是必要不充分。必须显式插入 chaos restart 才有办法回答"挂了能不能继续"。 + +### 测试 artifacts + +- 触发脚本:`kubeblocks-tests/valkey/tests/rebuild-ops.sh` 的 O02-chaos-gate 段(commit `18817ec`)。 +- 完整 evidence bundle:`artifacts/valkey-phase6-rebuild-fix-20260503-2023/trial-10-controller-restart/`。 +- 关联 fix PR:apecloud/kubeblocks#10191(commit chain `10dfc40af` → `b99890fbc` 关闭 4 类 contract gap)。 + ## 相关主题 - [`docs/addon-bounded-eventual-convergence-guide.md`](addon-bounded-eventual-convergence-guide.md) - [`docs/addon-ops-restart-troubleshooting-guide.md`](addon-ops-restart-troubleshooting-guide.md) - [`docs/addon-test-acceptance-and-first-blocker-guide.md`](addon-test-acceptance-and-first-blocker-guide.md) +- [`docs/addon-pvc-rebind-via-workload-intent-guide.md`](addon-pvc-rebind-via-workload-intent-guide.md) — PR #10191 关闭的 contract 体系,本附录的设计 prerequisite。 diff --git a/docs/addon-ship-readiness-multi-phase-validation-guide.md b/docs/addon-ship-readiness-multi-phase-validation-guide.md index 2cc97e8..458b23b 100644 --- a/docs/addon-ship-readiness-multi-phase-validation-guide.md +++ b/docs/addon-ship-readiness-multi-phase-validation-guide.md @@ -230,9 +230,84 @@ R07(scale-in 时 master kill 与 hot backup 重叠的窄时间窗)在 Phase - Phase 3 final root:`/Users/wei/.slock/agents/.../valkey-phase3-regression-x10-091219/final/review-summary.md` - 验证环境:k3d `kb-local`,KubeBlocks 1.2,Valkey addon rev 57 +## 案例附录:Valkey RebuildInstance race fix — 五层验证矩阵 (PR #10191) + +apecloud/kubeblocks PR #10191 的 e2e 验证用了一个比"baseline / chaos / regression"三段更细的五层矩阵。这个矩阵的 specific 价值是它在 Phase 6 acceptance 之外又加了两层"扩展强度"专门用来逼出 contract 层的并发缺口;最终在四个 follow-up commit 里关闭了 4 类没在 baseline N=10 里浮出来的 race。本附录给具体数和命令。 + +### 五层结构和数 + +| 层 | 内容 | 样本数 | product fail | +|---|---|---|---| +| 独立 acceptance | 全新 cluster + 30 baseline keys + slave disruption + RebuildInstance + data verify + topology verify | 10/10 PASS | 0 | +| 独立扩展 | 同形 acceptance 再 4 trial 拉宽 flake-detection 样本 | 4/4 PASS | 0 | +| 同 cluster 密集(initial) | 5 + 10 + 20 三组 rotating-slave RebuildInstance ops,no recreate between ops | 35/35 PASS | 0 | +| controller-restart chaos gate | 在 intent annotation `present → absent` AND pod `Ready=False` 同时成立时刻 `kubectl rollout restart` kb-controller | 1/1 PASS | 0 | +| 同 cluster 密集(follow-up after 4 fixes) | 三组 rotating-slave RebuildInstance ops 重跑 + 并发 OpsRequest O02 端到端 | 35/35 PASS(最终 N=20 final tally `PASS 240 / FAIL 0 / SKIP 0`) | 0 | + +**五层累计 84+ valid sample,全程 0 个 `PersistentVolume "" not found`、0 个 source PVC 最终绑到非 helper PV、0 个 cleanup conflict、0 个 stale intent residue。** + +### 扩展强度兑现的 4 个 contract gap + +baseline 49-sample(前三层)clean。但加了第四层 chaos gate 和第五层 dense follow-up 之后,逼出 4 类原本看不见的 contract 缺陷,每一个都直接 land 一个 fix commit: + +1. 同 instance/claim 并发 RebuildInstance 不 terminal — fix `10dfc40af`(`ErrRebuildIntentOwnedByDifferentOps` sentinel + handler `intctrlutil.NewFatalError` wrap)。 +2. 失败 OpsRequest 留 intent residue 挡住后续 rebuild — fix `2e129834a`(lenient `CleanupRebuildIntentByOpsUID` + `cleanupRebuildIntentOnFailure` hook)。 +3. tmp PVC 还没 bind 时 `getRestoredPV` fatal、cleanup 撞 InstanceSet hot-annotation conflict 不 retry — fix `514214ac9`(`NewErrorf(ErrorTypeNeedWaiting,...)` + `retry.RetryOnConflict`)。 +4. tmp PVC cleanup 之后 helperPV label discovery 沉默失败 — fix `b99890fbc`(`getOrDiscoverHelperPV` 把 intent annotation pvName 当 truth)。 + +在 49-sample baseline 已 clean 的前提下,第四+五层每多 push 一档 strength(chaos / 并发 / 同-cluster 高密度),就抓出一类没浮出来的 contract gap。这也是本文 ship 决策"二段判定"的直接反例:满足"产品 fail = 0"还不够,**必须把"扩展强度"作为一个独立的 normative 维度**,否则 baseline-only ship 等于把 race 遗留到生产。 + +### Trial-loop wrapper(可复现) + +```bash +NS=valkey-rebuild-race-r1 +ART=artifacts/rebuild-race-fix-r1 +mkdir -p "$ART" +for i in 1 2 3 4 5 6 7 8 9 10; do + echo "=== Trial $i ===" + ./tests/rebuild.sh "$NS" "$CLUSTER-$i" 2>&1 | tee "$ART/trial-$i.log" + rc=${PIPESTATUS[0]} + if [ $rc -ne 0 ]; then + echo "FAIL on trial $i — stopping for forensics, namespace preserved" + break + fi +done +``` + +第 10 个 trial 内嵌 controller-restart chaos gate(触发依据是 IS annotation transition + pod Ready=False,命令见 `addon-controller-crash-resilience-guide.md` 案例附录)。 + +### Forensics grep(验收硬条件) + +```bash +grep -E 'PersistentVolume "" not found|owned by ops|empty HelperPVName' \ + "$ART"/trial-*.log "$ART"/controller.log 2>/dev/null | wc -l +# 预期 0 +``` + +```bash +kubectl -n "$NS" get pvc -o yaml | yq '.items[] | {name: .metadata.name, vol: .spec.volumeName}' +# 每个 source PVC 的 volumeName 都应该指向 helper PV(带 rebuild-from / rebuild-tmp-pvc 标记),而不是 default-provisioned 名字 +``` + +```bash +kubectl get pv -o yaml | yq '.items[] + | select(.metadata.labels."operations.kubeblocks.io/rebuild-tmp-pvc") + | {name: .metadata.name, claimRef: .spec.claimRef, reclaim: .spec.persistentVolumeReclaimPolicy}' +# helper PV reclaim policy 必须等于推断出来的原值(这次 lane = local-path 默认 Delete) +``` + +### 测试 artifacts + +- 完整 e2e harness:`apecloud/kubeblocks-tests/valkey/tests/rebuild-ops.sh`(commit `18817ec`),含 O01-O04 strength lane。 +- 完整 evidence bundle:`artifacts/valkey-phase6-rebuild-fix-20260503-2023/`(baseline + chaos gate)+ `artifacts/valkey-rebuild-ops-20260503-2312-n20/`(follow-up 240/0/0 final tally)。 +- 关联 fix PR:apecloud/kubeblocks#10191(关闭 issue #10190 的 12 类 contract gap,4 类是这五层矩阵的最后两层逼出来的)。 +- 关联 self-audit:`apecloud/kubeblocks-tests/valkey/COVERAGE.md`(同 commit `18817ec`),列了 P0/P1/P2 扩展 backlog。 + ## 相关主题 - [`docs/addon-test-acceptance-and-first-blocker-guide.md`](addon-test-acceptance-and-first-blocker-guide.md) - [`docs/addon-test-probe-classification-guide.md`](addon-test-probe-classification-guide.md) - [`docs/addon-controller-crash-resilience-guide.md`](addon-controller-crash-resilience-guide.md) +- [`docs/addon-pvc-rebind-via-workload-intent-guide.md`](addon-pvc-rebind-via-workload-intent-guide.md) — 第四+五层 chaos / dense-follow-up 触发的 contract 体系本身。 +- [`docs/addon-design-contract-review-during-xp-guide.md`](addon-design-contract-review-during-xp-guide.md) — XP review-阶段抓到的另外 8 类 contract gap,与本附录 4 类合起来是同一个 fix 周期里的 12 类总样本。 - [`docs/addon-evidence-discipline-guide.md`](addon-evidence-discipline-guide.md)