From 7ad984b215b652ca94da0a72ed1376657d6b78a8 Mon Sep 17 00:00:00 2001 From: Ava Date: Tue, 5 May 2026 13:43:17 +0800 Subject: [PATCH 1/5] docs(runtime-contract-preflight): add v1 with 3-layer model and decision tree Add addon-runtime-contract-preflight-guide.md (652 lines) covering chart spec vs runtime contract drift across 3 layers: - Layer 1: chart spec field rejected by KB CRD schema (install-time fail) - Layer 2: vcluster substrate bootstrap precondition (CoreDNS image) - Layer 5: runtime env contract drift (ActionSet spec.env declare gap) Layer 0 / 3 / 4 reserved for future sediment. Non-contiguous numbering explained at section 3 opening. Layer 5 main 6-section authored by Oracle line addon owner (W7 ActionSet env audit, ORA-12154 grounded, idc4 19c standalone evidence with double verification). Polished and folded by SQL Server line. Layer 2 main 6-section authored by Oracle line addon owner (CoreDNS ImagePullBackOff grounded, registry.aliyuncs.com fix verified). idc bastion view inline placeholder for MariaDB line co-author (mirror family classification, sideload vs image swap tradeoff, blacklist wisdom, vcluster substrate bootstrap timing pitfalls). Layer 1 main 6-section drafted with placeholder for SQL Server line PoC evidence (KB 1.0.2 isExclusive field-not-declared-in-schema), to be filled after T08 reproduction completes. Decision tree (mermaid) at section 8 separates Layer 1 / 2 / 5 by 3 evidence questions: install-time fail / smoke-PASS-but-cross-pod-fail / ActionSet env declare diff. Same-surface-error different-root-cause covered as twin-fault scenario in Layer 2 + Layer 5 stack. Cross-doc family map at section 9 places this doc as the runtime contract dimension within the preflight family (5 docs, scope strictly non-overlapping). Cross-refs to chart-vs-kb-schema-skew-diagnosis, soak-test-result-classification, test-acceptance-and-first-blocker, probe-classification, and bounded-eventual-convergence guides. Case appendix A-D covers Oracle W7 grounded, idc4 CoreDNS fix grounded, SQL Server isExclusive (pending PoC fold-in), MariaDB line mirror family cross-line evidence (PR #54 reference). --- .../addon-runtime-contract-preflight-guide.md | 652 ++++++++++++++++++ 1 file changed, 652 insertions(+) create mode 100644 docs/addon-runtime-contract-preflight-guide.md diff --git a/docs/addon-runtime-contract-preflight-guide.md b/docs/addon-runtime-contract-preflight-guide.md new file mode 100644 index 0000000..897dc05 --- /dev/null +++ b/docs/addon-runtime-contract-preflight-guide.md @@ -0,0 +1,652 @@ +# Addon Runtime Contract Preflight 指南 + +> **Audience**: addon dev / addon test / cross-line TL +> **Status**: stable +> **Applies to**: any KB addon +> **Applies to KB version**: any (preflight 方法论本身版本无关;具体 ActionSet env 字段集随 KB 版本演化) +> **Affected by version skew**: yes — chart spec 与 runtime 实际契约的对齐随 KB 版本 / addon 版本 / vcluster substrate 版本三向漂移;本文方法论 stable,具体 declare 字段 / image 路径需对照实测版本 + +本文面向 Addon 开发与测试工程师,重点解决一类隐藏度极高的失败:**chart install schema 验证全部通过、smoke 测试全 PASS,但 runtime 跑某类操作时 cryptic 报错**。本质是 **chart spec 与 runtime 实际契约漂移**:spec 里没声明的运行时依赖,实际跑起来 silent 失败或被 cryptic 错误掩盖。 + +## 先用白话理解这篇文档 + +### 这篇文档解决什么问题 + +测试报"OpsRequest 失败"或"Backup Job 报 ORA-12154 / DNS resolve fail"时,团队第一反应通常是"DB / network 出问题了"。但这种归因经常错位——**同样的 surface error 可能源自完全独立的 root cause 层**: + +1. **Chart 没 declare runtime env**(KB dataprotection ActionSet `spec.env` 缺字段,runtime 引用得到空字符串) +2. **Substrate 没 bootstrap 好**(vcluster 默认 coredns 镜像拉不到,pod 内 DNS 不工作但 control plane 报 Running) +3. **真 DB / network 层问题**(少数情况) + +笼统说"DB / network bug"会让团队**在错误的层排障**:花 1 小时调 listener / TLS / 重启 pod,最后发现是 ActionSet spec 里少了一个 env 声明。 + +→ 真正的方法论是:**测试启动前显式 audit chart spec 与 runtime 契约的 3 个对齐 surface(chart spec / vcluster substrate / runtime env contract),把 cryptic 错误前置成可读的 preflight fail-fast**。 + +### 何时本文方法论 apply + +| 场景 | 关键决策 | +|---|---| +| 接 KB dataprotection 的 addon 写 Backup / Restore ActionSet | 必读 §6.2,audit `spec.env` declare 与 backup script 引用差集 | +| 在 vcluster 上跑 smoke / chaos / dataprotection | 必读 §6.3,preflight CoreDNS image + 跨 pod DNS resolution | +| chart `install` 报 `field not declared in schema` | 必读 §6.1,区分 chart 字段 / KB 版本字段 / KB main 三种漂移 | +| smoke 全 PASS 但 dataprotection / cross-pod 操作 fail | 走 §8 决策树,区分 Layer 1 / 2 / 5 | +| 上 IDC vcluster 之前 | 4-step preflight 走一遍(§4) | +| 同 surface error 多次出现,但根因 hopping | §7 archetype "chart spec doesn't declare a runtime requirement" | + +### 读完你能做什么决策 + +- **写新 addon 时**:能 5 秒列出 3 个 runtime contract 对齐 surface,避免 silent 漂移 +- **接 dataprotection 时**:能立刻 audit ActionSet spec.env vs backup script 引用的差集 +- **vcluster bootstrap 时**:能 preflight CoreDNS image 并选对 mirror(aliyuncs / dockerproxy.net / ACR / sideload) +- **看到 `ORA-12154 TNS:could not resolve` 这类 cryptic 错误时**:能用 §8 决策树 60 秒分到正确 layer,不再错误归 DB +- **review chart PR 时**:能识别"chart spec 看起来 OK 但 runtime 会 silent fail"的 contract gap 并 block + +### 为什么独立成篇 + +跟 [`addon-kb-schema-version-preflight-guide.md`](addon-kb-schema-version-preflight-guide.md)(KB schema 三层 image/chart/CRD 版本对齐)+ [`addon-test-script-preflight-guide.md`](addon-test-script-preflight-guide.md)(test runner 跨 line shared client state)+ [`addon-vcluster-kb-install-preflight-guide.md`](addon-vcluster-kb-install-preflight-guide.md)(vcluster 内 KB 安装 bootstrap)+ [`addon-multi-ns-registry-scan-preflight-guide.md`](addon-multi-ns-registry-scan-preflight-guide.md)(多 ns 测试 scope 拆分)一起构成 **preflight family**,覆盖 chaos test lifecycle 启动前的不同对齐 surface。 + +各 doc scope 严格独立: +- `kb-schema-version-preflight` = **schema dimension**(image / chart / CRD 三层 artifact 版本一致) +- 本文 = **runtime contract dimension**(chart spec 与 runtime 实际契约对齐) +- `test-script-preflight` = **shared client state dimension**(跨 line tenant kubeconfig) +- `vcluster-kb-install-preflight` = **bootstrap harness dimension**(KB install in vcluster 时序) +- `multi-ns-registry-scan-preflight` = **测试 scope dimension**(verified vs scan-only 二分) + +本文聚焦 **"chart spec 与 runtime 实际契约的 3 个对齐 surface"** 这一独立主题,不与 schema-version 重叠。 + +--- + +## 适用场景 + +当你负责或维护以下工作时,本文适用: + +- 写新 addon 的 ActionSet(Backup / Restore / Reconfigure / Switchover),需声明 `spec.env` +- 把 addon 上 IDC 共享 k8s + per-line vcluster +- chart `install` 报 `field not declared in schema` 时根因定位 +- 在 vcluster 上跑 dataprotection 测试 / cross-pod replication 测试 +- chaos test lifecycle preflight 全量审计 + +不适用(属于其他 doc): +- KB image / chart / CRD 三层 artifact 版本对齐 → [`addon-kb-schema-version-preflight-guide.md`](addon-kb-schema-version-preflight-guide.md) +- runner shared `~/.kube/config` 跨 line 干扰 → [`addon-test-script-preflight-guide.md`](addon-test-script-preflight-guide.md) +- helm install KB 在 vcluster 时序 → [`addon-vcluster-kb-install-preflight-guide.md`](addon-vcluster-kb-install-preflight-guide.md) +- 测试 fail 后 4-state 归类 → [`addon-soak-test-result-classification-guide.md`](addon-soak-test-result-classification-guide.md) +- chart `field not declared in schema` 局部 bug vs 代差 vs 跟 KB main → [`addon-chart-vs-kb-schema-skew-diagnosis-guide.md`](addon-chart-vs-kb-schema-skew-diagnosis-guide.md)(本文 §6.1 引用此 doc,不重复) + +## Runtime Contract 三 Layer 模型 + +`addon-kb-schema-version-preflight-guide.md` 的 schema dimension 用 image / chart / CRD 三层定义"哪个版本"。本文的 runtime contract dimension 用 **三 Layer 模型**定义"chart spec 与 runtime 实际契约的对齐 surface"。 + +> **编号说明**:Layer 编号采用 **non-contiguous numbering(1 / 2 / 5)**,**Layer 0 / 3 / 4 留作未来 reservation**。Schema dimension 的 image / chart / CRD 三层在本文映射为 Layer 0(image 是底层 artifact, schema-version doc 主管),Layer 3 / 4 留给"storage version migration / API removal" 这类未来 sediment 主题。当前实战 grounded sample 集中在 Layer 1 / 2 / 5。 + +| Layer | 维度 | 漂移现象 | 主管 doc | +|---|---|---|---| +| Layer 0 | Image artifact (reserved) | image build 不一致、digest 漂移 | [`addon-kb-schema-version-preflight-guide.md`](addon-kb-schema-version-preflight-guide.md) | +| **Layer 1** | **Chart spec 字段** | chart 模板字段被 KB CRD schema reject(install 阶段就 fail) | 本文 §6.1 + [`addon-chart-vs-kb-schema-skew-diagnosis-guide.md`](addon-chart-vs-kb-schema-skew-diagnosis-guide.md) | +| **Layer 2** | **vcluster substrate bootstrap precondition** | vcluster 默认 coredns / metrics-server / 关键基础组件 image 拉不到,control plane 报 Running 但 cluster-internal DNS 不工作 | 本文 §6.3 | +| Layer 3 | Storage version migration (reserved) | CRD storage version 升级中断、object 残留旧 version | (planned, future) | +| Layer 4 | Removed API surface (reserved) | 删除的 API 仍被脚本引用 | (planned, future) | +| **Layer 5** | **Runtime env contract** | chart spec 没 declare addon 自身 env,runtime 引用得到空字符串,下游 cryptic 错误(不是 install fail) | 本文 §6.2 | + +**Layer 1 vs Layer 5 的判别**: +- Layer 1 = **install-time schema reject**(helm install 直接 fail,message: `field not declared in schema`) +- Layer 5 = **runtime-time silent-empty**(install / smoke 全 PASS,dataprotection 时 cryptic error) + +**Layer 2 vs Layer 5 的判别**: +- Layer 2 = **substrate side**(vcluster 自己的 coredns / 基础组件没起来;fix 在 cluster bootstrap) +- Layer 5 = **chart spec side**(chart 模板缺字段;fix 在 chart 模板) +- 同 surface error(如 ORA-12154 TNS:could not resolve),不同根因,必须用 §8 决策树区分 + +## 4-step Preflight 流程 + +接 dataprotection 或上 vcluster 之前,按 4 步走一遍: + +### Step 1 — Layer 0/1: schema 三层版本对齐 + +走 [`addon-kb-schema-version-preflight-guide.md`](addon-kb-schema-version-preflight-guide.md) 的三层 audit(image / chart / live CRD)。本步通过后才进 Layer 1 chart spec 字段 audit: + +```bash +# Chart 模板字段是否被当前 KB CRD schema 接受? +helm template | kubectl apply --dry-run=server -f - 2>&1 | grep -E '(field not declared|unknown field|forbidden)' && echo "Layer 1 fail" || echo "Layer 1 pass" +``` + +### Step 2 — Layer 2: vcluster substrate bootstrap + +vcluster control plane Running 不等于 cluster 内部基础设施 Ready。必检: + +```bash +# CoreDNS 是否真的在跑(不仅 deployment 存在) +kubectl -n kube-system get pods -l k8s-app=kube-dns | grep -v Running | grep -v NAME && echo "Layer 2 fail" || echo "Layer 2 pass" + +# 任意业务 pod 内能否 resolve cluster-internal DNS +kubectl exec -- nslookup kubernetes.default.svc.cluster.local 2>&1 | grep -E '(NXDOMAIN|server can.t find)' && echo "Layer 2 fail" || echo "Layer 2 pass" +``` + +### Step 3 — Layer 5: ActionSet env contract audit + +每条 ActionSet(Backup / Restore / 等)audit `spec.env` declare 与 backup script 引用的差集: + +```bash +# 列 ActionSet declare 的 env name +kubectl get actionset -o jsonpath='{.spec.backup.backupData.env[*].name}' | tr ' ' '\n' | sort > /tmp/declared.txt + +# Grep backup script 引用的 env +grep -oE '\$\{[A-Z_]+\}' .sh | sed 's/[${}]//g' | sort -u > /tmp/referenced.txt + +# 差集即 Layer 5 风险 surface +comm -23 /tmp/referenced.txt /tmp/declared.txt +# 输出非空 = chart 没 declare 但 script 引用 = Layer 5 silent-empty 风险 +``` + +注意:DP_* 框架变量(`DP_DB_HOST` / `DP_DB_PORT` / `DP_DB_USER` / `DP_DB_PASSWORD`)由 KB dataprotection runner 自动注入,不需 ActionSet declare;但 addon 自身需要的 `ORACLE_SID` / `MYSQL_PORT` / `PGDATABASE` 等必须显式 declare。 + +### Step 4 — 实跑 1 个 dataprotection / cross-pod 操作 round-trip + +前 3 步全 pass 后,必跑一次端到端验证:发起一个 Backup(最小数据集)、观察 pod 内 Backup script 的 env list 与连接是否成功。空跑 install 不算 preflight 通过。 + +```bash +# 创建最小 Backup +kubectl create -f - <--backup-policy + backupMethod: +YAML + +# 等到完成 / 失败 +kubectl get backup -w | grep preflight-backup +``` + +4 步全 pass 才是 ready-for-test 状态。任意一步 fail 都不能进 chaos / soak / production handoff。 + +## 跨引擎口径同步(4-pillar 表) + +任何 addon 接 dataprotection 时,对照下表填 4 pillar,确保 runtime contract 完整: + +| Pillar | 验证内容 | Layer | 命令模板 | +|---|---|---|---| +| **Spec declare** | ActionSet `spec.env` 是否 declare 引擎自身需要的 env | Layer 5 | `kubectl get actionset -o jsonpath='{.spec.backup.backupData.env[*].name}'` | +| **Script reference** | backup script 引用的 env 是否都在 Spec declare 集合中 | Layer 5 | `grep -oE '\$\{[A-Z_]+\}'