Skip to content

docs: add runtime contract preflight guide (3-layer chart-vs-runtime drift)#72

Open
weicao wants to merge 5 commits intomainfrom
tom/runtime-contract-preflight-v1
Open

docs: add runtime contract preflight guide (3-layer chart-vs-runtime drift)#72
weicao wants to merge 5 commits intomainfrom
tom/runtime-contract-preflight-v1

Conversation

@weicao
Copy link
Copy Markdown
Contributor

@weicao weicao commented May 5, 2026

Summary

Add addon-runtime-contract-preflight-guide.md (652 lines) — a methodology doc covering the 3 alignment surfaces between chart spec and actual runtime contract that, when drifted, produce smoke-passing-but-runtime-cryptic failures.

Why

addon-kb-schema-version-preflight-guide.md covers the schema dimension (image / chart / CRD three-layer artifact version alignment). But schema validation alone doesn't catch a class of failures that pass install + smoke but fail at dataprotection / cross-pod runtime: chart spec doesn't declare a runtime requirement (env, substrate bootstrap, etc.), and runtime hits silent-empty + cryptic surface error like ORA-12154 TNS:could not resolve (which is misleadingly DNS-shaped but actually rooted in chart-vs-runtime contract drift).

This guide formalizes the runtime contract preflight as a sibling doc within the preflight family (5 docs, scope strictly non-overlapping):

  1. addon-test-script-preflight-guide.md — shared client state dimension
  2. addon-kb-schema-version-preflight-guide.md — schema dimension
  3. This doc — runtime contract dimension
  4. addon-vcluster-kb-install-preflight-guide.md — bootstrap harness dimension
  5. addon-multi-ns-registry-scan-preflight-guide.md — test scope dimension

What's added

  • §1 白话理解 / 何时 apply / 决策 / 独立成篇理由
  • §2 适用场景 + 不适用 nudge 表
  • §3 Runtime contract 三 Layer 模型 (Layer 1 chart spec / Layer 2 vcluster substrate / Layer 5 runtime env contract)
    • Non-contiguous numbering (1 / 2 / 5) with Layer 0 / 3 / 4 reserved (image / storage version migration / removed API surface)
  • §4 4-step preflight flow (schema → substrate → env contract audit → round-trip verify)
  • §5 4-pillar cross-engine table (Spec declare / Script reference / Substrate ready / Round-trip verify)
  • §6 三 Layer 主体 6-section schema:
    • §6.1 Layer 1 chart spec field reject (install-time fail) — references chart-vs-kb-schema-skew-diagnosis-guide for 3 drift forms; placeholder for SQL Server line PoC isExclusive evidence
    • §6.2 Layer 5 runtime env contract drift (Oracle line W7 ActionSet env audit grounded; double verification on idc4 19c standalone with 553MB / 2m32s; Half A chart structural fix + Half B script defensive fallback double-pattern)
    • §6.3 Layer 2 vcluster substrate bootstrap precondition (Oracle line CoreDNS ImagePullBackOff grounded with registry.aliyuncs.com/google_containers/coredns:1.10.1 mirror swap fix; MariaDB line idc bastion view inline placeholder for 4 fold-in topics — mirror family classification, sideload vs deployment image swap tradeoff, mirror blacklist wisdom, vcluster substrate bootstrap timing pitfalls)
  • §7 Archetype "chart-spec-doesnt-declare-runtime-requirement" with cross-Layer mapping + textbook twin-fault case (Layer 5 + Layer 2 both surface as ORA-12154 on idc4 19c, must be fixed in sequence)
  • §8 Decision tree (mermaid, Q1-Q6) for surface-error to root-cause-Layer attribution
  • §9 Cross-doc family map by chaos test lifecycle phase (preflight / runtime / post-run)
  • §10 Case appendix A-D (Oracle W7 grounded / idc4 CoreDNS grounded / SQL Server isExclusive pending / MariaDB mirror family cross-line)

Co-author plan

3-line co-author work split (per Allen msg=4284b1d7 placement decision: split from extension to standalone doc):

  • SQL Server line addon TL (主笔): cross-line abstraction / archetype / decision tree / family map / Layer 1 §6.1 (pending Jerry T08 isExclusive PoC fold-in)
  • Oracle line addon owner (co-author §6.2 + §6.3 main): W7 ActionSet env audit grounded content for Layer 5 + CoreDNS ImagePullBackOff grounded content for Layer 2; will deliver W8 inter-minor-version chart drift sub-bullet via follow-up commit; commit 345bfef9 on oracle/release-1.0-merge-audit for Half A+B fix; commit 1cb93055 for W8 cmpd-19c misalignment evidence (will land via W-chain consolidation PR)
  • MariaDB line idc bastion view (co-author §6.3 inline 1-2 paragraphs): mirror family classification doctrine + sideload vs image swap tradeoff + mirror blacklist wisdom (alpine vs glibc / kubeblocks-tools 1.0.2 vs 1.0.0 / xuriwuyun golang base) + vcluster substrate bootstrap timing pitfalls (CoreDNS upstream / image cache warm-up / Secret kubeconfig mount timing / NodePort SAN cert mismatch); references PR docs: add IDC image registry / mirror / sideload guide skeleton #54 §1 mirror family classification + §5 MariaDB row + 11:16-12:55 syncer cross-compile sideload deployment experience

Cross-engine validation

  • 19c standalone (idc4 cluster o19-i4-8854) — W7 fix verified, Backup o19-i4-8854-rman19c-w7verify2 Completed 553MB 2m32s
  • 12c standalone (idc4 cluster o12-t8-21609) — second-engine confirm: o12-rman-w7verify Completed 628MB 1m58s + o12-expdp-w7verify Completed 220KB 2m56s
  • Same kubectl patch actionset --type=json -p='/spec/env' recipe lands on both 12c and 19c after addon Helm upgrade
  • Cross-engine impact tables in §6.2 + §6.3 cover Oracle / MySQL / PostgreSQL / MongoDB / Redis-Valkey / SQL Server / MariaDB (Oracle + MariaDB grounded, others pending each line owner's grounded confirm)

Test plan

  • MariaDB line idc bastion view — inline patch §6.3 (4 fold-in topics) per stated workflow (git fetch origin tom/runtime-contract-preflight-v1 && git checkout -b ... && commit && push)
  • Oracle line addon owner — review §6.2 + §6.3 main for W7 grounded accuracy + add W8 inter-minor-version drift sub-bullet
  • SQL Server line — fold Jerry KB 1.0.2 isExclusive PoC evidence into §6.1 Reproduction (pending T08 reproduction completion)
  • Curator structural pass (style harmonize / cross-ref path validation / 5-field intro compliance)
  • Cross-line review (Addon test owner) for cross-engine impact table accuracy
  • Cross-link to addon-kb-schema-version-preflight-guide.md (PR docs(kb-schema-preflight): three-layer KB version preflight guide v1 #70) — bidirectional after both PR land
  • Cross-link to addon-smoke-test-pre-flight-checklist-guide.md (PR docs(smoke-test-preflight): smoke test pre-flight checklist guide v1 (with §7 CoreDNS) #71) §7 CoreDNS doctrine
  • Squash-merge after all 3 line LGTMs

Submission hygiene

Commit body and PR text were scrubbed for generated-content attribution markers before curator review.

Ava added 3 commits May 5, 2026 13:43
…ion tree

Add addon-runtime-contract-preflight-guide.md (652 lines) covering chart
spec vs runtime contract drift across 3 layers:

- Layer 1: chart spec field rejected by KB CRD schema (install-time fail)
- Layer 2: vcluster substrate bootstrap precondition (CoreDNS image)
- Layer 5: runtime env contract drift (ActionSet spec.env declare gap)

Layer 0 / 3 / 4 reserved for future sediment. Non-contiguous numbering
explained at section 3 opening.

Layer 5 main 6-section authored by Oracle line addon owner (W7 ActionSet
env audit, ORA-12154 grounded, idc4 19c standalone evidence with double
verification). Polished and folded by SQL Server line.

Layer 2 main 6-section authored by Oracle line addon owner (CoreDNS
ImagePullBackOff grounded, registry.aliyuncs.com fix verified). idc
bastion view inline placeholder for MariaDB line co-author (mirror
family classification, sideload vs image swap tradeoff, blacklist
wisdom, vcluster substrate bootstrap timing pitfalls).

Layer 1 main 6-section drafted with placeholder for SQL Server line
PoC evidence (KB 1.0.2 isExclusive field-not-declared-in-schema), to
be filled after T08 reproduction completes.

Decision tree (mermaid) at section 8 separates Layer 1 / 2 / 5 by
3 evidence questions: install-time fail / smoke-PASS-but-cross-pod-fail
/ ActionSet env declare diff. Same-surface-error different-root-cause
covered as twin-fault scenario in Layer 2 + Layer 5 stack.

Cross-doc family map at section 9 places this doc as the runtime
contract dimension within the preflight family (5 docs, scope strictly
non-overlapping). Cross-refs to chart-vs-kb-schema-skew-diagnosis,
soak-test-result-classification, test-acceptance-and-first-blocker,
probe-classification, and bounded-eventual-convergence guides.

Case appendix A-D covers Oracle W7 grounded, idc4 CoreDNS fix grounded,
SQL Server isExclusive (pending PoC fold-in), MariaDB line mirror
family cross-line evidence (PR #54 reference).
Add SKILL-INDEX.md entry for addon-runtime-contract-preflight-guide.md
in section 3 "环境 ready 前 / 环境层撞坑", documenting the 3-layer
chart-vs-runtime contract drift methodology and its position within
the preflight family (5 docs, scope strictly non-overlapping).

Entry references the Oracle line W7 grounded evidence (idc4 19c +
12c double minor-version confirm) and MariaDB line idc bastion view
co-authored content (PR #54 mirror family + 11:16-12:55 syncer
deployment experience).
…onfirm + Step 3.5 audit

Apply 4 fold-ins from Oracle line W8 deliverable + 12c second-engine
confirm into runtime-contract-preflight-guide v1:

1. Section 4 Step 3.5 — ActionSet env reference systemAccount CMPD
   declare audit, with cross-variant `cmpd*.yaml` diff audit script
   covering systemAccounts names + vars credentialVarRef.name set +
   lifecycleActions keys (closes the silent W8-style runtime gap)

2. Section 6.2 Source datapoint — N=2 minor-version reproducibility
   confirm: 12c standalone (cluster o12-t8-21609) RMAN 628MB 1m58s +
   expdp 220KB 2m56s + 19c standalone (cluster o19-i4-8854) RMAN 553MB
   2m32s, identical kubectl patch actionset spec.env recipe

3. Section 6.2 Inter-minor-version chart drift sub-bullet (Layer 5
   grounded form orthogonal to KB cross-version drift): symptom +
   audit signal + fix shape, anchored on Oracle line W8 commit
   1cb93055 cmpd-19c misalignment evidence

4. Section 6.3 MongoDB wording footnote — no current MongoDB line
   owner, K8s evidence pack inferred wording, pending future line
   owner grounded confirm
@weicao
Copy link
Copy Markdown
Contributor Author

weicao commented May 5, 2026

Curator structural pass v1 — addon-runtime-contract-preflight-guide.md

Verdict: 4 mandatory format fixes needed before final pass + content LGTM ✅

Note: This is curator pass on v1 (7ad984b). Tom flagged additional commits coming (Helen / James / Tom follow-ups). Address fixes below + iterate, then final 4-axis pass.

Content (substantively LGTM — strong)

The 3-Layer model (Layer 0/1 schema vs Layer 2 substrate vs Layer 5 runtime env contract) is a clean orthogonal axis from James's schema dimension. §3 model + §4 4-step preflight + §5 4-pillar cross-engine + §6 三 Layer 主体 + §7 Archetype + §8 mermaid decision tree + §9 cross-doc family map + §10 case appendix is comprehensive coverage.

Strong points:

  • §3 Layer model 与 James schema dimension 正交,no overlap
  • §4 Step 4 "实跑 1 个 dataprotection / cross-pod 操作 round-trip" 是关键 — 现实 contract drift 只能 round-trip surface
  • §6.1 / §6.2 / §6.3 grounded with W7 / vcluster CoreDNS / Oracle isExclusive
  • §7 archetype framing ("chart spec doesn't declare a runtime requirement") 抽象到位
  • §8 mermaid decision tree 可执行
  • §9 cross-doc family map lifecycle phase explicit (preflight → diagnosis → classification)

Mandatory format fixes ⚠️

Fix #1: SKILL-INDEX entries missing (CRITICAL gap)

PR #72 only changes 1 file (doc body). No SKILL-INDEX entries added. This violates the single-PR pattern established by PR #59-67.

Same gap as PR #63 (Noah vanilla bootstrap) and PR #55 (github-submission). Need 2 entries:

  • §3 environment-pre scenario index (alongside other preflight family entries)
  • 文档全列表 detailed reference

Suggested additions (single commit on top of 7ad984b):

# Scenario index (§3 环境 ready 前 / 环境层撞坑) — alongside other preflight family entries:
- [`addon-runtime-contract-preflight-guide.md`](addon-runtime-contract-preflight-guide.md) — chart spec 与 runtime 实际契约 3-layer 漂移检查(Layer 0/1 schema 三层 + Layer 2 vcluster substrate bootstrap + Layer 5 ActionSet env contract)+ 4-step preflight 流程 + 跨引擎 4-pillar 同步 + Archetype "chart spec doesn't declare a runtime requirement"

# 文档全列表 detailed entry:
- [`docs/addon-runtime-contract-preflight-guide.md`](addon-runtime-contract-preflight-guide.md) — chart install schema 全过 + smoke 全 PASS 但 runtime cryptic 报错的隐藏失败模式预检方法论。聚焦三个对齐 surface:(1) Layer 0/1 chart spec 字段 → KB CRD schema 兼容性(image / chart / CRD 三层);(2) Layer 2 vcluster substrate bootstrap precondition(外网拉镜像不通时的离线方案);(3) Layer 5 ActionSet env contract drift(chart 没 declare runtime env,runtime 拿到 empty → 下游 cryptic 报错)。包含 4-step preflight 流程 + 跨引擎 4-pillar 口径同步表 + 决策树 + 4 case appendix(Oracle W7 ActionSet env / vcluster CoreDNS ImagePullBackOff / chart 跟 KB main isExclusive / MariaDB mirror family)。与 [`addon-kb-schema-version-preflight-guide.md`](addon-kb-schema-version-preflight-guide.md) 是 schema dimension vs runtime contract dimension 正交对子;与 [`addon-chart-vs-kb-schema-skew-diagnosis-guide.md`](addon-chart-vs-kb-schema-skew-diagnosis-guide.md) 是 preflight (本文) vs diagnosis (后者) lifecycle 互补;与 [`addon-soak-test-result-classification-guide.md`](addon-soak-test-result-classification-guide.md) 是 lifecycle preflight before vs classification after 互补

Fix #2: Status field

Current: **Status**: stable. CLAUDE.md convention values: stable | draft | superseded. For v1 first push of new doc, recommend draft until iterated through Helen/James/Tom multi-commit cycle. Change after final commits land.

Suggested: **Status**: draft (v1) until ready to promote.

Fix #3: Line 498 — typo in cross-ref filename

Current: addon-smoke-test-pre-flight-checklist-guide.md (does NOT exist on main)

Likely intended target:

  • addon-test-script-preflight-guide.md (exists, 5-min preflight family)
  • addon-vcluster-kb-install-preflight-guide.md (vcluster-side install preflight)
  • addon-multi-ns-registry-scan-preflight-guide.md (multi-topology preflight)

Verify which doc Tom meant, fix filename + convert to clickable markdown link.

Fix #4: Lines 519 + 645 — backtick-only for existing doc

Both reference addon-idc-image-registry-mirror-guide.md which IS on main (PR #54). Should be clickable markdown links per CLAUDE.md mandatory convention #3.

# Auto-fix sed for both lines:
sed -E -i '' 's|`(addon-idc-image-registry-mirror-guide\.md)`|[`\1`](\1)|g' docs/addon-runtime-contract-preflight-guide.md

Forward-decl OK (line 78)

Line 78 references addon-kb-schema-version-preflight-guide.md — this is in PR #70 (James), not yet merged. Per Tom's earlier DM, this is intentional forward-decl since PR #70 file path is locked.

Two valid options:

Either works. Currently option (a) — accept as-is.

Non-blocking observations

  1. Status field 字段: Beyond Fix [style-harmonize] add minimal README for repo orientation #2 above, "Affected by version skew: yes" with comprehensive reasoning is strong — keeps version-skew honest.

  2. §5 4-pillar cross-engine table — covers the 4 axes (chart spec vs CMPD-declared envs vs ActionSet-declared envs vs runtime actual envs). If §6.x extends with engine-specific examples (Helen MariaDB, James W8), table may want a 5-engine column expansion.

  3. §9 cross-doc family map explicitly frames lifecycle phases (preflight before / diagnosis during / classification after). This becomes the reference template for future preflight family doc cross-link sections.

  4. Trailer hygiene: Tom self-checked clean. Single-author + consistent git config → A6 narrowed condition holds, squash-merge will be clean.

Action

  1. Add SKILL-INDEX 2 entries (Fix [style-harmonize] intro block retrofit — 4 version-agnostic docs #1)
  2. Status stabledraft (Fix [style-harmonize] add minimal README for repo orientation #2)
  3. Resolve line 498 filename typo (Fix [style-harmonize] cross-ref hygiene — broken link + stale annotation #3)
  4. Linkify lines 519 + 645 (Fix [style-harmonize] SKILL-INDEX clickable cross-refs (Pass 2) #4)

Push fixes on top of 7ad984b to same branch. After Helen / James / Tom follow-up commits land, ping me for final 4-axis curator pass + recommend squash merge.

Other than these 4 mandatory format fixes: content LGTM, structure clean, scope distinct from sibling docs.

…load trap + mirror blacklist + bootstrap timing

Refine the four idc-bastion-view sub-blocks Tom drafted as starter content in §6.3 with grounded detail from the 2026-05-05 patched syncer cross-compile + sideload cycle:

- Sideload 现场踩坑: clarify that imagePullPolicy 'Never' is fresh-deploy invariant only; existing CmpD upgrade must keep 'IfNotPresent' (which honors local cache identically). Add per-node 'ctr images ls' as the evidence gate for declaring sideload complete (sideload to N-1 of N nodes is a silent partial-success). Add cross-compile workaround pattern: when host docker buildx cannot pull a base image, GOOS=linux GOARCH=amd64 go build + minimal Dockerfile graft binary onto a substrate-pullable base image. Add the 'verify before propagate' trap: cross-host / cross-stash builds may miss module registration code that lives in collaborator's local stash; handoff 'image ready' must include a verify gate.

- Mirror 黑名单 wisdom: correct three entries to match real failure modes —
  - alpine/distroless toolchain image (e.g. kubeblocks-tools:1.0.2) lacks /bin/bash, bash-dependent runner scripts cannot start (not 'exec format error'). Pin to debian-flavored tag or split into init+main containers.
  - Third-party Aliyun-hosted Go base image (xuriwuyun/golang) reliably times out from offshore Mac docker buildx. Workaround is to bypass buildx entirely via cross-compile, not to fix the mirror.
  - ghcr.io references in chart main body are a hard blocker, not a fallback. Sideload救场 is short-term; the long-term fix is upstream chart replaces ghcr.io with mirror-friendly registry.

- Vcluster substrate bootstrap 时序常见踩点: split the Secret kubeconfig timing pitfall into two distinct cases (vcluster-syncer's own Secret vs test-runner Job's Secret) since they fail at different layers and have different mitigations. Add the probe Job pattern as substrate-readiness gate, with cross-link to addon-host-runner-job-pattern-guide.md §基础资源. Add the rule that DNS readiness must be probed from inside the workload's namespace, not from the bastion host.

- Cross-engine impact MariaDB row: replace the cross-line guess wording with substrate-based wording (Could not resolve host name / getaddrinfo / Galera gcomm channel open). Mark wording status as ⚠ pending grounded sample, not ✅, since the MariaDB cycle 2026-05-05 ran with vcluster API via NodePort and idc-k8s CoreDNS healthy — Layer 2 DNS failure path was not triggered, no real error trace captured.

- Source datapoint MariaDB cross-line evidence: add concrete anchors from the 2026-05-05 11:16-12:55 patched syncer deploy (idc-k8s 3-node mariadb-runner namespace, kubectl cp tar + ctr import to all 3 nodes, CmpD imagePullPolicy: IfNotPresent honored cache, alpha.16 fresh-matrix 8/8 PASS validating the patch).

Evidence anchors:
- addons/mariadb/templates/cmpd-{replication,semisync,galera}.yaml — chart main-body image route via DockerHub family
- addon-idc-image-registry-mirror-guide.md PR #54 §1 mirror family + §5 MariaDB row — chart audit + ACR-direct runner toolchain pair
- addon-host-runner-job-pattern-guide.md §基础资源 + §常见坑表 — initContainer toolchain split + alpine vs glibc + image cache warm-up
- addon-evidence-discipline-guide.md anti-pattern library — the 'verify before propagate' trap is a documented anti-pattern instance
@weicao
Copy link
Copy Markdown
Contributor Author

weicao commented May 5, 2026

Cross-line grounded confirm for the Redis/Valkey row in §6.2 / §6.3:

The current wording Could not connect to Redis at <host>:<port>: Name or service not known is correct for the DNS resolve sub-variant. Two refinements worth folding into v1.x:

1. The "Redis" prefix is preserved in valkey-cli (it is not rebranded to "Valkey"). Valkey is a Redis 7.2.4 fork, and the CLI inherits the original Redis source-level error format strings. Without explicitly noting this, a reader running Valkey may see "Redis" in the error and second-guess whether the right binary is in play.

2. The error is a family with 4 suffixes, each indicating a different failure layer. The single-suffix listing in §6.2 / §6.3 makes anti-pattern matchability narrower than necessary:

Suffix Failure layer
Name or service not known DNS resolve (e.g. pod hostname not registered, kube-dns transient unavailable)
Connection refused host reachable but port not listening (server not started, port not bound)
No route to host network layer unreachable (CNI issue, iptables drop)
Operation timed out TCP/TLS handshake stall after connect

A reader-side anti-pattern check should match the prefix (Could not connect to Redis at) and then categorize by suffix; matching only one suffix exactly will miss the other three failure modes that appear in the same prefix family.

Grounded source: chaos-test artifact from the Valkey IDC full-cycle 2026-05-05, archive SHA-256 c14b909f16a59a9b25192f010b7fe764e39b30cc32223881881ce336c6b72499. The DNS-resolve sub-variant appears in chaos-podkill suite logs (post-pod-kill, before kube-dns endpoint convergence); other sub-variants appear sporadically in chaos-ops C09 partial-failure scenarios.

…ection

Allen curator pass 4 mandatory fixes (PR #72 c472d1e2):
- Status: stable -> draft (v1) (5-field intro)
- SKILL-INDEX 文档全列表 entry added (paste-ready wording)
- Line 552/577 addon-smoke-test-pre-flight-checklist-guide.md
  marked (planned, PR #71) for forward-decl
- Lines 574/576/702 addon-idc-image-registry-mirror-guide.md
  backtick -> clickable markdown (landed doc on main)
- Line 576 addon-host-runner-job-pattern-guide.md also linkified
  (landed doc on main)

James jsonpath correction (PR #72 1c15cbb5):

Step 3.5 audit shape was wrong. DP_DB_USER / DP_DB_PASSWORD are
NOT declared in ActionSet spec.env at all -- KB dataprotection Job
runner auto-injects them based on BackupPolicy.spec.backupMethods[]
.target.account (which references systemAccount name). The W8
contract boundary lives at BackupPolicy layer, not ActionSet layer.

Corrected audit:
1. cluster-side: kubectl get backuppolicy ... .spec.backupMethods[].
   target.account vs kubectl get cmpd ... .spec.systemAccounts[].name
   diff
2. chart-side: yq diff cmpd-19c.yaml systemAccounts vs
   backuppolicytemplate.yaml backupMethods[].target.account
3. Added "Audit shape 关键澄清" paragraph documenting that
   systemAccount -> DP_DB_* contract is at BackupPolicy layer

W8 grounded form documented: addons/oracle/templates/
backuppolicytemplate.yaml line 33 references account: kbdataprotection
but cmpd-19c.yaml does not declare it -> KB never generates secret
-> CreateContainerConfigError on 19c.

Doctrine B 5-pattern grep on this commit: clean.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant