Skip to content

Implement mock_chain_validation build tag (compute-and-swallow consensus + module-genesis failures) #3427

@bdchatham

Description

@bdchatham

Problem

Production-grade testing of seid at real-world state size — pacific-1's IAVL depth, contract cardinality, account distribution, in-flight gov state — is structurally hard. Existing options fall short:

  • State-bloat seeder on harbor (synthesized state) leaves a "your shape isn't really pacific-1" gap that defeats the point.
  • seid export + state surgery on x/staking + x/distribution + x/slashing + bank balances was scoped at 1–2 weeks with high probability of "boots and halts at block 1–10."

The team needs a build of seid that boots from any chain export, runs all prod code paths with full validation work performed, and converts halting mechanisms (consensus failures, cryptographic state mismatches, module-genesis invariant panics) into log+counter events. Operators see what diverged (actual hash values, actual invariant mismatch magnitudes), not just that something did. The chain "continues happily" so unique testing scenarios can run against real prod state with diagnostic signal preserved.

Impact

Eliminates the "your synthesized shape isn't really pacific-1" objection that has been blocking real-state load testing. Unlocks reproducible chaos scenarios against actual mainnet state for performance evaluation, behavior probing, and pre-release validation — at the cost of one well-isolated build target.

Relevant experts

  • sei-tendermint owners — Layer 1 refactor of ConsensusPolicy interface and validation paths (M1)
  • sei-cosmos x/staking + x/distribution module owners — Layer 2 panic-site signoff (M3)
  • sei-chain x/evm + x/oracle module owners — Layer 3 module audit (M4)
  • Build system / CI owners — Makefile + Dockerfile + image-tag work (M5)
  • Platform team — first lab smoke test orchestration (M6)

Proposed approach

Six milestones, deliverable as separate reviewable PRs. Pattern: per-package //go:build mock_chain_validation variant files in every layer (matches the existing mock_block_validation precedent in sei-tendermint; compile-time gate, no API signature changes).

M1 — sei-tendermint policy injection refactor (no behavior change).

  • M1.0: Audit halting checks in sei-tendermint validation paths. Output: a markdown enumeration of every site where validation can return an error that halts the chain (file:line, error type, what it gates).
  • M1.0: Settle the Swallow*Failure() bool interface shape. New methods on ConsensusPolicy returning false by default.
  • M1.1: Refactor each audited site from "if policy says skip, return early" to "compute the result; if failure, log detail + counter; conditionally return error based on policy.Swallow*Failure()."
  • M1.1: Update consensus_policy_default.go and consensus_policy_mock_block_validation.go to return false for new Swallow* methods (preserves existing tag semantics).
  • M1.1: Tests covering production behavior (unchanged) and a mock policy variant flipping Swallow*Failure() == true.

Deliverable: one PR against sei-chain landing the refactor without changing any runtime behavior.

M2 — sei-tendermint mock_chain_validation variant.

  • Add sei-tendermint/types/consensus_policy_mock_chain_validation.go with //go:build mock_chain_validation returning true for every Swallow*Failure() method.
  • Verify go build -tags mock_chain_validation produces a working binary.

Deliverable: one small additive PR.

M3 — sei-cosmos module-genesis panic guards.

  • //go:build mock_chain_validation variants for sei-cosmos/x/staking/genesis.go converting lines 113, 126, 139 from panic(...) to log+counter+continue. Log payload includes both sides of the failed comparison.
  • Same for sei-cosmos/x/distribution/keeper/genesis.go (8 sites, lines 28-95).
  • Optional sei-cosmos/x/slashing/genesis.go if audit reveals panic sites.

Deliverable: one PR against sei-chain (sei-cosmos subtree).

M4 — sei-chain module audit + variants.

  • 30-min audit of x/evm/genesis.go, x/oracle/genesis.go, and other sei-specific modules (tokenfactory, epoch, dex if present) for cross-state invariant panics.
  • Apply the same variant pattern where needed.

Deliverable: one PR against sei-chain.

M5 — Build system + image.

  • Makefile target: make build-unsafe injecting GO_BUILD_TAGS=mock_chain_validation and producing an unsafe-vX.Y.Z image.
  • Dockerfile / GitHub Actions matrix entry pushing the image to ECR/GHCR with the unsafe- prefix.
  • CI sanity check: confirm the build tag is in the version.BuildTags ldflag.

Deliverable: one PR against sei-chain.

M6 — First lab smoke test.

  • Use a recent pacific-1 export.
  • Boot the binary with new validators.
  • Verify the chain survives block 1 (success criterion: sei_unsafe_validation_skipped_total is non-zero, chain produces blocks).
  • Verify the structured log lines fire with divergence detail.
  • Capture results in a follow-up document.

Deliverable: a manifest update in platform-shadow + a brief writeup.

Acceptance criteria

  • M1 PR merged: ConsensusPolicy interface extended; validation paths refactored; production behavior unchanged; tests pass.
  • M2 PR merged: mock_chain_validation variant exists; go build -tags mock_chain_validation succeeds.
  • M3 PR merged: x/staking + x/distribution panic sites have variants; module owners signed off on which sites convert vs stay as halts.
  • M4 PR merged: x/evm + x/oracle + sei-specific module audit complete; any found panic sites have variants.
  • M5 PR merged: unsafe--prefixed image published from CI.
  • M6 smoke test: a binary built with the tag boots a pacific-1 fork past block 1 with new validators.

Out of scope (with un-defer triggers)

  • Runtime startup guards refusing prod chain-ids. Un-defer trigger: a near-miss where someone tries to deploy the unsafe binary to a prod cluster.
  • Separate ECR repository for unsafe builds. Un-defer trigger: same.
  • Cosign provenance / signing separation. Un-defer trigger: any compliance/audit requirement.
  • Out-of-band AppHash-diff sidecar to detect silent divergence in the lab. Un-defer trigger: lab results start producing numbers that seem too good to be true.
  • Combining mock_chain_validation with mock_block_validation into an umbrella tag. Security-specialist explicitly rejected — separate tags isolate blast radius.
  • Parameter-passing approach instead of per-package build tags. Decided in favor of per-package variant files for compile-time gate parity with mock_block_validation precedent.

References

Open questions (decisions wait for implementation review)

  1. Layer 1 audit scope — which sei-tendermint validation paths need routing through ConsensusPolicy
  2. Swallow*Failure() interface shape — final naming at implementation discretion
  3. Which sei-cosmos panic sites convert vs which stay as halts — needs x/staking + x/distribution depth signoff at M3 PR-time review
  4. Telemetry granularity (design defaults to one counter with {site, kind} labels)
  5. Image tag prefix specifics — design proposes unsafe-
  6. Scope of the cosmos-sdk-upgrade audit going forward

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions