fix(k8s+caddy): stop namespace-rename cascade-delete + Caddy fallout (PAY-SPACE 2026-05-10, fulldiveVR 2026-05-12 outages)#255
Conversation
When the namespace-naming change from #230 lands on a consumer, Pulumi schedules a Replace on every custom-stack namespace (parentEnv != stackEnv). During the brief create-replacement + delete-replaced window the Service carrying `simple-container.com/caddyfile-entry` exists in *both* the old and new namespaces. The Caddy aggregator script concatenated annotations from `kubectl get services --all-namespaces` without dedup, producing two identical `http://<domain> { ... }` site blocks in `/tmp/Caddyfile`. Caddy aborted with `ambiguous site definition` and crashloops until the old Service is collected. PAY-SPACE hit this in production on 2026-05-11 — `support-payhey.pay.space` was the visible victim because it sorts alphabetically before its siblings, but every whitelabel that migrated through the rename traversed the same transient duplicate. Fix: - Include `creationTimestamp` in the jsonpath listing and `sort -r` so the most-recently-created Service is processed first. - Track emitted site-address keys in a tempfile. The dedup key is the first non-blank line of each annotation — for domain entries that's `http://<domain> {` or `https://<domain> {`, for prefix entries it's `handle_path /<prefix>*`. Both transports are guarded. - Older Service for a key already emitted is skipped with a log line, so the picked winner is observable in the init-container output. Verified offline against a synthetic three-Service set (new-ns/example and old-ns/example both declaring `http://example.com`, plus unrelated `other.com`): output Caddyfile has exactly one `http://example.com` block and its `reverse_proxy` resolves to new-ns. Module builds clean, `go test ./pkg/clouds/pulumi/kubernetes/...` passes. The fix is independent of #230's `RetainOnDelete` migration semantics — even after that path is hardened, any future namespace-shape change or Service-Replace will see the same overlap window. This makes the Caddy ingress tolerant of it rather than crashlooping. Signed-off-by: Dmitrii Creed <creeed22@gmail.com>
Semgrep Scan ResultsRepository:
Scanned at 2026-05-13 13:52 UTC |
Security Scan ResultsRepository:
Scanned at 2026-05-13 13:52 UTC |
Codex caught a critical regression I introduced: the new
`kubectl ... | sort -r` pipeline under `set -e` (no pipefail) silently
collapsed to `services=""` whenever kubectl failed, and the script
exited successfully. Caddy would then start with only the default
`http:// { file_server }` block and every domain would serve the
welcome page on the next pod restart — the same masquerading-as-200
failure mode that took prod down on 2026-05-10. Hard miss; would have
made the original outage repeatable on any transient kubectl flake.
Changes:
- `set -xeo pipefail`. A kubectl error now fails the init-container
fast; K8s reschedules and retries instead of cementing a partial
config.
- Split the `kubectl | sort` into two assignments so the failure mode
is unambiguous even if a future reader doesn't notice the pipefail.
- Normalize the dedup key in awk: skip blank lines, skip comment lines,
trim leading/trailing whitespace. For SC-generated annotations this
is functionally a no-op (their first non-blank line is deterministic),
but it makes the dedup robust against indentation differences and
user-authored caddyfile-entry annotations with header comments —
gemini's concern.
- Switched `echo "$services" | while` to `printf '%s\n'` to keep the
pipeline shell-portable when `$services` could contain backslashes.
Offline verification: pipefail now exits 1 on kubectl failure; dedup
key normalization collapses ` http://example.com {` (indented, new)
and `http://example.com {` (flush, old) to the same key; comment-led
annotations still emit with the right key.
Followups intentionally NOT in scope here:
1. Retroactive `RetainOnDelete` for namespace resources whose state
predates #230 — the actual prod-killer. Both reviewers explicitly
called out that this PR does not fix it.
2. Caddy default-block hardening — serve a hard 503 instead of
file_server on /etc/caddy/pages when no Service block matches, so
the absence of routes is loud instead of disguised as healthy 200s.
Both will be follow-up PRs.
Signed-off-by: Dmitrii Creed <creeed22@gmail.com>
Review pass — codex + gemini (parallel)Ran a parallel review with both tools. Headline: codex caught a critical regression I introduced, gemini caught some defensive concerns. Pushed 1abd3c1 addressing both. Codex — the serious one
This automated the exact welcome-page outage scenario from 2026-05-10. Without pipefail, any transient kubectl flake on a Caddy pod restart silently produces a config with no Service routes, the catch-all Fix in 1abd3c1:
Gemini — defensiveFlagged the dedup key as brittle in theory: leading whitespace, comment lines, multi-host comma-separated site addresses. For SC-generated annotations this is moot (first line is deterministic Fix in 1abd3c1: /^[[:space:]]*$/ { next }
/^[[:space:]]*#/ { next }
{ sub(/^[[:space:]]+/, ""); sub(/[[:space:]]+$/, ""); print; exit }Verified offline: Multi-host comma-order normalization (gemini's Both reviewers explicitly confirmThis PR fixes only the secondary crashloop. It does not fix the actual prod-killer (non-retroactive
Both follow-ups are real bugs. This one's safe to merge first as the immediate symptom containment. |
When all Services with a `simple-container.com/caddyfile-entry`
annotation for a given Host disappear — for example, a cascade-deletion
from a namespace Replace gone wrong — requests fell through to the
catch-all `http:// { file_server /etc/caddy/pages }` block and got back
HTTP 200 + "Default page" from index.html. External monitoring saw
healthy 200s. CDNs and load balancers saw 200s. Pingdom / UptimeRobot /
the dashboard everyone trusts saw 200s. The outage was invisible to
every layer that wasn't deep-inspecting the response body.
PAY-SPACE hit this on 2026-05-10: the migration from SC #230 cascade-
deleted the shared parent namespace, every Service annotation for
production hosts evaporated, and every domain pointing at the cluster
served the Caddy welcome page. The outage was only noticed when a human
opened a browser tab.
Change:
- Default catch-all now uses `respond ... 503 { close }` instead of
`file_server /etc/caddy/pages`.
- Retry-After: 60 so CDNs back off appropriately and clients know to
retry rather than treating 503 as a hard failure.
- Cache-Control: no-store so an aggressive cache doesn't pin the 503
state past route recovery.
- HTML body still rendered for humans visiting in a browser, but it's
now a 503 page that names the problem (missing
`simple-container.com/caddyfile-entry` annotation) and tells operators
what to check. The literal "Default page" string is gone.
Behavior verified by running the Caddy image with the new default block:
configured host (Host: example.com) → HTTP 200
unmatched host (Host: support-bot.pay.space) → HTTP 503
Retry-After: 60
Cache-Control: no-store
`caddy validate` against the full embedded Caddyfile + new default block
+ a sample matched site passes clean.
The /etc/caddy/pages directory (index.html, 404.html, 502.html, 500.html)
is still embedded and used by the `handle_bucket_error` and
`handle_server_error` snippets for legitimate per-Service error
fallbacks — only the catch-all stopped serving it as a 200.
Pairs with #255 (Caddy aggregator dedup) as the two halves of the
2026-05-10 PAY-SPACE outage: dedup keeps the aggregator from
crashlooping during a Service transition, this PR keeps the absence of
routes loud so it doesn't masquerade as a healthy 200.
Signed-off-by: Dmitrii Creed <creeed22@gmail.com>
Five blockers/issues from the parallel review:
1. **Content-Type silently became text/plain** (codex, blocking). Caddy's
`respond` defaults the response to text/plain when no explicit
Content-Type is set on the route, so browsers visiting the catch-all
saw the raw HTML as literal text. Fix: `header Content-Type
"text/html; charset=utf-8"` inside the response path. Verified with
curl against caddy v2.11.2: Content-Type now reports text/html.
2. **Headers leaked onto the HSTS 301 path** (codex, important). With
`import hsts` appended, the `header Cache-Control "no-store"` and
`header Retry-After "60"` directives applied to BOTH the 503 AND the
`redir` 301-to-HTTPS that hsts adds. That's wrong for a 301 — clients
shouldn't be told to retry a permanent redirect, and `no-store`
defeats the redirect cache. Fix: wrap the headers + respond in an
explicit `handle { ... }` so they only fire on the 503 path.
3. **HSTS redirect made the 503 unreachable behind a CDN** (codex,
important; gemini noticed but called it acceptable — codex is right).
Caddy directive ordering runs `redir` before `respond`. A request
with `X-Forwarded-Proto: http` (which Cloudflare/GCP LB/most modern
CDNs set) matched hsts's `@httpReq` matcher and got a 301 to HTTPS
for the unknown host — then failed the TLS handshake because Caddy
has no cert for the unknown SNI. The user-visible result was a
browser-level TLS error, invisible to HTTP-layer monitoring — exactly
the failure mode this PR is trying to fix. Fix: omit `import hsts`
from the catch-all entirely. HSTS on a Host-agnostic catch-all is
semantically meaningless anyway (the header tells browsers "always
use HTTPS for THIS host", but the catch-all answers any host).
Per-Service site blocks still get HSTS via their own `import hsts`.
Verified: `Host: support-bot.pay.space` with `X-Forwarded-Proto:
http` now returns 503 directly instead of 301.
4. **Stale comment in the dedup section** (codex). The pipefail
rationale comment still said a kubectl failure would "serve the
welcome page from /etc/caddy/pages". With commit 3 in this PR the
welcome page is gone; the failure mode is "503 on every domain".
That's still a complete loss of routing for the cluster and worth
bailing loud over, but the comment now describes the actual current
behavior.
5. **/etc/caddy/pages/index.html is dead** (codex + gemini). Was only
referenced by the old `file_server` catch-all; the per-Service
`handle_*_error` snippets only reference 404/500/502.html. Deleted.
Validation:
- `caddy validate` clean on the assembled Caddyfile
- `go build ./...` clean
- `go test ./pkg/clouds/pulumi/kubernetes/... -count=1` passes
- Live Caddy v2.11.2 probe matrix:
Host: example.com (known) → 200 "ok" text/plain
Host: support-bot.pay.space (unknown) → 503 text/html Cache-Control:no-store Retry-After:60 Connection:close
Host: support-bot.pay.space + XFP:http → still 503 (no 301 anymore)
Out of scope still: HTTPS catch-all for unknown SNI. Caddy doesn't
synthesize a cert for unknown SNIs without explicit `default_sni` +
matching wildcard cert config, which is per-cluster and not something
this fix should bake in. Direct TLS handshake failure remains the
behavior for unknown SNIs; the HTTP 503 path is what monitoring
actually pings.
Signed-off-by: Dmitrii Creed <creeed22@gmail.com>
…ommit) Previous commit d7b4d71 captured only the index.html deletion but the caddy.go changes weren't staged (git rm + git add interaction). This adds them: content-type, handle wrapper, hsts removal, stale comment. See d7b4d71's commit message for the full rationale of all five review findings — repeated here for completeness: 1. respond defaults to text/plain — add `header Content-Type "text/html; charset=utf-8"` so browsers render the HTML body. 2. Cache-Control + Retry-After leaked onto the HSTS 301 path — wrap headers + respond in explicit `handle { ... }`. 3. HSTS redirect made the catch-all 503 unreachable behind CDNs that set X-Forwarded-Proto — drop `import hsts` from the catch-all. 4. Stale comment about welcome page failure mode — updated to reflect the new 503 failure mode. 5. (the index.html deletion, landed in d7b4d71) Verified live against simplecontainer/caddy:latest: Host: example.com → 200 "ok" Host: support-bot.pay.space → 503 text/html, Cache-Control: no-store, Retry-After: 60, Connection: close Same Host + X-Forwarded-Proto: http → 503 (was 301 before) Signed-off-by: Dmitrii Creed <creeed22@gmail.com>
…re split Signed-off-by: Dmitrii Creed <creeed22@gmail.com>
Previous commit 2d290fe wrote comments with markdown-style backticks inside the Go raw-string-delimited bash script literal, which closed the raw string mid-comment and turned the rest into invalid Go ("syntax error: unexpected name printf in composite literal"). Replaced with plain text (printf-to-sort, printf-to-while-read). Should have built before pushing. `go build ./...` clean now. Signed-off-by: Dmitrii Creed <creeed22@gmail.com>
Gemini round-3 review flagged `set -x` as a security regression: tracing every command prints the raw caddyfile-entry annotation body (and the output of `kubectl get service ...`) to stdout, which lands in cluster logging (GCP/Datadog/ELK). SC-generated annotations don't carry secrets, but consumer-side misuse — basicauth credentials in `Headers` map, or raw Caddy directives in `LbConfig.ExtraHelpers` — could template into the annotation body and leak via -x. The init container is rarely debugged live (when it is, an operator can override the command), so the debuggability cost is low. The script still emits informative one-line `Processing service: $service in namespace: $ns` and `Skipping duplicate caddyfile-entry ...` messages without -x. Kept: `cat /tmp/Caddyfile` at the end. That's the assembled config the Caddy server actually loads; printing it is useful for verifying rollouts and is consistent with prior behavior. If a consumer puts secrets into per-Service annotations they leak there too, but it's intentional logging of the deployed config, not an incidental per-command trace. Codex round-3 verdict was "clean, merge" but acknowledged the same exposure existed via `cat`. I'm siding with gemini on -x because the trace exposure compounds (every kubectl invocation × every Service × every pod restart) while `cat` is a single final dump. Signed-off-by: Dmitrii Creed <creeed22@gmail.com>
Multi-round codex + gemini parallel review — convergedFour rounds, two reviewers per round. Both reach "mergeable" in round 4. Round 1 — codex blocked the PR, gemini was too lenientCodex caught the critical regression I'd introduced:
Gemini noticed the HSTS issue but called it "still loud enough." Codex was correct — TLS-fail is invisible to HTTP monitoring, which is the whole point of this fix. Pushed: d7b4d71 + 328e796 — Round 2 — both cleanCodex: "Findings: none blocking. Mergeable now." Verified all five round-1 fixes with real Caddy testing across h2, prefix-mode, and per-Service-with-hsts. One nit: pipefail comment said "kubectl piped into sort" but code now splits them. Pushed: 2d290fe — refreshed comment. I then accidentally put markdown backticks inside the Go raw string and broke the build. Pushed: 9828eb5 — fix. Round 3 — codex clean, gemini blockedGemini flagged Sided with gemini. Pushed: 95730bf — dropped Round 4 — convergedBoth reviewers: mergeable, no blockers. Codex confirmed Net result7 commits. From original two-line dedup PR to a hardened, security-reviewed change that addresses both halves of the 2026-05-10 outage:
Ready to merge. |
… cascade PR #230 changed custom-stack namespace naming from shared <stackName> to per-stackEnv <stackName>-<stackEnv> and added RetainOnDelete(true), expecting that to protect existing consumers through the migration pulumi up. It didn't — Pulumi reads delete-time options from the state of the resource being deleted, not from the current program. Existing Namespace resources predate #230 and don't carry RetainOnDelete; when the new code computed a different metadata.Name, Pulumi diffed against state, scheduled a Replace, executed delete-old before create-new, and sent k8s DELETE on the legacy shared namespace. K8s cascade-deleted every resource inside, including the parent stack's production resources that lived in the same shared namespace. Confirmed outages: - PAY-SPACE 2026-05-10/11: support-bot parent + every whitelabel (support-payhey, support-rulex, support-gl-pay) cascade-deleted. Caddy fallout from this is also fixed in earlier commits of this PR. - fulldiveVR/wizeup-rooms-api 2026-05-12: namespace wize-rooms-api hosted both the likeclaw-us parent stack and the likeclaw-us-dev child. A routine merge to dev triggered the child's deploy. Pulumi plan: kubernetes:core/v1:Namespace: (replace) name "wize-rooms-api" => "wize-rooms-api-likeclaw-us-dev". Namespace deleted, rooms-api.wizeup.app returned 502 until prod was manually re-deployed. (actions/runs/25725750825) Fix: sdk.IgnoreChanges([]string{"metadata.name"}) on both Namespace registration sites — simple_container.go for client stacks, helpers.go's ensureNamespace for helm operator stacks. Behavior: - Fresh deploy of a new custom child stack: no prior state, no diff to ignore. Namespace created with the per-stackEnv name. PR #230's isolation goal preserved for new deploys. - Existing custom child stack on its next pulumi up: state has metadata.Name=<legacy shared name>, program desires metadata.Name= <stackName>-<stackEnv>. IgnoreChanges suppresses the diff — no Replace scheduled, no delete attempted. State retains the legacy name. Service/Deployment/etc. that reference namespace.Metadata.Name().Elem() now resolve to the legacy name and continue to land in the shared namespace. Migration cost: zero. Consumer is back to the pre-#230 sharing model, but RetainOnDelete protects against the cross-sibling destroy cascade #230 was originally added to solve. Both hazards now defused. - Existing consumer who actively wants per-stackEnv isolation: opt-in by removing the legacy Namespace resource from Pulumi state (state edit; k8s namespace itself stays put). Next pulumi up sees no prior namespace, registers a fresh one at the per-stackEnv name. Old k8s namespace continues to host the parent stack; the migrated child lives in the new isolated namespace. This is the established codebase pattern: rds_postgres.go:45 and rds_mysql.go:55 use IgnoreChanges([]string{"storageEncrypted"}) for the same purpose — silence a default flip so it doesn't propose a destructive replacement on existing stacks. Signed-off-by: Dmitrii Creed <creeed22@gmail.com>
…amespace Codex round-2 review of efd2523 caught that IgnoreChanges("metadata.name") defeats its own purpose at two callsites that still bake the program- computed namespace name into resources downstream of the Namespace itself: - caddyfile-entry annotation. The Service annotation template at simple_container.go:673 was using sanitizedNamespace as the reverse_proxy upstream namespace. With IgnoreChanges in place, a migrated stack's Service is created in the legacy shared namespace (because namespace.Metadata.Name() resolves to the legacy value), but the annotation pointed Caddy at <svc>.<NEW>.svc.cluster.local — DNS fails to resolve, Caddy 502s for the affected host. - VPA. createVPA was called with sanitizedNamespace and built the VPA CRD with metadata.namespace = NEW name. The Deployment it targets lives in the legacy namespace, so the VPA sits orphaned and never scales the workload. Both bugs ship the migration cascade fix (efd2523) without actually preventing 502s or autoscaling regression for migrated stacks. Fix: 1. Caddyfile-entry template extracted to a local variable (caddyfileEntryTemplate). The same template is rendered twice: - synchronously into caddyfileEntry (string) for sc.CaddyfileEntry export — that's used as a change-hash signal by kube_run.go and intentionally tracks the desired-config view, not the migrated live state. - asynchronously into caddyfileEntryAnnotation (sdk.StringOutput) via namespace.Metadata.Name().ApplyT — resolves namespace at apply time. For fresh deploys (liveNS == sanitizedNamespace), the callback returns the statically-rendered template verbatim, so byte output matches the legacy code path. For migrated stacks (liveNS != sanitizedNamespace), it re-applies placeholders with the live namespace and returns the new string. 2. Render failures inside the ApplyT callback are returned as errors wrapped with errors.Wrapf, NOT silently fallen back to the statically-rendered template. Falling back would re-introduce the exact migrated-stack 502 bug this commit is fixing. Codex review flag — the silent fallback was the wrong failure mode. 3. Service/Ingress annotation maps switched from sdk.ToStringMap(map[string]string) to a manually-built sdk.StringMap so the caddyfile-entry value can be an Output while the rest stay static. Equivalent for static-only entries. 4. createVPA signature: namespace string → namespace sdk.StringInput. The metadata.Namespace field directly accepts the Pulumi input. Caller now passes namespace.Metadata.Name().Elem(), which is the live Namespace.metadata.name Output. Verification: - go build ./... clean - go test ./pkg/clouds/pulumi/kubernetes/... -count=1 passes - Three rounds of parallel codex + gemini review on the namespace work; this commit addresses the round-3 follow-ups (template duplication cleaned up, ApplyT error propagation made fatal). Pairs with efd2523 (the IgnoreChanges fix) as the complete cascade- prevention story: efd2523 stops the Namespace itself from being Replace-deleted, this commit stops the downstream resources (Service annotation + VPA) from drifting away from where the Namespace actually landed. Signed-off-by: Dmitrii Creed <creeed22@gmail.com>
Namespace work — three review rounds (codex + gemini parallel each round)The IgnoreChanges fix in efd2523 was incomplete. Multi-round review caught it and the follow-up is in 1d248d9. Round 1 — clean, both approvedBoth confirmed semantics: Round 2 — codex caught a critical bug, gemini missed itCodex found that two callsites in simple_container.go still used the program-computed
Gemini approved the diff in round 2 without spotting either of these. Codex's deeper code-path tracing caught it. Round 3 — both reviewers, same diffVerdict: both approve the round-2 fix in principle but flag two follow-ups:
Both addressed in 1d248d9:
Final PR state11 commits. Both halves of the post-mortem covered:
Ready for merge. |
Dev review pointed out the inconsistency in commit e5a6519 + d7b4d71: the default catch-all 503 was inlined as a string literal in caddy.go, while the other status pages (404, 500, 502) are still served from /etc/caddy/pages/{code}.html via the handle_bucket_error / handle_server_error snippets in embed/caddy/Caddyfile. Two different mechanisms for the same class of response. Refactor to the same file-based pattern: - New pages/503.html with the SC-operator instruction body that was previously inlined ("No backend route is configured for this host" + hint to check the simple-container.com/caddyfile-entry annotation). - caddy.go's default catch-all switches from respond "<html>...</html>" 503 { close } to root * /etc/caddy/pages rewrite * /503.html file_server { status 503 } - Drops the explicit header Content-Type — file_server emits it automatically from the .html extension. - index.html is NOT restored; the 200-OK welcome page was the original failure mode, replaced now by 503.html. Wins: - Symmetry: one pattern for every status page in the codebase. - Operator override: a cluster operator can mount a ConfigMap at /etc/caddy/pages/503.html to customize the body (branded outage page, i18n, etc.) without touching SC api code. - Smaller raw string in caddy.go; the HTML body is no longer inlined via a one-line `respond "<!doctype...>"` blob. Verified live with simplecontainer/caddy:latest serving the assembled Caddyfile + the embedded pages dir mounted at /etc/caddy/pages: Host: example.com (matched) → HTTP 200 Host: support-bot.pay.space (unmatched) → HTTP 503 Content-Type: text/html; charset=utf-8 Cache-Control: no-store Retry-After: 60 body = pages/503.html Same + X-Forwarded-Proto: http → HTTP 503 (no HSTS redirect, since catch-all still doesn't `import hsts` — see e5a6519's comment). Build clean, go test ./pkg/clouds/pulumi/kubernetes/... passes. Signed-off-by: Dmitrii Creed <creeed22@gmail.com>
Followup: default 503 refactored from inline HTML to file-basedDev review feedback on the original inlined 3b6d44b unifies it:
http:// {
import gzip
handle {
root * /etc/caddy/pages
rewrite * /503.html
header Cache-Control "no-store"
header Retry-After "60"
file_server {
status 503
}
}
}Wins:
Verified live with
ReviewSingle round, codex + gemini in parallel:
PR is at 12 commits now. Ready for merge. |
Two consumer outages, same root cause: SC api #230's namespace rename triggers a Pulumi Replace that cascade-deletes the shared parent namespace on the first
pulumi upafter #230 ships. Plus the Caddy fallout from that cascade was invisible to monitoring.Confirmed outages
parentEnv: production(support-payhey,support-rulex,support-gl-pay, parallel wallets) cascade-deleted the sharedsupport-bot/ wallet namespaces. Caddy then served the welcome page on every prod host as HTTP 200, hiding the outage from monitoring until a human opened a browser.wize-rooms-apihosted both thelikeclaw-usparent and thelikeclaw-us-devchild. A routine merge to dev triggered the child's deploy. Pulumi plan:kubernetes:core/v1:Namespace: (replace) name "wize-rooms-api" => "wize-rooms-api-likeclaw-us-dev". Namespace deleted,rooms-api.wizeup.appreturned 502 until prod was manually re-deployed. (actions/runs/25725750825)Anyone with
parentEnv != stackEnvwhose Pulumi state predates #230 is at risk on their nextpulumi up. Two confirmed so far; likely more.Root-cause fix —
IgnoreChanges("metadata.name")on Namespace#230 added
RetainOnDelete(true)expecting that to protect existing consumers through the migration. It didn't: Pulumi reads delete-time options from the state of the resource being deleted, not from the current program. The old Namespace resource in state predates #230 and doesn't carry the flag, soReplaceproceeds with the k8s DELETE and cascade-kills the parent.efd2523 adds
sdk.IgnoreChanges([]string{"metadata.name"})to bothcorev1.NewNamespacecall sites (simple_container.goclient stacks +helpers.gohelm operator stacks).Behavior:
pulumi up: state's oldmetadata.Namevs program's new desiredmetadata.Namewould normally schedule a Replace;IgnoreChangessuppresses that diff. No Replace, no delete, no cascade. State retains the legacy name. Service / Deployment / etc. follow that name and continue to land in the shared namespace. Migration cost: zero.Namespaceresource from Pulumi state (state-edit; k8s namespace itself stays). Nextpulumi upregisters a fresh namespace at the new name.Established codebase pattern:
rds_postgres.go:45andrds_mysql.go:55use the same shape (IgnoreChanges([]string{"storageEncrypted"})) for the same purpose — silence a default flip so it doesn't propose a destructive replacement on existing stacks.Caddy fallout fixes (also in this PR)
When the cascade-delete in the root cause path finished, every Service with
simple-container.com/caddyfile-entryfor the affected hosts disappeared. Two distinct Caddy failure modes followed:Aggregator crashloop during the Replace window — for the brief moment the old + new Services coexisted, two
http://<domain> { ... }site blocks ended up in/tmp/Caddyfileand Caddy aborted withambiguous site definition: http://<domain>. Commits 2e0eeae + 1abd3c1: dedup by site-address (first non-blank, non-comment line of the annotation, whitespace-trimmed), most-recent Service wins viacreationTimestamp+sort -r,set -eo pipefailso a flaky kubectl can't silently produce an empty config.Default catch-all served HTTP 200 + welcome page — after the cascade finished, requests for production hosts fell through to
http:// { file_server /etc/caddy/pages }and got200 OK "Default page". External monitoring, CDNs, uptime checks all saw healthy 200s. Commits e5a6519 + d7b4d71 + 328e796: default block now returns503withRetry-After: 60,Cache-Control: no-store,Content-Type: text/html, wrapped in an explicithandle { ... }so headers + body apply only to the 503 path. Removedimport hstsfrom the catch-all so the 503 reaches monitoring directly instead of redirecting into a TLS handshake failure for unknown SNI.Operational hardening — 95730bf: dropped
set -xso annotation bodies aren't traced to cluster logs.Dead code removal:
/etc/caddy/pages/index.html(the "Default page" template) deleted, no longer referenced.404/500/502.htmlretained — still used by per-Servicehandle_*_errorsnippets.Review provenance
This PR has been through four rounds of parallel
codex+geminireview on the Caddy half. Convergent on "mergeable" in round 4. Each fixup commit captures one round's findings; commit history is intentionally not squashed so the review trail is auditable. Comments above on the PR record the round-by-round summaries.The namespace-root-cause commit (efd2523) is fresh — needs its own review pass before merge.
Test plan
Unit:
go build ./...cleango test ./pkg/clouds/pulumi/kubernetes/... -count=1passesBehavioral (manual, post-merge with branch preview):
<stackName>-<stackEnv>.metadata.namediff is suppressed byIgnoreChanges). All other resources (Service, Deployment, …) unchanged.simplecontainer/caddy:latest.Followup
Memory recorded for next time: this is the second SC migration in two days where a
metadata.namechange was assumed to be safe underRetainOnDelete. Future SC changes tometadata.nameof any long-lived resource should default toIgnoreChangesfrom the start, not retrofit after an outage.