fix(k8s+caddy): stop namespace-rename cascade-delete + Caddy fallout (PAY-SPACE 2026-05-10, fulldiveVR 2026-05-12 outages) by Cre-eD · Pull Request #255 · simple-container-com/api

Cre-eD · 2026-05-11T17:14:42Z

Two consumer outages, same root cause: SC api #230's namespace rename triggers a Pulumi Replace that cascade-deletes the shared parent namespace on the first pulumi up after #230 ships. Plus the Caddy fallout from that cascade was invisible to monitoring.

Confirmed outages

PAY-SPACE 2026-05-10/11: every whitelabel under parentEnv: production (support-payhey, support-rulex, support-gl-pay, parallel wallets) cascade-deleted the shared support-bot / wallet namespaces. Caddy then served the welcome page on every prod host as HTTP 200, hiding the outage from monitoring until a human opened a browser.
fulldiveVR/wizeup-rooms-api 2026-05-12: namespace wize-rooms-api hosted both the likeclaw-us parent and the likeclaw-us-dev child. A routine merge to dev triggered the child's deploy. Pulumi plan: kubernetes:core/v1:Namespace: (replace) name "wize-rooms-api" => "wize-rooms-api-likeclaw-us-dev". Namespace deleted, rooms-api.wizeup.app returned 502 until prod was manually re-deployed. (actions/runs/25725750825)

Anyone with parentEnv != stackEnv whose Pulumi state predates #230 is at risk on their next pulumi up. Two confirmed so far; likely more.

Root-cause fix — `IgnoreChanges("metadata.name")` on Namespace

#230 added RetainOnDelete(true) expecting that to protect existing consumers through the migration. It didn't: Pulumi reads delete-time options from the state of the resource being deleted, not from the current program. The old Namespace resource in state predates #230 and doesn't carry the flag, so Replace proceeds with the k8s DELETE and cascade-kills the parent.

efd2523 adds sdk.IgnoreChanges([]string{"metadata.name"}) to both corev1.NewNamespace call sites (simple_container.go client stacks + helpers.go helm operator stacks).

Behavior:

Fresh deploy of a new custom child stack: no prior state → no diff to ignore → namespace created with the per-stackEnv name. PR fix(k8s): isolate custom-stack namespaces and retain shared ns on destroy #230's isolation goal preserved.
Existing custom child stack on next pulumi up: state's old metadata.Name vs program's new desired metadata.Name would normally schedule a Replace; IgnoreChanges suppresses that diff. No Replace, no delete, no cascade. State retains the legacy name. Service / Deployment / etc. follow that name and continue to land in the shared namespace. Migration cost: zero.
Existing consumer who actively wants the new isolated namespace: opt-in by removing the legacy Namespace resource from Pulumi state (state-edit; k8s namespace itself stays). Next pulumi up registers a fresh namespace at the new name.

Established codebase pattern: rds_postgres.go:45 and rds_mysql.go:55 use the same shape (IgnoreChanges([]string{"storageEncrypted"})) for the same purpose — silence a default flip so it doesn't propose a destructive replacement on existing stacks.

Caddy fallout fixes (also in this PR)

When the cascade-delete in the root cause path finished, every Service with simple-container.com/caddyfile-entry for the affected hosts disappeared. Two distinct Caddy failure modes followed:

Aggregator crashloop during the Replace window — for the brief moment the old + new Services coexisted, two http://<domain> { ... } site blocks ended up in /tmp/Caddyfile and Caddy aborted with ambiguous site definition: http://<domain>. Commits 2e0eeae + 1abd3c1: dedup by site-address (first non-blank, non-comment line of the annotation, whitespace-trimmed), most-recent Service wins via creationTimestamp + sort -r, set -eo pipefail so a flaky kubectl can't silently produce an empty config.
Default catch-all served HTTP 200 + welcome page — after the cascade finished, requests for production hosts fell through to http:// { file_server /etc/caddy/pages } and got 200 OK "Default page". External monitoring, CDNs, uptime checks all saw healthy 200s. Commits e5a6519 + d7b4d71 + 328e796: default block now returns 503 with Retry-After: 60, Cache-Control: no-store, Content-Type: text/html, wrapped in an explicit handle { ... } so headers + body apply only to the 503 path. Removed import hsts from the catch-all so the 503 reaches monitoring directly instead of redirecting into a TLS handshake failure for unknown SNI.
Operational hardening — 95730bf: dropped set -x so annotation bodies aren't traced to cluster logs.

Dead code removal: /etc/caddy/pages/index.html (the "Default page" template) deleted, no longer referenced. 404/500/502.html retained — still used by per-Service handle_*_error snippets.

Review provenance

This PR has been through four rounds of parallel codex + gemini review on the Caddy half. Convergent on "mergeable" in round 4. Each fixup commit captures one round's findings; commit history is intentionally not squashed so the review trail is auditable. Comments above on the PR record the round-by-round summaries.

The namespace-root-cause commit (efd2523) is fresh — needs its own review pass before merge.

Test plan

Unit:

go build ./... clean
go test ./pkg/clouds/pulumi/kubernetes/... -count=1 passes

Behavioral (manual, post-merge with branch preview):

Fresh deploy of a new custom child stack: Namespace created with <stackName>-<stackEnv>.
Existing PAY-SPACE / fulldiveVR consumer deploy: Pulumi diff for Namespace should show NO Replace (the metadata.name diff is suppressed by IgnoreChanges). All other resources (Service, Deployment, …) unchanged.
Caddy: validates clean, returns 503 with right headers for unknown Host, 200 for matched site blocks. Verified in commit message of e5a6519 with real simplecontainer/caddy:latest.

Followup

Memory recorded for next time: this is the second SC migration in two days where a metadata.name change was assumed to be safe under RetainOnDelete. Future SC changes to metadata.name of any long-lived resource should default to IgnoreChanges from the start, not retrofit after an outage.

When the namespace-naming change from #230 lands on a consumer, Pulumi schedules a Replace on every custom-stack namespace (parentEnv != stackEnv). During the brief create-replacement + delete-replaced window the Service carrying `simple-container.com/caddyfile-entry` exists in *both* the old and new namespaces. The Caddy aggregator script concatenated annotations from `kubectl get services --all-namespaces` without dedup, producing two identical `http://<domain> { ... }` site blocks in `/tmp/Caddyfile`. Caddy aborted with `ambiguous site definition` and crashloops until the old Service is collected. PAY-SPACE hit this in production on 2026-05-11 — `support-payhey.pay.space` was the visible victim because it sorts alphabetically before its siblings, but every whitelabel that migrated through the rename traversed the same transient duplicate. Fix: - Include `creationTimestamp` in the jsonpath listing and `sort -r` so the most-recently-created Service is processed first. - Track emitted site-address keys in a tempfile. The dedup key is the first non-blank line of each annotation — for domain entries that's `http://<domain> {` or `https://<domain> {`, for prefix entries it's `handle_path /<prefix>*`. Both transports are guarded. - Older Service for a key already emitted is skipped with a log line, so the picked winner is observable in the init-container output. Verified offline against a synthetic three-Service set (new-ns/example and old-ns/example both declaring `http://example.com`, plus unrelated `other.com`): output Caddyfile has exactly one `http://example.com` block and its `reverse_proxy` resolves to new-ns. Module builds clean, `go test ./pkg/clouds/pulumi/kubernetes/...` passes. The fix is independent of #230's `RetainOnDelete` migration semantics — even after that path is hardened, any future namespace-shape change or Service-Replace will see the same overlap window. This makes the Caddy ingress tolerant of it rather than crashlooping. Signed-off-by: Dmitrii Creed <creeed22@gmail.com>

github-actions · 2026-05-11T17:15:59Z

Semgrep Scan Results

Repository: api | Commit: 7d875a4

Check	Status	Details
✅ Semgrep	Pass	0 total findings (no error/warning)

Scanned at 2026-05-13 13:52 UTC

github-actions · 2026-05-11T17:16:16Z

Security Scan Results

Repository: api | Commit: 7d875a4

Check	Status	Details
✅ Secret Scan	Pass	No secrets detected
⚠️ Dependencies (Trivy)	High	1 high, 1 total
⚠️ Dependencies (Grype)	High	1 high, 1 total
📦 SBOM	Generated	470 components (CycloneDX)

Scanned at 2026-05-13 13:52 UTC

Codex caught a critical regression I introduced: the new `kubectl ... | sort -r` pipeline under `set -e` (no pipefail) silently collapsed to `services=""` whenever kubectl failed, and the script exited successfully. Caddy would then start with only the default `http:// { file_server }` block and every domain would serve the welcome page on the next pod restart — the same masquerading-as-200 failure mode that took prod down on 2026-05-10. Hard miss; would have made the original outage repeatable on any transient kubectl flake. Changes: - `set -xeo pipefail`. A kubectl error now fails the init-container fast; K8s reschedules and retries instead of cementing a partial config. - Split the `kubectl | sort` into two assignments so the failure mode is unambiguous even if a future reader doesn't notice the pipefail. - Normalize the dedup key in awk: skip blank lines, skip comment lines, trim leading/trailing whitespace. For SC-generated annotations this is functionally a no-op (their first non-blank line is deterministic), but it makes the dedup robust against indentation differences and user-authored caddyfile-entry annotations with header comments — gemini's concern. - Switched `echo "$services" | while` to `printf '%s\n'` to keep the pipeline shell-portable when `$services` could contain backslashes. Offline verification: pipefail now exits 1 on kubectl failure; dedup key normalization collapses ` http://example.com {` (indented, new) and `http://example.com {` (flush, old) to the same key; comment-led annotations still emit with the right key. Followups intentionally NOT in scope here: 1. Retroactive `RetainOnDelete` for namespace resources whose state predates #230 — the actual prod-killer. Both reviewers explicitly called out that this PR does not fix it. 2. Caddy default-block hardening — serve a hard 503 instead of file_server on /etc/caddy/pages when no Service block matches, so the absence of routes is loud instead of disguised as healthy 200s. Both will be follow-up PRs. Signed-off-by: Dmitrii Creed <creeed22@gmail.com>

Cre-eD · 2026-05-11T17:39:34Z

Review pass — codex + gemini (parallel)

Ran a parallel review with both tools. Headline: codex caught a critical regression I introduced, gemini caught some defensive concerns. Pushed 1abd3c1 addressing both.

Codex — the serious one

kubectl get ... | sort -r under set -e without pipefail — a kubectl error/timeout becomes services="" and the init-container exits successfully with only the default Caddyfile.

This automated the exact welcome-page outage scenario from 2026-05-10. Without pipefail, any transient kubectl flake on a Caddy pod restart silently produces a config with no Service routes, the catch-all http:// { file_server } serves the welcome page on every domain, and monitoring sees 200 OK. Hard miss on my part.

Fix in 1abd3c1:

set -xeo pipefail
Split kubectl/sort into separate assignments to make the failure path explicit even for readers who skip the set line

Gemini — defensive

Flagged the dedup key as brittle in theory: leading whitespace, comment lines, multi-host comma-separated site addresses. For SC-generated annotations this is moot (first line is deterministic ${proto}://${domain} { or handle_path /${prefix}*), but indentation/comment normalization is cheap defense.

Fix in 1abd3c1:

/^[[:space:]]*$/ { next }
/^[[:space:]]*#/ { next }
{ sub(/^[[:space:]]+/, ""); sub(/[[:space:]]+$/, ""); print; exit }

Verified offline: http://example.com { (new, indented) and http://example.com { (old, flush) now produce the same dedup key. Comment-led annotations correctly emit with the underlying site-block as key.

Multi-host comma-order normalization (gemini's host1, host2 { vs host2, host1 {) is intentionally not addressed — not a case SC generates, and lexically sorting hosts inside a key would require a Caddyfile parser. Defer to a real follow-up if/when the use case appears.

Both reviewers explicitly confirm

This PR fixes only the secondary crashloop. It does not fix the actual prod-killer (non-retroactive RetainOnDelete cascade-deleting the shared parent namespace on first pulumi up after the migration). Two follow-up PRs queued:

Retroactive RetainOnDelete path so namespace resources that entered Pulumi state before fix(k8s): isolate custom-stack namespaces and retain shared ns on destroy #230 don't get cascade-deleted during the Replace
Caddy default-block hardening: serve a hard 503 instead of file_server /etc/caddy/pages when no Service block matches, so route absence is loud instead of disguised as healthy 200s

Both follow-ups are real bugs. This one's safe to merge first as the immediate symptom containment.

When all Services with a `simple-container.com/caddyfile-entry` annotation for a given Host disappear — for example, a cascade-deletion from a namespace Replace gone wrong — requests fell through to the catch-all `http:// { file_server /etc/caddy/pages }` block and got back HTTP 200 + "Default page" from index.html. External monitoring saw healthy 200s. CDNs and load balancers saw 200s. Pingdom / UptimeRobot / the dashboard everyone trusts saw 200s. The outage was invisible to every layer that wasn't deep-inspecting the response body. PAY-SPACE hit this on 2026-05-10: the migration from SC #230 cascade- deleted the shared parent namespace, every Service annotation for production hosts evaporated, and every domain pointing at the cluster served the Caddy welcome page. The outage was only noticed when a human opened a browser tab. Change: - Default catch-all now uses `respond ... 503 { close }` instead of `file_server /etc/caddy/pages`. - Retry-After: 60 so CDNs back off appropriately and clients know to retry rather than treating 503 as a hard failure. - Cache-Control: no-store so an aggressive cache doesn't pin the 503 state past route recovery. - HTML body still rendered for humans visiting in a browser, but it's now a 503 page that names the problem (missing `simple-container.com/caddyfile-entry` annotation) and tells operators what to check. The literal "Default page" string is gone. Behavior verified by running the Caddy image with the new default block: configured host (Host: example.com) → HTTP 200 unmatched host (Host: support-bot.pay.space) → HTTP 503 Retry-After: 60 Cache-Control: no-store `caddy validate` against the full embedded Caddyfile + new default block + a sample matched site passes clean. The /etc/caddy/pages directory (index.html, 404.html, 502.html, 500.html) is still embedded and used by the `handle_bucket_error` and `handle_server_error` snippets for legitimate per-Service error fallbacks — only the catch-all stopped serving it as a 200. Pairs with #255 (Caddy aggregator dedup) as the two halves of the 2026-05-10 PAY-SPACE outage: dedup keeps the aggregator from crashlooping during a Service transition, this PR keeps the absence of routes loud so it doesn't masquerade as a healthy 200. Signed-off-by: Dmitrii Creed <creeed22@gmail.com>

Five blockers/issues from the parallel review: 1. **Content-Type silently became text/plain** (codex, blocking). Caddy's `respond` defaults the response to text/plain when no explicit Content-Type is set on the route, so browsers visiting the catch-all saw the raw HTML as literal text. Fix: `header Content-Type "text/html; charset=utf-8"` inside the response path. Verified with curl against caddy v2.11.2: Content-Type now reports text/html. 2. **Headers leaked onto the HSTS 301 path** (codex, important). With `import hsts` appended, the `header Cache-Control "no-store"` and `header Retry-After "60"` directives applied to BOTH the 503 AND the `redir` 301-to-HTTPS that hsts adds. That's wrong for a 301 — clients shouldn't be told to retry a permanent redirect, and `no-store` defeats the redirect cache. Fix: wrap the headers + respond in an explicit `handle { ... }` so they only fire on the 503 path. 3. **HSTS redirect made the 503 unreachable behind a CDN** (codex, important; gemini noticed but called it acceptable — codex is right). Caddy directive ordering runs `redir` before `respond`. A request with `X-Forwarded-Proto: http` (which Cloudflare/GCP LB/most modern CDNs set) matched hsts's `@httpReq` matcher and got a 301 to HTTPS for the unknown host — then failed the TLS handshake because Caddy has no cert for the unknown SNI. The user-visible result was a browser-level TLS error, invisible to HTTP-layer monitoring — exactly the failure mode this PR is trying to fix. Fix: omit `import hsts` from the catch-all entirely. HSTS on a Host-agnostic catch-all is semantically meaningless anyway (the header tells browsers "always use HTTPS for THIS host", but the catch-all answers any host). Per-Service site blocks still get HSTS via their own `import hsts`. Verified: `Host: support-bot.pay.space` with `X-Forwarded-Proto: http` now returns 503 directly instead of 301. 4. **Stale comment in the dedup section** (codex). The pipefail rationale comment still said a kubectl failure would "serve the welcome page from /etc/caddy/pages". With commit 3 in this PR the welcome page is gone; the failure mode is "503 on every domain". That's still a complete loss of routing for the cluster and worth bailing loud over, but the comment now describes the actual current behavior. 5. **/etc/caddy/pages/index.html is dead** (codex + gemini). Was only referenced by the old `file_server` catch-all; the per-Service `handle_*_error` snippets only reference 404/500/502.html. Deleted. Validation: - `caddy validate` clean on the assembled Caddyfile - `go build ./...` clean - `go test ./pkg/clouds/pulumi/kubernetes/... -count=1` passes - Live Caddy v2.11.2 probe matrix: Host: example.com (known) → 200 "ok" text/plain Host: support-bot.pay.space (unknown) → 503 text/html Cache-Control:no-store Retry-After:60 Connection:close Host: support-bot.pay.space + XFP:http → still 503 (no 301 anymore) Out of scope still: HTTPS catch-all for unknown SNI. Caddy doesn't synthesize a cert for unknown SNIs without explicit `default_sni` + matching wildcard cert config, which is per-cluster and not something this fix should bake in. Direct TLS handshake failure remains the behavior for unknown SNIs; the HTTP 503 path is what monitoring actually pings. Signed-off-by: Dmitrii Creed <creeed22@gmail.com>

…ommit) Previous commit d7b4d71 captured only the index.html deletion but the caddy.go changes weren't staged (git rm + git add interaction). This adds them: content-type, handle wrapper, hsts removal, stale comment. See d7b4d71's commit message for the full rationale of all five review findings — repeated here for completeness: 1. respond defaults to text/plain — add `header Content-Type "text/html; charset=utf-8"` so browsers render the HTML body. 2. Cache-Control + Retry-After leaked onto the HSTS 301 path — wrap headers + respond in explicit `handle { ... }`. 3. HSTS redirect made the catch-all 503 unreachable behind CDNs that set X-Forwarded-Proto — drop `import hsts` from the catch-all. 4. Stale comment about welcome page failure mode — updated to reflect the new 503 failure mode. 5. (the index.html deletion, landed in d7b4d71) Verified live against simplecontainer/caddy:latest: Host: example.com → 200 "ok" Host: support-bot.pay.space → 503 text/html, Cache-Control: no-store, Retry-After: 60, Connection: close Same Host + X-Forwarded-Proto: http → 503 (was 301 before) Signed-off-by: Dmitrii Creed <creeed22@gmail.com>

…re split Signed-off-by: Dmitrii Creed <creeed22@gmail.com>

Previous commit 2d290fe wrote comments with markdown-style backticks inside the Go raw-string-delimited bash script literal, which closed the raw string mid-comment and turned the rest into invalid Go ("syntax error: unexpected name printf in composite literal"). Replaced with plain text (printf-to-sort, printf-to-while-read). Should have built before pushing. `go build ./...` clean now. Signed-off-by: Dmitrii Creed <creeed22@gmail.com>

Gemini round-3 review flagged `set -x` as a security regression: tracing every command prints the raw caddyfile-entry annotation body (and the output of `kubectl get service ...`) to stdout, which lands in cluster logging (GCP/Datadog/ELK). SC-generated annotations don't carry secrets, but consumer-side misuse — basicauth credentials in `Headers` map, or raw Caddy directives in `LbConfig.ExtraHelpers` — could template into the annotation body and leak via -x. The init container is rarely debugged live (when it is, an operator can override the command), so the debuggability cost is low. The script still emits informative one-line `Processing service: $service in namespace: $ns` and `Skipping duplicate caddyfile-entry ...` messages without -x. Kept: `cat /tmp/Caddyfile` at the end. That's the assembled config the Caddy server actually loads; printing it is useful for verifying rollouts and is consistent with prior behavior. If a consumer puts secrets into per-Service annotations they leak there too, but it's intentional logging of the deployed config, not an incidental per-command trace. Codex round-3 verdict was "clean, merge" but acknowledged the same exposure existed via `cat`. I'm siding with gemini on -x because the trace exposure compounds (every kubectl invocation × every Service × every pod restart) while `cat` is a single final dump. Signed-off-by: Dmitrii Creed <creeed22@gmail.com>

Cre-eD · 2026-05-11T18:57:57Z

Multi-round codex + gemini parallel review — converged

Four rounds, two reviewers per round. Both reach "mergeable" in round 4.

Round 1 — codex blocked the PR, gemini was too lenient

Codex caught the critical regression I'd introduced: kubectl ... | sort -r under set -e without pipefail would silently produce services="" on any kubectl flake, automating the welcome-page-masquerade outage that the PR is trying to fix. Plus:

respond defaults Content-Type to text/plain → browsers saw raw HTML tags
header Cache-Control and header Retry-After leaked onto the HSTS 301 path
import hsts redirected catch-all HTTP→HTTPS→TLS-handshake-fail, making the 503 unreachable behind any CDN that sets X-Forwarded-Proto
Stale comment + dead index.html

Gemini noticed the HSTS issue but called it "still loud enough." Codex was correct — TLS-fail is invisible to HTTP monitoring, which is the whole point of this fix.

Pushed: d7b4d71 + 328e796 — set -eo pipefail, explicit Content-Type, handle { ... } wrapper, removed import hsts from catch-all, deleted index.html.

Round 2 — both clean

Codex: "Findings: none blocking. Mergeable now." Verified all five round-1 fixes with real Caddy testing across h2, prefix-mode, and per-Service-with-hsts.
Gemini: "Mergeable as-is."

One nit: pipefail comment said "kubectl piped into sort" but code now splits them. Pushed: 2d290fe — refreshed comment.

I then accidentally put markdown backticks inside the Go raw string and broke the build. Pushed: 9828eb5 — fix.

Round 3 — codex clean, gemini blocked

Gemini flagged set -x as a security regression: tracing every command dumps annotation bodies into cluster logs, and a consumer-misuse path (basicauth credentials in Headers map, raw directives in LbConfig.ExtraHelpers) could template secrets into the annotation. Codex noted the same exposure existed via cat /tmp/Caddyfile and called it pre-existing.

Sided with gemini. Pushed: 95730bf — dropped -x, kept set -eo pipefail. Kept cat /tmp/Caddyfile since that's intentional final-config logging, useful for rollout verification, and the trace-amplification (every kubectl invocation × every Service × every pod restart) was the bigger leak.

Round 4 — converged

Both reviewers: mergeable, no blockers. Codex confirmed set -eo pipefail preserves errexit + pipeline-failure behavior; only execution tracing was removed. Gemini explicitly endorsed keeping cat /tmp/Caddyfile as "a pragmatic trade-off" — agrees with my reasoning.

Net result

7 commits. From original two-line dedup PR to a hardened, security-reviewed change that addresses both halves of the 2026-05-10 outage:

aggregator dedup + pipefail + key normalization (prevents the Caddy crashloop during Service Replace)
default catch-all returns 503 with text/html + Retry-After + Cache-Control: no-store, behind a handle { ... } scope that doesn't leak headers onto unrelated paths
removed catch-all import hsts so the 503 reaches monitoring directly instead of redirecting into a TLS failure
dropped set -x to reduce annotation-body exposure to cluster logs

Ready to merge.

… cascade PR #230 changed custom-stack namespace naming from shared <stackName> to per-stackEnv <stackName>-<stackEnv> and added RetainOnDelete(true), expecting that to protect existing consumers through the migration pulumi up. It didn't — Pulumi reads delete-time options from the state of the resource being deleted, not from the current program. Existing Namespace resources predate #230 and don't carry RetainOnDelete; when the new code computed a different metadata.Name, Pulumi diffed against state, scheduled a Replace, executed delete-old before create-new, and sent k8s DELETE on the legacy shared namespace. K8s cascade-deleted every resource inside, including the parent stack's production resources that lived in the same shared namespace. Confirmed outages: - PAY-SPACE 2026-05-10/11: support-bot parent + every whitelabel (support-payhey, support-rulex, support-gl-pay) cascade-deleted. Caddy fallout from this is also fixed in earlier commits of this PR. - fulldiveVR/wizeup-rooms-api 2026-05-12: namespace wize-rooms-api hosted both the likeclaw-us parent stack and the likeclaw-us-dev child. A routine merge to dev triggered the child's deploy. Pulumi plan: kubernetes:core/v1:Namespace: (replace) name "wize-rooms-api" => "wize-rooms-api-likeclaw-us-dev". Namespace deleted, rooms-api.wizeup.app returned 502 until prod was manually re-deployed. (actions/runs/25725750825) Fix: sdk.IgnoreChanges([]string{"metadata.name"}) on both Namespace registration sites — simple_container.go for client stacks, helpers.go's ensureNamespace for helm operator stacks. Behavior: - Fresh deploy of a new custom child stack: no prior state, no diff to ignore. Namespace created with the per-stackEnv name. PR #230's isolation goal preserved for new deploys. - Existing custom child stack on its next pulumi up: state has metadata.Name=<legacy shared name>, program desires metadata.Name= <stackName>-<stackEnv>. IgnoreChanges suppresses the diff — no Replace scheduled, no delete attempted. State retains the legacy name. Service/Deployment/etc. that reference namespace.Metadata.Name().Elem() now resolve to the legacy name and continue to land in the shared namespace. Migration cost: zero. Consumer is back to the pre-#230 sharing model, but RetainOnDelete protects against the cross-sibling destroy cascade #230 was originally added to solve. Both hazards now defused. - Existing consumer who actively wants per-stackEnv isolation: opt-in by removing the legacy Namespace resource from Pulumi state (state edit; k8s namespace itself stays put). Next pulumi up sees no prior namespace, registers a fresh one at the per-stackEnv name. Old k8s namespace continues to host the parent stack; the migrated child lives in the new isolated namespace. This is the established codebase pattern: rds_postgres.go:45 and rds_mysql.go:55 use IgnoreChanges([]string{"storageEncrypted"}) for the same purpose — silence a default flip so it doesn't propose a destructive replacement on existing stacks. Signed-off-by: Dmitrii Creed <creeed22@gmail.com>

…amespace Codex round-2 review of efd2523 caught that IgnoreChanges("metadata.name") defeats its own purpose at two callsites that still bake the program- computed namespace name into resources downstream of the Namespace itself: - caddyfile-entry annotation. The Service annotation template at simple_container.go:673 was using sanitizedNamespace as the reverse_proxy upstream namespace. With IgnoreChanges in place, a migrated stack's Service is created in the legacy shared namespace (because namespace.Metadata.Name() resolves to the legacy value), but the annotation pointed Caddy at <svc>.<NEW>.svc.cluster.local — DNS fails to resolve, Caddy 502s for the affected host. - VPA. createVPA was called with sanitizedNamespace and built the VPA CRD with metadata.namespace = NEW name. The Deployment it targets lives in the legacy namespace, so the VPA sits orphaned and never scales the workload. Both bugs ship the migration cascade fix (efd2523) without actually preventing 502s or autoscaling regression for migrated stacks. Fix: 1. Caddyfile-entry template extracted to a local variable (caddyfileEntryTemplate). The same template is rendered twice: - synchronously into caddyfileEntry (string) for sc.CaddyfileEntry export — that's used as a change-hash signal by kube_run.go and intentionally tracks the desired-config view, not the migrated live state. - asynchronously into caddyfileEntryAnnotation (sdk.StringOutput) via namespace.Metadata.Name().ApplyT — resolves namespace at apply time. For fresh deploys (liveNS == sanitizedNamespace), the callback returns the statically-rendered template verbatim, so byte output matches the legacy code path. For migrated stacks (liveNS != sanitizedNamespace), it re-applies placeholders with the live namespace and returns the new string. 2. Render failures inside the ApplyT callback are returned as errors wrapped with errors.Wrapf, NOT silently fallen back to the statically-rendered template. Falling back would re-introduce the exact migrated-stack 502 bug this commit is fixing. Codex review flag — the silent fallback was the wrong failure mode. 3. Service/Ingress annotation maps switched from sdk.ToStringMap(map[string]string) to a manually-built sdk.StringMap so the caddyfile-entry value can be an Output while the rest stay static. Equivalent for static-only entries. 4. createVPA signature: namespace string → namespace sdk.StringInput. The metadata.Namespace field directly accepts the Pulumi input. Caller now passes namespace.Metadata.Name().Elem(), which is the live Namespace.metadata.name Output. Verification: - go build ./... clean - go test ./pkg/clouds/pulumi/kubernetes/... -count=1 passes - Three rounds of parallel codex + gemini review on the namespace work; this commit addresses the round-3 follow-ups (template duplication cleaned up, ApplyT error propagation made fatal). Pairs with efd2523 (the IgnoreChanges fix) as the complete cascade- prevention story: efd2523 stops the Namespace itself from being Replace-deleted, this commit stops the downstream resources (Service annotation + VPA) from drifting away from where the Namespace actually landed. Signed-off-by: Dmitrii Creed <creeed22@gmail.com>

Cre-eD · 2026-05-12T14:34:16Z

Namespace work — three review rounds (codex + gemini parallel each round)

The IgnoreChanges fix in efd2523 was incomplete. Multi-round review caught it and the follow-up is in 1d248d9.

Round 1 — clean, both approved

Both confirmed semantics: IgnoreChanges("metadata.name") suppresses the Replace for existing state, has no effect on fresh deploys, downstream resources follow namespace.Metadata.Name().Elem() correctly. Codex even built a throwaway Pulumi stack and tested it live.

Round 2 — codex caught a critical bug, gemini missed it

Codex found that two callsites in simple_container.go still used the program-computed sanitizedNamespace (string) instead of the live namespace.Metadata.Name() Output:

caddyfile-entry annotation (line 673) templated sanitizedNamespace into reverse_proxy http://${service}.${namespace}.svc.cluster.local. On a migrated stack the Service lives in the legacy namespace but the annotation pointed Caddy at the new (non-existent) namespace → DNS doesn't resolve → 502. The exact failure mode efd2523 was supposed to prevent.
VPA (line 825 → createVPA → line 964) set metadata.namespace = sanitizedNamespace. VPA in NEW namespace, Deployment in LEGACY namespace → VPA never finds its target → autoscaling silently broken.

Gemini approved the diff in round 2 without spotting either of these. Codex's deeper code-path tracing caught it.

Round 3 — both reviewers, same diff

Verdict: both approve the round-2 fix in principle but flag two follow-ups:

Template duplication (both): the Caddyfile template literal was repeated inside the ApplyT callback (so a tweak to the main template wouldn't propagate to the live-render path). Both recommended extracting to a variable.
Silent fallback on ApplyT render failure is wrong (codex, blocking): if placeholders.Apply fails inside the callback, my code returned entryTemplate (the statically-rendered version) — which is exactly the bug we're fixing. Codex insisted this should propagate the error so Pulumi fails the update.

Both addressed in 1d248d9:

caddyfileEntryTemplate extracted; both initial sync render + ApplyT re-render share it
ApplyT callback now returns (string, error) and uses errors.Wrapf to propagate

Final PR state

11 commits. Both halves of the post-mortem covered:

Migration cascade prevention — IgnoreChanges("metadata.name") on both Namespace registration sites + downstream resources (caddyfile-entry annotation, VPA) following the live namespace Output.
Caddy fallout containment — aggregator dedup, default-block 503 (proper Content-Type, scoped headers, no HSTS redirect), removed set -x to limit annotation exposure in logs.

Ready for merge.

Dev review pointed out the inconsistency in commit e5a6519 + d7b4d71: the default catch-all 503 was inlined as a string literal in caddy.go, while the other status pages (404, 500, 502) are still served from /etc/caddy/pages/{code}.html via the handle_bucket_error / handle_server_error snippets in embed/caddy/Caddyfile. Two different mechanisms for the same class of response. Refactor to the same file-based pattern: - New pages/503.html with the SC-operator instruction body that was previously inlined ("No backend route is configured for this host" + hint to check the simple-container.com/caddyfile-entry annotation). - caddy.go's default catch-all switches from respond "<html>...</html>" 503 { close } to root * /etc/caddy/pages rewrite * /503.html file_server { status 503 } - Drops the explicit header Content-Type — file_server emits it automatically from the .html extension. - index.html is NOT restored; the 200-OK welcome page was the original failure mode, replaced now by 503.html. Wins: - Symmetry: one pattern for every status page in the codebase. - Operator override: a cluster operator can mount a ConfigMap at /etc/caddy/pages/503.html to customize the body (branded outage page, i18n, etc.) without touching SC api code. - Smaller raw string in caddy.go; the HTML body is no longer inlined via a one-line `respond "<!doctype...>"` blob. Verified live with simplecontainer/caddy:latest serving the assembled Caddyfile + the embedded pages dir mounted at /etc/caddy/pages: Host: example.com (matched) → HTTP 200 Host: support-bot.pay.space (unmatched) → HTTP 503 Content-Type: text/html; charset=utf-8 Cache-Control: no-store Retry-After: 60 body = pages/503.html Same + X-Forwarded-Proto: http → HTTP 503 (no HSTS redirect, since catch-all still doesn't `import hsts` — see e5a6519's comment). Build clean, go test ./pkg/clouds/pulumi/kubernetes/... passes. Signed-off-by: Dmitrii Creed <creeed22@gmail.com>

Cre-eD · 2026-05-12T19:15:02Z

Followup: default 503 refactored from inline HTML to file-based

Dev review feedback on the original inlined respond "<html>..." 503 approach: it was inconsistent with how every other error page in this codebase is served (pages/{404,500,502}.html via file_server from the handle_*_error snippets). Two mechanisms for the same class of response.

3b6d44b unifies it:

New embed/caddy/pages/503.html with the same SC-operator instruction body that was inlined before.
caddy.go default catch-all switches from respond "..." 503 { close } to:

http:// {
  import gzip
  handle {
    root * /etc/caddy/pages
    rewrite * /503.html
    header Cache-Control "no-store"
    header Retry-After "60"
    file_server {
      status 503
    }
  }
}

Wins:

Symmetry — one pattern across every status page in the codebase
Operator override — cluster operators can mount a different ConfigMap at /etc/caddy/pages/503.html for branded outage pages, i18n, etc.
Auto Content-Type — file_server emits it from the file extension; no explicit header Content-Type line
Smaller caddy.go — the HTML body is no longer a one-line respond blob

Verified live with simplecontainer/caddy:latest + the embedded pages dir mounted at /etc/caddy/pages:

Host: example.com (matched)             → HTTP 200
Host: support-bot.pay.space (unmatched) → HTTP 503
  Content-Type: text/html; charset=utf-8 (auto from .html ext)
  Cache-Control: no-store
  Retry-After: 60
  body = pages/503.html

Same + X-Forwarded-Proto: http          → HTTP 503 (no redirect — catch-all still doesn't import hsts)

go test ./pkg/clouds/pulumi/kubernetes/... passes.

Review

Single round, codex + gemini in parallel:

Codex: "No findings. Mergeable as-is." Verified Caddy v2 syntax, no path-traversal (rewrite discards attacker path before file_server), no SSA/ConfigMap drift. Noted one minor behavior change: respond { close } previously forced connection close; file_server does not. Not a blocker — these are scanner/probe-class requests where keepalive isn't a concern.
Gemini: "Mergeable." Confirmed subdirective valid since Caddy v2.4.0 (deployed Caddy is 2.11.2). rewrite over try_files is the right choice for a forced catch-all.

PR is at 12 commits now. Ready for merge.

Cre-eD mentioned this pull request May 11, 2026

fix(caddy): default catch-all returns 503 instead of welcome page #256

Closed

4 tasks

Cre-eD changed the title ~~fix(caddy): dedup caddyfile-entry annotations during Service transitions~~ fix(caddy): harden aggregator dedup + default block against route-vanish outage (post-mortem 2026-05-10) May 11, 2026

Cre-eD added 5 commits May 11, 2026 22:30

fixup: round-2 nit — refresh pipefail comment now that kubectl/sort a…

2d290fe

…re split Signed-off-by: Dmitrii Creed <creeed22@gmail.com>

Cre-eD changed the title ~~fix(caddy): harden aggregator dedup + default block against route-vanish outage (post-mortem 2026-05-10)~~ fix(k8s+caddy): stop namespace-rename cascade-delete + Caddy fallout (PAY-SPACE 2026-05-10, fulldiveVR 2026-05-12 outages) May 12, 2026

Cre-eD added 2 commits May 13, 2026 17:18

Merge branch 'main' into fix/caddy-aggregator-dedup-on-rename

0fe66c7

Merge branch 'main' into fix/caddy-aggregator-dedup-on-rename

30ef2c6

smecsia approved these changes May 13, 2026

View reviewed changes

Cre-eD merged commit b4fd96f into main May 13, 2026
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(k8s+caddy): stop namespace-rename cascade-delete + Caddy fallout (PAY-SPACE 2026-05-10, fulldiveVR 2026-05-12 outages)#255

fix(k8s+caddy): stop namespace-rename cascade-delete + Caddy fallout (PAY-SPACE 2026-05-10, fulldiveVR 2026-05-12 outages)#255
Cre-eD merged 13 commits into
mainfrom
fix/caddy-aggregator-dedup-on-rename

Cre-eD commented May 11, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 11, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 11, 2026 •

edited

Loading

Uh oh!

Cre-eD commented May 11, 2026

Uh oh!

Cre-eD commented May 11, 2026

Uh oh!

Cre-eD commented May 12, 2026

Uh oh!

Cre-eD commented May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Cre-eD commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Confirmed outages

Root-cause fix — IgnoreChanges("metadata.name") on Namespace

Caddy fallout fixes (also in this PR)

Review provenance

Test plan

Followup

Uh oh!

github-actions Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Semgrep Scan Results

Uh oh!

github-actions Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Security Scan Results

Uh oh!

Cre-eD commented May 11, 2026

Review pass — codex + gemini (parallel)

Codex — the serious one

Gemini — defensive

Both reviewers explicitly confirm

Uh oh!

Cre-eD commented May 11, 2026

Multi-round codex + gemini parallel review — converged

Round 1 — codex blocked the PR, gemini was too lenient

Round 2 — both clean

Round 3 — codex clean, gemini blocked

Round 4 — converged

Net result

Uh oh!

Cre-eD commented May 12, 2026

Namespace work — three review rounds (codex + gemini parallel each round)

Round 1 — clean, both approved

Round 2 — codex caught a critical bug, gemini missed it

Round 3 — both reviewers, same diff

Final PR state

Uh oh!

Cre-eD commented May 12, 2026

Followup: default 503 refactored from inline HTML to file-based

Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Cre-eD commented May 11, 2026 •

edited

Loading

Root-cause fix — `IgnoreChanges("metadata.name")` on Namespace

github-actions Bot commented May 11, 2026 •

edited

Loading

github-actions Bot commented May 11, 2026 •

edited

Loading