Skip to content

fix(k8s+caddy): stop namespace-rename cascade-delete + Caddy fallout (PAY-SPACE 2026-05-10, fulldiveVR 2026-05-12 outages)#255

Merged
Cre-eD merged 13 commits into
mainfrom
fix/caddy-aggregator-dedup-on-rename
May 13, 2026
Merged

fix(k8s+caddy): stop namespace-rename cascade-delete + Caddy fallout (PAY-SPACE 2026-05-10, fulldiveVR 2026-05-12 outages)#255
Cre-eD merged 13 commits into
mainfrom
fix/caddy-aggregator-dedup-on-rename

Conversation

@Cre-eD
Copy link
Copy Markdown
Contributor

@Cre-eD Cre-eD commented May 11, 2026

Two consumer outages, same root cause: SC api #230's namespace rename triggers a Pulumi Replace that cascade-deletes the shared parent namespace on the first pulumi up after #230 ships. Plus the Caddy fallout from that cascade was invisible to monitoring.

Confirmed outages

  • PAY-SPACE 2026-05-10/11: every whitelabel under parentEnv: production (support-payhey, support-rulex, support-gl-pay, parallel wallets) cascade-deleted the shared support-bot / wallet namespaces. Caddy then served the welcome page on every prod host as HTTP 200, hiding the outage from monitoring until a human opened a browser.
  • fulldiveVR/wizeup-rooms-api 2026-05-12: namespace wize-rooms-api hosted both the likeclaw-us parent and the likeclaw-us-dev child. A routine merge to dev triggered the child's deploy. Pulumi plan: kubernetes:core/v1:Namespace: (replace) name "wize-rooms-api" => "wize-rooms-api-likeclaw-us-dev". Namespace deleted, rooms-api.wizeup.app returned 502 until prod was manually re-deployed. (actions/runs/25725750825)

Anyone with parentEnv != stackEnv whose Pulumi state predates #230 is at risk on their next pulumi up. Two confirmed so far; likely more.

Root-cause fix — IgnoreChanges("metadata.name") on Namespace

#230 added RetainOnDelete(true) expecting that to protect existing consumers through the migration. It didn't: Pulumi reads delete-time options from the state of the resource being deleted, not from the current program. The old Namespace resource in state predates #230 and doesn't carry the flag, so Replace proceeds with the k8s DELETE and cascade-kills the parent.

efd2523 adds sdk.IgnoreChanges([]string{"metadata.name"}) to both corev1.NewNamespace call sites (simple_container.go client stacks + helpers.go helm operator stacks).

Behavior:

  • Fresh deploy of a new custom child stack: no prior state → no diff to ignore → namespace created with the per-stackEnv name. PR fix(k8s): isolate custom-stack namespaces and retain shared ns on destroy #230's isolation goal preserved.
  • Existing custom child stack on next pulumi up: state's old metadata.Name vs program's new desired metadata.Name would normally schedule a Replace; IgnoreChanges suppresses that diff. No Replace, no delete, no cascade. State retains the legacy name. Service / Deployment / etc. follow that name and continue to land in the shared namespace. Migration cost: zero.
  • Existing consumer who actively wants the new isolated namespace: opt-in by removing the legacy Namespace resource from Pulumi state (state-edit; k8s namespace itself stays). Next pulumi up registers a fresh namespace at the new name.

Established codebase pattern: rds_postgres.go:45 and rds_mysql.go:55 use the same shape (IgnoreChanges([]string{"storageEncrypted"})) for the same purpose — silence a default flip so it doesn't propose a destructive replacement on existing stacks.

Caddy fallout fixes (also in this PR)

When the cascade-delete in the root cause path finished, every Service with simple-container.com/caddyfile-entry for the affected hosts disappeared. Two distinct Caddy failure modes followed:

  1. Aggregator crashloop during the Replace window — for the brief moment the old + new Services coexisted, two http://<domain> { ... } site blocks ended up in /tmp/Caddyfile and Caddy aborted with ambiguous site definition: http://<domain>. Commits 2e0eeae + 1abd3c1: dedup by site-address (first non-blank, non-comment line of the annotation, whitespace-trimmed), most-recent Service wins via creationTimestamp + sort -r, set -eo pipefail so a flaky kubectl can't silently produce an empty config.

  2. Default catch-all served HTTP 200 + welcome page — after the cascade finished, requests for production hosts fell through to http:// { file_server /etc/caddy/pages } and got 200 OK "Default page". External monitoring, CDNs, uptime checks all saw healthy 200s. Commits e5a6519 + d7b4d71 + 328e796: default block now returns 503 with Retry-After: 60, Cache-Control: no-store, Content-Type: text/html, wrapped in an explicit handle { ... } so headers + body apply only to the 503 path. Removed import hsts from the catch-all so the 503 reaches monitoring directly instead of redirecting into a TLS handshake failure for unknown SNI.

  3. Operational hardening95730bf: dropped set -x so annotation bodies aren't traced to cluster logs.

Dead code removal: /etc/caddy/pages/index.html (the "Default page" template) deleted, no longer referenced. 404/500/502.html retained — still used by per-Service handle_*_error snippets.

Review provenance

This PR has been through four rounds of parallel codex + gemini review on the Caddy half. Convergent on "mergeable" in round 4. Each fixup commit captures one round's findings; commit history is intentionally not squashed so the review trail is auditable. Comments above on the PR record the round-by-round summaries.

The namespace-root-cause commit (efd2523) is fresh — needs its own review pass before merge.

Test plan

Unit:

  • go build ./... clean
  • go test ./pkg/clouds/pulumi/kubernetes/... -count=1 passes

Behavioral (manual, post-merge with branch preview):

  • Fresh deploy of a new custom child stack: Namespace created with <stackName>-<stackEnv>.
  • Existing PAY-SPACE / fulldiveVR consumer deploy: Pulumi diff for Namespace should show NO Replace (the metadata.name diff is suppressed by IgnoreChanges). All other resources (Service, Deployment, …) unchanged.
  • Caddy: validates clean, returns 503 with right headers for unknown Host, 200 for matched site blocks. Verified in commit message of e5a6519 with real simplecontainer/caddy:latest.

Followup

Memory recorded for next time: this is the second SC migration in two days where a metadata.name change was assumed to be safe under RetainOnDelete. Future SC changes to metadata.name of any long-lived resource should default to IgnoreChanges from the start, not retrofit after an outage.

When the namespace-naming change from #230 lands on a consumer, Pulumi
schedules a Replace on every custom-stack namespace (parentEnv !=
stackEnv). During the brief create-replacement + delete-replaced window
the Service carrying `simple-container.com/caddyfile-entry` exists in
*both* the old and new namespaces. The Caddy aggregator script
concatenated annotations from `kubectl get services --all-namespaces`
without dedup, producing two identical `http://<domain> { ... }` site
blocks in `/tmp/Caddyfile`. Caddy aborted with `ambiguous site
definition` and crashloops until the old Service is collected.

PAY-SPACE hit this in production on 2026-05-11 — `support-payhey.pay.space`
was the visible victim because it sorts alphabetically before its
siblings, but every whitelabel that migrated through the rename traversed
the same transient duplicate.

Fix:
- Include `creationTimestamp` in the jsonpath listing and `sort -r` so the
  most-recently-created Service is processed first.
- Track emitted site-address keys in a tempfile. The dedup key is the
  first non-blank line of each annotation — for domain entries that's
  `http://<domain> {` or `https://<domain> {`, for prefix entries it's
  `handle_path /<prefix>*`. Both transports are guarded.
- Older Service for a key already emitted is skipped with a log line, so
  the picked winner is observable in the init-container output.

Verified offline against a synthetic three-Service set (new-ns/example
and old-ns/example both declaring `http://example.com`, plus unrelated
`other.com`): output Caddyfile has exactly one `http://example.com` block
and its `reverse_proxy` resolves to new-ns. Module builds clean,
`go test ./pkg/clouds/pulumi/kubernetes/...` passes.

The fix is independent of #230's `RetainOnDelete` migration semantics —
even after that path is hardened, any future namespace-shape change or
Service-Replace will see the same overlap window. This makes the Caddy
ingress tolerant of it rather than crashlooping.

Signed-off-by: Dmitrii Creed <creeed22@gmail.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 11, 2026

Semgrep Scan Results

Repository: api | Commit: 7d875a4

Check Status Details
✅ Semgrep Pass 0 total findings (no error/warning)

Scanned at 2026-05-13 13:52 UTC

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 11, 2026

Security Scan Results

Repository: api | Commit: 7d875a4

Check Status Details
✅ Secret Scan Pass No secrets detected
⚠️ Dependencies (Trivy) High 1 high, 1 total
⚠️ Dependencies (Grype) High 1 high, 1 total
📦 SBOM Generated 470 components (CycloneDX)

Scanned at 2026-05-13 13:52 UTC

Codex caught a critical regression I introduced: the new
`kubectl ... | sort -r` pipeline under `set -e` (no pipefail) silently
collapsed to `services=""` whenever kubectl failed, and the script
exited successfully. Caddy would then start with only the default
`http:// { file_server }` block and every domain would serve the
welcome page on the next pod restart — the same masquerading-as-200
failure mode that took prod down on 2026-05-10. Hard miss; would have
made the original outage repeatable on any transient kubectl flake.

Changes:

- `set -xeo pipefail`. A kubectl error now fails the init-container
  fast; K8s reschedules and retries instead of cementing a partial
  config.
- Split the `kubectl | sort` into two assignments so the failure mode
  is unambiguous even if a future reader doesn't notice the pipefail.
- Normalize the dedup key in awk: skip blank lines, skip comment lines,
  trim leading/trailing whitespace. For SC-generated annotations this
  is functionally a no-op (their first non-blank line is deterministic),
  but it makes the dedup robust against indentation differences and
  user-authored caddyfile-entry annotations with header comments —
  gemini's concern.
- Switched `echo "$services" | while` to `printf '%s\n'` to keep the
  pipeline shell-portable when `$services` could contain backslashes.

Offline verification: pipefail now exits 1 on kubectl failure; dedup
key normalization collapses `  http://example.com {` (indented, new)
and `http://example.com {` (flush, old) to the same key; comment-led
annotations still emit with the right key.

Followups intentionally NOT in scope here:

1. Retroactive `RetainOnDelete` for namespace resources whose state
   predates #230 — the actual prod-killer. Both reviewers explicitly
   called out that this PR does not fix it.
2. Caddy default-block hardening — serve a hard 503 instead of
   file_server on /etc/caddy/pages when no Service block matches, so
   the absence of routes is loud instead of disguised as healthy 200s.

Both will be follow-up PRs.

Signed-off-by: Dmitrii Creed <creeed22@gmail.com>
@Cre-eD
Copy link
Copy Markdown
Contributor Author

Cre-eD commented May 11, 2026

Review pass — codex + gemini (parallel)

Ran a parallel review with both tools. Headline: codex caught a critical regression I introduced, gemini caught some defensive concerns. Pushed 1abd3c1 addressing both.

Codex — the serious one

kubectl get ... | sort -r under set -e without pipefail — a kubectl error/timeout becomes services="" and the init-container exits successfully with only the default Caddyfile.

This automated the exact welcome-page outage scenario from 2026-05-10. Without pipefail, any transient kubectl flake on a Caddy pod restart silently produces a config with no Service routes, the catch-all http:// { file_server } serves the welcome page on every domain, and monitoring sees 200 OK. Hard miss on my part.

Fix in 1abd3c1:

  • set -xeo pipefail
  • Split kubectl/sort into separate assignments to make the failure path explicit even for readers who skip the set line

Gemini — defensive

Flagged the dedup key as brittle in theory: leading whitespace, comment lines, multi-host comma-separated site addresses. For SC-generated annotations this is moot (first line is deterministic ${proto}://${domain} { or handle_path /${prefix}*), but indentation/comment normalization is cheap defense.

Fix in 1abd3c1:

/^[[:space:]]*$/ { next }
/^[[:space:]]*#/ { next }
{ sub(/^[[:space:]]+/, ""); sub(/[[:space:]]+$/, ""); print; exit }

Verified offline: http://example.com { (new, indented) and http://example.com { (old, flush) now produce the same dedup key. Comment-led annotations correctly emit with the underlying site-block as key.

Multi-host comma-order normalization (gemini's host1, host2 { vs host2, host1 {) is intentionally not addressed — not a case SC generates, and lexically sorting hosts inside a key would require a Caddyfile parser. Defer to a real follow-up if/when the use case appears.

Both reviewers explicitly confirm

This PR fixes only the secondary crashloop. It does not fix the actual prod-killer (non-retroactive RetainOnDelete cascade-deleting the shared parent namespace on first pulumi up after the migration). Two follow-up PRs queued:

  1. Retroactive RetainOnDelete path so namespace resources that entered Pulumi state before fix(k8s): isolate custom-stack namespaces and retain shared ns on destroy #230 don't get cascade-deleted during the Replace
  2. Caddy default-block hardening: serve a hard 503 instead of file_server /etc/caddy/pages when no Service block matches, so route absence is loud instead of disguised as healthy 200s

Both follow-ups are real bugs. This one's safe to merge first as the immediate symptom containment.

When all Services with a `simple-container.com/caddyfile-entry`
annotation for a given Host disappear — for example, a cascade-deletion
from a namespace Replace gone wrong — requests fell through to the
catch-all `http:// { file_server /etc/caddy/pages }` block and got back
HTTP 200 + "Default page" from index.html. External monitoring saw
healthy 200s. CDNs and load balancers saw 200s. Pingdom / UptimeRobot /
the dashboard everyone trusts saw 200s. The outage was invisible to
every layer that wasn't deep-inspecting the response body.

PAY-SPACE hit this on 2026-05-10: the migration from SC #230 cascade-
deleted the shared parent namespace, every Service annotation for
production hosts evaporated, and every domain pointing at the cluster
served the Caddy welcome page. The outage was only noticed when a human
opened a browser tab.

Change:
- Default catch-all now uses `respond ... 503 { close }` instead of
  `file_server /etc/caddy/pages`.
- Retry-After: 60 so CDNs back off appropriately and clients know to
  retry rather than treating 503 as a hard failure.
- Cache-Control: no-store so an aggressive cache doesn't pin the 503
  state past route recovery.
- HTML body still rendered for humans visiting in a browser, but it's
  now a 503 page that names the problem (missing
  `simple-container.com/caddyfile-entry` annotation) and tells operators
  what to check. The literal "Default page" string is gone.

Behavior verified by running the Caddy image with the new default block:

  configured host (Host: example.com)     → HTTP 200
  unmatched host (Host: support-bot.pay.space) → HTTP 503
    Retry-After: 60
    Cache-Control: no-store

`caddy validate` against the full embedded Caddyfile + new default block
+ a sample matched site passes clean.

The /etc/caddy/pages directory (index.html, 404.html, 502.html, 500.html)
is still embedded and used by the `handle_bucket_error` and
`handle_server_error` snippets for legitimate per-Service error
fallbacks — only the catch-all stopped serving it as a 200.

Pairs with #255 (Caddy aggregator dedup) as the two halves of the
2026-05-10 PAY-SPACE outage: dedup keeps the aggregator from
crashlooping during a Service transition, this PR keeps the absence of
routes loud so it doesn't masquerade as a healthy 200.

Signed-off-by: Dmitrii Creed <creeed22@gmail.com>
@Cre-eD Cre-eD changed the title fix(caddy): dedup caddyfile-entry annotations during Service transitions fix(caddy): harden aggregator dedup + default block against route-vanish outage (post-mortem 2026-05-10) May 11, 2026
Cre-eD added 5 commits May 11, 2026 22:30
Five blockers/issues from the parallel review:

1. **Content-Type silently became text/plain** (codex, blocking). Caddy's
   `respond` defaults the response to text/plain when no explicit
   Content-Type is set on the route, so browsers visiting the catch-all
   saw the raw HTML as literal text. Fix: `header Content-Type
   "text/html; charset=utf-8"` inside the response path. Verified with
   curl against caddy v2.11.2: Content-Type now reports text/html.

2. **Headers leaked onto the HSTS 301 path** (codex, important). With
   `import hsts` appended, the `header Cache-Control "no-store"` and
   `header Retry-After "60"` directives applied to BOTH the 503 AND the
   `redir` 301-to-HTTPS that hsts adds. That's wrong for a 301 — clients
   shouldn't be told to retry a permanent redirect, and `no-store`
   defeats the redirect cache. Fix: wrap the headers + respond in an
   explicit `handle { ... }` so they only fire on the 503 path.

3. **HSTS redirect made the 503 unreachable behind a CDN** (codex,
   important; gemini noticed but called it acceptable — codex is right).
   Caddy directive ordering runs `redir` before `respond`. A request
   with `X-Forwarded-Proto: http` (which Cloudflare/GCP LB/most modern
   CDNs set) matched hsts's `@httpReq` matcher and got a 301 to HTTPS
   for the unknown host — then failed the TLS handshake because Caddy
   has no cert for the unknown SNI. The user-visible result was a
   browser-level TLS error, invisible to HTTP-layer monitoring — exactly
   the failure mode this PR is trying to fix. Fix: omit `import hsts`
   from the catch-all entirely. HSTS on a Host-agnostic catch-all is
   semantically meaningless anyway (the header tells browsers "always
   use HTTPS for THIS host", but the catch-all answers any host).
   Per-Service site blocks still get HSTS via their own `import hsts`.
   Verified: `Host: support-bot.pay.space` with `X-Forwarded-Proto:
   http` now returns 503 directly instead of 301.

4. **Stale comment in the dedup section** (codex). The pipefail
   rationale comment still said a kubectl failure would "serve the
   welcome page from /etc/caddy/pages". With commit 3 in this PR the
   welcome page is gone; the failure mode is "503 on every domain".
   That's still a complete loss of routing for the cluster and worth
   bailing loud over, but the comment now describes the actual current
   behavior.

5. **/etc/caddy/pages/index.html is dead** (codex + gemini). Was only
   referenced by the old `file_server` catch-all; the per-Service
   `handle_*_error` snippets only reference 404/500/502.html. Deleted.

Validation:

- `caddy validate` clean on the assembled Caddyfile
- `go build ./...` clean
- `go test ./pkg/clouds/pulumi/kubernetes/... -count=1` passes
- Live Caddy v2.11.2 probe matrix:
    Host: example.com (known)             → 200 "ok" text/plain
    Host: support-bot.pay.space (unknown) → 503 text/html  Cache-Control:no-store  Retry-After:60  Connection:close
    Host: support-bot.pay.space + XFP:http → still 503 (no 301 anymore)

Out of scope still: HTTPS catch-all for unknown SNI. Caddy doesn't
synthesize a cert for unknown SNIs without explicit `default_sni` +
matching wildcard cert config, which is per-cluster and not something
this fix should bake in. Direct TLS handshake failure remains the
behavior for unknown SNIs; the HTTP 503 path is what monitoring
actually pings.

Signed-off-by: Dmitrii Creed <creeed22@gmail.com>
…ommit)

Previous commit d7b4d71 captured only the index.html deletion but the
caddy.go changes weren't staged (git rm + git add interaction). This
adds them: content-type, handle wrapper, hsts removal, stale comment.

See d7b4d71's commit message for the full rationale of all five
review findings — repeated here for completeness:

1. respond defaults to text/plain — add `header Content-Type "text/html;
   charset=utf-8"` so browsers render the HTML body.
2. Cache-Control + Retry-After leaked onto the HSTS 301 path — wrap
   headers + respond in explicit `handle { ... }`.
3. HSTS redirect made the catch-all 503 unreachable behind CDNs that
   set X-Forwarded-Proto — drop `import hsts` from the catch-all.
4. Stale comment about welcome page failure mode — updated to reflect
   the new 503 failure mode.
5. (the index.html deletion, landed in d7b4d71)

Verified live against simplecontainer/caddy:latest:
  Host: example.com         → 200 "ok"
  Host: support-bot.pay.space → 503 text/html, Cache-Control: no-store,
                                Retry-After: 60, Connection: close
  Same Host + X-Forwarded-Proto: http → 503 (was 301 before)

Signed-off-by: Dmitrii Creed <creeed22@gmail.com>
…re split

Signed-off-by: Dmitrii Creed <creeed22@gmail.com>
Previous commit 2d290fe wrote comments with markdown-style backticks
inside the Go raw-string-delimited bash script literal, which closed
the raw string mid-comment and turned the rest into invalid Go
("syntax error: unexpected name printf in composite literal").

Replaced with plain text (printf-to-sort, printf-to-while-read).
Should have built before pushing. `go build ./...` clean now.

Signed-off-by: Dmitrii Creed <creeed22@gmail.com>
Gemini round-3 review flagged `set -x` as a security regression: tracing
every command prints the raw caddyfile-entry annotation body (and the
output of `kubectl get service ...`) to stdout, which lands in cluster
logging (GCP/Datadog/ELK). SC-generated annotations don't carry secrets,
but consumer-side misuse — basicauth credentials in `Headers` map, or
raw Caddy directives in `LbConfig.ExtraHelpers` — could template into
the annotation body and leak via -x.

The init container is rarely debugged live (when it is, an operator can
override the command), so the debuggability cost is low. The script
still emits informative one-line `Processing service: $service in
namespace: $ns` and `Skipping duplicate caddyfile-entry ...` messages
without -x.

Kept: `cat /tmp/Caddyfile` at the end. That's the assembled config the
Caddy server actually loads; printing it is useful for verifying
rollouts and is consistent with prior behavior. If a consumer puts
secrets into per-Service annotations they leak there too, but it's
intentional logging of the deployed config, not an incidental
per-command trace.

Codex round-3 verdict was "clean, merge" but acknowledged the same
exposure existed via `cat`. I'm siding with gemini on -x because the
trace exposure compounds (every kubectl invocation × every Service ×
every pod restart) while `cat` is a single final dump.

Signed-off-by: Dmitrii Creed <creeed22@gmail.com>
@Cre-eD
Copy link
Copy Markdown
Contributor Author

Cre-eD commented May 11, 2026

Multi-round codex + gemini parallel review — converged

Four rounds, two reviewers per round. Both reach "mergeable" in round 4.

Round 1 — codex blocked the PR, gemini was too lenient

Codex caught the critical regression I'd introduced: kubectl ... | sort -r under set -e without pipefail would silently produce services="" on any kubectl flake, automating the welcome-page-masquerade outage that the PR is trying to fix. Plus:

  • respond defaults Content-Type to text/plain → browsers saw raw HTML tags
  • header Cache-Control and header Retry-After leaked onto the HSTS 301 path
  • import hsts redirected catch-all HTTP→HTTPS→TLS-handshake-fail, making the 503 unreachable behind any CDN that sets X-Forwarded-Proto
  • Stale comment + dead index.html

Gemini noticed the HSTS issue but called it "still loud enough." Codex was correct — TLS-fail is invisible to HTTP monitoring, which is the whole point of this fix.

Pushed: d7b4d71 + 328e796set -eo pipefail, explicit Content-Type, handle { ... } wrapper, removed import hsts from catch-all, deleted index.html.

Round 2 — both clean

Codex: "Findings: none blocking. Mergeable now." Verified all five round-1 fixes with real Caddy testing across h2, prefix-mode, and per-Service-with-hsts.
Gemini: "Mergeable as-is."

One nit: pipefail comment said "kubectl piped into sort" but code now splits them. Pushed: 2d290fe — refreshed comment.

I then accidentally put markdown backticks inside the Go raw string and broke the build. Pushed: 9828eb5 — fix.

Round 3 — codex clean, gemini blocked

Gemini flagged set -x as a security regression: tracing every command dumps annotation bodies into cluster logs, and a consumer-misuse path (basicauth credentials in Headers map, raw directives in LbConfig.ExtraHelpers) could template secrets into the annotation. Codex noted the same exposure existed via cat /tmp/Caddyfile and called it pre-existing.

Sided with gemini. Pushed: 95730bf — dropped -x, kept set -eo pipefail. Kept cat /tmp/Caddyfile since that's intentional final-config logging, useful for rollout verification, and the trace-amplification (every kubectl invocation × every Service × every pod restart) was the bigger leak.

Round 4 — converged

Both reviewers: mergeable, no blockers. Codex confirmed set -eo pipefail preserves errexit + pipeline-failure behavior; only execution tracing was removed. Gemini explicitly endorsed keeping cat /tmp/Caddyfile as "a pragmatic trade-off" — agrees with my reasoning.

Net result

7 commits. From original two-line dedup PR to a hardened, security-reviewed change that addresses both halves of the 2026-05-10 outage:

  • aggregator dedup + pipefail + key normalization (prevents the Caddy crashloop during Service Replace)
  • default catch-all returns 503 with text/html + Retry-After + Cache-Control: no-store, behind a handle { ... } scope that doesn't leak headers onto unrelated paths
  • removed catch-all import hsts so the 503 reaches monitoring directly instead of redirecting into a TLS failure
  • dropped set -x to reduce annotation-body exposure to cluster logs

Ready to merge.

… cascade

PR #230 changed custom-stack namespace naming from shared <stackName> to
per-stackEnv <stackName>-<stackEnv> and added RetainOnDelete(true),
expecting that to protect existing consumers through the migration
pulumi up. It didn't — Pulumi reads delete-time options from the state
of the resource being deleted, not from the current program. Existing
Namespace resources predate #230 and don't carry RetainOnDelete; when
the new code computed a different metadata.Name, Pulumi diffed against
state, scheduled a Replace, executed delete-old before create-new, and
sent k8s DELETE on the legacy shared namespace. K8s cascade-deleted
every resource inside, including the parent stack's production
resources that lived in the same shared namespace.

Confirmed outages:

- PAY-SPACE 2026-05-10/11: support-bot parent + every whitelabel
  (support-payhey, support-rulex, support-gl-pay) cascade-deleted.
  Caddy fallout from this is also fixed in earlier commits of this PR.

- fulldiveVR/wizeup-rooms-api 2026-05-12: namespace wize-rooms-api
  hosted both the likeclaw-us parent stack and the likeclaw-us-dev
  child. A routine merge to dev triggered the child's deploy. Pulumi
  plan: kubernetes:core/v1:Namespace: (replace) name "wize-rooms-api"
  => "wize-rooms-api-likeclaw-us-dev". Namespace deleted,
  rooms-api.wizeup.app returned 502 until prod was manually
  re-deployed. (actions/runs/25725750825)

Fix: sdk.IgnoreChanges([]string{"metadata.name"}) on both Namespace
registration sites — simple_container.go for client stacks,
helpers.go's ensureNamespace for helm operator stacks.

Behavior:

- Fresh deploy of a new custom child stack: no prior state, no diff
  to ignore. Namespace created with the per-stackEnv name. PR #230's
  isolation goal preserved for new deploys.

- Existing custom child stack on its next pulumi up: state has
  metadata.Name=<legacy shared name>, program desires metadata.Name=
  <stackName>-<stackEnv>. IgnoreChanges suppresses the diff — no
  Replace scheduled, no delete attempted. State retains the legacy
  name. Service/Deployment/etc. that reference
  namespace.Metadata.Name().Elem() now resolve to the legacy name and
  continue to land in the shared namespace. Migration cost: zero.
  Consumer is back to the pre-#230 sharing model, but RetainOnDelete
  protects against the cross-sibling destroy cascade #230 was
  originally added to solve. Both hazards now defused.

- Existing consumer who actively wants per-stackEnv isolation: opt-in
  by removing the legacy Namespace resource from Pulumi state (state
  edit; k8s namespace itself stays put). Next pulumi up sees no prior
  namespace, registers a fresh one at the per-stackEnv name. Old k8s
  namespace continues to host the parent stack; the migrated child
  lives in the new isolated namespace.

This is the established codebase pattern: rds_postgres.go:45 and
rds_mysql.go:55 use IgnoreChanges([]string{"storageEncrypted"}) for
the same purpose — silence a default flip so it doesn't propose a
destructive replacement on existing stacks.

Signed-off-by: Dmitrii Creed <creeed22@gmail.com>
@Cre-eD Cre-eD changed the title fix(caddy): harden aggregator dedup + default block against route-vanish outage (post-mortem 2026-05-10) fix(k8s+caddy): stop namespace-rename cascade-delete + Caddy fallout (PAY-SPACE 2026-05-10, fulldiveVR 2026-05-12 outages) May 12, 2026
…amespace

Codex round-2 review of efd2523 caught that IgnoreChanges("metadata.name")
defeats its own purpose at two callsites that still bake the program-
computed namespace name into resources downstream of the Namespace
itself:

- caddyfile-entry annotation. The Service annotation template at
  simple_container.go:673 was using sanitizedNamespace as the
  reverse_proxy upstream namespace. With IgnoreChanges in place, a
  migrated stack's Service is created in the legacy shared namespace
  (because namespace.Metadata.Name() resolves to the legacy value), but
  the annotation pointed Caddy at <svc>.<NEW>.svc.cluster.local — DNS
  fails to resolve, Caddy 502s for the affected host.

- VPA. createVPA was called with sanitizedNamespace and built the VPA
  CRD with metadata.namespace = NEW name. The Deployment it targets
  lives in the legacy namespace, so the VPA sits orphaned and never
  scales the workload.

Both bugs ship the migration cascade fix (efd2523) without actually
preventing 502s or autoscaling regression for migrated stacks.

Fix:

1. Caddyfile-entry template extracted to a local variable
   (caddyfileEntryTemplate). The same template is rendered twice:
   - synchronously into caddyfileEntry (string) for sc.CaddyfileEntry
     export — that's used as a change-hash signal by kube_run.go and
     intentionally tracks the desired-config view, not the migrated
     live state.
   - asynchronously into caddyfileEntryAnnotation (sdk.StringOutput)
     via namespace.Metadata.Name().ApplyT — resolves namespace at
     apply time. For fresh deploys (liveNS == sanitizedNamespace), the
     callback returns the statically-rendered template verbatim, so
     byte output matches the legacy code path. For migrated stacks
     (liveNS != sanitizedNamespace), it re-applies placeholders with
     the live namespace and returns the new string.

2. Render failures inside the ApplyT callback are returned as errors
   wrapped with errors.Wrapf, NOT silently fallen back to the
   statically-rendered template. Falling back would re-introduce the
   exact migrated-stack 502 bug this commit is fixing. Codex review
   flag — the silent fallback was the wrong failure mode.

3. Service/Ingress annotation maps switched from
   sdk.ToStringMap(map[string]string) to a manually-built sdk.StringMap
   so the caddyfile-entry value can be an Output while the rest stay
   static. Equivalent for static-only entries.

4. createVPA signature: namespace string → namespace sdk.StringInput.
   The metadata.Namespace field directly accepts the Pulumi input.
   Caller now passes namespace.Metadata.Name().Elem(), which is the
   live Namespace.metadata.name Output.

Verification:
- go build ./... clean
- go test ./pkg/clouds/pulumi/kubernetes/... -count=1 passes
- Three rounds of parallel codex + gemini review on the namespace work;
  this commit addresses the round-3 follow-ups (template duplication
  cleaned up, ApplyT error propagation made fatal).

Pairs with efd2523 (the IgnoreChanges fix) as the complete cascade-
prevention story: efd2523 stops the Namespace itself from being
Replace-deleted, this commit stops the downstream resources (Service
annotation + VPA) from drifting away from where the Namespace actually
landed.

Signed-off-by: Dmitrii Creed <creeed22@gmail.com>
@Cre-eD
Copy link
Copy Markdown
Contributor Author

Cre-eD commented May 12, 2026

Namespace work — three review rounds (codex + gemini parallel each round)

The IgnoreChanges fix in efd2523 was incomplete. Multi-round review caught it and the follow-up is in 1d248d9.

Round 1 — clean, both approved

Both confirmed semantics: IgnoreChanges("metadata.name") suppresses the Replace for existing state, has no effect on fresh deploys, downstream resources follow namespace.Metadata.Name().Elem() correctly. Codex even built a throwaway Pulumi stack and tested it live.

Round 2 — codex caught a critical bug, gemini missed it

Codex found that two callsites in simple_container.go still used the program-computed sanitizedNamespace (string) instead of the live namespace.Metadata.Name() Output:

  • caddyfile-entry annotation (line 673) templated sanitizedNamespace into reverse_proxy http://${service}.${namespace}.svc.cluster.local. On a migrated stack the Service lives in the legacy namespace but the annotation pointed Caddy at the new (non-existent) namespace → DNS doesn't resolve → 502. The exact failure mode efd2523 was supposed to prevent.
  • VPA (line 825 → createVPA → line 964) set metadata.namespace = sanitizedNamespace. VPA in NEW namespace, Deployment in LEGACY namespace → VPA never finds its target → autoscaling silently broken.

Gemini approved the diff in round 2 without spotting either of these. Codex's deeper code-path tracing caught it.

Round 3 — both reviewers, same diff

Verdict: both approve the round-2 fix in principle but flag two follow-ups:

  • Template duplication (both): the Caddyfile template literal was repeated inside the ApplyT callback (so a tweak to the main template wouldn't propagate to the live-render path). Both recommended extracting to a variable.
  • Silent fallback on ApplyT render failure is wrong (codex, blocking): if placeholders.Apply fails inside the callback, my code returned entryTemplate (the statically-rendered version) — which is exactly the bug we're fixing. Codex insisted this should propagate the error so Pulumi fails the update.

Both addressed in 1d248d9:

  • caddyfileEntryTemplate extracted; both initial sync render + ApplyT re-render share it
  • ApplyT callback now returns (string, error) and uses errors.Wrapf to propagate

Final PR state

11 commits. Both halves of the post-mortem covered:

  1. Migration cascade preventionIgnoreChanges("metadata.name") on both Namespace registration sites + downstream resources (caddyfile-entry annotation, VPA) following the live namespace Output.
  2. Caddy fallout containment — aggregator dedup, default-block 503 (proper Content-Type, scoped headers, no HSTS redirect), removed set -x to limit annotation exposure in logs.

Ready for merge.

Dev review pointed out the inconsistency in commit e5a6519 + d7b4d71:
the default catch-all 503 was inlined as a string literal in caddy.go,
while the other status pages (404, 500, 502) are still served from
/etc/caddy/pages/{code}.html via the handle_bucket_error /
handle_server_error snippets in embed/caddy/Caddyfile. Two different
mechanisms for the same class of response.

Refactor to the same file-based pattern:

- New pages/503.html with the SC-operator instruction body that was
  previously inlined ("No backend route is configured for this host" +
  hint to check the simple-container.com/caddyfile-entry annotation).
- caddy.go's default catch-all switches from
    respond "<html>...</html>" 503 { close }
  to
    root * /etc/caddy/pages
    rewrite * /503.html
    file_server { status 503 }
- Drops the explicit header Content-Type — file_server emits it
  automatically from the .html extension.
- index.html is NOT restored; the 200-OK welcome page was the original
  failure mode, replaced now by 503.html.

Wins:

- Symmetry: one pattern for every status page in the codebase.
- Operator override: a cluster operator can mount a ConfigMap at
  /etc/caddy/pages/503.html to customize the body (branded outage page,
  i18n, etc.) without touching SC api code.
- Smaller raw string in caddy.go; the HTML body is no longer inlined
  via a one-line `respond "<!doctype...>"` blob.

Verified live with simplecontainer/caddy:latest serving the assembled
Caddyfile + the embedded pages dir mounted at /etc/caddy/pages:

  Host: example.com (matched)            → HTTP 200
  Host: support-bot.pay.space (unmatched) → HTTP 503
    Content-Type: text/html; charset=utf-8
    Cache-Control: no-store
    Retry-After: 60
    body = pages/503.html

  Same + X-Forwarded-Proto: http  → HTTP 503 (no HSTS redirect, since
  catch-all still doesn't `import hsts` — see e5a6519's comment).

Build clean, go test ./pkg/clouds/pulumi/kubernetes/... passes.

Signed-off-by: Dmitrii Creed <creeed22@gmail.com>
@Cre-eD
Copy link
Copy Markdown
Contributor Author

Cre-eD commented May 12, 2026

Followup: default 503 refactored from inline HTML to file-based

Dev review feedback on the original inlined respond "<html>..." 503 approach: it was inconsistent with how every other error page in this codebase is served (pages/{404,500,502}.html via file_server from the handle_*_error snippets). Two mechanisms for the same class of response.

3b6d44b unifies it:

  • New embed/caddy/pages/503.html with the same SC-operator instruction body that was inlined before.
  • caddy.go default catch-all switches from respond "..." 503 { close } to:
http:// {
  import gzip
  handle {
    root * /etc/caddy/pages
    rewrite * /503.html
    header Cache-Control "no-store"
    header Retry-After "60"
    file_server {
      status 503
    }
  }
}

Wins:

  • Symmetry — one pattern across every status page in the codebase
  • Operator override — cluster operators can mount a different ConfigMap at /etc/caddy/pages/503.html for branded outage pages, i18n, etc.
  • Auto Content-Typefile_server emits it from the file extension; no explicit header Content-Type line
  • Smaller caddy.go — the HTML body is no longer a one-line respond blob

Verified live with simplecontainer/caddy:latest + the embedded pages dir mounted at /etc/caddy/pages:

Host: example.com (matched)             → HTTP 200
Host: support-bot.pay.space (unmatched) → HTTP 503
  Content-Type: text/html; charset=utf-8 (auto from .html ext)
  Cache-Control: no-store
  Retry-After: 60
  body = pages/503.html

Same + X-Forwarded-Proto: http          → HTTP 503 (no redirect — catch-all still doesn't import hsts)

go test ./pkg/clouds/pulumi/kubernetes/... passes.

Review

Single round, codex + gemini in parallel:

  • Codex: "No findings. Mergeable as-is." Verified Caddy v2 syntax, no path-traversal (rewrite discards attacker path before file_server), no SSA/ConfigMap drift. Noted one minor behavior change: respond { close } previously forced connection close; file_server does not. Not a blocker — these are scanner/probe-class requests where keepalive isn't a concern.
  • Gemini: "Mergeable." Confirmed subdirective valid since Caddy v2.4.0 (deployed Caddy is 2.11.2). rewrite over try_files is the right choice for a forced catch-all.

PR is at 12 commits now. Ready for merge.

@Cre-eD Cre-eD merged commit b4fd96f into main May 13, 2026
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants