From 2e0eeaed30ac3b107982c9d46a977731d2710a15 Mon Sep 17 00:00:00 2001 From: Dmitrii Creed Date: Mon, 11 May 2026 21:13:49 +0400 Subject: [PATCH 01/11] fix(caddy): dedup caddyfile-entry annotations during Service transitions MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit When the namespace-naming change from #230 lands on a consumer, Pulumi schedules a Replace on every custom-stack namespace (parentEnv != stackEnv). During the brief create-replacement + delete-replaced window the Service carrying `simple-container.com/caddyfile-entry` exists in *both* the old and new namespaces. The Caddy aggregator script concatenated annotations from `kubectl get services --all-namespaces` without dedup, producing two identical `http:// { ... }` site blocks in `/tmp/Caddyfile`. Caddy aborted with `ambiguous site definition` and crashloops until the old Service is collected. PAY-SPACE hit this in production on 2026-05-11 — `support-payhey.pay.space` was the visible victim because it sorts alphabetically before its siblings, but every whitelabel that migrated through the rename traversed the same transient duplicate. Fix: - Include `creationTimestamp` in the jsonpath listing and `sort -r` so the most-recently-created Service is processed first. - Track emitted site-address keys in a tempfile. The dedup key is the first non-blank line of each annotation — for domain entries that's `http:// {` or `https:// {`, for prefix entries it's `handle_path /*`. Both transports are guarded. - Older Service for a key already emitted is skipped with a log line, so the picked winner is observable in the init-container output. Verified offline against a synthetic three-Service set (new-ns/example and old-ns/example both declaring `http://example.com`, plus unrelated `other.com`): output Caddyfile has exactly one `http://example.com` block and its `reverse_proxy` resolves to new-ns. Module builds clean, `go test ./pkg/clouds/pulumi/kubernetes/...` passes. The fix is independent of #230's `RetainOnDelete` migration semantics — even after that path is hardened, any future namespace-shape change or Service-Replace will see the same overlap window. This makes the Caddy ingress tolerant of it rather than crashlooping. Signed-off-by: Dmitrii Creed --- pkg/clouds/pulumi/kubernetes/caddy.go | 33 ++++++++++++++++++++++----- 1 file changed, 27 insertions(+), 6 deletions(-) diff --git a/pkg/clouds/pulumi/kubernetes/caddy.go b/pkg/clouds/pulumi/kubernetes/caddy.go index 857b6493..ee289ffd 100644 --- a/pkg/clouds/pulumi/kubernetes/caddy.go +++ b/pkg/clouds/pulumi/kubernetes/caddy.go @@ -178,25 +178,46 @@ func DeployCaddyService(ctx *sdk.Context, caddy CaddyDeployment, input api.Resou Command: sdk.ToStringArray([]string{"bash", "-c", ` set -xe; cp -f /etc/caddy/Caddyfile /tmp/Caddyfile; - + # Inject custom Caddyfile prefix at the top (e.g., GCS storage configuration) if [ -n "$CADDYFILE_PREFIX" ]; then echo "$CADDYFILE_PREFIX" >> /tmp/Caddyfile echo "" >> /tmp/Caddyfile fi - - # Get all services with Simple Container annotations across all namespaces - services=$(kubectl get services --all-namespaces -o jsonpath='{range .items[?(@.metadata.annotations.simple-container\.com/caddyfile-entry)]}{.metadata.namespace}{" "}{.metadata.name}{"\n"}{end}') + + # List Services carrying the caddyfile-entry annotation. We also pull + # creationTimestamp so we can dedup by site-address with the newest + # Service winning — during a Pulumi Replace of a namespace (or Service), + # the old and new Services transiently coexist and both carry the same + # annotation; without dedup that produced two "http:// { ... }" + # blocks and Caddy aborted with "ambiguous site definition". + services=$(kubectl get services --all-namespaces -o jsonpath='{range .items[?(@.metadata.annotations.simple-container\.com/caddyfile-entry)]}{.metadata.creationTimestamp}{" "}{.metadata.namespace}{" "}{.metadata.name}{"\n"}{end}' | sort -r) echo "$DEFAULT_ENTRY_START" >> /tmp/Caddyfile if [ "$USE_PREFIXES" == "false" ]; then echo "$DEFAULT_ENTRY" >> /tmp/Caddyfile echo "}" >> /tmp/Caddyfile fi + # Dedup state: first non-blank line of each annotation is the site + # address (e.g. "http://support-payhey.pay.space {") or the + # "handle_path /*" matcher for prefix routing. Already-seen + # keys are skipped — most-recently-created Service wins via sort -r. + seen=$(mktemp) + trap 'rm -f "$seen"' EXIT # Process each service that has Caddyfile entry annotation - echo "$services" | while read ns service; do + echo "$services" | while read ts ns service; do if [ -n "$ns" ] && [ -n "$service" ]; then + entry=$(kubectl get service -n "$ns" "$service" -o jsonpath='{.metadata.annotations.simple-container\.com/caddyfile-entry}' 2>/dev/null || true) + if [ -z "$entry" ]; then + continue + fi + key=$(printf '%s\n' "$entry" | awk 'NF{print; exit}') + if [ -n "$key" ] && grep -qFx -- "$key" "$seen" 2>/dev/null; then + echo "Skipping duplicate caddyfile-entry '$key' from $ns/$service (older Service)" + continue + fi + [ -n "$key" ] && printf '%s\n' "$key" >> "$seen" echo "Processing service: $service in namespace: $ns" - kubectl get service -n $ns $service -o jsonpath='{.metadata.annotations.simple-container\.com/caddyfile-entry}' >> /tmp/Caddyfile || true; + printf '%s\n' "$entry" >> /tmp/Caddyfile echo "" >> /tmp/Caddyfile fi done From 1abd3c17de15644cfed37bb574fc3e1535467199 Mon Sep 17 00:00:00 2001 From: Dmitrii Creed Date: Mon, 11 May 2026 21:39:08 +0400 Subject: [PATCH 02/11] fixup: address codex + gemini review findings MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Codex caught a critical regression I introduced: the new `kubectl ... | sort -r` pipeline under `set -e` (no pipefail) silently collapsed to `services=""` whenever kubectl failed, and the script exited successfully. Caddy would then start with only the default `http:// { file_server }` block and every domain would serve the welcome page on the next pod restart — the same masquerading-as-200 failure mode that took prod down on 2026-05-10. Hard miss; would have made the original outage repeatable on any transient kubectl flake. Changes: - `set -xeo pipefail`. A kubectl error now fails the init-container fast; K8s reschedules and retries instead of cementing a partial config. - Split the `kubectl | sort` into two assignments so the failure mode is unambiguous even if a future reader doesn't notice the pipefail. - Normalize the dedup key in awk: skip blank lines, skip comment lines, trim leading/trailing whitespace. For SC-generated annotations this is functionally a no-op (their first non-blank line is deterministic), but it makes the dedup robust against indentation differences and user-authored caddyfile-entry annotations with header comments — gemini's concern. - Switched `echo "$services" | while` to `printf '%s\n'` to keep the pipeline shell-portable when `$services` could contain backslashes. Offline verification: pipefail now exits 1 on kubectl failure; dedup key normalization collapses ` http://example.com {` (indented, new) and `http://example.com {` (flush, old) to the same key; comment-led annotations still emit with the right key. Followups intentionally NOT in scope here: 1. Retroactive `RetainOnDelete` for namespace resources whose state predates #230 — the actual prod-killer. Both reviewers explicitly called out that this PR does not fix it. 2. Caddy default-block hardening — serve a hard 503 instead of file_server on /etc/caddy/pages when no Service block matches, so the absence of routes is loud instead of disguised as healthy 200s. Both will be follow-up PRs. Signed-off-by: Dmitrii Creed --- pkg/clouds/pulumi/kubernetes/caddy.go | 28 +++++++++++++++++++-------- 1 file changed, 20 insertions(+), 8 deletions(-) diff --git a/pkg/clouds/pulumi/kubernetes/caddy.go b/pkg/clouds/pulumi/kubernetes/caddy.go index ee289ffd..2905e9a0 100644 --- a/pkg/clouds/pulumi/kubernetes/caddy.go +++ b/pkg/clouds/pulumi/kubernetes/caddy.go @@ -176,7 +176,7 @@ func DeployCaddyService(ctx *sdk.Context, caddy CaddyDeployment, input api.Resou return envVars }(), Command: sdk.ToStringArray([]string{"bash", "-c", ` - set -xe; + set -xeo pipefail; cp -f /etc/caddy/Caddyfile /tmp/Caddyfile; # Inject custom Caddyfile prefix at the top (e.g., GCS storage configuration) @@ -191,26 +191,38 @@ func DeployCaddyService(ctx *sdk.Context, caddy CaddyDeployment, input api.Resou # the old and new Services transiently coexist and both carry the same # annotation; without dedup that produced two "http:// { ... }" # blocks and Caddy aborted with "ambiguous site definition". - services=$(kubectl get services --all-namespaces -o jsonpath='{range .items[?(@.metadata.annotations.simple-container\.com/caddyfile-entry)]}{.metadata.creationTimestamp}{" "}{.metadata.namespace}{" "}{.metadata.name}{"\n"}{end}' | sort -r) + # pipefail is critical here: a flaky kubectl piped into sort would + # otherwise yield services="" and the init-container would silently + # emit a Caddyfile with only the default block — every domain would + # then serve the welcome page from /etc/caddy/pages on the next pod + # restart, masquerading as healthy 200s. + raw_services=$(kubectl get services --all-namespaces -o jsonpath='{range .items[?(@.metadata.annotations.simple-container\.com/caddyfile-entry)]}{.metadata.creationTimestamp}{" "}{.metadata.namespace}{" "}{.metadata.name}{"\n"}{end}') + services=$(printf '%s' "$raw_services" | sort -r) echo "$DEFAULT_ENTRY_START" >> /tmp/Caddyfile if [ "$USE_PREFIXES" == "false" ]; then echo "$DEFAULT_ENTRY" >> /tmp/Caddyfile echo "}" >> /tmp/Caddyfile fi - # Dedup state: first non-blank line of each annotation is the site - # address (e.g. "http://support-payhey.pay.space {") or the - # "handle_path /*" matcher for prefix routing. Already-seen - # keys are skipped — most-recently-created Service wins via sort -r. + # Dedup state: first non-blank, non-comment line of each annotation is + # the site address (e.g. "http://support-payhey.pay.space {") or the + # "handle_path /*" matcher for prefix routing. Whitespace is + # trimmed both sides so an indentation difference can't pass through as + # a distinct key. Already-seen keys are skipped — most-recently-created + # Service wins via sort -r. seen=$(mktemp) trap 'rm -f "$seen"' EXIT # Process each service that has Caddyfile entry annotation - echo "$services" | while read ts ns service; do + printf '%s\n' "$services" | while read ts ns service; do if [ -n "$ns" ] && [ -n "$service" ]; then entry=$(kubectl get service -n "$ns" "$service" -o jsonpath='{.metadata.annotations.simple-container\.com/caddyfile-entry}' 2>/dev/null || true) if [ -z "$entry" ]; then continue fi - key=$(printf '%s\n' "$entry" | awk 'NF{print; exit}') + key=$(printf '%s\n' "$entry" | awk ' + /^[[:space:]]*$/ { next } + /^[[:space:]]*#/ { next } + { sub(/^[[:space:]]+/, ""); sub(/[[:space:]]+$/, ""); print; exit } + ') if [ -n "$key" ] && grep -qFx -- "$key" "$seen" 2>/dev/null; then echo "Skipping duplicate caddyfile-entry '$key' from $ns/$service (older Service)" continue From e5a65198e76f55c337b9448fc2ec5d2832f8e7e9 Mon Sep 17 00:00:00 2001 From: Dmitrii Creed Date: Mon, 11 May 2026 21:57:26 +0400 Subject: [PATCH 03/11] fix(caddy): default catch-all returns 503 instead of welcome page MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit When all Services with a `simple-container.com/caddyfile-entry` annotation for a given Host disappear — for example, a cascade-deletion from a namespace Replace gone wrong — requests fell through to the catch-all `http:// { file_server /etc/caddy/pages }` block and got back HTTP 200 + "Default page" from index.html. External monitoring saw healthy 200s. CDNs and load balancers saw 200s. Pingdom / UptimeRobot / the dashboard everyone trusts saw 200s. The outage was invisible to every layer that wasn't deep-inspecting the response body. PAY-SPACE hit this on 2026-05-10: the migration from SC #230 cascade- deleted the shared parent namespace, every Service annotation for production hosts evaporated, and every domain pointing at the cluster served the Caddy welcome page. The outage was only noticed when a human opened a browser tab. Change: - Default catch-all now uses `respond ... 503 { close }` instead of `file_server /etc/caddy/pages`. - Retry-After: 60 so CDNs back off appropriately and clients know to retry rather than treating 503 as a hard failure. - Cache-Control: no-store so an aggressive cache doesn't pin the 503 state past route recovery. - HTML body still rendered for humans visiting in a browser, but it's now a 503 page that names the problem (missing `simple-container.com/caddyfile-entry` annotation) and tells operators what to check. The literal "Default page" string is gone. Behavior verified by running the Caddy image with the new default block: configured host (Host: example.com) → HTTP 200 unmatched host (Host: support-bot.pay.space) → HTTP 503 Retry-After: 60 Cache-Control: no-store `caddy validate` against the full embedded Caddyfile + new default block + a sample matched site passes clean. The /etc/caddy/pages directory (index.html, 404.html, 502.html, 500.html) is still embedded and used by the `handle_bucket_error` and `handle_server_error` snippets for legitimate per-Service error fallbacks — only the catch-all stopped serving it as a 200. Pairs with #255 (Caddy aggregator dedup) as the two halves of the 2026-05-10 PAY-SPACE outage: dedup keeps the aggregator from crashlooping during a Service transition, this PR keeps the absence of routes loud so it doesn't masquerade as a healthy 200. Signed-off-by: Dmitrii Creed --- pkg/clouds/pulumi/kubernetes/caddy.go | 16 +++++++++++++--- 1 file changed, 13 insertions(+), 3 deletions(-) diff --git a/pkg/clouds/pulumi/kubernetes/caddy.go b/pkg/clouds/pulumi/kubernetes/caddy.go index 2905e9a0..72db9dc4 100644 --- a/pkg/clouds/pulumi/kubernetes/caddy.go +++ b/pkg/clouds/pulumi/kubernetes/caddy.go @@ -90,11 +90,21 @@ func DeployCaddyService(ctx *sdk.Context, caddy CaddyDeployment, input api.Resou } defaultCaddyFileEntryStart := `http:// {` + // Default catch-all serves a hard 503 instead of a static "welcome" page. + // Rationale: when all Services with a `simple-container.com/caddyfile-entry` + // annotation for a given Host vanish (e.g. a cascade-deletion from a + // namespace Replace gone wrong), the request used to fall through to a + // `file_server /etc/caddy/pages` block and respond with HTTP 200 + "Default + // page". External monitoring saw healthy 200s while every backend was gone. + // 503 + Retry-After makes the absence of routes loud: CDNs fail over, + // uptime checks alert, oncall sees it. defaultCaddyFileEntry := ` import gzip - import handle_static - root * /etc/caddy/pages - file_server + header Cache-Control "no-store" + header Retry-After "60" + respond "503 Service Unavailable

503 Service Unavailable

No backend route is configured for this host.

If you are an operator, verify the Service has the simple-container.com/caddyfile-entry annotation and that Caddy has been rolled.

" 503 { + close + } ` // if caddy must respect SSL connections only useSSL := caddy.UseSSL == nil || *caddy.UseSSL From d7b4d71f1f823e3686be34a92f8164b4151d67d2 Mon Sep 17 00:00:00 2001 From: Dmitrii Creed Date: Mon, 11 May 2026 22:30:52 +0400 Subject: [PATCH 04/11] fixup: round-1 codex/gemini review on default-block 503 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Five blockers/issues from the parallel review: 1. **Content-Type silently became text/plain** (codex, blocking). Caddy's `respond` defaults the response to text/plain when no explicit Content-Type is set on the route, so browsers visiting the catch-all saw the raw HTML as literal text. Fix: `header Content-Type "text/html; charset=utf-8"` inside the response path. Verified with curl against caddy v2.11.2: Content-Type now reports text/html. 2. **Headers leaked onto the HSTS 301 path** (codex, important). With `import hsts` appended, the `header Cache-Control "no-store"` and `header Retry-After "60"` directives applied to BOTH the 503 AND the `redir` 301-to-HTTPS that hsts adds. That's wrong for a 301 — clients shouldn't be told to retry a permanent redirect, and `no-store` defeats the redirect cache. Fix: wrap the headers + respond in an explicit `handle { ... }` so they only fire on the 503 path. 3. **HSTS redirect made the 503 unreachable behind a CDN** (codex, important; gemini noticed but called it acceptable — codex is right). Caddy directive ordering runs `redir` before `respond`. A request with `X-Forwarded-Proto: http` (which Cloudflare/GCP LB/most modern CDNs set) matched hsts's `@httpReq` matcher and got a 301 to HTTPS for the unknown host — then failed the TLS handshake because Caddy has no cert for the unknown SNI. The user-visible result was a browser-level TLS error, invisible to HTTP-layer monitoring — exactly the failure mode this PR is trying to fix. Fix: omit `import hsts` from the catch-all entirely. HSTS on a Host-agnostic catch-all is semantically meaningless anyway (the header tells browsers "always use HTTPS for THIS host", but the catch-all answers any host). Per-Service site blocks still get HSTS via their own `import hsts`. Verified: `Host: support-bot.pay.space` with `X-Forwarded-Proto: http` now returns 503 directly instead of 301. 4. **Stale comment in the dedup section** (codex). The pipefail rationale comment still said a kubectl failure would "serve the welcome page from /etc/caddy/pages". With commit 3 in this PR the welcome page is gone; the failure mode is "503 on every domain". That's still a complete loss of routing for the cluster and worth bailing loud over, but the comment now describes the actual current behavior. 5. **/etc/caddy/pages/index.html is dead** (codex + gemini). Was only referenced by the old `file_server` catch-all; the per-Service `handle_*_error` snippets only reference 404/500/502.html. Deleted. Validation: - `caddy validate` clean on the assembled Caddyfile - `go build ./...` clean - `go test ./pkg/clouds/pulumi/kubernetes/... -count=1` passes - Live Caddy v2.11.2 probe matrix: Host: example.com (known) → 200 "ok" text/plain Host: support-bot.pay.space (unknown) → 503 text/html Cache-Control:no-store Retry-After:60 Connection:close Host: support-bot.pay.space + XFP:http → still 503 (no 301 anymore) Out of scope still: HTTPS catch-all for unknown SNI. Caddy doesn't synthesize a cert for unknown SNIs without explicit `default_sni` + matching wildcard cert config, which is per-cluster and not something this fix should bake in. Direct TLS handshake failure remains the behavior for unknown SNIs; the HTTP 503 path is what monitoring actually pings. Signed-off-by: Dmitrii Creed --- .../pulumi/kubernetes/embed/caddy/pages/index.html | 12 ------------ 1 file changed, 12 deletions(-) delete mode 100644 pkg/clouds/pulumi/kubernetes/embed/caddy/pages/index.html diff --git a/pkg/clouds/pulumi/kubernetes/embed/caddy/pages/index.html b/pkg/clouds/pulumi/kubernetes/embed/caddy/pages/index.html deleted file mode 100644 index cb8f0acc..00000000 --- a/pkg/clouds/pulumi/kubernetes/embed/caddy/pages/index.html +++ /dev/null @@ -1,12 +0,0 @@ - -Default page - - -
-

Default page

-
\ No newline at end of file From 328e7968f46c00d5e95d064fcdba09130a16eb29 Mon Sep 17 00:00:00 2001 From: Dmitrii Creed Date: Mon, 11 May 2026 22:31:24 +0400 Subject: [PATCH 05/11] fixup: round-1 codex/gemini review (the actual code, missed in prev commit) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Previous commit d7b4d71 captured only the index.html deletion but the caddy.go changes weren't staged (git rm + git add interaction). This adds them: content-type, handle wrapper, hsts removal, stale comment. See d7b4d71's commit message for the full rationale of all five review findings — repeated here for completeness: 1. respond defaults to text/plain — add `header Content-Type "text/html; charset=utf-8"` so browsers render the HTML body. 2. Cache-Control + Retry-After leaked onto the HSTS 301 path — wrap headers + respond in explicit `handle { ... }`. 3. HSTS redirect made the catch-all 503 unreachable behind CDNs that set X-Forwarded-Proto — drop `import hsts` from the catch-all. 4. Stale comment about welcome page failure mode — updated to reflect the new 503 failure mode. 5. (the index.html deletion, landed in d7b4d71) Verified live against simplecontainer/caddy:latest: Host: example.com → 200 "ok" Host: support-bot.pay.space → 503 text/html, Cache-Control: no-store, Retry-After: 60, Connection: close Same Host + X-Forwarded-Proto: http → 503 (was 301 before) Signed-off-by: Dmitrii Creed --- pkg/clouds/pulumi/kubernetes/caddy.go | 37 +++++++++++++++++++-------- 1 file changed, 26 insertions(+), 11 deletions(-) diff --git a/pkg/clouds/pulumi/kubernetes/caddy.go b/pkg/clouds/pulumi/kubernetes/caddy.go index 72db9dc4..323cfc59 100644 --- a/pkg/clouds/pulumi/kubernetes/caddy.go +++ b/pkg/clouds/pulumi/kubernetes/caddy.go @@ -98,19 +98,33 @@ func DeployCaddyService(ctx *sdk.Context, caddy CaddyDeployment, input api.Resou // page". External monitoring saw healthy 200s while every backend was gone. // 503 + Retry-After makes the absence of routes loud: CDNs fail over, // uptime checks alert, oncall sees it. + // + // Headers + respond are wrapped in an explicit `handle { ... }` so they + // only apply to the 503 path. Without `handle`, Caddy directive ordering + // (redir > respond) means a `redir` from `import hsts` would fire first + // and the catch-all 503 would never be reachable behind a CDN that sets + // X-Forwarded-Proto. The header directives would also leak Cache-Control + // and Retry-After onto an unrelated 301. We also intentionally do NOT + // `import hsts` here — sending an HSTS header from a catch-all that + // answers any Host is meaningless, and the HTTP→HTTPS redirect would only + // route the request into a TLS handshake failure (Caddy has no cert for + // an unknown SNI), which is invisible to HTTP-layer monitoring. We want + // the 503 itself to be the loudest possible signal. defaultCaddyFileEntry := ` import gzip - header Cache-Control "no-store" - header Retry-After "60" - respond "503 Service Unavailable

503 Service Unavailable

No backend route is configured for this host.

If you are an operator, verify the Service has the simple-container.com/caddyfile-entry annotation and that Caddy has been rolled.

" 503 { - close + handle { + header Content-Type "text/html; charset=utf-8" + header Cache-Control "no-store" + header Retry-After "60" + respond "503 Service Unavailable

503 Service Unavailable

No backend route is configured for this host.

If you are an operator, verify the Service has the simple-container.com/caddyfile-entry annotation and that Caddy has been rolled.

" 503 { + close + } } ` - // if caddy must respect SSL connections only + // Still computed because it's threaded into per-stack Caddyfile entries + // elsewhere in this function; intentionally NOT applied to the catch-all + // default block above (see comment on `import hsts` omission). useSSL := caddy.UseSSL == nil || *caddy.UseSSL - if useSSL { - defaultCaddyFileEntry += "\nimport hsts" - } serviceAccountName := input.ToResName(fmt.Sprintf("%s-caddy-sa", input.Descriptor.Name)) serviceAccount, err := NewSimpleServiceAccount(ctx, serviceAccountName, &SimpleServiceAccountArgs{ @@ -203,9 +217,10 @@ func DeployCaddyService(ctx *sdk.Context, caddy CaddyDeployment, input api.Resou # blocks and Caddy aborted with "ambiguous site definition". # pipefail is critical here: a flaky kubectl piped into sort would # otherwise yield services="" and the init-container would silently - # emit a Caddyfile with only the default block — every domain would - # then serve the welcome page from /etc/caddy/pages on the next pod - # restart, masquerading as healthy 200s. + # emit a Caddyfile with only the default block on the next pod + # restart. That's now a 503 (cf. the default block above), but it's + # still a complete loss of routing for the entire cluster — bail + # loud so K8s reschedules the init-container and retries. raw_services=$(kubectl get services --all-namespaces -o jsonpath='{range .items[?(@.metadata.annotations.simple-container\.com/caddyfile-entry)]}{.metadata.creationTimestamp}{" "}{.metadata.namespace}{" "}{.metadata.name}{"\n"}{end}') services=$(printf '%s' "$raw_services" | sort -r) echo "$DEFAULT_ENTRY_START" >> /tmp/Caddyfile From 2d290fe42ac1cd69c553ea31fdc50bb81ea66d93 Mon Sep 17 00:00:00 2001 From: Dmitrii Creed Date: Mon, 11 May 2026 22:35:29 +0400 Subject: [PATCH 06/11] =?UTF-8?q?fixup:=20round-2=20nit=20=E2=80=94=20refr?= =?UTF-8?q?esh=20pipefail=20comment=20now=20that=20kubectl/sort=20are=20sp?= =?UTF-8?q?lit?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Signed-off-by: Dmitrii Creed --- pkg/clouds/pulumi/kubernetes/caddy.go | 17 +++++++++++------ 1 file changed, 11 insertions(+), 6 deletions(-) diff --git a/pkg/clouds/pulumi/kubernetes/caddy.go b/pkg/clouds/pulumi/kubernetes/caddy.go index 323cfc59..aecfbaf7 100644 --- a/pkg/clouds/pulumi/kubernetes/caddy.go +++ b/pkg/clouds/pulumi/kubernetes/caddy.go @@ -215,12 +215,17 @@ func DeployCaddyService(ctx *sdk.Context, caddy CaddyDeployment, input api.Resou # the old and new Services transiently coexist and both carry the same # annotation; without dedup that produced two "http:// { ... }" # blocks and Caddy aborted with "ambiguous site definition". - # pipefail is critical here: a flaky kubectl piped into sort would - # otherwise yield services="" and the init-container would silently - # emit a Caddyfile with only the default block on the next pod - # restart. That's now a 503 (cf. the default block above), but it's - # still a complete loss of routing for the entire cluster — bail - # loud so K8s reschedules the init-container and retries. + # kubectl and sort are split into separate assignments so a kubectl + # failure surfaces unambiguously even without pipefail (originally + # they were piped; pipefail was added in response to a review catch + # and we kept the structural split so future readers don't need to + # know about pipefail to reason about failure modes here). pipefail + # is kept on as belt-and-suspenders for the later `printf | sort` + # and the `printf "%s" "$services" | while read` pipeline below. + # If either listing step fails the init-container exits non-zero + # and K8s reschedules — preferable to a Caddyfile with only the + # default 503 block, which would mean a complete loss of routing + # for the entire cluster. raw_services=$(kubectl get services --all-namespaces -o jsonpath='{range .items[?(@.metadata.annotations.simple-container\.com/caddyfile-entry)]}{.metadata.creationTimestamp}{" "}{.metadata.namespace}{" "}{.metadata.name}{"\n"}{end}') services=$(printf '%s' "$raw_services" | sort -r) echo "$DEFAULT_ENTRY_START" >> /tmp/Caddyfile From 9828eb5a9417611dc2d379e2e6361b40e64fe32a Mon Sep 17 00:00:00 2001 From: Dmitrii Creed Date: Mon, 11 May 2026 22:36:42 +0400 Subject: [PATCH 07/11] fixup: remove backticks from comment that broke Go raw string Previous commit 2d290fe wrote comments with markdown-style backticks inside the Go raw-string-delimited bash script literal, which closed the raw string mid-comment and turned the rest into invalid Go ("syntax error: unexpected name printf in composite literal"). Replaced with plain text (printf-to-sort, printf-to-while-read). Should have built before pushing. `go build ./...` clean now. Signed-off-by: Dmitrii Creed --- pkg/clouds/pulumi/kubernetes/caddy.go | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/pkg/clouds/pulumi/kubernetes/caddy.go b/pkg/clouds/pulumi/kubernetes/caddy.go index aecfbaf7..539b0831 100644 --- a/pkg/clouds/pulumi/kubernetes/caddy.go +++ b/pkg/clouds/pulumi/kubernetes/caddy.go @@ -218,10 +218,10 @@ func DeployCaddyService(ctx *sdk.Context, caddy CaddyDeployment, input api.Resou # kubectl and sort are split into separate assignments so a kubectl # failure surfaces unambiguously even without pipefail (originally # they were piped; pipefail was added in response to a review catch - # and we kept the structural split so future readers don't need to - # know about pipefail to reason about failure modes here). pipefail - # is kept on as belt-and-suspenders for the later `printf | sort` - # and the `printf "%s" "$services" | while read` pipeline below. + # and we kept the structural split so future readers do not need + # to know about pipefail to reason about failure modes here). + # pipefail is kept on as belt-and-suspenders for the later + # printf-to-sort pipe and the printf-to-while-read pipeline below. # If either listing step fails the init-container exits non-zero # and K8s reschedules — preferable to a Caddyfile with only the # default 503 block, which would mean a complete loss of routing From 95730bf2ad390b081651b4a581e3ba1df68f9090 Mon Sep 17 00:00:00 2001 From: Dmitrii Creed Date: Mon, 11 May 2026 22:55:38 +0400 Subject: [PATCH 08/11] =?UTF-8?q?fixup:=20round-3=20gemini=20=E2=80=94=20d?= =?UTF-8?q?rop=20set=20-x=20to=20avoid=20tracing=20annotations=20to=20logs?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Gemini round-3 review flagged `set -x` as a security regression: tracing every command prints the raw caddyfile-entry annotation body (and the output of `kubectl get service ...`) to stdout, which lands in cluster logging (GCP/Datadog/ELK). SC-generated annotations don't carry secrets, but consumer-side misuse — basicauth credentials in `Headers` map, or raw Caddy directives in `LbConfig.ExtraHelpers` — could template into the annotation body and leak via -x. The init container is rarely debugged live (when it is, an operator can override the command), so the debuggability cost is low. The script still emits informative one-line `Processing service: $service in namespace: $ns` and `Skipping duplicate caddyfile-entry ...` messages without -x. Kept: `cat /tmp/Caddyfile` at the end. That's the assembled config the Caddy server actually loads; printing it is useful for verifying rollouts and is consistent with prior behavior. If a consumer puts secrets into per-Service annotations they leak there too, but it's intentional logging of the deployed config, not an incidental per-command trace. Codex round-3 verdict was "clean, merge" but acknowledged the same exposure existed via `cat`. I'm siding with gemini on -x because the trace exposure compounds (every kubectl invocation × every Service × every pod restart) while `cat` is a single final dump. Signed-off-by: Dmitrii Creed --- pkg/clouds/pulumi/kubernetes/caddy.go | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/pkg/clouds/pulumi/kubernetes/caddy.go b/pkg/clouds/pulumi/kubernetes/caddy.go index 539b0831..bacbd925 100644 --- a/pkg/clouds/pulumi/kubernetes/caddy.go +++ b/pkg/clouds/pulumi/kubernetes/caddy.go @@ -200,7 +200,16 @@ func DeployCaddyService(ctx *sdk.Context, caddy CaddyDeployment, input api.Resou return envVars }(), Command: sdk.ToStringArray([]string{"bash", "-c", ` - set -xeo pipefail; + # set -e (exit on error) + pipefail (any pipe component fail = fail). + # Notably we do NOT enable -x here: tracing every command would dump + # the raw caddyfile-entry annotation body to stdout for every Service + # on every pod restart, which lands in cluster logs (GCP/Datadog/ELK). + # SC-generated annotations don't contain secrets, but consumer-side + # misuse (eg. basicauth credentials in Headers or LbConfig.ExtraHelpers + # that templated into the annotation) could leak via -x. The trade-off + # is debuggability — for live troubleshooting, re-enable -x by + # overriding the init-container command in the cluster. + set -eo pipefail; cp -f /etc/caddy/Caddyfile /tmp/Caddyfile; # Inject custom Caddyfile prefix at the top (e.g., GCS storage configuration) From efd25235c655cdd17680ad6a242923befb38e1b1 Mon Sep 17 00:00:00 2001 From: Dmitrii Creed Date: Tue, 12 May 2026 17:07:41 +0400 Subject: [PATCH 09/11] fix(k8s): IgnoreChanges("metadata.name") on namespace, stop migration cascade MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit PR #230 changed custom-stack namespace naming from shared to per-stackEnv - and added RetainOnDelete(true), expecting that to protect existing consumers through the migration pulumi up. It didn't — Pulumi reads delete-time options from the state of the resource being deleted, not from the current program. Existing Namespace resources predate #230 and don't carry RetainOnDelete; when the new code computed a different metadata.Name, Pulumi diffed against state, scheduled a Replace, executed delete-old before create-new, and sent k8s DELETE on the legacy shared namespace. K8s cascade-deleted every resource inside, including the parent stack's production resources that lived in the same shared namespace. Confirmed outages: - PAY-SPACE 2026-05-10/11: support-bot parent + every whitelabel (support-payhey, support-rulex, support-gl-pay) cascade-deleted. Caddy fallout from this is also fixed in earlier commits of this PR. - fulldiveVR/wizeup-rooms-api 2026-05-12: namespace wize-rooms-api hosted both the likeclaw-us parent stack and the likeclaw-us-dev child. A routine merge to dev triggered the child's deploy. Pulumi plan: kubernetes:core/v1:Namespace: (replace) name "wize-rooms-api" => "wize-rooms-api-likeclaw-us-dev". Namespace deleted, rooms-api.wizeup.app returned 502 until prod was manually re-deployed. (actions/runs/25725750825) Fix: sdk.IgnoreChanges([]string{"metadata.name"}) on both Namespace registration sites — simple_container.go for client stacks, helpers.go's ensureNamespace for helm operator stacks. Behavior: - Fresh deploy of a new custom child stack: no prior state, no diff to ignore. Namespace created with the per-stackEnv name. PR #230's isolation goal preserved for new deploys. - Existing custom child stack on its next pulumi up: state has metadata.Name=, program desires metadata.Name= -. IgnoreChanges suppresses the diff — no Replace scheduled, no delete attempted. State retains the legacy name. Service/Deployment/etc. that reference namespace.Metadata.Name().Elem() now resolve to the legacy name and continue to land in the shared namespace. Migration cost: zero. Consumer is back to the pre-#230 sharing model, but RetainOnDelete protects against the cross-sibling destroy cascade #230 was originally added to solve. Both hazards now defused. - Existing consumer who actively wants per-stackEnv isolation: opt-in by removing the legacy Namespace resource from Pulumi state (state edit; k8s namespace itself stays put). Next pulumi up sees no prior namespace, registers a fresh one at the per-stackEnv name. Old k8s namespace continues to host the parent stack; the migrated child lives in the new isolated namespace. This is the established codebase pattern: rds_postgres.go:45 and rds_mysql.go:55 use IgnoreChanges([]string{"storageEncrypted"}) for the same purpose — silence a default flip so it doesn't propose a destructive replacement on existing stacks. Signed-off-by: Dmitrii Creed --- pkg/clouds/pulumi/kubernetes/helpers.go | 14 +++-- .../pulumi/kubernetes/simple_container.go | 57 +++++++++++++------ 2 files changed, 50 insertions(+), 21 deletions(-) diff --git a/pkg/clouds/pulumi/kubernetes/helpers.go b/pkg/clouds/pulumi/kubernetes/helpers.go index 0bef2a7b..a2e62f1c 100644 --- a/pkg/clouds/pulumi/kubernetes/helpers.go +++ b/pkg/clouds/pulumi/kubernetes/helpers.go @@ -45,10 +45,16 @@ func sanitizeK8sName(name string) string { } func ensureNamespace(ctx *sdk.Context, input api.ResourceInput, params pApi.ProvisionParams, namespace string) (*corev1.Namespace, error) { - // RetainOnDelete: see the rationale at simple_container.go's NewNamespace call — - // helm operator stacks share namespaces across sibling stacks the same way client - // stacks do, so the destroy-cascade hazard is identical here. - opts := []sdk.ResourceOption{sdk.Provider(params.Provider), sdk.RetainOnDelete(true)} + // RetainOnDelete + IgnoreChanges("metadata.name"): see the long rationale + // at simple_container.go's NewNamespace call. Helm operator stacks share + // namespaces across sibling stacks the same way client stacks do, so both + // the destroy-cascade hazard and the migration-time Replace cascade + // hazard apply identically here. + opts := []sdk.ResourceOption{ + sdk.Provider(params.Provider), + sdk.RetainOnDelete(true), + sdk.IgnoreChanges([]string{"metadata.name"}), + } sanitizedNamespace := sanitizeK8sName(namespace) return corev1.NewNamespace(ctx, fmt.Sprintf("create-ns-%s-%s", sanitizedNamespace, input.ToResName(input.Descriptor.Name)), &corev1.NamespaceArgs{ Metadata: &metav1.ObjectMetaArgs{ diff --git a/pkg/clouds/pulumi/kubernetes/simple_container.go b/pkg/clouds/pulumi/kubernetes/simple_container.go index 74d35796..5f11462d 100644 --- a/pkg/clouds/pulumi/kubernetes/simple_container.go +++ b/pkg/clouds/pulumi/kubernetes/simple_container.go @@ -218,23 +218,46 @@ func NewSimpleContainer(ctx *sdk.Context, args *SimpleContainerArgs, opts ...sdk // Use deployment name as Pulumi resource name to ensure uniqueness across environments // while keeping the actual K8s namespace name as specified by the user. // - // RetainOnDelete: in legacy deploys, sub-env client stacks (e.g. parentEnv=production - // with stackEnv=tenant-a/tenant-b/...) shared one physical K8s namespace because the - // namespace metadata.Name was derived from stackName, not from stackEnv. Each stack - // tracked its own Pulumi Namespace resource with a unique URN, but they all referenced - // the same physical k8s namespace. Without RetainOnDelete, destroying any single - // sub-env stack would cascade-delete the shared namespace and wipe every sibling - // stack's resources (Deployments, Services, Secrets) — a real production outage when - // a throwaway sub-env destroy took down all live siblings. + // Namespace-handling has two protections against the destroy/Replace cascade + // hazard discovered in pre-PR-230 deploys (see PR #230 and the 2026-05-10 + // PAY-SPACE + 2026-05-12 fulldiveVR outages): // - // GenerateNamespaceName now isolates custom stacks per-stackEnv, but RetainOnDelete - // remains load-bearing for the migration step: when a pre-existing custom stack - // first runs `pulumi up` after this version, Pulumi Replaces the namespace, and the - // old shared namespace must NOT be deleted because the parent stack still lives - // there. Post-migration, RetainOnDelete continues to defend against any case where - // multiple stacks legitimately share a namespace (helm operators, explicit - // `Namespace` overrides). Empty namespaces left after the last referencing stack - // is destroyed must be cleaned up by hand. + // 1. RetainOnDelete(true). In legacy deploys, sub-env client stacks + // (parentEnv= with stackEnv=tenant-a/tenant-b/...) shared one + // physical K8s namespace because metadata.Name was derived from + // stackName, not stackEnv. Each stack tracked its own Pulumi Namespace + // resource at a unique URN, but they all pointed at the same physical + // namespace. Destroying any single sub-env stack would cascade-delete + // the shared namespace and wipe every sibling. RetainOnDelete keeps + // Pulumi from issuing the k8s DELETE on destroy. + // + // 2. IgnoreChanges("metadata.name"). PR #230 changed GenerateNamespaceName + // to isolate custom stacks (stackName-stackEnv) instead of sharing the + // parent's namespace. That works for fresh deploys, but for any consumer + // whose Pulumi state predates #230, the next `pulumi up` saw a diff + // between state's metadata.Name="" and program's + // metadata.Name="-", and scheduled a Replace. + // Replace = create-new + delete-old, and `RetainOnDelete` on the new + // resource is non-retroactive — Pulumi reads delete-time options from + // the OLD resource's state, which predates the flag. The k8s DELETE on + // the shared namespace went through and cascade-killed the parent + // stack's running resources. + // + // IgnoreChanges("metadata.name") suppresses the diff entirely. No + // Replace is scheduled, no delete fires. The resource state retains + // whatever metadata.Name it had (new for fresh deploys, legacy shared + // for migrated consumers). Other resources (Service, Deployment, …) + // that reference namespace.Metadata.Name().Elem() follow whichever + // name is in effect — fresh deploys land in the isolated namespace, + // migrated consumers continue using the shared one. Combined with + // RetainOnDelete this keeps both modes safe. + // + // Consumers who want to migrate an existing custom stack to the + // isolated namespace name opt in by running + // pulumi stack export | jq 'del(... namespace urn ...)' | pulumi stack import + // (forget the namespace resource — k8s namespace itself stays put), + // then the next pulumi up registers a fresh Namespace at the isolated + // name. Documented in the PR description. namespaceResourceName := fmt.Sprintf("%s-ns", sanitizedDeployment) namespace, err := corev1.NewNamespace(ctx, namespaceResourceName, &corev1.NamespaceArgs{ Metadata: &metav1.ObjectMetaArgs{ @@ -242,7 +265,7 @@ func NewSimpleContainer(ctx *sdk.Context, args *SimpleContainerArgs, opts ...sdk Labels: sdk.ToStringMap(appLabels), Annotations: sdk.ToStringMap(appAnnotations), }, - }, append(opts, sdk.RetainOnDelete(true))...) + }, append(opts, sdk.RetainOnDelete(true), sdk.IgnoreChanges([]string{"metadata.name"}))...) if err != nil { return nil, err } From 1d248d945e4647f22b481a40a2e96acc2af6c64b Mon Sep 17 00:00:00 2001 From: Dmitrii Creed Date: Tue, 12 May 2026 18:33:39 +0400 Subject: [PATCH 10/11] fix(k8s): caddyfile-entry + VPA follow live namespace, not sanitizedNamespace MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Codex round-2 review of efd2523 caught that IgnoreChanges("metadata.name") defeats its own purpose at two callsites that still bake the program- computed namespace name into resources downstream of the Namespace itself: - caddyfile-entry annotation. The Service annotation template at simple_container.go:673 was using sanitizedNamespace as the reverse_proxy upstream namespace. With IgnoreChanges in place, a migrated stack's Service is created in the legacy shared namespace (because namespace.Metadata.Name() resolves to the legacy value), but the annotation pointed Caddy at ..svc.cluster.local — DNS fails to resolve, Caddy 502s for the affected host. - VPA. createVPA was called with sanitizedNamespace and built the VPA CRD with metadata.namespace = NEW name. The Deployment it targets lives in the legacy namespace, so the VPA sits orphaned and never scales the workload. Both bugs ship the migration cascade fix (efd2523) without actually preventing 502s or autoscaling regression for migrated stacks. Fix: 1. Caddyfile-entry template extracted to a local variable (caddyfileEntryTemplate). The same template is rendered twice: - synchronously into caddyfileEntry (string) for sc.CaddyfileEntry export — that's used as a change-hash signal by kube_run.go and intentionally tracks the desired-config view, not the migrated live state. - asynchronously into caddyfileEntryAnnotation (sdk.StringOutput) via namespace.Metadata.Name().ApplyT — resolves namespace at apply time. For fresh deploys (liveNS == sanitizedNamespace), the callback returns the statically-rendered template verbatim, so byte output matches the legacy code path. For migrated stacks (liveNS != sanitizedNamespace), it re-applies placeholders with the live namespace and returns the new string. 2. Render failures inside the ApplyT callback are returned as errors wrapped with errors.Wrapf, NOT silently fallen back to the statically-rendered template. Falling back would re-introduce the exact migrated-stack 502 bug this commit is fixing. Codex review flag — the silent fallback was the wrong failure mode. 3. Service/Ingress annotation maps switched from sdk.ToStringMap(map[string]string) to a manually-built sdk.StringMap so the caddyfile-entry value can be an Output while the rest stay static. Equivalent for static-only entries. 4. createVPA signature: namespace string → namespace sdk.StringInput. The metadata.Namespace field directly accepts the Pulumi input. Caller now passes namespace.Metadata.Name().Elem(), which is the live Namespace.metadata.name Output. Verification: - go build ./... clean - go test ./pkg/clouds/pulumi/kubernetes/... -count=1 passes - Three rounds of parallel codex + gemini review on the namespace work; this commit addresses the round-3 follow-ups (template duplication cleaned up, ApplyT error propagation made fatal). Pairs with efd2523 (the IgnoreChanges fix) as the complete cascade- prevention story: efd2523 stops the Namespace itself from being Replace-deleted, this commit stops the downstream resources (Service annotation + VPA) from drifting away from where the Namespace actually landed. Signed-off-by: Dmitrii Creed --- .../pulumi/kubernetes/simple_container.go | 96 ++++++++++++++++--- 1 file changed, 85 insertions(+), 11 deletions(-) diff --git a/pkg/clouds/pulumi/kubernetes/simple_container.go b/pkg/clouds/pulumi/kubernetes/simple_container.go index 5f11462d..e5d503a8 100644 --- a/pkg/clouds/pulumi/kubernetes/simple_container.go +++ b/pkg/clouds/pulumi/kubernetes/simple_container.go @@ -636,20 +636,27 @@ func NewSimpleContainer(ctx *sdk.Context, args *SimpleContainerArgs, opts ...sdk serviceAnnotations := lo.Assign(appAnnotations) var caddyfileEntry string + var caddyfileEntryAnnotation sdk.StringInput if args.GenerateCaddyfileEntry && mainPort != nil { + // The unsubstituted template — used for both the initial sync render + // (sc.CaddyfileEntry static export, change-hash signal) and the + // deferred re-render inside ApplyT below (live-namespace annotation + // on the Service). Single source of truth so any template tweak + // updates both paths. + var caddyfileEntryTemplate string if args.Domain != "" { - caddyfileEntry = ` + caddyfileEntryTemplate = ` ${proto}://${domain} { reverse_proxy http://${service}.${namespace}.svc.cluster.local:${port} { header_down Server nginx ${addHeaders} import handle_server_error ${extraHelpers} } - ${imports} + ${imports} } ` } else if args.Prefix != "" { - caddyfileEntry = ` + caddyfileEntryTemplate = ` handle_path /${prefix}* {${additionalProxyConfig} reverse_proxy http://${service}.${namespace}.svc.cluster.local:${port} { header_down Server nginx ${addHeaders} @@ -683,9 +690,51 @@ ${proto}://${domain} { } else { placeholdersMap["additionalProxyConfig"] = "" } + // Apply placeholders synchronously so the static representation + // (used for sc.CaddyfileEntry change-hash + log lines) is populated. + // `namespace` here is sanitizedNamespace — for fresh deploys that + // matches the live k8s namespace, but for migrated stacks with + // IgnoreChanges("metadata.name") suppressing the rename the live + // namespace stays at the legacy value. The annotation that lands + // on the Service is computed from the live namespace Output below. + caddyfileEntry = caddyfileEntryTemplate if err := placeholders.New().Apply(&caddyfileEntry, placeholders.WithData(placeholdersMap)); err != nil { return nil, errors.Wrapf(err, "failed to apply placeholders on caddyfile entry template") } + + // Build the actual annotation as an Output that resolves namespace + // from the live Namespace resource's metadata.name. On migrated + // stacks this is the legacy shared name (because of IgnoreChanges), + // which is also where the Service is created, so reverse_proxy + // http://${service}.${namespace}.svc.cluster.local resolves to + // real cluster DNS. On fresh deploys it equals sanitizedNamespace + // so the byte output matches the legacy code path. + // + // Render failures inside ApplyT are returned as errors (not silently + // fallen back to the statically-rendered template) — falling back + // would re-introduce the migrated-stack 502 bug this PR is fixing. + staticEntry := caddyfileEntry + caddyfileEntryAnnotation = namespace.Metadata.Name().ApplyT(func(nsPtr *string) (string, error) { + liveNS := sanitizedNamespace + if nsPtr != nil && *nsPtr != "" { + liveNS = *nsPtr + } + if liveNS == sanitizedNamespace { + // Fresh deploy or no migration: static template is correct verbatim. + return staticEntry, nil + } + // Migrated stack: re-render with the live (legacy) namespace. + localMap := make(placeholders.MapData, len(placeholdersMap)) + for k, v := range placeholdersMap { + localMap[k] = v + } + localMap["namespace"] = liveNS + rendered := caddyfileEntryTemplate + if err := placeholders.New().Apply(&rendered, placeholders.WithData(localMap)); err != nil { + return "", errors.Wrapf(err, "failed to re-render caddyfile entry for live namespace %q", liveNS) + } + return rendered, nil + }).(sdk.StringOutput) serviceAnnotations[AnnotationCaddyfileEntry] = caddyfileEntry } @@ -698,6 +747,18 @@ ${proto}://${domain} { }) } } + // Build the Pulumi-input annotation map. The caddyfile-entry value, if + // any, is an Output that resolves the namespace placeholder against the + // live Namespace resource (so IgnoreChanges'd migrated stacks point at + // the legacy shared namespace, fresh deploys point at the per-stackEnv + // namespace). Everything else is a static string. + serviceAnnotationsInput := sdk.StringMap{} + for k, v := range serviceAnnotations { + serviceAnnotationsInput[k] = sdk.String(v) + } + if caddyfileEntryAnnotation != nil { + serviceAnnotationsInput[AnnotationCaddyfileEntry] = caddyfileEntryAnnotation + } var service *corev1.Service if len(lo.FromPtr(args.IngressContainer).Ports) > 0 { service, err = corev1.NewService(ctx, sanitizedService, &corev1.ServiceArgs{ @@ -705,7 +766,7 @@ ${proto}://${domain} { Name: sdk.String(sanitizedService), Namespace: namespace.Metadata.Name().Elem(), Labels: sdk.ToStringMap(appLabels), - Annotations: sdk.ToStringMap(serviceAnnotations), + Annotations: serviceAnnotationsInput, }, Spec: &corev1.ServiceSpecArgs{ Selector: sdk.ToStringMap(appLabels), @@ -724,16 +785,25 @@ ${proto}://${domain} { if mainPort == nil { return nil, errors.Errorf("cannot provision ingress when no main port is specified") } - ingressAnnotations := lo.Assign(serviceAnnotations) + // Mirror the Service-side annotation map (Pulumi-input with the + // live-namespace caddyfile-entry Output) and overlay the + // Ingress-only ssl-redirect tweak. + ingressAnnotationsInput := sdk.StringMap{} + for k, v := range serviceAnnotations { + ingressAnnotationsInput[k] = sdk.String(v) + } + if caddyfileEntryAnnotation != nil { + ingressAnnotationsInput[AnnotationCaddyfileEntry] = caddyfileEntryAnnotation + } if args.UseSSL { - ingressAnnotations["ingress.kubernetes.io/ssl-redirect"] = "false" // do not need ssl redirect from kube + ingressAnnotationsInput["ingress.kubernetes.io/ssl-redirect"] = sdk.String("false") // do not need ssl redirect from kube } _, err = networkv1.NewIngress(ctx, sanitizedService, &networkv1.IngressArgs{ Metadata: &metav1.ObjectMetaArgs{ Name: sdk.String(sanitizedService), Namespace: namespace.Metadata.Name().Elem(), Labels: sdk.ToStringMap(appLabels), - Annotations: sdk.ToStringMap(ingressAnnotations), + Annotations: ingressAnnotationsInput, }, Spec: &networkv1.IngressSpecArgs{ Rules: networkv1.IngressRuleArray{ @@ -820,9 +890,13 @@ ${proto}://${domain} { return nil, err } - // Create VPA if enabled + // Create VPA if enabled. Pass the live namespace name (Pulumi Output) + // rather than the program-computed sanitizedNamespace string, so the + // VPA lands in the same namespace as its target Deployment on migrated + // stacks (where IgnoreChanges("metadata.name") keeps the namespace at + // the legacy shared value). if args.VPA != nil && args.VPA.Enabled { - if err := createVPA(ctx, args, baseResourceName, sanitizedNamespace, appLabels, appAnnotations, opts...); err != nil { + if err := createVPA(ctx, args, baseResourceName, namespace.Metadata.Name().Elem(), appLabels, appAnnotations, opts...); err != nil { return nil, errors.Wrapf(err, "failed to create VPA for deployment %s", baseResourceName) } } @@ -866,7 +940,7 @@ ${proto}://${domain} { return sc, nil } -func createVPA(ctx *sdk.Context, args *SimpleContainerArgs, deploymentName, namespace string, labels, annotations map[string]string, opts ...sdk.ResourceOption) error { +func createVPA(ctx *sdk.Context, args *SimpleContainerArgs, deploymentName string, namespace sdk.StringInput, labels, annotations map[string]string, opts ...sdk.ResourceOption) error { vpaName := fmt.Sprintf("%s-vpa", deploymentName) // Build VPA spec content @@ -961,7 +1035,7 @@ func createVPA(ctx *sdk.Context, args *SimpleContainerArgs, deploymentName, name Kind: sdk.String("VerticalPodAutoscaler"), Metadata: &metav1.ObjectMetaArgs{ Name: sdk.String(vpaName), - Namespace: sdk.String(namespace), + Namespace: namespace, Labels: sdk.ToStringMap(vpaLabels), Annotations: sdk.ToStringMap(vpaAnnotations), }, From 3b6d44bd2e01fd552772b25ca6d4b0955666b2f2 Mon Sep 17 00:00:00 2001 From: Dmitrii Creed Date: Tue, 12 May 2026 22:19:55 +0400 Subject: [PATCH 11/11] refactor(caddy): default 503 page served from embed/pages/503.html MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Dev review pointed out the inconsistency in commit e5a6519 + d7b4d71: the default catch-all 503 was inlined as a string literal in caddy.go, while the other status pages (404, 500, 502) are still served from /etc/caddy/pages/{code}.html via the handle_bucket_error / handle_server_error snippets in embed/caddy/Caddyfile. Two different mechanisms for the same class of response. Refactor to the same file-based pattern: - New pages/503.html with the SC-operator instruction body that was previously inlined ("No backend route is configured for this host" + hint to check the simple-container.com/caddyfile-entry annotation). - caddy.go's default catch-all switches from respond "..." 503 { close } to root * /etc/caddy/pages rewrite * /503.html file_server { status 503 } - Drops the explicit header Content-Type — file_server emits it automatically from the .html extension. - index.html is NOT restored; the 200-OK welcome page was the original failure mode, replaced now by 503.html. Wins: - Symmetry: one pattern for every status page in the codebase. - Operator override: a cluster operator can mount a ConfigMap at /etc/caddy/pages/503.html to customize the body (branded outage page, i18n, etc.) without touching SC api code. - Smaller raw string in caddy.go; the HTML body is no longer inlined via a one-line `respond ""` blob. Verified live with simplecontainer/caddy:latest serving the assembled Caddyfile + the embedded pages dir mounted at /etc/caddy/pages: Host: example.com (matched) → HTTP 200 Host: support-bot.pay.space (unmatched) → HTTP 503 Content-Type: text/html; charset=utf-8 Cache-Control: no-store Retry-After: 60 body = pages/503.html Same + X-Forwarded-Proto: http → HTTP 503 (no HSTS redirect, since catch-all still doesn't `import hsts` — see e5a6519's comment). Build clean, go test ./pkg/clouds/pulumi/kubernetes/... passes. Signed-off-by: Dmitrii Creed --- pkg/clouds/pulumi/kubernetes/caddy.go | 53 +++++++++++-------- .../kubernetes/embed/caddy/pages/503.html | 17 ++++++ 2 files changed, 48 insertions(+), 22 deletions(-) create mode 100644 pkg/clouds/pulumi/kubernetes/embed/caddy/pages/503.html diff --git a/pkg/clouds/pulumi/kubernetes/caddy.go b/pkg/clouds/pulumi/kubernetes/caddy.go index bacbd925..32a5bc5c 100644 --- a/pkg/clouds/pulumi/kubernetes/caddy.go +++ b/pkg/clouds/pulumi/kubernetes/caddy.go @@ -90,34 +90,43 @@ func DeployCaddyService(ctx *sdk.Context, caddy CaddyDeployment, input api.Resou } defaultCaddyFileEntryStart := `http:// {` - // Default catch-all serves a hard 503 instead of a static "welcome" page. - // Rationale: when all Services with a `simple-container.com/caddyfile-entry` - // annotation for a given Host vanish (e.g. a cascade-deletion from a - // namespace Replace gone wrong), the request used to fall through to a - // `file_server /etc/caddy/pages` block and respond with HTTP 200 + "Default - // page". External monitoring saw healthy 200s while every backend was gone. - // 503 + Retry-After makes the absence of routes loud: CDNs fail over, - // uptime checks alert, oncall sees it. + // Default catch-all serves a hard 503 page from /etc/caddy/pages/503.html + // instead of `file_server` over the whole pages dir (which used to serve + // index.html with status 200 for any unknown Host — invisible to monitoring + // when every backend was gone). // - // Headers + respond are wrapped in an explicit `handle { ... }` so they - // only apply to the 503 path. Without `handle`, Caddy directive ordering - // (redir > respond) means a `redir` from `import hsts` would fire first - // and the catch-all 503 would never be reachable behind a CDN that sets - // X-Forwarded-Proto. The header directives would also leak Cache-Control - // and Retry-After onto an unrelated 301. We also intentionally do NOT - // `import hsts` here — sending an HSTS header from a catch-all that - // answers any Host is meaningless, and the HTTP→HTTPS redirect would only - // route the request into a TLS handshake failure (Caddy has no cert for - // an unknown SNI), which is invisible to HTTP-layer monitoring. We want - // the 503 itself to be the loudest possible signal. + // Rationale for 503: + // - When all Services with `simple-container.com/caddyfile-entry` for a + // given Host vanish (e.g. cascade-deletion from a namespace Replace gone + // wrong), the request now gets HTTP 503 + Retry-After. CDNs fail over, + // uptime checks alert, oncall sees it. + // + // Why file_server (not respond with inlined HTML): + // - Symmetric with the existing `handle_bucket_error` / `handle_server_error` + // snippets in embed/caddy/Caddyfile, which serve {404,500,502}.html the + // same way for per-Service error fallbacks. One pattern for every status + // page in this codebase. + // - file_server emits Content-Type automatically from the file extension, + // so no explicit `header Content-Type` needed. + // - Operators can override the 503 body by mounting a different ConfigMap + // at /etc/caddy/pages/503.html without touching SC api code. + // + // Wrapped in `handle { ... }` so the directives below apply only to the + // 503 path and nothing else can short-circuit (e.g. `import hsts` redir + // firing before the response). We also intentionally do NOT `import hsts` + // here — sending an HSTS header from a catch-all that answers any Host is + // meaningless, and the HTTP→HTTPS redirect would only route the request + // into a TLS handshake failure (Caddy has no cert for an unknown SNI), + // which is invisible to HTTP-layer monitoring. defaultCaddyFileEntry := ` import gzip handle { - header Content-Type "text/html; charset=utf-8" + root * /etc/caddy/pages + rewrite * /503.html header Cache-Control "no-store" header Retry-After "60" - respond "503 Service Unavailable

503 Service Unavailable

No backend route is configured for this host.

If you are an operator, verify the Service has the simple-container.com/caddyfile-entry annotation and that Caddy has been rolled.

" 503 { - close + file_server { + status 503 } } ` diff --git a/pkg/clouds/pulumi/kubernetes/embed/caddy/pages/503.html b/pkg/clouds/pulumi/kubernetes/embed/caddy/pages/503.html new file mode 100644 index 00000000..85412ecd --- /dev/null +++ b/pkg/clouds/pulumi/kubernetes/embed/caddy/pages/503.html @@ -0,0 +1,17 @@ + +503 Service Unavailable + + +
+

503 Service Unavailable

+

No backend route is configured for this host.

+

If you are an operator, verify the Service has the + simple-container.com/caddyfile-entry annotation and that + Caddy has been rolled.

+