fix(caddy): default catch-all returns 503 instead of welcome page#256
Closed
Cre-eD wants to merge 1 commit into
Closed
fix(caddy): default catch-all returns 503 instead of welcome page#256Cre-eD wants to merge 1 commit into
Cre-eD wants to merge 1 commit into
Conversation
When all Services with a `simple-container.com/caddyfile-entry`
annotation for a given Host disappear — for example, a cascade-deletion
from a namespace Replace gone wrong — requests fell through to the
catch-all `http:// { file_server /etc/caddy/pages }` block and got back
HTTP 200 + "Default page" from index.html. External monitoring saw
healthy 200s. CDNs and load balancers saw 200s. Pingdom / UptimeRobot /
the dashboard everyone trusts saw 200s. The outage was invisible to
every layer that wasn't deep-inspecting the response body.
PAY-SPACE hit this on 2026-05-10: the migration from SC #230 cascade-
deleted the shared parent namespace, every Service annotation for
production hosts evaporated, and every domain pointing at the cluster
served the Caddy welcome page. The outage was only noticed when a human
opened a browser tab.
Change:
- Default catch-all now uses `respond ... 503 { close }` instead of
`file_server /etc/caddy/pages`.
- Retry-After: 60 so CDNs back off appropriately and clients know to
retry rather than treating 503 as a hard failure.
- Cache-Control: no-store so an aggressive cache doesn't pin the 503
state past route recovery.
- HTML body still rendered for humans visiting in a browser, but it's
now a 503 page that names the problem (missing
`simple-container.com/caddyfile-entry` annotation) and tells operators
what to check. The literal "Default page" string is gone.
Behavior verified by running the Caddy image with the new default block:
configured host (Host: example.com) → HTTP 200
unmatched host (Host: support-bot.pay.space) → HTTP 503
Retry-After: 60
Cache-Control: no-store
`caddy validate` against the full embedded Caddyfile + new default block
+ a sample matched site passes clean.
The /etc/caddy/pages directory (index.html, 404.html, 502.html, 500.html)
is still embedded and used by the `handle_bucket_error` and
`handle_server_error` snippets for legitimate per-Service error
fallbacks — only the catch-all stopped serving it as a 200.
Pairs with #255 (Caddy aggregator dedup) as the two halves of the
2026-05-10 PAY-SPACE outage: dedup keeps the aggregator from
crashlooping during a Service transition, this PR keeps the absence of
routes loud so it doesn't masquerade as a healthy 200.
Signed-off-by: Dmitrii Creed <creeed22@gmail.com>
Semgrep Scan ResultsRepository:
Scanned at 2026-05-11 17:59 UTC |
Security Scan ResultsRepository:
Scanned at 2026-05-11 17:59 UTC |
Contributor
Author
|
Consolidated into #255 — same file, same outage, reviewing them together is more useful than reviewing the two halves separately. Branch |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
The Caddy default catch-all
http:// { file_server /etc/caddy/pages }served HTTP 200 + a static "Default page" for any Host that didn't have a matching Service-annotated site block. When all Services with asimple-container.com/caddyfile-entryannotation for a given Host disappeared (e.g. cascade-delete from a namespace Replace gone sideways), every request to that domain came back 200 OK. Every external monitoring system trusted that 200. CDNs trusted it. Uptime checks trusted it. The outage was invisible to every layer that wasn't deep-inspecting the response body.PAY-SPACE hit this on 2026-05-10 when the migration from #230 cascade-deleted the shared
support-botparent namespace. Every production domain pointing at the cluster served the Caddy welcome page. The outage was only noticed when a human opened a browser tab.What this changes
caddy.go:92-103: the default catch-all now uses
respond ... 503 { close }:503so every monitoring layer alarms when routes vanishRetry-After: 60so CDNs back off without giving upCache-Control: no-storeso an aggressive cache doesn't pin the 503 state past route recoverysimple-container.com/caddyfile-entryannotation missing) and tells operators what to checkcloseso we don't keep dead-end keepalive connections openWhat this does NOT change
/etc/caddy/pages/*.htmlis still embedded and still used byhandle_bucket_error+handle_server_errorsnippets for legitimate per-Service error fallbacks (404, 500, 502). The catch-all just stopped serving from there.Verification
caddy validateagainst the full embedded Caddyfile + the new default block + a sample matched site:End-to-end with
simplecontainer/caddy:latestserving the test Caddyfile:Compatibility
http://<cluster-LB-IP>/returning 200 (e.g. a misconfigured smoke test) will now see 503 — that's a behavior change but it's a strict improvement; the previous 200 was wrong.Related
Test plan