Skip to content

fix(caddy): default catch-all returns 503 instead of welcome page#256

Closed
Cre-eD wants to merge 1 commit into
mainfrom
fix/caddy-default-block-loud-503
Closed

fix(caddy): default catch-all returns 503 instead of welcome page#256
Cre-eD wants to merge 1 commit into
mainfrom
fix/caddy-default-block-loud-503

Conversation

@Cre-eD
Copy link
Copy Markdown
Contributor

@Cre-eD Cre-eD commented May 11, 2026

Why

The Caddy default catch-all http:// { file_server /etc/caddy/pages } served HTTP 200 + a static "Default page" for any Host that didn't have a matching Service-annotated site block. When all Services with a simple-container.com/caddyfile-entry annotation for a given Host disappeared (e.g. cascade-delete from a namespace Replace gone sideways), every request to that domain came back 200 OK. Every external monitoring system trusted that 200. CDNs trusted it. Uptime checks trusted it. The outage was invisible to every layer that wasn't deep-inspecting the response body.

PAY-SPACE hit this on 2026-05-10 when the migration from #230 cascade-deleted the shared support-bot parent namespace. Every production domain pointing at the cluster served the Caddy welcome page. The outage was only noticed when a human opened a browser tab.

What this changes

caddy.go:92-103: the default catch-all now uses respond ... 503 { close }:

http:// {
  import gzip
  header Cache-Control "no-store"
  header Retry-After "60"
  respond "<!doctype html>... 503 Service Unavailable ..." 503 {
    close
  }
  import hsts   # when useSSL
}
  • 503 so every monitoring layer alarms when routes vanish
  • Retry-After: 60 so CDNs back off without giving up
  • Cache-Control: no-store so an aggressive cache doesn't pin the 503 state past route recovery
  • HTML body for humans visiting in a browser — now names the problem (simple-container.com/caddyfile-entry annotation missing) and tells operators what to check
  • close so we don't keep dead-end keepalive connections open

What this does NOT change

  • /etc/caddy/pages/*.html is still embedded and still used by handle_bucket_error + handle_server_error snippets for legitimate per-Service error fallbacks (404, 500, 502). The catch-all just stopped serving from there.
  • Domain-matching Service site blocks behave exactly as before — only the no-match path changed.

Verification

caddy validate against the full embedded Caddyfile + the new default block + a sample matched site:

{"level":"info","msg":"using config from file","file":"/etc/caddy/Caddyfile"}
{"level":"info","logger":"tls.cache.maintenance","msg":"stopped background certificate maintenance"}
Valid configuration

End-to-end with simplecontainer/caddy:latest serving the test Caddyfile:

$ curl -sS -o /dev/null -w "HTTP %{http_code}\n" -H "Host: example.com" http://localhost:18080/
HTTP 200

$ curl -sS -w "HTTP %{http_code}\n" -H "Host: support-bot.pay.space" http://localhost:18080/
HTTP 503

$ curl -sIS -H "Host: support-bot.pay.space" http://localhost:18080/ | grep -iE 'retry-after|cache-control'
Cache-Control: no-store
Retry-After: 60

Compatibility

  • No SC consumer documentation references the welcome-page behavior. The catch-all 200 was incidental, not contractual.
  • Any caller that was depending on http://<cluster-LB-IP>/ returning 200 (e.g. a misconfigured smoke test) will now see 503 — that's a behavior change but it's a strict improvement; the previous 200 was wrong.

Related

Test plan

  • Merge
  • Branch-preview build picked up by next deploy
  • Trigger a Caddy roll and probe a known-good Host (expect 200) and an arbitrary unmatched Host (expect 503)
  • Confirm Cloudflare / upstream LB respects the 503 — surfaces in monitoring instead of silently passing through

When all Services with a `simple-container.com/caddyfile-entry`
annotation for a given Host disappear — for example, a cascade-deletion
from a namespace Replace gone wrong — requests fell through to the
catch-all `http:// { file_server /etc/caddy/pages }` block and got back
HTTP 200 + "Default page" from index.html. External monitoring saw
healthy 200s. CDNs and load balancers saw 200s. Pingdom / UptimeRobot /
the dashboard everyone trusts saw 200s. The outage was invisible to
every layer that wasn't deep-inspecting the response body.

PAY-SPACE hit this on 2026-05-10: the migration from SC #230 cascade-
deleted the shared parent namespace, every Service annotation for
production hosts evaporated, and every domain pointing at the cluster
served the Caddy welcome page. The outage was only noticed when a human
opened a browser tab.

Change:
- Default catch-all now uses `respond ... 503 { close }` instead of
  `file_server /etc/caddy/pages`.
- Retry-After: 60 so CDNs back off appropriately and clients know to
  retry rather than treating 503 as a hard failure.
- Cache-Control: no-store so an aggressive cache doesn't pin the 503
  state past route recovery.
- HTML body still rendered for humans visiting in a browser, but it's
  now a 503 page that names the problem (missing
  `simple-container.com/caddyfile-entry` annotation) and tells operators
  what to check. The literal "Default page" string is gone.

Behavior verified by running the Caddy image with the new default block:

  configured host (Host: example.com)     → HTTP 200
  unmatched host (Host: support-bot.pay.space) → HTTP 503
    Retry-After: 60
    Cache-Control: no-store

`caddy validate` against the full embedded Caddyfile + new default block
+ a sample matched site passes clean.

The /etc/caddy/pages directory (index.html, 404.html, 502.html, 500.html)
is still embedded and used by the `handle_bucket_error` and
`handle_server_error` snippets for legitimate per-Service error
fallbacks — only the catch-all stopped serving it as a 200.

Pairs with #255 (Caddy aggregator dedup) as the two halves of the
2026-05-10 PAY-SPACE outage: dedup keeps the aggregator from
crashlooping during a Service transition, this PR keeps the absence of
routes loud so it doesn't masquerade as a healthy 200.

Signed-off-by: Dmitrii Creed <creeed22@gmail.com>
@github-actions
Copy link
Copy Markdown

Semgrep Scan Results

Repository: api | Commit: c083fac

Check Status Details
✅ Semgrep Pass 0 total findings (no error/warning)

Scanned at 2026-05-11 17:59 UTC

@github-actions
Copy link
Copy Markdown

Security Scan Results

Repository: api | Commit: c083fac

Check Status Details
✅ Secret Scan Pass No secrets detected
✅ Dependencies (Trivy) Pass 0 total (no critical/high)
✅ Dependencies (Grype) Pass 0 total (no critical/high)
📦 SBOM Generated 470 components (CycloneDX)

Scanned at 2026-05-11 17:59 UTC

@Cre-eD
Copy link
Copy Markdown
Contributor Author

Cre-eD commented May 11, 2026

Consolidated into #255 — same file, same outage, reviewing them together is more useful than reviewing the two halves separately. Branch fix/caddy-default-block-loud-503 is unused but kept on the remote for archeology.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant