OPRUN-4597: Fix lifecycle-server readiness probe to allow empty index#1288
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository: openshift/coderabbit/.coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (3)
🚧 Files skipped from review as they are similar to previous changes (1)
WalkthroughHealth readiness now depends on whether lifecycle data loading failed. The start logic logs load success or failure and continues initialization. The health handler API was changed to accept a readiness flag; tests updated to exercise the new ready/health responses. ChangesHealth Check Readiness Refactor
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 11 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (11 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
@perdasilva: This pull request references OPRUN-4597 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cmd/lifecycle-server/start.go (1)
137-157:⚠️ Potential issue | 🟠 Major | ⚡ Quick winEmpty successful loads still make the API behave as unavailable.
With this wiring,
loadErr == nilmakes/readyzreturn 200 for an empty-but-valid catalog, butserver.NewHandler(data, log)still returns 503 for every/api/...request whenlen(data) == 0(pkg/lifecycle-server/server.go, Lines 61-65). That means a catalog with no lifecycle metadata is reported ready while the API still looks broken.Please distinguish “load failed” from “loaded successfully but empty” in the API path too, so empty catalogs fall through to normal 404 behavior instead of 503.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cmd/lifecycle-server/start.go` around lines 137 - 157, The API incorrectly treats an empty-but-successful catalog as a load failure; update the wiring so server.NewHandler can distinguish real load failures from empty-but-valid data by passing the load error (loadErr) into the handler and changing NewHandler to base its 503 vs normal behavior on that error rather than on data length; specifically, modify the call apiHandler := server.NewHandler(data, log) -> server.NewHandler(data, loadErr, log) (and propagate that parameter into authnzFilter creation), then update the NewHandler implementation (and any internal checks that use len(data) to return 503) to check the loadErr value (and only return 503 when loadErr != nil), leaving empty catalogs to behave normally (404s) while keeping NewHealthHandler(loadErr, log) as-is.
🧹 Nitpick comments (2)
pkg/lifecycle-server/server_test.go (1)
244-246: ⚡ Quick winAssert that
/readyznever leaks the raw load error.
require.Containsis too loose for the failure case: the test would still pass if the handler appendedtc.loadErr.Error()after the generic message.Suggested test hardening
body, err := io.ReadAll(resp.Body) require.NoError(t, err, "failed to read response body") - require.Contains(t, string(body), tc.expectedBody, "unexpected response body") + require.Equal(t, tc.expectedBody, string(body), "unexpected response body") + if tc.loadErr != nil { + require.NotContains(t, string(body), tc.loadErr.Error(), "readiness response should not leak internal errors") + }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/lifecycle-server/server_test.go` around lines 244 - 246, The test for the /readyz handler currently uses require.Contains which can still pass if the handler appends the raw load error; update the assertion after reading resp.Body (the lines using io.ReadAll, require.NoError, and require.Contains) to ensure the response equals the expected generic message and never includes the raw error: replace require.Contains with a strict equality check (require.Equal(t, tc.expectedBody, string(body))) and/or add an explicit require.NotContains(t, string(body), tc.loadErr.Error()) to guarantee tc.loadErr.Error() is not leaked.pkg/lifecycle-server/server.go (1)
20-20: ⚡ Quick winAvoid logging the same load error on every
/readyzprobe.
run()already logs the startup failure once, so logging again here will emit the same error on every readiness check while the pod is unhealthy. That can flood logs and bury the original failure signal.Proposed cleanup
-import ( - "fmt" - "net/http" +import ( + "net/http" "github.com/go-logr/logr" ) func NewHealthHandler(loadErr error, log logr.Logger) http.Handler { mux := http.NewServeMux() + if loadErr != nil { + log.Error(loadErr, "lifecycle index failed to load; readiness will report unavailable") + } mux.HandleFunc("GET /healthz", func(w http.ResponseWriter, r *http.Request) { w.WriteHeader(http.StatusOK) _, _ = w.Write([]byte("ok")) }) mux.HandleFunc("GET /readyz", func(w http.ResponseWriter, r *http.Request) { if loadErr != nil { - log.Error(loadErr, "readiness check failed") w.WriteHeader(http.StatusServiceUnavailable) - _, _ = w.Write([]byte(fmt.Sprintf("error parsing catalog lifecycle metadata, check logs for details"))) + _, _ = w.Write([]byte("error parsing catalog lifecycle metadata, check logs for details")) return }Also applies to: 30-40
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/lifecycle-server/server.go` at line 20, The readiness handler is re-logging the same startup/load error on every /readyz probe; remove the duplicate logging in the readiness endpoint so run() remains the single place that logs the failure. Update the readiness handler (the /readyz handler function referenced alongside run()) to check the same failure condition but only return the appropriate non-200 status and short message/body without calling the logger; keep run()'s existing logging as the one-time error report. Ensure any other readiness-related handlers in the same file (lines ~30-40) follow the same pattern.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@cmd/lifecycle-server/start.go`:
- Around line 137-157: The API incorrectly treats an empty-but-successful
catalog as a load failure; update the wiring so server.NewHandler can
distinguish real load failures from empty-but-valid data by passing the load
error (loadErr) into the handler and changing NewHandler to base its 503 vs
normal behavior on that error rather than on data length; specifically, modify
the call apiHandler := server.NewHandler(data, log) -> server.NewHandler(data,
loadErr, log) (and propagate that parameter into authnzFilter creation), then
update the NewHandler implementation (and any internal checks that use len(data)
to return 503) to check the loadErr value (and only return 503 when loadErr !=
nil), leaving empty catalogs to behave normally (404s) while keeping
NewHealthHandler(loadErr, log) as-is.
---
Nitpick comments:
In `@pkg/lifecycle-server/server_test.go`:
- Around line 244-246: The test for the /readyz handler currently uses
require.Contains which can still pass if the handler appends the raw load error;
update the assertion after reading resp.Body (the lines using io.ReadAll,
require.NoError, and require.Contains) to ensure the response equals the
expected generic message and never includes the raw error: replace
require.Contains with a strict equality check (require.Equal(t, tc.expectedBody,
string(body))) and/or add an explicit require.NotContains(t, string(body),
tc.loadErr.Error()) to guarantee tc.loadErr.Error() is not leaked.
In `@pkg/lifecycle-server/server.go`:
- Line 20: The readiness handler is re-logging the same startup/load error on
every /readyz probe; remove the duplicate logging in the readiness endpoint so
run() remains the single place that logs the failure. Update the readiness
handler (the /readyz handler function referenced alongside run()) to check the
same failure condition but only return the appropriate non-200 status and short
message/body without calling the logger; keep run()'s existing logging as the
one-time error report. Ensure any other readiness-related handlers in the same
file (lines ~30-40) follow the same pattern.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 4e15b6bd-ccb1-4155-9abd-cb9e955fb74b
📒 Files selected for processing (3)
cmd/lifecycle-server/start.gopkg/lifecycle-server/server.gopkg/lifecycle-server/server_test.go
fgiudici
left a comment
There was a problem hiding this comment.
just found a couple of minor things
Looks good!
/lgtm
The readiness probe previously returned 503 when the lifecycle index was empty, which is a valid state (no lifecycle metadata in catalog). Now it returns 200 for an empty index and only returns 503 when there were errors building the index. Error logging is done once at startup during FBC processing rather than on each readiness probe poll. The health handler takes a simple bool instead of the load error to keep the interface clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
8bbec32 to
5a9d6c4
Compare
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: fgiudici, perdasilva The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/verified bypass The lifecycle-server will be spawn by the lifecycle-controller, not yet merged, so this code cannot be really tested. |
|
@fgiudici: The DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/retest |
d75c36f
into
openshift:main
Summary
/readyz) previously returned 503 when the lifecycle index was empty, treating it as unready. An empty index is a valid state (no lifecycle metadata in the catalog), so it now returns 200.NewHealthHandlerto take aboolinstead of anerror, keeping the handler decoupled from load internals.Test plan
go test ./pkg/lifecycle-server/...)🤖 Generated with Claude Code
Summary by CodeRabbit
Health & Reliability
Bug Fixes