Skip to content

OPRUN-4597: Fix lifecycle-server readiness probe to allow empty index#1288

Merged
openshift-merge-bot[bot] merged 1 commit intoopenshift:mainfrom
perdasilva:lifecycle-server-readiness
May 6, 2026
Merged

OPRUN-4597: Fix lifecycle-server readiness probe to allow empty index#1288
openshift-merge-bot[bot] merged 1 commit intoopenshift:mainfrom
perdasilva:lifecycle-server-readiness

Conversation

@perdasilva
Copy link
Copy Markdown
Contributor

@perdasilva perdasilva commented May 5, 2026

Summary

  • The readiness probe (/readyz) previously returned 503 when the lifecycle index was empty, treating it as unready. An empty index is a valid state (no lifecycle metadata in the catalog), so it now returns 200.
  • Returns 503 only when there were actual errors building the index. The server continues to start even on load errors so the readiness probe can report the failure.
  • Error logging moved from the readiness handler to FBC processing at startup, so errors are logged once instead of on every probe poll.
  • Simplified NewHealthHandler to take a bool instead of an error, keeping the handler decoupled from load internals.

Test plan

  • Unit tests updated and passing (go test ./pkg/lifecycle-server/...)
  • Verify readiness returns 200 when catalog has no lifecycle metadata
  • Verify readiness returns 503 with generic message when FBC loading fails
  • Verify error details appear in server logs at startup (not in HTTP response or on each probe poll)

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Health & Reliability

    • Refined health endpoints: liveness stays healthy; readiness now accurately reports degraded state (returns 503) when initialization fails, improving monitoring signals.
  • Bug Fixes

    • Initialization now logs lifecycle load results and continues startup on load failures while surfacing clearer readiness status and error messages for diagnostics.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 5, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e3f3c8a2-4bca-48c0-96f3-5e2e968a8ce6

📥 Commits

Reviewing files that changed from the base of the PR and between 8bbec32 and 5a9d6c4.

📒 Files selected for processing (3)
  • cmd/lifecycle-server/start.go
  • pkg/lifecycle-server/server.go
  • pkg/lifecycle-server/server_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/lifecycle-server/server.go

Walkthrough

Health readiness now depends on whether lifecycle data loading failed. The start logic logs load success or failure and continues initialization. The health handler API was changed to accept a readiness flag; tests updated to exercise the new ready/health responses.

Changes

Health Check Readiness Refactor

Layer / File(s) Summary
API / Signature
pkg/lifecycle-server/server.go
NewHealthHandler signature changed to accept a boolean readiness flag (isReady bool) instead of lifecycle data.
Core Readiness Logic
pkg/lifecycle-server/server.go
/healthz remains 200; /readyz returns 503 with a fixed error message when isReady is false and 200 when true.
Lifecycle Loading
cmd/lifecycle-server/start.go
Lifecycle loading now captures data, loadErr; on failure the error is logged but initialization continues. The health handler is constructed using the readiness indication derived from the load result.
Wiring / Handler Construction
cmd/lifecycle-server/start.go
Health handler is created via server.NewHealthHandler(loadErr != nil) (passing readiness boolean); API handler still receives data even if load failed.
Tests
pkg/lifecycle-server/server_test.go
Tests updated to pass an isReady boolean into NewHealthHandler and assert /healthz and /readyz responses for ready vs not-ready cases.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 11 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (11 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: fixing the lifecycle-server readiness probe to handle empty index as a valid state.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed PR does not contain Ginkgo tests. The modified test file uses Go's standard testing framework (testing.T) with table-driven tests. Check is not applicable.
Test Structure And Quality ✅ Passed PR uses standard Go unit tests, not Ginkgo tests. Custom check targets Ginkgo test quality. Tests follow Go best practices: single responsibility, assertion messages, and consistent patterns.
Microshift Test Compatibility ✅ Passed The custom check applies to new Ginkgo e2e tests. This PR contains only standard Go unit tests in server_test.go and code changes in server.go and start.go. No Ginkgo e2e tests were added.
Single Node Openshift (Sno) Test Compatibility ✅ Passed PR does not add any Ginkgo e2e tests. Modified files contain only standard Go unit tests using testing.T framework with Testify assertions. SNO compatibility check not applicable.
Topology-Aware Scheduling Compatibility ✅ Passed PR modifies only lifecycle-server health check logic (HTTP handlers). No deployment manifests, operator controllers, pod specs, or scheduling constraints are introduced or modified.
Ote Binary Stdout Contract ✅ Passed Check inapplicable: lifecycle-server is a standalone operator service, not an OTE test binary. Uses klog (defaults stderr) with no stdout writes from main().
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No Ginkgo e2e tests added. PR contains only Go unit tests (testing.T) with standard libraries, not Ginkgo framework. Check not applicable.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from joelanford and pedjak May 5, 2026 11:40
@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 5, 2026
@perdasilva perdasilva changed the title Fix lifecycle-server readiness probe to allow empty index OPRUN-4597: Fix lifecycle-server readiness probe to allow empty index May 5, 2026
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 5, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented May 5, 2026

@perdasilva: This pull request references OPRUN-4597 which is a valid jira issue.

Details

In response to this:

Summary

  • The readiness probe (/readyz) previously returned 503 when the lifecycle index was empty, treating it as unready. An empty index is a valid state (no lifecycle metadata in the catalog), so it now returns 200.
  • Returns 503 only when there were actual errors building the index. The server continues to start even on load errors so the readiness probe can report the failure.
  • Error details are logged server-side and not exposed through the endpoint to avoid leaking internal information.

Test plan

  • Unit tests updated and passing (go test ./pkg/lifecycle-server/...)
  • Verify readiness returns 200 when catalog has no lifecycle metadata
  • Verify readiness returns 503 with generic message when FBC loading fails
  • Verify error details appear in server logs (not in HTTP response)

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cmd/lifecycle-server/start.go (1)

137-157: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Empty successful loads still make the API behave as unavailable.

With this wiring, loadErr == nil makes /readyz return 200 for an empty-but-valid catalog, but server.NewHandler(data, log) still returns 503 for every /api/... request when len(data) == 0 (pkg/lifecycle-server/server.go, Lines 61-65). That means a catalog with no lifecycle metadata is reported ready while the API still looks broken.

Please distinguish “load failed” from “loaded successfully but empty” in the API path too, so empty catalogs fall through to normal 404 behavior instead of 503.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cmd/lifecycle-server/start.go` around lines 137 - 157, The API incorrectly
treats an empty-but-successful catalog as a load failure; update the wiring so
server.NewHandler can distinguish real load failures from empty-but-valid data
by passing the load error (loadErr) into the handler and changing NewHandler to
base its 503 vs normal behavior on that error rather than on data length;
specifically, modify the call apiHandler := server.NewHandler(data, log) ->
server.NewHandler(data, loadErr, log) (and propagate that parameter into
authnzFilter creation), then update the NewHandler implementation (and any
internal checks that use len(data) to return 503) to check the loadErr value
(and only return 503 when loadErr != nil), leaving empty catalogs to behave
normally (404s) while keeping NewHealthHandler(loadErr, log) as-is.
🧹 Nitpick comments (2)
pkg/lifecycle-server/server_test.go (1)

244-246: ⚡ Quick win

Assert that /readyz never leaks the raw load error.

require.Contains is too loose for the failure case: the test would still pass if the handler appended tc.loadErr.Error() after the generic message.

Suggested test hardening
 			body, err := io.ReadAll(resp.Body)
 			require.NoError(t, err, "failed to read response body")
-			require.Contains(t, string(body), tc.expectedBody, "unexpected response body")
+			require.Equal(t, tc.expectedBody, string(body), "unexpected response body")
+			if tc.loadErr != nil {
+				require.NotContains(t, string(body), tc.loadErr.Error(), "readiness response should not leak internal errors")
+			}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/lifecycle-server/server_test.go` around lines 244 - 246, The test for the
/readyz handler currently uses require.Contains which can still pass if the
handler appends the raw load error; update the assertion after reading resp.Body
(the lines using io.ReadAll, require.NoError, and require.Contains) to ensure
the response equals the expected generic message and never includes the raw
error: replace require.Contains with a strict equality check (require.Equal(t,
tc.expectedBody, string(body))) and/or add an explicit require.NotContains(t,
string(body), tc.loadErr.Error()) to guarantee tc.loadErr.Error() is not leaked.
pkg/lifecycle-server/server.go (1)

20-20: ⚡ Quick win

Avoid logging the same load error on every /readyz probe.

run() already logs the startup failure once, so logging again here will emit the same error on every readiness check while the pod is unhealthy. That can flood logs and bury the original failure signal.

Proposed cleanup
-import (
-	"fmt"
-	"net/http"
+import (
+	"net/http"

 	"github.com/go-logr/logr"
 )

 func NewHealthHandler(loadErr error, log logr.Logger) http.Handler {
 	mux := http.NewServeMux()
+	if loadErr != nil {
+		log.Error(loadErr, "lifecycle index failed to load; readiness will report unavailable")
+	}
 	mux.HandleFunc("GET /healthz", func(w http.ResponseWriter, r *http.Request) {
 		w.WriteHeader(http.StatusOK)
 		_, _ = w.Write([]byte("ok"))
 	})
 	mux.HandleFunc("GET /readyz", func(w http.ResponseWriter, r *http.Request) {
 		if loadErr != nil {
-			log.Error(loadErr, "readiness check failed")
 			w.WriteHeader(http.StatusServiceUnavailable)
-			_, _ = w.Write([]byte(fmt.Sprintf("error parsing catalog lifecycle metadata, check logs for details")))
+			_, _ = w.Write([]byte("error parsing catalog lifecycle metadata, check logs for details"))
 			return
 		}

Also applies to: 30-40

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/lifecycle-server/server.go` at line 20, The readiness handler is
re-logging the same startup/load error on every /readyz probe; remove the
duplicate logging in the readiness endpoint so run() remains the single place
that logs the failure. Update the readiness handler (the /readyz handler
function referenced alongside run()) to check the same failure condition but
only return the appropriate non-200 status and short message/body without
calling the logger; keep run()'s existing logging as the one-time error report.
Ensure any other readiness-related handlers in the same file (lines ~30-40)
follow the same pattern.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@cmd/lifecycle-server/start.go`:
- Around line 137-157: The API incorrectly treats an empty-but-successful
catalog as a load failure; update the wiring so server.NewHandler can
distinguish real load failures from empty-but-valid data by passing the load
error (loadErr) into the handler and changing NewHandler to base its 503 vs
normal behavior on that error rather than on data length; specifically, modify
the call apiHandler := server.NewHandler(data, log) -> server.NewHandler(data,
loadErr, log) (and propagate that parameter into authnzFilter creation), then
update the NewHandler implementation (and any internal checks that use len(data)
to return 503) to check the loadErr value (and only return 503 when loadErr !=
nil), leaving empty catalogs to behave normally (404s) while keeping
NewHealthHandler(loadErr, log) as-is.

---

Nitpick comments:
In `@pkg/lifecycle-server/server_test.go`:
- Around line 244-246: The test for the /readyz handler currently uses
require.Contains which can still pass if the handler appends the raw load error;
update the assertion after reading resp.Body (the lines using io.ReadAll,
require.NoError, and require.Contains) to ensure the response equals the
expected generic message and never includes the raw error: replace
require.Contains with a strict equality check (require.Equal(t, tc.expectedBody,
string(body))) and/or add an explicit require.NotContains(t, string(body),
tc.loadErr.Error()) to guarantee tc.loadErr.Error() is not leaked.

In `@pkg/lifecycle-server/server.go`:
- Line 20: The readiness handler is re-logging the same startup/load error on
every /readyz probe; remove the duplicate logging in the readiness endpoint so
run() remains the single place that logs the failure. Update the readiness
handler (the /readyz handler function referenced alongside run()) to check the
same failure condition but only return the appropriate non-200 status and short
message/body without calling the logger; keep run()'s existing logging as the
one-time error report. Ensure any other readiness-related handlers in the same
file (lines ~30-40) follow the same pattern.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4e15b6bd-ccb1-4155-9abd-cb9e955fb74b

📥 Commits

Reviewing files that changed from the base of the PR and between 2fcbb3e and 8bbec32.

📒 Files selected for processing (3)
  • cmd/lifecycle-server/start.go
  • pkg/lifecycle-server/server.go
  • pkg/lifecycle-server/server_test.go

Copy link
Copy Markdown
Member

@fgiudici fgiudici left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just found a couple of minor things
Looks good!

/lgtm

Comment thread pkg/lifecycle-server/server.go Outdated
Comment thread pkg/lifecycle-server/server.go Outdated
@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label May 5, 2026
The readiness probe previously returned 503 when the lifecycle index
was empty, which is a valid state (no lifecycle metadata in catalog).
Now it returns 200 for an empty index and only returns 503 when there
were errors building the index. Error logging is done once at startup
during FBC processing rather than on each readiness probe poll. The
health handler takes a simple bool instead of the load error to keep
the interface clean.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@perdasilva perdasilva force-pushed the lifecycle-server-readiness branch from 8bbec32 to 5a9d6c4 Compare May 6, 2026 07:42
@openshift-ci openshift-ci Bot removed the lgtm Indicates that a PR is ready to be merged. label May 6, 2026
Copy link
Copy Markdown
Member

@fgiudici fgiudici left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label May 6, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 6, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fgiudici, perdasilva

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@fgiudici
Copy link
Copy Markdown
Member

fgiudici commented May 6, 2026

/verified bypass

The lifecycle-server will be spawn by the lifecycle-controller, not yet merged, so this code cannot be really tested.
We are going to test all this new feature only when we will have all the pieces in place.

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label May 6, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@fgiudici: The verified label has been added.

Details

In response to this:

/verified bypass

The lifecycle-server will be spawn by the lifecycle-controller, not yet merged, so this code cannot be really tested.
We are going to test all this new feature only when we will have all the pieces in place.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@perdasilva
Copy link
Copy Markdown
Contributor Author

/retest

@openshift-merge-bot openshift-merge-bot Bot merged commit d75c36f into openshift:main May 6, 2026
16 of 17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants