OPRUN-4597: Fix lifecycle-server readiness probe to allow empty index by perdasilva · Pull Request #1288 · openshift/operator-framework-olm

perdasilva · 2026-05-05T11:39:42Z

Summary

The readiness probe (/readyz) previously returned 503 when the lifecycle index was empty, treating it as unready. An empty index is a valid state (no lifecycle metadata in the catalog), so it now returns 200.
Returns 503 only when there were actual errors building the index. The server continues to start even on load errors so the readiness probe can report the failure.
Error logging moved from the readiness handler to FBC processing at startup, so errors are logged once instead of on every probe poll.
Simplified NewHealthHandler to take a bool instead of an error, keeping the handler decoupled from load internals.

Test plan

Unit tests updated and passing (go test ./pkg/lifecycle-server/...)
Verify readiness returns 200 when catalog has no lifecycle metadata
Verify readiness returns 503 with generic message when FBC loading fails
Verify error details appear in server logs at startup (not in HTTP response or on each probe poll)

🤖 Generated with Claude Code

Summary by CodeRabbit

Health & Reliability
- Refined health endpoints: liveness stays healthy; readiness now accurately reports degraded state (returns 503) when initialization fails, improving monitoring signals.
Bug Fixes
- Initialization now logs lifecycle load results and continues startup on load failures while surfacing clearer readiness status and error messages for diagnostics.

coderabbitai · 2026-05-05T11:39:59Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e3f3c8a2-4bca-48c0-96f3-5e2e968a8ce6

📥 Commits

Reviewing files that changed from the base of the PR and between 8bbec32 and 5a9d6c4.

📒 Files selected for processing (3)

cmd/lifecycle-server/start.go
pkg/lifecycle-server/server.go
pkg/lifecycle-server/server_test.go

🚧 Files skipped from review as they are similar to previous changes (1)

pkg/lifecycle-server/server.go

Walkthrough

Health readiness now depends on whether lifecycle data loading failed. The start logic logs load success or failure and continues initialization. The health handler API was changed to accept a readiness flag; tests updated to exercise the new ready/health responses.

Changes

Health Check Readiness Refactor

Layer / File(s)	Summary
API / Signature `pkg/lifecycle-server/server.go`	`NewHealthHandler` signature changed to accept a boolean readiness flag (`isReady bool`) instead of lifecycle data.
Core Readiness Logic `pkg/lifecycle-server/server.go`	`/healthz` remains 200; `/readyz` returns 503 with a fixed error message when `isReady` is false and 200 when true.
Lifecycle Loading `cmd/lifecycle-server/start.go`	Lifecycle loading now captures `data, loadErr`; on failure the error is logged but initialization continues. The health handler is constructed using the readiness indication derived from the load result.
Wiring / Handler Construction `cmd/lifecycle-server/start.go`	Health handler is created via `server.NewHealthHandler(loadErr != nil)` (passing readiness boolean); API handler still receives `data` even if load failed.
Tests `pkg/lifecycle-server/server_test.go`	Tests updated to pass an `isReady` boolean into `NewHealthHandler` and assert `/healthz` and `/readyz` responses for ready vs not-ready cases.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 11 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (11 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly summarizes the main change: fixing the lifecycle-server readiness probe to handle empty index as a valid state.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	PR does not contain Ginkgo tests. The modified test file uses Go's standard testing framework (testing.T) with table-driven tests. Check is not applicable.
Test Structure And Quality	✅ Passed	PR uses standard Go unit tests, not Ginkgo tests. Custom check targets Ginkgo test quality. Tests follow Go best practices: single responsibility, assertion messages, and consistent patterns.
Microshift Test Compatibility	✅ Passed	The custom check applies to new Ginkgo e2e tests. This PR contains only standard Go unit tests in server_test.go and code changes in server.go and start.go. No Ginkgo e2e tests were added.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	PR does not add any Ginkgo e2e tests. Modified files contain only standard Go unit tests using testing.T framework with Testify assertions. SNO compatibility check not applicable.
Topology-Aware Scheduling Compatibility	✅ Passed	PR modifies only lifecycle-server health check logic (HTTP handlers). No deployment manifests, operator controllers, pod specs, or scheduling constraints are introduced or modified.
Ote Binary Stdout Contract	✅ Passed	Check inapplicable: lifecycle-server is a standalone operator service, not an OTE test binary. Uses klog (defaults stderr) with no stdout writes from main().
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	No Ginkgo e2e tests added. PR contains only Go unit tests (testing.T) with standard libraries, not Ginkgo framework. Check not applicable.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci-robot · 2026-05-05T11:41:38Z

@perdasilva: This pull request references OPRUN-4597 which is a valid jira issue.

Details

In response to this:

Summary

The readiness probe (/readyz) previously returned 503 when the lifecycle index was empty, treating it as unready. An empty index is a valid state (no lifecycle metadata in the catalog), so it now returns 200.

Returns 503 only when there were actual errors building the index. The server continues to start even on load errors so the readiness probe can report the failure.

Error details are logged server-side and not exposed through the endpoint to avoid leaking internal information.

Test plan

Unit tests updated and passing (go test ./pkg/lifecycle-server/...)

Verify readiness returns 200 when catalog has no lifecycle metadata

Verify readiness returns 503 with generic message when FBC loading fails

Verify error details appear in server logs (not in HTTP response)

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

cmd/lifecycle-server/start.go (1)
137-157: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Empty successful loads still make the API behave as unavailable.

With this wiring, loadErr == nil makes /readyz return 200 for an empty-but-valid catalog, but server.NewHandler(data, log) still returns 503 for every /api/... request when len(data) == 0 (pkg/lifecycle-server/server.go, Lines 61-65). That means a catalog with no lifecycle metadata is reported ready while the API still looks broken.

Please distinguish “load failed” from “loaded successfully but empty” in the API path too, so empty catalogs fall through to normal 404 behavior instead of 503.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cmd/lifecycle-server/start.go` around lines 137 - 157, The API incorrectly
treats an empty-but-successful catalog as a load failure; update the wiring so
server.NewHandler can distinguish real load failures from empty-but-valid data
by passing the load error (loadErr) into the handler and changing NewHandler to
base its 503 vs normal behavior on that error rather than on data length;
specifically, modify the call apiHandler := server.NewHandler(data, log) ->
server.NewHandler(data, loadErr, log) (and propagate that parameter into
authnzFilter creation), then update the NewHandler implementation (and any
internal checks that use len(data) to return 503) to check the loadErr value
(and only return 503 when loadErr != nil), leaving empty catalogs to behave
normally (404s) while keeping NewHealthHandler(loadErr, log) as-is.

🧹 Nitpick comments (2)

pkg/lifecycle-server/server_test.go (1)

244-246: ⚡ Quick win

Assert that /readyz never leaks the raw load error.

require.Contains is too loose for the failure case: the test would still pass if the handler appended tc.loadErr.Error() after the generic message.

Suggested test hardening

 			body, err := io.ReadAll(resp.Body)
 			require.NoError(t, err, "failed to read response body")
-			require.Contains(t, string(body), tc.expectedBody, "unexpected response body")
+			require.Equal(t, tc.expectedBody, string(body), "unexpected response body")
+			if tc.loadErr != nil {
+				require.NotContains(t, string(body), tc.loadErr.Error(), "readiness response should not leak internal errors")
+			}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/lifecycle-server/server_test.go` around lines 244 - 246, The test for the
/readyz handler currently uses require.Contains which can still pass if the
handler appends the raw load error; update the assertion after reading resp.Body
(the lines using io.ReadAll, require.NoError, and require.Contains) to ensure
the response equals the expected generic message and never includes the raw
error: replace require.Contains with a strict equality check (require.Equal(t,
tc.expectedBody, string(body))) and/or add an explicit require.NotContains(t,
string(body), tc.loadErr.Error()) to guarantee tc.loadErr.Error() is not leaked.

pkg/lifecycle-server/server.go (1)

20-20: ⚡ Quick win

Avoid logging the same load error on every /readyz probe.

run() already logs the startup failure once, so logging again here will emit the same error on every readiness check while the pod is unhealthy. That can flood logs and bury the original failure signal.

Proposed cleanup

-import (
-	"fmt"
-	"net/http"
+import (
+	"net/http"

 	"github.com/go-logr/logr"
 )

 func NewHealthHandler(loadErr error, log logr.Logger) http.Handler {
 	mux := http.NewServeMux()
+	if loadErr != nil {
+		log.Error(loadErr, "lifecycle index failed to load; readiness will report unavailable")
+	}
 	mux.HandleFunc("GET /healthz", func(w http.ResponseWriter, r *http.Request) {
 		w.WriteHeader(http.StatusOK)
 		_, _ = w.Write([]byte("ok"))
 	})
 	mux.HandleFunc("GET /readyz", func(w http.ResponseWriter, r *http.Request) {
 		if loadErr != nil {
-			log.Error(loadErr, "readiness check failed")
 			w.WriteHeader(http.StatusServiceUnavailable)
-			_, _ = w.Write([]byte(fmt.Sprintf("error parsing catalog lifecycle metadata, check logs for details")))
+			_, _ = w.Write([]byte("error parsing catalog lifecycle metadata, check logs for details"))
 			return
 		}

Also applies to: 30-40

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/lifecycle-server/server.go` at line 20, The readiness handler is
re-logging the same startup/load error on every /readyz probe; remove the
duplicate logging in the readiness endpoint so run() remains the single place
that logs the failure. Update the readiness handler (the /readyz handler
function referenced alongside run()) to check the same failure condition but
only return the appropriate non-200 status and short message/body without
calling the logger; keep run()'s existing logging as the one-time error report.
Ensure any other readiness-related handlers in the same file (lines ~30-40)
follow the same pattern.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@cmd/lifecycle-server/start.go`:
- Around line 137-157: The API incorrectly treats an empty-but-successful
catalog as a load failure; update the wiring so server.NewHandler can
distinguish real load failures from empty-but-valid data by passing the load
error (loadErr) into the handler and changing NewHandler to base its 503 vs
normal behavior on that error rather than on data length; specifically, modify
the call apiHandler := server.NewHandler(data, log) -> server.NewHandler(data,
loadErr, log) (and propagate that parameter into authnzFilter creation), then
update the NewHandler implementation (and any internal checks that use len(data)
to return 503) to check the loadErr value (and only return 503 when loadErr !=
nil), leaving empty catalogs to behave normally (404s) while keeping
NewHealthHandler(loadErr, log) as-is.

---

Nitpick comments:
In `@pkg/lifecycle-server/server_test.go`:
- Around line 244-246: The test for the /readyz handler currently uses
require.Contains which can still pass if the handler appends the raw load error;
update the assertion after reading resp.Body (the lines using io.ReadAll,
require.NoError, and require.Contains) to ensure the response equals the
expected generic message and never includes the raw error: replace
require.Contains with a strict equality check (require.Equal(t, tc.expectedBody,
string(body))) and/or add an explicit require.NotContains(t, string(body),
tc.loadErr.Error()) to guarantee tc.loadErr.Error() is not leaked.

In `@pkg/lifecycle-server/server.go`:
- Line 20: The readiness handler is re-logging the same startup/load error on
every /readyz probe; remove the duplicate logging in the readiness endpoint so
run() remains the single place that logs the failure. Update the readiness
handler (the /readyz handler function referenced alongside run()) to check the
same failure condition but only return the appropriate non-200 status and short
message/body without calling the logger; keep run()'s existing logging as the
one-time error report. Ensure any other readiness-related handlers in the same
file (lines ~30-40) follow the same pattern.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4e15b6bd-ccb1-4155-9abd-cb9e955fb74b

📥 Commits

Reviewing files that changed from the base of the PR and between 2fcbb3e and 8bbec32.

📒 Files selected for processing (3)

cmd/lifecycle-server/start.go
pkg/lifecycle-server/server.go
pkg/lifecycle-server/server_test.go

fgiudici

just found a couple of minor things
Looks good!

/lgtm

The readiness probe previously returned 503 when the lifecycle index was empty, which is a valid state (no lifecycle metadata in catalog). Now it returns 200 for an empty index and only returns 503 when there were errors building the index. Error logging is done once at startup during FBC processing rather than on each readiness probe poll. The health handler takes a simple bool instead of the load error to keep the interface clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fgiudici

/lgtm

openshift-ci · 2026-05-06T07:53:36Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fgiudici, perdasilva

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~DOWNSTREAM_OWNERS~~ [perdasilva]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

fgiudici · 2026-05-06T07:55:54Z

/verified bypass

The lifecycle-server will be spawn by the lifecycle-controller, not yet merged, so this code cannot be really tested.
We are going to test all this new feature only when we will have all the pieces in place.

openshift-ci-robot · 2026-05-06T07:56:05Z

@fgiudici: The verified label has been added.

Details

In response to this:

/verified bypass

The lifecycle-server will be spawn by the lifecycle-controller, not yet merged, so this code cannot be really tested.
We are going to test all this new feature only when we will have all the pieces in place.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

perdasilva · 2026-05-06T12:19:30Z

/retest

openshift-ci Bot requested review from joelanford and pedjak May 5, 2026 11:40

openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 5, 2026

perdasilva changed the title ~~Fix lifecycle-server readiness probe to allow empty index~~ OPRUN-4597: Fix lifecycle-server readiness probe to allow empty index May 5, 2026

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 5, 2026

coderabbitai Bot reviewed May 5, 2026

View reviewed changes

fgiudici reviewed May 5, 2026

View reviewed changes

Comment thread pkg/lifecycle-server/server.go Outdated

Comment thread pkg/lifecycle-server/server.go Outdated

openshift-ci Bot assigned fgiudici May 5, 2026

openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label May 5, 2026

perdasilva force-pushed the lifecycle-server-readiness branch from 8bbec32 to 5a9d6c4 Compare May 6, 2026 07:42

openshift-ci Bot removed the lgtm Indicates that a PR is ready to be merged. label May 6, 2026

fgiudici approved these changes May 6, 2026

View reviewed changes

openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label May 6, 2026

openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label May 6, 2026

openshift-merge-bot Bot merged commit d75c36f into openshift:main May 6, 2026
16 of 17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OPRUN-4597: Fix lifecycle-server readiness probe to allow empty index#1288

OPRUN-4597: Fix lifecycle-server readiness probe to allow empty index#1288
openshift-merge-bot[bot] merged 1 commit intoopenshift:mainfrom
perdasilva:lifecycle-server-readiness

perdasilva commented May 5, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 5, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

openshift-ci-robot commented May 5, 2026 •

edited by openshift-ci Bot

Loading

Summary

Test plan

Uh oh!

coderabbitai Bot left a comment

Uh oh!

fgiudici left a comment

Uh oh!

Uh oh!

Uh oh!

fgiudici left a comment

Uh oh!

openshift-ci Bot commented May 6, 2026

Uh oh!

fgiudici commented May 6, 2026

Uh oh!

openshift-ci-robot commented May 6, 2026

Uh oh!

perdasilva commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

perdasilva commented May 5, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

openshift-ci-robot commented May 5, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

fgiudici left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

fgiudici left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci Bot commented May 6, 2026

Uh oh!

fgiudici commented May 6, 2026

Uh oh!

openshift-ci-robot commented May 6, 2026

Uh oh!

perdasilva commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

perdasilva commented May 5, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 5, 2026 •

edited

Loading

openshift-ci-robot commented May 5, 2026 •

edited by openshift-ci Bot

Loading