Skip to content

OPRUN-4445: add lifecycle-controller and lifecycle-server #1281

Draft
fgiudici wants to merge 7 commits intoopenshift:mainfrom
fgiudici:upgrade-planner
Draft

OPRUN-4445: add lifecycle-controller and lifecycle-server #1281
fgiudici wants to merge 7 commits intoopenshift:mainfrom
fgiudici:upgrade-planner

Conversation

@fgiudici
Copy link
Copy Markdown
Member

@fgiudici fgiudici commented Apr 24, 2026

Summary by CodeRabbit

  • New Features

    • Added lifecycle-controller and lifecycle-server components plus a new HTTP API to serve lifecycle data.
  • Security

    • TLS-enabled servers, hardened container/pod security settings, RBAC, and NetworkPolicies.
  • Manifests

    • New Deployment, Service, NetworkPolicy, and RBAC manifests and health endpoints.
  • Tests

    • Extensive unit and integration tests for lifecycle components and TLS/indexing behavior.
  • Chores

    • Build/image updated to include new binaries and manifest generation simplified.

joelanford and others added 5 commits April 23, 2026 18:13
Add openshift/controller-runtime-common for shared TLS controller
support, and update openshift/library-go crypto package.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add the lifecycle-server component that serves File-Based Catalog
content over HTTPS. The server watches for catalog updates and
serves them to OLM for operator discovery.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add the lifecycle-controller that reconciles CatalogSource resources
to manage lifecycle-server pods. Includes dynamic TLS configuration
via controller-runtime-common, NetworkPolicy management, leader
election, and server-side apply for owned resources.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add deployment manifests for lifecycle-controller and lifecycle-server
including RBAC, NetworkPolicy, Service, and IBM Cloud managed variants.
Update Dockerfile with lifecycle binary targets, Makefile with build
rules, CRD manifests with lifecycle annotations, and the manifest
generation script for downstream handling.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add comprehensive unit tests covering controller reconciliation,
TLS configuration, FBC catalog building, and HTTP server behavior.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 24, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Apr 24, 2026

@fgiudici: This pull request references OPRUN-4445 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 24, 2026
@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 24, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 24, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: fgiudici
Once this PR has been reviewed and has the lgtm label, please assign joelanford for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 24, 2026

Walkthrough

Adds two new CLIs (lifecycle-controller, lifecycle-server), implements a controller and HTTP server for serving lifecycle FBC data with dynamic TLS and auth, introduces Kubernetes manifests (Deployment/Service/NetworkPolicy/RBAC), updates build/image flow, and simplifies CRD manifest generation.

Changes

Cohort / File(s) Summary
Build & Image
Makefile, operator-lifecycle-manager.Dockerfile
New Make targets/variables and FORCE rules to build bin/lifecycle-controller and bin/lifecycle-server; Dockerfile now copies those binaries into the runtime image.
Controller CLI
cmd/lifecycle-controller/main.go, cmd/lifecycle-controller/start.go, cmd/lifecycle-controller/util.go
New Cobra-based CLI/start command that initializes scheme, TLS provider, manager (leader election, metrics, health), and registers the reconciler.
Server CLI
cmd/lifecycle-server/main.go, cmd/lifecycle-server/start.go, cmd/lifecycle-server/util.go
New Cobra-based lifecycle-server startup with dynamic TLS, K8s-backed authn/authz middleware, FBC loading, HTTPS API and separate health endpoint with coordinated shutdown.
Controller Core
pkg/lifecycle-controller/controller.go, pkg/lifecycle-controller/tls.go, cmd/.../util.go
Adds LifecycleServerReconciler (reconcile/setup logic), TLSConfigProvider for dynamic TLS, helper utilities for scheme and label selectors, resource naming and pod/node/image extraction, and predicates/event mapping.
Controller Tests & Docs
pkg/lifecycle-controller/controller_test.go, pkg/lifecycle-controller/tls_test.go, pkg/lifecycle-controller/TODO.md
Comprehensive unit tests for reconciler and TLS provider plus an E2E TODO specification.
Server Implementation & Tests
pkg/lifecycle-server/fbc.go, pkg/lifecycle-server/server.go, pkg/lifecycle-server/fbc_test.go, pkg/lifecycle-server/server_test.go
FBC filesystem indexing (LifecycleIndex) and HTTP handler serving GET /api/{version}/lifecycles/{package}; tests cover parsing, indexing, and handler behavior.
Kubernetes Manifests
manifests/0000_50_olm_08-*.yaml, manifests/0000_50_olm_09-lifecycle-server.rbac.yaml
Adds lifecycle-controller Deployment, Service, NetworkPolicy, ServiceAccount/ClusterRole/ClusterRoleBinding and a lifecycle-server ClusterRole for kube-rbac-proxy.
MicroShift Manifests
microshift-manifests/...
Parallel MicroShift variants of the lifecycle-controller manifests (deployments, RBAC, networkpolicy, service) and CRD metadata formatting changes.
CRDs & Kustomize
manifests/0000_50_olm_00-*.crd.yaml, microshift-manifests/..., microshift-manifests/kustomization.yaml
Bumped/normalized controller-gen annotation versions, small OpenAPI formatting changes, and added default "" for ip; kustomization extended to include lifecycle manifests.
Manifest Generation Script
scripts/generate_crds_manifests.sh
Replaced controller-gen/yq edits with direct copying from staging/api/crds/* and added generation inclusion of lifecycle-controller/server manifests.
Go Modules
go.mod
Promoted several deps from indirect to direct requires and added a replace for controller-runtime-common to a forked module.
Makefile Targets (explicit)
Makefile entries
New variables LIFECYCLE_CONTROLLER_CMD, LIFECYCLE_SERVER_CMD and corresponding build targets producing bin/ artifacts.

Sequence Diagram(s)

sequenceDiagram
    participant Admin as Admin
    participant K8sAPI as Kubernetes API
    participant Manager as Controller-Runtime Manager
    participant Controller as Lifecycle Controller
    participant CatalogPod as Catalog Pod
    participant TLSProvider as TLSConfig Provider

    Admin->>K8sAPI: Create/Update CatalogSource
    K8sAPI->>Manager: Event/watch
    Manager->>Controller: Reconcile request
    Controller->>K8sAPI: Get CatalogSource
    Controller->>CatalogPod: Query pods for image digest & node info
    CatalogPod-->>Controller: Pod metadata
    Controller->>TLSProvider: Get TLS config
    TLSProvider-->>Controller: TLS config + unsupported ciphers
    Controller->>K8sAPI: Create/Update Deployment, Service, NetworkPolicy, ServiceAccount
    Controller->>K8sAPI: Update ClusterRoleBinding subjects
    K8sAPI-->>Controller: Resources applied
    Controller-->>Manager: Reconciliation complete
Loading
sequenceDiagram
    participant Client as Client
    participant Health as Healthz Endpoint
    participant Proxy as Auth/Authz Middleware
    participant API as Lifecycle API Handler
    participant FBC as FBC Index

    Client->>Health: GET /healthz
    Health-->>Client: 200 OK

    Client->>Proxy: GET /api/v1/lifecycles/pkg
    Proxy->>Proxy: Validate token (TokenReview / SAR)
    alt Authz succeeds
        Proxy->>API: Forward request
        API->>FBC: Lookup version/package
        alt Found
            FBC-->>API: json.RawMessage blob
            API->>Client: 200 OK + application/json
        else Not found
            API->>Client: 404 Not Found
        end
    else Authz fails
        Proxy-->>Client: 403 Forbidden
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes


Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error, 3 warnings)

Check name Status Explanation Resolution
Ote Binary Stdout Contract ❌ Error Both cmd/lifecycle-controller/start.go and cmd/lifecycle-server/start.go violate the OTE Binary Stdout Contract by using klog without configuring stderr output. Add klog.LogToStderr(true) immediately after initializing the klog logger in both files to redirect all output to stderr and preserve stdout JSON contract.
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality ⚠️ Warning Test code lacks timeout patterns; all cluster-interacting tests use context.Background() without timeouts, risking indefinite hangs. Wrap context-dependent operations in context.WithTimeout() or leverage testing.TB.Timeout() to prevent indefinite waits.
Topology-Aware Scheduling Compatibility ⚠️ Warning Static lifecycle-controller deployments use control-plane nodeSelector with only master taint toleration, causing failures on HyperShift and modern clusters with control-plane:NoSchedule taint. Dynamic deployments lack topology awareness. Rolling update strategy requires 2 schedulable nodes, breaking on SNO. Implement topology awareness by checking infrastructure.Status.ControlPlaneTopology in controller code and manifests. Add missing control-plane:NoSchedule toleration. Conditionally adjust nodeSelector for External topologies. Adjust rolling update strategy for single-node and two-node clusters.
✅ Passed checks (8 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically identifies the main change: adding two new components (lifecycle-controller and lifecycle-server) to the OLM project.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed The pull request does not include any Ginkgo tests. All new test files use standard Go unit testing with func Test*(*testing.T) pattern, which are not subject to Ginkgo test naming requirements.
Microshift Test Compatibility ✅ Passed The pull request does not add any Ginkgo e2e tests; all test files use standard Go testing patterns without Ginkgo imports or e2e test files.
Single Node Openshift (Sno) Test Compatibility ✅ Passed No Ginkgo e2e tests are added in this pull request. Test files are standard Go unit tests using the testing package, not Ginkgo framework.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No Ginkgo e2e tests are added; all new test files use standard Go testing package with *testing.T functions.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
scripts/generate_crds_manifests.sh (1)

850-865: ⚠️ Potential issue | 🟠 Major

Microshift kustomization will pull in the TechPreview lifecycle manifests.

files_to_ignore filters out ibm-cloud-managed, psm-operator, collect-profiles, pprof, etc., but does not include lifecycle. As written, 0000_50_olm_08-lifecycle-controller.* and 0000_50_olm_09-lifecycle-server.rbac.yaml will be copied into microshift-manifests/ and appended to kustomization.yaml. Since these resources rely on:

  • release.openshift.io/feature-set: TechPreviewNoUpgrade (no-op in microshift — always active), and
  • config.openshift.io/apiservers (does not exist in microshift — the ClusterRole references it at line 728‑730),

this will either ship an always-on TechPreview component on microshift or fail at runtime when the controller tries to watch apiservers. Please either add "lifecycle" (or the specific filenames) to files_to_ignore, or confirm microshift is an intended delivery target.

📝 Proposed fix
-files_to_ignore=("ibm-cloud-managed.yaml" "kustomization.yaml" "psm-operator" "removed" "collect-profiles" "pprof" "prometheus" "service-monitor" "operatorstatus")
+files_to_ignore=("ibm-cloud-managed.yaml" "kustomization.yaml" "psm-operator" "removed" "collect-profiles" "pprof" "prometheus" "service-monitor" "operatorstatus" "lifecycle-controller" "lifecycle-server")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/generate_crds_manifests.sh` around lines 850 - 865, The microshift
kustomization generation loop is currently missing "lifecycle" in the
files_to_ignore list, so lifecycle controller/rl roles (e.g.,
0000_50_olm_08-lifecycle-controller.* and
0000_50_olm_09-lifecycle-server.rbac.yaml) are being included; update the
files_to_ignore array used by the microshift_manifests_files loop to add
"lifecycle" (or the specific lifecycle filenames) so the for-loop that appends
entries into microshift-manifests/kustomization.yaml will skip those lifecycle
manifests, ensuring they are not added to the generated kustomization.yaml.
🧹 Nitpick comments (17)
scripts/generate_crds_manifests.sh (1)

677-791: RBAC/NetworkPolicy consistency nits for the lifecycle-controller manifests.

A couple of smaller things worth a second look, not blockers:

  • NetPol (lines 693‑697): egress to 6443/TCP is unscoped by to:, so it allows egress to any pod/service/external endpoint on port 6443. Consider constraining with a namespaceSelector for the default namespace (kube-apiserver) to follow least-privilege like collect-profiles does above.
  • ClusterRole leases (line 769): verbs are get,list,watch,create,update,delete but the leader-election library used by controller-runtime only needs get,create,update (and optionally list,watch when ReleaseOnCancel is off). delete in particular is not required. Trimming unused verbs matches upstream guidance.
  • ClusterRole for services (line 746) includes delete, but the controller is doing server-side apply for owned resources — verify delete is actually exercised for garbage collection (e.g. on CatalogSource deletion) vs. relying on ownerReferences. If the latter, delete can be dropped.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/generate_crds_manifests.sh` around lines 677 - 791, The NetworkPolicy
for lifecycle-controller currently allows egress to port 6443/TCP without
scoping; update the manifest created in the heredoc (manifest name
0000_50_olm_08-lifecycle-controller.networkpolicy.yaml) to add a to: with an
appropriate namespaceSelector (e.g., select the kube-apiserver/default
namespace) to restrict egress to the API server; in the ClusterRole (kind:
ClusterRole name: operator-lifecycle-manager-lifecycle-controller in
0000_50_olm_08-lifecycle-controller.rbac.yaml) remove the unnecessary "delete"
verb from the coordination.k8s.io/leases rule and trim verbs to only
["get","create","update"] (optionally keep "list","watch" if ReleaseOnCancel is
disabled); finally, review the ClusterRole rule for services (apiGroups: [""],
resources: ["services"]) and if the controller relies on
ownerReferences/server-side-apply for GC, remove the "delete" verb from that
rule as well (or retain it only if explicit deletions are exercised).
pkg/lifecycle-controller/tls_test.go (1)

75-89: Optional: strengthen GetCertificatePreserved by checking invocation count.

The current test relies on a boolean that flips on first call. To guarantee the injected callback is the one invoked via the cloned *tls.Config, consider asserting a counter increments exactly once (or that the callback pointer survives a Get() clone).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/lifecycle-controller/tls_test.go` around lines 75 - 89, Replace the
boolean flag in TestTLSConfigProvider_GetCertificatePreserved with an integer
counter so you can assert the injected callback is invoked the expected number
of times: change getCert to increment a counter (e.g., calls++), create the
provider via NewTLSConfigProvider and obtain cfg via p.Get(), call
cfg.GetCertificate(nil), then assert the counter equals 1 (or the exact expected
count) instead of checking a bool; reference the getCert closure,
NewTLSConfigProvider, p.Get, and cfg.GetCertificate to locate the test and
update expectations.
manifests/0000_50_olm_08-lifecycle-controller.networkpolicy.yaml (1)

16-23: Consider tightening ingress/egress to namespace or CIDR selectors.

Both rules omit from/to peers, so ingress on 8443 is reachable from any pod in the cluster and egress on 6443 can dial any address (not just the kube-apiserver). The metrics endpoint is TLS-protected, so this is typically acceptable, but you may want to narrow ingress to the monitoring namespace (e.g., openshift-monitoring) and scope egress for 6443 if you want defense-in-depth consistent with the rest of OLM’s hardening. Not a blocker.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@manifests/0000_50_olm_08-lifecycle-controller.networkpolicy.yaml` around
lines 16 - 23, Ingress on port 8443 and egress on port 6443 currently have no
peer selectors so they allow traffic from/to anywhere; update the NetworkPolicy
stanza that defines ingress: - ports: - port: 8443 and egress: - ports: - port:
6443 to include appropriate peer selectors (e.g., for ingress add a from: with
namespaceSelector matching the monitoring namespace such as matchLabels: {name:
openshift-monitoring} or a podSelector limiting to monitoring pods, and for
egress add a to: with namespaceSelector/podSelector or an ipBlock.cidr that
targets the kube-apiserver IPs) so the rules are scoped rather than global.
pkg/lifecycle-controller/TODO.md (1)

1-42: Nit: consider tracking these in a Jira/GitHub issue rather than an in-tree TODO.md.

In-tree TODO files tend to drift and be forgotten. Since this PR is already tied to OPRUN-4445, filing a follow-up Jira that links back to this content would make the e2e coverage plan discoverable and actionable outside the repository. The file itself is fine to keep as a checklist in the meantime.

Happy to draft the follow-up Jira description if useful.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/lifecycle-controller/TODO.md` around lines 1 - 42, Move the in-tree TODO
items into a tracked external ticket and link it from the file: create a
Jira/GitHub issue (referencing OPRUN-4445) that contains the e2e test checklist
from pkg/lifecycle-controller/TODO.md, include expected test scenarios
(happy-path and targeted tests) and acceptance criteria, then update
pkg/lifecycle-controller/TODO.md to replace the full checklist with a short
pointer to the newly created issue (and the issue ID) and optionally note that
the file is retained as a temporary local checklist; also update the PR
description to reference the new issue so the work is discoverable and
actionable outside the repo.
Makefile (1)

102-106: Optional: consider embedding version/commit and dropping unused build tags.

Two small improvements worth considering:

  1. Unlike $(OLM_CMDS) and $(REGISTRY_CMDS), these binaries lack -ldflags injection of GitCommit/Version. While lifecycle-controller reads a RELEASE_VERSION environment variable, matching the OLM convention of version-flagged binaries would improve production log visibility.

  2. The build tags -tags "json1,containers_image_openpgp" are unused by either binary—neither imports mattn/go-sqlite3 or containers/image. Dropping them avoids pulling in unnecessary CGO for json1.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Makefile` around lines 102 - 106, Update the Makefile build rules for
LIFECYCLE_CONTROLLER_CMD and LIFECYCLE_SERVER_CMD to embed version metadata and
remove unnecessary build tags: add -ldflags settings to inject GitCommit/Version
(matching the OLM_CMDS/REGISTRY_CMDS convention) into the go build invocation
for the targets LIFECYCLE_CONTROLLER_CMD and LIFECYCLE_SERVER_CMD, and drop the
unused -tags "json1,containers_image_openpgp" from those build lines so
CGO/sqlite-related tags are not pulled in for these binaries.
pkg/lifecycle-server/fbc.go (3)

108-115: Sort ListVersions output for deterministic logs.

Map iteration order in Go is intentionally randomized, so the version list in start.go's startup log (log.Info("loaded lifecycle data", ..., "versions", data.ListVersions())) will change between restarts. Sorting produces deterministic diagnostic output.

♻️ Suggested refactor
 // ListVersions returns the list of versions available in the index
 func (index LifecycleIndex) ListVersions() []string {
 	versions := make([]string, 0, len(index))
 	for v := range index {
 		versions = append(versions, v)
 	}
+	sort.Strings(versions)
 	return versions
 }

Add "sort" to the imports.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/lifecycle-server/fbc.go` around lines 108 - 115, ListVersions currently
returns versions in map iteration order which is non-deterministic; modify
LifecycleIndex.ListVersions to sort the versions before returning them (use
sort.Strings on the slice) so logs (e.g., the startup log in start.go that calls
data.ListVersions()) are deterministic, and add the "sort" import to the file.

42-87: Silent error skip on per-file walk errors loses debuggability.

At line 53-54, any error from WalkMetasFS on an individual file (malformed JSON, I/O error, unreadable entry) is silently dropped. Given this data is consumed by a public API and a single bad file can drop a lifecycle blob for a package, surfacing these would help operators diagnose a missing package. Consider accepting a logr.Logger (or returning a slice of non-fatal warnings) so callers can log per-file problems without aborting the walk.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/lifecycle-server/fbc.go` around lines 42 - 87, LoadLifecycleData
currently swallows per-file errors inside the declcfg.WalkMetasFS callback,
losing debuggability; modify LoadLifecycleData to accept a logger (e.g.,
logr.Logger) or a pointer to a slice of warning errors, and in the WalkMetasFS
callback, instead of returning nil on per-file errors (the err parameter) or
other parse/validation issues, record them by logging via the provided logger or
appending to the warnings slice (but continue walking), keeping the existing
behavior of returning a single terminal error only if WalkMetasFS itself fails;
update the function signature (LoadLifecycleData) and callers accordingly and
reference the meta/err handling inside the anonymous callback where
schemaVersionRegex, meta.Package, meta.Blob and the result map are used so
individual file problems are surfaced without aborting the overall walk.

70-77: Consider logging or rejecting duplicate (version, package) entries.

A later-walked FBC file silently overwrites an earlier blob at line 76 with no indication. This is surprising if a catalog accidentally ships two lifecycle blobs for the same package/version, and whichever is walked last wins (order is not guaranteed by WalkMetasFS). A debug log line here (or an explicit error-on-duplicate option) would make the behavior observable.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/lifecycle-server/fbc.go` around lines 70 - 77, When inserting into result
under schemaVersion and meta.Package (the block using mu.Lock()/mu.Unlock() that
assigns result[schemaVersion][meta.Package] = meta.Blob), check whether an entry
already exists and either log a debug/warn message with the existing vs new blob
identity or return an error to reject duplicates; specifically, inside that
critical section look up result[schemaVersion][meta.Package] before assignment,
and if non-nil call the package’s logger or lifecycle server logger with
schemaVersion and meta.Package and indicate a duplicate (optionally include a
short blob fingerprint/length) or propagate an error instead of silently
overwriting—this makes duplicate (version, package) entries from WalkMetasFS
observable and avoid silent last-wins behavior.
pkg/lifecycle-controller/controller_test.go (1)

125-129: Edge case: empty name produces invalid DNS label.

resourceName("") returns "-lifecycle-server", which starts with a hyphen and is not a valid DNS-1123 label — any attempt to create a resource with this name will be rejected by the API server. In practice CatalogSource.Name is validated by Kubernetes and can't be empty, so this is defensive only; consider either removing this test case or having resourceName return an error / panic for empty input to make the contract explicit.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/lifecycle-controller/controller_test.go` around lines 125 - 129, The test
shows resourceName("") returns "-lifecycle-server", producing an invalid
DNS-1123 label; update the code/tests so the contract is explicit: either remove
the "empty name" test case from controller_test.go, or change resourceName to
validate input and return an error or panic for empty/invalid names (e.g., check
in resourceName for "" and handle accordingly). Locate the function resourceName
and the test case named "empty name" to implement the chosen fix and ensure
callers (e.g., any CatalogSource name handling) are updated to handle the new
error/behavior.
cmd/lifecycle-controller/start.go (2)

124-135: LIFECYCLE_SERVER_IMAGE required check runs before TLS cert existence check — fine, but consider consolidated validation.

Not a bug, but loadStartConfig mixes validation (env vars, cert file load, selectors) with construction (REST config, scheme, leader election). Extracting a validateFlags() step that runs first would produce a single clear error message per invalid input rather than failing on the first one. Optional cleanup.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/lifecycle-controller/start.go` around lines 124 - 135, The
loadStartConfig function currently intermixes environment/flag validation with
setup logic; extract a new validateStartConfig (or validateFlags) helper and
move all input-only checks (NAMESPACE presence when !disableLeaderElection,
LIFECYCLE_SERVER_IMAGE non-empty, TLS cert existence and any selector
validation) into that function so loadStartConfig only constructs and returns
the startConfig and runtime objects (REST config, scheme, leader election
flags); call validateStartConfig early from the startup path before invoking
loadStartConfig, keep the same error messages, and reference the existing
symbols loadStartConfig and disableLeaderElection to locate where to split
validation from construction.

251-256: healthz check is a no-op stub.

The /healthz handler always returns nil, so it will stay green even if the cache hasn't synced, the leader hasn't been elected, or the TLS provider hasn't been initialized. Controller-runtime exposes mgr.GetCache().WaitForCacheSync and similar hooks that could back a meaningful readiness check. Low priority since most OLM deployments have historically used this pattern.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/lifecycle-controller/start.go` around lines 251 - 256, The current
healthz handler added via mgr.AddHealthzCheck("healthz", ...) is a no-op;
replace it with a readiness check that returns an error until required
subsystems are ready: use mgr.GetCache().WaitForCacheSync(ctx) to ensure caches
are synced, verify leader election/leadership if applicable (e.g., check a
stored elected flag or manager/leader utilities), and confirm the TLS provider
has been initialized (check the TLS provider variable/state used elsewhere in
start.go). Implement the closure to perform these checks (using the appropriate
context available in start.go), return nil when all pass and an error describing
the failing condition otherwise, and keep the AddHealthzCheck call signature the
same.
microshift-manifests/0000_50_olm_08-lifecycle-controller.deployment.yaml (1)

78-90: Consider adding a startupProbe given initialDelaySeconds: 30.

With both liveness and readiness probes using initialDelaySeconds: 30 and default periodSeconds/failureThreshold, a slow cold start (e.g., leader-election contention, API server warm-up) could trigger premature liveness failures and restart loops. A startupProbe with a generous failureThreshold would decouple cold-start tolerance from steady-state liveness.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@microshift-manifests/0000_50_olm_08-lifecycle-controller.deployment.yaml`
around lines 78 - 90, Add a startupProbe to the container alongside the existing
livenessProbe and readinessProbe to tolerate slow cold starts: define
startupProbe with an httpGet to path /healthz on port "health" and scheme HTTP,
set a modest initialDelaySeconds (e.g., 10), a larger periodSeconds (e.g., 10)
and a generous failureThreshold (e.g., 30) so startup failures don’t trigger
liveness restarts; update the same container spec that contains livenessProbe,
readinessProbe and terminationMessagePolicy to include this startupProbe stanza.
manifests/0000_50_olm_08-lifecycle-controller.rbac.yaml (2)

52-55: Consider narrowing clusterrolebindings permissions to the managed CRB name.

The controller manages exactly one CRB (clusterRoleBindingName, per pkg/lifecycle-controller/controller.go), but this rule grants full CRUD on every CRB in the cluster. Since create can't be scoped by resourceNames, split the rule so that write verbs are scoped to the specific CRB to limit blast radius:

🛡️ Proposed RBAC hardening
-  # Manage lifecycle-server clusterrolebindings
-  - apiGroups: ["rbac.authorization.k8s.io"]
-    resources: ["clusterrolebindings"]
-    verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
+  # Manage lifecycle-server clusterrolebindings
+  - apiGroups: ["rbac.authorization.k8s.io"]
+    resources: ["clusterrolebindings"]
+    verbs: ["get", "list", "watch", "create"]
+  - apiGroups: ["rbac.authorization.k8s.io"]
+    resources: ["clusterrolebindings"]
+    resourceNames: ["operator-lifecycle-manager-lifecycle-server"]
+    verbs: ["update", "patch", "delete"]

Note: the Trivy KSV-0056 hits on services/networkpolicies appear to be false positives — these are required for the controller's core responsibility of reconciling per-CatalogSource lifecycle-server resources.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@manifests/0000_50_olm_08-lifecycle-controller.rbac.yaml` around lines 52 -
55, The ClusterRole currently grants full CRUD on all clusterrolebindings;
narrow write access by splitting the rule: keep a broad read rule for
"clusterrolebindings" with verbs ["get","list","watch"], and add a separate rule
for write verbs ["create","update","patch","delete"] that includes resourceNames
set to the specific ClusterRoleBinding name used by the controller
(clusterRoleBindingName as referenced in
pkg/lifecycle-controller/controller.go); update the RBAC manifest to reflect
these two rules so only the managed CRB can be modified while reads remain
cluster-wide.

63-69: Consider adding resourceNames field to narrow the RBAC permissions to the specific lease and configmap resources used by the controller.

Verification confirms the controller uses "operator-lifecycle-manager-lifecycle-server" as both the cluster role and binding name (in pkg/lifecycle-controller/controller.go lines 62-63). To follow least-privilege principles, the lease and configmap rules should be restricted using resourceNames: ["operator-lifecycle-manager-lifecycle-server"].

Example narrowing
  # Leader election
  - apiGroups: ["coordination.k8s.io"]
    resources: ["leases"]
    resourceNames: ["operator-lifecycle-manager-lifecycle-server"]
    verbs: ["get", "list", "watch", "create", "update", "delete"]
  - apiGroups: [""]
    resources: ["configmaps"]
    resourceNames: ["operator-lifecycle-manager-lifecycle-server"]
    verbs: ["get", "list", "watch", "create", "update", "delete"]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@manifests/0000_50_olm_08-lifecycle-controller.rbac.yaml` around lines 63 -
69, The RBAC rules for leader election currently grant broad access to leases
and configmaps; narrow them by adding resourceNames:
["operator-lifecycle-manager-lifecycle-server"] to both rules so only the
specific lease and configmap used by the lifecycle controller are allowed (match
the controller name referenced in pkg/lifecycle-controller/controller.go as
operator-lifecycle-manager-lifecycle-server).
pkg/lifecycle-controller/controller.go (3)

339-348: Two naming/path hygiene nits.

  • resourceName (line 339-348): truncation to 63 chars is DNS-safe in length but doesn't guard against collisions between two CatalogSources whose first 63 characters of <csName>-lifecycle-server match. CatalogSource names are usually short so the risk is low, but if you want to be defensive, append a short hash of the full name before truncating (e.g., truncate to 57 chars + - + 5-char sha256 prefix).
  • fbcPath (line 396): catalogMountPath + catalogDir relies on catalogDir starting with /. The default "/configs" is fine, but a user-supplied cs.Spec.GrpcPodConfig.ExtractContent.CatalogDir without a leading slash (e.g., "configs") would produce "/catalogconfigs". Use path.Join(catalogMountPath, catalogDir) to normalize.
🔧 fbcPath fix
-	const catalogMountPath = "/catalog"
-	fbcPath := catalogMountPath + catalogDir
+	const catalogMountPath = "/catalog"
+	fbcPath := path.Join(catalogMountPath, catalogDir)

Also applies to: 390-396

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/lifecycle-controller/controller.go` around lines 339 - 348, resourceName
currently truncates to 63 chars which can collide; change it to compute a short
hash (e.g., sha256) of the full csName, take a 5-character hex prefix, then
truncate the human-readable portion to 57 chars and append "-" + the 5-char
prefix before final trim/lowercase (update function resourceName to produce
"<truncated-57>-<5hex>" and still enforce '-' trimming and lowercase). For
fbcPath construction, stop concatenating strings and use
path.Join(catalogMountPath, catalogDir) (ensure you import "path") so
user-supplied catalogDir with or without leading slash is normalized; update the
code that builds fbcPath (using cs.Spec.GrpcPodConfig.ExtractContent.CatalogDir
and catalogMountPath) accordingly.

174-208: Reconcile-driven cleanup is fragile without OwnerReferences.

ensureResources applies SA/Service/Deployment/NetworkPolicy via SSA without any OwnerReferences to the CatalogSource. Cleanup depends entirely on the controller observing a delete/no-match event via cleanupResources. If the controller is down when a CatalogSource is deleted, the tombstone event can be missed (no relist delivers a DeleteEvent for an object already gone), and the per-catalog resources orphan indefinitely.

Since the resources are in the same namespace as the CatalogSource, setting an OwnerReference on each applied object lets the Kubernetes garbage collector clean them up for free. This also simplifies cleanupResources (it becomes mostly unnecessary for the common deletion path).

If cross-namespace ownership ever becomes a requirement, a finalizer on CatalogSource is the robust alternative.

Also applies to: 270-336

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/lifecycle-controller/controller.go` around lines 174 - 208,
ensureResources currently applies SA/Service/Deployment/NetworkPolicy without
OwnerReferences, which can orphan resources if the controller misses delete
events; update the resource builders (buildServiceAccount, buildService,
buildDeployment, buildNetworkPolicy) or immediately before calling Apply in
ensureResources to set a controller OwnerReference to the CatalogSource (use
controllerutil.SetControllerReference or populate metav1.OwnerReference with cs
as the owner) so the Kubernetes GC will remove per-catalog resources when the
CatalogSource is deleted (ensure Namespace matches the CatalogSource namespace);
you may then simplify cleanupResources since GC will handle the common delete
path.

210-239: reconcileClusterRoleBinding lists every CatalogSource cluster-wide and re-applies the CRB on every reconcile.

Two small but worthwhile improvements:

  1. Line 214 lists all CatalogSources in the cluster without a selector, then re-filters client-side with matchesCatalogSource. If the cache for CatalogSources isn't already filtered, this is O(N) per reconcile and scales poorly with the total number of catalogs. Consider using client.MatchingLabelsSelector{Selector: r.CatalogSourceLabelSelector} as a List option to narrow the server/cache scan.
  2. The CRB is re-applied unconditionally on every reconcile of any CatalogSource (and pod, SA, service, NP event that enqueues one). In clusters with many matching catalogs, this becomes a hot write path. Server-side apply is a no-op when the desired state matches existing managed fields, but the round-trip cost still adds up. An easy mitigation is to compute the subject set and skip r.Apply when it equals the current CRB's subjects.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/lifecycle-controller/controller.go` around lines 210 - 239,
reconcileClusterRoleBinding currently calls r.List without a selector and always
re-applies the CRB; change the List call to use
client.MatchingLabelsSelector{Selector: r.CatalogSourceLabelSelector} so only
matching CatalogSources are returned (refer to reconcileClusterRoleBinding and
matchesCatalogSource/CatalogSourceLabelSelector), and after building the
subjects slice (from resourceName and ServiceAccounts) fetch the existing
ClusterRoleBinding (via r.Get) and compare its current subjects to the computed
subjects; if they are identical, skip the r.Apply call to avoid unnecessary
server-side apply round-trips, otherwise proceed to r.Apply as before.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cmd/lifecycle-controller/start.go`:
- Around line 189-199: getInitialTLSProfile currently swallows errors from
tlsutil.FetchAPIServerTLSProfile and silently falls back to the default, which
hides startup RBAC/API errors and disables the watcher; modify
getInitialTLSProfile so that it logs the error (using the setupLog passed down
from loadStartConfig) and returns the error to the caller unless the failure is
a clear "resource not found" case, in which case you may continue to return the
default profile and a false flag; thread setupLog through loadStartConfig into
getInitialTLSProfile and update callers to handle the returned error so
transient API/RBAC failures surface instead of being masked.
- Around line 89-93: The deferred close of tlsProfileChan in
setupTLSProfileWatcher can race with blocking sends from the OnProfileChange
callback and cause panics or leaked goroutines; remove the defer
close(tlsProfileChan) and instead modify the OnProfileChange callback (which
receives a context) to gate its send on tlsProfileChan with a select that checks
ctx.Done() (drop the send if ctx is done) so sends never block or attempt to
send on a closed channel; leave the channel unclosed and let it be GC'd when no
longer referenced.

In `@manifests/0000_50_olm_08-lifecycle-controller.deployment.yaml`:
- Around line 62-67: Replace the incorrect tiny Go memory cap by increasing the
GOMEMLIMIT and the container memory request for the lifecycle-controller
deployment: update the env var GOMEMLIMIT from "5MiB" to a realistic value (e.g.
"50MiB" or match the lifecycle-server setting used in
pkg/lifecycle-controller/controller.go), and raise resources.requests.memory
(and CPU if needed) under the same container spec (resources.requests.memory,
resources.requests.cpu). Apply the same changes to the corresponding manifest
for IBM Cloud Managed
(0000_50_olm_08-lifecycle-controller.deployment.ibm-cloud-managed.yaml) so both
deployments use the higher GOMEMLIMIT and appropriate memory request.

In `@microshift-manifests/0000_50_olm_00-clusterserviceversions.crd.yaml`:
- Around line 4715-4718: The schema for the CRD includes a contradictory default
for the required field "ip": remove the line setting default: "" for the "ip"
property so the "ip" field remains required without a default; locate the "ip"
property in the CRD schema (the block containing x-kubernetes-list-map-keys: -
ip and x-kubernetes-list-type: map) and delete the default entry while leaving
the required list and other metadata intact.

In `@microshift-manifests/0000_50_olm_08-lifecycle-controller.deployment.yaml`:
- Around line 62-67: The GOMEMLIMIT environment variable is misconfigured as
"5MiB" and the pod resource requests are far too low; update the Deployment for
lifecycle-controller to remove or set GOMEMLIMIT to a realistic Go heap cap (for
example "1Gi" or a value tuned from observed usage) and increase
resources.requests.cpu and resources.requests.memory from 10m/10Mi to sensible
defaults (for example 100m CPU and 200Mi memory) and ensure you make the same
change for the other manifest copy that contains the same GOMEMLIMIT and
resource request entries.

In `@microshift-manifests/0000_50_olm_08-lifecycle-controller.networkpolicy.yaml`:
- Around line 16-19: The NetworkPolicy's ingress rule currently exposes port
8443 without a from: selector (ingress -> ports -> port: 8443), relying on
kube-rbac-proxy and the lifecycle-server RBAC for access control; if you later
want to restrict callers, update this NetworkPolicy to add a from: block (e.g.,
ingress.from with podSelector or namespaceSelector) that limits traffic to the
monitoring/metrics pods or a specific namespace/service account, ensuring the
rule explicitly selects only allowed sources instead of permitting cluster-wide
ingress to port 8443.

In `@microshift-manifests/0000_50_olm_08-lifecycle-controller.rbac.yaml`:
- Around line 52-72: Add the missing "bind" verb to the ClusterRole that grants
permissions related to operator-lifecycle-manager-lifecycle-server (i.e., the
rule covering apiGroups: ["authentication.k8s.io"] resources: ["tokenreviews"]
and apiGroups: ["authorization.k8s.io"] resources: ["subjectaccessreviews"]) so
the controller service account can create ClusterRoleBindings without future
privilege-escalation breaks; also remove the unused configmaps rule (apiGroups:
[""], resources: ["configmaps"]) and drop the "delete" verb from the leases rule
(apiGroups: ["coordination.k8s.io"], resources: ["leases"]) and from any
remaining configmaps/events verbs to enforce least privilege and reflect that
leader election uses LeasesResourceLock only.

In `@pkg/lifecycle-controller/controller_test.go`:
- Around line 826-887: The table-driven test declares and checks the expectErr
field but never sets it; either remove the unused expectErr field and the
conditional branch that checks tc.expectErr in the test, or add a dedicated test
case that sets expectErr=true and constructs a failing client to force
getCatalogPodInfo to return an error. Locate the test table and loop in
TestGetCatalogPodInfo (the tt slice and the t.Run loop), the tc.expectErr check,
and the call to r.getCatalogPodInfo(ctx, cs); then either delete the expectErr
field from the tt struct and the if tc.expectErr branch, or add a new entry that
triggers an error (e.g., create a client that returns an error when listing
Pods) and set expectErr:true so the error path is exercised.

In `@pkg/lifecycle-controller/tls_test.go`:
- Around line 91-122: In TestTLSConfigProvider_ConcurrentAccess, avoid calling
require.NotNil from spawned goroutines; change the in-goroutine checks to
assert.NotNil(t, cfg) and keep require.NotNil only on the main test goroutine
after wg.Wait(); also add the import "github.com/stretchr/testify/assert" to the
test file so assert is available. Ensure the references are the require.NotNil
call inside the anonymous reader goroutine in
TestTLSConfigProvider_ConcurrentAccess and not the final require.NotNil after
wg.Wait().

In `@scripts/generate_crds_manifests.sh`:
- Around line 657-675: The Service for lifecycle-controller exposes TLS metrics
on port 8443 but no ServiceMonitor is emitted; add a ServiceMonitor manifest for
the lifecycle-controller that targets selector app: olm-lifecycle-controller in
namespace openshift-operator-lifecycle-manager, with spec.endpoints containing
port: metrics, scheme: https, and tlsConfig that references the serving cert
secret (secretName: lifecycle-controller-serving-cert) so Prometheus uses the CA
bundle (and set insecureSkipVerify: false); ensure labels/namespace match the
Service and follow the pattern used for the PSM ServiceMonitor (use the same
endpoint name, selector, and namespace); alternatively, if monitoring is
intentionally out-of-scope, add a TODO comment in the script near the
lifecycle-controller Service block documenting the omission.
- Around line 589-608: The lifecycle-controller pod spec currently hardcodes a
single CatalogSource selector and an unsafe GOMEMLIMIT; update the deployment to
make the catalog selector configurable (do not hardcode
"metadata.namespace=openshift-marketplace,metadata.name=redhat-operators" in the
args passed to /bin/lifecycle-controller) by wiring a config value/flag in your
cluster-olm-operator reconciler (expose a field or feature gate that populates
the --catalog-source-field-selector arg or leave it unset) and add a clarifying
comment if this is intentionally a tech‑preview guardrail; also remove or raise
the environment GOMEMLIMIT value (currently "5MiB") in the pod env so it is at
least in line with the container memory request (or omit the variable to use Go
defaults) to avoid GC thrashing—update the lifecycle-controller args and the env
entry for GOMEMLIMIT (and any related config fields like RELEASE_VERSION /
LIFECYCLE_SERVER_IMAGE if you surface the selector via config) accordingly.
- Around line 49-50: The current copy step in scripts/generate_crds_manifests.sh
simply copies staged CRDs from "${ROOT_DIR}/staging/api/crds" to "${crddir}/"
but lacks documentation about update triggers and why controller-gen/yq fixups
were removed; update the script header comments to state how staging/api/crds is
updated (e.g., via upstream sync cadence or manual cherry-pick) and reference
the existing commit governance check (verify_commits.sh) that enforces
single-upstream commit traceability, and add a short note explaining that
controller-gen regeneration and yq mutations were removed because CRD manifests
are consumed verbatim from upstream and thus no local schema fixes are required;
mention the variables/commands to look at (cp, ROOT_DIR, crddir,
verify_commits.sh) so reviewers can locate the relevant logic.

---

Outside diff comments:
In `@scripts/generate_crds_manifests.sh`:
- Around line 850-865: The microshift kustomization generation loop is currently
missing "lifecycle" in the files_to_ignore list, so lifecycle controller/rl
roles (e.g., 0000_50_olm_08-lifecycle-controller.* and
0000_50_olm_09-lifecycle-server.rbac.yaml) are being included; update the
files_to_ignore array used by the microshift_manifests_files loop to add
"lifecycle" (or the specific lifecycle filenames) so the for-loop that appends
entries into microshift-manifests/kustomization.yaml will skip those lifecycle
manifests, ensuring they are not added to the generated kustomization.yaml.

---

Nitpick comments:
In `@cmd/lifecycle-controller/start.go`:
- Around line 124-135: The loadStartConfig function currently intermixes
environment/flag validation with setup logic; extract a new validateStartConfig
(or validateFlags) helper and move all input-only checks (NAMESPACE presence
when !disableLeaderElection, LIFECYCLE_SERVER_IMAGE non-empty, TLS cert
existence and any selector validation) into that function so loadStartConfig
only constructs and returns the startConfig and runtime objects (REST config,
scheme, leader election flags); call validateStartConfig early from the startup
path before invoking loadStartConfig, keep the same error messages, and
reference the existing symbols loadStartConfig and disableLeaderElection to
locate where to split validation from construction.
- Around line 251-256: The current healthz handler added via
mgr.AddHealthzCheck("healthz", ...) is a no-op; replace it with a readiness
check that returns an error until required subsystems are ready: use
mgr.GetCache().WaitForCacheSync(ctx) to ensure caches are synced, verify leader
election/leadership if applicable (e.g., check a stored elected flag or
manager/leader utilities), and confirm the TLS provider has been initialized
(check the TLS provider variable/state used elsewhere in start.go). Implement
the closure to perform these checks (using the appropriate context available in
start.go), return nil when all pass and an error describing the failing
condition otherwise, and keep the AddHealthzCheck call signature the same.

In `@Makefile`:
- Around line 102-106: Update the Makefile build rules for
LIFECYCLE_CONTROLLER_CMD and LIFECYCLE_SERVER_CMD to embed version metadata and
remove unnecessary build tags: add -ldflags settings to inject GitCommit/Version
(matching the OLM_CMDS/REGISTRY_CMDS convention) into the go build invocation
for the targets LIFECYCLE_CONTROLLER_CMD and LIFECYCLE_SERVER_CMD, and drop the
unused -tags "json1,containers_image_openpgp" from those build lines so
CGO/sqlite-related tags are not pulled in for these binaries.

In `@manifests/0000_50_olm_08-lifecycle-controller.networkpolicy.yaml`:
- Around line 16-23: Ingress on port 8443 and egress on port 6443 currently have
no peer selectors so they allow traffic from/to anywhere; update the
NetworkPolicy stanza that defines ingress: - ports: - port: 8443 and egress: -
ports: - port: 6443 to include appropriate peer selectors (e.g., for ingress add
a from: with namespaceSelector matching the monitoring namespace such as
matchLabels: {name: openshift-monitoring} or a podSelector limiting to
monitoring pods, and for egress add a to: with namespaceSelector/podSelector or
an ipBlock.cidr that targets the kube-apiserver IPs) so the rules are scoped
rather than global.

In `@manifests/0000_50_olm_08-lifecycle-controller.rbac.yaml`:
- Around line 52-55: The ClusterRole currently grants full CRUD on all
clusterrolebindings; narrow write access by splitting the rule: keep a broad
read rule for "clusterrolebindings" with verbs ["get","list","watch"], and add a
separate rule for write verbs ["create","update","patch","delete"] that includes
resourceNames set to the specific ClusterRoleBinding name used by the controller
(clusterRoleBindingName as referenced in
pkg/lifecycle-controller/controller.go); update the RBAC manifest to reflect
these two rules so only the managed CRB can be modified while reads remain
cluster-wide.
- Around line 63-69: The RBAC rules for leader election currently grant broad
access to leases and configmaps; narrow them by adding resourceNames:
["operator-lifecycle-manager-lifecycle-server"] to both rules so only the
specific lease and configmap used by the lifecycle controller are allowed (match
the controller name referenced in pkg/lifecycle-controller/controller.go as
operator-lifecycle-manager-lifecycle-server).

In `@microshift-manifests/0000_50_olm_08-lifecycle-controller.deployment.yaml`:
- Around line 78-90: Add a startupProbe to the container alongside the existing
livenessProbe and readinessProbe to tolerate slow cold starts: define
startupProbe with an httpGet to path /healthz on port "health" and scheme HTTP,
set a modest initialDelaySeconds (e.g., 10), a larger periodSeconds (e.g., 10)
and a generous failureThreshold (e.g., 30) so startup failures don’t trigger
liveness restarts; update the same container spec that contains livenessProbe,
readinessProbe and terminationMessagePolicy to include this startupProbe stanza.

In `@pkg/lifecycle-controller/controller_test.go`:
- Around line 125-129: The test shows resourceName("") returns
"-lifecycle-server", producing an invalid DNS-1123 label; update the code/tests
so the contract is explicit: either remove the "empty name" test case from
controller_test.go, or change resourceName to validate input and return an error
or panic for empty/invalid names (e.g., check in resourceName for "" and handle
accordingly). Locate the function resourceName and the test case named "empty
name" to implement the chosen fix and ensure callers (e.g., any CatalogSource
name handling) are updated to handle the new error/behavior.

In `@pkg/lifecycle-controller/controller.go`:
- Around line 339-348: resourceName currently truncates to 63 chars which can
collide; change it to compute a short hash (e.g., sha256) of the full csName,
take a 5-character hex prefix, then truncate the human-readable portion to 57
chars and append "-" + the 5-char prefix before final trim/lowercase (update
function resourceName to produce "<truncated-57>-<5hex>" and still enforce '-'
trimming and lowercase). For fbcPath construction, stop concatenating strings
and use path.Join(catalogMountPath, catalogDir) (ensure you import "path") so
user-supplied catalogDir with or without leading slash is normalized; update the
code that builds fbcPath (using cs.Spec.GrpcPodConfig.ExtractContent.CatalogDir
and catalogMountPath) accordingly.
- Around line 174-208: ensureResources currently applies
SA/Service/Deployment/NetworkPolicy without OwnerReferences, which can orphan
resources if the controller misses delete events; update the resource builders
(buildServiceAccount, buildService, buildDeployment, buildNetworkPolicy) or
immediately before calling Apply in ensureResources to set a controller
OwnerReference to the CatalogSource (use controllerutil.SetControllerReference
or populate metav1.OwnerReference with cs as the owner) so the Kubernetes GC
will remove per-catalog resources when the CatalogSource is deleted (ensure
Namespace matches the CatalogSource namespace); you may then simplify
cleanupResources since GC will handle the common delete path.
- Around line 210-239: reconcileClusterRoleBinding currently calls r.List
without a selector and always re-applies the CRB; change the List call to use
client.MatchingLabelsSelector{Selector: r.CatalogSourceLabelSelector} so only
matching CatalogSources are returned (refer to reconcileClusterRoleBinding and
matchesCatalogSource/CatalogSourceLabelSelector), and after building the
subjects slice (from resourceName and ServiceAccounts) fetch the existing
ClusterRoleBinding (via r.Get) and compare its current subjects to the computed
subjects; if they are identical, skip the r.Apply call to avoid unnecessary
server-side apply round-trips, otherwise proceed to r.Apply as before.

In `@pkg/lifecycle-controller/tls_test.go`:
- Around line 75-89: Replace the boolean flag in
TestTLSConfigProvider_GetCertificatePreserved with an integer counter so you can
assert the injected callback is invoked the expected number of times: change
getCert to increment a counter (e.g., calls++), create the provider via
NewTLSConfigProvider and obtain cfg via p.Get(), call cfg.GetCertificate(nil),
then assert the counter equals 1 (or the exact expected count) instead of
checking a bool; reference the getCert closure, NewTLSConfigProvider, p.Get, and
cfg.GetCertificate to locate the test and update expectations.

In `@pkg/lifecycle-controller/TODO.md`:
- Around line 1-42: Move the in-tree TODO items into a tracked external ticket
and link it from the file: create a Jira/GitHub issue (referencing OPRUN-4445)
that contains the e2e test checklist from pkg/lifecycle-controller/TODO.md,
include expected test scenarios (happy-path and targeted tests) and acceptance
criteria, then update pkg/lifecycle-controller/TODO.md to replace the full
checklist with a short pointer to the newly created issue (and the issue ID) and
optionally note that the file is retained as a temporary local checklist; also
update the PR description to reference the new issue so the work is discoverable
and actionable outside the repo.

In `@pkg/lifecycle-server/fbc.go`:
- Around line 108-115: ListVersions currently returns versions in map iteration
order which is non-deterministic; modify LifecycleIndex.ListVersions to sort the
versions before returning them (use sort.Strings on the slice) so logs (e.g.,
the startup log in start.go that calls data.ListVersions()) are deterministic,
and add the "sort" import to the file.
- Around line 42-87: LoadLifecycleData currently swallows per-file errors inside
the declcfg.WalkMetasFS callback, losing debuggability; modify LoadLifecycleData
to accept a logger (e.g., logr.Logger) or a pointer to a slice of warning
errors, and in the WalkMetasFS callback, instead of returning nil on per-file
errors (the err parameter) or other parse/validation issues, record them by
logging via the provided logger or appending to the warnings slice (but continue
walking), keeping the existing behavior of returning a single terminal error
only if WalkMetasFS itself fails; update the function signature
(LoadLifecycleData) and callers accordingly and reference the meta/err handling
inside the anonymous callback where schemaVersionRegex, meta.Package, meta.Blob
and the result map are used so individual file problems are surfaced without
aborting the overall walk.
- Around line 70-77: When inserting into result under schemaVersion and
meta.Package (the block using mu.Lock()/mu.Unlock() that assigns
result[schemaVersion][meta.Package] = meta.Blob), check whether an entry already
exists and either log a debug/warn message with the existing vs new blob
identity or return an error to reject duplicates; specifically, inside that
critical section look up result[schemaVersion][meta.Package] before assignment,
and if non-nil call the package’s logger or lifecycle server logger with
schemaVersion and meta.Package and indicate a duplicate (optionally include a
short blob fingerprint/length) or propagate an error instead of silently
overwriting—this makes duplicate (version, package) entries from WalkMetasFS
observable and avoid silent last-wins behavior.

In `@scripts/generate_crds_manifests.sh`:
- Around line 677-791: The NetworkPolicy for lifecycle-controller currently
allows egress to port 6443/TCP without scoping; update the manifest created in
the heredoc (manifest name
0000_50_olm_08-lifecycle-controller.networkpolicy.yaml) to add a to: with an
appropriate namespaceSelector (e.g., select the kube-apiserver/default
namespace) to restrict egress to the API server; in the ClusterRole (kind:
ClusterRole name: operator-lifecycle-manager-lifecycle-controller in
0000_50_olm_08-lifecycle-controller.rbac.yaml) remove the unnecessary "delete"
verb from the coordination.k8s.io/leases rule and trim verbs to only
["get","create","update"] (optionally keep "list","watch" if ReleaseOnCancel is
disabled); finally, review the ClusterRole rule for services (apiGroups: [""],
resources: ["services"]) and if the controller relies on
ownerReferences/server-side-apply for GC, remove the "delete" verb from that
rule as well (or retain it only if explicit deletions are exercised).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a7892ad0-ead2-4a1e-a85a-8ff58999794f

📥 Commits

Reviewing files that changed from the base of the PR and between 122728c and abeb3db.

⛔ Files ignored due to path filters (6)
  • go.sum is excluded by !**/*.sum
  • vendor/github.com/openshift/controller-runtime-common/LICENSE is excluded by !**/vendor/**, !vendor/**
  • vendor/github.com/openshift/controller-runtime-common/pkg/tls/controller.go is excluded by !**/vendor/**, !vendor/**
  • vendor/github.com/openshift/controller-runtime-common/pkg/tls/tls.go is excluded by !**/vendor/**, !vendor/**
  • vendor/github.com/openshift/library-go/pkg/crypto/crypto.go is excluded by !**/vendor/**, !vendor/**
  • vendor/modules.txt is excluded by !**/vendor/**, !vendor/**
📒 Files selected for processing (36)
  • Makefile
  • cmd/lifecycle-controller/main.go
  • cmd/lifecycle-controller/start.go
  • cmd/lifecycle-controller/util.go
  • cmd/lifecycle-server/main.go
  • cmd/lifecycle-server/start.go
  • cmd/lifecycle-server/util.go
  • go.mod
  • manifests/0000_50_olm_00-catalogsources.crd.yaml
  • manifests/0000_50_olm_00-clusterserviceversions.crd.yaml
  • manifests/0000_50_olm_08-lifecycle-controller.deployment.ibm-cloud-managed.yaml
  • manifests/0000_50_olm_08-lifecycle-controller.deployment.yaml
  • manifests/0000_50_olm_08-lifecycle-controller.networkpolicy.yaml
  • manifests/0000_50_olm_08-lifecycle-controller.rbac.yaml
  • manifests/0000_50_olm_08-lifecycle-controller.service.yaml
  • manifests/0000_50_olm_09-lifecycle-server.rbac.yaml
  • microshift-manifests/0000_50_olm_00-catalogsources.crd.yaml
  • microshift-manifests/0000_50_olm_00-clusterserviceversions.crd.yaml
  • microshift-manifests/0000_50_olm_08-lifecycle-controller.deployment.ibm-cloud-managed.yaml
  • microshift-manifests/0000_50_olm_08-lifecycle-controller.deployment.yaml
  • microshift-manifests/0000_50_olm_08-lifecycle-controller.networkpolicy.yaml
  • microshift-manifests/0000_50_olm_08-lifecycle-controller.rbac.yaml
  • microshift-manifests/0000_50_olm_08-lifecycle-controller.service.yaml
  • microshift-manifests/0000_50_olm_09-lifecycle-server.rbac.yaml
  • microshift-manifests/kustomization.yaml
  • operator-lifecycle-manager.Dockerfile
  • pkg/lifecycle-controller/TODO.md
  • pkg/lifecycle-controller/controller.go
  • pkg/lifecycle-controller/controller_test.go
  • pkg/lifecycle-controller/tls.go
  • pkg/lifecycle-controller/tls_test.go
  • pkg/lifecycle-server/fbc.go
  • pkg/lifecycle-server/fbc_test.go
  • pkg/lifecycle-server/server.go
  • pkg/lifecycle-server/server_test.go
  • scripts/generate_crds_manifests.sh

Comment on lines +89 to +93
tlsProfileChan, err := setupTLSProfileWatcher(mgr, cfg)
if err != nil {
return fmt.Errorf("unable to setup TLS profile watcher: %v", err)
}
defer close(tlsProfileChan)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Confirm whether SecurityProfileWatcher.OnProfileChange can be invoked after mgr.Start returns
# and whether the upstream implementation treats ctx cancellation for the callback.
rg -nP --type=go -C5 '\bSecurityProfileWatcher\b' vendor/ 2>/dev/null || \
  rg -nP --type=go -C5 '\bSecurityProfileWatcher\b'
rg -nP --type=go -C3 'OnProfileChange' vendor/ 2>/dev/null || \
  rg -nP --type=go -C3 'OnProfileChange'

Repository: openshift/operator-framework-olm

Length of output: 9224


🏁 Script executed:

cat -n cmd/lifecycle-controller/start.go | sed -n '1,100p'

Repository: openshift/operator-framework-olm

Length of output: 4200


🏁 Script executed:

cat -n cmd/lifecycle-controller/start.go | sed -n '270,295p'

Repository: openshift/operator-framework-olm

Length of output: 1405


🏁 Script executed:

cat -n cmd/lifecycle-controller/start.go | sed -n '180,210p'

Repository: openshift/operator-framework-olm

Length of output: 1719


🏁 Script executed:

# Find all usages of tlsProfileChan to understand the data flow
rg -n 'tlsProfileChan' cmd/lifecycle-controller/start.go -A2 -B2

Repository: openshift/operator-framework-olm

Length of output: 912


🏁 Script executed:

# Find setupTLSProfileWatcher function definition and channel creation
rg -n 'func setupTLSProfileWatcher' cmd/lifecycle-controller/start.go -A30

Repository: openshift/operator-framework-olm

Length of output: 1414


🏁 Script executed:

# Find where tlsChangeChan is created (make call)
rg -n 'make.*tlsChangeChan|make.*event.TypedGenericEvent' cmd/lifecycle-controller/start.go -B5 -A2

Repository: openshift/operator-framework-olm

Length of output: 377


🏁 Script executed:

# Find where tlsChangeChan is consumed/read from
rg -n 'for.*tlsProfileChan|<-.*tlsProfileChan' cmd/ -r

Repository: openshift/operator-framework-olm

Length of output: 490


🏁 Script executed:

# Find where tlsProfileChan is consumed/read from
rg -n 'for.*range.*tlsProfileChan|<-.*tlsProfileChan' cmd/ -B3 -A3

Repository: openshift/operator-framework-olm

Length of output: 58


🏁 Script executed:

# Also check SetupWithManager in the LifecycleServerReconciler
rg -n 'SetupWithManager' pkg/lifecycle-controller/ -A20 | head -80

Repository: openshift/operator-framework-olm

Length of output: 3235


Potential panic / goroutine leak: defer close(tlsProfileChan) races with blocking sends from OnProfileChange.

Two failure modes:

  1. If OnProfileChange fires while the manager is shutting down, the unbuffered send at line 282 blocks (no consumer — controller-runtime has stopped draining). mgr.Start returns, defer close(tlsProfileChan) runs, and the send then panics with send on closed channel.
  2. Even without a close, a blocked send holds the watcher goroutine indefinitely, leaking it past run() returning.

The OnProfileChange callback already receives a context.Context (line 271) — gate the send on ctx.Done() and drop the defer close (leaving it to GC is safe and avoids the race entirely):

🔒️ Suggested fix
-	defer close(tlsProfileChan)
+	// Intentionally not closed: producer (SecurityProfileWatcher) may still be mid-send
+	// when mgr.Start returns; rely on GC after run() exits.
 		OnProfileChange: func(ctx context.Context, oldTLSProfileSpec, newTLSProfileSpec configv1.TLSProfileSpec) {
 			cfg.TLSConfigProvider.UpdateProfile(newTLSProfileSpec)
 			log.Info("applying new TLS profile spec",
 				"minVersion", newTLSProfileSpec.MinTLSVersion,
 				"cipherSuites", newTLSProfileSpec.Ciphers,
 			)

 			_, unsupportedCiphers := cfg.TLSConfigProvider.Get()
 			if len(unsupportedCiphers) > 0 {
 				log.Info("ignoring unsupported ciphers found in TLS profile", "unsupportedCiphers", unsupportedCiphers)
 			}
-			tlsChangeChan <- event.TypedGenericEvent[configv1.TLSProfileSpec]{Object: newTLSProfileSpec}
+			select {
+			case tlsChangeChan <- event.TypedGenericEvent[configv1.TLSProfileSpec]{Object: newTLSProfileSpec}:
+			case <-ctx.Done():
+				return
+			}
 		},
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/lifecycle-controller/start.go` around lines 89 - 93, The deferred close
of tlsProfileChan in setupTLSProfileWatcher can race with blocking sends from
the OnProfileChange callback and cause panics or leaked goroutines; remove the
defer close(tlsProfileChan) and instead modify the OnProfileChange callback
(which receives a context) to gate its send on tlsProfileChan with a select that
checks ctx.Done() (drop the send if ctx is done) so sends never block or attempt
to send on a closed channel; leave the channel unclosed and let it be GC'd when
no longer referenced.

Comment on lines +189 to +199
func getInitialTLSProfile(ctx context.Context, restConfig *rest.Config, sch *runtime.Scheme) (configv1.TLSProfileSpec, bool, error) {
cl, err := client.New(restConfig, client.Options{Scheme: sch})
if err != nil {
return configv1.TLSProfileSpec{}, false, fmt.Errorf("failed to create client: %v", err)
}
initialTLSProfileSpec, err := tlsutil.FetchAPIServerTLSProfile(ctx, cl)
if err != nil {
return *configv1.TLSProfiles[crypto.DefaultTLSProfileType], false, nil
}
return initialTLSProfileSpec, true, nil
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Silently swallowing FetchAPIServerTLSProfile error masks startup problems.

When tlsutil.FetchAPIServerTLSProfile fails (e.g., transient API server error, missing RBAC), this returns the default profile, disables the TLS profile watcher for the entire process lifetime, and returns nil — there's no log line and the operator has no way to distinguish "no APIServer config" from "RBAC was missing at startup". At minimum, log the error; ideally, bubble it up unless the failure is specifically "resource not found":

🔒️ Suggested fix
-func getInitialTLSProfile(ctx context.Context, restConfig *rest.Config, sch *runtime.Scheme) (configv1.TLSProfileSpec, bool, error) {
+func getInitialTLSProfile(ctx context.Context, log logr.Logger, restConfig *rest.Config, sch *runtime.Scheme) (configv1.TLSProfileSpec, bool, error) {
 	cl, err := client.New(restConfig, client.Options{Scheme: sch})
 	if err != nil {
 		return configv1.TLSProfileSpec{}, false, fmt.Errorf("failed to create client: %v", err)
 	}
 	initialTLSProfileSpec, err := tlsutil.FetchAPIServerTLSProfile(ctx, cl)
 	if err != nil {
+		log.Error(err, "failed to fetch APIServer TLS profile; falling back to default and disabling TLS profile watcher")
 		return *configv1.TLSProfiles[crypto.DefaultTLSProfileType], false, nil
 	}
 	return initialTLSProfileSpec, true, nil
 }

(and thread setupLog through from loadStartConfig).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/lifecycle-controller/start.go` around lines 189 - 199,
getInitialTLSProfile currently swallows errors from
tlsutil.FetchAPIServerTLSProfile and silently falls back to the default, which
hides startup RBAC/API errors and disables the watcher; modify
getInitialTLSProfile so that it logs the error (using the setupLog passed down
from loadStartConfig) and returns the error to the caller unless the failure is
a clear "resource not found" case, in which case you may continue to return the
default profile and a false flag; thread setupLog through loadStartConfig into
getInitialTLSProfile and update callers to handle the returned error so
transient API/RBAC failures surface instead of being masked.

Comment on lines +62 to +67
- name: GOMEMLIMIT
value: "5MiB"
resources:
requests:
cpu: 10m
memory: 10Mi
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

GOMEMLIMIT: 5MiB will starve the Go runtime — almost certainly a typo.

A 5MiB soft-memory limit is below the baseline heap of a controller-runtime binary (manager + multiple informers for CatalogSources/Pods/Deployments/Services/SAs/NetworkPolicies + library-go crypto + apply configurations). The runtime will force back-to-back GC cycles as it approaches the limit, causing severe CPU thrashing and, in practice, startup failure or effective deadlock before any reconcile runs.

For comparison, the lifecycle-server container you create in pkg/lifecycle-controller/controller.go (line 450-451) uses GOMEMLIMIT=50MiB. That value — or higher — is appropriate here too, alongside a correspondingly raised memory request (the current 10Mi also looks too low for a reconciler of this scope).

🔧 Proposed fix
           resources:
             requests:
               cpu: 10m
-              memory: 10Mi
+              memory: 80Mi
           ...
             - name: GOMEMLIMIT
-              value: "5MiB"
+              value: "80MiB"

The same values should be applied to manifests/0000_50_olm_08-lifecycle-controller.deployment.ibm-cloud-managed.yaml.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- name: GOMEMLIMIT
value: "5MiB"
resources:
requests:
cpu: 10m
memory: 10Mi
- name: GOMEMLIMIT
value: "80MiB"
resources:
requests:
cpu: 10m
memory: 80Mi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@manifests/0000_50_olm_08-lifecycle-controller.deployment.yaml` around lines
62 - 67, Replace the incorrect tiny Go memory cap by increasing the GOMEMLIMIT
and the container memory request for the lifecycle-controller deployment: update
the env var GOMEMLIMIT from "5MiB" to a realistic value (e.g. "50MiB" or match
the lifecycle-server setting used in pkg/lifecycle-controller/controller.go),
and raise resources.requests.memory (and CPU if needed) under the same container
spec (resources.requests.memory, resources.requests.cpu). Apply the same changes
to the corresponding manifest for IBM Cloud Managed
(0000_50_olm_08-lifecycle-controller.deployment.ibm-cloud-managed.yaml) so both
deployments use the higher GOMEMLIMIT and appropriate memory request.

Comment on lines +4715 to 4718
default: ""
x-kubernetes-list-map-keys:
- ip
x-kubernetes-list-type: map
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify whether `ip` is already required in this schema object and find similar patterns.
set -euo pipefail

echo "== Locate list-map keys using ip =="
rg -n -C4 'x-kubernetes-list-map-keys:\n\s*-\s*ip' microshift-manifests/0000_50_olm_00-clusterserviceversions.crd.yaml

echo
echo "== Locate defaults on ip fields in CRDs =="
rg -n -C3 '^\s*ip:\s*$|^\s*default:\s*""\s*$' microshift-manifests/*.yaml

echo
echo "== Locate required blocks mentioning ip in same file =="
rg -n -C3 '^\s*required:\s*$|^\s*-\s*ip\s*$' microshift-manifests/0000_50_olm_00-clusterserviceversions.crd.yaml

Repository: openshift/operator-framework-olm

Length of output: 322


🏁 Script executed:

#!/bin/bash
# Verify the ip field and list-map key configuration in the CRD
set -euo pipefail

# Convert to relative path and check file
FILE="microshift-manifests/0000_50_olm_00-clusterserviceversions.crd.yaml"

echo "=== Checking lines around 4715-4718 ==="
sed -n '4710,4725p' "$FILE" | cat -n

echo
echo "=== Search for ip field definitions with defaults ==="
rg -U -n 'ip:.*?\n.*?default' "$FILE" | head -20

echo
echo "=== Search for x-kubernetes-list-map-keys patterns ==="
rg -n 'x-kubernetes-list-map-keys' "$FILE"

echo
echo "=== Check context around line 4715 (50 lines before and after) ==="
sed -n '4665,4765p' "$FILE" | cat -n

Repository: openshift/operator-framework-olm

Length of output: 1333


🏁 Script executed:

#!/bin/bash
FILE="microshift-manifests/0000_50_olm_00-clusterserviceversions.crd.yaml"

echo "=== Broader context: lines 4680-4730 ==="
sed -n '4680,4730p' "$FILE" | cat -n

echo
echo "=== Search for 'required:' with following lines to identify scope ==="
sed -n '4660,4750p' "$FILE" | grep -n -A5 'required:' | head -30

Repository: openshift/operator-framework-olm

Length of output: 4771


🏁 Script executed:

#!/bin/bash
FILE="microshift-manifests/0000_50_olm_00-clusterserviceversions.crd.yaml"

# Count total occurrences of this contradictory pattern in the file
echo "=== Fields marked as required AND having defaults (potential issues) ==="
python3 << 'PYEOF'
import re

with open(FILE, 'r') as f:
    content = f.read()

# Find all object schemas with both required and default patterns
lines = content.split('\n')

issues = []
for i in range(len(lines) - 50):
    # Look for required blocks
    if re.search(r'^\s*required:\s*$', lines[i]):
        # Check next 40 lines for fields with defaults
        obj_section = '\n'.join(lines[i:i+50])
        if 'default:' in obj_section:
            issues.append({
                'line': i + 1,
                'context': lines[max(0, i-2):min(len(lines), i+15)]
            })

if issues:
    print(f"Found {len(issues)} objects with required AND default fields")
    for issue in issues[:3]:  # Show first 3
        print(f"\nAround line {issue['line']}:")
        for line in issue['context']:
            print(f"  {line}")
else:
    print("No mixed required+default patterns found")
PYEOF

Repository: openshift/operator-framework-olm

Length of output: 278


Remove contradictory default from required ip field

The ip field is already marked as required (line 4724) but also has default: "" (line 4715). Required fields should not have defaults—this contradiction can cause unexpected merge and validation behavior. Remove the default value.

Correct schema fix
                                              required:
                                                - ip
                                              properties:
                                                hostnames:
                                                  description: Hostnames for the above IP address.
                                                  type: array
                                                  items:
                                                    type: string
                                                  x-kubernetes-list-type: atomic
                                                ip:
                                                  description: IP address of the host file entry.
                                                  type: string
-                                                 default: ""
                                            x-kubernetes-list-map-keys:
                                              - ip
                                            x-kubernetes-list-type: map
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
default: ""
x-kubernetes-list-map-keys:
- ip
x-kubernetes-list-type: map
x-kubernetes-list-map-keys:
- ip
x-kubernetes-list-type: map
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@microshift-manifests/0000_50_olm_00-clusterserviceversions.crd.yaml` around
lines 4715 - 4718, The schema for the CRD includes a contradictory default for
the required field "ip": remove the line setting default: "" for the "ip"
property so the "ip" field remains required without a default; locate the "ip"
property in the CRD schema (the block containing x-kubernetes-list-map-keys: -
ip and x-kubernetes-list-type: map) and delete the default entry while leaving
the required list and other metadata intact.

Comment on lines +62 to +67
- name: GOMEMLIMIT
value: "5MiB"
resources:
requests:
cpu: 10m
memory: 10Mi
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

GOMEMLIMIT: "5MiB" is almost certainly a misconfiguration.

Five mebibytes is not enough memory for the Go runtime itself, let alone a controller-runtime manager with informers, caches, and REST clients. The controller will either OOM at startup or spend most of its time in GC. The memory request of 10Mi is likewise far too low for this workload.

Note: the same values appear in microshift-manifests/0000_50_olm_08-lifecycle-controller.deployment.ibm-cloud-managed.yaml (lines 63-64, 67-68); both should be fixed together.

🐛 Suggested correction (tune to observed usage)
             - name: GOMEMLIMIT
-              value: "5MiB"
+              value: "200MiB"
           resources:
             requests:
               cpu: 10m
-              memory: 10Mi
+              memory: 100Mi
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- name: GOMEMLIMIT
value: "5MiB"
resources:
requests:
cpu: 10m
memory: 10Mi
- name: GOMEMLIMIT
value: "200MiB"
resources:
requests:
cpu: 10m
memory: 100Mi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@microshift-manifests/0000_50_olm_08-lifecycle-controller.deployment.yaml`
around lines 62 - 67, The GOMEMLIMIT environment variable is misconfigured as
"5MiB" and the pod resource requests are far too low; update the Deployment for
lifecycle-controller to remove or set GOMEMLIMIT to a realistic Go heap cap (for
example "1Gi" or a value tuned from observed usage) and increase
resources.requests.cpu and resources.requests.memory from 10m/10Mi to sensible
defaults (for example 100m CPU and 200Mi memory) and ensure you make the same
change for the other manifest copy that contains the same GOMEMLIMIT and
resource request entries.

Comment on lines +826 to +887

tt := []struct {
name string
pods []*corev1.Pod
expectedImage string
expectedNode string
expectErr bool
}{
{
name: "no pods returns empty",
pods: nil,
expectedImage: "",
expectedNode: "",
},
{
name: "running pod with digest",
pods: []*corev1.Pod{
catalogPod("test-catalog", "test-ns", "worker-1", "sha256:abc123", corev1.PodRunning),
},
expectedImage: "sha256:abc123",
expectedNode: "worker-1",
},
{
name: "pending pod is skipped",
pods: []*corev1.Pod{
catalogPod("test-catalog", "test-ns", "worker-1", "sha256:abc123", corev1.PodPending),
},
expectedImage: "",
expectedNode: "",
},
{
name: "running pod with empty imageID is skipped",
pods: []*corev1.Pod{
catalogPod("test-catalog", "test-ns", "worker-1", "", corev1.PodRunning),
},
expectedImage: "",
expectedNode: "",
},
}

for _, tc := range tt {
t.Run(tc.name, func(t *testing.T) {
builder := fake.NewClientBuilder().WithScheme(scheme)
for _, p := range tc.pods {
builder = builder.WithObjects(p)
}
cl := builder.Build()

r := testReconciler(cl)
cs := newCatalogSource("test-catalog", "test-ns", nil)

image, node, err := r.getCatalogPodInfo(ctx, cs)
if tc.expectErr {
require.Error(t, err)
return
}
require.NoError(t, err)
require.Equal(t, tc.expectedImage, image)
require.Equal(t, tc.expectedNode, node)
})
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Unused expectErr field in TestGetCatalogPodInfo.

The expectErr field is declared (line 832) and branched on (line 878-881), but no case in tt sets it to true, so the error path is never exercised. Either add a case that triggers an error (e.g., make the client fail) or remove the field to keep the table honest.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/lifecycle-controller/controller_test.go` around lines 826 - 887, The
table-driven test declares and checks the expectErr field but never sets it;
either remove the unused expectErr field and the conditional branch that checks
tc.expectErr in the test, or add a dedicated test case that sets expectErr=true
and constructs a failing client to force getCatalogPodInfo to return an error.
Locate the test table and loop in TestGetCatalogPodInfo (the tt slice and the
t.Run loop), the tc.expectErr check, and the call to r.getCatalogPodInfo(ctx,
cs); then either delete the expectErr field from the tt struct and the if
tc.expectErr branch, or add a new entry that triggers an error (e.g., create a
client that returns an error when listing Pods) and set expectErr:true so the
error path is exercised.

Comment on lines +91 to +122
func TestTLSConfigProvider_ConcurrentAccess(t *testing.T) {
p := NewTLSConfigProvider(dummyGetCertificate, tls12Profile())

const goroutines = 50
var wg sync.WaitGroup
wg.Add(goroutines * 2)

// Half the goroutines read, half update
for i := range goroutines {
go func() {
defer wg.Done()
cfg, _ := p.Get()
require.NotNil(t, cfg)
}()
go func(i int) {
defer wg.Done()
var profile configv1.TLSProfileSpec
if i%2 == 0 {
profile = tls12Profile()
} else {
profile = tls13Profile()
}
p.UpdateProfile(profile)
}(i)
}

wg.Wait()

// Provider should still be functional after concurrent access
cfg, _ := p.Get()
require.NotNil(t, cfg)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

cd /dev/null && find . -path "*pkg/lifecycle-controller/tls_test.go" 2>/dev/null | head -1

Repository: openshift/operator-framework-olm

Length of output: 126


🏁 Script executed:

git ls-files | grep -i "tls_test.go"

Repository: openshift/operator-framework-olm

Length of output: 113


🏁 Script executed:

fd tls_test.go

Repository: openshift/operator-framework-olm

Length of output: 113


🏁 Script executed:

git ls-files | head -30

Repository: openshift/operator-framework-olm

Length of output: 736


🏁 Script executed:

cat -n pkg/lifecycle-controller/tls_test.go | head -130

Repository: openshift/operator-framework-olm

Length of output: 3894


🏁 Script executed:

head -130 pkg/lifecycle-controller/tls_test.go | tail -50

Repository: openshift/operator-framework-olm

Length of output: 1223


🏁 Script executed:

sed -n '1,15p' pkg/lifecycle-controller/tls_test.go

Repository: openshift/operator-framework-olm

Length of output: 341


🏁 Script executed:

web_search
Go testing t.FailNow from goroutine anti-pattern

Repository: openshift/operator-framework-olm

Length of output: 168


🌐 Web query:

Go testing FailNow must only be called from test goroutine

💡 Result:

Yes, in Go's testing package, t.FailNow (and equivalent methods like t.Fatal, t.Fatalf) must only be called from the goroutine running the test or benchmark function, not from other goroutines created during the test. Calling it from a non-test goroutine does not properly stop the test execution as intended and can lead to data races (detectable by the race detector), hangs, or skipped cleanup. This is explicitly stated in the official documentation for (*T).FailNow: "FailNow marks the function as having failed and stops its execution by calling runtime.Goexit (which then runs all deferred calls in the current goroutine). Execution will continue at the next test or benchmark. FailNow must be called from the goroutine running the test or benchmark function, not from other goroutines created during the test. Calling FailNow does not stop those other goroutines." The same restriction applies to benchmarks (*B).FailNow and fuzzers (*F).FailNow (added in Go 1.18). Instead, from worker goroutines, use t.Error or t.Errorf to report failures without stopping execution, combined with synchronization primitives like sync.WaitGroup to wait for goroutines to complete before the test ends. Example of incorrect usage: func TestBad(t *testing.T) { go func { t.FailNow // Wrong: called from non-test goroutine } } Correct approach: func TestGood(t *testing.T) { var wg sync.WaitGroup wg.Add(1) go func { defer wg.Done // Do work... if err != nil { t.Error("worker error:", err) // Reports failure but doesn't stop goroutine } } wg.Wait } Tools like Go's vet analyzer (GoVetFailNowInNotTestGoroutine) and IDE inspections (e.g., GoLand) detect this issue statically. The race detector may also flag data races when violated dynamically. This rule has been consistent across Go versions, including the latest (Go 1.25.6 as of pkg.go.dev docs).

Citations:


🏁 Script executed:

rg "require\." pkg/lifecycle-controller/tls_test.go

Repository: openshift/operator-framework-olm

Length of output: 627


🏁 Script executed:

# Check if assert is imported anywhere in the file
grep -n "assert" pkg/lifecycle-controller/tls_test.go

Repository: openshift/operator-framework-olm

Length of output: 58


🌐 Web query:

stretchr testify assert vs require difference goroutine

💡 Result:

In the Go testing library stretchr/testify, assert and require provide assertion functions with different behaviors: - assert functions log failures but allow the test to continue, returning a boolean indicating success (true) or failure (false). This enables conditional subsequent assertions, e.g., if assert.NotNil(t, obj) { assert.Equal(t, "expected", obj.Value) }. Multiple failures are reported. - require functions perform the same checks but call t.FailNow on failure, immediately terminating the current test (skipping remaining code). They do not return a value. Key difference with goroutines: require functions must be called from the goroutine running the test function (main test goroutine), not from other goroutines created during the test. Calling require from a non-test goroutine can cause race conditions because t.FailNow is unsafe outside the test goroutine. assert functions are safer in goroutines as they only log failures without terminating. Use assert for non-fatal checks or when testing concurrent code; use require for fatal preconditions in the main test goroutine. Always use sync.WaitGroup or similar to wait for goroutines before assertions.

Citations:


🏁 Script executed:

# Verify the exact behavior difference and check if there are any other similar issues in the file
rg "go func" pkg/lifecycle-controller/tls_test.go -A 5

Repository: openshift/operator-framework-olm

Length of output: 297


Avoid calling require.* from non-test goroutines.

require.NotNil calls t.FailNow(), which per Go's testing documentation must only be called from the goroutine running the test function. Calling it from a spawned goroutine can lead to missed or incorrect failure reporting. Use assert.* inside goroutines, then retain require.* only on the test goroutine after wg.Wait().

Update line 103 to use assert.NotNil(t, cfg) and add "github.com/stretchr/testify/assert" to the imports.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/lifecycle-controller/tls_test.go` around lines 91 - 122, In
TestTLSConfigProvider_ConcurrentAccess, avoid calling require.NotNil from
spawned goroutines; change the in-goroutine checks to assert.NotNil(t, cfg) and
keep require.NotNil only on the main test goroutine after wg.Wait(); also add
the import "github.com/stretchr/testify/assert" to the test file so assert is
available. Ensure the references are the require.NotNil call inside the
anonymous reader goroutine in TestTLSConfigProvider_ConcurrentAccess and not the
final require.NotNil after wg.Wait().

Comment on lines +49 to +50
# Copy upstream CRDs directly instead of regenerating with controller-gen
cp "${ROOT_DIR}"/staging/api/crds/*.yaml "${crddir}/"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify the upstream CRD source is wired into a CI update path and not stale.
fd -t f '.yaml$' staging/api/crds | head -20
echo "---"
# Look for any CI/Makefile target that refreshes staging/api/crds
rg -nP -C2 'staging/api/crds|crds/\*\.yaml' Makefile scripts/ 2>/dev/null | head -40

Repository: openshift/operator-framework-olm

Length of output: 929


🏁 Script executed:

# Check if staging/api is a git submodule
cat .gitmodules 2>/dev/null || echo "No .gitmodules file"

Repository: openshift/operator-framework-olm

Length of output: 96


🏁 Script executed:

# Search for CI workflows that might handle CRD validation or updates
fd -t f '\.(yml|yaml)$' .github/workflows 2>/dev/null | xargs grep -l 'staging/api\|crds\|controller-gen' 2>/dev/null

Repository: openshift/operator-framework-olm

Length of output: 60


🏁 Script executed:

# Look for any recent changes to the generate_crds_manifests.sh script showing removed yq mutations
git log --oneline -20 scripts/generate_crds_manifests.sh 2>/dev/null | head -10

Repository: openshift/operator-framework-olm

Length of output: 153


🏁 Script executed:

# Check the actual structure of staging/api to understand its role
ls -la staging/ 2>/dev/null | head -20

Repository: openshift/operator-framework-olm

Length of output: 383


🏁 Script executed:

# Search for any documentation or comments explaining staging/api
rg -i 'staging.*api|upstream.*crd' --type=md README* docs/ 2>/dev/null | head -20

Repository: openshift/operator-framework-olm

Length of output: 862


🏁 Script executed:

# Search the entire Makefile and scripts for all CRD-related targets
rg -i 'crd|controller-gen|staging' Makefile scripts/*.sh 2>/dev/null | grep -E '(target|^[^:]*:)' | head -30

Repository: openshift/operator-framework-olm

Length of output: 3115


🏁 Script executed:

# Check git history of CRD files to see when they were last touched
git log --oneline -- staging/api/crds/ 2>/dev/null | head -10

Repository: openshift/operator-framework-olm

Length of output: 153


🏁 Script executed:

# Look for controller-gen in the codebase
rg 'controller-gen' --type=sh --type=go --type=mod 2>/dev/null | head -20

Repository: openshift/operator-framework-olm

Length of output: 58


🏁 Script executed:

# Check if there's a vendor sync or update script
fd -t f -i 'vendor|sync|update' scripts/ | xargs ls -la 2>/dev/null

Repository: openshift/operator-framework-olm

Length of output: 156


🏁 Script executed:

# Search for documentation about how staging/api is maintained
rg -i 'staging.*api|vendor.*api' docs/ 2>/dev/null | head -20

Repository: openshift/operator-framework-olm

Length of output: 975


🏁 Script executed:

# Read the actual generate_crds_manifests.sh around line 50 more carefully
sed -n '45,55p' scripts/generate_crds_manifests.sh

Repository: openshift/operator-framework-olm

Length of output: 474


🏁 Script executed:

# Check what CRDs are being referenced
sed -n '30,70p' scripts/generate_crds_manifests.sh

Repository: openshift/operator-framework-olm

Length of output: 1596


🏁 Script executed:

# Look at verify_commits.sh to understand how it validates staging
head -50 scripts/verify_commits.sh

Repository: openshift/operator-framework-olm

Length of output: 1685


🏁 Script executed:

# Check check-staging-vendor.sh
cat scripts/check-staging-vendor.sh

Repository: openshift/operator-framework-olm

Length of output: 666


🏁 Script executed:

# Look for any git history showing the removal of controller-gen or yq mutations
git log -p --all -- scripts/generate_crds_manifests.sh 2>/dev/null | grep -A5 -B5 'controller-gen\|yq' | head -80

Repository: openshift/operator-framework-olm

Length of output: 58


Staged CRD copy relies on upstream commit traceability rather than automated freshness checks.

Direct copying from staging/api/crds does couple the manifest to upstream schema. The codebase does enforce commit-level governance via verify_commits.sh, which validates that all staging changes must reference an upstream commit from a single upstream repository. This provides traceability and prevents accidental drifts, but it is not automatic synchronization.

However, confirm:

  • When staging/api/crds is updated, what triggers that update? Is it part of a regular upstream sync cadence, or manual cherry-pick?
  • The removed controller-gen regeneration and yq mutations (referenced in the comment) — document why these fixups are no longer needed, since the CRD schema is now taken verbatim from upstream.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/generate_crds_manifests.sh` around lines 49 - 50, The current copy
step in scripts/generate_crds_manifests.sh simply copies staged CRDs from
"${ROOT_DIR}/staging/api/crds" to "${crddir}/" but lacks documentation about
update triggers and why controller-gen/yq fixups were removed; update the script
header comments to state how staging/api/crds is updated (e.g., via upstream
sync cadence or manual cherry-pick) and reference the existing commit governance
check (verify_commits.sh) that enforces single-upstream commit traceability, and
add a short note explaining that controller-gen regeneration and yq mutations
were removed because CRD manifests are consumed verbatim from upstream and thus
no local schema fixes are required; mention the variables/commands to look at
(cp, ROOT_DIR, crddir, verify_commits.sh) so reviewers can locate the relevant
logic.

Comment on lines +589 to +608
command:
- /bin/lifecycle-controller
args:
- start
- --catalog-source-field-selector=metadata.namespace=openshift-marketplace,metadata.name=redhat-operators
- --tls-cert=/var/run/secrets/serving-cert/tls.crt
- --tls-key=/var/run/secrets/serving-cert/tls.key
image: quay.io/operator-framework/olm@sha256:de396b540b82219812061d0d753440d5655250c621c753ed1dc67d6154741607
imagePullPolicy: IfNotPresent
env:
- name: RELEASE_VERSION
value: "0.0.1-snapshot"
- name: NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: LIFECYCLE_SERVER_IMAGE
value: quay.io/operator-framework/olm@sha256:de396b540b82219812061d0d753440d5655250c621c753ed1dc67d6154741607
- name: GOMEMLIMIT
value: "5MiB"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Hardcoded single-catalog field selector and tight GOMEMLIMIT.

Two items on the lifecycle-controller deployment args/env:

  1. Line 593 pins the controller to reconcile exactly one CatalogSource: metadata.namespace=openshift-marketplace,metadata.name=redhat-operators. Any other catalog (certified-operators, community-operators, custom CatalogSources) is ignored. If this scoping is just a Tech Preview guardrail, add a comment; if it's meant to be configurable (per‑cluster operator flag, feature gate, etc.), wire it through cluster-olm-operator / a config field rather than hardcoding here.
  2. Line 608 sets GOMEMLIMIT=5MiB against a 10Mi request. Go runtime + controller-runtime caches for CatalogSources, Pods, Deployments, Services, SAs, NetworkPolicies, CRBs, APIServers cluster-wide typically exceed 5MiB resident, which will push GC into a thrash/soft-OOM loop. Please benchmark and raise this (or drop the env var).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/generate_crds_manifests.sh` around lines 589 - 608, The
lifecycle-controller pod spec currently hardcodes a single CatalogSource
selector and an unsafe GOMEMLIMIT; update the deployment to make the catalog
selector configurable (do not hardcode
"metadata.namespace=openshift-marketplace,metadata.name=redhat-operators" in the
args passed to /bin/lifecycle-controller) by wiring a config value/flag in your
cluster-olm-operator reconciler (expose a field or feature gate that populates
the --catalog-source-field-selector arg or leave it unset) and add a clarifying
comment if this is intentionally a tech‑preview guardrail; also remove or raise
the environment GOMEMLIMIT value (currently "5MiB") in the pod env so it is at
least in line with the container memory request (or omit the variable to use Go
defaults) to avoid GC thrashing—update the lifecycle-controller args and the env
entry for GOMEMLIMIT (and any related config fields like RELEASE_VERSION /
LIFECYCLE_SERVER_IMAGE if you surface the selector via config) accordingly.

Comment on lines +657 to +675
cat << EOF > manifests/0000_50_olm_08-lifecycle-controller.service.yaml
apiVersion: v1
kind: Service
metadata:
name: lifecycle-controller
namespace: openshift-operator-lifecycle-manager
annotations:
release.openshift.io/feature-set: "TechPreviewNoUpgrade"
service.beta.openshift.io/serving-cert-secret-name: lifecycle-controller-serving-cert
spec:
ports:
- name: metrics
port: 8443
protocol: TCP
targetPort: metrics
selector:
app: olm-lifecycle-controller
type: ClusterIP
EOF
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Metrics endpoint is TLS but no ServiceMonitor is generated.

The Service annotates service.beta.openshift.io/serving-cert-secret-name and the Deployment passes --tls-cert / --tls-key for port 8443, but unlike the PSM equivalent (lines 287‑308), this script does not emit a ServiceMonitor for lifecycle-controller. Metrics will be scraped by nobody unless Prometheus is configured out-of-band. If monitoring is in-scope for this feature, add a matching servicemonitor.yaml with scheme: https and the serving-cert CA bundle; if it's intentionally deferred, add a TODO.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/generate_crds_manifests.sh` around lines 657 - 675, The Service for
lifecycle-controller exposes TLS metrics on port 8443 but no ServiceMonitor is
emitted; add a ServiceMonitor manifest for the lifecycle-controller that targets
selector app: olm-lifecycle-controller in namespace
openshift-operator-lifecycle-manager, with spec.endpoints containing port:
metrics, scheme: https, and tlsConfig that references the serving cert secret
(secretName: lifecycle-controller-serving-cert) so Prometheus uses the CA bundle
(and set insecureSkipVerify: false); ensure labels/namespace match the Service
and follow the pattern used for the PSM ServiceMonitor (use the same endpoint
name, selector, and namespace); alternatively, if monitoring is intentionally
out-of-scope, add a TODO comment in the script near the lifecycle-controller
Service block documenting the omission.

Comment thread scripts/generate_crds_manifests.sh Outdated
Co-authored-by: Per Goncalves da Silva <perdasilva@redhat.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (3)
scripts/generate_crds_manifests.sh (3)

593-608: ⚠️ Potential issue | 🟠 Major

The controller is still pinned to one namespace and still has an extremely tight memory cap.

--catalog-source-field-selector=metadata.namespace=openshift-marketplace keeps this deployment from reconciling catalogs outside that namespace, and GOMEMLIMIT=5MiB is still well below the container’s 10Mi memory request. If the namespace scoping is intentional, please make that explicit in config or comments; otherwise leave the selector unset. I’d also drop the custom GOMEMLIMIT or raise it to something realistic for controller-runtime.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/generate_crds_manifests.sh` around lines 593 - 608, Remove or make
the hard-coded namespace selector and the tiny memory cap configurable: remove
the argument
--catalog-source-field-selector=metadata.namespace=openshift-marketplace from
the container args (or expose it as a configuration/ENV so namespace scoping is
explicit), and either remove the GOMEMLIMIT environment entry or raise it to a
realistic value (e.g., >= the container memory request such as 100Mi) so it does
not undercut the pod’s requested memory; update the NAMESPACE / comments to
document intentional namespace scoping if you choose to keep the selector.

657-675: ⚠️ Potential issue | 🟡 Minor

TLS metrics are exposed, but there is still no matching ServiceMonitor.

This Service serves HTTPS metrics with a serving cert, but unlike the package-server-manager block at Lines 287-308, the script still does not emit a ServiceMonitor for lifecycle-controller. If monitoring is intentionally deferred, a TODO would help; otherwise I’d add the monitor here to keep observability consistent.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/generate_crds_manifests.sh` around lines 657 - 675, The
lifecycle-controller Service exposes HTTPS metrics but no ServiceMonitor is
generated; either add a ServiceMonitor manifest for the lifecycle-controller or
add a clear TODO comment. Locate the lifecycle-controller Service block
(resource name: lifecycle-controller, selector app: olm-lifecycle-controller) in
scripts/generate_crds_manifests.sh and emit a ServiceMonitor resource (match
selector app: olm-lifecycle-controller, endpoints -> port: metrics, scheme:
https, appropriate namespaceSelector and TLS/bearer token settings) analogous to
the package-server-manager ServiceMonitor block to ensure Prometheus can scrape
it, or add a TODO explaining why monitoring is intentionally omitted.

49-50: ⚠️ Potential issue | 🟡 Minor

Please document the new CRD source-of-truth flow.

This still switches downstream CRDs to verbatim copies from staging/api/crds, but the script does not explain how that staging snapshot is refreshed or why the old local controller-gen/yq fixups are no longer needed. A short note pointing maintainers to scripts/verify_commits.sh would make this much easier to reason about next time.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/generate_crds_manifests.sh` around lines 49 - 50, Add a short
explanatory comment or README entry describing the new CRD source-of-truth flow:
state that scripts/generate_crds_manifests.sh now copies verbatim CRDs from
staging/api/crds (the authoritative snapshot), explain how that staging snapshot
is refreshed (refer to the commit/automation verified by
scripts/verify_commits.sh), and note why the previous controller-gen/yq
post-processing is no longer required; include these notes adjacent to the cp
"${ROOT_DIR}"/staging/api/crds/*.yaml "${crddir}/" line so maintainers can
quickly find the rationale and the refresh verification script.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/generate_crds_manifests.sh`:
- Around line 547-556: The TechPreview manifests (e.g.,
manifests/0000_50_olm_08-lifecycle-controller.deployment.yaml) are being copied
unconditionally into microshift-manifests; update the copy/packaging logic to
skip files annotated with release.openshift.io/feature-set:
"TechPreviewNoUpgrade" or explicitly exclude filename patterns 0000_50_olm_08-*
and 0000_50_olm_09-* when populating microshift-manifests so these lifecycle
components remain preview-only; modify the script section that performs the
manifest copy (the microshift-manifests population loop) to filter out those
patterns or check the annotation before copying.

---

Duplicate comments:
In `@scripts/generate_crds_manifests.sh`:
- Around line 593-608: Remove or make the hard-coded namespace selector and the
tiny memory cap configurable: remove the argument
--catalog-source-field-selector=metadata.namespace=openshift-marketplace from
the container args (or expose it as a configuration/ENV so namespace scoping is
explicit), and either remove the GOMEMLIMIT environment entry or raise it to a
realistic value (e.g., >= the container memory request such as 100Mi) so it does
not undercut the pod’s requested memory; update the NAMESPACE / comments to
document intentional namespace scoping if you choose to keep the selector.
- Around line 657-675: The lifecycle-controller Service exposes HTTPS metrics
but no ServiceMonitor is generated; either add a ServiceMonitor manifest for the
lifecycle-controller or add a clear TODO comment. Locate the
lifecycle-controller Service block (resource name: lifecycle-controller,
selector app: olm-lifecycle-controller) in scripts/generate_crds_manifests.sh
and emit a ServiceMonitor resource (match selector app:
olm-lifecycle-controller, endpoints -> port: metrics, scheme: https, appropriate
namespaceSelector and TLS/bearer token settings) analogous to the
package-server-manager ServiceMonitor block to ensure Prometheus can scrape it,
or add a TODO explaining why monitoring is intentionally omitted.
- Around line 49-50: Add a short explanatory comment or README entry describing
the new CRD source-of-truth flow: state that scripts/generate_crds_manifests.sh
now copies verbatim CRDs from staging/api/crds (the authoritative snapshot),
explain how that staging snapshot is refreshed (refer to the commit/automation
verified by scripts/verify_commits.sh), and note why the previous
controller-gen/yq post-processing is no longer required; include these notes
adjacent to the cp "${ROOT_DIR}"/staging/api/crds/*.yaml "${crddir}/" line so
maintainers can quickly find the rationale and the refresh verification script.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 26720ce2-1274-47f8-9dc4-dbc658a0d3f2

📥 Commits

Reviewing files that changed from the base of the PR and between abeb3db and 5037aa5.

📒 Files selected for processing (1)
  • scripts/generate_crds_manifests.sh

Comment on lines +547 to +556
cat << EOF > manifests/0000_50_olm_08-lifecycle-controller.deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: lifecycle-controller
namespace: openshift-operator-lifecycle-manager
labels:
app: olm-lifecycle-controller
annotations:
release.openshift.io/feature-set: "TechPreviewNoUpgrade"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

TechPreview manifests become unconditional in MicroShift.

These new resources are tagged TechPreviewNoUpgrade, but later this script copies every manifest into microshift-manifests and the ignore list does not exclude the new 0000_50_olm_08-* / 0000_50_olm_09-* files. Because MicroShift does not have CVO feature-set gating here, this ships the lifecycle components by default there instead of keeping them preview-only.

Possible follow-up
 files_to_ignore=("ibm-cloud-managed.yaml" "kustomization.yaml" "psm-operator" "removed" "collect-profiles" "pprof" "prometheus" "service-monitor" "operatorstatus")
+files_to_ignore+=(
+  "0000_50_olm_08-lifecycle-controller"
+  "0000_50_olm_09-lifecycle-server"
+)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/generate_crds_manifests.sh` around lines 547 - 556, The TechPreview
manifests (e.g., manifests/0000_50_olm_08-lifecycle-controller.deployment.yaml)
are being copied unconditionally into microshift-manifests; update the
copy/packaging logic to skip files annotated with
release.openshift.io/feature-set: "TechPreviewNoUpgrade" or explicitly exclude
filename patterns 0000_50_olm_08-* and 0000_50_olm_09-* when populating
microshift-manifests so these lifecycle components remain preview-only; modify
the script section that performs the manifest copy (the microshift-manifests
population loop) to filter out those patterns or check the annotation before
copying.

Signed-off-by: Francesco Giudici <fgiudici@redhat.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
microshift-manifests/0000_50_olm_08-lifecycle-controller.deployment.yaml (1)

62-67: ⚠️ Potential issue | 🔴 Critical

Increase heap cap and memory request; current values are likely to destabilize the pod.

Line 63 sets GOMEMLIMIT to 5MiB, and Line 67 requests only 10Mi memory. For a Go controller-runtime process, this is likely to cause startup failures or heavy GC thrashing.

🐛 Proposed fix
             - name: GOMEMLIMIT
-              value: "5MiB"
+              value: "200MiB"
           resources:
             requests:
-              cpu: 10m
-              memory: 10Mi
+              cpu: 100m
+              memory: 200Mi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@microshift-manifests/0000_50_olm_08-lifecycle-controller.deployment.yaml`
around lines 62 - 67, The GOMEMLIMIT env var is set far too low and the pod
memory request is insufficient; update the GOMEMLIMIT value (env name
GOMEMLIMIT) to a realistic heap cap (e.g., increase from 5MiB to a larger value
appropriate for the controller, e.g., hundreds of MiB) and raise
resources.requests.memory (and adjust cpu if needed) in the deployment spec so
the Go controller-runtime process can start and run without extreme GC/ooms;
locate the env block containing GOMEMLIMIT and the resources.requests section
and raise those values consistently.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@microshift-manifests/0000_50_olm_08-lifecycle-controller.deployment.ibm-cloud-managed.yaml`:
- Around line 92-99: The pod is pinned to nodeSelector
node-role.kubernetes.io/control-plane but only tolerates the master taint key
node-role.kubernetes.io/master, so it can remain Pending on clusters that taint
control-plane nodes with node-role.kubernetes.io/control-plane: add a new
toleration entry under tolerations that matches key
node-role.kubernetes.io/control-plane with operator: Exists and effect:
NoSchedule (and optionally effect: NoExecute if you want eviction tolerance) so
the Deployment’s tolerations cover both master and control-plane taints.

---

Duplicate comments:
In `@microshift-manifests/0000_50_olm_08-lifecycle-controller.deployment.yaml`:
- Around line 62-67: The GOMEMLIMIT env var is set far too low and the pod
memory request is insufficient; update the GOMEMLIMIT value (env name
GOMEMLIMIT) to a realistic heap cap (e.g., increase from 5MiB to a larger value
appropriate for the controller, e.g., hundreds of MiB) and raise
resources.requests.memory (and adjust cpu if needed) in the deployment spec so
the Go controller-runtime process can start and run without extreme GC/ooms;
locate the env block containing GOMEMLIMIT and the resources.requests section
and raise those values consistently.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 9e60de6e-01c8-400d-8d42-d0bd42ccb5c0

📥 Commits

Reviewing files that changed from the base of the PR and between 5037aa5 and 214abe5.

📒 Files selected for processing (20)
  • manifests/0000_50_olm_00-catalogsources.crd.yaml
  • manifests/0000_50_olm_00-clusterserviceversions.crd.yaml
  • manifests/0000_50_olm_00-installplans.crd.yaml
  • manifests/0000_50_olm_00-olmconfigs.crd.yaml
  • manifests/0000_50_olm_00-operatorconditions.crd.yaml
  • manifests/0000_50_olm_00-operatorgroups.crd.yaml
  • manifests/0000_50_olm_00-operators.crd.yaml
  • manifests/0000_50_olm_00-subscriptions.crd.yaml
  • manifests/0000_50_olm_08-lifecycle-controller.deployment.ibm-cloud-managed.yaml
  • manifests/0000_50_olm_08-lifecycle-controller.deployment.yaml
  • microshift-manifests/0000_50_olm_00-catalogsources.crd.yaml
  • microshift-manifests/0000_50_olm_00-clusterserviceversions.crd.yaml
  • microshift-manifests/0000_50_olm_00-installplans.crd.yaml
  • microshift-manifests/0000_50_olm_00-olmconfigs.crd.yaml
  • microshift-manifests/0000_50_olm_00-operatorconditions.crd.yaml
  • microshift-manifests/0000_50_olm_00-operatorgroups.crd.yaml
  • microshift-manifests/0000_50_olm_00-operators.crd.yaml
  • microshift-manifests/0000_50_olm_00-subscriptions.crd.yaml
  • microshift-manifests/0000_50_olm_08-lifecycle-controller.deployment.ibm-cloud-managed.yaml
  • microshift-manifests/0000_50_olm_08-lifecycle-controller.deployment.yaml
✅ Files skipped from review due to trivial changes (16)
  • microshift-manifests/0000_50_olm_00-olmconfigs.crd.yaml
  • manifests/0000_50_olm_00-olmconfigs.crd.yaml
  • manifests/0000_50_olm_00-operatorgroups.crd.yaml
  • manifests/0000_50_olm_00-operators.crd.yaml
  • microshift-manifests/0000_50_olm_00-installplans.crd.yaml
  • microshift-manifests/0000_50_olm_00-operatorconditions.crd.yaml
  • microshift-manifests/0000_50_olm_00-operatorgroups.crd.yaml
  • manifests/0000_50_olm_00-subscriptions.crd.yaml
  • microshift-manifests/0000_50_olm_00-operators.crd.yaml
  • manifests/0000_50_olm_00-installplans.crd.yaml
  • microshift-manifests/0000_50_olm_00-catalogsources.crd.yaml
  • manifests/0000_50_olm_08-lifecycle-controller.deployment.yaml
  • manifests/0000_50_olm_00-operatorconditions.crd.yaml
  • manifests/0000_50_olm_08-lifecycle-controller.deployment.ibm-cloud-managed.yaml
  • microshift-manifests/0000_50_olm_00-subscriptions.crd.yaml
  • manifests/0000_50_olm_00-catalogsources.crd.yaml
🚧 Files skipped from review as they are similar to previous changes (2)
  • manifests/0000_50_olm_00-clusterserviceversions.crd.yaml
  • microshift-manifests/0000_50_olm_00-clusterserviceversions.crd.yaml

Comment on lines +92 to +99
nodeSelector:
kubernetes.io/os: linux
node-role.kubernetes.io/control-plane: ""
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/master
operator: Exists
- effect: NoExecute
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Verify how control-plane scheduling is handled across manifests.
rg -n --glob '*.yaml' 'node-role\.kubernetes\.io/control-plane' microshift-manifests -C2
rg -n --glob '*.yaml' 'key:\s*node-role\.kubernetes\.io/control-plane' microshift-manifests -C2
rg -n 'nodeSelector:|tolerations:|node-role\.kubernetes\.io/(master|control-plane)' \
  microshift-manifests/0000_50_olm_08-lifecycle-controller.deployment.ibm-cloud-managed.yaml -C3

Repository: openshift/operator-framework-olm

Length of output: 1229


🏁 Script executed:

cat -n microshift-manifests/0000_50_olm_08-lifecycle-controller.deployment.ibm-cloud-managed.yaml | sed -n '92,105p'

Repository: openshift/operator-framework-olm

Length of output: 630


🏁 Script executed:

cat -n microshift-manifests/0000_50_olm_08-lifecycle-controller.deployment.yaml | sed -n '91,110p'

Repository: openshift/operator-framework-olm

Length of output: 818


Add control-plane taint toleration to avoid Pending pods on control-plane-tainted clusters.

The nodeSelector pins scheduling to node-role.kubernetes.io/control-plane, but the tolerations only include node-role.kubernetes.io/master. On clusters using the standard node-role.kubernetes.io/control-plane:NoSchedule taint on control-plane nodes, this pod will remain unschedulable.

Suggested patch
       tolerations:
+        - effect: NoSchedule
+          key: node-role.kubernetes.io/control-plane
+          operator: Exists
         - effect: NoSchedule
           key: node-role.kubernetes.io/master
           operator: Exists
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@microshift-manifests/0000_50_olm_08-lifecycle-controller.deployment.ibm-cloud-managed.yaml`
around lines 92 - 99, The pod is pinned to nodeSelector
node-role.kubernetes.io/control-plane but only tolerates the master taint key
node-role.kubernetes.io/master, so it can remain Pending on clusters that taint
control-plane nodes with node-role.kubernetes.io/control-plane: add a new
toleration entry under tolerations that matches key
node-role.kubernetes.io/control-plane with operator: Exists and effect:
NoSchedule (and optionally effect: NoExecute if you want eviction tolerance) so
the Deployment’s tolerations cover both master and control-plane taints.

@openshift-ci openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 30, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 30, 2026

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants