Add ha-policy-management by nhoriguchi · Pull Request #1932 · openshift/enhancements

nhoriguchi · 2026-01-29T00:22:45Z

This enhancement improves guideline compliance checks within the CI process (the Red Hat-internal pipeline for OpenShift) to improve overall HA. Specifically, it integrates a mechanism to evaluate HA levels based on implementation status and developers' input. By notifying developers of non-compliant components, the management process encourages developers to follow the guidelines. All data will be stored in a common repository, allowing both developers and partners to grasp the overall HA status early and easily.

openshift-ci · 2026-01-29T00:24:20Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign jeffdyoung for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2026-01-29T00:37:03Z

@nhoriguchi: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/markdownlint	`fb6d692`	link	true	`/test markdownlint`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

dgoodwin

Sorry for the delay here. I'm proposing a pretty dramatic alteration of the plan here, but one that will make this much simpler, quicker, and easy to get off the ground. In my mind you can start on implementation of the monitortest described below in origin as soon as you like. We can assist to help understand how to gather lots of data while the PR is open without having to merge it. I'd be quite interested to see what the testing turns up.

dgoodwin · 2026-02-24T18:09:44Z

+
+Currently, HA implementation is often left to developers’ discretion,
+leading to inconsistent or insufficient HA configurations.
+Although general guidelines exist ([CONVENTIONS.md](https://github.com/openshift/enhancements/blob/master/CONVENTIONS.md#high-availability)).


Is it fair to say this is the set of conventions you want to enforce with this framework? Are there additional items you would like added? If so I would suggest a PR to that linked enhancement. It helps to have agreed conventions before we start enforcing.

Sorry for my late response.

Originally we intended to start with small set of conventions (like anti-affinity or podDisruptionBudget) and we have an additional item (not included in CONVENTIONS.md now) like healthcheck probes.

dgoodwin · 2026-02-24T18:10:37Z

+
+### Non-Goals
+
+* Strict enforcement of guidelines that block product releases is out of scope.


This will be easy to do when we're ready with the approaches I will spell out below.

dgoodwin · 2026-02-24T18:12:22Z

+
+* As an OpenShift Product Manager, I want a clear overview of HA
+  implementation status across components, so I can identify issues
+  from overall HA quality earlier.


Can you define the list of statuses you envision? Is it just compliant and non-compliant? What other levels/statuses do you envision?

According to the comment below, I think that the following 4 statuses can be defined.

compliant

non-compliant (whitelisted)

non-compliant (pending)

non-compliant (violated)

, each status corresponds to each item in your comment around Line 161.

dgoodwin · 2026-02-24T18:28:10Z

+* This proposal targets only all core and infrastructure-related components,
+  and the other components are out of scope.
+
+## Proposal


I would propose radical simplification of the proposal. I believe we can meet your goals with well established precedents, without the need for lots of new processes and systems, and you can get up and running quite quickly.

We've done this sort of things many times, the process is as follows:

Establish the Tests

Typically these are implemented as monitortests, they will run at the end of most of our hundreds of CI jobs. The monitortests generate junit test results per openshift component namespace, and per HA check you'd like to implement.

Example: https://github.com/openshift/origin/blob/00eaaf722f71858b3af6091af44b7225b5f8a6d7/pkg/monitortests/kubelet/containerfailures/container_failures.go#L137

Typically these kinds of tests encode exceptions linked to jiras. So you write the tests, do some preliminary testing in the PR (we can help), see what violations it finds, then write a Jira for each. (more below)

I suggest having the tests only flake when they find a problem for now, so we do not merge the PR and cause mass failures. Once all the problems are identified with bugs filed and exceptions added, the test can be moved to a state where it's allowed to fail.

In this case envision:

[Monitor:ha-compliance][Jira:"console"] pods in ns/openshift-console should define health checks
[Monitor:ha-compliance][Jira:"console"] pods in ns/openshift-console should sufficient replicas for HA

etc.

File Bugs for Violations

Sippy provides the dashboard of current state. Example for the monitortest linked above.

As problems are identified, someone will need to file bugs and add exceptions within the test. Typically we'll label the jiras with a specific label to help keep track. For any approved exception the test will usually permanently flake.

In the event the jiras is closed as not applicable or can't be fixed by engineering or PM, those should d likely transition from exceptions to just permanently approved whitelist with a comment explaining why, or a link to the jira that explains.

Once the test is stable in the wild, new violations will immediately start failing jobs and we have ample provisions for that to make it's way to dev teams. This prevents new components from coming in without the capability unless someone explicitly approves it, as well as regressions for existing components.

It can take time and effort for someone to find all the exceptions to be added and allow the test to start failing on regressions/problems, but in the interim the tests are live, gathering data, and not causing mass failures/panic.

I would propose radical simplification of the proposal. I believe we can meet your goals with well established precedents, without the need for lots of new processes and systems, and you can get up and running quite quickly.

We've done this sort of things many times, the process is as follows:

The stated approach seems to cover our proposal, so I'd like to implement the HA policy mgmt on top of the process.

A few questions below ...

Typically these are implemented as monitortests, they will run at the end of most of our hundreds of CI jobs. The monitortests generate junit test results per openshift component namespace, and per HA check you'd like to implement.

Example: https://github.com/openshift/origin/blob/00eaaf722f71858b3af6091af44b7225b5f8a6d7/pkg/monitortests/kubelet/containerfailures/container_failures.go#L137

Typically these kinds of tests encode exceptions linked to jiras. So you write the tests, do some preliminary testing in the PR (we can help), see what violations it finds, then write a Jira for each. (more below)

I feel hard to understand from the above code that how the behavior you mentioned actually works. Could you share (privately) some guideline documents for testers about monitortest? I'd like to know how to write a minimum test case and how to run with examples.

I suggest having the tests only flake when they find a problem for now, so we do not merge the PR and cause mass failures. Once all the problems are identified with bugs filed and exceptions added, the test can be moved to a state where it's allowed to fail.

Yes, controlling failures in preparation phase is important and one of the concern. Is this kind of "switch" on whether to fail or flake controlled in the testing framework (not in test cases)?

dgoodwin · 2026-02-24T18:31:16Z

+* Create test cases to collect HA policy information from running OpenShift clusters.
+* Define HA configs to define the type of HA feature to be handled
+  (redundancy and health check in the first proposal).
+* Define the data structure of input and output of "HA level check" process.


Covered by junit test results.

dgoodwin · 2026-02-24T18:35:59Z

+  an HA level check for an HA config.
+* Define the criteria that must be met to pass the HA level check for each
+  component and for each HA config.
+* Define the workflow of how to collect the responses from notified component owners.


Jira collects the responses from component owners.

dgoodwin · 2026-02-24T18:37:28Z

+#### HA level check
+
+HA level check uses these types of input information to judge whether each
+component properly covers HA configs or not, then the result is output


This storage specifically is a concern, we need this to fit existing processes, and introducing new storage mechanisms and formats is probably beyond what we can undertake and fit into our existing org workflows. The good news however is that with the above we can get you up and running and working towards these goals much more quickly.

If using JIRA as a storage for the result is acceptable, we need no additional storage.

dgoodwin · 2026-02-24T18:39:15Z

+one of the three values: pass, fail, and skip. Each config has its own
+HA implementation status info and component specific info.
+
+This flowchart is essential for HA policy management, so detailed explanations


This would be replaced by the states in the test:

pass

flake (because the component is permanently whitelisted)

flake (because a pending jira is awaiting a response)

fail (no exception/whitelist entry exists, and the violation appears unapproved)

I agree with the above test results describing actual statuses well. So many of the workflow section including the flowchart need to be updated for the implementation based on the existing framework.

dgoodwin · 2026-02-24T18:40:03Z

+#### How component owners respond?
+
+A component owner whose component failed the HA Level Check will receive a
+notification containing the following data (details are omitted for brevity):


Hoping to avoid any new notification mechanisms, the above outlines how we notify component owners when they have a problem that needs addressing.

dgoodwin · 2026-02-24T18:41:01Z

+Risk: Development teams bear the burden of responding to notifications
+in a timely manner to prioritize and plan the development of HA features.
+Mitigation: The management process will only issue warnings without
+blocking the actual release process.


Agreed, we can accommodate this while the test is in flake mode only, and cannot fail. In future, once all exceptions look covered, we can make the test official and let it fail for anything new.

While an exception is granted with an open jira, the test will flake and not fail.

Periodic monitoring or automation is required to check the list of exception jiras to see if they were closed, and take appropriate action. (either reopen in disagreement, or moving the exception to the permanent whitelist) At this point I would recommend claude command helper in origin repo to help maintain this aspect.

I agree with using JIRA states to represent policy ("Closed" as a permanent whitelist, and "Open" as temporary exception or a disagreement).

Following changes over time is also important. The implementation status of HA at the workload level (Pod, DaemonSet, Deployment, StatefulSet) may change while we are still communicating via component-specific JIRA notifications. We need to ensure accurate tracking within JIRA, which will likely be a complex task and Claude-based AI agent could be helpful.

openshift-bot · 2026-03-25T01:15:22Z

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

nhoriguchi · 2026-03-25T14:09:26Z

/remove-lifecycle stale

nhoriguchi

@dgoodwin Sorry for my late response, I have been tied up with my relocation tasks.

Overall, your suggestion seems to satisfy most of our requirements, so I would like to proceed with implementation using that method. I plan to rewrite the documentation in this PR, focusing primarily on the implementation section.

And I would like to move on to discussing the "how". Specifically, I need a "getting started" guide for monitortests to clarify exactly what we will contribute to.

nhoriguchi · 2026-04-20T19:07:13Z

+
+Currently, HA implementation is often left to developers’ discretion,
+leading to inconsistent or insufficient HA configurations.
+Although general guidelines exist ([CONVENTIONS.md](https://github.com/openshift/enhancements/blob/master/CONVENTIONS.md#high-availability)).


Sorry for my late response.

Originally we intended to start with small set of conventions (like anti-affinity or podDisruptionBudget) and we have an additional item (not included in CONVENTIONS.md now) like healthcheck probes.

nhoriguchi · 2026-04-20T19:07:44Z

+
+* As an OpenShift Product Manager, I want a clear overview of HA
+  implementation status across components, so I can identify issues
+  from overall HA quality earlier.


According to the comment below, I think that the following 4 statuses can be defined.

compliant

non-compliant (whitelisted)

non-compliant (pending)

non-compliant (violated)

, each status corresponds to each item in your comment around Line 161.

nhoriguchi · 2026-04-20T19:08:28Z

+* This proposal targets only all core and infrastructure-related components,
+  and the other components are out of scope.
+
+## Proposal


I would propose radical simplification of the proposal. I believe we can meet your goals with well established precedents, without the need for lots of new processes and systems, and you can get up and running quite quickly.

We've done this sort of things many times, the process is as follows:

The stated approach seems to cover our proposal, so I'd like to implement the HA policy mgmt on top of the process.

A few questions below ...

Typically these are implemented as monitortests, they will run at the end of most of our hundreds of CI jobs. The monitortests generate junit test results per openshift component namespace, and per HA check you'd like to implement.

Example: https://github.com/openshift/origin/blob/00eaaf722f71858b3af6091af44b7225b5f8a6d7/pkg/monitortests/kubelet/containerfailures/container_failures.go#L137

Typically these kinds of tests encode exceptions linked to jiras. So you write the tests, do some preliminary testing in the PR (we can help), see what violations it finds, then write a Jira for each. (more below)

I feel hard to understand from the above code that how the behavior you mentioned actually works. Could you share (privately) some guideline documents for testers about monitortest? I'd like to know how to write a minimum test case and how to run with examples.

I suggest having the tests only flake when they find a problem for now, so we do not merge the PR and cause mass failures. Once all the problems are identified with bugs filed and exceptions added, the test can be moved to a state where it's allowed to fail.

Yes, controlling failures in preparation phase is important and one of the concern. Is this kind of "switch" on whether to fail or flake controlled in the testing framework (not in test cases)?

nhoriguchi · 2026-04-20T19:09:09Z

+  detect degradations in the HA implementation status.
+* Define how to store the result of HA level check of each OpenShift version
+  to track the record of previous check results.
+* Introduce a mechanism to notify the degradations to component owners whose


This looks interesting, it's easy to track per-component results. One concern is that HA check is done in more granular level, such as per-pod, per-deployment, and per-daemonset. Maybe are some raw test results like list of check results in granular level? At least JIRAs working as reports for each component need to contain such kind of raw test results to identify which parts need improvements.

nhoriguchi · 2026-04-20T19:09:25Z

+#### HA level check
+
+HA level check uses these types of input information to judge whether each
+component properly covers HA configs or not, then the result is output


If using JIRA as a storage for the result is acceptable, we need no additional storage.

nhoriguchi · 2026-04-20T19:09:32Z

+one of the three values: pass, fail, and skip. Each config has its own
+HA implementation status info and component specific info.
+
+This flowchart is essential for HA policy management, so detailed explanations


I agree with the above test results describing actual statuses well. So many of the workflow section including the flowchart need to be updated for the implementation based on the existing framework.

nhoriguchi · 2026-04-20T19:09:47Z

+Risk: Development teams bear the burden of responding to notifications
+in a timely manner to prioritize and plan the development of HA features.
+Mitigation: The management process will only issue warnings without
+blocking the actual release process.


I agree with using JIRA states to represent policy ("Closed" as a permanent whitelist, and "Open" as temporary exception or a disagreement).

Following changes over time is also important. The implementation status of HA at the workload level (Pod, DaemonSet, Deployment, StatefulSet) may change while we are still communicating via component-specific JIRA notifications. We need to ensure accurate tracking within JIRA, which will likely be a complex task and Claude-based AI agent could be helpful.

Add ha-policy-management

fb6d692

openshift-ci bot requested review from bear-redhat and patrickdillon January 29, 2026 00:24

dgoodwin requested changes Feb 24, 2026

View reviewed changes

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 25, 2026

openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 25, 2026

nhoriguchi commented Apr 20, 2026

View reviewed changes


		### Non-Goals

		* Strict enforcement of guidelines that block product releases is out of scope.

Conversation

nhoriguchi commented Jan 29, 2026

Uh oh!

openshift-ci bot commented Jan 29, 2026

Uh oh!

openshift-ci bot commented Jan 29, 2026

Uh oh!

dgoodwin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Establish the Tests

File Bugs for Violations

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

openshift-bot commented Mar 25, 2026

Uh oh!

nhoriguchi commented Mar 25, 2026

Uh oh!

nhoriguchi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants