Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
@nhoriguchi: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
dgoodwin
left a comment
There was a problem hiding this comment.
Sorry for the delay here. I'm proposing a pretty dramatic alteration of the plan here, but one that will make this much simpler, quicker, and easy to get off the ground. In my mind you can start on implementation of the monitortest described below in origin as soon as you like. We can assist to help understand how to gather lots of data while the PR is open without having to merge it. I'd be quite interested to see what the testing turns up.
|
|
||
| Currently, HA implementation is often left to developers’ discretion, | ||
| leading to inconsistent or insufficient HA configurations. | ||
| Although general guidelines exist ([CONVENTIONS.md](https://github.com/openshift/enhancements/blob/master/CONVENTIONS.md#high-availability)). |
There was a problem hiding this comment.
Is it fair to say this is the set of conventions you want to enforce with this framework? Are there additional items you would like added? If so I would suggest a PR to that linked enhancement. It helps to have agreed conventions before we start enforcing.
There was a problem hiding this comment.
Sorry for my late response.
Originally we intended to start with small set of conventions (like anti-affinity or podDisruptionBudget) and we have an additional item (not included in CONVENTIONS.md now) like healthcheck probes.
|
|
||
| ### Non-Goals | ||
|
|
||
| * Strict enforcement of guidelines that block product releases is out of scope. |
There was a problem hiding this comment.
This will be easy to do when we're ready with the approaches I will spell out below.
|
|
||
| * As an OpenShift Product Manager, I want a clear overview of HA | ||
| implementation status across components, so I can identify issues | ||
| from overall HA quality earlier. |
There was a problem hiding this comment.
Can you define the list of statuses you envision? Is it just compliant and non-compliant? What other levels/statuses do you envision?
There was a problem hiding this comment.
According to the comment below, I think that the following 4 statuses can be defined.
- compliant
- non-compliant (whitelisted)
- non-compliant (pending)
- non-compliant (violated)
, each status corresponds to each item in your comment around Line 161.
| * This proposal targets only all core and infrastructure-related components, | ||
| and the other components are out of scope. | ||
|
|
||
| ## Proposal |
There was a problem hiding this comment.
I would propose radical simplification of the proposal. I believe we can meet your goals with well established precedents, without the need for lots of new processes and systems, and you can get up and running quite quickly.
We've done this sort of things many times, the process is as follows:
Establish the Tests
Typically these are implemented as monitortests, they will run at the end of most of our hundreds of CI jobs. The monitortests generate junit test results per openshift component namespace, and per HA check you'd like to implement.
Typically these kinds of tests encode exceptions linked to jiras. So you write the tests, do some preliminary testing in the PR (we can help), see what violations it finds, then write a Jira for each. (more below)
I suggest having the tests only flake when they find a problem for now, so we do not merge the PR and cause mass failures. Once all the problems are identified with bugs filed and exceptions added, the test can be moved to a state where it's allowed to fail.
In this case envision:
[Monitor:ha-compliance][Jira:"console"] pods in ns/openshift-console should define health checks
[Monitor:ha-compliance][Jira:"console"] pods in ns/openshift-console should sufficient replicas for HA
etc.
File Bugs for Violations
Sippy provides the dashboard of current state. Example for the monitortest linked above.
As problems are identified, someone will need to file bugs and add exceptions within the test. Typically we'll label the jiras with a specific label to help keep track. For any approved exception the test will usually permanently flake.
In the event the jiras is closed as not applicable or can't be fixed by engineering or PM, those should d likely transition from exceptions to just permanently approved whitelist with a comment explaining why, or a link to the jira that explains.
Once the test is stable in the wild, new violations will immediately start failing jobs and we have ample provisions for that to make it's way to dev teams. This prevents new components from coming in without the capability unless someone explicitly approves it, as well as regressions for existing components.
It can take time and effort for someone to find all the exceptions to be added and allow the test to start failing on regressions/problems, but in the interim the tests are live, gathering data, and not causing mass failures/panic.
There was a problem hiding this comment.
I would propose radical simplification of the proposal. I believe we can meet your goals with well established precedents, without the need for lots of new processes and systems, and you can get up and running quite quickly.
We've done this sort of things many times, the process is as follows:
The stated approach seems to cover our proposal, so I'd like to implement the HA policy mgmt on top of the process.
A few questions below ...
Typically these are implemented as monitortests, they will run at the end of most of our hundreds of CI jobs. The monitortests generate junit test results per openshift component namespace, and per HA check you'd like to implement.
Typically these kinds of tests encode exceptions linked to jiras. So you write the tests, do some preliminary testing in the PR (we can help), see what violations it finds, then write a Jira for each. (more below)
I feel hard to understand from the above code that how the behavior you mentioned actually works. Could you share (privately) some guideline documents for testers about monitortest? I'd like to know how to write a minimum test case and how to run with examples.
I suggest having the tests only flake when they find a problem for now, so we do not merge the PR and cause mass failures. Once all the problems are identified with bugs filed and exceptions added, the test can be moved to a state where it's allowed to fail.
Yes, controlling failures in preparation phase is important and one of the concern. Is this kind of "switch" on whether to fail or flake controlled in the testing framework (not in test cases)?
| * Create test cases to collect HA policy information from running OpenShift clusters. | ||
| * Define HA configs to define the type of HA feature to be handled | ||
| (redundancy and health check in the first proposal). | ||
| * Define the data structure of input and output of "HA level check" process. |
There was a problem hiding this comment.
Covered by junit test results.
| an HA level check for an HA config. | ||
| * Define the criteria that must be met to pass the HA level check for each | ||
| component and for each HA config. | ||
| * Define the workflow of how to collect the responses from notified component owners. |
There was a problem hiding this comment.
Jira collects the responses from component owners.
| #### HA level check | ||
|
|
||
| HA level check uses these types of input information to judge whether each | ||
| component properly covers HA configs or not, then the result is output |
There was a problem hiding this comment.
This storage specifically is a concern, we need this to fit existing processes, and introducing new storage mechanisms and formats is probably beyond what we can undertake and fit into our existing org workflows. The good news however is that with the above we can get you up and running and working towards these goals much more quickly.
There was a problem hiding this comment.
If using JIRA as a storage for the result is acceptable, we need no additional storage.
| one of the three values: pass, fail, and skip. Each config has its own | ||
| HA implementation status info and component specific info. | ||
|
|
||
| This flowchart is essential for HA policy management, so detailed explanations |
There was a problem hiding this comment.
This would be replaced by the states in the test:
- pass
- flake (because the component is permanently whitelisted)
- flake (because a pending jira is awaiting a response)
- fail (no exception/whitelist entry exists, and the violation appears unapproved)
There was a problem hiding this comment.
I agree with the above test results describing actual statuses well. So many of the workflow section including the flowchart need to be updated for the implementation based on the existing framework.
| #### How component owners respond? | ||
|
|
||
| A component owner whose component failed the HA Level Check will receive a | ||
| notification containing the following data (details are omitted for brevity): |
There was a problem hiding this comment.
Hoping to avoid any new notification mechanisms, the above outlines how we notify component owners when they have a problem that needs addressing.
| Risk: Development teams bear the burden of responding to notifications | ||
| in a timely manner to prioritize and plan the development of HA features. | ||
| Mitigation: The management process will only issue warnings without | ||
| blocking the actual release process. |
There was a problem hiding this comment.
Agreed, we can accommodate this while the test is in flake mode only, and cannot fail. In future, once all exceptions look covered, we can make the test official and let it fail for anything new.
While an exception is granted with an open jira, the test will flake and not fail.
Periodic monitoring or automation is required to check the list of exception jiras to see if they were closed, and take appropriate action. (either reopen in disagreement, or moving the exception to the permanent whitelist) At this point I would recommend claude command helper in origin repo to help maintain this aspect.
There was a problem hiding this comment.
I agree with using JIRA states to represent policy ("Closed" as a permanent whitelist, and "Open" as temporary exception or a disagreement).
Following changes over time is also important. The implementation status of HA at the workload level (Pod, DaemonSet, Deployment, StatefulSet) may change while we are still communicating via component-specific JIRA notifications. We need to ensure accurate tracking within JIRA, which will likely be a complex task and Claude-based AI agent could be helpful.
|
Inactive enhancement proposals go stale after 28d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Mark the proposal as fresh by commenting If this proposal is safe to close now please do so with /lifecycle stale |
|
/remove-lifecycle stale |
nhoriguchi
left a comment
There was a problem hiding this comment.
@dgoodwin Sorry for my late response, I have been tied up with my relocation tasks.
Overall, your suggestion seems to satisfy most of our requirements, so I would like to proceed with implementation using that method. I plan to rewrite the documentation in this PR, focusing primarily on the implementation section.
And I would like to move on to discussing the "how". Specifically, I need a "getting started" guide for monitortests to clarify exactly what we will contribute to.
|
|
||
| Currently, HA implementation is often left to developers’ discretion, | ||
| leading to inconsistent or insufficient HA configurations. | ||
| Although general guidelines exist ([CONVENTIONS.md](https://github.com/openshift/enhancements/blob/master/CONVENTIONS.md#high-availability)). |
There was a problem hiding this comment.
Sorry for my late response.
Originally we intended to start with small set of conventions (like anti-affinity or podDisruptionBudget) and we have an additional item (not included in CONVENTIONS.md now) like healthcheck probes.
|
|
||
| * As an OpenShift Product Manager, I want a clear overview of HA | ||
| implementation status across components, so I can identify issues | ||
| from overall HA quality earlier. |
There was a problem hiding this comment.
According to the comment below, I think that the following 4 statuses can be defined.
- compliant
- non-compliant (whitelisted)
- non-compliant (pending)
- non-compliant (violated)
, each status corresponds to each item in your comment around Line 161.
| * This proposal targets only all core and infrastructure-related components, | ||
| and the other components are out of scope. | ||
|
|
||
| ## Proposal |
There was a problem hiding this comment.
I would propose radical simplification of the proposal. I believe we can meet your goals with well established precedents, without the need for lots of new processes and systems, and you can get up and running quite quickly.
We've done this sort of things many times, the process is as follows:
The stated approach seems to cover our proposal, so I'd like to implement the HA policy mgmt on top of the process.
A few questions below ...
Typically these are implemented as monitortests, they will run at the end of most of our hundreds of CI jobs. The monitortests generate junit test results per openshift component namespace, and per HA check you'd like to implement.
Typically these kinds of tests encode exceptions linked to jiras. So you write the tests, do some preliminary testing in the PR (we can help), see what violations it finds, then write a Jira for each. (more below)
I feel hard to understand from the above code that how the behavior you mentioned actually works. Could you share (privately) some guideline documents for testers about monitortest? I'd like to know how to write a minimum test case and how to run with examples.
I suggest having the tests only flake when they find a problem for now, so we do not merge the PR and cause mass failures. Once all the problems are identified with bugs filed and exceptions added, the test can be moved to a state where it's allowed to fail.
Yes, controlling failures in preparation phase is important and one of the concern. Is this kind of "switch" on whether to fail or flake controlled in the testing framework (not in test cases)?
| detect degradations in the HA implementation status. | ||
| * Define how to store the result of HA level check of each OpenShift version | ||
| to track the record of previous check results. | ||
| * Introduce a mechanism to notify the degradations to component owners whose |
There was a problem hiding this comment.
This looks interesting, it's easy to track per-component results. One concern is that HA check is done in more granular level, such as per-pod, per-deployment, and per-daemonset. Maybe are some raw test results like list of check results in granular level? At least JIRAs working as reports for each component need to contain such kind of raw test results to identify which parts need improvements.
| #### HA level check | ||
|
|
||
| HA level check uses these types of input information to judge whether each | ||
| component properly covers HA configs or not, then the result is output |
There was a problem hiding this comment.
If using JIRA as a storage for the result is acceptable, we need no additional storage.
| one of the three values: pass, fail, and skip. Each config has its own | ||
| HA implementation status info and component specific info. | ||
|
|
||
| This flowchart is essential for HA policy management, so detailed explanations |
There was a problem hiding this comment.
I agree with the above test results describing actual statuses well. So many of the workflow section including the flowchart need to be updated for the implementation based on the existing framework.
| Risk: Development teams bear the burden of responding to notifications | ||
| in a timely manner to prioritize and plan the development of HA features. | ||
| Mitigation: The management process will only issue warnings without | ||
| blocking the actual release process. |
There was a problem hiding this comment.
I agree with using JIRA states to represent policy ("Closed" as a permanent whitelist, and "Open" as temporary exception or a disagreement).
Following changes over time is also important. The implementation status of HA at the workload level (Pod, DaemonSet, Deployment, StatefulSet) may change while we are still communicating via component-specific JIRA notifications. We need to ensure accurate tracking within JIRA, which will likely be a complex task and Claude-based AI agent could be helpful.
This enhancement improves guideline compliance checks within the CI process (the Red Hat-internal pipeline for OpenShift) to improve overall HA. Specifically, it integrates a mechanism to evaluate HA levels based on implementation status and developers' input. By notifying developers of non-compliant components, the management process encourages developers to follow the guidelines. All data will be stored in a common repository, allowing both developers and partners to grasp the overall HA status early and easily.