Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion alerts/cluster-etcd-operator/etcdHighFsyncDurations.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ You can find more performance troubleshooting tips in

In the OpenShift dashboard console under Observe section, select the etcd
dashboard. There are both leader elections as well as Disk Sync Duration
dashboards which will assit with further issues.
dashboards which will assist with further issues.

## Mitigation

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ credentials.

### TLS Certificate Update

If the issue stems from incorrect or expired certicates, update the associated
If the issue stems from incorrect or expired certificates, update the associated
OpenShift `Secret` or `ConfigMap` with the correct and valid certificates.

## Notes
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ with `openshift-` or `kube-`.
## Impact

Significant inode usage by a system component is likely to prevent the
component from functioning normally. Signficant inode usage can also lead to a
component from functioning normally. Significant inode usage can also lead to a
partial or full cluster outage.

## Diagnosis
Expand Down
67 changes: 0 additions & 67 deletions alerts/cluster-monitoring-operator/NodeClockNotSynchronising.md

This file was deleted.

67 changes: 67 additions & 0 deletions alerts/cluster-monitoring-operator/NodeClockNotSynchronizing.md
Comment thread
codekow marked this conversation as resolved.
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# NodeClockNotSynchronizing

## Meaning

The `NodeClockNotSynchronizing` alert triggers when a node is affected by
issues with the NTP server for that node. For example, this alert might trigger
when certificates are rotated for the API Server on a node, and the
certificates fail validation because of an invalid time.


## Impact
This alert is critical. It indicates an issue that can lead to the API Server
Operator becoming degraded or unavailable. If the API Server Operator becomes
degraded or unavailable, this issue can negatively affect other Operators, such
as the Cluster Monitoring Operator.

## Diagnosis

To diagnose the underlying issue, start a debug pod on the affected node and
check the `chronyd` service:

```shell
oc -n default debug node/<affected_node_name>
chroot /host
systemctl status chronyd
```

## Mitigation

1. If the `chronyd` service is failing or stopped, start it:

```shell
systemctl start chronyd
```
If the chronyd service is ready, restart it

```shell
systemctl restart chronyd
```

If `chronyd` starts or restarts successfully, the service adjusts the clock
and displays something similar to the following example output:

```shell
Oct 18 19:39:36 ip-100-67-47-86 chronyd[2055318]: System clock wrong by 16422.107473 seconds, adjustment started
Oct 19 00:13:18 ip-100-67-47-86 chronyd[2055318]: System clock was stepped by 16422.107473 seconds
```

2. Verify that the `chronyd` service is running:

```shell
systemctl status chronyd
```

3. Verify using PromQL:

```console
min_over_time(node_timex_sync_status[5m])
node_timex_maxerror_seconds
```
`node_timex_sync_status` returns `1` if NTP is working properly,or `0` if
NTP is not working properly. `node_timex_maxerror_seconds` indicates how
many seconds NTP is falling behind.

The alert triggers when the value for
`min_over_time(node_timex_sync_status[5m])` equals `0` and the value for
`node_timex_maxerror_seconds` is greater than or equal to `16`.
2 changes: 1 addition & 1 deletion alerts/cluster-network-operator/NorthboundStaleAlert.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ hierarchy](./hierarchy/alerts-hierarchy.svg)

Investigate the health of the affected ovnkube-controller or northbound database
processes that run in the `ovnkube-controller` and `nbdb` containers
repectively.
respectively.

For OCP clusters at versions 4.13 or earlier, the containers run in
ovnkube-master pods:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -61,13 +61,13 @@ The result should be Status:active

Mitigation will depend on what was found in the diagnosis section.

As a general fix, you can try exiting the affected ovn-northd procesess with
As a general fix, you can try exiting the affected ovn-northd processes with
```shell
ovn-appctl -t ovn-northd exit
```
which should cause the container running northd to restart. If this does not
work you can try restarting the pods where the affected ovn-northd procesess are
work you can try restarting the pods where the affected ovn-northd processes are
running.

Contact the incident response team in your organisation if fixing the issue is
Contact the incident response team in your organization if fixing the issue is
not apparent.
2 changes: 1 addition & 1 deletion alerts/cluster-network-operator/SouthboundStaleAlert.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ hierarchy](./hierarchy/alerts-hierarchy.svg)
## Diagnosis

Investigate the health of the affected northd or southbound database processes
that run in the `northd` and `sbdb` containers repectively.
that run in the `northd` and `sbdb` containers respectively.

For OCP clusters at versions 4.13 or earlier, the containers run in
ovnkube-master pods:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ threshold for 1 hour, the alert will fire.
## Impact
The memory usage per instance within control
plane nodes influences the stability
and responsiveness of the cluster, most noticably in the etcd and
and responsiveness of the cluster, most noticeably in the etcd and
Kubernetes API server pods. Moreover, OOM kill can occur
with excessive memory usage, which negatively
influences the pod scheduling. Etcd also relies on a certain number of
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -31,12 +31,12 @@ pod logs for the cluster.

For the following command, replace the $DAEMONPOD variable
with the name of your own machine-config-daemon-* pod name.
That is scheduled on the node expriencing the error.
That is scheduled on the node experiencing the error.

```console
oc logs -f -n openshift-machine-config-operator $DAEMONPOD -c machine-config-daemon
```
When a pivot is occuring the following will be logged.
When a pivot is occurring the following will be logged.

```console
I1126 17:15:38.991090 3069 rpm-ostree.go:243] Executing rebase to quay.io/my-registry/custom-image@blah
Expand Down Expand Up @@ -67,7 +67,7 @@ stated reason it gives for not being able to pivot. The following are
common reasons a pivot can fail.

- The rpm-ostree service is unable to
pull the image from quay succesfully.
pull the image from quay successfully.
- There are issues with the rpm-ostree service itself such as
being unable to start, or unable to build the OsImage folder,
unable to pivot from the current configuration.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ will fire.

## Impact

If the MCD is unable to succesfully reboot the node,
If the MCD is unable to successfully reboot the node,
any pending MachineConfig changes that would
require a reboot would not be propagated,
and the MachineConfig cluster operator would degrade.
Expand Down Expand Up @@ -71,7 +71,7 @@ update.go:2641] failed to run reboot: exec: "systemd-run": executable file not f

This error indicates that the `systemd-run` file cannot be
found in the /usr/bin/systemd-run $PATH and so the node
cannot reboot succesfully.
cannot reboot successfully.

The error message will change depending on what is
preventing the reboot.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ The system daemons needs this memory in order to
run and satisfy system processes. If other workloads
start to use this memory then system daemons
can be impacted. This alert
firing does not nessarily mean the node is
firing does not necessarily mean the node is
resource exhausted at the moment.

## Diagnosis
Expand Down Expand Up @@ -53,7 +53,7 @@ to get the 95th percentile.
portion of the system's memory occupied by
a process that is held in the main memory)

If this value is greather then the 95th
If this value is greater then the 95th
percentile of the allocatable memory for
the node then the alert will go into pending.
After 15 minutes in this state the alert
Expand Down Expand Up @@ -120,7 +120,7 @@ useful for troubleshooting:

- You can use the `top` command on
the host to get a dynamic update of
the largest memory consuming proccesses.
the largest memory consuming processes.
For instance, to get the top 100 memory
consuming processes on a node.

Expand All @@ -137,7 +137,7 @@ statistics of the node.
- Each node also contains a file called
`/proc/meminfo`. This file provides a usage
report about memory on the system. You can
learn how to interperet the fields [here](https://access.redhat.com/solutions/406773).
learn how to interpret the fields [here](https://access.redhat.com/solutions/406773).

- For kubelet-level commands you can get
the memory usage of individual pods by
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ Storage cluster will become read-only at 85%.

## Diagnosis

Using the Openshift console, go to Storage-Data Fountation-Storage systems.
Using the Openshift console, go to Storage-Data Foundation-Storage systems.
A list of the available storage systems with basic information about raw
capacity and used capacity will be visible.
The command "ceph health" provides also information about cluster storage
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ Storage cluster will become read-only at 85%.

## Diagnosis

Using the Openshift console, go to Storage-Data Fountation-Storage systems.
Using the Openshift console, go to Storage-Data Foundation-Storage systems.
A list of the available storage systems with basic information about raw
capacity and used capacity will be visible.
The command "ceph health" provides also information about cluster storage
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ Storage cluster will become read-only at 85%.

## Diagnosis

Using the Openshift console, go to Storage-Data Fountation-Storage systems.
Using the Openshift console, go to Storage-Data Foundation-Storage systems.
A list of the available storage systems with basic information about raw
capacity and used capacity will be visible.
The command "ceph health" provides also information about cluster storage
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ oc patch -n openshift-storage storagecluster ocs-storagecluster \
```
Above is a sample patch command, user need to see their current CPU
configurations and increase accordingly
PS: It is always adviced to add another MDS pod (that is to scale
PS: It is always advised to add another MDS pod (that is to scale
Horizontally) once we have reached the max resource limit. Please see
[HorizontalScaling](CephMdsCPUUsageHighNeedsHorizontalScaling.md)
documentation for more details.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ the cache limit set in `mds_cache_memory_limit`.
The MDS tries to stay under a reservation of the `mds_cache_memory_limit` by
trimming unused metadata in its cache and recalling cached items in the client
caches. It is possible for the MDS to exceed this limit due to slow recall from
clients as result of multiple clients accesing the files.
clients as result of multiple clients accessing the files.

Read more about ceph MDS cache configuration [here](https://docs.ceph.com/en/latest/cephfs/cache-configuration/?highlight=mds%20cache%20configuration#mds-cache-configuration)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,11 +14,11 @@ be fixed as soon as possible.
## Diagnosis

Make sure we have enough RAM provisioned for MDS Cache. Default is 4GB, but
recomended is minimum 8GB.
recommended is minimum 8GB.

## Mitigation

It is highly recomended to distribute MDS daemons across at least two nodes in
It is highly recommended to distribute MDS daemons across at least two nodes in
the cluster. Otherwise, a hardware failure on a single node may result in the
file system becoming unavailable.

Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ are only 3 monitors.

This a "info" level alert, and therefore just a suggestion.
The alert is just suggesting to increase the number of ceph monitors, to be
more resistent to failures.
more resistant to failures.
It can be silenced without any impact in the cluster functionality or
performance.
If the number of monitors is increased to 5, the cluster will be more robust.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ One threshold that can trigger this warning condition is the
## Impact

Due the quota configured the pool will become readonly when the quota will be
exhausted completelly
exhausted completely

## Diagnosis

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ One threshold that can trigger this warning condition is the
## Impact

Due the quota configured the pool will become readonly when the quota will be
exhausted completelly
exhausted completely

## Diagnosis

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ Connection with external key management service is not working.

## Mitigation

Review configuration values in the ´ocs-kms-connection-details´ confimap.
Review configuration values in the ´ocs-kms-connection-details´ configmap.

Verify the connectivity with the external KMS, verifying
[network connectivity](helpers/networkConnectivity.md)
Loading