From 204b382206e4d6ca87b6b3c7f0ff7145f0970894 Mon Sep 17 00:00:00 2001 From: Cory Latschkowski Date: Mon, 16 Feb 2026 17:58:45 -0600 Subject: [PATCH] fix: spelling errors w/ symlinks --- .../etcdHighFsyncDurations.md | 2 +- .../ClusterLogForwarderOutputErrorRate.md | 2 +- .../KubePersistentVolumeInodesFillingUp.md | 2 +- .../NodeClockNotSynchronising.md | 68 +------------------ .../NodeClockNotSynchronizing.md | 67 ++++++++++++++++++ .../NorthboundStaleAlert.md | 2 +- .../OVNKubernetesNorthdInactive.md | 6 +- .../SouthboundStaleAlert.md | 2 +- .../HighOverallControlPlaneMemory.md | 2 +- .../MachineConfigDaemonPivotError.md | 6 +- .../MachineConfigDaemonRebootError.md | 4 +- .../SystemMemoryExceedsReservation.md | 8 +-- .../CephClusterCriticallyFull.md | 2 +- .../CephClusterNearFull.md | 2 +- .../CephClusterReadOnly.md | 2 +- ...CephMdsCPUUsageHighNeedsVerticalScaling.md | 2 +- .../CephMdsCacheUsageHigh.md | 2 +- .../CephMdsMissingReplicas.md | 4 +- .../CephMonLowNumber.md | 2 +- .../CephPoolQuotaBytesCriticallyExhausted.md | 2 +- .../CephPoolQuotaBytesNearExhaustion.md | 2 +- .../KMSServerConnectionAlert.md | 2 +- .../ObcQuotaBytesExhausedAlert.md | 39 +---------- .../ObcQuotaBytesExhaustedAlert.md | 38 +++++++++++ .../ObcQuotaObjectsExhausedAlert.md | 38 +---------- .../ObcQuotaObjectsExhaustedAlert.md | 37 ++++++++++ .../StorageClientHeartbeatMissed.md | 2 +- .../helpers/checkOperator.md | 2 +- .../helpers/diagnosis.md | 6 +- .../helpers/troubleshootCeph.md | 2 +- .../DuplicateWaspAgentDSDetected.md | 2 +- 31 files changed, 181 insertions(+), 178 deletions(-) mode change 100644 => 120000 alerts/cluster-monitoring-operator/NodeClockNotSynchronising.md create mode 100644 alerts/cluster-monitoring-operator/NodeClockNotSynchronizing.md mode change 100644 => 120000 alerts/openshift-container-storage-operator/ObcQuotaBytesExhausedAlert.md create mode 100644 alerts/openshift-container-storage-operator/ObcQuotaBytesExhaustedAlert.md mode change 100644 => 120000 alerts/openshift-container-storage-operator/ObcQuotaObjectsExhausedAlert.md create mode 100644 alerts/openshift-container-storage-operator/ObcQuotaObjectsExhaustedAlert.md diff --git a/alerts/cluster-etcd-operator/etcdHighFsyncDurations.md b/alerts/cluster-etcd-operator/etcdHighFsyncDurations.md index 469a3764..d669e9b4 100644 --- a/alerts/cluster-etcd-operator/etcdHighFsyncDurations.md +++ b/alerts/cluster-etcd-operator/etcdHighFsyncDurations.md @@ -39,7 +39,7 @@ You can find more performance troubleshooting tips in In the OpenShift dashboard console under Observe section, select the etcd dashboard. There are both leader elections as well as Disk Sync Duration -dashboards which will assit with further issues. +dashboards which will assist with further issues. ## Mitigation diff --git a/alerts/cluster-logging-operator/ClusterLogForwarderOutputErrorRate.md b/alerts/cluster-logging-operator/ClusterLogForwarderOutputErrorRate.md index b4233068..d78ed3bd 100644 --- a/alerts/cluster-logging-operator/ClusterLogForwarderOutputErrorRate.md +++ b/alerts/cluster-logging-operator/ClusterLogForwarderOutputErrorRate.md @@ -63,7 +63,7 @@ credentials. ### TLS Certificate Update -If the issue stems from incorrect or expired certicates, update the associated +If the issue stems from incorrect or expired certificates, update the associated OpenShift `Secret` or `ConfigMap` with the correct and valid certificates. ## Notes diff --git a/alerts/cluster-monitoring-operator/KubePersistentVolumeInodesFillingUp.md b/alerts/cluster-monitoring-operator/KubePersistentVolumeInodesFillingUp.md index f87d5d2d..d5f0bf85 100644 --- a/alerts/cluster-monitoring-operator/KubePersistentVolumeInodesFillingUp.md +++ b/alerts/cluster-monitoring-operator/KubePersistentVolumeInodesFillingUp.md @@ -10,7 +10,7 @@ with `openshift-` or `kube-`. ## Impact Significant inode usage by a system component is likely to prevent the -component from functioning normally. Signficant inode usage can also lead to a +component from functioning normally. Significant inode usage can also lead to a partial or full cluster outage. ## Diagnosis diff --git a/alerts/cluster-monitoring-operator/NodeClockNotSynchronising.md b/alerts/cluster-monitoring-operator/NodeClockNotSynchronising.md deleted file mode 100644 index dc30eaa8..00000000 --- a/alerts/cluster-monitoring-operator/NodeClockNotSynchronising.md +++ /dev/null @@ -1,67 +0,0 @@ -# NodeClockNotSynchronising - -## Meaning - -The `NodeClockNotSynchronising` alert triggers when a node is affected by -issues with the NTP server for that node. For example, this alert might trigger -when certificates are rotated for the API Server on a node, and the -certificates fail validation because of an invalid time. - - -## Impact -This alert is critical. It indicates an issue that can lead to the API Server -Operator becoming degraded or unavailable. If the API Server Operator becomes -degraded or unavailable, this issue can negatively affect other Operators, such -as the Cluster Monitoring Operator. - -## Diagnosis - -To diagnose the underlying issue, start a debug pod on the affected node and -check the `chronyd` service: - -```shell -oc -n default debug node/ -chroot /host -systemctl status chronyd -``` - -## Mitigation - -1. If the `chronyd` service is failing or stopped, start it: - - ```shell - systemctl start chronyd - ``` - If the chronyd service is ready, restart it - - ```shell - systemctl restart chronyd - ``` - - If `chronyd` starts or restarts successfuly, the service adjusts the clock - and displays something similar to the following example output: - - ```shell - Oct 18 19:39:36 ip-100-67-47-86 chronyd[2055318]: System clock wrong by 16422.107473 seconds, adjustment started - Oct 19 00:13:18 ip-100-67-47-86 chronyd[2055318]: System clock was stepped by 16422.107473 seconds - ``` - -2. Verify that the `chronyd` service is running: - - ```shell - systemctl status chronyd - ``` - -3. Verify using PromQL: - - ```console - min_over_time(node_timex_sync_status[5m]) - node_timex_maxerror_seconds - ``` - `node_timex_sync_status` returns `1` if NTP is working properly,or `0` if - NTP is not working properly. `node_timex_maxerror_seconds` indicates how - many seconds NTP is falling behind. - - The alert triggers when the value for - `min_over_time(node_timex_sync_status[5m])` equals `0` and the value for - `node_timex_maxerror_seconds` is greater than or equal to `16`. diff --git a/alerts/cluster-monitoring-operator/NodeClockNotSynchronising.md b/alerts/cluster-monitoring-operator/NodeClockNotSynchronising.md new file mode 120000 index 00000000..1984d392 --- /dev/null +++ b/alerts/cluster-monitoring-operator/NodeClockNotSynchronising.md @@ -0,0 +1 @@ +NodeClockNotSynchronizing.md \ No newline at end of file diff --git a/alerts/cluster-monitoring-operator/NodeClockNotSynchronizing.md b/alerts/cluster-monitoring-operator/NodeClockNotSynchronizing.md new file mode 100644 index 00000000..283968e5 --- /dev/null +++ b/alerts/cluster-monitoring-operator/NodeClockNotSynchronizing.md @@ -0,0 +1,67 @@ +# NodeClockNotSynchronizing + +## Meaning + +The `NodeClockNotSynchronizing` alert triggers when a node is affected by +issues with the NTP server for that node. For example, this alert might trigger +when certificates are rotated for the API Server on a node, and the +certificates fail validation because of an invalid time. + + +## Impact +This alert is critical. It indicates an issue that can lead to the API Server +Operator becoming degraded or unavailable. If the API Server Operator becomes +degraded or unavailable, this issue can negatively affect other Operators, such +as the Cluster Monitoring Operator. + +## Diagnosis + +To diagnose the underlying issue, start a debug pod on the affected node and +check the `chronyd` service: + +```shell +oc -n default debug node/ +chroot /host +systemctl status chronyd +``` + +## Mitigation + +1. If the `chronyd` service is failing or stopped, start it: + + ```shell + systemctl start chronyd + ``` + If the chronyd service is ready, restart it + + ```shell + systemctl restart chronyd + ``` + + If `chronyd` starts or restarts successfully, the service adjusts the clock + and displays something similar to the following example output: + + ```shell + Oct 18 19:39:36 ip-100-67-47-86 chronyd[2055318]: System clock wrong by 16422.107473 seconds, adjustment started + Oct 19 00:13:18 ip-100-67-47-86 chronyd[2055318]: System clock was stepped by 16422.107473 seconds + ``` + +2. Verify that the `chronyd` service is running: + + ```shell + systemctl status chronyd + ``` + +3. Verify using PromQL: + + ```console + min_over_time(node_timex_sync_status[5m]) + node_timex_maxerror_seconds + ``` + `node_timex_sync_status` returns `1` if NTP is working properly,or `0` if + NTP is not working properly. `node_timex_maxerror_seconds` indicates how + many seconds NTP is falling behind. + + The alert triggers when the value for + `min_over_time(node_timex_sync_status[5m])` equals `0` and the value for + `node_timex_maxerror_seconds` is greater than or equal to `16`. diff --git a/alerts/cluster-network-operator/NorthboundStaleAlert.md b/alerts/cluster-network-operator/NorthboundStaleAlert.md index 344d42e0..75cdf6de 100644 --- a/alerts/cluster-network-operator/NorthboundStaleAlert.md +++ b/alerts/cluster-network-operator/NorthboundStaleAlert.md @@ -26,7 +26,7 @@ hierarchy](./hierarchy/alerts-hierarchy.svg) Investigate the health of the affected ovnkube-controller or northbound database processes that run in the `ovnkube-controller` and `nbdb` containers -repectively. +respectively. For OCP clusters at versions 4.13 or earlier, the containers run in ovnkube-master pods: diff --git a/alerts/cluster-network-operator/OVNKubernetesNorthdInactive.md b/alerts/cluster-network-operator/OVNKubernetesNorthdInactive.md index 6b105a29..8409c6c2 100644 --- a/alerts/cluster-network-operator/OVNKubernetesNorthdInactive.md +++ b/alerts/cluster-network-operator/OVNKubernetesNorthdInactive.md @@ -61,13 +61,13 @@ The result should be Status:active Mitigation will depend on what was found in the diagnosis section. -As a general fix, you can try exiting the affected ovn-northd procesess with +As a general fix, you can try exiting the affected ovn-northd processes with ```shell ovn-appctl -t ovn-northd exit ``` which should cause the container running northd to restart. If this does not -work you can try restarting the pods where the affected ovn-northd procesess are +work you can try restarting the pods where the affected ovn-northd processes are running. -Contact the incident response team in your organisation if fixing the issue is +Contact the incident response team in your organization if fixing the issue is not apparent. diff --git a/alerts/cluster-network-operator/SouthboundStaleAlert.md b/alerts/cluster-network-operator/SouthboundStaleAlert.md index bb95c347..c21d553a 100644 --- a/alerts/cluster-network-operator/SouthboundStaleAlert.md +++ b/alerts/cluster-network-operator/SouthboundStaleAlert.md @@ -25,7 +25,7 @@ hierarchy](./hierarchy/alerts-hierarchy.svg) ## Diagnosis Investigate the health of the affected northd or southbound database processes -that run in the `northd` and `sbdb` containers repectively. +that run in the `northd` and `sbdb` containers respectively. For OCP clusters at versions 4.13 or earlier, the containers run in ovnkube-master pods: diff --git a/alerts/machine-config-operator/HighOverallControlPlaneMemory.md b/alerts/machine-config-operator/HighOverallControlPlaneMemory.md index 2058b711..45a4085e 100644 --- a/alerts/machine-config-operator/HighOverallControlPlaneMemory.md +++ b/alerts/machine-config-operator/HighOverallControlPlaneMemory.md @@ -11,7 +11,7 @@ threshold for 1 hour, the alert will fire. ## Impact The memory usage per instance within control plane nodes influences the stability -and responsiveness of the cluster, most noticably in the etcd and +and responsiveness of the cluster, most noticeably in the etcd and Kubernetes API server pods. Moreover, OOM kill can occur with excessive memory usage, which negatively influences the pod scheduling. Etcd also relies on a certain number of diff --git a/alerts/machine-config-operator/MachineConfigDaemonPivotError.md b/alerts/machine-config-operator/MachineConfigDaemonPivotError.md index 04040c6e..cbcea2a3 100644 --- a/alerts/machine-config-operator/MachineConfigDaemonPivotError.md +++ b/alerts/machine-config-operator/MachineConfigDaemonPivotError.md @@ -31,12 +31,12 @@ pod logs for the cluster. For the following command, replace the $DAEMONPOD variable with the name of your own machine-config-daemon-* pod name. -That is scheduled on the node expriencing the error. +That is scheduled on the node experiencing the error. ```console oc logs -f -n openshift-machine-config-operator $DAEMONPOD -c machine-config-daemon ``` -When a pivot is occuring the following will be logged. +When a pivot is occurring the following will be logged. ```console I1126 17:15:38.991090 3069 rpm-ostree.go:243] Executing rebase to quay.io/my-registry/custom-image@blah @@ -67,7 +67,7 @@ stated reason it gives for not being able to pivot. The following are common reasons a pivot can fail. - The rpm-ostree service is unable to -pull the image from quay succesfully. +pull the image from quay successfully. - There are issues with the rpm-ostree service itself such as being unable to start, or unable to build the OsImage folder, unable to pivot from the current configuration. diff --git a/alerts/machine-config-operator/MachineConfigDaemonRebootError.md b/alerts/machine-config-operator/MachineConfigDaemonRebootError.md index a0a6eb1e..c9c04afb 100644 --- a/alerts/machine-config-operator/MachineConfigDaemonRebootError.md +++ b/alerts/machine-config-operator/MachineConfigDaemonRebootError.md @@ -10,7 +10,7 @@ will fire. ## Impact -If the MCD is unable to succesfully reboot the node, +If the MCD is unable to successfully reboot the node, any pending MachineConfig changes that would require a reboot would not be propagated, and the MachineConfig cluster operator would degrade. @@ -71,7 +71,7 @@ update.go:2641] failed to run reboot: exec: "systemd-run": executable file not f This error indicates that the `systemd-run` file cannot be found in the /usr/bin/systemd-run $PATH and so the node -cannot reboot succesfully. +cannot reboot successfully. The error message will change depending on what is preventing the reboot. diff --git a/alerts/machine-config-operator/SystemMemoryExceedsReservation.md b/alerts/machine-config-operator/SystemMemoryExceedsReservation.md index c9162161..758236af 100644 --- a/alerts/machine-config-operator/SystemMemoryExceedsReservation.md +++ b/alerts/machine-config-operator/SystemMemoryExceedsReservation.md @@ -19,7 +19,7 @@ The system daemons needs this memory in order to run and satisfy system processes. If other workloads start to use this memory then system daemons can be impacted. This alert -firing does not nessarily mean the node is +firing does not necessarily mean the node is resource exhausted at the moment. ## Diagnosis @@ -53,7 +53,7 @@ to get the 95th percentile. portion of the system's memory occupied by a process that is held in the main memory) - If this value is greather then the 95th + If this value is greater then the 95th percentile of the allocatable memory for the node then the alert will go into pending. After 15 minutes in this state the alert @@ -120,7 +120,7 @@ useful for troubleshooting: - You can use the `top` command on the host to get a dynamic update of -the largest memory consuming proccesses. +the largest memory consuming processes. For instance, to get the top 100 memory consuming processes on a node. @@ -137,7 +137,7 @@ statistics of the node. - Each node also contains a file called `/proc/meminfo`. This file provides a usage report about memory on the system. You can -learn how to interperet the fields [here](https://access.redhat.com/solutions/406773). +learn how to interpret the fields [here](https://access.redhat.com/solutions/406773). - For kubelet-level commands you can get the memory usage of individual pods by diff --git a/alerts/openshift-container-storage-operator/CephClusterCriticallyFull.md b/alerts/openshift-container-storage-operator/CephClusterCriticallyFull.md index 4da3a2fd..916c240d 100644 --- a/alerts/openshift-container-storage-operator/CephClusterCriticallyFull.md +++ b/alerts/openshift-container-storage-operator/CephClusterCriticallyFull.md @@ -12,7 +12,7 @@ Storage cluster will become read-only at 85%. ## Diagnosis -Using the Openshift console, go to Storage-Data Fountation-Storage systems. +Using the Openshift console, go to Storage-Data Foundation-Storage systems. A list of the available storage systems with basic information about raw capacity and used capacity will be visible. The command "ceph health" provides also information about cluster storage diff --git a/alerts/openshift-container-storage-operator/CephClusterNearFull.md b/alerts/openshift-container-storage-operator/CephClusterNearFull.md index 015893bf..d433e4be 100644 --- a/alerts/openshift-container-storage-operator/CephClusterNearFull.md +++ b/alerts/openshift-container-storage-operator/CephClusterNearFull.md @@ -11,7 +11,7 @@ Storage cluster will become read-only at 85%. ## Diagnosis -Using the Openshift console, go to Storage-Data Fountation-Storage systems. +Using the Openshift console, go to Storage-Data Foundation-Storage systems. A list of the available storage systems with basic information about raw capacity and used capacity will be visible. The command "ceph health" provides also information about cluster storage diff --git a/alerts/openshift-container-storage-operator/CephClusterReadOnly.md b/alerts/openshift-container-storage-operator/CephClusterReadOnly.md index ed4808ea..2c107d96 100644 --- a/alerts/openshift-container-storage-operator/CephClusterReadOnly.md +++ b/alerts/openshift-container-storage-operator/CephClusterReadOnly.md @@ -13,7 +13,7 @@ Storage cluster will become read-only at 85%. ## Diagnosis -Using the Openshift console, go to Storage-Data Fountation-Storage systems. +Using the Openshift console, go to Storage-Data Foundation-Storage systems. A list of the available storage systems with basic information about raw capacity and used capacity will be visible. The command "ceph health" provides also information about cluster storage diff --git a/alerts/openshift-container-storage-operator/CephMdsCPUUsageHighNeedsVerticalScaling.md b/alerts/openshift-container-storage-operator/CephMdsCPUUsageHighNeedsVerticalScaling.md index cd45085c..dcedb22a 100644 --- a/alerts/openshift-container-storage-operator/CephMdsCPUUsageHighNeedsVerticalScaling.md +++ b/alerts/openshift-container-storage-operator/CephMdsCPUUsageHighNeedsVerticalScaling.md @@ -37,7 +37,7 @@ oc patch -n openshift-storage storagecluster ocs-storagecluster \ ``` Above is a sample patch command, user need to see their current CPU configurations and increase accordingly -PS: It is always adviced to add another MDS pod (that is to scale +PS: It is always advised to add another MDS pod (that is to scale Horizontally) once we have reached the max resource limit. Please see [HorizontalScaling](CephMdsCPUUsageHighNeedsHorizontalScaling.md) documentation for more details. diff --git a/alerts/openshift-container-storage-operator/CephMdsCacheUsageHigh.md b/alerts/openshift-container-storage-operator/CephMdsCacheUsageHigh.md index 235a430d..4011c337 100644 --- a/alerts/openshift-container-storage-operator/CephMdsCacheUsageHigh.md +++ b/alerts/openshift-container-storage-operator/CephMdsCacheUsageHigh.md @@ -21,7 +21,7 @@ the cache limit set in `mds_cache_memory_limit`. The MDS tries to stay under a reservation of the `mds_cache_memory_limit` by trimming unused metadata in its cache and recalling cached items in the client caches. It is possible for the MDS to exceed this limit due to slow recall from -clients as result of multiple clients accesing the files. +clients as result of multiple clients accessing the files. Read more about ceph MDS cache configuration [here](https://docs.ceph.com/en/latest/cephfs/cache-configuration/?highlight=mds%20cache%20configuration#mds-cache-configuration) diff --git a/alerts/openshift-container-storage-operator/CephMdsMissingReplicas.md b/alerts/openshift-container-storage-operator/CephMdsMissingReplicas.md index 06876d62..48de3d7a 100644 --- a/alerts/openshift-container-storage-operator/CephMdsMissingReplicas.md +++ b/alerts/openshift-container-storage-operator/CephMdsMissingReplicas.md @@ -14,11 +14,11 @@ be fixed as soon as possible. ## Diagnosis Make sure we have enough RAM provisioned for MDS Cache. Default is 4GB, but -recomended is minimum 8GB. +recommended is minimum 8GB. ## Mitigation -It is highly recomended to distribute MDS daemons across at least two nodes in +It is highly recommended to distribute MDS daemons across at least two nodes in the cluster. Otherwise, a hardware failure on a single node may result in the file system becoming unavailable. diff --git a/alerts/openshift-container-storage-operator/CephMonLowNumber.md b/alerts/openshift-container-storage-operator/CephMonLowNumber.md index d6837ef4..5308a789 100644 --- a/alerts/openshift-container-storage-operator/CephMonLowNumber.md +++ b/alerts/openshift-container-storage-operator/CephMonLowNumber.md @@ -11,7 +11,7 @@ are only 3 monitors. This a "info" level alert, and therefore just a suggestion. The alert is just suggesting to increase the number of ceph monitors, to be -more resistent to failures. +more resistant to failures. It can be silenced without any impact in the cluster functionality or performance. If the number of monitors is increased to 5, the cluster will be more robust. diff --git a/alerts/openshift-container-storage-operator/CephPoolQuotaBytesCriticallyExhausted.md b/alerts/openshift-container-storage-operator/CephPoolQuotaBytesCriticallyExhausted.md index 3b378889..6249557f 100644 --- a/alerts/openshift-container-storage-operator/CephPoolQuotaBytesCriticallyExhausted.md +++ b/alerts/openshift-container-storage-operator/CephPoolQuotaBytesCriticallyExhausted.md @@ -10,7 +10,7 @@ One threshold that can trigger this warning condition is the ## Impact Due the quota configured the pool will become readonly when the quota will be -exhausted completelly +exhausted completely ## Diagnosis diff --git a/alerts/openshift-container-storage-operator/CephPoolQuotaBytesNearExhaustion.md b/alerts/openshift-container-storage-operator/CephPoolQuotaBytesNearExhaustion.md index 4f7d32db..8879a561 100644 --- a/alerts/openshift-container-storage-operator/CephPoolQuotaBytesNearExhaustion.md +++ b/alerts/openshift-container-storage-operator/CephPoolQuotaBytesNearExhaustion.md @@ -10,7 +10,7 @@ One threshold that can trigger this warning condition is the ## Impact Due the quota configured the pool will become readonly when the quota will be -exhausted completelly +exhausted completely ## Diagnosis diff --git a/alerts/openshift-container-storage-operator/KMSServerConnectionAlert.md b/alerts/openshift-container-storage-operator/KMSServerConnectionAlert.md index d89cf181..1fb919f7 100644 --- a/alerts/openshift-container-storage-operator/KMSServerConnectionAlert.md +++ b/alerts/openshift-container-storage-operator/KMSServerConnectionAlert.md @@ -17,7 +17,7 @@ Connection with external key management service is not working. ## Mitigation -Review configuration values in the ´ocs-kms-connection-details´ confimap. +Review configuration values in the ´ocs-kms-connection-details´ configmap. Verify the connectivity with the external KMS, verifying [network connectivity](helpers/networkConnectivity.md) diff --git a/alerts/openshift-container-storage-operator/ObcQuotaBytesExhausedAlert.md b/alerts/openshift-container-storage-operator/ObcQuotaBytesExhausedAlert.md deleted file mode 100644 index db105031..00000000 --- a/alerts/openshift-container-storage-operator/ObcQuotaBytesExhausedAlert.md +++ /dev/null @@ -1,38 +0,0 @@ -# ObcQuotaBytesExhausedAlert - -## Meaning - -This is the next stage once we have reached [ObcQuotaObjectsAlert](ObcQuotaObjectsAlert.md). -ObjectBucketClaim has crossed the limit set by the quota(bytes) and will be -read-only now. Increase the quota in the OBC custom resource immediately. - -## Impact - -OBC has exhausted and reached it's limit. - -## Diagnosis - -Alert message will clearly indicate which OBC has reached the quota bytes limit. -Look at the deployments attached to the OBC and see what all apps are -using/filling-up the OBC. - -## Mitigation - -Need to increase the quota limit immediately for the ObjectBucketClaim -custom resource. We can set quota option on OBC by using the `maxObjects` -and `maxSize` options in the ObjectBucketClaim CRD - -```yaml -apiVersion: objectbucket.io/v1alpha1 -kind: ObjectBucketClaim -metadata: - name: - namespace: -spec: - bucketName: - storageClassName: - additionalConfig: - maxObjects: "1000" # sets limit on no of objects this obc can hold - maxSize: "2G" # sets max limit for the size of data this obc can hold -``` - diff --git a/alerts/openshift-container-storage-operator/ObcQuotaBytesExhausedAlert.md b/alerts/openshift-container-storage-operator/ObcQuotaBytesExhausedAlert.md new file mode 120000 index 00000000..45cf555f --- /dev/null +++ b/alerts/openshift-container-storage-operator/ObcQuotaBytesExhausedAlert.md @@ -0,0 +1 @@ +ObcQuotaBytesExhaustedAlert.md \ No newline at end of file diff --git a/alerts/openshift-container-storage-operator/ObcQuotaBytesExhaustedAlert.md b/alerts/openshift-container-storage-operator/ObcQuotaBytesExhaustedAlert.md new file mode 100644 index 00000000..e856d450 --- /dev/null +++ b/alerts/openshift-container-storage-operator/ObcQuotaBytesExhaustedAlert.md @@ -0,0 +1,38 @@ +# ObcQuotaBytesExhaustedAlert + +## Meaning + +This is the next stage once we have reached [ObcQuotaObjectsAlert](ObcQuotaObjectsAlert.md). +ObjectBucketClaim has crossed the limit set by the quota(bytes) and will be +read-only now. Increase the quota in the OBC custom resource immediately. + +## Impact + +OBC has exhausted and reached it's limit. + +## Diagnosis + +Alert message will clearly indicate which OBC has reached the quota bytes limit. +Look at the deployments attached to the OBC and see what all apps are +using/filling-up the OBC. + +## Mitigation + +Need to increase the quota limit immediately for the ObjectBucketClaim +custom resource. We can set quota option on OBC by using the `maxObjects` +and `maxSize` options in the ObjectBucketClaim CRD + +```yaml +apiVersion: objectbucket.io/v1alpha1 +kind: ObjectBucketClaim +metadata: + name: + namespace: +spec: + bucketName: + storageClassName: + additionalConfig: + maxObjects: "1000" # sets limit on no of objects this obc can hold + maxSize: "2G" # sets max limit for the size of data this obc can hold +``` + diff --git a/alerts/openshift-container-storage-operator/ObcQuotaObjectsExhausedAlert.md b/alerts/openshift-container-storage-operator/ObcQuotaObjectsExhausedAlert.md deleted file mode 100644 index 4f79917d..00000000 --- a/alerts/openshift-container-storage-operator/ObcQuotaObjectsExhausedAlert.md +++ /dev/null @@ -1,37 +0,0 @@ -# ObcQuotaObjectsExhausedAlert - -## Meaning - -ObjectBucketClaim has crossed the limit set by the quota(objects) and -will be read-only now. - -## Impact - -Application won't be able to do any transaction through the OBC and will be stalled. - -## Diagnosis - -Alert message will indicate which OBC has reached the object quota limit. -Look at the deployments attached to the OBC and -see what all apps are using/filling-up the OBC. - -## Mitigation - -Immediately increase the quota for the OBC, specified in the alert details. -We can increase quota option on OBC by using the `maxObjects` and -`maxSize` options in the ObjectBucketClaim CRD - -```yaml -apiVersion: objectbucket.io/v1alpha1 -kind: ObjectBucketClaim -metadata: - name: - namespace: -spec: - bucketName: - storageClassName: - additionalConfig: - maxObjects: "1000" # sets limit on no of objects this obc can hold - maxSize: "2G" # sets max limit for the size of data this obc can hold -``` - diff --git a/alerts/openshift-container-storage-operator/ObcQuotaObjectsExhausedAlert.md b/alerts/openshift-container-storage-operator/ObcQuotaObjectsExhausedAlert.md new file mode 120000 index 00000000..a9ca6f07 --- /dev/null +++ b/alerts/openshift-container-storage-operator/ObcQuotaObjectsExhausedAlert.md @@ -0,0 +1 @@ +ObcQuotaObjectsExhaustedAlert.md \ No newline at end of file diff --git a/alerts/openshift-container-storage-operator/ObcQuotaObjectsExhaustedAlert.md b/alerts/openshift-container-storage-operator/ObcQuotaObjectsExhaustedAlert.md new file mode 100644 index 00000000..7cd27f56 --- /dev/null +++ b/alerts/openshift-container-storage-operator/ObcQuotaObjectsExhaustedAlert.md @@ -0,0 +1,37 @@ +# ObcQuotaObjectsExhaustedAlert + +## Meaning + +ObjectBucketClaim has crossed the limit set by the quota(objects) and +will be read-only now. + +## Impact + +Application won't be able to do any transaction through the OBC and will be stalled. + +## Diagnosis + +Alert message will indicate which OBC has reached the object quota limit. +Look at the deployments attached to the OBC and +see what all apps are using/filling-up the OBC. + +## Mitigation + +Immediately increase the quota for the OBC, specified in the alert details. +We can increase quota option on OBC by using the `maxObjects` and +`maxSize` options in the ObjectBucketClaim CRD + +```yaml +apiVersion: objectbucket.io/v1alpha1 +kind: ObjectBucketClaim +metadata: + name: + namespace: +spec: + bucketName: + storageClassName: + additionalConfig: + maxObjects: "1000" # sets limit on no of objects this obc can hold + maxSize: "2G" # sets max limit for the size of data this obc can hold +``` + diff --git a/alerts/openshift-container-storage-operator/StorageClientHeartbeatMissed.md b/alerts/openshift-container-storage-operator/StorageClientHeartbeatMissed.md index 65597797..98a8d84d 100644 --- a/alerts/openshift-container-storage-operator/StorageClientHeartbeatMissed.md +++ b/alerts/openshift-container-storage-operator/StorageClientHeartbeatMissed.md @@ -23,7 +23,7 @@ Verify ODF Provider reachability from ODF Client by following ## Mitigation -### Intermittent nework connectivity +### Intermittent network connectivity 1. From diagnosis if you find endpoint is reachable, wait for at most 5 minutes to have connection reestablished which should stop firing alert. diff --git a/alerts/openshift-container-storage-operator/helpers/checkOperator.md b/alerts/openshift-container-storage-operator/helpers/checkOperator.md index fe1e7f1e..c86ada03 100644 --- a/alerts/openshift-container-storage-operator/helpers/checkOperator.md +++ b/alerts/openshift-container-storage-operator/helpers/checkOperator.md @@ -30,7 +30,7 @@ The status for each type should be False. For example: ] ``` -The output above shows a false status for type CatalogSourcesUnHealthly, +The output above shows a false status for type CatalogSourcesUnHealthy, meaning the catalog sources are healthy. ## OCS Operator Pod Health diff --git a/alerts/openshift-container-storage-operator/helpers/diagnosis.md b/alerts/openshift-container-storage-operator/helpers/diagnosis.md index cb54e4cc..58b745fb 100644 --- a/alerts/openshift-container-storage-operator/helpers/diagnosis.md +++ b/alerts/openshift-container-storage-operator/helpers/diagnosis.md @@ -82,7 +82,7 @@ Step 1: Check Node Health: ip-10-0-175-99.eu-west-2.compute.internal Ready worker 83m v1.23.5+3afdacb 10.0.175.99 Red Hat Enterprise Linux CoreOS 410.84.202206080346-0 (Ootpa) 4.18.0-305.49.1.el8_4.x86_64 cri-o://1.23.3-3.rhaos4.10.git5fe1720.el8 ``` -If any nodes are not ready/scheduable, then continue to Step 2. +If any nodes are not ready/schedulable, then continue to Step 2. Step 2: Inspect Node Events: @@ -108,7 +108,7 @@ Example: ``` Look for any events similar to the above example which may indicate the node -is undergoing maintainence. +is undergoing maintenance. ## Further info @@ -120,7 +120,7 @@ converged mode on OpenShift Dedicated Clusters by the OpenShift Cluster Manager Related Links -* [ODF Dedicated Converged Add-on Architecure](https://docs.google.com/document/d/1ISEY16OfsvEPmlJEjEwPvDvDs0KyNzgl369A-V6-GRA/edit#heading=h.mznotzn8pklp) +* [ODF Dedicated Converged Add-on Architecture](https://docs.google.com/document/d/1ISEY16OfsvEPmlJEjEwPvDvDs0KyNzgl369A-V6-GRA/edit#heading=h.mznotzn8pklp) * [ODF Product Architecture](https://access.redhat.com/documentation/en-us/red_hat_openshift_container_storage/4.6/html/planning_your_deployment/ocs-architecture_rhocs) diff --git a/alerts/openshift-container-storage-operator/helpers/troubleshootCeph.md b/alerts/openshift-container-storage-operator/helpers/troubleshootCeph.md index dc2febe3..202e0169 100644 --- a/alerts/openshift-container-storage-operator/helpers/troubleshootCeph.md +++ b/alerts/openshift-container-storage-operator/helpers/troubleshootCeph.md @@ -6,7 +6,7 @@ Some common commands to troubleshoot a Ceph cluster: * ceph status * ceph osd status -* cepd osd df +* ceph osd df * ceph osd utilization * ceph osd pool stats * ceph osd tree diff --git a/alerts/openshift-virtualization-operator/DuplicateWaspAgentDSDetected.md b/alerts/openshift-virtualization-operator/DuplicateWaspAgentDSDetected.md index 656d41ff..fba7e6b5 100644 --- a/alerts/openshift-virtualization-operator/DuplicateWaspAgentDSDetected.md +++ b/alerts/openshift-virtualization-operator/DuplicateWaspAgentDSDetected.md @@ -5,7 +5,7 @@ This alert is deprecated. You can safely ignore or silence it. ## Meaning wasp-agent is a node-local agent that enables swap for burstable QoS pods. It mimics the behavior of kubelet swap feature. -wasp-agent deployment consists of Daemonset, serivce account, +wasp-agent deployment consists of Daemonset, service account, role binding, privileged SCC. Wasp-agent currently deployed automatically by the HCO operator when the memory overcommit percentage is set to a value higher than 100%. In the past wasp-agent was deployed manually