Rhobsp03ue1 by philipgough · Pull Request #2 · spaparaju/configuration

philipgough · 2026-03-02T10:07:14Z

No description provided.

Signed-off-by: Moad Zardab <mzardab@redhat.com>

* Modifying template to use resourceTemplates on hive selector syncsets Signed-off-by: Moad Zardab <mzardab@redhat.com> Signed-off-by: Moad Zardab <mzardab@redhat.com> * Modifying template to use resourceTemplates on hive selector syncsets Signed-off-by: Moad Zardab <mzardab@redhat.com> Signed-off-by: Moad Zardab <mzardab@redhat.com> --------- Signed-off-by: Moad Zardab <mzardab@redhat.com>

Signed-off-by: Moad Zardab <mzardab@redhat.com>

…lowlist (node/kube metrics) (#1029) Signed-off-by: Moad Zardab <mzardab@redhat.com>

…cs to al…" (#1033) This reverts commit 4522634.

…(#1035)

Adds kube_namespace_labels

…1047)

…rposes (#1053)

…(#1054) Signed-off-by: Moad Zardab <mzardab@redhat.com> Co-authored-by: Philip Gough <philip.p.gough@gmail.com>

* Initial plan * Remove ThanosQuery configuration file from staging cluster Co-authored-by: philipgough <5781491+philipgough@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: philipgough <5781491+philipgough@users.noreply.github.com>

…" (#1259) This reverts commit 6847955.

Port alert rules from dynatrace-config to close coverage gaps identified during parity validation (SREP-3426). Alerts added: - api-RapidErrorBudgetBurn: fast-burn SLO for API server (>90% failure in 3m) - KubeAPIServerRestartingFrequently: crash-looping API server detection - CoreClusterOperatorDown: core operator health (console, network, monitoring, dns) - DefaultIngressControllerDegraded: ingress controller degradation - CertManagerCertExpirySoon: proactive cert expiry warning (21 days) - CertManagerCertNotReady: cert renewal failure - SAEDeploymentDoesNotHaveExpectedReplicas: partial SAE degradation Recording rules added: - sre:kube_apiserver:container_restarts_total - hcp_worker_nodes:available_count - core_cluster_operator:down:filtered All alerts include hypershift_cluster_alerts_disabled suppression. KubeAPIErrorBudgetBurn is deferred (needs apiserver_request:error_rate_* recording rules -- separate follow-up). Jira: SREP-3427

Port the KubeAPI SLO error budget burn alerting from dynatrace-config. This is the last alert rule gap identified in parity validation. Adds kube-api-error-budget.yaml with: - 8 base counter recording rules (read/write totals, errors, latency) - 14 error rate recording rules (7 windows x 2 verbs) - 4 alert variants (fast/medium/slow/very-slow burn rates) - Targets 99.99% API server availability Also fixes a bug from the original dynatrace-config where the 2h read error rate denominator used 'instance' instead of 'namespace' in the group-by clause. Jira: SREP-3429

Two fixes for alerts that exist in rhobs-configuration but never fire because of missing/wrong metrics: 1. ClusterOperatorDown: Rewrite to use cluster_operator_up (which is in the remote write allowlist) instead of cluster_operator_conditions (which is NOT collected). The alert was always evaluating to empty. 2. OauthServiceDeploymentDegraded: Add the sre:oauth:deployment_pods_unavailable recording rule, ported from dynatrace-config. Without this recording rule, the metric never existed so the alert never fired. NodeHighResourceUsage is tracked separately in SREP-3430 (requires ~30 recording rules with instance-type-specific baselines). Jira: SREP-3275

Revert ClusterOperatorDown to use cluster_operator_conditions (the standard OpenShift metric) instead of the custom cluster_operator_up. Add cluster_operator_conditions to both the CMO federation match list and the remote write allowlist so it reaches the RHOBS cell. Also add certmanager_certificate_expiration_timestamp_seconds to both allowlists so CertManagerCertExpirySoon can evaluate, and add certmanager_certificate_ready_status to the CMO federation match list (was already in remote write but not federated from CMO). Remove sre:oauth:deployment_pods_unavailable recording rule from tenant-rules -- it depends on kube_deployment_status_replicas_unavailable which is not forwarded to the RHOBS cell. The recording rule should run on the MC (where the raw metric exists) and its result is already in the remote write allowlist. Jira: SREP-3430

…ck (#1266) Add PrometheusRule objects to the RHOBS monitoring stack SelectorSyncSet so recording rules are evaluated on each MC where the raw metrics exist. Results are forwarded to RHOBS cells via the existing remote write allowlist. Recording rules added: - sre:oauth:deployment_pods_unavailable (for OauthServiceDeploymentDegraded) - node:is_hypershift, hypershift_node:info - hypershift_node:network:baseline_throughput_in_Gbps (10 instance types) - hypershift_node:disk_io:baseline_throughput_in_Mbps (10 instance types) - hypershift_node:consumption:{cpu,memory,network,filesystem,disk_io} - hypershift_node:ready, hypershift_node:in_bad_condition - hypershift_node:high_usage (7 threshold variants) CMO federation match list additions (raw metrics needed by recording rules): - kube_node_status_condition - node_filesystem_avail_bytes, node_filesystem_size_bytes - node_disk_written_bytes_total, node_disk_read_bytes_total These replace recording rules that previously ran in the Dynatrace Prometheus namespace (openshift-observability-dynatrace). Jira: SREP-3430

… (#1267) Document the process for validating changes in non-production RHOBS cells before promoting to production via pinned SHA refs. Covers the full workflow from merge to validation to promotion to rollback. Jira: SREP-3431

…#1268) Add SC-specific metric collection, recording rules, and alert rules to enable RHOBS alerting on Service Clusters alongside MCs. Monitoring stack template changes (SREP-3433, SREP-3434): - Add ACM metrics to CMO federation match list and remote write allowlist: acm_managed_cluster_status_condition, acm_manifestwork_status_condition, acm_manifestwork_count - Add ManifestWork recording rule results to remote write allowlist - Deploy ACM ManifestWork recording rules as PrometheusRule on SCs SC tenant rules (SREP-3435): - acm-managed-clusters.yaml: ACMManagedClusterConditionUnknown, ACMManagedClusterKubeAPIServerUnavailable, ACMManagedClusterClientCertRotationFailed - acm-grc.yaml: ACMPolicyControllerReconcileErrors - acm-manifestwork.yaml: ACMManifestWorkAppliedHighFailureRate - cert-manager.yaml: CertManagerCertExpirySoon, CertManagerCertNotReady - observability.yaml: DeadMansSnitch Jira: SREP-3432, SREP-3433, SREP-3434, SREP-3435

* SREP-3432: Enable RHOBS collection and alerting on Service Clusters Add SC-specific metric collection, recording rules, and alert rules to enable RHOBS alerting on Service Clusters alongside MCs. Monitoring stack template changes (SREP-3433, SREP-3434): - Add ACM metrics to CMO federation match list and remote write allowlist: acm_managed_cluster_status_condition, acm_manifestwork_status_condition, acm_manifestwork_count - Add ManifestWork recording rule results to remote write allowlist - Deploy ACM ManifestWork recording rules as PrometheusRule on SCs SC tenant rules (SREP-3435): - acm-managed-clusters.yaml: ACMManagedClusterConditionUnknown, ACMManagedClusterKubeAPIServerUnavailable, ACMManagedClusterClientCertRotationFailed - acm-grc.yaml: ACMPolicyControllerReconcileErrors - acm-manifestwork.yaml: ACMManifestWorkAppliedHighFailureRate - cert-manager.yaml: CertManagerCertExpirySoon, CertManagerCertNotReady - observability.yaml: DeadMansSnitch Jira: SREP-3432, SREP-3433, SREP-3434, SREP-3435 * Add sc.yaml generation script and make target Add generate-sc-rules.sh to produce a combined sc.yaml from the split domain files in resources/tenant-rules/sc/, matching the pattern used for hcp.yaml. Add sc-rules make target. Jira: SREP-3432

* Add manifests for rhobsp01sae1 Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com> * Add to bundle Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com> --------- Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>