Skip to content

Rhobsp03ue1#2

Draft
philipgough wants to merge 1057 commits intospaparaju:mainfrom
philipgough:rhobsp03ue1
Draft

Rhobsp03ue1#2
philipgough wants to merge 1057 commits intospaparaju:mainfrom
philipgough:rhobsp03ue1

Conversation

@philipgough
Copy link
Copy Markdown

No description provided.

moadz and others added 30 commits September 30, 2025 12:52
Signed-off-by: Moad Zardab <mzardab@redhat.com>
Signed-off-by: Moad Zardab <mzardab@redhat.com>
* Modifying template to use resourceTemplates on hive selector syncsets
Signed-off-by: Moad Zardab <mzardab@redhat.com>

Signed-off-by: Moad Zardab <mzardab@redhat.com>

* Modifying template to use resourceTemplates on hive selector syncsets
Signed-off-by: Moad Zardab <mzardab@redhat.com>

Signed-off-by: Moad Zardab <mzardab@redhat.com>

---------

Signed-off-by: Moad Zardab <mzardab@redhat.com>
Signed-off-by: Moad Zardab <mzardab@redhat.com>
Signed-off-by: Moad Zardab <mzardab@redhat.com>
Signed-off-by: Moad Zardab <mzardab@redhat.com>
…lowlist (node/kube metrics) (#1029)

Signed-off-by: Moad Zardab <mzardab@redhat.com>
…(#1054)

Signed-off-by: Moad Zardab <mzardab@redhat.com>
Co-authored-by: Philip Gough <philip.p.gough@gmail.com>
Copilot AI and others added 30 commits February 6, 2026 09:43
* Initial plan

* Remove ThanosQuery configuration file from staging cluster

Co-authored-by: philipgough <5781491+philipgough@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: philipgough <5781491+philipgough@users.noreply.github.com>
Port alert rules from dynatrace-config to close coverage gaps
identified during parity validation (SREP-3426).

Alerts added:
- api-RapidErrorBudgetBurn: fast-burn SLO for API server (>90% failure in 3m)
- KubeAPIServerRestartingFrequently: crash-looping API server detection
- CoreClusterOperatorDown: core operator health (console, network, monitoring, dns)
- DefaultIngressControllerDegraded: ingress controller degradation
- CertManagerCertExpirySoon: proactive cert expiry warning (21 days)
- CertManagerCertNotReady: cert renewal failure
- SAEDeploymentDoesNotHaveExpectedReplicas: partial SAE degradation

Recording rules added:
- sre:kube_apiserver:container_restarts_total
- hcp_worker_nodes:available_count
- core_cluster_operator:down:filtered

All alerts include hypershift_cluster_alerts_disabled suppression.

KubeAPIErrorBudgetBurn is deferred (needs apiserver_request:error_rate_*
recording rules -- separate follow-up).

Jira: SREP-3427
Port the KubeAPI SLO error budget burn alerting from dynatrace-config.
This is the last alert rule gap identified in parity validation.

Adds kube-api-error-budget.yaml with:
- 8 base counter recording rules (read/write totals, errors, latency)
- 14 error rate recording rules (7 windows x 2 verbs)
- 4 alert variants (fast/medium/slow/very-slow burn rates)
- Targets 99.99% API server availability

Also fixes a bug from the original dynatrace-config where the 2h read
error rate denominator used 'instance' instead of 'namespace' in the
group-by clause.

Jira: SREP-3429
Two fixes for alerts that exist in rhobs-configuration but never fire
because of missing/wrong metrics:

1. ClusterOperatorDown: Rewrite to use cluster_operator_up (which is
   in the remote write allowlist) instead of cluster_operator_conditions
   (which is NOT collected). The alert was always evaluating to empty.

2. OauthServiceDeploymentDegraded: Add the sre:oauth:deployment_pods_unavailable
   recording rule, ported from dynatrace-config. Without this recording rule,
   the metric never existed so the alert never fired.

NodeHighResourceUsage is tracked separately in SREP-3430 (requires ~30
recording rules with instance-type-specific baselines).

Jira: SREP-3275
Revert ClusterOperatorDown to use cluster_operator_conditions (the
standard OpenShift metric) instead of the custom cluster_operator_up.
Add cluster_operator_conditions to both the CMO federation match list
and the remote write allowlist so it reaches the RHOBS cell.

Also add certmanager_certificate_expiration_timestamp_seconds to both
allowlists so CertManagerCertExpirySoon can evaluate, and add
certmanager_certificate_ready_status to the CMO federation match list
(was already in remote write but not federated from CMO).

Remove sre:oauth:deployment_pods_unavailable recording rule from
tenant-rules -- it depends on kube_deployment_status_replicas_unavailable
which is not forwarded to the RHOBS cell. The recording rule should run
on the MC (where the raw metric exists) and its result is already in
the remote write allowlist.

Jira: SREP-3430
…ck (#1266)

Add PrometheusRule objects to the RHOBS monitoring stack SelectorSyncSet
so recording rules are evaluated on each MC where the raw metrics exist.
Results are forwarded to RHOBS cells via the existing remote write
allowlist.

Recording rules added:
- sre:oauth:deployment_pods_unavailable (for OauthServiceDeploymentDegraded)
- node:is_hypershift, hypershift_node:info
- hypershift_node:network:baseline_throughput_in_Gbps (10 instance types)
- hypershift_node:disk_io:baseline_throughput_in_Mbps (10 instance types)
- hypershift_node:consumption:{cpu,memory,network,filesystem,disk_io}
- hypershift_node:ready, hypershift_node:in_bad_condition
- hypershift_node:high_usage (7 threshold variants)

CMO federation match list additions (raw metrics needed by recording rules):
- kube_node_status_condition
- node_filesystem_avail_bytes, node_filesystem_size_bytes
- node_disk_written_bytes_total, node_disk_read_bytes_total

These replace recording rules that previously ran in the Dynatrace
Prometheus namespace (openshift-observability-dynatrace).

Jira: SREP-3430
… (#1267)

Document the process for validating changes in non-production RHOBS
cells before promoting to production via pinned SHA refs. Covers
the full workflow from merge to validation to promotion to rollback.

Jira: SREP-3431
…#1268)

Add SC-specific metric collection, recording rules, and alert rules
to enable RHOBS alerting on Service Clusters alongside MCs.

Monitoring stack template changes (SREP-3433, SREP-3434):
- Add ACM metrics to CMO federation match list and remote write
  allowlist: acm_managed_cluster_status_condition,
  acm_manifestwork_status_condition, acm_manifestwork_count
- Add ManifestWork recording rule results to remote write allowlist
- Deploy ACM ManifestWork recording rules as PrometheusRule on SCs

SC tenant rules (SREP-3435):
- acm-managed-clusters.yaml: ACMManagedClusterConditionUnknown,
  ACMManagedClusterKubeAPIServerUnavailable,
  ACMManagedClusterClientCertRotationFailed
- acm-grc.yaml: ACMPolicyControllerReconcileErrors
- acm-manifestwork.yaml: ACMManifestWorkAppliedHighFailureRate
- cert-manager.yaml: CertManagerCertExpirySoon, CertManagerCertNotReady
- observability.yaml: DeadMansSnitch

Jira: SREP-3432, SREP-3433, SREP-3434, SREP-3435
* SREP-3432: Enable RHOBS collection and alerting on Service Clusters

Add SC-specific metric collection, recording rules, and alert rules
to enable RHOBS alerting on Service Clusters alongside MCs.

Monitoring stack template changes (SREP-3433, SREP-3434):
- Add ACM metrics to CMO federation match list and remote write
  allowlist: acm_managed_cluster_status_condition,
  acm_manifestwork_status_condition, acm_manifestwork_count
- Add ManifestWork recording rule results to remote write allowlist
- Deploy ACM ManifestWork recording rules as PrometheusRule on SCs

SC tenant rules (SREP-3435):
- acm-managed-clusters.yaml: ACMManagedClusterConditionUnknown,
  ACMManagedClusterKubeAPIServerUnavailable,
  ACMManagedClusterClientCertRotationFailed
- acm-grc.yaml: ACMPolicyControllerReconcileErrors
- acm-manifestwork.yaml: ACMManifestWorkAppliedHighFailureRate
- cert-manager.yaml: CertManagerCertExpirySoon, CertManagerCertNotReady
- observability.yaml: DeadMansSnitch

Jira: SREP-3432, SREP-3433, SREP-3434, SREP-3435

* Add sc.yaml generation script and make target

Add generate-sc-rules.sh to produce a combined sc.yaml from the
split domain files in resources/tenant-rules/sc/, matching the
pattern used for hcp.yaml. Add sc-rules make target.

Jira: SREP-3432
* Add manifests for rhobsp01sae1

Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>

* Add to bundle

Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>

---------

Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>
Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>
…write (#1272)

Add rhobs_route_monitor_operator_* metrics to both the federation
ServiceMonitor match[] and the writeRelabelConfigs allowlist so that
RMO new Prometheus metrics are shipped from MCs to RHOBS cells.

Depends on: openshift/route-monitor-operator#464
…273)

Split alerts by evaluation location:
- RMO alerts in tenant-rules/hcp/synthetics.yaml (ThanosRuler, evaluates
  against MC metrics remote-written to RHOBS cells)
- Synthetics-api/agent alerts in observability/prometheusrules/ (app-sre
  Prometheus, evaluates against locally scraped RHOBS cell metrics)

Note: synthetics-agent instances on backplane clusters are not covered
by these alerts yet -- only RHOBS-cell-deployed instances.
…274)

Add absence-of-signal detection alerts:
- SyntheticsAgentReconcileStale: fires when agent hasn't reconciled
  in 5+ minutes (it runs every 30s). Uses the existing
  last_reconciliation_timestamp_seconds metric.
- SyntheticsAPIMetricsAbsent: fires when API metrics target disappears
- SyntheticsAgentMetricsAbsent: fires when agent metrics target disappears
- RMOMetricsAbsent: fires when RMO metrics stop arriving from all MCs
…(#1275)

The synthetics-api and synthetics-agent PrometheusRules were previously
in resources/observability/prometheusrules/ which deploys to app-sre
monitoring clusters. Alerts there route through the app-sre Alertmanager
to AppSRE PagerDuty -- not to the RHOBS infra PD services.

Move all API/agent alerts (including watchdog alerts) into the tenant
rules at resources/tenant-rules/hcp/synthetics.yaml alongside the RMO
alerts. These are evaluated by ThanosRuler on each RHOBS cell and fire
through the RHOBS cell Alertmanager which has service=rhobs-synthetics
routing to the rhobs-infra-* PagerDuty services.

Remove the now-obsolete observability PrometheusRule files.
Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>
…lection (#1277)

Template for deploying a MonitoringStack to app-sre OSD clusters that
federates OCM namespace metrics (uhc-*, osd-fleet-manager-*) from the
cluster CMO Prometheus and remote-writes them to a RHOBS cell.

Deployed directly via app-interface saas files (not SelectorSyncSet)
since these clusters are not Hive-managed.
…sterRoleBinding (#1278)

Move the federation ServiceMonitor from openshift-monitoring to the
deploy namespace with namespaceSelector to target openshift-monitoring.
This avoids cross-namespace resource deployment via saas-deploy.

Remove the ClusterRoleBinding from the template since saas-deploy
should not create cluster-scoped resources. The ClusterRoleBinding
needs to be deployed separately via namespace openshiftResources.
Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>
Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>
…ion (#1282)

Add an OpenShift template for deploying a token-refresher to non-Hive
clusters (e.g. app-sre OSD clusters) for forwarding OCM component logs
to RHOBS cells via OTLP.

This is the logging equivalent of the existing
ocm-component-monitoring-stack-template.yaml used for metrics. It
deploys a Secret, Deployment, Service, and ConfigMap into the target
namespace, using rhobs-ocm-* naming to avoid conflicts with existing
token-refresher resources.

Jira: https://issues.redhat.com/browse/SREP-3646
Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>
Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>
…(#1286)

This metric is scraped by the blackbox exporter ServiceMonitor but was
filtered out by the remote write relabel config. Adding it enables the
synthetic monitoring duration panels in Grafana dashboards.
Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>
… (#1288)

Loki rejects log entries from openshift-validation-webhook (~1MB) that
exceed the default 262KB maxLineSize limit. When a batch contains an
oversized entry, the entire batch is dropped via HTTP 400, blocking all
log delivery to the RHOBS cell.

Increase maxLineSize from 262144 (256KB) to 2097152 (2MB) across all
production cells to unblock log ingestion.

Jira: https://issues.redhat.com/browse/SREP-3712
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants