Draft
Conversation
Signed-off-by: Moad Zardab <mzardab@redhat.com>
Signed-off-by: Moad Zardab <mzardab@redhat.com>
* Modifying template to use resourceTemplates on hive selector syncsets Signed-off-by: Moad Zardab <mzardab@redhat.com> Signed-off-by: Moad Zardab <mzardab@redhat.com> * Modifying template to use resourceTemplates on hive selector syncsets Signed-off-by: Moad Zardab <mzardab@redhat.com> Signed-off-by: Moad Zardab <mzardab@redhat.com> --------- Signed-off-by: Moad Zardab <mzardab@redhat.com>
Signed-off-by: Moad Zardab <mzardab@redhat.com>
Signed-off-by: Moad Zardab <mzardab@redhat.com>
Signed-off-by: Moad Zardab <mzardab@redhat.com>
…lowlist (node/kube metrics) (#1029) Signed-off-by: Moad Zardab <mzardab@redhat.com>
…cs to al…" (#1033) This reverts commit 4522634.
Adds kube_namespace_labels
…(#1054) Signed-off-by: Moad Zardab <mzardab@redhat.com> Co-authored-by: Philip Gough <philip.p.gough@gmail.com>
* Initial plan * Remove ThanosQuery configuration file from staging cluster Co-authored-by: philipgough <5781491+philipgough@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: philipgough <5781491+philipgough@users.noreply.github.com>
…" (#1259) This reverts commit 6847955.
Port alert rules from dynatrace-config to close coverage gaps identified during parity validation (SREP-3426). Alerts added: - api-RapidErrorBudgetBurn: fast-burn SLO for API server (>90% failure in 3m) - KubeAPIServerRestartingFrequently: crash-looping API server detection - CoreClusterOperatorDown: core operator health (console, network, monitoring, dns) - DefaultIngressControllerDegraded: ingress controller degradation - CertManagerCertExpirySoon: proactive cert expiry warning (21 days) - CertManagerCertNotReady: cert renewal failure - SAEDeploymentDoesNotHaveExpectedReplicas: partial SAE degradation Recording rules added: - sre:kube_apiserver:container_restarts_total - hcp_worker_nodes:available_count - core_cluster_operator:down:filtered All alerts include hypershift_cluster_alerts_disabled suppression. KubeAPIErrorBudgetBurn is deferred (needs apiserver_request:error_rate_* recording rules -- separate follow-up). Jira: SREP-3427
Port the KubeAPI SLO error budget burn alerting from dynatrace-config. This is the last alert rule gap identified in parity validation. Adds kube-api-error-budget.yaml with: - 8 base counter recording rules (read/write totals, errors, latency) - 14 error rate recording rules (7 windows x 2 verbs) - 4 alert variants (fast/medium/slow/very-slow burn rates) - Targets 99.99% API server availability Also fixes a bug from the original dynatrace-config where the 2h read error rate denominator used 'instance' instead of 'namespace' in the group-by clause. Jira: SREP-3429
Two fixes for alerts that exist in rhobs-configuration but never fire because of missing/wrong metrics: 1. ClusterOperatorDown: Rewrite to use cluster_operator_up (which is in the remote write allowlist) instead of cluster_operator_conditions (which is NOT collected). The alert was always evaluating to empty. 2. OauthServiceDeploymentDegraded: Add the sre:oauth:deployment_pods_unavailable recording rule, ported from dynatrace-config. Without this recording rule, the metric never existed so the alert never fired. NodeHighResourceUsage is tracked separately in SREP-3430 (requires ~30 recording rules with instance-type-specific baselines). Jira: SREP-3275
Revert ClusterOperatorDown to use cluster_operator_conditions (the standard OpenShift metric) instead of the custom cluster_operator_up. Add cluster_operator_conditions to both the CMO federation match list and the remote write allowlist so it reaches the RHOBS cell. Also add certmanager_certificate_expiration_timestamp_seconds to both allowlists so CertManagerCertExpirySoon can evaluate, and add certmanager_certificate_ready_status to the CMO federation match list (was already in remote write but not federated from CMO). Remove sre:oauth:deployment_pods_unavailable recording rule from tenant-rules -- it depends on kube_deployment_status_replicas_unavailable which is not forwarded to the RHOBS cell. The recording rule should run on the MC (where the raw metric exists) and its result is already in the remote write allowlist. Jira: SREP-3430
…ck (#1266)
Add PrometheusRule objects to the RHOBS monitoring stack SelectorSyncSet
so recording rules are evaluated on each MC where the raw metrics exist.
Results are forwarded to RHOBS cells via the existing remote write
allowlist.
Recording rules added:
- sre:oauth:deployment_pods_unavailable (for OauthServiceDeploymentDegraded)
- node:is_hypershift, hypershift_node:info
- hypershift_node:network:baseline_throughput_in_Gbps (10 instance types)
- hypershift_node:disk_io:baseline_throughput_in_Mbps (10 instance types)
- hypershift_node:consumption:{cpu,memory,network,filesystem,disk_io}
- hypershift_node:ready, hypershift_node:in_bad_condition
- hypershift_node:high_usage (7 threshold variants)
CMO federation match list additions (raw metrics needed by recording rules):
- kube_node_status_condition
- node_filesystem_avail_bytes, node_filesystem_size_bytes
- node_disk_written_bytes_total, node_disk_read_bytes_total
These replace recording rules that previously ran in the Dynatrace
Prometheus namespace (openshift-observability-dynatrace).
Jira: SREP-3430
… (#1267) Document the process for validating changes in non-production RHOBS cells before promoting to production via pinned SHA refs. Covers the full workflow from merge to validation to promotion to rollback. Jira: SREP-3431
…#1268) Add SC-specific metric collection, recording rules, and alert rules to enable RHOBS alerting on Service Clusters alongside MCs. Monitoring stack template changes (SREP-3433, SREP-3434): - Add ACM metrics to CMO federation match list and remote write allowlist: acm_managed_cluster_status_condition, acm_manifestwork_status_condition, acm_manifestwork_count - Add ManifestWork recording rule results to remote write allowlist - Deploy ACM ManifestWork recording rules as PrometheusRule on SCs SC tenant rules (SREP-3435): - acm-managed-clusters.yaml: ACMManagedClusterConditionUnknown, ACMManagedClusterKubeAPIServerUnavailable, ACMManagedClusterClientCertRotationFailed - acm-grc.yaml: ACMPolicyControllerReconcileErrors - acm-manifestwork.yaml: ACMManifestWorkAppliedHighFailureRate - cert-manager.yaml: CertManagerCertExpirySoon, CertManagerCertNotReady - observability.yaml: DeadMansSnitch Jira: SREP-3432, SREP-3433, SREP-3434, SREP-3435
* SREP-3432: Enable RHOBS collection and alerting on Service Clusters Add SC-specific metric collection, recording rules, and alert rules to enable RHOBS alerting on Service Clusters alongside MCs. Monitoring stack template changes (SREP-3433, SREP-3434): - Add ACM metrics to CMO federation match list and remote write allowlist: acm_managed_cluster_status_condition, acm_manifestwork_status_condition, acm_manifestwork_count - Add ManifestWork recording rule results to remote write allowlist - Deploy ACM ManifestWork recording rules as PrometheusRule on SCs SC tenant rules (SREP-3435): - acm-managed-clusters.yaml: ACMManagedClusterConditionUnknown, ACMManagedClusterKubeAPIServerUnavailable, ACMManagedClusterClientCertRotationFailed - acm-grc.yaml: ACMPolicyControllerReconcileErrors - acm-manifestwork.yaml: ACMManifestWorkAppliedHighFailureRate - cert-manager.yaml: CertManagerCertExpirySoon, CertManagerCertNotReady - observability.yaml: DeadMansSnitch Jira: SREP-3432, SREP-3433, SREP-3434, SREP-3435 * Add sc.yaml generation script and make target Add generate-sc-rules.sh to produce a combined sc.yaml from the split domain files in resources/tenant-rules/sc/, matching the pattern used for hcp.yaml. Add sc-rules make target. Jira: SREP-3432
* Add manifests for rhobsp01sae1 Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com> * Add to bundle Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com> --------- Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>
Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>
…write (#1272) Add rhobs_route_monitor_operator_* metrics to both the federation ServiceMonitor match[] and the writeRelabelConfigs allowlist so that RMO new Prometheus metrics are shipped from MCs to RHOBS cells. Depends on: openshift/route-monitor-operator#464
…273) Split alerts by evaluation location: - RMO alerts in tenant-rules/hcp/synthetics.yaml (ThanosRuler, evaluates against MC metrics remote-written to RHOBS cells) - Synthetics-api/agent alerts in observability/prometheusrules/ (app-sre Prometheus, evaluates against locally scraped RHOBS cell metrics) Note: synthetics-agent instances on backplane clusters are not covered by these alerts yet -- only RHOBS-cell-deployed instances.
…274) Add absence-of-signal detection alerts: - SyntheticsAgentReconcileStale: fires when agent hasn't reconciled in 5+ minutes (it runs every 30s). Uses the existing last_reconciliation_timestamp_seconds metric. - SyntheticsAPIMetricsAbsent: fires when API metrics target disappears - SyntheticsAgentMetricsAbsent: fires when agent metrics target disappears - RMOMetricsAbsent: fires when RMO metrics stop arriving from all MCs
…(#1275) The synthetics-api and synthetics-agent PrometheusRules were previously in resources/observability/prometheusrules/ which deploys to app-sre monitoring clusters. Alerts there route through the app-sre Alertmanager to AppSRE PagerDuty -- not to the RHOBS infra PD services. Move all API/agent alerts (including watchdog alerts) into the tenant rules at resources/tenant-rules/hcp/synthetics.yaml alongside the RMO alerts. These are evaluated by ThanosRuler on each RHOBS cell and fire through the RHOBS cell Alertmanager which has service=rhobs-synthetics routing to the rhobs-infra-* PagerDuty services. Remove the now-obsolete observability PrometheusRule files.
Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>
…lection (#1277) Template for deploying a MonitoringStack to app-sre OSD clusters that federates OCM namespace metrics (uhc-*, osd-fleet-manager-*) from the cluster CMO Prometheus and remote-writes them to a RHOBS cell. Deployed directly via app-interface saas files (not SelectorSyncSet) since these clusters are not Hive-managed.
…sterRoleBinding (#1278) Move the federation ServiceMonitor from openshift-monitoring to the deploy namespace with namespaceSelector to target openshift-monitoring. This avoids cross-namespace resource deployment via saas-deploy. Remove the ClusterRoleBinding from the template since saas-deploy should not create cluster-scoped resources. The ClusterRoleBinding needs to be deployed separately via namespace openshiftResources.
Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>
Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>
…ion (#1282) Add an OpenShift template for deploying a token-refresher to non-Hive clusters (e.g. app-sre OSD clusters) for forwarding OCM component logs to RHOBS cells via OTLP. This is the logging equivalent of the existing ocm-component-monitoring-stack-template.yaml used for metrics. It deploys a Secret, Deployment, Service, and ConfigMap into the target namespace, using rhobs-ocm-* naming to avoid conflicts with existing token-refresher resources. Jira: https://issues.redhat.com/browse/SREP-3646
Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>
Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>
…(#1286) This metric is scraped by the blackbox exporter ServiceMonitor but was filtered out by the remote write relabel config. Adding it enables the synthetic monitoring duration panels in Grafana dashboards.
Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>
… (#1288) Loki rejects log entries from openshift-validation-webhook (~1MB) that exceed the default 262KB maxLineSize limit. When a batch contains an oversized entry, the entire batch is dropped via HTTP 400, blocking all log delivery to the RHOBS cell. Increase maxLineSize from 262144 (256KB) to 2097152 (2MB) across all production cells to unblock log ingestion. Jira: https://issues.redhat.com/browse/SREP-3712
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.