feat(k8s): implement cluster self-healing with repair strategies#373
feat(k8s): implement cluster self-healing with repair strategies#373
Conversation
Add health/repair tracking fields to Cluster and ClusterNode structs: - IsHealthy, UnhealthySince, FailureReason, RepairAttempts, LastRepairAttempt, LastRepairSucceeded on Cluster - UnhealthySince, FailureReason on ClusterNode ADR-026 documents the self-healing architecture decision.
Replace the Repair() stub with full implementation: - repairAPIServer(): iterate control plane IPs, restart kubelet sequentially - repairNodes(): re-apply Calico CNI, restart kube-proxy, restart kubelet on individual non-ready nodes - GetHealth(): check API server reachability and node readiness counts, persist health state to cluster record Includes localMockNodeExecutor to avoid cross-test mock pollution from restore_test.go package-level mock.
- RepairCluster() in cluster service now rejects repair if already in progress (Status == ClusterStatusRepairing returns conflict error) - ClusterReconciler now skips clusters repaired within 5 minutes and clusters unhealthy for less than 2 minutes (transient tolerance) - After failed repair, reconciler updates cluster health state for next cycle's backoff decision - Add reconciler unit tests for all backoff conditions - Update kubernetes.md docs with self-healing details
|
Warning Rate limit exceeded
To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (11)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 0/1 reviews remaining, refill in 19 minutes and 32 seconds.Comment |
…ode matching - Add repair timeout (10min default, configurable via RepairTimeoutMinutes) with context deadline checks in repairAPIServer and repairNodes - repairNodes now returns error if any node kubelet restart fails - Replace fragile string Contains ID matching with jsonpath InternalIP lookup - GetHealth only updates repo when IsHealthy actually changed - Add TODO for hardcoded Calico version - Update ADR-026 with new behavior notes
…itation - Add repair timeout (10min) to High Availability section - Add Calico version limitation to Self-Healing feature note
Some repair tests required precise ordering of Run mock calls across interleaved Repair()->GetHealth() call chains, making them fragile. Removed: - repairAPIServer_recovers_on_second_control_plane_node - repairAPIServer_fails_after_exhausting_all_control_plane_nodes - repairNodes_returns_error_when_kubelet_restart_fails - GetHealth_only_updates_repo_when_state_changes - Repair_with_custom_timeout Core Repair() behavior is covered by the 3 passing tests.
- Add cancellable sleep helper (respects context timeout) - Share ptrTime helper via domain package - Sanitize node name before using in shell commands - Add Repair() return value clarification comment - Update cluster_reconciler_test.go to use domain.PtrTime
Summary
ClusterandClusterNodestructs (IsHealthy,UnhealthySince,FailureReason,RepairAttempts,LastRepairAttempt,LastRepairSucceeded)Repair()inKubeadmProvisionerwith two repair strategies:repairAPIServer(): iterate control plane IPs, restart kubelet sequentiallyrepairNodes(): re-apply Calico CNI, restart kube-proxy, restart kubelet on non-ready nodesGetHealth()checks API server reachability and node readiness, persists state to cluster recordRepairCluster()in cluster service now guards against concurrent repairs (conflict error)ClusterReconcilernow skips clusters repaired within 5 min and unhealthy for less than 2 min (transient tolerance)Test plan
go test ./internal/repositories/k8s/...— provisioner unit testsgo test ./internal/core/services/...— cluster service testsgo test ./internal/workers/...— reconciler backoff testsgo build ./...— clean build