[Draft] Feature: Circuit Breaker for SolisCloud API Throttling#3788
[Draft] Feature: Circuit Breaker for SolisCloud API Throttling#3788CraigCallender wants to merge 1 commit intospringfall2008:mainfrom
Conversation
|
I suppose the question is why fail the API requests rather than wait, is it because you have to wait too long for their rate throttling to clear? |
There was a problem hiding this comment.
Pull request overview
Adds a circuit breaker around SolisCloud API calls to prevent cascading retries/call storms when SolisCloud returns “throttled” error codes via HTTP 200 responses, improving stability and log/metric signal during rate limiting.
Changes:
- Introduces circuit breaker state + backoff logic in
SolisAPI, gating requests at_execute_requestand short-circuitingrun()when OPEN. - Updates API failure metrics to label known throttle codes as
rate_limitand avoids burning_with_retrytime once the breaker trips. - Adds circuit-breaker-focused tests and updates cspell dictionary.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| apps/predbat/solis.py | Implements circuit breaker state machine and integrates it into request execution, retry behavior, metrics, and the component run loop. |
| apps/predbat/tests/test_solis.py | Adds tests covering breaker transitions/backoff and ensures run() and _with_retry short-circuit when OPEN. |
| .cspell/custom-dictionary-workspace.txt | Adds “retrips” to project dictionary. |
| self._cb_trip_count += 1 | ||
| self._cb_recovery_time = min(SOLIS_CB_INITIAL_RECOVERY_TIME * (SOLIS_CB_BACKOFF_MULTIPLIER ** (self._cb_trip_count - 1)), SOLIS_CB_MAX_RECOVERY_TIME) | ||
| self._cb_state = "OPEN" | ||
| self._cb_open_at = time.monotonic() | ||
| self._cb_failure_count = 0 | ||
| self.log(f"Warn: Solis API: Circuit breaker OPEN for {self._cb_recovery_time:.0f}s (trip {self._cb_trip_count}) — {self._cb_trip_count * SOLIS_CB_FAILURE_THRESHOLD} consecutive failures") |
There was a problem hiding this comment.
The log message on non-throttle trips reports ... — {self._cb_trip_count * SOLIS_CB_FAILURE_THRESHOLD} consecutive failures, but the breaker actually trips after SOLIS_CB_FAILURE_THRESHOLD consecutive failures (and _cb_failure_count is reset on trip). This is misleading for debugging; consider logging the actual threshold / current failure count instead.
| self._cb_trip_count += 1 | |
| self._cb_recovery_time = min(SOLIS_CB_INITIAL_RECOVERY_TIME * (SOLIS_CB_BACKOFF_MULTIPLIER ** (self._cb_trip_count - 1)), SOLIS_CB_MAX_RECOVERY_TIME) | |
| self._cb_state = "OPEN" | |
| self._cb_open_at = time.monotonic() | |
| self._cb_failure_count = 0 | |
| self.log(f"Warn: Solis API: Circuit breaker OPEN for {self._cb_recovery_time:.0f}s (trip {self._cb_trip_count}) — {self._cb_trip_count * SOLIS_CB_FAILURE_THRESHOLD} consecutive failures") | |
| failure_count = self._cb_failure_count | |
| self._cb_trip_count += 1 | |
| self._cb_recovery_time = min(SOLIS_CB_INITIAL_RECOVERY_TIME * (SOLIS_CB_BACKOFF_MULTIPLIER ** (self._cb_trip_count - 1)), SOLIS_CB_MAX_RECOVERY_TIME) | |
| self._cb_state = "OPEN" | |
| self._cb_open_at = time.monotonic() | |
| self._cb_failure_count = 0 | |
| self.log(f"Warn: Solis API: Circuit breaker OPEN for {self._cb_recovery_time:.0f}s (trip {self._cb_trip_count}) — {failure_count} consecutive failures") |
| @property | ||
| def _cb_is_open(self): | ||
| """Return True if the circuit breaker is blocking outgoing calls.""" | ||
| if self._cb_state == "CLOSED": | ||
| return False | ||
| if self._cb_state == "OPEN": | ||
| elapsed = time.monotonic() - self._cb_open_at | ||
| if elapsed >= self._cb_recovery_time: | ||
| self._cb_state = "HALF_OPEN" | ||
| self.log(f"Solis API: Circuit breaker transitioning to HALF_OPEN after {elapsed:.0f}s — allowing probe request") | ||
| return False | ||
| return True | ||
| # HALF_OPEN — let the next probe request through | ||
| return False |
There was a problem hiding this comment.
Circuit breaker HALF_OPEN currently allows every subsequent API call through because _cb_is_open returns False for HALF_OPEN and there is no “single probe” gating. In a run cycle that makes many Solis calls, this can burst the API immediately after the recovery window and re-trigger throttling. Consider allowing exactly one probe request in HALF_OPEN (e.g., a _cb_probe_in_flight / _cb_probe_used flag checked in _execute_request, blocking additional calls with a clear error until success/failure closes or re-opens the breaker).
| def _cb_record_failure(self, response_code=None): | ||
| """Record a failed API call and trip the circuit breaker when appropriate.""" | ||
| if response_code is not None and response_code in SOLIS_CB_THROTTLE_CODES: | ||
| # Throttle code — trip immediately regardless of failure count | ||
| self._cb_trip_count += 1 | ||
| self._cb_recovery_time = min(SOLIS_CB_INITIAL_RECOVERY_TIME * (SOLIS_CB_BACKOFF_MULTIPLIER ** (self._cb_trip_count - 1)), SOLIS_CB_MAX_RECOVERY_TIME) | ||
| self._cb_state = "OPEN" | ||
| self._cb_open_at = time.monotonic() | ||
| self._cb_failure_count = 0 | ||
| self.log(f"Warn: Solis API: Circuit breaker OPEN for {self._cb_recovery_time:.0f}s (trip {self._cb_trip_count}) — throttle code {response_code}") | ||
| else: | ||
| self._cb_failure_count += 1 | ||
| if self._cb_failure_count >= SOLIS_CB_FAILURE_THRESHOLD: | ||
| self._cb_trip_count += 1 | ||
| self._cb_recovery_time = min(SOLIS_CB_INITIAL_RECOVERY_TIME * (SOLIS_CB_BACKOFF_MULTIPLIER ** (self._cb_trip_count - 1)), SOLIS_CB_MAX_RECOVERY_TIME) | ||
| self._cb_state = "OPEN" | ||
| self._cb_open_at = time.monotonic() | ||
| self._cb_failure_count = 0 | ||
| self.log(f"Warn: Solis API: Circuit breaker OPEN for {self._cb_recovery_time:.0f}s (trip {self._cb_trip_count}) — {self._cb_trip_count * SOLIS_CB_FAILURE_THRESHOLD} consecutive failures") | ||
|
|
There was a problem hiding this comment.
In _cb_record_failure, failures while in HALF_OPEN are treated like CLOSED (incrementing _cb_failure_count and only tripping after the threshold unless it’s a throttle code). That contradicts the intended circuit-breaker behavior (a failed probe should immediately re-open with backoff) and can leave the breaker stuck in HALF_OPEN while repeated calls continue. Consider special-casing _cb_state == "HALF_OPEN" so any failure re-trips OPEN and updates _cb_recovery_time via backoff.
Feature: Circuit Breaker for SolisCloud API Throttling
Note
I'm currently testing this locally, which is why it's a draft PR. Just posting here to get any thoughts on it before I submit, but feel free to ignore it as well until I finalise it.
Problem
The SolisCloud API implements rate limiting in a non-standard way: instead of returning HTTP 429 responses, it returns HTTP 200 with an error code in the JSON payload (e.g.
B0600: "Datalogger returns data abnormally. Please try again later"). Predbat detects this error and logs it, but the existing_with_retrylogic treated it like any transient failure and immediately retried — and then continued executing the next API call in the same cycle.The problem compounds because operations like
write_time_windows_if_changed()in V2 mode can require up to ~40+ API calls per cycle (read → write → verify for each changed register across 6 slots). When the first call hits the rate limit, the retry loop exhausts its 30-second window retrying that single call, then the cascade continues — each subsequent call also fails and retries, rapidly exhausting the API token budget. By the time the next 60-second cycle begins, the quota is still depleted, so those calls fail too, leading to a self-reinforcing failure loop that cannot recover without manually stopping Predbat and waiting ~5 minutes.Changes
solis.py— Circuit breaker patternA three-state circuit breaker (CLOSED → OPEN → HALF_OPEN) has been added to the
SolisAPIclass, gating all outgoing API calls at_execute_request.New constants:
SOLIS_CB_FAILURE_THRESHOLD = 3— consecutive non-throttle failures before trippingSOLIS_CB_INITIAL_RECOVERY_TIME = 60— seconds before first retry after tripSOLIS_CB_MAX_RECOVERY_TIME = 900— 15-minute cap on recovery waitSOLIS_CB_BACKOFF_MULTIPLIER = 2— recovery time doubles on each successive tripSOLIS_CB_THROTTLE_CODES = {"B0600", "B0173", "B0115"}— codes that trip the breaker immediatelyState machine:
SolisAPIError("Circuit breaker OPEN")— no HTTP request is made. Recovery time uses exponential backoff: 60s → 120s → 240s → 480s → 900s (capped), resetting to 60s on full recovery.Known throttle codes trip the breaker immediately (on the first failure, not after the threshold), preventing any cascade.
_with_retrybails immediately when the circuit is not CLOSED, so the 30-second retry window is not wasted after a trip — critical for preventing the multi-call cascade inwrite_time_windows_if_changed.run()short-circuits when the circuit is OPEN: logs one clean "Circuit breaker OPEN, skipping API calls this cycle (Xs remaining)" message per cycle instead of per-call failures flooding the log, and still callspublish_entities()so HA entities stay alive with stale-but-valid cached data.record_api_calluses the"rate_limit"failure reason (already a valid metric label) for throttle-code failures, enabling Prometheus dashboards to distinguish throttling from other errors.tests/test_solis.py— 10 new circuit breaker teststest_cb_closed_to_open_on_threshold— CLOSED → OPEN afterSOLIS_CB_FAILURE_THRESHOLDconsecutive non-throttle failurestest_cb_immediate_trip_on_throttle_code— CLOSED → OPEN immediately on B0600/B0173/B0115, no threshold requiredtest_cb_open_blocks_calls— OPEN state raisesSolisAPIErrorwithout making any HTTP requesttest_cb_open_to_half_open_after_recovery— OPEN → HALF_OPEN after recovery time elapsestest_cb_half_open_to_closed_on_success— HALF_OPEN → CLOSED on successful probe; trip count and recovery time fully resettest_cb_half_open_retrips_on_failure— HALF_OPEN → OPEN on failed probe; recovery time doublestest_cb_backoff_sequence_and_cap— validates 60s → 120s → 240s → 480s → 900s sequence and captest_cb_full_recovery_resets_trip_count— confirms trip count and recovery time reset to initial values after full recoverytest_cb_with_retry_bails_on_open—_with_retrymakes exactly 1 call (no retries) when circuit is OPENtest_cb_run_skips_when_open—run()returnsFalseand callspublish_entities()but makes no API calls when circuit is OPEN