Skip to content

[Draft] Feature: Circuit Breaker for SolisCloud API Throttling#3788

Open
CraigCallender wants to merge 1 commit intospringfall2008:mainfrom
CraigCallender:feature/solis-cloud-circuit-breaker
Open

[Draft] Feature: Circuit Breaker for SolisCloud API Throttling#3788
CraigCallender wants to merge 1 commit intospringfall2008:mainfrom
CraigCallender:feature/solis-cloud-circuit-breaker

Conversation

@CraigCallender
Copy link
Copy Markdown
Contributor

Feature: Circuit Breaker for SolisCloud API Throttling

Note

I'm currently testing this locally, which is why it's a draft PR. Just posting here to get any thoughts on it before I submit, but feel free to ignore it as well until I finalise it.

Problem

The SolisCloud API implements rate limiting in a non-standard way: instead of returning HTTP 429 responses, it returns HTTP 200 with an error code in the JSON payload (e.g. B0600: "Datalogger returns data abnormally. Please try again later"). Predbat detects this error and logs it, but the existing _with_retry logic treated it like any transient failure and immediately retried — and then continued executing the next API call in the same cycle.

The problem compounds because operations like write_time_windows_if_changed() in V2 mode can require up to ~40+ API calls per cycle (read → write → verify for each changed register across 6 slots). When the first call hits the rate limit, the retry loop exhausts its 30-second window retrying that single call, then the cascade continues — each subsequent call also fails and retries, rapidly exhausting the API token budget. By the time the next 60-second cycle begins, the quota is still depleted, so those calls fail too, leading to a self-reinforcing failure loop that cannot recover without manually stopping Predbat and waiting ~5 minutes.

Changes

solis.py — Circuit breaker pattern

A three-state circuit breaker (CLOSED → OPEN → HALF_OPEN) has been added to the SolisAPI class, gating all outgoing API calls at _execute_request.

New constants:

  • SOLIS_CB_FAILURE_THRESHOLD = 3 — consecutive non-throttle failures before tripping
  • SOLIS_CB_INITIAL_RECOVERY_TIME = 60 — seconds before first retry after trip
  • SOLIS_CB_MAX_RECOVERY_TIME = 900 — 15-minute cap on recovery wait
  • SOLIS_CB_BACKOFF_MULTIPLIER = 2 — recovery time doubles on each successive trip
  • SOLIS_CB_THROTTLE_CODES = {"B0600", "B0173", "B0115"} — codes that trip the breaker immediately

State machine:

  • CLOSED (normal): all API calls pass through; non-throttle failures increment a counter; the breaker trips when the counter reaches the threshold.
  • OPEN (tripped): all API calls are blocked immediately with SolisAPIError("Circuit breaker OPEN") — no HTTP request is made. Recovery time uses exponential backoff: 60s → 120s → 240s → 480s → 900s (capped), resetting to 60s on full recovery.
  • HALF_OPEN (probing): after the recovery window elapses, one probe request is allowed through. Success → CLOSED (full reset). Failure → OPEN again with doubled recovery time.

Known throttle codes trip the breaker immediately (on the first failure, not after the threshold), preventing any cascade.

_with_retry bails immediately when the circuit is not CLOSED, so the 30-second retry window is not wasted after a trip — critical for preventing the multi-call cascade in write_time_windows_if_changed.

run() short-circuits when the circuit is OPEN: logs one clean "Circuit breaker OPEN, skipping API calls this cycle (Xs remaining)" message per cycle instead of per-call failures flooding the log, and still calls publish_entities() so HA entities stay alive with stale-but-valid cached data.

record_api_call uses the "rate_limit" failure reason (already a valid metric label) for throttle-code failures, enabling Prometheus dashboards to distinguish throttling from other errors.

tests/test_solis.py — 10 new circuit breaker tests

  • test_cb_closed_to_open_on_threshold — CLOSED → OPEN after SOLIS_CB_FAILURE_THRESHOLD consecutive non-throttle failures
  • test_cb_immediate_trip_on_throttle_code — CLOSED → OPEN immediately on B0600/B0173/B0115, no threshold required
  • test_cb_open_blocks_calls — OPEN state raises SolisAPIError without making any HTTP request
  • test_cb_open_to_half_open_after_recovery — OPEN → HALF_OPEN after recovery time elapses
  • test_cb_half_open_to_closed_on_success — HALF_OPEN → CLOSED on successful probe; trip count and recovery time fully reset
  • test_cb_half_open_retrips_on_failure — HALF_OPEN → OPEN on failed probe; recovery time doubles
  • test_cb_backoff_sequence_and_cap — validates 60s → 120s → 240s → 480s → 900s sequence and cap
  • test_cb_full_recovery_resets_trip_count — confirms trip count and recovery time reset to initial values after full recovery
  • test_cb_with_retry_bails_on_open_with_retry makes exactly 1 call (no retries) when circuit is OPEN
  • test_cb_run_skips_when_openrun() returns False and calls publish_entities() but makes no API calls when circuit is OPEN

@springfall2008 springfall2008 requested a review from Copilot April 15, 2026 14:16
@springfall2008
Copy link
Copy Markdown
Owner

I suppose the question is why fail the API requests rather than wait, is it because you have to wait too long for their rate throttling to clear?

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a circuit breaker around SolisCloud API calls to prevent cascading retries/call storms when SolisCloud returns “throttled” error codes via HTTP 200 responses, improving stability and log/metric signal during rate limiting.

Changes:

  • Introduces circuit breaker state + backoff logic in SolisAPI, gating requests at _execute_request and short-circuiting run() when OPEN.
  • Updates API failure metrics to label known throttle codes as rate_limit and avoids burning _with_retry time once the breaker trips.
  • Adds circuit-breaker-focused tests and updates cspell dictionary.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
apps/predbat/solis.py Implements circuit breaker state machine and integrates it into request execution, retry behavior, metrics, and the component run loop.
apps/predbat/tests/test_solis.py Adds tests covering breaker transitions/backoff and ensures run() and _with_retry short-circuit when OPEN.
.cspell/custom-dictionary-workspace.txt Adds “retrips” to project dictionary.

Comment thread apps/predbat/solis.py
Comment on lines +389 to +394
self._cb_trip_count += 1
self._cb_recovery_time = min(SOLIS_CB_INITIAL_RECOVERY_TIME * (SOLIS_CB_BACKOFF_MULTIPLIER ** (self._cb_trip_count - 1)), SOLIS_CB_MAX_RECOVERY_TIME)
self._cb_state = "OPEN"
self._cb_open_at = time.monotonic()
self._cb_failure_count = 0
self.log(f"Warn: Solis API: Circuit breaker OPEN for {self._cb_recovery_time:.0f}s (trip {self._cb_trip_count}) — {self._cb_trip_count * SOLIS_CB_FAILURE_THRESHOLD} consecutive failures")
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The log message on non-throttle trips reports ... — {self._cb_trip_count * SOLIS_CB_FAILURE_THRESHOLD} consecutive failures, but the breaker actually trips after SOLIS_CB_FAILURE_THRESHOLD consecutive failures (and _cb_failure_count is reset on trip). This is misleading for debugging; consider logging the actual threshold / current failure count instead.

Suggested change
self._cb_trip_count += 1
self._cb_recovery_time = min(SOLIS_CB_INITIAL_RECOVERY_TIME * (SOLIS_CB_BACKOFF_MULTIPLIER ** (self._cb_trip_count - 1)), SOLIS_CB_MAX_RECOVERY_TIME)
self._cb_state = "OPEN"
self._cb_open_at = time.monotonic()
self._cb_failure_count = 0
self.log(f"Warn: Solis API: Circuit breaker OPEN for {self._cb_recovery_time:.0f}s (trip {self._cb_trip_count}) — {self._cb_trip_count * SOLIS_CB_FAILURE_THRESHOLD} consecutive failures")
failure_count = self._cb_failure_count
self._cb_trip_count += 1
self._cb_recovery_time = min(SOLIS_CB_INITIAL_RECOVERY_TIME * (SOLIS_CB_BACKOFF_MULTIPLIER ** (self._cb_trip_count - 1)), SOLIS_CB_MAX_RECOVERY_TIME)
self._cb_state = "OPEN"
self._cb_open_at = time.monotonic()
self._cb_failure_count = 0
self.log(f"Warn: Solis API: Circuit breaker OPEN for {self._cb_recovery_time:.0f}s (trip {self._cb_trip_count}) — {failure_count} consecutive failures")

Copilot uses AI. Check for mistakes.
Comment thread apps/predbat/solis.py
Comment on lines +396 to +409
@property
def _cb_is_open(self):
"""Return True if the circuit breaker is blocking outgoing calls."""
if self._cb_state == "CLOSED":
return False
if self._cb_state == "OPEN":
elapsed = time.monotonic() - self._cb_open_at
if elapsed >= self._cb_recovery_time:
self._cb_state = "HALF_OPEN"
self.log(f"Solis API: Circuit breaker transitioning to HALF_OPEN after {elapsed:.0f}s — allowing probe request")
return False
return True
# HALF_OPEN — let the next probe request through
return False
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Circuit breaker HALF_OPEN currently allows every subsequent API call through because _cb_is_open returns False for HALF_OPEN and there is no “single probe” gating. In a run cycle that makes many Solis calls, this can burst the API immediately after the recovery window and re-trigger throttling. Consider allowing exactly one probe request in HALF_OPEN (e.g., a _cb_probe_in_flight / _cb_probe_used flag checked in _execute_request, blocking additional calls with a clear error until success/failure closes or re-opens the breaker).

Copilot uses AI. Check for mistakes.
Comment thread apps/predbat/solis.py
Comment on lines +376 to +395
def _cb_record_failure(self, response_code=None):
"""Record a failed API call and trip the circuit breaker when appropriate."""
if response_code is not None and response_code in SOLIS_CB_THROTTLE_CODES:
# Throttle code — trip immediately regardless of failure count
self._cb_trip_count += 1
self._cb_recovery_time = min(SOLIS_CB_INITIAL_RECOVERY_TIME * (SOLIS_CB_BACKOFF_MULTIPLIER ** (self._cb_trip_count - 1)), SOLIS_CB_MAX_RECOVERY_TIME)
self._cb_state = "OPEN"
self._cb_open_at = time.monotonic()
self._cb_failure_count = 0
self.log(f"Warn: Solis API: Circuit breaker OPEN for {self._cb_recovery_time:.0f}s (trip {self._cb_trip_count}) — throttle code {response_code}")
else:
self._cb_failure_count += 1
if self._cb_failure_count >= SOLIS_CB_FAILURE_THRESHOLD:
self._cb_trip_count += 1
self._cb_recovery_time = min(SOLIS_CB_INITIAL_RECOVERY_TIME * (SOLIS_CB_BACKOFF_MULTIPLIER ** (self._cb_trip_count - 1)), SOLIS_CB_MAX_RECOVERY_TIME)
self._cb_state = "OPEN"
self._cb_open_at = time.monotonic()
self._cb_failure_count = 0
self.log(f"Warn: Solis API: Circuit breaker OPEN for {self._cb_recovery_time:.0f}s (trip {self._cb_trip_count}) — {self._cb_trip_count * SOLIS_CB_FAILURE_THRESHOLD} consecutive failures")

Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In _cb_record_failure, failures while in HALF_OPEN are treated like CLOSED (incrementing _cb_failure_count and only tripping after the threshold unless it’s a throttle code). That contradicts the intended circuit-breaker behavior (a failed probe should immediately re-open with backoff) and can leave the breaker stuck in HALF_OPEN while repeated calls continue. Consider special-casing _cb_state == "HALF_OPEN" so any failure re-trips OPEN and updates _cb_recovery_time via backoff.

Copilot uses AI. Check for mistakes.
@springfall2008 springfall2008 marked this pull request as ready for review April 16, 2026 18:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants