feat(alerts): record_alert primitive + AlertFeed/Octopus producers#3830
feat(alerts): record_alert primitive + AlertFeed/Octopus producers#3830
Conversation
Add a general-purpose alert recording mechanism so any component can surface a user-facing issue without building a bespoke publication pipeline. - PredBat.record_alert(category, severity, title, message, dedup_key, metadata, expires_at, action_url) in output.py - ComponentBase.record_alert delegate so components can call self.record_alert(...) naturally (same pattern as record_status) - self._active_alerts dict tracks currently-active alerts keyed by dedup_key (or category::title); initialized in PredBat.__init__ - Publishes to sensor.<prefix>.system_alerts — state on/off, attribute 'alerts' carries the list sorted critical > warning > info - TTL-only lifecycle: producers re-record each cycle while the condition holds; alerts with expires_at in the past are pruned on the next publish. Fast-resolve is re-recording with expires_at=now. No explicit clear_alert — keeps the contract small. - No consumers yet; separate commits wire AlertFeed and octopus.py (IOG smart control lost) to call this
Dual-publish: existing HA entity sensor.<prefix>_alertfeed_status is unchanged (backward-compat for HA users); weather alerts are also recorded via the new record_alert() primitive so they appear on sensor.<prefix>.system_alerts alongside other categories. CAP severity maps to framework severity: - Extreme -> critical - Severe -> warning - Moderate -> info - Minor -> info Dedup key: weather:<event>:<onset> - stable across cycles of the same CAP alert so TTL refresh works without duplicates. When keep_reserve applies, it is carried in metadata for downstream consumers to render a reserve-hold hint.
When a customer's IOG charger lands in SMART_CONTROL_NOT_AVAILABLE (or any non-CAPABLE state) for more than 24 hours, raise a system severity=warning alert via record_alert. The alert nudges the customer to re-authorise MyEnergi in the Octopus app. Implementation: - Capture currentState from the dispatches query (already fetched, previously discarded) into each intelligent device dict - Track first_seen per device in self.smart_control_degraded_since - Clear tracking when currentState returns to SMART_CONTROL_CAPABLE - After 24h degraded, re-record each sensor cycle with a 2h TTL so the alert stays alive while the condition persists and auto-expires once it clears (per the TTL-only lifecycle) Closes the original motivating case: a customer lost IOG smart control for ~48h with no feedback; plannedDispatches silently returned empty and the only visible symptom was PredBat ignoring the cheap charging window.
Code reviewFound 2 issues:
batpred/apps/predbat/output.py Lines 2471 to 2486 in 93d781b
batpred/apps/predbat/octopus.py Lines 375 to 381 in 93d781b 🤖 Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
1. _publish_system_alerts compared expires_at strings lexicographically against now_utc_exact.isoformat(). Producers legitimately use different timezone offsets (CAP weather = +00:00, octopus code = local offset), and string ordering across offsets is not chronological. Parse expires_at via datetime.fromisoformat (handling trailing 'Z' for older Pythons) and compare against a timezone-aware UTC now. Treat naive timestamps as UTC. 2. smart_control_degraded_since was in-memory only, so every AppDaemon or pod restart reset the 24h alert clock for IOG devices that had been non-capable for days. Persist it to the existing octopus YAML cache (alongside intelligent_devices, saving_sessions, etc.) as ISO strings; rehydrate into datetime on load. The 24h window now survives restarts.
Code reviewFound 1 issue:
batpred/apps/predbat/output.py Lines 2474 to 2481 in 6ef15be 🤖 Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
Was calling datetime.now(timezone.utc) directly, ignoring the engine's canonical cycle time. record_status, _maybe_raise_smart_ control_alert, and the recorded_at field on each alert all use self.now_utc_exact — pruning should too so mocked-time tests and deterministic plan cycles see a consistent view of 'now'. self.now_utc_exact is aware (datetime.now(self.local_tz)); Python compares aware datetimes across timezones correctly, so the compare against an aware expires_dt still works. Ref: PR review
Code reviewNo issues found. Checked for bugs and CLAUDE.md compliance. 🤖 Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
There was a problem hiding this comment.
Pull request overview
Adds a unified, user-facing alert recording mechanism across PredBat components and introduces initial alert producers (weather + Octopus Intelligent smart-control degradation), publishing aggregated alerts to a new HA entity.
Changes:
- Introduces
record_alert()+ TTL pruning + publishing of aggregated alerts inoutput.py. - Adds
ComponentBase.record_alert()delegation so components can emit alerts via the base instance. - Migrates
AlertFeedto also emit unified alerts and adds an Octopus IOG “smart control lost” detector/producer.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| apps/predbat/predbat.py | Initializes _active_alerts storage for user-facing alerts. |
| apps/predbat/output.py | Implements record_alert() and publishes/prunes aggregated system alerts. |
| apps/predbat/component_base.py | Adds a record_alert() delegate to the base PredBat instance. |
| apps/predbat/alertfeed.py | Re-publishes CAP weather alerts into the unified alert framework. |
| apps/predbat/octopus.py | Tracks IOG device currentState degradation and emits a warning alert after 24h. |
| self.smart_control_degraded_since = {} | ||
| for device_id, iso in raw.items(): | ||
| try: | ||
| self.smart_control_degraded_since[device_id] = datetime.fromisoformat(iso) |
There was a problem hiding this comment.
load_octopus_cache() restores smart_control_degraded_since with datetime.fromisoformat(iso), but if the cached ISO string is naive (no offset) this will create a naive datetime. Later _maybe_raise_smart_control_alert() subtracts it from now (timezone-aware), which will raise a TypeError and break the update loop. Please normalise loaded timestamps to timezone-aware (e.g., assume UTC or local_tz when tzinfo is missing, and handle trailing 'Z' like output.py does).
| self.smart_control_degraded_since[device_id] = datetime.fromisoformat(iso) | |
| if isinstance(iso, str) and iso.endswith("Z"): | |
| iso = iso[:-1] + "+00:00" | |
| parsed = datetime.fromisoformat(iso) | |
| if parsed.tzinfo is None: | |
| parsed = parsed.replace(tzinfo=timezone.utc) | |
| self.smart_control_degraded_since[device_id] = parsed |
| # Active user-facing alerts keyed by (category, dedup_key). See | ||
| # record_alert()/clear_alert() in output.py. |
There was a problem hiding this comment.
The comment says alerts are keyed by (category, dedup_key) and references clear_alert(), but the implementation added in output.py keys only by a single string and there is no clear_alert() in this PR. Please update this comment to match the actual API/lifecycle to avoid misleading future changes.
| # Active user-facing alerts keyed by (category, dedup_key). See | |
| # record_alert()/clear_alert() in output.py. | |
| # Active user-facing alerts keyed by the alert identifier string used | |
| # by the output layer. Updated via the alert-recording flow in output.py. |
| key = dedup_key or "{}::{}".format(category, title) | ||
| self._active_alerts[key] = { | ||
| "category": category, | ||
| "severity": severity, | ||
| "title": title, | ||
| "message": message, | ||
| "dedup_key": key, |
There was a problem hiding this comment.
record_alert() stores alerts under a plain key = dedup_key or "{category}::{title}". If callers provide a dedup_key without including the category (as in octopus.py), alerts from different categories could collide and overwrite each other. Consider composing the internal dict key from both category + dedup_key (while keeping the original dedup_key field intact for consumers).
| key = dedup_key or "{}::{}".format(category, title) | |
| self._active_alerts[key] = { | |
| "category": category, | |
| "severity": severity, | |
| "title": title, | |
| "message": message, | |
| "dedup_key": key, | |
| consumer_dedup_key = dedup_key or "{}::{}".format(category, title) | |
| internal_key = "{}::{}".format(category, dedup_key) if dedup_key else consumer_dedup_key | |
| self._active_alerts[internal_key] = { | |
| "category": category, | |
| "severity": severity, | |
| "title": title, | |
| "message": message, | |
| "dedup_key": consumer_dedup_key, |
| def record_alert(self, category, severity, title, message, dedup_key=None, metadata=None, expires_at=None, action_url=None): | ||
| """ | ||
| Record a user-facing alert. Published as a list of dicts on the | ||
| `sensor.<prefix>_system_alerts` entity (attribute `alerts`). |
There was a problem hiding this comment.
The docstring says alerts are published to sensor.<prefix>_system_alerts, but the code actually publishes to self.prefix + ".system_alerts" (a different entity id pattern). Please align the docstring (and any user-facing docs) with the actual published entity name.
| `sensor.<prefix>_system_alerts` entity (attribute `alerts`). | |
| `<prefix>.system_alerts` entity (attribute `alerts`). |
| # Surface SMART_CONTROL_NOT_AVAILABLE as a user-facing alert when | ||
| # it persists beyond 24h. The customer's charger has lost Octopus's | ||
| # smart control — PredBat will ignore IOG slots, but there is no | ||
| # feedback in the app beyond empty plannedDispatches. Alert nudges | ||
| # them to re-authorise MyEnergi in the Octopus app. | ||
| self._maybe_raise_smart_control_alert(device_id, device) | ||
|
|
||
| def _maybe_raise_smart_control_alert(self, device_id, device): | ||
| """Check the device's currentState and raise / refresh a system alert | ||
| if it has been non-capable for more than 24h. TTL-only: while the | ||
| condition persists we re-record each sensor cycle to keep the alert | ||
| alive; when currentState returns to capable we stop re-recording and | ||
| the alert expires on its own.""" |
There was a problem hiding this comment.
New smart-control degradation alert logic in _maybe_raise_smart_control_alert() is user-facing behavior (24h threshold + TTL refresh) but there are already Octopus tests under apps/predbat/tests/. Please add/extend tests to cover: tracking first_seen across cycles, alert firing after 24h, and clearing when SMART_CONTROL_CAPABLE returns (including cache restore path).
| # Use the engine's canonical "now" (aware, local tz) so mocked-time | ||
| # tests and deterministic plan cycles stay consistent. Python compares | ||
| # aware datetimes across timezones correctly. | ||
| now_dt = self.now_utc_exact if hasattr(self, "now_utc_exact") and self.now_utc_exact else None |
There was a problem hiding this comment.
_publish_system_alerts() only prunes expired alerts when now_dt is available, but PredBat itself does not define now_utc_exact (only ComponentBase does). In that case now_dt will be None, TTL pruning never happens, and recorded_at is always None. Recommend adding a now_utc_exact property/attribute on PredBat (e.g., set in update_time) or switching this logic to use an existing canonical timestamp like self.now_utc so TTL-only lifecycle works in both HA and standalone modes.
| now_dt = self.now_utc_exact if hasattr(self, "now_utc_exact") and self.now_utc_exact else None | |
| now_dt = getattr(self, "now_utc_exact", None) or getattr(self, "now_utc", None) | |
| if now_dt is not None and now_dt.tzinfo is None: | |
| now_dt = now_dt.replace(tzinfo=timezone.utc) |
| # Treat naive timestamps as UTC for backward compatibility. | ||
| if expires_dt.tzinfo is None: | ||
| expires_dt = expires_dt.replace(tzinfo=timezone.utc) | ||
| if expires_dt < now_dt: |
There was a problem hiding this comment.
Expiry pruning uses if expires_dt < now_dt, but the API description says re-recording with expires_at=now should be pruned on the next publish. Using a strict < can leave an alert around for an extra cycle when the timestamps are equal; consider using <= for the comparison so "expires at now" reliably clears.
| if expires_dt < now_dt: | |
| if expires_dt <= now_dt: |
| self.dashboard_item( | ||
| self.prefix + ".system_alerts", | ||
| state="on" if active else "off", |
There was a problem hiding this comment.
PR description says the new entity is published as sensor.<prefix>.system_alerts, but the implementation publishes to self.prefix + ".system_alerts". Please reconcile the PR description and the actual entity id pattern (and ideally keep it consistent with other published entities).
| # Also publish to the unified alerts framework so downstream consumers | ||
| # (dashboards, gateways, SaaS) see weather alerts alongside other | ||
| # categories. TTL-only lifecycle: we re-record each cycle, entries | ||
| # drop off when no longer active (stops being re-recorded + TTL expires). | ||
| for alert in alerts or []: | ||
| expires = alert.get("expires") | ||
| if not expires: | ||
| continue | ||
| cap_severity = (alert.get("severity") or "").lower() | ||
| framework_severity = "critical" if cap_severity == "extreme" else "warning" if cap_severity == "severe" else "info" | ||
| event = alert.get("event") or alert.get("title") or "Weather alert" | ||
| onset = alert.get("onset") | ||
| area = alert.get("areaDesc") or "your area" | ||
| metadata = { | ||
| "event": alert.get("event"), | ||
| "severity_cap": alert.get("severity"), | ||
| "certainty": alert.get("certainty"), | ||
| "urgency": alert.get("urgency"), | ||
| "area": area, | ||
| "onset": str(onset) if onset else None, | ||
| } | ||
| if keep and keep > 0: | ||
| metadata["action"] = "keep_reserve" | ||
| metadata["keep_percent"] = keep | ||
| dedup_key = "weather:{}:{}".format(event, str(onset) if onset else "no-onset") | ||
| self.record_alert( | ||
| category="weather", | ||
| severity=framework_severity, | ||
| title=event, | ||
| message="{} until {} ({}/{}/{})".format(area, expires, alert.get("severity") or "unknown", alert.get("certainty") or "unknown", alert.get("urgency") or "unknown"), | ||
| dedup_key=dedup_key, | ||
| metadata=metadata, | ||
| expires_at=expires.isoformat() if hasattr(expires, "isoformat") else str(expires), | ||
| ) |
There was a problem hiding this comment.
apply_alerts() now loops over potentially multiple CAP alerts and calls record_alert() for each one; record_alert() immediately publishes the aggregated system_alerts entity each time. This can cause multiple HA state updates per cycle. Consider batching (e.g., add a publish=False option and publish once after the loop, or have AlertFeed build alerts then call a single publish) to reduce churn.
Summary
Adds a general-purpose alert recording mechanism so any predbat component can surface a user-facing issue, published on a new
sensor.<prefix>.system_alertsentity. Migrates the existing weatherAlertFeedto use it and adds a detector inoctopus.pyfor Octopus losing smart control of an IOG device.Why
Today predbat has
record_statusfor operational state (one string per instance) andAlertFeedfor weather specifically. There's no uniform way for components to raise "hey, user should see this".octopus.pyfetchescurrentStateon every Intelligent dispatches query but discards the value — so a customer whose charger has lost Octopus smart control sees no feedback; plannedDispatches silently returns empty.API
TTL-only lifecycle (no
clear_alert): producers re-record each cycle while the condition holds; alerts withexpires_atin the past are pruned on next publish. Fast-resolve is re-recording withexpires_at=now.Published list sorted critical > warning > info on
sensor.<prefix>.system_alerts(attributealerts).Changes
output.py—record_alert()+_publish_system_alerts()with TTL prunecomponent_base.py—record_alertdelegate so components callself.record_alert(...)predbat.py—self._active_alerts = {}in__init__alertfeed.py— dual-publishes weather alerts viarecord_alert(keeps the oldsensor.<prefix>_alertfeed_statusentity intact for backward compat)octopus.py— capturescurrentStatefrom the Intelligent dispatches query; raises a warning alert when a device has been non-CAPABLE for >24h (e.g.SMART_CONTROL_NOT_AVAILABLE)Backward compatibility
sensor.<prefix>_alertfeed_statusunchanged — HA dashboards still workrecord_alertis additive, no existing API removedsensor.<prefix>.system_alertsis empty for HA users unless a producer firesTest plan
sensor.<prefix>.system_alertsalongside the existing weather entitySMART_CONTROL_NOT_AVAILABLEfor >24h → warning alert appears with dedup keyiog_smart_control_lost:<device_id>SMART_CONTROL_CAPABLE) → alert stops being re-recorded, drops off after TTL (2h)expires_at=now→ pruned on next publish🤖 Generated with Claude Code
Tracking ticket
This PR implements part of the unified alert framework design tracked in Predictive-Cloud-Ltd/predbat-saas-infra#230 (from the SaaS side; batpred changes are kept minimal and work standalone for HA users). Also closes Predictive-Cloud-Ltd/predbat-saas-infra#290 — the IOG smart-control-lost customer case.