Skip to content

feat(alerts): record_alert primitive + AlertFeed/Octopus producers#3830

Open
mgazza wants to merge 5 commits intomainfrom
feat/record-alert-primitive
Open

feat(alerts): record_alert primitive + AlertFeed/Octopus producers#3830
mgazza wants to merge 5 commits intomainfrom
feat/record-alert-primitive

Conversation

@mgazza
Copy link
Copy Markdown
Collaborator

@mgazza mgazza commented Apr 24, 2026

Summary

Adds a general-purpose alert recording mechanism so any predbat component can surface a user-facing issue, published on a new sensor.<prefix>.system_alerts entity. Migrates the existing weather AlertFeed to use it and adds a detector in octopus.py for Octopus losing smart control of an IOG device.

Why

Today predbat has record_status for operational state (one string per instance) and AlertFeed for weather specifically. There's no uniform way for components to raise "hey, user should see this". octopus.py fetches currentState on every Intelligent dispatches query but discards the value — so a customer whose charger has lost Octopus smart control sees no feedback; plannedDispatches silently returns empty.

API

self.record_alert(
    category,         # weather | system | optimization | api_keys | billing | announcement
    severity,         # critical | warning | info
    title, message,   # short + long user-facing text
    dedup_key=None,   # defaults to category::title; same key replaces
    metadata=None,    # arbitrary dict, for consumers to route/render
    expires_at=None,  # ISO-8601 — recommended, TTL-only lifecycle
    action_url=None,  # optional deep link
)

TTL-only lifecycle (no clear_alert): producers re-record each cycle while the condition holds; alerts with expires_at in the past are pruned on next publish. Fast-resolve is re-recording with expires_at=now.

Published list sorted critical > warning > info on sensor.<prefix>.system_alerts (attribute alerts).

Changes

  • output.pyrecord_alert() + _publish_system_alerts() with TTL prune
  • component_base.pyrecord_alert delegate so components call self.record_alert(...)
  • predbat.pyself._active_alerts = {} in __init__
  • alertfeed.py — dual-publishes weather alerts via record_alert (keeps the old sensor.<prefix>_alertfeed_status entity intact for backward compat)
  • octopus.py — captures currentState from the Intelligent dispatches query; raises a warning alert when a device has been non-CAPABLE for >24h (e.g. SMART_CONTROL_NOT_AVAILABLE)

Backward compatibility

  • sensor.<prefix>_alertfeed_status unchanged — HA dashboards still work
  • record_alert is additive, no existing API removed
  • The new sensor.<prefix>.system_alerts is empty for HA users unless a producer fires

Test plan

  • Feed a CAP weather alert → appears in sensor.<prefix>.system_alerts alongside the existing weather entity
  • Simulate SMART_CONTROL_NOT_AVAILABLE for >24h → warning alert appears with dedup key iog_smart_control_lost:<device_id>
  • Condition clears (return to SMART_CONTROL_CAPABLE) → alert stops being re-recorded, drops off after TTL (2h)
  • Re-record with expires_at=now → pruned on next publish

🤖 Generated with Claude Code


Tracking ticket

This PR implements part of the unified alert framework design tracked in Predictive-Cloud-Ltd/predbat-saas-infra#230 (from the SaaS side; batpred changes are kept minimal and work standalone for HA users). Also closes Predictive-Cloud-Ltd/predbat-saas-infra#290 — the IOG smart-control-lost customer case.

mgazza added 3 commits April 22, 2026 07:07
Add a general-purpose alert recording mechanism so any component can
surface a user-facing issue without building a bespoke publication
pipeline.

- PredBat.record_alert(category, severity, title, message, dedup_key,
  metadata, expires_at, action_url) in output.py
- ComponentBase.record_alert delegate so components can call
  self.record_alert(...) naturally (same pattern as record_status)
- self._active_alerts dict tracks currently-active alerts keyed by
  dedup_key (or category::title); initialized in PredBat.__init__
- Publishes to sensor.<prefix>.system_alerts — state on/off, attribute
  'alerts' carries the list sorted critical > warning > info
- TTL-only lifecycle: producers re-record each cycle while the
  condition holds; alerts with expires_at in the past are pruned on
  the next publish. Fast-resolve is re-recording with expires_at=now.
  No explicit clear_alert — keeps the contract small.
- No consumers yet; separate commits wire AlertFeed and octopus.py
  (IOG smart control lost) to call this
Dual-publish: existing HA entity sensor.<prefix>_alertfeed_status is
unchanged (backward-compat for HA users); weather alerts are also
recorded via the new record_alert() primitive so they appear on
sensor.<prefix>.system_alerts alongside other categories.

CAP severity maps to framework severity:
- Extreme  -> critical
- Severe   -> warning
- Moderate -> info
- Minor    -> info

Dedup key: weather:<event>:<onset> - stable across cycles of the
same CAP alert so TTL refresh works without duplicates. When
keep_reserve applies, it is carried in metadata for downstream
consumers to render a reserve-hold hint.
When a customer's IOG charger lands in SMART_CONTROL_NOT_AVAILABLE
(or any non-CAPABLE state) for more than 24 hours, raise a system
severity=warning alert via record_alert. The alert nudges the
customer to re-authorise MyEnergi in the Octopus app.

Implementation:
- Capture currentState from the dispatches query (already fetched,
  previously discarded) into each intelligent device dict
- Track first_seen per device in self.smart_control_degraded_since
- Clear tracking when currentState returns to SMART_CONTROL_CAPABLE
- After 24h degraded, re-record each sensor cycle with a 2h TTL so
  the alert stays alive while the condition persists and auto-expires
  once it clears (per the TTL-only lifecycle)

Closes the original motivating case: a customer lost IOG smart
control for ~48h with no feedback; plannedDispatches silently
returned empty and the only visible symptom was PredBat ignoring
the cheap charging window.
@mgazza
Copy link
Copy Markdown
Collaborator Author

mgazza commented Apr 24, 2026

Code review

Found 2 issues:

  1. _publish_system_alerts prunes expired alerts via lexicographic string compare on expires_at, but producers supply strings with different timezone offsets — weather alerts from CAP come with +00:00, the octopus smart-control alert uses now_utc_exact.isoformat() which carries a local offset (e.g. +01:00). String ordering across offset variants is not chronological, so alerts can be pruned too early or kept alive past their real expiry. Suggest normalising to UTC before compare, or parsing to datetime for the comparison.

batpred/apps/predbat/output.py

Lines 2471 to 2486 in 93d781b

def _publish_system_alerts(self):
"""Prune expired entries and publish the current active list."""
now_iso = self.now_utc_exact.isoformat() if hasattr(self, "now_utc_exact") and self.now_utc_exact else None
# Prune expired
if now_iso:
expired_keys = [k for k, a in self._active_alerts.items() if a.get("expires_at") and a["expires_at"] < now_iso]
for k in expired_keys:
del self._active_alerts[k]
active = sorted(
self._active_alerts.values(),
key=lambda a: ({"critical": 0, "warning": 1, "info": 2}.get(a.get("severity", "info"), 2), a.get("recorded_at") or ""),
)
self.dashboard_item(

  1. smart_control_degraded_since is in-memory only. On AppDaemon reload or pod restart, the 24h clock resets — a customer whose device has been non-capable for days waits another 24h after each restart before the alert fires. Consistent with batpred's ephemeral-state convention elsewhere, but worth either persisting the first-seen timestamp to disk/cache or documenting the restart behaviour.

self.free_electricity_events = []
# Track when each intelligent device first reported a non-capable
# currentState (e.g. SMART_CONTROL_NOT_AVAILABLE). When the condition
# persists beyond 24h we raise a user-facing alert via record_alert.
self.smart_control_degraded_since = {}
# API request metrics for monitoring

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

1. _publish_system_alerts compared expires_at strings lexicographically
   against now_utc_exact.isoformat(). Producers legitimately use
   different timezone offsets (CAP weather = +00:00, octopus code =
   local offset), and string ordering across offsets is not
   chronological. Parse expires_at via datetime.fromisoformat (handling
   trailing 'Z' for older Pythons) and compare against a timezone-aware
   UTC now. Treat naive timestamps as UTC.

2. smart_control_degraded_since was in-memory only, so every AppDaemon
   or pod restart reset the 24h alert clock for IOG devices that had
   been non-capable for days. Persist it to the existing octopus YAML
   cache (alongside intelligent_devices, saving_sessions, etc.) as ISO
   strings; rehydrate into datetime on load. The 24h window now
   survives restarts.
@mgazza
Copy link
Copy Markdown
Collaborator Author

mgazza commented Apr 24, 2026

Code review

Found 1 issue:

  1. _publish_system_alerts checks self.now_utc_exact exists but then uses datetime.now(timezone.utc) as the reference — ignoring the engine's canonical "now". Batpred uses self.now_utc_exact consistently elsewhere (e.g. record_status, _maybe_raise_smart_control_alert, and the recorded_at timestamp assigned two lines above) so mocked-time tests and deterministic plan cycles stay consistent. Pruning may drift from the rest of the cycle's view of time. Fix: now_dt = self.now_utc_exact (it's already timezone-aware, so the aware/aware comparison works).

batpred/apps/predbat/output.py

Lines 2474 to 2481 in 6ef15be

# Parse expires_at to timezone-aware datetime and compare against a UTC
# now so producers can pass ISO strings with any offset (e.g. CAP
# weather alerts arrive as +00:00, octopus alerts as local offset).
# Lex comparison on mixed-offset ISO strings would order wrongly.
now_dt = datetime.now(timezone.utc) if hasattr(self, "now_utc_exact") and self.now_utc_exact else None
if now_dt is not None:
expired_keys = []

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

Was calling datetime.now(timezone.utc) directly, ignoring the
engine's canonical cycle time. record_status, _maybe_raise_smart_
control_alert, and the recorded_at field on each alert all use
self.now_utc_exact — pruning should too so mocked-time tests and
deterministic plan cycles see a consistent view of 'now'.

self.now_utc_exact is aware (datetime.now(self.local_tz)); Python
compares aware datetimes across timezones correctly, so the compare
against an aware expires_dt still works.

Ref: PR review
@mgazza
Copy link
Copy Markdown
Collaborator Author

mgazza commented Apr 24, 2026

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a unified, user-facing alert recording mechanism across PredBat components and introduces initial alert producers (weather + Octopus Intelligent smart-control degradation), publishing aggregated alerts to a new HA entity.

Changes:

  • Introduces record_alert() + TTL pruning + publishing of aggregated alerts in output.py.
  • Adds ComponentBase.record_alert() delegation so components can emit alerts via the base instance.
  • Migrates AlertFeed to also emit unified alerts and adds an Octopus IOG “smart control lost” detector/producer.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
apps/predbat/predbat.py Initializes _active_alerts storage for user-facing alerts.
apps/predbat/output.py Implements record_alert() and publishes/prunes aggregated system alerts.
apps/predbat/component_base.py Adds a record_alert() delegate to the base PredBat instance.
apps/predbat/alertfeed.py Re-publishes CAP weather alerts into the unified alert framework.
apps/predbat/octopus.py Tracks IOG device currentState degradation and emits a warning alert after 24h.

Comment thread apps/predbat/octopus.py
self.smart_control_degraded_since = {}
for device_id, iso in raw.items():
try:
self.smart_control_degraded_since[device_id] = datetime.fromisoformat(iso)
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

load_octopus_cache() restores smart_control_degraded_since with datetime.fromisoformat(iso), but if the cached ISO string is naive (no offset) this will create a naive datetime. Later _maybe_raise_smart_control_alert() subtracts it from now (timezone-aware), which will raise a TypeError and break the update loop. Please normalise loaded timestamps to timezone-aware (e.g., assume UTC or local_tz when tzinfo is missing, and handle trailing 'Z' like output.py does).

Suggested change
self.smart_control_degraded_since[device_id] = datetime.fromisoformat(iso)
if isinstance(iso, str) and iso.endswith("Z"):
iso = iso[:-1] + "+00:00"
parsed = datetime.fromisoformat(iso)
if parsed.tzinfo is None:
parsed = parsed.replace(tzinfo=timezone.utc)
self.smart_control_degraded_since[device_id] = parsed

Copilot uses AI. Check for mistakes.
Comment thread apps/predbat/predbat.py
Comment on lines +476 to +477
# Active user-facing alerts keyed by (category, dedup_key). See
# record_alert()/clear_alert() in output.py.
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says alerts are keyed by (category, dedup_key) and references clear_alert(), but the implementation added in output.py keys only by a single string and there is no clear_alert() in this PR. Please update this comment to match the actual API/lifecycle to avoid misleading future changes.

Suggested change
# Active user-facing alerts keyed by (category, dedup_key). See
# record_alert()/clear_alert() in output.py.
# Active user-facing alerts keyed by the alert identifier string used
# by the output layer. Updated via the alert-recording flow in output.py.

Copilot uses AI. Check for mistakes.
Comment thread apps/predbat/output.py
Comment on lines +2458 to +2464
key = dedup_key or "{}::{}".format(category, title)
self._active_alerts[key] = {
"category": category,
"severity": severity,
"title": title,
"message": message,
"dedup_key": key,
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

record_alert() stores alerts under a plain key = dedup_key or "{category}::{title}". If callers provide a dedup_key without including the category (as in octopus.py), alerts from different categories could collide and overwrite each other. Consider composing the internal dict key from both category + dedup_key (while keeping the original dedup_key field intact for consumers).

Suggested change
key = dedup_key or "{}::{}".format(category, title)
self._active_alerts[key] = {
"category": category,
"severity": severity,
"title": title,
"message": message,
"dedup_key": key,
consumer_dedup_key = dedup_key or "{}::{}".format(category, title)
internal_key = "{}::{}".format(category, dedup_key) if dedup_key else consumer_dedup_key
self._active_alerts[internal_key] = {
"category": category,
"severity": severity,
"title": title,
"message": message,
"dedup_key": consumer_dedup_key,

Copilot uses AI. Check for mistakes.
Comment thread apps/predbat/output.py
def record_alert(self, category, severity, title, message, dedup_key=None, metadata=None, expires_at=None, action_url=None):
"""
Record a user-facing alert. Published as a list of dicts on the
`sensor.<prefix>_system_alerts` entity (attribute `alerts`).
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring says alerts are published to sensor.<prefix>_system_alerts, but the code actually publishes to self.prefix + ".system_alerts" (a different entity id pattern). Please align the docstring (and any user-facing docs) with the actual published entity name.

Suggested change
`sensor.<prefix>_system_alerts` entity (attribute `alerts`).
`<prefix>.system_alerts` entity (attribute `alerts`).

Copilot uses AI. Check for mistakes.
Comment thread apps/predbat/octopus.py
Comment on lines +1872 to +1884
# Surface SMART_CONTROL_NOT_AVAILABLE as a user-facing alert when
# it persists beyond 24h. The customer's charger has lost Octopus's
# smart control — PredBat will ignore IOG slots, but there is no
# feedback in the app beyond empty plannedDispatches. Alert nudges
# them to re-authorise MyEnergi in the Octopus app.
self._maybe_raise_smart_control_alert(device_id, device)

def _maybe_raise_smart_control_alert(self, device_id, device):
"""Check the device's currentState and raise / refresh a system alert
if it has been non-capable for more than 24h. TTL-only: while the
condition persists we re-record each sensor cycle to keep the alert
alive; when currentState returns to capable we stop re-recording and
the alert expires on its own."""
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New smart-control degradation alert logic in _maybe_raise_smart_control_alert() is user-facing behavior (24h threshold + TTL refresh) but there are already Octopus tests under apps/predbat/tests/. Please add/extend tests to cover: tracking first_seen across cycles, alert firing after 24h, and clearing when SMART_CONTROL_CAPABLE returns (including cache restore path).

Copilot generated this review using guidance from repository custom instructions.
Comment thread apps/predbat/output.py
# Use the engine's canonical "now" (aware, local tz) so mocked-time
# tests and deterministic plan cycles stay consistent. Python compares
# aware datetimes across timezones correctly.
now_dt = self.now_utc_exact if hasattr(self, "now_utc_exact") and self.now_utc_exact else None
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_publish_system_alerts() only prunes expired alerts when now_dt is available, but PredBat itself does not define now_utc_exact (only ComponentBase does). In that case now_dt will be None, TTL pruning never happens, and recorded_at is always None. Recommend adding a now_utc_exact property/attribute on PredBat (e.g., set in update_time) or switching this logic to use an existing canonical timestamp like self.now_utc so TTL-only lifecycle works in both HA and standalone modes.

Suggested change
now_dt = self.now_utc_exact if hasattr(self, "now_utc_exact") and self.now_utc_exact else None
now_dt = getattr(self, "now_utc_exact", None) or getattr(self, "now_utc", None)
if now_dt is not None and now_dt.tzinfo is None:
now_dt = now_dt.replace(tzinfo=timezone.utc)

Copilot uses AI. Check for mistakes.
Comment thread apps/predbat/output.py
# Treat naive timestamps as UTC for backward compatibility.
if expires_dt.tzinfo is None:
expires_dt = expires_dt.replace(tzinfo=timezone.utc)
if expires_dt < now_dt:
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expiry pruning uses if expires_dt < now_dt, but the API description says re-recording with expires_at=now should be pruned on the next publish. Using a strict < can leave an alert around for an extra cycle when the timestamps are equal; consider using <= for the comparison so "expires at now" reliably clears.

Suggested change
if expires_dt < now_dt:
if expires_dt <= now_dt:

Copilot uses AI. Check for mistakes.
Comment thread apps/predbat/output.py
Comment on lines +2510 to +2512
self.dashboard_item(
self.prefix + ".system_alerts",
state="on" if active else "off",
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description says the new entity is published as sensor.<prefix>.system_alerts, but the implementation publishes to self.prefix + ".system_alerts". Please reconcile the PR description and the actual entity id pattern (and ideally keep it consistent with other published entities).

Copilot uses AI. Check for mistakes.
Comment thread apps/predbat/alertfeed.py
Comment on lines +165 to +198
# Also publish to the unified alerts framework so downstream consumers
# (dashboards, gateways, SaaS) see weather alerts alongside other
# categories. TTL-only lifecycle: we re-record each cycle, entries
# drop off when no longer active (stops being re-recorded + TTL expires).
for alert in alerts or []:
expires = alert.get("expires")
if not expires:
continue
cap_severity = (alert.get("severity") or "").lower()
framework_severity = "critical" if cap_severity == "extreme" else "warning" if cap_severity == "severe" else "info"
event = alert.get("event") or alert.get("title") or "Weather alert"
onset = alert.get("onset")
area = alert.get("areaDesc") or "your area"
metadata = {
"event": alert.get("event"),
"severity_cap": alert.get("severity"),
"certainty": alert.get("certainty"),
"urgency": alert.get("urgency"),
"area": area,
"onset": str(onset) if onset else None,
}
if keep and keep > 0:
metadata["action"] = "keep_reserve"
metadata["keep_percent"] = keep
dedup_key = "weather:{}:{}".format(event, str(onset) if onset else "no-onset")
self.record_alert(
category="weather",
severity=framework_severity,
title=event,
message="{} until {} ({}/{}/{})".format(area, expires, alert.get("severity") or "unknown", alert.get("certainty") or "unknown", alert.get("urgency") or "unknown"),
dedup_key=dedup_key,
metadata=metadata,
expires_at=expires.isoformat() if hasattr(expires, "isoformat") else str(expires),
)
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apply_alerts() now loops over potentially multiple CAP alerts and calls record_alert() for each one; record_alert() immediately publishes the aggregated system_alerts entity each time. This can cause multiple HA state updates per cycle. Consider batching (e.g., add a publish=False option and publish once after the loop, or have AlertFeed build alerts then call a single publish) to reduce churn.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants