Decouple YAML save reply from full scanner reload#890
Conversation
`devices/update_config` previously awaited a full `scanner.scan()` on the WS hot path. For ~700-line configs (especially those with `packages:` blocks) the ESPHome YAML resolve inside `load_device_yaml` is hundreds of ms of CPU on the worker thread, and the save-confirmation popup waits for the whole round trip. Add a wake-driven background reload worker to `DeviceScanner` and switch the save path to `scanner.request(filename)` — the WS reply acks on the atomic write alone, and `DEVICE_UPDATED` follows on the worker's next drain via the existing on-change pipeline. The worker shares a single `WakeWorker` primitive (`controllers/_wake_worker.py`) with `BuildSizeRefresher`, which already had the same pending-set + asyncio.Event + start/stop shape duplicated inline. Drift-detection callers (`list_devices`, `start`, `delete`, `archive`, `unarchive`, `create`, `clone`, the REST shim's `poll()`) keep `await scanner.scan()`; the scanner's lock serialises them against the worker so semantics are preserved.
There was a problem hiding this comment.
Pull request overview
Decouples the YAML editor's save reply from the full scanner reload. devices/update_config previously awaited scanner.scan() on the WS hot path, which on large !include/packages: configs blocked the save-confirmation popup for hundreds of ms. The PR introduces a reusable WakeWorker[T] primitive, adds a wake-driven background reload worker to DeviceScanner, and switches _persist_yaml_mutation to a fire-and-forget scanner.request(configuration). BuildSizeRefresher is refactored onto the same primitive without behavioural change. Drift-detection callers that need a post-scan view keep await scanner.scan(); the scanner's internal lock serialises them against the new worker.
Changes:
- New
WakeWorker[T]primitive (controllers/_wake_worker.py) owning the pending set, wake event, and start/stop lifecycle. DeviceScannergainsstart/stop/requestand an internal_rundrain loop;_persist_yaml_mutationcallsrequest()instead of awaitingscan().BuildSizeRefreshermigrated toWakeWorker; tests updated to reach throughrefresher._worker.RecordingScannertest fake gainsrequest/start/stopstubs.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| esphome_device_builder/controllers/_wake_worker.py | New WakeWorker[T] scaffold (pending set + wake event + task lifecycle). |
| esphome_device_builder/controllers/_device_scanner.py | Adds start/stop/request + _run drain loop on top of the new worker. |
| esphome_device_builder/controllers/_build_size_refresher.py | Replaces inline pending/wake/task logic with the shared WakeWorker. |
| esphome_device_builder/controllers/devices/controller.py | Starts/stops the scanner worker; switches _persist_yaml_mutation from scan() to request(). |
| tests/controllers/test_wake_worker.py | New unit tests for the primitive. |
| tests/test_device_scanner_request.py | New unit tests for the scanner's wake-driven reload path. |
| tests/controllers/test_build_size_refresher.py | Updates internal-attr accesses to go through _worker. |
| tests/controllers/devices/test_get_update_config.py | Asserts ("request", ...) on the scanner, rewires the ordering test. |
| tests/controllers/devices/test_edit_friendly_name.py | Asserts ("request", ...) and drives the real scanner's worker in the e2e path. |
| tests/_recording_scanner.py | Adds request/start/stop stubs to the test fake. |
* `test_update_config_writes_before_requesting_reload` no longer reads the YAML inside a sync `request()` callback — that trips blockbuster on Linux CI. Instead, monkeypatch both `_write_yaml_atomic_async` and `scanner.request` to record the call order and assert write-then-request. * Edit-friendly-name end-to-end test and the new scanner request tests previously waited on internal worker flags (`pending`, `_wake`, lock state). There's a window where all three look idle but the worker hasn't actually run `reload()` yet. Replace with a deadline-based `_wait_for(predicate)` that polls the user-observable outcome (the device's friendly_name, the recorded reload list).
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #890 +/- ##
==========================================
- Coverage 99.32% 99.31% -0.01%
==========================================
Files 189 190 +1
Lines 13763 13823 +60
==========================================
+ Hits 13670 13729 +59
- Misses 93 94 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
Tests previously polled internal worker flags (pending / _wake.is_set / _lock.locked) to wait for the worker to drain. That has a race window: between "pending cleared" and "reload actually starts," all three flags look idle but the work hasn't happened. WakeWorker now exposes: * ``drain()`` async context manager that pairs the wake-receive with the idle-set on exit (so a run loop can't forget to signal completion). * ``wait_idle()`` that parks until pending is empty AND no drain is in progress. * ``stop()`` always sets idle on the way out so any wait_idle parked through shutdown unblocks rather than hanging. Scanner and BuildSizeRefresher both wrap their per-iteration work in ``async with self._worker.drain():`` and tests use ``await worker.wait_idle()`` for synchronization. Race-safe under asyncio's single-thread model: * request()'s pending.add + idle.clear + wake.set is one sync block, can't interleave with anything else. * drain()'s finally (``if not self.pending: self._idle.set()``) is one sync block, can't interleave with a concurrent request. * A request that lands mid-drain is visible to the finally's ``if not self.pending`` check, so idle stays cleared until the next drain handles it.
Both ``DeviceScanner`` and ``BuildSizeRefresher`` had identical ``request`` / ``start`` / ``stop`` pass-throughs forwarding to a composed ``WakeWorker``, plus identical ``while True: async with self._worker.drain():`` boilerplate in their run loops. Switch to inheritance: ``WakeWorker`` now owns the loop and exposes ``_drain`` (required) and ``_on_start`` (optional no-op) hooks. Consumers: * ``DeviceScanner(WakeWorker[str])`` — drops the three pass- throughs, replaces ``_run`` with ``_drain``. * ``BuildSizeRefresher(WakeWorker[str])`` — same, plus ``_on_start`` for the phase-A fleet sweep. Tests: * ``refresher._worker.X`` → ``refresher.X``; same for the scanner request tests. ``test_wake_worker.py`` rewritten to subclass the base for each scenario. * ``RecordingScanner.start`` is now sync to match production.
|
@bluetoothbot review |
* ``WakeWorker._run_loop`` now wraps each ``_drain`` call in a ``try/except Exception`` so a programming bug in a subclass's drain body doesn't kill the worker and strand every ``wait_idle`` waiter. Tested by ``test_drain_exception_logged _and_loop_continues``. * ``WakeWorker.start`` retrieves and logs the prior task's exception when a crashed task is replaced, so the failure mode is visible in logs and doesn't surface as "Task exception was never retrieved" at GC. Tested by ``test_start_logs_and_replaces_crashed_task``. * ``DeviceScanner._drain`` debug-logs when ``reload`` returns ``False`` (the queued YAML vanished between request and drain time) so the silent-skip case is observable. * ``RecordingScanner`` docstring now lists the full mirrored surface (``request`` / ``start`` / ``stop``) and explicitly pins the sync/async signatures so a future drift surfaces. * ``test_wake_worker.py`` drops the lone ``@pytest.mark.asyncio`` marker (the repo is auto-mode) and types ``caplog`` as ``pytest.LogCaptureFixture``.
Codecov flagged line 78 (the abstract ``raise NotImplementedError`` in the base ``_drain``) as uncovered — every concrete subclass in production and tests overrides it. Direct unit test pins the contract so the abstract default stays visible at 100%.
PR Review — Decouple YAML save reply from full scanner reloadSolid PR. The One remaining footgun worth addressing before merge: 🟢 Suggestions1. `_on_start` exception is not guarded — can deadlock `wait_idle` (`esphome_device_builder/controllers/_wake_worker.py`, L89-99)
Today this is masked because Suggested fix: wrap async def _run_loop(self) -> None:
try:
await self._on_start()
except Exception:
_LOGGER.exception("Worker %s _on_start raised; continuing", type(self).__name__)
if not self.pending:
self._idle.set()
while True:
...Alternatively, tighten the contract in the 2. Worth a one-line note: request/regen ordering is now non-deterministic (`esphome_device_builder/controllers/devices/controller.py`, L779-786)Before this PR, Whether the worker reads StorageJSON's 3. `request()` clears `_idle` even when no worker is running (`esphome_device_builder/controllers/_wake_worker.py`, L30-34)
In practice this is fine because the only production caller paths run through a fully-started controller, and Not a blocker; the current callers are all gated behind controller startup. Worth either a one-line note on Checklist
SummarySolid PR. The One remaining footgun worth addressing before merge: Automated review by Kōan7614520 |
bluetoothbot flagged: if ``_drain`` raises while items are still in ``self.pending``, ``_drain_cycle.__aexit__`` neither sets ``_idle`` nor re-arms ``_wake``. The worker parks on ``_wake.wait()`` forever and any ``wait_idle`` waiter is stranded. Reachable in ``BuildSizeRefresher._drain`` because ``_persist_size`` is a sync callback outside the per-item try. Three fixes: * ``_drain_cycle`` now re-arms ``_wake`` when ``pending`` is non-empty on exit. ``Event.set()`` is idempotent so the normal "concurrent request set wake during drain" path is unchanged. * ``start`` clears ``_idle`` so a ``wait_idle`` issued right after ``start`` parks past ``_on_start``; ``_run_loop`` re-sets idle when ``_on_start`` queued no work. * ``_drain`` docstring spells out the swap-empty vs pop-as-you-go patterns and the contract that the base preserves leftover pending across exception-exiting drains. Three new tests pin the contracts: * ``test_drain_raising_with_items_pending_drains_remainder`` * ``test_wait_idle_after_start_with_no_on_start_work`` * ``test_wait_idle_after_start_waits_for_on_start_work``
|
Thanks for the careful read — warning #1 is a real deadlock and you're right that the existing test missed it because
|
|
@bluetoothbot review |
Self-review found three CLAUDE.md drifts in the last batch: ``_drain`` docstring was 19 lines of teaching material when the contract is one sentence, ``_drain_cycle`` carried a multi-line "without this we used to deadlock" rationale that belongs in the PR body, and ``start`` / ``_run_loop`` had multi-line inline comments that say what a one-liner would. Net −33 lines; behavior unchanged, tests + coverage unchanged.
bluetoothbot flagged three follow-ups on top of the deadlock fix in 75bf618: * ``_run_loop`` now wraps ``_on_start`` in the same ``try/except Exception`` as ``_drain``. Symmetric defense against a subclass's ``_on_start`` raising and stranding every ``wait_idle`` waiter behind the cleared ``_idle`` event. * ``WakeWorker.request`` docstring spells out the worker-must- be-running invariant. * ``_persist_yaml_mutation`` docstring makes the fire-and-forget contract explicit so the next reader doesn't try to reason about the post-reload device row before the next event tick. Regression test ``test_on_start_exception_logged_and_loop_continues`` pins the new guard.
|
Thanks for the follow-up — all three landed in 06890ca:
|
What does this implement/fix?
devices/update_config(the YAML editor's Save button) previouslyawaited a full
scanner.scan()on the WS hot path. For~700-line configs, especially ones with
!include/packages:,the ESPHome YAML resolve inside
load_device_yamlis hundreds ofms of CPU on the worker thread, and the save-confirmation popup
waits for the whole round trip. (Reported in Discord by
MasterOfNone.)
This PR moves the post-save reload off the WS reply:
DeviceScannergains a wake-driven background reload workerbuilt on a new
WakeWorker[T]primitive atcontrollers/_wake_worker.py. The worker drains a pendingfilename set when its
asyncio.Eventis set; reloads rununder the existing scanner lock and fire
DEVICE_UPDATEDviathe on-change pipeline.
_persist_yaml_mutationswitches fromawait scanner.scan()to
scanner.request(configuration). The save reply acks onthe atomic disk write alone, and the device row refresh
reaches subscribed clients moments later via the existing
event bus.
BuildSizeRefresheralready had the samepending-set + wake-event + start/stop shape inlined; it now
shares the new
WakeWorkerprimitive too. Behaviourpreserved (drain still uses
pop()-style for themid-walk-request semantics the original called out).
Drift-detection callers (
list_devices,start,delete,archive,unarchive,create,clone, the REST shim'spoll()) keepawait scanner.scan(); they need the post-scanview in their response. The scanner's internal lock serialises
them against the worker so semantics are preserved.
Related issue or feature (if applicable):
Types of changes
bugfixnew-featureenhancementbreaking-changerefactordocsmaintenancecidependenciesFrontend coordination
The wire shape of
devices/update_configis unchanged; it stillreturns
None. The frontend already receives the post-savedevice-row refresh via the existing
device_updatedevent, sothe "save, then see the device update" UX continues to work; it
just arrives a fraction of a second after the popup instead of
being blocked behind the reload.
Checklist
ruff,codespell, yaml/json/python checks).tests/where applicable.components.jsonhas not been hand-edited (regenerate viascript/sync_components.pyif a sync is needed).docs/ARCHITECTURE.mdand/ordocs/API.md.