Decouple YAML save reply from full scanner reload by bdraco · Pull Request #890 · esphome/device-builder

bdraco · 2026-05-16T07:12:51Z

What does this implement/fix?

devices/update_config (the YAML editor's Save button) previously
awaited a full scanner.scan() on the WS hot path. For
~700-line configs, especially ones with !include / packages:,
the ESPHome YAML resolve inside load_device_yaml is hundreds of
ms of CPU on the worker thread, and the save-confirmation popup
waits for the whole round trip. (Reported in Discord by
MasterOfNone.)

This PR moves the post-save reload off the WS reply:

DeviceScanner gains a wake-driven background reload worker
built on a new WakeWorker[T] primitive at
controllers/_wake_worker.py. The worker drains a pending
filename set when its asyncio.Event is set; reloads run
under the existing scanner lock and fire DEVICE_UPDATED via
the on-change pipeline.
_persist_yaml_mutation switches from await scanner.scan()
to scanner.request(configuration). The save reply acks on
the atomic disk write alone, and the device row refresh
reaches subscribed clients moments later via the existing
event bus.
BuildSizeRefresher already had the same
pending-set + wake-event + start/stop shape inlined; it now
shares the new WakeWorker primitive too. Behaviour
preserved (drain still uses pop()-style for the
mid-walk-request semantics the original called out).

Drift-detection callers (list_devices, start, delete,
archive, unarchive, create, clone, the REST shim's
poll()) keep await scanner.scan(); they need the post-scan
view in their response. The scanner's internal lock serialises
them against the worker so semantics are preserved.

Related issue or feature (if applicable):

fixes <none, Slack report from MasterOfNone, 2026-05-15>

Types of changes

Bugfix (non-breaking change which fixes an issue) — bugfix
New feature (non-breaking change which adds functionality) — new-feature
Enhancement to an existing feature — enhancement
Breaking change (fix or feature that would cause existing functionality to not work as expected) — breaking-change
Refactor (no behaviour change) — refactor
Documentation only — docs
Maintenance / chore — maintenance
CI / workflow change — ci
Dependencies bump — dependencies

Frontend coordination

No frontend change needed
Companion frontend PR: esphome/device-builder-dashboard-frontend#

The wire shape of devices/update_config is unchanged; it still
returns None. The frontend already receives the post-save
device-row refresh via the existing device_updated event, so
the "save, then see the device update" UX continues to work; it
just arrives a fraction of a second after the popup instead of
being blocked behind the reload.

Checklist

The code change is tested and works locally.
Pre-commit hooks pass (ruff, codespell, yaml/json/python checks).
Tests have been added or updated under tests/ where applicable.
components.json has not been hand-edited (regenerate via script/sync_components.py if a sync is needed).
Architecture-level changes are reflected in docs/ARCHITECTURE.md and/or docs/API.md.

`devices/update_config` previously awaited a full `scanner.scan()` on the WS hot path. For ~700-line configs (especially those with `packages:` blocks) the ESPHome YAML resolve inside `load_device_yaml` is hundreds of ms of CPU on the worker thread, and the save-confirmation popup waits for the whole round trip. Add a wake-driven background reload worker to `DeviceScanner` and switch the save path to `scanner.request(filename)` — the WS reply acks on the atomic write alone, and `DEVICE_UPDATED` follows on the worker's next drain via the existing on-change pipeline. The worker shares a single `WakeWorker` primitive (`controllers/_wake_worker.py`) with `BuildSizeRefresher`, which already had the same pending-set + asyncio.Event + start/stop shape duplicated inline. Drift-detection callers (`list_devices`, `start`, `delete`, `archive`, `unarchive`, `create`, `clone`, the REST shim's `poll()`) keep `await scanner.scan()`; the scanner's lock serialises them against the worker so semantics are preserved.

codspeed-hq · 2026-05-16T07:14:29Z

Merging this PR will not alter performance

✅ 25 untouched benchmarks

_{Comparing scanner-wake-driven-reload (06890ca) with main (e24c5a9)}

Copilot

Pull request overview

Decouples the YAML editor's save reply from the full scanner reload. devices/update_config previously awaited scanner.scan() on the WS hot path, which on large !include/packages: configs blocked the save-confirmation popup for hundreds of ms. The PR introduces a reusable WakeWorker[T] primitive, adds a wake-driven background reload worker to DeviceScanner, and switches _persist_yaml_mutation to a fire-and-forget scanner.request(configuration). BuildSizeRefresher is refactored onto the same primitive without behavioural change. Drift-detection callers that need a post-scan view keep await scanner.scan(); the scanner's internal lock serialises them against the new worker.

Changes:

New WakeWorker[T] primitive (controllers/_wake_worker.py) owning the pending set, wake event, and start/stop lifecycle.
DeviceScanner gains start/stop/request and an internal _run drain loop; _persist_yaml_mutation calls request() instead of awaiting scan().
BuildSizeRefresher migrated to WakeWorker; tests updated to reach through refresher._worker. RecordingScanner test fake gains request/start/stop stubs.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
esphome_device_builder/controllers/_wake_worker.py	New `WakeWorker[T]` scaffold (pending set + wake event + task lifecycle).
esphome_device_builder/controllers/_device_scanner.py	Adds `start`/`stop`/`request` + `_run` drain loop on top of the new worker.
esphome_device_builder/controllers/_build_size_refresher.py	Replaces inline pending/wake/task logic with the shared `WakeWorker`.
esphome_device_builder/controllers/devices/controller.py	Starts/stops the scanner worker; switches `_persist_yaml_mutation` from `scan()` to `request()`.
tests/controllers/test_wake_worker.py	New unit tests for the primitive.
tests/test_device_scanner_request.py	New unit tests for the scanner's wake-driven reload path.
tests/controllers/test_build_size_refresher.py	Updates internal-attr accesses to go through `_worker`.
tests/controllers/devices/test_get_update_config.py	Asserts `("request", ...)` on the scanner, rewires the ordering test.
tests/controllers/devices/test_edit_friendly_name.py	Asserts `("request", ...)` and drives the real scanner's worker in the e2e path.
tests/_recording_scanner.py	Adds `request`/`start`/`stop` stubs to the test fake.

* `test_update_config_writes_before_requesting_reload` no longer reads the YAML inside a sync `request()` callback — that trips blockbuster on Linux CI. Instead, monkeypatch both `_write_yaml_atomic_async` and `scanner.request` to record the call order and assert write-then-request. * Edit-friendly-name end-to-end test and the new scanner request tests previously waited on internal worker flags (`pending`, `_wake`, lock state). There's a window where all three look idle but the worker hasn't actually run `reload()` yet. Replace with a deadline-based `_wait_for(predicate)` that polls the user-observable outcome (the device's friendly_name, the recorded reload list).

codecov-commenter · 2026-05-16T07:20:38Z

Codecov Report

❌ Patch coverage is 99.06542% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 99.31%. Comparing base (e24c5a9) to head (06890ca).

Files with missing lines	Patch %	Lines
...home_device_builder/controllers/_device_scanner.py	92.30%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #890      +/-   ##
==========================================
- Coverage   99.32%   99.31%   -0.01%     
==========================================
  Files         189      190       +1     
  Lines       13763    13823      +60     
==========================================
+ Hits        13670    13729      +59     
- Misses         93       94       +1

Flag	Coverage Δ
py3.12	`99.26% <99.06%> (-0.01%)`	⬇️
py3.14	`99.31% <99.06%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...evice_builder/controllers/_build_size_refresher.py	`100.00% <100.00%> (ø)`
esphome_device_builder/controllers/_wake_worker.py	`100.00% <100.00%> (ø)`
...e_device_builder/controllers/devices/controller.py	`99.64% <100.00%> (+<0.01%)`	⬆️
...home_device_builder/controllers/_device_scanner.py	`99.39% <92.30%> (-0.61%)`	⬇️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Tests previously polled internal worker flags (pending / _wake.is_set / _lock.locked) to wait for the worker to drain. That has a race window: between "pending cleared" and "reload actually starts," all three flags look idle but the work hasn't happened. WakeWorker now exposes: * ``drain()`` async context manager that pairs the wake-receive with the idle-set on exit (so a run loop can't forget to signal completion). * ``wait_idle()`` that parks until pending is empty AND no drain is in progress. * ``stop()`` always sets idle on the way out so any wait_idle parked through shutdown unblocks rather than hanging. Scanner and BuildSizeRefresher both wrap their per-iteration work in ``async with self._worker.drain():`` and tests use ``await worker.wait_idle()`` for synchronization. Race-safe under asyncio's single-thread model: * request()'s pending.add + idle.clear + wake.set is one sync block, can't interleave with anything else. * drain()'s finally (``if not self.pending: self._idle.set()``) is one sync block, can't interleave with a concurrent request. * A request that lands mid-drain is visible to the finally's ``if not self.pending`` check, so idle stays cleared until the next drain handles it.

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 6 comments.

Both ``DeviceScanner`` and ``BuildSizeRefresher`` had identical ``request`` / ``start`` / ``stop`` pass-throughs forwarding to a composed ``WakeWorker``, plus identical ``while True: async with self._worker.drain():`` boilerplate in their run loops. Switch to inheritance: ``WakeWorker`` now owns the loop and exposes ``_drain`` (required) and ``_on_start`` (optional no-op) hooks. Consumers: * ``DeviceScanner(WakeWorker[str])`` — drops the three pass- throughs, replaces ``_run`` with ``_drain``. * ``BuildSizeRefresher(WakeWorker[str])`` — same, plus ``_on_start`` for the phase-A fleet sweep. Tests: * ``refresher._worker.X`` → ``refresher.X``; same for the scanner request tests. ``test_wake_worker.py`` rewritten to subclass the base for each scenario. * ``RecordingScanner.start`` is now sync to match production.

bdraco · 2026-05-16T07:48:32Z

@bluetoothbot review

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated no new comments.

* ``WakeWorker._run_loop`` now wraps each ``_drain`` call in a ``try/except Exception`` so a programming bug in a subclass's drain body doesn't kill the worker and strand every ``wait_idle`` waiter. Tested by ``test_drain_exception_logged _and_loop_continues``. * ``WakeWorker.start`` retrieves and logs the prior task's exception when a crashed task is replaced, so the failure mode is visible in logs and doesn't surface as "Task exception was never retrieved" at GC. Tested by ``test_start_logs_and_replaces_crashed_task``. * ``DeviceScanner._drain`` debug-logs when ``reload`` returns ``False`` (the queued YAML vanished between request and drain time) so the silent-skip case is observable. * ``RecordingScanner`` docstring now lists the full mirrored surface (``request`` / ``start`` / ``stop``) and explicitly pins the sync/async signatures so a future drift surfaces. * ``test_wake_worker.py`` drops the lone ``@pytest.mark.asyncio`` marker (the repo is auto-mode) and types ``caplog`` as ``pytest.LogCaptureFixture``.

Codecov flagged line 78 (the abstract ``raise NotImplementedError`` in the base ``_drain``) as uncovered — every concrete subclass in production and tests overrides it. Direct unit test pins the contract so the abstract default stays visible at 100%.

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated no new comments.

bluetoothbot · 2026-05-16T08:33:01Z

PR Review — Decouple YAML save reply from full scanner reload

Solid PR. The WakeWorker[T] abstraction is a clean extraction of the pending-set + wake-event + idle-event + start/stop pattern that BuildSizeRefresher already had, and DeviceScanner now reuses it to peel YAML-save reloads off the WS hot path — which is the actual user-visible improvement. The deadlock fix in 75bf618 (re-arm _wake when pending is non-empty on context-manager exit) closes the most worrying correctness issue, and the test coverage in test_wake_worker.py and test_device_scanner_request.py is thorough — coalescing, mid-drain requests, request-before-start, failing reloads, crashed-task restart, stop-cancels-waiters, on_start-queues-work, and _drain-raising-mid-pending are all pinned. The author has worked through several rounds of reviewer feedback and the result is in good shape.

One remaining footgun worth addressing before merge: _run_loop wraps _drain in try/except but not _on_start. BuildSizeRefresher catches its own exceptions internally so today is safe, but the symmetric defense in the base class would prevent the same deadlock for future subclasses. Two minor doc suggestions on top: noting the now-async ordering between scanner.request and _schedule_storage_regenerate in _persist_yaml_mutation, and clarifying that request() requires a running worker to make progress. None are blocking — these can land as follow-ups if preferred.

🟢 Suggestions

1. `_on_start` exception is not guarded — can deadlock `wait_idle` (`esphome_device_builder/controllers/_wake_worker.py`, L89-99)

_run_loop wraps _drain in try/except Exception (good, addresses #4467207824) but _on_start is awaited bare on line 90. If a subclass's _on_start raises, the task ends with an exception, _idle stays cleared (because line 48 in start() cleared it preemptively and the line 92-93 reset is skipped when the exception propagates), and every wait_idle() caller is parked forever until someone calls stop().

Today this is masked because BuildSizeRefresher._on_start catches its own exceptions internally, and DeviceScanner doesn't override _on_start. But the base class is intentionally a reusable abstraction (per the docstring on line 14), and the protective treatment for _drain strongly suggests the symmetric treatment for _on_start. A future subclass that forgets the internal try/except would hit a silent deadlock that the existing test coverage wouldn't catch.

Suggested fix: wrap _on_start the same way:

async def _run_loop(self) -> None:
    try:
        await self._on_start()
    except Exception:
        _LOGGER.exception("Worker %s _on_start raised; continuing", type(self).__name__)
    if not self.pending:
        self._idle.set()
    while True:
        ...

Alternatively, tighten the contract in the _on_start docstring (line 74) to state that subclasses must not let exceptions escape — but the in-code defense is cheap and symmetric with the existing _drain guard.

async def _run_loop(self) -> None:
    await self._on_start()
    # Re-set the idle ``start`` cleared if ``_on_start`` queued no work.
    if not self.pending:
        self._idle.set()
    while True:
        async with self._drain_cycle():
            try:
                await self._drain()
            except Exception:
                _LOGGER.exception("Worker %s drain raised; continuing", type(self).__name__)

2. Worth a one-line note: request/regen ordering is now non-deterministic (`esphome_device_builder/controllers/devices/controller.py`, L779-786)

Before this PR, await self._scanner.scan() completed before _schedule_storage_regenerate was called, so the DEVICE_UPDATED emitted by the scan necessarily preceded the storage regen. After this PR, scanner.request(...) returns immediately and the reload happens on a background worker; the relative order of the worker's reload-and-emit-UPDATED vs _schedule_storage_regenerate's completion is now event-loop-scheduler-dependent.

Whether the worker reads StorageJSON's loaded_integrations before or after the regen finishes determines whether the first DEVICE_UPDATED fires with the new or the previous loaded_integrations. This isn't a regression — the legacy ordering had the same stale-on-first-scan property (the inline comment on line 783-785 alludes to it) — but it's worth a single sentence in this docstring noting that the reload is intentionally fire-and-forget and that the next scan tick is what guarantees a fresh loaded_integrations row. It would make the contract explicit for the next reader who comes through trying to understand why the editor's save flow doesn't await the reload.

async def _persist_yaml_mutation(self, configuration: str, content: str) -> None:
    """Atomic write + queue background reload + StorageJSON regen."""
    await self._write_yaml_atomic_async(self._db.settings.rel_path(configuration), content)
    self._scanner.request(configuration)
    # Mirrors the upstream dashboard's
    # ``async_schedule_storage_json_update``; without it
    # ``loaded_integrations`` stays at its pre-write state.
    self._schedule_storage_regenerate(configuration)

3. `request()` clears `_idle` even when no worker is running (`esphome_device_builder/controllers/_wake_worker.py`, L30-34)

request() unconditionally clears _idle and sets _wake. If the worker has never been started (or has been stopped), the item sits in pending, _idle is cleared, and any wait_idle() blocks forever — there's no task draining anything.

In practice this is fine because the only production caller paths run through a fully-started controller, and test_request_before_start_drains_on_start pins the legitimate use case (request-before-start that drains after start). But a debugging session or a test that forgets to call start() will see wait_idle hang with no obvious diagnostic.

Not a blocker; the current callers are all gated behind controller startup. Worth either a one-line note on request() clarifying that the worker must be started to make progress, or — more defensively — having request() only clear _idle when self._task is not None. The second option is a behavior change worth thinking about; the first is free.

def request(self, item: T) -> None:
    """Push *item* onto :attr:`pending` and wake the loop."""
    self.pending.add(item)
    self._idle.clear()
    self._wake.set()

Checklist

Summary

Solid PR. The WakeWorker[T] abstraction is a clean extraction of the pending-set + wake-event + idle-event + start/stop pattern that BuildSizeRefresher already had, and DeviceScanner now reuses it to peel YAML-save reloads off the WS hot path — which is the actual user-visible improvement. The deadlock fix in 75bf618 (re-arm _wake when pending is non-empty on context-manager exit) closes the most worrying correctness issue, and the test coverage in test_wake_worker.py and test_device_scanner_request.py is thorough — coalescing, mid-drain requests, request-before-start, failing reloads, crashed-task restart, stop-cancels-waiters, on_start-queues-work, and _drain-raising-mid-pending are all pinned. The author has worked through several rounds of reviewer feedback and the result is in good shape.

One remaining footgun worth addressing before merge: _run_loop wraps _drain in try/except but not _on_start. BuildSizeRefresher catches its own exceptions internally so today is safe, but the symmetric defense in the base class would prevent the same deadlock for future subclasses. Two minor doc suggestions on top: noting the now-async ordering between scanner.request and _schedule_storage_regenerate in _persist_yaml_mutation, and clarifying that request() requires a running worker to make progress. None are blocking — these can land as follow-ups if preferred.

Automated review by Kōan7614520
80c4a0d
af2a9c1
2405a0c
4ce053a
a628aab
75bf618
50b4ee8

bluetoothbot flagged: if ``_drain`` raises while items are still in ``self.pending``, ``_drain_cycle.__aexit__`` neither sets ``_idle`` nor re-arms ``_wake``. The worker parks on ``_wake.wait()`` forever and any ``wait_idle`` waiter is stranded. Reachable in ``BuildSizeRefresher._drain`` because ``_persist_size`` is a sync callback outside the per-item try. Three fixes: * ``_drain_cycle`` now re-arms ``_wake`` when ``pending`` is non-empty on exit. ``Event.set()`` is idempotent so the normal "concurrent request set wake during drain" path is unchanged. * ``start`` clears ``_idle`` so a ``wait_idle`` issued right after ``start`` parks past ``_on_start``; ``_run_loop`` re-sets idle when ``_on_start`` queued no work. * ``_drain`` docstring spells out the swap-empty vs pop-as-you-go patterns and the contract that the base preserves leftover pending across exception-exiting drains. Three new tests pin the contracts: * ``test_drain_raising_with_items_pending_drains_remainder`` * ``test_wait_idle_after_start_with_no_on_start_work`` * ``test_wait_idle_after_start_waits_for_on_start_work``

bdraco · 2026-05-16T15:03:17Z

Thanks for the careful read — warning #1 is a real deadlock and you're right that the existing test missed it because _FlakyDrain clears pending before raising. Fixed in 75bf618:

_drain_cycle.__aexit__ now re-arms _wake when pending is non-empty on exit (your one-liner). Event.set() is idempotent so the mid-drain-request case is unchanged.
start() clears _idle and _run_loop re-sets it when _on_start queued no work, so suggestion Add section config API and user preferences #1's worker.start(); await worker.wait_idle() race is closed too.
_drain docstring now spells out the swap-empty vs pop-as-you-go patterns and the base's preserve-leftovers-across-exception contract (suggestion Fix be when i want the catalog for a specific device #2).
Three new tests pin the fixes (suggestion Handle hint and errors in yaml #3): test_drain_raising_with_items_pending_drains_remainder (the stuck-state regression you sketched), plus test_wait_idle_after_start_with_no_on_start_work and test_wait_idle_after_start_waits_for_on_start_work.

bdraco · 2026-05-16T17:19:11Z

@bluetoothbot review

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated no new comments.

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated no new comments.

Self-review found three CLAUDE.md drifts in the last batch: ``_drain`` docstring was 19 lines of teaching material when the contract is one sentence, ``_drain_cycle`` carried a multi-line "without this we used to deadlock" rationale that belongs in the PR body, and ``start`` / ``_run_loop`` had multi-line inline comments that say what a one-liner would. Net −33 lines; behavior unchanged, tests + coverage unchanged.

bluetoothbot flagged three follow-ups on top of the deadlock fix in 75bf618: * ``_run_loop`` now wraps ``_on_start`` in the same ``try/except Exception`` as ``_drain``. Symmetric defense against a subclass's ``_on_start`` raising and stranding every ``wait_idle`` waiter behind the cleared ``_idle`` event. * ``WakeWorker.request`` docstring spells out the worker-must- be-running invariant. * ``_persist_yaml_mutation`` docstring makes the fire-and-forget contract explicit so the next reader doesn't try to reason about the post-reload device row before the next event tick. Regression test ``test_on_start_exception_logged_and_loop_continues`` pins the new guard.

bdraco · 2026-05-16T17:50:41Z

Thanks for the follow-up — all three landed in 06890ca:

Add section config API and user preferences #1 _on_start guard: _run_loop now wraps _on_start in the same try/except Exception as _drain. Symmetric defense, as you suggested. Regression test test_on_start_exception_logged_and_loop_continues pins it.
Fix be when i want the catalog for a specific device #2 request/regen ordering: _persist_yaml_mutation docstring now reads "Atomic write + fire-and-forget background reload + StorageJSON regen." with a follow-up sentence noting callers don't see the post-reload device row before the next event tick.
Handle hint and errors in yaml #3 request() without a running worker: WakeWorker.request docstring now spells out the worker-must-be-running invariant. Went with the docstring note rather than the conditional-clear because the latter changes the request-before-start path and would need its own thinking-through; the note matches the actual production usage where start() always runs before request().

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated no new comments.

Copilot AI review requested due to automatic review settings May 16, 2026 07:12

github-actions Bot added the enhancement Improvement to an existing feature label May 16, 2026

Copilot started reviewing on behalf of bdraco May 16, 2026 07:13 View session

bdraco marked this pull request as draft May 16, 2026 07:14