You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Users of hal0 keep hitting chat-completions failures (502 dispatch.upstream_unavailable, 503 slot.loading at wrong moments, mysterious latency spikes) because hal0 and the embedded Lemonade daemon hold overlapping state machines with eventually-consistent reconciliation. The dispatcher reads from hal0's local copy of "is this slot alive?" while the truth lives in Lemonade. Whenever the two views diverge — silent eviction, port re-allocation after a respawn, mid-request peer-drop, configuration split-brain — the user (or their agent) gets a failure that shouldn't exist.
We have shipped 3 patches in 7 days against symptoms of this single root cause:
The pending port-discovery fix (catches port-drift after orphan-process respawn)
This pattern will keep producing bugs until the architectural shape changes.
Solution
Introduce a single typed boundary (LemondAdapter) that is the only module in hal0 which talks to Lemonade. Drive hal0's slot state from Lemonade events (WebSocket /logs/stream + log-line parsing) instead of polling. Compute dispatcher routing from the adapter at dispatch time so port drift becomes impossible by construction. Move idle-eviction policy from Lemonade to hal0 so every eviction is initiated by us, not surprised on us.
The user sees: zero dispatch.upstream_unavailable for transient eviction events; only structured 503 slot.loading + Retry-After envelopes (which their OpenAI-compatible SDKs already handle); predictable first-call latency after agent session start.
Six phases, each independently shippable. Phase 0 is a 40-LOC hotfix that ships this week. Phases 1–3 are the core architectural change (~3 weeks). Phases 4–6 are platform polish (~3–5 weeks).
User Stories
As a hal0 dashboard user, I want chats to succeed even when Lemonade has silently evicted the model, so that I don't see "ConnectError" 502s in the chat panel.
As an OpenAI-compatible SDK user (Python openai library, curl, agent harness), I want all transient backend issues to return 503 with Retry-After, so that my SDK's built-in retry logic handles the back-off cleanly.
As an OpenAI-compatible SDK user, I never want to receive a 502 dispatch.upstream_unavailable for a slot that hal0 can reach with a single internal re-load, so that I don't have to implement custom error handling for hal0-specific failure modes.
As an agent provider (Hermes, pi-coder, external MCP agent), I want predictable first-call latency after my session starts, so that warm-load happens before my first inference call rather than turning into a 30-second surprise.
As an agent provider, I want multi-modality chains (embed → chat → rerank) to succeed even if one slot was idle-evicted between calls, so that I don't have to manually pre-warm every slot before chaining.
As an agent author, I want streaming responses to never be cut mid-stream by an eviction, so that my agent doesn't have to handle "stream truncated, retry from scratch" semantics.
As an agent author, I want idempotent retry semantics on the inference path, so that retrying a failed dispatch never causes a double-execution of side effects (tool calls, etc.).
As an operator (the user running hal0 on their LXC), I want lemonade version upgrades to require touching at most one hal0 file, so that I'm not chasing schema deltas across LemonadeClient, LemonadeProvider, flm_trio, npu_swap_status, the idle driver, and the metrics shim individually.
As an operator, I want a Grafana-style dashboard of adapter SLOs (time-to-route-stable, eviction-recovery-time, /v1/load p95/p99, WS reconnection rate), so that I can detect upstream lemonade regressions before users do.
As an operator, I want dispatch.upstream_dead_attempting_recover to be a near-zero metric in production, so that I have evidence the architecture isn't compensating for state drift.
As an operator, I want hal0 to own when slots get evicted (not Lemonade's internal policy), so that I can tune idle eviction to my workload (long-form coding sessions vs. short Q&A) without changing lemond config.
As an operator, I want eviction policy decisions to be observable in the journal (slot.evicted_by_idle_policy with reason), so that I can diagnose "why did my slot unload?" without reading lemond logs.
As a hal0 contributor, I want the LemondAdapter to be the ONLY module that imports httpx for talking to lemond, so that adding a new failure mode (timeout, retry policy, circuit breaker) requires one file change, not seven.
As a hal0 contributor, I want all lemonade-related types (SlotEvent, RouteInfo, LoadedModel) to live in src/hal0/lemonade/ and be re-exported through a stable module API, so that I never have to grep for which file owns a given lemonade concept.
As a hal0 contributor, I want the dispatcher's _recover_evicted_slot recovery branch to be deleted in phase 3, so that the "two systems disagree" failure mode is impossible by construction rather than papered-over by retry logic.
As a hal0 contributor reading the v0.3 dispatcher today, I want to find one explicit document explaining the control-plane / data-plane split (ADR-0008 + this PRD's resulting ADR), so that I don't reverse-engineer the design from log lines and PR descriptions.
As a future-me debugging a state drift, I want the SlotManager's state for lemonade-backed slots to be derived (state = compute_from_events(history)) rather than authored (state = self._transition(...)), so that there is exactly one place to look when state is wrong.
As the AFK agent picking up phase 2, I want a working /logs/stream regex test suite + captured corpus before I commit, so that I don't ship a parser that breaks on the next lemonade release.
As the AFK agent, I want each phase to ship behind no feature flag (the adapter introduction is a pure refactor; subsequent phases preserve external behavior until the recovery branch is removed in phase 3), so that backing out a phase is git revert, not "untangle a flag from 5 codepaths."
As a Strix Halo home-AI user (the primary hal0 audience), I want hal0 to feel reliable enough that I run it as my primary inference endpoint for daily coding work, so that I'm not falling back to a cloud provider every time I hit a 502.
As a Strix Halo home-AI user, I want the dashboard's slot indicator dots to reflect ground truth (lemonade's view) within ~100ms of a state change, so that I can trust the UI when deciding whether to send another request.
As a Strix Halo home-AI user opening a new session, I want primary/embed/rerank to be warm-loaded in parallel within the same boot window, so that my first chat doesn't wait for serial cold-loads.
As an hal0-memory or hal0-admin MCP client, I want the embedding slot to be reliably available for memory.add / memory.search calls, so that my agent's memory operations don't intermittently fail due to slot state confusion.
As a developer running the δ-tier harness (make harness), I want slot lifecycle tests to validate the new adapter-driven model, so that the harness catches regressions before they reach production.
Implementation Decisions
Modules to be built
LemondAdapter (new, deep module) — src/hal0/lemonade/adapter.py. Single entry point for all hal0 → lemonade interactions. Owns: httpx client to :13305, WebSocket subscription to /logs/stream, TTL cache for /v1/health, event fan-out to subscribers. Replaces direct LemonadeClient usage scattered across flm_trio.py, npu_swap_status.py, the idle driver, the metrics shim. Designed to be the only file that changes when lemonade ships a breaking schema update.
SlotEvent types (new) — src/hal0/lemonade/events.py. Typed dataclasses for ModelLoaded(name, port, pid), ModelEvicted(name, reason), ProcessCrashed(name, pid), LoadFailed(name, reason), PortChanged(name, old_port, new_port). These are the contract between the adapter and its subscribers.
LogStreamSubscriber (new, deep module) — src/hal0/lemonade/log_stream.py. Owns the WebSocket connection to /logs/stream, reconnection-with-backoff, and the regex parser that converts log lines into SlotEvents. Falls back to polling on WS unavailable.
RouteCache (new) — TTL-cached view of /v1/health.all_models_loaded[] mapping model_name → (port, pid, alive). Invalidated by SlotEvents.
PortValidator (new, phase 3) — lightweight TCP probe with its own TTL cache. The dispatcher's pre-dispatch alive check.
SlotManager (manager.py) — for lemonade-backed slots, becomes a coordinator that subscribes to SlotEvents instead of polling. The recover_evicted_slot() helper is removed in phase 3. Idle policy (when to evict) moves here in phase 4.
LemonadeProvider (providers/lemonade.py) — routes through LemondAdapter. In phase 0, immediately adds port-read-back-from-/v1/health after load so Upstream.url is freshly known.
UpstreamRegistry — Upstream.url becomes mutable (or, equivalently, computed by a callback). Required because the post-load port may differ from startup config.
flm_trio.py, npu_swap_status.py, lemonade/idle.py, lemonade/metrics_shim.py, lemonade/client.py — fold into adapter or route through it. lemonade/client.py likely remains as the low-level httpx wrapper that the adapter composes.
# SlotEvent — the contract between adapter and subscribers@dataclassclassModelLoaded:
name: strport: intpid: int@dataclassclassModelEvicted:
name: strreason: Literal["idle", "oom", "manual", "load_failure_cascade", "unknown"]
# (etc — full list in `src/hal0/lemonade/events.py`)
Architectural decisions
Control-plane / data-plane split preserved. The dispatcher still forwards inference traffic directly to the child llama-server (e.g. :8001), not via lemond. This was the design call in lemonade/client.py:5-6 and the alternative (route data plane through lemond :13305) is explicitly rejected: lemond is not designed to be a streaming reverse proxy under load.
Lemond is the source of truth for slot lifecycle. hal0 derives. The state.json files become cache/restore hints, not authoritative state.
Events are reactive, polling is the safety net. WebSocket /logs/stream drives state in real time (phase 2); a 30s /v1/health poll runs as the heartbeat that catches missed events.
Recovery is impossible by construction. Phase 3 deletes the dispatcher's recovery branch. If the adapter's pre-dispatch validation says "alive," the request goes. If "not alive," SlotLoading 503 fires. There is no in-band recovery on the request path.
Idle eviction policy lives in hal0, not lemond. Phase 4 has hal0 calling /v1/unload proactively. This removes the "lemonade silently evicted" class entirely because every eviction is initiated by us.
Phase 0 ships independently. The 40-LOC port-discovery patch lands as v0.3.1 hotfix this week. It kills the port-drift bug without any architectural change.
Phasing (each phase is independently shippable and reversible)
Phase 0 (1 PR, this week, targets v0.3.1):fix(lemonade): port discovery after load. _await_ready reads backend_url and updates slot.port + Upstream.url.
Phase 1 (2–3 PRs, 1 week): Introduce LemondAdapter as a pass-through; consolidate scattered httpx callers; add TTL cache.
/api/slots — unchanged response shape. The state field still surfaces PULLING/STARTING/WARMING/READY/SERVING/IDLE/OFFLINE/ERROR, just computed differently underneath.
New /api/v1/adapter/events SSE stream (phase 6) — exposes the SlotEvent stream to dashboard clients so the slot indicator dots react in real time.
Schema changes
state.json semantics change in phase 5: becomes a 1-hour-TTL snapshot for fast dashboard startup, not authoritative. Older state.json files are forward-compatible (the adapter will reconcile against lemond on startup).
No database schema changes.
No breaking changes to slot TOML config.
Testing Decisions
What makes a good test
Test external behavior, not implementation. Assert on the contract surface: "given a slot evicted by lemond, the next /v1/chat/completions returns 200 within N seconds" — not "verify _serving_enter was called twice."
Captured /v1/health fixtures. Lemond's response shape is the contract surface; check in real captured responses as test fixtures so phase 1's refactor is provably behavior-preserving.
Log-corpus tests for phase 2. A captured corpus of real /logs/stream lines per recent lemond release (10.5, 10.6, 10.7-rc), tagged with expected SlotEvent outputs. The regex parser is locked to this corpus; new lemond releases require updating the corpus.
Property tests for the state-derivation logic in phase 5. Given a random sequence of SlotEvents, the derived UX state matches a reference implementation. Catches edge cases in event ordering.
No mocking of lemond's HTTP layer in unit tests. Use httpx.MockTransport with realistic responses (the pattern already established in tests/dispatcher/test_forward.py and tests/dispatcher/test_serving_integration.py).
Modules to be tested
LemondAdapter: unit tests against httpx.MockTransport + captured fixtures. Test that route_for() returns fresh port info after a ModelLoaded event invalidates the cache.
LogStreamSubscriber: regex parser tested against captured log corpus (one test per SlotEvent variant). Reconnection logic tested with a mock WS server.
PortValidator: tested with a real ephemeral TCP listener (bind, validate alive, close, validate dead).
Dispatcher: existing tests/dispatcher/test_serving_integration.py extended with adapter-event-driven scenarios. In phase 3, the recovery-branch tests added by PR fix(dispatch): recover slot after silent Lemonade eviction #392 are deleted (the branch they cover no longer exists).
SlotManager: existing tests/slots/ extended with event-subscription tests. The recover_evicted_slot() tests are deleted in phase 3.
Prior art in the codebase
tests/dispatcher/test_forward.py — httpx.MockTransport pattern for dispatcher unit tests.
tests/dispatcher/test_serving_integration.py — _RecordingSlotManager pattern for stand-in SlotManager + asserting on lifecycle events.
tests/providers/test_lemonade.py — captured fixture pattern for LemondClient responses.
tests/harness/ (δ-tier) — full lifecycle harness for installer → CLI → slot → uninstall. Adapter integration validated here at the system level.
Test additions per phase
Phase 0: 1 unit test asserting slot.port is updated to match /v1/health.backend_url after load.
Phase 1: refactor preserves behavior — existing 215 dispatcher+slots tests stay green; adapter gets ~20 new unit tests against MockTransport.
Replacing Lemonade entirely. This was considered (go back to hal0-managed systemd llama-servers) and rejected — we'd lose FLM/NPU support, Kokoro voice, sdcpp image-gen, and we just finished migrating TO lemonade in v0.2.
Routing the data plane through lemond. Considered and rejected: lemond is not designed to be a streaming reverse proxy under load. This is explicitly documented in lemonade/client.py:5-6.
A native model-event WebSocket API from upstream lemonade. Would let us skip phase 2's log-parsing. Not on lemonade's roadmap as of 2026-05; revisit annually. If it ships, phases 2's log-parsing is replaced with a thin WS adapter; the rest of the plan is unchanged.
Changing the inference-path protocol. The dispatcher still speaks OpenAI-compat. Streaming still uses SSE. The new architecture is invisible to clients.
Multi-host lemonade. All of this assumes single-host lemonade running on 127.0.0.1. Multi-host would be a separate ADR.
Cross-slot transactional guarantees. If an agent chain needs primary + embed + rerank to all be live simultaneously, the warm-load helper in phase 4 makes a best-effort but doesn't provide atomic guarantees. That's a v0.5 concern.
Further Notes
Timing: v0.3.1 vs v0.4 vs separate platform track
The user explicitly flagged this question as undecided. Three options:
Phase 0 as v0.3.1 hotfix only. Ship phase 0 this week, defer phases 1+ to v0.5. Lowest risk. Only fixes today's bug class.
Phases 0–3 bundled as v0.4 "Reactive Lemonade Integration" theme. ~3 weeks. v0.4 gets a coherent platform-reliability narrative. Right call if "platform reliability" resonates with the Strix Halo audience.
Separate platform track. Phase 0 → v0.3.1; phases 1–3 → v0.4.x point releases over the cycle. Decouples user-visible from infrastructural. Risk: platform work always loses to features in practice, so phases 4–6 may never happen.
Recommended hybrid:
Phase
Release
Theme
Phase 0
v0.3.1 hotfix this week
Bug fix
Phases 1–3
v0.4 milestone (~3 weeks)
"Reactive Lemonade Integration"
Phases 4–6
v0.4.x point releases
Platform polish
The trap to watch: if v0.4 is already promised feature-heavy (more agents, MCP, dashboard), phases 1–3 should slip to v0.5 — don't let infrastructure block features.
Spike questions before phase 2 starts
Does lemond's /logs/stream emit reliable load/unload events? 1-day spike to capture corpus and assess regex tractability.
WebSocket reconnection semantics — undocumented in lemonade today per memory hal0_lemonade_ws_protocol. Spike to characterize.
Upstream lemonade roadmap — is a native event API coming in 3 months? If yes, skip phase 2's log-parsing and wait.
ADR-0008 — Lemonade integration boundary. This PRD extends it with the adapter pattern; a follow-up ADR-0015 (or similar) should record the architectural shift.
Memory entries that informed this plan: hal0_dispatcher_silent_eviction_recovery, hal0_lemonade_port_drift, hal0_lemonade_gotchas, hal0_lemonade_internals, hal0_lemonade_ws_protocol, hal0_lemonade_threads_deadlock, hal0_lemonade_unload_gpu_cleanup_hang, hal0_lemonade_whisper_runpath_bug, hal0_lemonade_flm_npu_install, hal0_lemonade_ctx_size_lives_in_config_json, hal0_lemonade_rocm_device_perms, hal0_lemonade_hf_cache_gotchas, hal0_slot_backend_change_state_drift, hal0_hsa_gfx_override_stale_after_rocm_bundle_upgrade.
Problem Statement
Users of hal0 keep hitting chat-completions failures (
502 dispatch.upstream_unavailable,503 slot.loadingat wrong moments, mysterious latency spikes) because hal0 and the embedded Lemonade daemon hold overlapping state machines with eventually-consistent reconciliation. The dispatcher reads from hal0's local copy of "is this slot alive?" while the truth lives in Lemonade. Whenever the two views diverge — silent eviction, port re-allocation after a respawn, mid-request peer-drop, configuration split-brain — the user (or their agent) gets a failure that shouldn't exist.We have shipped 3 patches in 7 days against symptoms of this single root cause:
This pattern will keep producing bugs until the architectural shape changes.
Solution
Introduce a single typed boundary (
LemondAdapter) that is the only module in hal0 which talks to Lemonade. Drive hal0's slot state from Lemonade events (WebSocket/logs/stream+ log-line parsing) instead of polling. Compute dispatcher routing from the adapter at dispatch time so port drift becomes impossible by construction. Move idle-eviction policy from Lemonade to hal0 so every eviction is initiated by us, not surprised on us.The user sees: zero
dispatch.upstream_unavailablefor transient eviction events; only structured503 slot.loading+Retry-Afterenvelopes (which their OpenAI-compatible SDKs already handle); predictable first-call latency after agent session start.Six phases, each independently shippable. Phase 0 is a 40-LOC hotfix that ships this week. Phases 1–3 are the core architectural change (~3 weeks). Phases 4–6 are platform polish (~3–5 weeks).
User Stories
openailibrary, curl, agent harness), I want all transient backend issues to return503withRetry-After, so that my SDK's built-in retry logic handles the back-off cleanly.502 dispatch.upstream_unavailablefor a slot that hal0 can reach with a single internal re-load, so that I don't have to implement custom error handling for hal0-specific failure modes.LemonadeClient,LemonadeProvider,flm_trio,npu_swap_status, the idle driver, and the metrics shim individually.dispatch.upstream_dead_attempting_recoverto be a near-zero metric in production, so that I have evidence the architecture isn't compensating for state drift.slot.evicted_by_idle_policywith reason), so that I can diagnose "why did my slot unload?" without reading lemond logs.LemondAdapterto be the ONLY module that importshttpxfor talking to lemond, so that adding a new failure mode (timeout, retry policy, circuit breaker) requires one file change, not seven.SlotEvent,RouteInfo,LoadedModel) to live insrc/hal0/lemonade/and be re-exported through a stable module API, so that I never have to grep for which file owns a given lemonade concept._recover_evicted_slotrecovery branch to be deleted in phase 3, so that the "two systems disagree" failure mode is impossible by construction rather than papered-over by retry logic.state = compute_from_events(history)) rather than authored (state = self._transition(...)), so that there is exactly one place to look when state is wrong./logs/streamregex test suite + captured corpus before I commit, so that I don't ship a parser that breaks on the next lemonade release.git revert, not "untangle a flag from 5 codepaths."hal0-memoryorhal0-adminMCP client, I want the embedding slot to be reliably available for memory.add / memory.search calls, so that my agent's memory operations don't intermittently fail due to slot state confusion.hal0-memoryMCP user with the rerank slot wired (PR feat(memory): pin embedding model + wire rerank slot into memory_search #365), I want the rerank slot's lifecycle to follow the same warm/idle policy as primary, so that memory_search isn't slower than primary chat for no reason.make harness), I want slot lifecycle tests to validate the new adapter-driven model, so that the harness catches regressions before they reach production.Implementation Decisions
Modules to be built
LemondAdapter(new, deep module) —src/hal0/lemonade/adapter.py. Single entry point for all hal0 → lemonade interactions. Owns: httpx client to:13305, WebSocket subscription to/logs/stream, TTL cache for/v1/health, event fan-out to subscribers. Replaces directLemonadeClientusage scattered acrossflm_trio.py,npu_swap_status.py, the idle driver, the metrics shim. Designed to be the only file that changes when lemonade ships a breaking schema update.SlotEventtypes (new) —src/hal0/lemonade/events.py. Typed dataclasses forModelLoaded(name, port, pid),ModelEvicted(name, reason),ProcessCrashed(name, pid),LoadFailed(name, reason),PortChanged(name, old_port, new_port). These are the contract between the adapter and its subscribers.LogStreamSubscriber(new, deep module) —src/hal0/lemonade/log_stream.py. Owns the WebSocket connection to/logs/stream, reconnection-with-backoff, and the regex parser that converts log lines intoSlotEvents. Falls back to polling on WS unavailable.RouteCache(new) — TTL-cached view of/v1/health.all_models_loaded[]mappingmodel_name → (port, pid, alive). Invalidated bySlotEvents.PortValidator(new, phase 3) — lightweight TCP probe with its own TTL cache. The dispatcher's pre-dispatch alive check.Modules to be modified
Dispatcher(router.py) — computesUpstream.urlat dispatch time fromadapter.route_for(slot)rather than reading a static value baked at startup. The recovery branch from PR fix(dispatch): recover slot after silent Lemonade eviction #392 is removed in phase 3 in favor of pre-dispatch port validation. The fix(dispatch): gate slot forwards during swap window with structured 503 #385 swap-window gate stays.SlotManager(manager.py) — for lemonade-backed slots, becomes a coordinator that subscribes toSlotEvents instead of polling. Therecover_evicted_slot()helper is removed in phase 3. Idle policy (when to evict) moves here in phase 4.LemonadeProvider(providers/lemonade.py) — routes throughLemondAdapter. In phase 0, immediately adds port-read-back-from-/v1/healthafter load soUpstream.urlis freshly known.UpstreamRegistry—Upstream.urlbecomes mutable (or, equivalently, computed by a callback). Required because the post-load port may differ from startup config.flm_trio.py,npu_swap_status.py,lemonade/idle.py,lemonade/metrics_shim.py,lemonade/client.py— fold into adapter or route through it.lemonade/client.pylikely remains as the low-level httpx wrapper that the adapter composes.Interfaces (the deep-module contracts)
Architectural decisions
:8001), not via lemond. This was the design call inlemonade/client.py:5-6and the alternative (route data plane through lemond:13305) is explicitly rejected: lemond is not designed to be a streaming reverse proxy under load.state.jsonfiles become cache/restore hints, not authoritative state./logs/streamdrives state in real time (phase 2); a 30s/v1/healthpoll runs as the heartbeat that catches missed events.SlotLoading503 fires. There is no in-band recovery on the request path./v1/unloadproactively. This removes the "lemonade silently evicted" class entirely because every eviction is initiated by us.Phasing (each phase is independently shippable and reversible)
fix(lemonade): port discovery after load._await_readyreadsbackend_urland updatesslot.port+Upstream.url.LemondAdapteras a pass-through; consolidate scattered httpx callers; add TTL cache./logs/streamsubscription + log-line regex parser. SlotManager subscribes to events; polling becomes safety net./v1/unload. Warm-load on agent session start.API contracts (external)
/v1/chat/completions,/v1/embeddings,/v1/rerank, etc — unchanged. Same OpenAI-compat envelope. Same503 slot.loading+Retry-Aftersemantics from PR fix(dispatch): gate slot forwards during swap window with structured 503 #385./api/slots— unchanged response shape. Thestatefield still surfaces PULLING/STARTING/WARMING/READY/SERVING/IDLE/OFFLINE/ERROR, just computed differently underneath./api/v1/adapter/eventsSSE stream (phase 6) — exposes theSlotEventstream to dashboard clients so the slot indicator dots react in real time.Schema changes
state.jsonsemantics change in phase 5: becomes a 1-hour-TTL snapshot for fast dashboard startup, not authoritative. Older state.json files are forward-compatible (the adapter will reconcile against lemond on startup).Testing Decisions
What makes a good test
_serving_enterwas called twice."/v1/healthfixtures. Lemond's response shape is the contract surface; check in real captured responses as test fixtures so phase 1's refactor is provably behavior-preserving./logs/streamlines per recent lemond release (10.5, 10.6, 10.7-rc), tagged with expectedSlotEventoutputs. The regex parser is locked to this corpus; new lemond releases require updating the corpus.SlotEvents, the derived UX state matches a reference implementation. Catches edge cases in event ordering.httpx.MockTransportwith realistic responses (the pattern already established intests/dispatcher/test_forward.pyandtests/dispatcher/test_serving_integration.py).Modules to be tested
LemondAdapter: unit tests againsthttpx.MockTransport+ captured fixtures. Test thatroute_for()returns fresh port info after aModelLoadedevent invalidates the cache.LogStreamSubscriber: regex parser tested against captured log corpus (one test perSlotEventvariant). Reconnection logic tested with a mock WS server.PortValidator: tested with a real ephemeral TCP listener (bind, validate alive, close, validate dead).Dispatcher: existingtests/dispatcher/test_serving_integration.pyextended with adapter-event-driven scenarios. In phase 3, the recovery-branch tests added by PR fix(dispatch): recover slot after silent Lemonade eviction #392 are deleted (the branch they cover no longer exists).SlotManager: existingtests/slots/extended with event-subscription tests. Therecover_evicted_slot()tests are deleted in phase 3.Prior art in the codebase
tests/dispatcher/test_forward.py—httpx.MockTransportpattern for dispatcher unit tests.tests/dispatcher/test_serving_integration.py—_RecordingSlotManagerpattern for stand-in SlotManager + asserting on lifecycle events.tests/providers/test_lemonade.py— captured fixture pattern forLemondClientresponses.tests/harness/(δ-tier) — full lifecycle harness for installer → CLI → slot → uninstall. Adapter integration validated here at the system level.Test additions per phase
slot.portis updated to match/v1/health.backend_urlafter load.Out of Scope
lemonade/client.py:5-6.127.0.0.1. Multi-host would be a separate ADR.Further Notes
Timing: v0.3.1 vs v0.4 vs separate platform track
The user explicitly flagged this question as undecided. Three options:
Recommended hybrid:
The trap to watch: if v0.4 is already promised feature-heavy (more agents, MCP, dashboard), phases 1–3 should slip to v0.5 — don't let infrastructure block features.
Spike questions before phase 2 starts
/logs/streamemit reliable load/unload events? 1-day spike to capture corpus and assess regex tractability.hal0_lemonade_ws_protocol. Spike to characterize.Cross-references
_recover_evicted_slothelper added here is the proof-of-concept for phase 3's pre-dispatch validation.hal0_dispatcher_silent_eviction_recovery,hal0_lemonade_port_drift,hal0_lemonade_gotchas,hal0_lemonade_internals,hal0_lemonade_ws_protocol,hal0_lemonade_threads_deadlock,hal0_lemonade_unload_gpu_cleanup_hang,hal0_lemonade_whisper_runpath_bug,hal0_lemonade_flm_npu_install,hal0_lemonade_ctx_size_lives_in_config_json,hal0_lemonade_rocm_device_perms,hal0_lemonade_hf_cache_gotchas,hal0_slot_backend_change_state_drift,hal0_hsa_gfx_override_stale_after_rocm_bundle_upgrade./home/halo/Development/Projects/hal0/Developer Docs/hal0-lemonade-slot-management-end-state.md.Success metrics
dispatch.upstream_dead_attempting_recover→ 0 in production (phase 3 deletes the branch; its absence is also the win).dispatch.upstream_unavailable(502 rate) < 0.1% of inference calls.