fix(watchdog): two-phase timeout + consume `prompt_progress` keepalives for llama.cpp by Aaronontheweb · Pull Request #946 · netclaw-dev/netclaw

Aaronontheweb · 2026-05-09T13:52:41Z

Summary

Two-phase watchdog timeout: Split the single FirstTokenTimeout (600s) into PrefillTimeout (1800s, covers queue wait + prefill) and FirstTokenTimeout (600s, inter-delta silence). Watchdog starts generous, promotes to the tighter budget on first streaming delta.
Consume llama-server prompt_progress events: Request return_progress: true in streaming payloads and fix ParseStreamingUpdates to yield keepalives for content-less data events instead of silently dropping them.
Forward SSE comment lines: Yield keepalive for non-data: SSE lines (comment keepalives, event-type lines) so the watchdog resets during prefill/queuing.
First-delta keepalive: Send watchdog-refresh when the first text/thinking delta is buffered (previously held until 2nd delta with no signal).
Operation name constants: Replace stringly-typed "llm-call", "tool-execution", "compaction" literals with constants on ProcessingWatchdog.

Context

Session D0AC6CKBK5K/1778174942.852979 hit two consecutive 600s watchdog timeouts on 2026-05-08. Server-side logs proved the server was healthy: 7m15s queued (slot contention), 2m44s cold prefill of 91K tokens (KV cache bug forced full re-processing), cancelled at 84.4% — ~18 seconds from completing.

The watchdog couldn't distinguish "server is busy prefilling" from "server is dead" because both look like SSE silence. llama-server already sends prompt_progress events during prefill (PR #15827), but ParseStreamingUpdates was silently dropping them at the contents.Count == 0 && finishReason is null guard.

Test plan

Existing watchdog tests updated with PrefillTimeout and pass
StreamsReasoningAndTextDeltas_FromOfficialSpectrum updated to assert keepalive from content-less initial chunk
Full test suite: 3,342 tests pass, 0 failures
dotnet slopwatch analyze clean
./scripts/Add-FileHeaders.ps1 -Verify clean
Integration test against llama-server with --parallel 1 + concurrent requests to verify progress events refresh watchdog

The processing watchdog was killing healthy LLM requests during slot contention and cold prefill (91K tokens, ~10 min silent). Three fixes: 1. Split watchdog into PrefillTimeout (1800s default) and InterDeltaTimeout (FirstTokenTimeout, 600s). Start generous, promote on first delta. 2. Request `return_progress: true` from llama-server and fix ParseStreamingUpdates to yield keepalives for content-less data events (e.g. prompt_progress) instead of silently dropping them. 3. Forward SSE comment lines as keepalives and send watchdog-refresh on first buffered text/thinking delta.

…restart Promote() then Refresh() with the same timeout restarted the timer twice on first delta. Restructured to call only one per delta. Extracted shared RestartLlmTimer() to eliminate identical method bodies.

…ants Extract LlmCall, ToolExecution, Compaction constants on ProcessingWatchdog and replace all 7 call sites in LlmSessionActor.

Aaronontheweb

LGTM

Aaronontheweb · 2026-05-09T13:56:09Z

            body["stream_options"] = new JsonObject { ["include_usage"] = true };
+            // llama-server sends prefill progress as SSE data events when enabled.
+            // Harmless on servers that don't support it (unknown fields are ignored).
+            body["return_progress"] = true;


Encourages llama-server to send us progress updates during prefill et al

Aaronontheweb · 2026-05-09T13:56:47Z

+        {
+            // Content-less data events (e.g. prompt_progress during prefill) — yield
+            // keepalive so the watchdog timer resets while the server is working.
+            yield return KeepaliveUpdate;


should prevent netclaw's watchdog from aggressively nuking sessions when the model is sending back progress reports, thinking-only updates, etc.

Aaronontheweb added 3 commits May 9, 2026 13:42

refactor(watchdog): deduplicate Promote/Refresh and fix double timer …

41c7268

…restart Promote() then Refresh() with the same timeout restarted the timer twice on first delta. Restructured to call only one per delta. Extracted shared RestartLlmTimer() to eliminate identical method bodies.

refactor(watchdog): replace operation name string literals with const…

021345f

…ants Extract LlmCall, ToolExecution, Compaction constants on ProcessingWatchdog and replace all 7 call sites in LlmSessionActor.

Aaronontheweb added reliability Retries, resilience, graceful degradation context-pipeline LLM context assembly: prompt layers, dynamic injection, memory recall, temporal grounding labels May 9, 2026

Aaronontheweb commented May 9, 2026

View reviewed changes

Merge branch 'dev' into fix/watchdog-prefill-liveness

35846af

Aaronontheweb enabled auto-merge (squash) May 9, 2026 13:57

Aaronontheweb changed the title ~~fix(watchdog): two-phase timeout + consume prompt_progress keepalives~~ fix(watchdog): two-phase timeout + consume prompt_progress keepalives for llama.cpp May 9, 2026

Aaronontheweb merged commit af2754d into netclaw-dev:dev May 9, 2026
6 checks passed

Aaronontheweb mentioned this pull request May 9, 2026

investigation: measure tool schema token weight in system prompt #622

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(watchdog): two-phase timeout + consume `prompt_progress` keepalives for llama.cpp#946

fix(watchdog): two-phase timeout + consume `prompt_progress` keepalives for llama.cpp#946
Aaronontheweb merged 4 commits intonetclaw-dev:devfrom
Aaronontheweb:fix/watchdog-prefill-liveness

Aaronontheweb commented May 9, 2026

Uh oh!

Aaronontheweb left a comment

Uh oh!

Aaronontheweb May 9, 2026

Uh oh!

Aaronontheweb May 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Aaronontheweb commented May 9, 2026

Summary

Context

Test plan

Uh oh!

Aaronontheweb left a comment

Choose a reason for hiding this comment

Uh oh!

Aaronontheweb May 9, 2026

Choose a reason for hiding this comment

Uh oh!

Aaronontheweb May 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant