Skip to content

fix(watchdog): two-phase timeout + consume prompt_progress keepalives for llama.cpp#946

Merged
Aaronontheweb merged 4 commits intonetclaw-dev:devfrom
Aaronontheweb:fix/watchdog-prefill-liveness
May 9, 2026
Merged

fix(watchdog): two-phase timeout + consume prompt_progress keepalives for llama.cpp#946
Aaronontheweb merged 4 commits intonetclaw-dev:devfrom
Aaronontheweb:fix/watchdog-prefill-liveness

Conversation

@Aaronontheweb
Copy link
Copy Markdown
Collaborator

Summary

  • Two-phase watchdog timeout: Split the single FirstTokenTimeout (600s) into PrefillTimeout (1800s, covers queue wait + prefill) and FirstTokenTimeout (600s, inter-delta silence). Watchdog starts generous, promotes to the tighter budget on first streaming delta.
  • Consume llama-server prompt_progress events: Request return_progress: true in streaming payloads and fix ParseStreamingUpdates to yield keepalives for content-less data events instead of silently dropping them.
  • Forward SSE comment lines: Yield keepalive for non-data: SSE lines (comment keepalives, event-type lines) so the watchdog resets during prefill/queuing.
  • First-delta keepalive: Send watchdog-refresh when the first text/thinking delta is buffered (previously held until 2nd delta with no signal).
  • Operation name constants: Replace stringly-typed "llm-call", "tool-execution", "compaction" literals with constants on ProcessingWatchdog.

Context

Session D0AC6CKBK5K/1778174942.852979 hit two consecutive 600s watchdog timeouts on 2026-05-08. Server-side logs proved the server was healthy: 7m15s queued (slot contention), 2m44s cold prefill of 91K tokens (KV cache bug forced full re-processing), cancelled at 84.4% — ~18 seconds from completing.

The watchdog couldn't distinguish "server is busy prefilling" from "server is dead" because both look like SSE silence. llama-server already sends prompt_progress events during prefill (PR #15827), but ParseStreamingUpdates was silently dropping them at the contents.Count == 0 && finishReason is null guard.

Test plan

  • Existing watchdog tests updated with PrefillTimeout and pass
  • StreamsReasoningAndTextDeltas_FromOfficialSpectrum updated to assert keepalive from content-less initial chunk
  • Full test suite: 3,342 tests pass, 0 failures
  • dotnet slopwatch analyze clean
  • ./scripts/Add-FileHeaders.ps1 -Verify clean
  • Integration test against llama-server with --parallel 1 + concurrent requests to verify progress events refresh watchdog

The processing watchdog was killing healthy LLM requests during slot
contention and cold prefill (91K tokens, ~10 min silent). Three fixes:

1. Split watchdog into PrefillTimeout (1800s default) and InterDeltaTimeout
   (FirstTokenTimeout, 600s). Start generous, promote on first delta.

2. Request `return_progress: true` from llama-server and fix
   ParseStreamingUpdates to yield keepalives for content-less data events
   (e.g. prompt_progress) instead of silently dropping them.

3. Forward SSE comment lines as keepalives and send watchdog-refresh on
   first buffered text/thinking delta.
…restart

Promote() then Refresh() with the same timeout restarted the timer
twice on first delta. Restructured to call only one per delta.
Extracted shared RestartLlmTimer() to eliminate identical method bodies.
…ants

Extract LlmCall, ToolExecution, Compaction constants on ProcessingWatchdog
and replace all 7 call sites in LlmSessionActor.
@Aaronontheweb Aaronontheweb added reliability Retries, resilience, graceful degradation context-pipeline LLM context assembly: prompt layers, dynamic injection, memory recall, temporal grounding labels May 9, 2026
Copy link
Copy Markdown
Collaborator Author

@Aaronontheweb Aaronontheweb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

body["stream_options"] = new JsonObject { ["include_usage"] = true };
// llama-server sends prefill progress as SSE data events when enabled.
// Harmless on servers that don't support it (unknown fields are ignored).
body["return_progress"] = true;
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Encourages llama-server to send us progress updates during prefill et al

{
// Content-less data events (e.g. prompt_progress during prefill) — yield
// keepalive so the watchdog timer resets while the server is working.
yield return KeepaliveUpdate;
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should prevent netclaw's watchdog from aggressively nuking sessions when the model is sending back progress reports, thinking-only updates, etc.

@Aaronontheweb Aaronontheweb enabled auto-merge (squash) May 9, 2026 13:57
@Aaronontheweb Aaronontheweb changed the title fix(watchdog): two-phase timeout + consume prompt_progress keepalives fix(watchdog): two-phase timeout + consume prompt_progress keepalives for llama.cpp May 9, 2026
@Aaronontheweb Aaronontheweb merged commit af2754d into netclaw-dev:dev May 9, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

context-pipeline LLM context assembly: prompt layers, dynamic injection, memory recall, temporal grounding reliability Retries, resilience, graceful degradation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant