Skip to content

Assistant responses lost on restart: dead event stream + DB write failure #406

@PureWeen

Description

@PureWeen

Problem

Assistant responses are visible in real-time but lost on app restart. User messages survive because they are written on send, but assistant responses from the same conversation disappear.

Observed in NET11-MANAGEMENT session

Session ID: 93a81981-e015-4f14-998e-0804c1d75cb6

Timeline

  • 06:27-06:30 — User sent 4 messages, assistant responded to each (visible in real-time)
  • 06:37 — User sent "do you keep losing messages?"
  • 06:38 — DB writes stopped (chat_history.db latest = 06:38:39)
  • 11:37 — App restarted, loaded from events.jsonl
  • Result: User messages from 06:27-06:30 are present, but ALL assistant responses between 06:27 and 06:37 are missing

Evidence from events.jsonl

After the last session.resume (11:37), only 1 user + 3 assistant messages with actual content. 17 other assistant.message events have content=null (tool-only rounds). The responses from 06:27-06:37 are NOT in the file at all. 30 resumes, 12 shutdowns total.

Evidence from chat_history.db

DB has 186 messages. Latest timestamp: 2026-03-18T06:38:39 — stopped being written 5 hours before events.jsonl latest user (11:37:27).

Root Cause

Dead event stream: The headless server idle timeout terminates the session. PolyPilot re-resumes via ResumeSessionAsync. The SDK creates a new session object, but the event file writer is broken — events flow in-memory (real-time rendering works) but never persist to events.jsonl.

DB write failure: The _chatDb.AddMessageAsync calls are fire-and-forget. After the event stream dies, DB writes also silently fail — possibly because the session ID changed on re-resume.

Why both safety nets fail

  1. events.jsonl: SDK bug — file writer breaks after server cleanup + re-resume
  2. chat_history.db: Fire-and-forget writes silently fail after event stream death
  3. LoadBestHistoryAsync (PR Fix multi-agent worker failures, session persistence, server health, and history recovery #391): Correctly picks newer source, but BOTH are stale/incomplete

Reproduction Steps

  1. Start PolyPilot with an active conversation
  2. Wait ~35 minutes (or stop the headless server process)
  3. Send a message — PolyPilot auto-reconnects
  4. Have a multi-turn conversation (send messages, get responses)
  5. Restart PolyPilot (relaunch.sh)
  6. Expected: All messages from step 4 visible
  7. Actual: Only user messages survive; assistant responses gone

Diagnostic Data

Session: 93a81981-e015-4f14-998e-0804c1d75cb6
events.jsonl: 1.6MB, 1565 lines, 30 resumes, 12 shutdowns
chat_history.db: 186 messages, stopped at 06:38:39 UTC
events.jsonl latest user: 11:37:27 UTC
DB latest user: 06:37:26 UTC
Gap: 5 hours of lost assistant responses

Related Issues

Suggested Fix

  1. Detect dead event stream: After re-resume, verify events.jsonl mtime changes after first SDK event. If not, force fresh session creation.
  2. DB write resilience: After re-resume with new session ID, ensure AddMessageAsync uses the NEW session ID.
  3. Periodic DB sync: Periodically bulk-sync History to DB (every 30s) as safety net instead of relying on per-message fire-and-forget.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions