-
Notifications
You must be signed in to change notification settings - Fork 25
Description
Problem
Assistant responses are visible in real-time but lost on app restart. User messages survive because they are written on send, but assistant responses from the same conversation disappear.
Observed in NET11-MANAGEMENT session
Session ID: 93a81981-e015-4f14-998e-0804c1d75cb6
Timeline
06:27-06:30— User sent 4 messages, assistant responded to each (visible in real-time)06:37— User sent "do you keep losing messages?"06:38— DB writes stopped (chat_history.dblatest =06:38:39)11:37— App restarted, loaded fromevents.jsonl- Result: User messages from 06:27-06:30 are present, but ALL assistant responses between 06:27 and 06:37 are missing
Evidence from events.jsonl
After the last session.resume (11:37), only 1 user + 3 assistant messages with actual content. 17 other assistant.message events have content=null (tool-only rounds). The responses from 06:27-06:37 are NOT in the file at all. 30 resumes, 12 shutdowns total.
Evidence from chat_history.db
DB has 186 messages. Latest timestamp: 2026-03-18T06:38:39 — stopped being written 5 hours before events.jsonl latest user (11:37:27).
Root Cause
Dead event stream: The headless server idle timeout terminates the session. PolyPilot re-resumes via ResumeSessionAsync. The SDK creates a new session object, but the event file writer is broken — events flow in-memory (real-time rendering works) but never persist to events.jsonl.
DB write failure: The _chatDb.AddMessageAsync calls are fire-and-forget. After the event stream dies, DB writes also silently fail — possibly because the session ID changed on re-resume.
Why both safety nets fail
- events.jsonl: SDK bug — file writer breaks after server cleanup + re-resume
- chat_history.db: Fire-and-forget writes silently fail after event stream death
- LoadBestHistoryAsync (PR Fix multi-agent worker failures, session persistence, server health, and history recovery #391): Correctly picks newer source, but BOTH are stale/incomplete
Reproduction Steps
- Start PolyPilot with an active conversation
- Wait ~35 minutes (or stop the headless server process)
- Send a message — PolyPilot auto-reconnects
- Have a multi-turn conversation (send messages, get responses)
- Restart PolyPilot (relaunch.sh)
- Expected: All messages from step 4 visible
- Actual: Only user messages survive; assistant responses gone
Diagnostic Data
Session: 93a81981-e015-4f14-998e-0804c1d75cb6
events.jsonl: 1.6MB, 1565 lines, 30 resumes, 12 shutdowns
chat_history.db: 186 messages, stopped at 06:38:39 UTC
events.jsonl latest user: 11:37:27 UTC
DB latest user: 06:37:26 UTC
Gap: 5 hours of lost assistant responses
Related Issues
- Send keep-alive pings to prevent server idle timeout killing sessions #396 — Keep-alive pings (merged as PR Add keepalive ping to prevent server idle timeout (#396) #402, should prevent the trigger)
- Upstream: Copilot CLI headless server breaks when global CLI cleans shared native module directory #392 — posix_spawn failures (same server cleanup root cause)
- PR Fix multi-agent worker failures, session persistence, server health, and history recovery #391 — LoadBestHistoryAsync (mitigates but does not solve when both sources fail)
Suggested Fix
- Detect dead event stream: After re-resume, verify events.jsonl mtime changes after first SDK event. If not, force fresh session creation.
- DB write resilience: After re-resume with new session ID, ensure AddMessageAsync uses the NEW session ID.
- Periodic DB sync: Periodically bulk-sync History to DB (every 30s) as safety net instead of relying on per-message fire-and-forget.