Skip to content

Investigate SDK premature session.idle — upstream fix or permanent workaround #389

@PureWeen

Description

@PureWeen

Context

The Copilot SDK/CLI emits premature session.idle events during long tool-executing turns, causing multi-agent workers to be collected with truncated responses. PR #375 added an elaborate workaround (RecoverFromPrematureIdleIfNeededAsync) using:

  1. ManualResetEventSlim signal set by re-arm events after premature idle
  2. events.jsonl mtime freshness detection (< 15s = CLI still writing)
  3. Multi-round recovery loop with bestResponse accumulation
  4. 120s recovery timeout

Why this is a hack

  • Relies on filesystem side-channel (file mtime) to determine SDK state
  • Normal completions can stall ~15s while freshness detection runs
  • Multi-round loop adds complexity and edge cases (OCE handling, bestResponse scoping)
  • The root cause is the SDK emitting session.idle before the turn is actually complete

Proposed investigation

  1. File upstream issue with Copilot SDK team documenting the premature idle behavior
  2. Request turn-scoped terminal events or turn IDs so consumers can distinguish real vs premature idle
  3. If upstream fix is not forthcoming, evaluate whether the workaround can be simplified (e.g., just use turn ID matching)

Priority

Medium — the workaround works but adds resilience debt in the most critical orchestration path.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions