Skip to content

fix: skip stale check for ad-hoc instances#14

Closed
bloodcarter wants to merge 1 commit intoaannoo:mainfrom
bloodcarter:fix/adhoc-stale-cleanup
Closed

fix: skip stale check for ad-hoc instances#14
bloodcarter wants to merge 1 commit intoaannoo:mainfrom
bloodcarter:fix/adhoc-stale-cleanup

Conversation

@bloodcarter
Copy link
Copy Markdown
Contributor

Summary

  • Ad-hoc instances created via hcom start --as <name> get falsely marked as stale and cleaned up after ~10 seconds
  • Root cause: HEARTBEAT_THRESHOLD_NO_TCP (10s) applies to ad-hoc instances which have no heartbeat mechanism (no PTY thread, no TCP endpoint)
  • Fix: skip heartbeat-based stale check for tool="adhoc" instances, same as remote instances

The problem

When a script creates an ad-hoc identity for event polling:

hcom start --as controller
# ... poll events with hcom events --from agent --last 1 ...

The identity is killed by stale cleanup within seconds because:

  1. Ad-hoc has tcp_mode=0 and no notify endpoint → has_tcp=false
  2. Threshold becomes HEARTBEAT_THRESHOLD_NO_TCP (10 seconds)
  3. No PTY delivery thread sends heartbeats for ad-hoc
  4. After 10s with no heartbeat, status becomes inactive:stale:listening
  5. Cleanup removes it

The fix

3-line change in get_instance_status(): check data.tool == "adhoc" and skip the heartbeat timeout, same as is_remote instances.

Test plan

  • hcom start --as test && sleep 200 && hcom list test — survives 200s (was dying at 10s)
  • All 1243 existing tests pass
  • Normal PTY agents still get stale-checked correctly

🤖 Generated with Claude Code

Ad-hoc instances created via `hcom start --as <name>` have no PTY
process or TCP endpoint sending heartbeats. The 10-second
HEARTBEAT_THRESHOLD_NO_TCP causes them to be marked stale:listening
and cleaned up almost immediately, breaking script-controlled
workflows that use ad-hoc identities for event polling.

Fix: skip the heartbeat-based stale check entirely for tool="adhoc"
instances, same as remote instances. They stay alive until explicitly
stopped via `hcom stop`.
@bloodcarter
Copy link
Copy Markdown
Contributor Author

Closing — the root cause is deeper than the stale check. Ad-hoc identities created from within Claude Code tool calls are killed by the hooks lifecycle, not the stale checker. The instances.rs fix only helps for pure adhoc (outside Claude), which is a rarer case. Will investigate the hooks lifecycle separately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant