Skip to content

fix: reset last_active on agent restore to prevent heartbeat false-positives on startup#703

Open
Fail-Safe wants to merge 1 commit intoRightNow-AI:mainfrom
Fail-Safe:fix/heartbeat-startup-false-positive
Open

fix: reset last_active on agent restore to prevent heartbeat false-positives on startup#703
Fail-Safe wants to merge 1 commit intoRightNow-AI:mainfrom
Fail-Safe:fix/heartbeat-startup-false-positive

Conversation

@Fail-Safe
Copy link
Contributor

Problem

Every daemon restart triggers a cascade of spurious crash-recovery cycles for all agents.

When agents are loaded from persistent storage, their last_active timestamp is restored from before the previous shutdown. If the daemon was down for longer than the heartbeat unresponsiveness threshold (default 180 s), the very first heartbeat tick — which fires ~30 s after startup — sees every restored agent as stale and immediately marks them all as Crashed, kicking off recovery attempts for agents that had only just been restored and hadn't had a chance to run.

Observed in logs:

INFO  openfang_kernel: Restored 6 agent(s) from persistent storage
...30 seconds later...
WARN  openfang_kernel::heartbeat: Agent is unresponsive agent=Athena inactive_secs=210 timeout_secs=180
WARN  openfang_kernel::heartbeat: Agent is unresponsive agent=Paige  inactive_secs=210 timeout_secs=180
WARN  openfang_kernel::heartbeat: Agent is unresponsive agent=Cortex inactive_secs=210 timeout_secs=180
WARN  openfang_kernel::kernel:    Unresponsive Running agent marked as Crashed for recovery agent=Athena
...

The recovery succeeds each time, but the unnecessary crash-recovery round-trip adds noise to logs, delays agent availability by 30–60 s on every restart, and burns recovery attempt budget.

Fix

One line added to the agent restore loop in kernel.rs, alongside the existing state = AgentState::Running reset:

restored_entry.last_active = chrono::Utc::now();

This gives each restored agent a fresh heartbeat baseline consistent with how newly spawned agents are initialized (they also set last_active = Utc::now()). The heartbeat monitor then starts timing from the actual moment of restoration rather than from the pre-shutdown timestamp.

Impact

  • No behavioural change for agents that are genuinely unresponsive after restoration
  • Agents that fail to start their loop after restore will still be caught by the heartbeat on the next check cycle (30 s later)
  • Eliminates the false-positive crash storm on every daemon restart

…sitives

When agents are loaded from persistent storage on daemon startup, their
last_active timestamp reflects when they were last active before the
previous shutdown. If the daemon was down for longer than the heartbeat
timeout (default 180 s), the first heartbeat tick immediately marks every
restored agent as unresponsive and triggers crash recovery — even though
all agents just started and haven't had a chance to run.

Fix: stamp last_active = Utc::now() alongside the state = Running reset
in the restore loop. This is consistent with how new agent spawns work
(they also set last_active to now) and gives each restored agent a clean
baseline from which the heartbeat can accurately track responsiveness.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant