fix: reset last_active on agent restore to prevent heartbeat false-positives on startup by Fail-Safe · Pull Request #703 · RightNow-AI/openfang

Fail-Safe · 2026-03-17T21:09:32Z

Problem

Every daemon restart triggers a cascade of spurious crash-recovery cycles for all agents.

When agents are loaded from persistent storage, their last_active timestamp is restored from before the previous shutdown. If the daemon was down for longer than the heartbeat unresponsiveness threshold (default 180 s), the very first heartbeat tick — which fires ~30 s after startup — sees every restored agent as stale and immediately marks them all as Crashed, kicking off recovery attempts for agents that had only just been restored and hadn't had a chance to run.

Observed in logs:

INFO  openfang_kernel: Restored 6 agent(s) from persistent storage
...30 seconds later...
WARN  openfang_kernel::heartbeat: Agent is unresponsive agent=Athena inactive_secs=210 timeout_secs=180
WARN  openfang_kernel::heartbeat: Agent is unresponsive agent=Paige  inactive_secs=210 timeout_secs=180
WARN  openfang_kernel::heartbeat: Agent is unresponsive agent=Cortex inactive_secs=210 timeout_secs=180
WARN  openfang_kernel::kernel:    Unresponsive Running agent marked as Crashed for recovery agent=Athena
...

The recovery succeeds each time, but the unnecessary crash-recovery round-trip adds noise to logs, delays agent availability by 30–60 s on every restart, and burns recovery attempt budget.

Fix

One line added to the agent restore loop in kernel.rs, alongside the existing state = AgentState::Running reset:

restored_entry.last_active = chrono::Utc::now();

This gives each restored agent a fresh heartbeat baseline consistent with how newly spawned agents are initialized (they also set last_active = Utc::now()). The heartbeat monitor then starts timing from the actual moment of restoration rather than from the pre-shutdown timestamp.

Impact

No behavioural change for agents that are genuinely unresponsive after restoration
Agents that fail to start their loop after restore will still be caught by the heartbeat on the next check cycle (30 s later)
Eliminates the false-positive crash storm on every daemon restart

…sitives When agents are loaded from persistent storage on daemon startup, their last_active timestamp reflects when they were last active before the previous shutdown. If the daemon was down for longer than the heartbeat timeout (default 180 s), the first heartbeat tick immediately marks every restored agent as unresponsive and triggers crash recovery — even though all agents just started and haven't had a chance to run. Fix: stamp last_active = Utc::now() alongside the state = Running reset in the restore loop. This is consistent with how new agent spawns work (they also set last_active to now) and gives each restored agent a clean baseline from which the heartbeat can accurately track responsiveness. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: reset last_active on agent restore to prevent heartbeat false-positives on startup#703

fix: reset last_active on agent restore to prevent heartbeat false-positives on startup#703
Fail-Safe wants to merge 1 commit intoRightNow-AI:mainfrom
Fail-Safe:fix/heartbeat-startup-false-positive

Fail-Safe commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Fail-Safe commented Mar 17, 2026

Problem

Fix

Impact

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant