fix: reset last_active on agent restore to prevent heartbeat false-positives on startup#703
Open
Fail-Safe wants to merge 1 commit intoRightNow-AI:mainfrom
Open
Conversation
…sitives When agents are loaded from persistent storage on daemon startup, their last_active timestamp reflects when they were last active before the previous shutdown. If the daemon was down for longer than the heartbeat timeout (default 180 s), the first heartbeat tick immediately marks every restored agent as unresponsive and triggers crash recovery — even though all agents just started and haven't had a chance to run. Fix: stamp last_active = Utc::now() alongside the state = Running reset in the restore loop. This is consistent with how new agent spawns work (they also set last_active to now) and gives each restored agent a clean baseline from which the heartbeat can accurately track responsiveness. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Every daemon restart triggers a cascade of spurious crash-recovery cycles for all agents.
When agents are loaded from persistent storage, their
last_activetimestamp is restored from before the previous shutdown. If the daemon was down for longer than the heartbeat unresponsiveness threshold (default 180 s), the very first heartbeat tick — which fires ~30 s after startup — sees every restored agent as stale and immediately marks them all asCrashed, kicking off recovery attempts for agents that had only just been restored and hadn't had a chance to run.Observed in logs:
The recovery succeeds each time, but the unnecessary crash-recovery round-trip adds noise to logs, delays agent availability by 30–60 s on every restart, and burns recovery attempt budget.
Fix
One line added to the agent restore loop in
kernel.rs, alongside the existingstate = AgentState::Runningreset:This gives each restored agent a fresh heartbeat baseline consistent with how newly spawned agents are initialized (they also set
last_active = Utc::now()). The heartbeat monitor then starts timing from the actual moment of restoration rather than from the pre-shutdown timestamp.Impact