feat(agent): heartbeat & error handling with audit log#146
Conversation
Extend agentHeartbeatSchema with currentTask telemetry and a narrow warning|fatal error block. Heartbeat-borne errors are now persisted to agent_errors via logAgentError (warning rows leave status untouched, fatal rows force status='error' and route active tasks through handleTaskFailure). Status transitions emit a structured logger.info line for audit; no-op transitions are suppressed to avoid heartbeat- volume noise. The transition decision is extracted as a pure decideHeartbeatTransition helper so the policy can be pinned without DB mocks, matching the classifyWorstSeverity pattern. Refs: spec/tickets/Agent_Heartbeat_&_Error_Handling.md Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Apply ce-code-review safe_auto fixes from run 20260516-021943-e0500813: - Bound heartbeat error message (max 4096) and context (max 16K chars serialized) at the zod boundary so a compromised agent cannot bloat agent_errors with multi-MB rows every ~30s. Mirror the caps onto the standalone POST /errors schema. - Pass projectId through from processHeartbeat into logAgentError so the helper skips its redundant SELECT on every error-bearing heartbeat. - Tighten decideHeartbeatTransition to use the AgentHeartbeat['status'] literal union instead of bare string; a status typo at a call site is now a compile error. - Update packages/openapi/agent-api.yaml Heartbeat schema with currentTask and error properties, document the narrow warning|fatal enum, the non-redundant heartbeat-vs-/errors channel contract, and the new size caps on AgentError. Cross-corroborated by sec-1+adv-1 (P1), maintainability+perf (P2), kieran-typescript (P2), api-contract+project-standards (P1). Residual non-safe-auto findings (transactional integrity, currentTask ownership check, atomic CTE, integration tests, fatal-strand recovery path) recorded in /tmp/compound-engineering/ce-code-review/20260516-021943-e0500813/residual.md. Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
There was a problem hiding this comment.
Pull request overview
Implements the Agent Heartbeat & Error Handling ticket: extends agentHeartbeatSchema with optional currentTask telemetry and a narrow error: { severity: warning|fatal, message, context } block, persists heartbeat-borne errors to agent_errors, escalates fatal errors to status='error' plus handleTaskFailure on all active tasks, and emits a structured audit log line on real status transitions. A pure decideHeartbeatTransition helper is extracted so the policy can be unit-tested without DB mocks. The OpenAPI spec and standalone /errors route are updated with matching size caps.
Changes:
- New
agentHeartbeatErrorSchema/agentHeartbeatCurrentTaskSchema(with 4 KiB message / 16 KiB serialized context caps) added to the shared package and surfaced in the OpenAPI Heartbeat schema; standaloneAgentErrorgets the same caps. processHeartbeatrewritten arounddecideHeartbeatTransition: reads prior status, logs the error vialogAgentError(with project-id pass-through to skip a redundant SELECT), updates the row, emits a transition audit log, and runshandleTaskFailurefor fatals.- New unit tests for
decideHeartbeatTransition(6) and route-level schema validation tests forPOST /heartbeat(5).
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/shared/src/schemas/index.ts | Adds heartbeat error/current-task schemas with size caps and extends agentHeartbeatSchema. |
| packages/shared/src/types/index.ts | Re-exports new AgentHeartbeatError and AgentHeartbeatCurrentTask types. |
| packages/openapi/agent-api.yaml | Documents currentTask and error on Heartbeat and tightens the standalone AgentError contract. |
| packages/backend/src/services/agents.ts | Adds decideHeartbeatTransition, status-transition logging, fatal-error task failure flow, and projectId pass-through to logAgentError. |
| packages/backend/src/routes/agent/index.ts | Imports new caps and applies matching bounds to the standalone /errors validation schema. |
| packages/backend/tests/unit/agents-service.test.ts | 6 unit tests for decideHeartbeatTransition. |
| packages/backend/tests/unit/agent-api-contract.test.ts | 5 new route-level schema-validation tests for POST /heartbeat. |
Comments suppressed due to low confidence (2)
packages/backend/src/services/agents.ts:434
- The
agent_errorsinsert and theagentsUPDATE happen as two separate, non-transactional statements (with the insert occurring first, before the row is locked or updated). If the UPDATE — or the subsequenthandleTaskFailureloop on a fatal heartbeat — throws, the system is left with an orphan error row attributed to an agent whose status was never transitioned and whose active tasks were never failed, breaking the invariant that fatal heartbeats produce a consistent(error_row, status='error', failed_tasks)triple. Wrapping the insert + update + task-failure loop indb.transaction(...)would restore atomicity. (PR description already flags this as residual #3/#4; flagging here so it's anchored to the diff.)
This issue also appears on line 378 of the same file.
if (data.error) {
await logAgentError({
agentId,
severity: data.error.severity,
message: data.error.message,
context: data.error.context,
taskId: data.currentTask?.taskId,
// Skip the redundant SELECT inside logAgentError — we already
// have projectId from priorRow.
projectId: priorRow?.projectId,
});
}
const updates: Record<string, unknown> = {
status: effectiveStatus,
lastSeenAt: new Date(),
updatedAt: new Date(),
};
if (data.capabilities) {
updates['capabilities'] = data.capabilities;
}
if (data.deviceInfo) {
updates['hardwareProfile'] = data.deviceInfo;
}
const [updated] = await db.update(agents).set(updates).where(eq(agents.id, agentId)).returning();
if (updated) {
emitAgentStatus(updated.projectId, updated.id, effectiveStatus);
if (transition.shouldLogTransition && transition.reason && priorStatus) {
logStatusTransition({
agentId: updated.id,
projectId: updated.projectId,
fromStatus: priorStatus,
toStatus: effectiveStatus,
reason: transition.reason,
});
}
}
// On fatal error, fail the agent's current tasks. `handleTaskFailure`
// applies the up-to-3-retry policy defined in the ticket; rolling a
// parallel path here would diverge. Dynamic import preserves the
// existing tasks.ts <-> agents.ts circular-import workaround.
if (isFatalError) {
const activeTasks = await db
.select({ id: tasks.id })
.from(tasks)
.where(and(eq(tasks.agentId, agentId), sql`${tasks.status} IN ('assigned', 'running')`));
const { handleTaskFailure } = await import('./tasks.js');
for (const activeTask of activeTasks) {
await handleTaskFailure(activeTask.id, agentId, data.error?.message ?? 'Agent fatal error');
}
}
packages/backend/src/services/agents.ts:389
currentTask.taskIdfrom the heartbeat payload is persisted directly intoagent_errors.task_idwith no check that the task belongs to the calling agent (notasks.agentId = agentIdlookup, and the column is not enforced by an FK constraint visible here). A compromised or buggy agent can attribute its errors to arbitrary task IDs — including tasks owned by other agents or other projects — which corrupts per-task error history and any UI that joinsagent_errorstotasks. Either validate ownership before persisting or droptaskIdwhen ownership cannot be established.
if (data.error) {
await logAgentError({
agentId,
severity: data.error.severity,
message: data.error.message,
context: data.error.context,
taskId: data.currentTask?.taskId,
// Skip the redundant SELECT inside logAgentError — we already
// have projectId from priorRow.
projectId: priorRow?.projectId,
});
}
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughThis PR extends agent heartbeats with optional inline ChangesAgent Heartbeat Enhancement
Sequence Diagram(s)sequenceDiagram
participant Agent
participant API as AgentAPI
participant Service as processHeartbeat
participant DB as agentsDB
participant Audit as logStatusTransition
Agent->>API: POST /api/v1/agent/heartbeat (status, error?, currentTask?)
API->>Service: processHeartbeat(agentId, heartbeat)
Service->>DB: SELECT prior agent row (status, projectId)
Service->>Service: decideHeartbeatTransition(payloadStatus, errorSeverity, priorStatus)
Service->>DB: UPDATE agents SET effectiveStatus, timestamps, capabilities
Service->>DB: INSERT agent_errors (suppressed SSE)
Service->>API: emit agent_status SSE (post-commit)
Service->>Audit: logStatusTransition(...) [only if transition]
Service->>Service: on fatal -> dynamically import handleTaskFailure -> fan out per active task
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 7 | ❌ 3❌ Failed checks (2 warnings, 1 inconclusive)
✅ Passed checks (7 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning Review ran into problems🔥 ProblemsThese MCP integrations need to be re-authenticated in the Integrations settings: Linear, Notion Comment |
There was a problem hiding this comment.
Warning
CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.
Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.
Actionable comments posted: 1
🧹 Nitpick comments (3)
packages/backend/tests/unit/agent-api-contract.test.ts (1)
163-242: ⚡ Quick winAdd heartbeat boundary tests for new size caps.
These cases validate enum/required fields, but they don’t cover the new
error.messagemax (4096) and serializederror.context16KiB cap. Adding pass/fail boundary assertions here would lock the contract and catch regressions early.As per coding guidelines, "Agent API contract tests should validate responses against OpenAPI spec to keep server and clients in sync."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/backend/tests/unit/agent-api-contract.test.ts` around lines 163 - 242, Add boundary tests in agent-api-contract.test.ts for the AGENT_BASE/heartbeat endpoint to validate the new error.message max length (4096) and serialized error.context cap (16 * 1024 bytes): create one test that sends error.message exactly 4096 characters and error.context whose JSON string length is exactly 16KiB and assert res.status === 200, and a complementary test that sends error.message of 4097 chars and/or error.context whose JSON string length is 16KiB+1 and assert res.status === 400; reuse the existing request pattern (agentToken(TEST_AGENT_TOKEN), app.request to `${AGENT_BASE}/heartbeat`) and build payloads with JSON.stringify to measure serialized size to ensure the server enforces the limits.packages/backend/src/services/agents.ts (1)
353-353: 💤 Low valueConsider adding an explicit return type.
Per coding guidelines on exported functions. The implicit type works but documenting the contract helps downstream callers.
Suggested signature
-export async function processHeartbeat(agentId: number, data: AgentHeartbeat) { +export async function processHeartbeat(agentId: number, data: AgentHeartbeat): Promise<{ + agent: NonNullable<Awaited<ReturnType<typeof getAgentById>>> | null; + hasHighPriorityTasks: boolean; +}> {Or define a named interface if this return shape is reused elsewhere.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/backend/src/services/agents.ts` at line 353, The exported function processHeartbeat currently relies on an implicit return type; update its signature to include an explicit return type (e.g., Promise<void> or a specific Promise<YourReturnShape>) to follow exported-function guidelines and clarify the contract for callers; modify the declaration "export async function processHeartbeat(agentId: number, data: AgentHeartbeat)" to include the appropriate explicit return type or define and reference a named interface/type if the return shape is reused elsewhere.packages/backend/tests/unit/agents-service.test.ts (1)
319-337: 💤 Low valueTest at lines 319-327 duplicates inputs from 272-283.
Both tests use
{ payloadStatus: 'online', errorSeverity: 'fatal', priorStatus: 'online' }. The earlier test already assertsshouldLogTransitionandreason. Consider consolidating or repurposing this test for an untested edge case (e.g., fatal error whenpriorStatus='error'— confirms no-op when already in error state).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/backend/tests/unit/agents-service.test.ts` around lines 319 - 337, The two tests duplicate the same inputs for decideHeartbeatTransition; update the second test to cover a missing edge case instead of repeating `{ payloadStatus: 'online', errorSeverity: 'fatal', priorStatus: 'online' }`: modify the test that currently asserts suppression when priorStatus is null to instead change the duplicated test (the one that flips an online agent to fatal) into a case where errorSeverity: 'fatal' and priorStatus: 'error' (or otherwise already-in-error) and assert that shouldLogTransition is false and reason is null to confirm no-op when agent is already in error; keep references to the decideHeartbeatTransition call and its result assertions.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/openapi/agent-api.yaml`:
- Around line 317-319: The OpenAPI contract allows heartbeat errors with
severity "fatal" to set agent.status='error', which conflicts with the auth
middleware check in auth.ts (if (!agent || agent.status === 'error') ...) and
leaves agents permanently blocked; either update packages/openapi/agent-api.yaml
to remove or rename the fatal severity so heartbeats cannot force
status='error', or implement recovery semantics: add an agent recovery endpoint
or admin-reset flow and update the middleware/auth logic to permit recovery
requests (or a special recovery token) from agents with status='error';
reference the heartbeat error schema in agent-api.yaml and the auth middleware
in packages/backend/src/middleware/auth.ts when making the change.
---
Nitpick comments:
In `@packages/backend/src/services/agents.ts`:
- Line 353: The exported function processHeartbeat currently relies on an
implicit return type; update its signature to include an explicit return type
(e.g., Promise<void> or a specific Promise<YourReturnShape>) to follow
exported-function guidelines and clarify the contract for callers; modify the
declaration "export async function processHeartbeat(agentId: number, data:
AgentHeartbeat)" to include the appropriate explicit return type or define and
reference a named interface/type if the return shape is reused elsewhere.
In `@packages/backend/tests/unit/agent-api-contract.test.ts`:
- Around line 163-242: Add boundary tests in agent-api-contract.test.ts for the
AGENT_BASE/heartbeat endpoint to validate the new error.message max length
(4096) and serialized error.context cap (16 * 1024 bytes): create one test that
sends error.message exactly 4096 characters and error.context whose JSON string
length is exactly 16KiB and assert res.status === 200, and a complementary test
that sends error.message of 4097 chars and/or error.context whose JSON string
length is 16KiB+1 and assert res.status === 400; reuse the existing request
pattern (agentToken(TEST_AGENT_TOKEN), app.request to `${AGENT_BASE}/heartbeat`)
and build payloads with JSON.stringify to measure serialized size to ensure the
server enforces the limits.
In `@packages/backend/tests/unit/agents-service.test.ts`:
- Around line 319-337: The two tests duplicate the same inputs for
decideHeartbeatTransition; update the second test to cover a missing edge case
instead of repeating `{ payloadStatus: 'online', errorSeverity: 'fatal',
priorStatus: 'online' }`: modify the test that currently asserts suppression
when priorStatus is null to instead change the duplicated test (the one that
flips an online agent to fatal) into a case where errorSeverity: 'fatal' and
priorStatus: 'error' (or otherwise already-in-error) and assert that
shouldLogTransition is false and reason is null to confirm no-op when agent is
already in error; keep references to the decideHeartbeatTransition call and its
result assertions.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 7c3eea7e-ab5e-48ee-9427-658b05ffca0e
📒 Files selected for processing (7)
packages/backend/src/routes/agent/index.tspackages/backend/src/services/agents.tspackages/backend/tests/unit/agent-api-contract.test.tspackages/backend/tests/unit/agents-service.test.tspackages/openapi/agent-api.yamlpackages/shared/src/schemas/index.tspackages/shared/src/types/index.ts
Address findings from the pr-review-toolkit review pass:
- Fix false TextEncoder/ES2022 comment in the heartbeat error schema
(comment rot from an earlier autofix swap).
- Re-export the real decideHeartbeatTransition, classifyWorstSeverity,
classifyRecentErrors, and pickCurrentTaskByAgent from the
agent-api-contract mock.module factory. Without this, sibling test
files load undefined for these symbols depending on bun's per-file
load order — a known Linux-vs-macOS CI flake documented in GOTCHAS.md.
- Narrow errorSeverity in decideHeartbeatTransition to
AgentHeartbeatError['severity'] so the schema is the single source of
truth for the warning|fatal enum.
- Drop the dead `&& priorStatus` clause in the transition-log guard
(decideHeartbeatTransition already ensures shouldLogTransition is
false when priorStatus is null).
- Surface the silent-no-op cases: a missing agent row at heartbeat
time emits a logger.warn, and an agent_errors INSERT that returns no
row emits logger.error so a regression isn't invisible.
- Trim ticket references and the duplicated non-redundant-channel
paragraph in service code; the OpenAPI spec is the source of truth.
Tests:
- Add response-body assertions on accept-case heartbeats so a route
regression that drops `acknowledged` is caught.
- Add a back-compat baseline test for `{status:'online'}` only.
- Add boundary tests for the 4096-char message cap and 16K-char
context cap so a refactor that drops the limits ships visibly.
- Add a negative-progress rejection test for currentTask.
- Convert the new decideHeartbeatTransition tests to AAA structure to
match the existing classifyWorstSeverity tests in the same file.
361 backend tests pass; just ci-check clean.
Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
There was a problem hiding this comment.
Warning
CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.
Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.
Actionable comments posted: 1
🧹 Nitpick comments (1)
packages/backend/tests/unit/agent-api-contract.test.ts (1)
365-386: ⚡ Quick winAdd an exact-limit acceptance test for
error.contextsize.You test
>16Krejection, but not acceptance at exactly the cap. Add the at-cap case to lock the boundary and prevent off-by-one regressions.As per coding guidelines, "Agent API contract tests must validate responses against OpenAPI spec to keep server and clients in sync."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/backend/tests/unit/agent-api-contract.test.ts` around lines 365 - 386, Add a new unit test in agent-api-contract.test.ts that mirrors the existing oversized test but uses a context whose JSON serialization is exactly HEARTBEAT_ERROR_CONTEXT_MAX_CHARS (e.g., const atCap = { blob: 'x'.repeat(16 * 1024) }) and posts the same heartbeat payload via app.request to `${AGENT_BASE}/heartbeat` with agentToken(TEST_AGENT_TOKEN), then assert the response is accepted (status 200 or the expected success code) and also validate the response against the OpenAPI spec using the same response validation helper used in other contract tests so the boundary (exact cap) is locked and checked against the API spec.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/backend/tests/unit/agent-api-contract.test.ts`:
- Around line 252-266: The test "rejects an error.severity outside the
warning|fatal enum" only asserts HTTP 400 but must also validate the error
response body shape against the API contract; update the test that posts to
`${AGENT_BASE}/heartbeat` (using agentToken/TEST_AGENT_TOKEN and app.request) to
parse the JSON body and assert the error envelope fields (for example
expect(body).toHaveProperty('error.code') and other required fields) and/or run
it through the existing OpenAPI response validator helper used elsewhere in the
suite so the response matches the spec; apply the same additional assertions to
the other listed failing tests (the blocks around lines 268-282, 284-298,
300-314, 343-363, 365-386).
---
Nitpick comments:
In `@packages/backend/tests/unit/agent-api-contract.test.ts`:
- Around line 365-386: Add a new unit test in agent-api-contract.test.ts that
mirrors the existing oversized test but uses a context whose JSON serialization
is exactly HEARTBEAT_ERROR_CONTEXT_MAX_CHARS (e.g., const atCap = { blob:
'x'.repeat(16 * 1024) }) and posts the same heartbeat payload via app.request to
`${AGENT_BASE}/heartbeat` with agentToken(TEST_AGENT_TOKEN), then assert the
response is accepted (status 200 or the expected success code) and also validate
the response against the OpenAPI spec using the same response validation helper
used in other contract tests so the boundary (exact cap) is locked and checked
against the API spec.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: a8f1da72-9fd1-43c7-947b-473d6f25ba2c
📒 Files selected for processing (4)
packages/backend/src/services/agents.tspackages/backend/tests/unit/agent-api-contract.test.tspackages/backend/tests/unit/agents-service.test.tspackages/shared/src/schemas/index.ts
🚧 Files skipped from review as they are similar to previous changes (3)
- packages/backend/src/services/agents.ts
- packages/backend/tests/unit/agents-service.test.ts
- packages/shared/src/schemas/index.ts
Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Fixes Applied SuccessfullyFixed 2 file(s) based on 2 CodeRabbit feedback item(s). Files modified:
Commit: The latest autofix changes are on the Resolved
Deferred
|
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (3)
packages/backend/src/services/agents.ts:444
- The
agent_errorsINSERT (vialogAgentError), theagentsUPDATE, and thehandleTaskFailureloop are issued as separate, non-transactional statements. If any later step throws (e.g., the UPDATE fails,handleTaskFailurerejects mid-iteration), the heartbeat will have already persisted anagent_errorsrow and/or emitted SSE events viaemitAgentError/emitAgentStatus, but the agent row may not be marked'error'and some/all of the active tasks may not be failed. Dashboard SSE clients will then see stale state that disagrees with the DB. Consider wrapping insert + update + task-failure loop indb.transaction(...)and emitting SSE events only after the transaction commits.
This issue also appears on line 434 of the same file.
if (data.error) {
await logAgentError({
agentId,
severity: data.error.severity,
message: data.error.message,
context: data.error.context,
taskId: data.currentTask?.taskId,
// Skip the redundant SELECT inside logAgentError — we already
// have projectId from priorRow.
projectId: priorRow?.projectId,
});
}
const updates: Record<string, unknown> = {
status: effectiveStatus,
lastSeenAt: new Date(),
updatedAt: new Date(),
};
if (data.capabilities) {
updates['capabilities'] = data.capabilities;
}
if (data.deviceInfo) {
updates['hardwareProfile'] = data.deviceInfo;
}
const [updated] = await db.update(agents).set(updates).where(eq(agents.id, agentId)).returning();
if (updated) {
emitAgentStatus(updated.projectId, updated.id, effectiveStatus);
// `priorStatus` is non-null whenever `shouldLogTransition` is true
// (see decideHeartbeatTransition); the assertion below keeps TS happy
// without re-checking at runtime.
if (transition.shouldLogTransition && transition.reason) {
logStatusTransition({
agentId: updated.id,
projectId: updated.projectId,
fromStatus: priorStatus as string,
toStatus: effectiveStatus,
reason: transition.reason,
});
}
} else {
// Auth middleware verified the agent's bearer token, so the row was
// present a moment ago. A vanishing row mid-heartbeat means it was
// deleted concurrently. Surface it so an operator can correlate
// bursts of "ghost heartbeat" warnings with cleanup actions.
logger.warn(
{ agentId, status: effectiveStatus },
'Heartbeat for an agent row that no longer exists'
);
}
// On fatal error, fail the agent's current tasks. `handleTaskFailure`
// owns the retry policy; rolling a parallel path here would diverge.
// Dynamic import preserves the existing tasks.ts <-> agents.ts
// circular-import workaround.
if (isFatalError) {
const activeTasks = await db
.select({ id: tasks.id })
.from(tasks)
.where(and(eq(tasks.agentId, agentId), sql`${tasks.status} IN ('assigned', 'running')`));
const { handleTaskFailure } = await import('./tasks.js');
for (const activeTask of activeTasks) {
await handleTaskFailure(activeTask.id, agentId, data.error?.message ?? 'Agent fatal error');
}
}
packages/backend/src/services/agents.ts:402
- The SELECT of
priorRowfollowed by the unconditional UPDATE onagentsis a TOCTOU window: a concurrent heartbeat (or an admin-driven status mutation) between the two statements can be silently overwritten. BecauseprocessHeartbeatis invoked on the agent's natural heartbeat cadence, two near-simultaneous requests against the same agent row can interleave such that the audit log recordsfrom=A → to=Bwhile the row actually movedA → C → B. Consider a single UPDATE ... RETURNING that captures the prior status via a CTE, or use a row lock (SELECT ... FOR UPDATEinside a transaction) before the UPDATE.
const [priorRow] = await db
.select({ status: agents.status, projectId: agents.projectId })
.from(agents)
.where(eq(agents.id, agentId))
.limit(1);
const priorStatus = priorRow?.status ?? null;
const transition = decideHeartbeatTransition({
payloadStatus: data.status,
errorSeverity: data.error?.severity,
priorStatus,
});
const { effectiveStatus, isFatalError } = transition;
// Persist heartbeat-borne errors to agent_errors regardless of severity.
// Warnings are recorded but do not change status or fail the running
// task; fatals are recorded AND drive the status/task transitions
// below. The non-redundant-channel contract (heartbeat vs POST /errors)
// is documented on the Heartbeat schema in packages/openapi/agent-api.yaml.
if (data.error) {
await logAgentError({
agentId,
severity: data.error.severity,
message: data.error.message,
context: data.error.context,
taskId: data.currentTask?.taskId,
// Skip the redundant SELECT inside logAgentError — we already
// have projectId from priorRow.
projectId: priorRow?.projectId,
});
}
const updates: Record<string, unknown> = {
status: effectiveStatus,
lastSeenAt: new Date(),
updatedAt: new Date(),
};
if (data.capabilities) {
updates['capabilities'] = data.capabilities;
}
if (data.deviceInfo) {
updates['hardwareProfile'] = data.deviceInfo;
}
const [updated] = await db.update(agents).set(updates).where(eq(agents.id, agentId)).returning();
packages/backend/src/services/agents.ts:436
- A throw from
handleTaskFailurefor any single task aborts this loop, leaving earlier-iteration tasks failed and later-iteration tasks still inassigned/running. Since the agent has just been forced tostatus='error', those leftover tasks are stranded until the next sweep. Wrap the per-task call in try/catch and continue with logged failures, or include the loop in a surrounding transaction.
if (isFatalError) {
const activeTasks = await db
.select({ id: tasks.id })
There was a problem hiding this comment.
Warning
CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.
Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/backend/src/routes/agent/index.ts`:
- Around line 41-50: The agentValidationHook result parameter uses an incomplete
inline type; update it to match `@hono/zod-validator` 0.7.6 by replacing the
current ad-hoc type with the correct union that includes data and target or by
importing the official hook type: use a type like ({ success: true; data: T } |
{ success: false; error: ZodError; data: T }) & { target: Target } (or import
the provided generic hook type from `@hono/zod-validator`) so that
agentValidationHook(result, c) references a result that includes data, target
and a ZodError on failure.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 719edbf7-81ec-40d6-81ee-67b4a321a50a
📒 Files selected for processing (2)
packages/backend/src/routes/agent/index.tspackages/backend/tests/unit/agent-api-contract.test.ts
🚧 Files skipped from review as they are similar to previous changes (1)
- packages/backend/tests/unit/agent-api-contract.test.ts
Resolve the CodeRabbit-flagged fatal-heartbeat lockout (Residual #1). Before this change, the strict requireAgentToken middleware rejected any agent whose row was in status='error'. Once a fatal heartbeat forced the agent into that state, every subsequent request — including the heartbeat the agent would use to announce recovery — returned 401. Agents could not self-recover. Resolution: - Refactor requireAgentToken to a factory parameterized by allowErroredAgent. - Export two variants: the strict default (work endpoints) and a new requireAgentTokenAllowRecovery that admits status='error' agents so they can post a clean heartbeat. processHeartbeat is the only programmatic recovery path; the heartbeat handler transitions the agent back to 'online' on a non-fatal payload. - Wire requireAgentTokenAllowRecovery on POST /api/v1/agent/heartbeat only. /tasks/*, /errors, /benchmark, /resources/*, and /cracker/* keep the strict middleware so a broken agent does not pick up new work until it has recovered via /heartbeat first. - Document the recovery contract on the Heartbeat error.severity field in packages/openapi/agent-api.yaml. Tests: - Three new auth-middleware cases pin the recovery variant: accepts online agents (parity with strict), accepts error-state agents, and still rejects unknown tokens. - Two new agent-api-contract cases exercise the route-level integration: an errored agent posting a clean heartbeat receives 200, and the same errored agent gets 401 on /tasks/next (strict middleware unchanged). 366 backend tests pass; just ci-check clean. Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Critical Finding ResolvedPer maintainer direction, the previously-deferred CodeRabbit Critical finding (fatal-heartbeat lockout) is now fixed in commit Resolution: split agent-token middleware into a strict variant (work endpoints) and a recovery-friendly variant ( Files modified:
366 backend tests pass; |
There was a problem hiding this comment.
Warning
CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.
Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.
Actionable comments posted: 1
🧹 Nitpick comments (1)
packages/backend/tests/unit/agent-api-contract.test.ts (1)
185-197: 💤 Low valueExtend existing 400 tests with envelope assertions for consistency.
The new heartbeat tests use
expectAgentValidationError(res)to assert the documented envelope shape. The older 400 tests for this endpoint (and/tasks/:id/report,/errors) only assert status. SinceagentValidationHookwas wired in this PR, consider updating these to catch contract drift.Applies to: lines 525-537, 553-565, 567-579.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/backend/tests/unit/agent-api-contract.test.ts` around lines 185 - 197, Update the older 400 tests for the heartbeat, tasks report and errors endpoints to assert the documented error envelope using the shared helper instead of (or in addition to) just checking status: replace the plain expect(res.status).toBe(400) checks in the tests for "should return 400 for invalid heartbeat status enum" (heartbeat), the "/tasks/:id/report" 400 test, and the "/errors" 400 test with a call to expectAgentValidationError(res) so the response shape is validated by the agentValidationHook; locate the tests by their descriptions and use the helper expectAgentValidationError to assert the envelope.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/backend/tests/unit/auth-middleware.test.ts`:
- Around line 255-266: The test named "should accept a valid active agent (same
path as the strict middleware)" uses mockAgentResult with an invalid status
value; update the mock agent object assigned to mockAgentResult (used in this
test) to use a valid status from the enum (change status: 'active' to status:
'online') so the mocked agent matches the expected schema and the other
assertions (e.g., the request to '/agent-endpoint' expecting 200) behave
correctly.
---
Nitpick comments:
In `@packages/backend/tests/unit/agent-api-contract.test.ts`:
- Around line 185-197: Update the older 400 tests for the heartbeat, tasks
report and errors endpoints to assert the documented error envelope using the
shared helper instead of (or in addition to) just checking status: replace the
plain expect(res.status).toBe(400) checks in the tests for "should return 400
for invalid heartbeat status enum" (heartbeat), the "/tasks/:id/report" 400
test, and the "/errors" 400 test with a call to expectAgentValidationError(res)
so the response shape is validated by the agentValidationHook; locate the tests
by their descriptions and use the helper expectAgentValidationError to assert
the envelope.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 5d147780-2c64-48cf-a1ba-173e96b3ac1f
📒 Files selected for processing (5)
packages/backend/src/middleware/auth.tspackages/backend/src/routes/agent/index.tspackages/backend/tests/unit/agent-api-contract.test.tspackages/backend/tests/unit/auth-middleware.test.tspackages/openapi/agent-api.yaml
🚧 Files skipped from review as they are similar to previous changes (1)
- packages/openapi/agent-api.yaml
Address every remaining residual from the multi-agent code review: * #2 — Integration coverage. New tests/integration/agent-heartbeat.test.ts exercises the full heartbeat path with a mocked drizzle client across 9 scenarios mapped 1:1 to plan U5: warning persistence, fatal task failure, fatal with no active tasks, transition log fires once, no log on no-op, currentTask.taskId carryover, ownership rejection, mid-loop continue-on-error, and context PII scrub. The file runs in an isolated bun:test phase (AGENT_HEARTBEAT_TEST_ISOLATED=1) per the GOTCHAS.md mock.module leakage pattern; package.json wires it as a dedicated phase in the test script. * #3 — Transactional integrity. processHeartbeat now wraps the prior- status read (SELECT ... FOR UPDATE), the agent_errors INSERT, and the agents UPDATE in db.transaction. A rollback no longer leaves an orphan error row referencing an un-updated agent. logAgentError accepts an optional drizzle client so it can participate in the outer transaction. * #4 — Continue-on-error fatal task fan-out. The handleTaskFailure loop now wraps each call in try/catch, logs failures with the failing task ID, and returns a {attempted, failed} summary so the caller / monitoring can detect partial fan-outs without losing sibling work. * #5 — currentTask ownership enforcement. New verifyTaskOwnership helper queries tasks WHERE id=? AND agent_id=? before the transaction opens; mismatches drop the task linkage on the persisted agent_errors row and log a warn so compromised tokens cannot attribute errors across agents. An FK on agent_errors.task_id (ON DELETE SET NULL) is the last-line guard against dangling rows. Migration 0007 backfills orphan task_ids before attaching the FK. * #6 — TOCTOU race on prior-status read. The atomic block from #3 uses .for('update') on the agents row, so concurrent heartbeats serialize on the row and cannot both observe the same prior status. Resolves the long-standing GOTCHAS.md "atomic status guards" warning for this code path. * #7 — Post-commit event emits. emitAgentError and emitAgentStatus fire AFTER db.transaction returns; SSE clients never see a status the DB later rolls back. logAgentError gains a suppressEvent flag so the heartbeat path can defer the emit without changing POST /errors behavior. * #8 — HeartbeatTransition discriminated union. Replaces the shouldLogTransition boolean + reason union with a kind:'noop' | kind:'transition' tagged variant. Invalid combinations ({shouldLogTransition: true, reason: null}) are now compile errors; the call-site triple guard collapses to one kind check, and fromStatus is only accessible on the transition arm. * #10 — PII scrub. New scrubAgentErrorContext walks the agent- supplied context jsonb (depth-capped) and redacts any object key matching /token|password|secret|api[_-]?key|authorization|cookie |bearer/i to "[REDACTED]". processHeartbeat scrubs before persisting so a leaked agent process or accidentally-serialized env block cannot deposit credentials into agent_errors. 396 backend tests pass (5 isolated phases + main); just ci-check clean. Migration 0007 backfills orphan agent_errors.task_id refs before attaching the FK so the change applies safely to existing production data. Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
All residual review findings resolvedPer maintainer direction (no deferrals), every remaining residual is fixed in commit Code changes
Resolution map
396 total tests pass (372 main + 9 heartbeat-integration + 9 health-deterministic + 3 control-RBAC + 22 tasks + 19 queue-manager isolated phases). |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 15 out of 15 changed files in this pull request and generated 10 comments.
Comments suppressed due to low confidence (1)
packages/backend/src/services/agents.ts:561
campaignsis already imported at the top of this file. The dynamicawait import('@hashhive/shared')here shadows the top-level binding and adds an unnecessary async resolution on every heartbeat that has anupdatedagent row. Use the static import.
// Check if there are high-priority pending tasks for this agent's project
let hasHighPriorityTasks = false;
if (updated) {
const { campaigns } = await import('@hashhive/shared');
There was a problem hiding this comment.
Warning
CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.
Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.
Actionable comments posted: 2
🧹 Nitpick comments (2)
packages/backend/src/services/agents.ts (1)
373-373: 💤 Low valueConsider extending pattern to cover additional credential-shaped keys.
Current pattern misses common variants like
credential,private,session, andjwt. Not blocking, but operators may inadvertently leak these.-const SECRET_KEY_PATTERN = /token|password|secret|api[_-]?key|authorization|cookie|bearer/i; +const SECRET_KEY_PATTERN = /token|password|secret|api[_-]?key|authorization|cookie|bearer|credential|private|session|jwt/i;🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/backend/src/services/agents.ts` at line 373, SECRET_KEY_PATTERN currently misses common credential-like keys; update the regular expression constant SECRET_KEY_PATTERN to include additional variants such as credential, private, session, jwt (and optionally token-related forms like access[_-]?token) so keys like "credential", "privateKey", "session_id", "jwt" are matched; modify the pattern used in packages/backend/src/services/agents.ts (the SECRET_KEY_PATTERN constant) to append these alternates into the alternation list while preserving existing case-insensitive flag and boundary/word considerations.packages/backend/tests/integration/agent-heartbeat.test.ts (1)
122-124: 💤 Low valueIntentional thenable pattern — suppress or acknowledge.
Biome flags
thenproperty as suspicious, but this is an intentional pattern to make the drizzle mock chain awaitable without calling.limit(). Consider adding a suppression comment for clarity:// Make the chain thenable so consumers that await it without .limit() resolve. + // biome-ignore lint/suspicious/noThenProperty: intentional thenable mock for drizzle chain (chain as { then: Promise<unknown[]>['then'] }).then = (resolve: (v: unknown) => void) => Promise.resolve(rows).then(resolve);Same applies to line 185.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/backend/tests/integration/agent-heartbeat.test.ts` around lines 122 - 124, The test uses an intentional thenable pattern by assigning (chain as { then: Promise<unknown[]>['then'] }).then = ... to make the drizzle mock chain awaitable; Biome flags this as suspicious, so add a short suppression/acknowledgement comment immediately above that assignment (and the analogous assignment around line 185) explaining that the then property is intentionally added to make the mock awaitable and should not be removed, or add the appropriate Biome suppression directive to silence the warning while keeping the thenable behavior in functions/mocks named like the chain mock.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/backend/tests/integration/agent-heartbeat.test.ts`:
- Around line 272-280: The beforeEach currently calls mockClear() on several
mocks which only clears call history and can leak queued mockResolvedValueOnce
values; update the beforeEach to call mockReset() for handleTaskFailureMock,
loggerMock.info, loggerMock.warn, loggerMock.error, emitAgentErrorMock, and
emitAgentStatusMock so each test starts with fully reset mocks (e.g., replace
all .mockClear() calls with .mockReset() in the beforeEach block that also calls
resetState()).
In `@packages/shared/src/db/migrations/meta/_journal.json`:
- Around line 54-59: Migration entry with idx 7 (tag "0007_light_ulik") has an
out-of-order "when" timestamp (1779052939557) that precedes migration 6;
regenerate the migration journal so the "when" for idx 7 is updated to a
correct, increasing timestamp by running the project's migration generator (run
"just db-generate"), then commit the updated _journal.json entry so the "when"
value for idx 7 is chronologically after migration 6.
---
Nitpick comments:
In `@packages/backend/src/services/agents.ts`:
- Line 373: SECRET_KEY_PATTERN currently misses common credential-like keys;
update the regular expression constant SECRET_KEY_PATTERN to include additional
variants such as credential, private, session, jwt (and optionally token-related
forms like access[_-]?token) so keys like "credential", "privateKey",
"session_id", "jwt" are matched; modify the pattern used in
packages/backend/src/services/agents.ts (the SECRET_KEY_PATTERN constant) to
append these alternates into the alternation list while preserving existing
case-insensitive flag and boundary/word considerations.
In `@packages/backend/tests/integration/agent-heartbeat.test.ts`:
- Around line 122-124: The test uses an intentional thenable pattern by
assigning (chain as { then: Promise<unknown[]>['then'] }).then = ... to make the
drizzle mock chain awaitable; Biome flags this as suspicious, so add a short
suppression/acknowledgement comment immediately above that assignment (and the
analogous assignment around line 185) explaining that the then property is
intentionally added to make the mock awaitable and should not be removed, or add
the appropriate Biome suppression directive to silence the warning while keeping
the thenable behavior in functions/mocks named like the chain mock.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 94987f31-bcc8-4862-9db1-8c9aa927287a
📒 Files selected for processing (9)
packages/backend/package.jsonpackages/backend/src/services/agents.tspackages/backend/tests/integration/agent-heartbeat.test.tspackages/backend/tests/unit/agent-api-contract.test.tspackages/backend/tests/unit/agents-service.test.tspackages/shared/src/db/migrations/0007_light_ulik.sqlpackages/shared/src/db/migrations/meta/0007_snapshot.jsonpackages/shared/src/db/migrations/meta/_journal.jsonpackages/shared/src/db/schema.ts
✅ Files skipped from review due to trivial changes (2)
- packages/shared/src/db/migrations/0007_light_ulik.sql
- packages/shared/src/db/migrations/meta/0007_snapshot.json
🚧 Files skipped from review as they are similar to previous changes (2)
- packages/backend/tests/unit/agents-service.test.ts
- packages/backend/tests/unit/agent-api-contract.test.ts
Address every actionable review thread from CodeRabbit + Copilot:
* Validation hook now uses a structural type that mirrors the
@hono/zod-validator 0.7.6 Hook signature and walks every zod issue
(not just the first) so union-refinement failures appear in the 400
body instead of being dropped.
* Refine the secret-key scrub: replace the substring regex with a
whole-word + trailing-suffix check (isSecretKey) so descriptive names
like tokenCount, cookieDomain, bearerHostname, apiKeyName,
secretsManagerEnabled, and passwordAge are no longer falsely
redacted. Add positive and negative coverage.
* Move verifyTaskOwnership inside the heartbeat transaction with a
FOR UPDATE lock on the task row so a concurrent reassignment cannot
land between the ownership check and the agent_errors insert.
* Cache the dynamic import('./tasks.js') call (lazy module ref via
getHandleTaskFailure) so the import resolves once instead of on
every fatal-error heartbeat. Drop a redundant dynamic
import('@hashhive/shared') for `campaigns` (already imported
statically).
* Rename requireAgentTokenAllowRecovery to
requireAgentTokenForHeartbeatRecovery so the heartbeat-only intent
is harder to miss when adding new endpoints.
* Migration 0007: add IF EXISTS to the DROP INDEX so a hand-rebuilt
environment doesn't block the FK addition; fix journal timestamp
ordering (0007.when now > 0006.when).
* Replace mockClear() with mockReset() in the heartbeat integration
beforeEach (per GOTCHAS.md "Use mockReset() not mockClear() in
beforeEach"); reinstall the default handleTaskFailure impl so the
scenarios that rely on it keep working.
* Fix a stray status:'active' literal in the auth-middleware test;
the agent status enum only accepts online|busy|error|benchmarked.
* Replace the redundant decideHeartbeatTransition fatal-from-online
case with the previously-missing fatal-when-already-error edge case
so the test exercises the noop-on-same-resolved-status branch.
374 backend tests pass + 9 isolated heartbeat phase = 383 total in
this PR; just ci-check clean.
Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Summary
Implements Agent Heartbeat & Error Handling for the hashcat worker agent surface.
Implementation units
agentHeartbeatSchema+agentHeartbeatErrorSchema+agentHeartbeatCurrentTaskSchemain@hashhive/shared; correspondingAgentHeartbeatError/AgentHeartbeatCurrentTasktype exports.processHeartbeatrefactored with a puredecideHeartbeatTransitiondecision helper, transactional persistence, ownership-checked task linkage, structured status-transition audit log, and post-commit event emission.POST /api/v1/agent/heartbeatflows the new schema through unchanged.Hardening applied during review (commits
ca6a144->4c94b281e78d53d6c74cb98f92d97d66e26bb485)Every finding from the multi-agent review pass + CodeRabbit + the second-pass pr-review-toolkit run is addressed in-PR:
Schema + DoS bounds
error.messageat 4096 chars anderror.contextat 16 K serialized chars on both heartbeat and standalone/errorsschemas.warning|fatalenum on the OpenAPI spec.Logging + visibility
logger.infoaudit line on every real status transition; suppressed on no-op heartbeats.logger.warnwhen a heartbeat arrives for a vanished agent row.logger.errorwhenagent_errorsINSERT returns no row.logger.warnwhencurrentTask.taskIdis not owned by the calling agent (drops the task linkage).logger.errorwhenhandleTaskFailurethrows mid-loop (continue-on-error preserves sibling tasks).Security
scrubAgentErrorContext) redacts/token|password|secret|api[_-]?key|authorization|cookie|bearer/ikeys server-side before persisting.currentTask.taskIdis verified againsttasks WHERE id=? AND agent_id=?before persistence so a compromised agent token cannot attribute errors to another agent's tasks.agent_errors.task_id(ON DELETE SET NULL) as last-line guard; migration 0007 backfills orphan refs before attaching.requireAgentTokenAllowRecovery, heartbeat-only) variants so an errored agent can self-recover via a clean heartbeat.Atomicity + correctness
db.transactionwraps the heartbeat-error INSERT + agents UPDATE;SELECT ... FOR UPDATEon the agent row closes the TOCTOU race on prior-status read.emitAgentError,emitAgentStatus) fire AFTER the transaction commits so SSE clients never see rolled-back state.handleTaskFailurein try/catch; partial failures surface in the response as{attempted, failed}.HeartbeatTransitionrefactored to a discriminated union (kind: 'noop' | 'transition'); invalid combinations are compile errors and the call-site triple guard collapses to one kind check.Type tightening
decideHeartbeatTransitionparameter andHeartbeatTransition.effectiveStatususe theAgentHeartbeat['status']literal union — status typos at call sites are compile errors.errorSeverityparameter typed viaAgentHeartbeatError['severity'].Contract + ergonomics
{error: {code: 'VALIDATION_ERROR', message}}envelope via a centralizedagentValidationHook.decideHeartbeatTransition,classifyWorstSeverity,classifyRecentErrors,pickCurrentTaskByAgent, andscrubAgentErrorContextfrom itsmock.modulefactory to avoid Linux-vs-macOS load-order CI flakes (GOTCHAS.md pattern).logAgentErroraccepts an optional drizzle client (so it can participate in an outer transaction) and an optionalprojectIdpass-through (soprocessHeartbeatskips the redundant SELECT).OpenAPI spec
Heartbeatschema documentscurrentTask,error, the narrow severity enum, size caps, the non-redundant heartbeat-vs-/errorscontract, and the recovery path.Test coverage
AGENT_HEARTBEAT_TEST_ISOLATED=1)processHeartbeatpath with mocked drizzle; 1:1 mapped to plan U5Plan U5 scenarios covered by the new
tests/integration/agent-heartbeat.test.ts:handleTaskFailureinvocation (R3/R4/R5)currentTask.taskIdcarryover when ownedcurrentTask.taskIddropped when not owned (with audit log)handleTaskFailurethrow doesn't strand siblingserror.contextPII scrub before persistenceTest plan
just check(format, lint, type-check, build) — cleanjust ci-check(full test suite on top of check) — 372 main + 9 heartbeat integration + 53 other isolated = 396 total, 0 fail, 15 skipRefs
spec/tickets/Agent_Heartbeat_&_Error_Handling.mddocs/plans/2026-05-16-001-feat-agent-heartbeat-error-handling-plan.md(gitignored)packages/shared/src/db/migrations/0007_light_ulik.sql