Skip to content

feat(agent): heartbeat & error handling with audit log#146

Merged
unclesp1d3r merged 7 commits into
mainfrom
feat/agent-heartbeat-error-handling
May 17, 2026
Merged

feat(agent): heartbeat & error handling with audit log#146
unclesp1d3r merged 7 commits into
mainfrom
feat/agent-heartbeat-error-handling

Conversation

@unclesp1d3r
Copy link
Copy Markdown
Member

@unclesp1d3r unclesp1d3r commented May 16, 2026

Summary

Implements Agent Heartbeat & Error Handling for the hashcat worker agent surface.

Implementation units

  • U1agentHeartbeatSchema + agentHeartbeatErrorSchema + agentHeartbeatCurrentTaskSchema in @hashhive/shared; corresponding AgentHeartbeatError / AgentHeartbeatCurrentTask type exports.
  • U2/U3processHeartbeat refactored with a pure decideHeartbeatTransition decision helper, transactional persistence, ownership-checked task linkage, structured status-transition audit log, and post-commit event emission.
  • U4 — Route handler at POST /api/v1/agent/heartbeat flows the new schema through unchanged.
  • U5 — Pure-helper tests, schema-validation tests, and full integration coverage for the heartbeat path.

Hardening applied during review (commits ca6a144 -> 4c94b281e78d53d6c74cb98f92d97d66e26bb485)

Every finding from the multi-agent review pass + CodeRabbit + the second-pass pr-review-toolkit run is addressed in-PR:

Schema + DoS bounds

  • Bound error.message at 4096 chars and error.context at 16 K serialized chars on both heartbeat and standalone /errors schemas.
  • Documented size caps and the narrow warning|fatal enum on the OpenAPI spec.

Logging + visibility

  • Structured logger.info audit line on every real status transition; suppressed on no-op heartbeats.
  • logger.warn when a heartbeat arrives for a vanished agent row.
  • logger.error when agent_errors INSERT returns no row.
  • logger.warn when currentTask.taskId is not owned by the calling agent (drops the task linkage).
  • logger.error when handleTaskFailure throws mid-loop (continue-on-error preserves sibling tasks).

Security

  • PII scrubber (scrubAgentErrorContext) redacts /token|password|secret|api[_-]?key|authorization|cookie|bearer/i keys server-side before persisting.
  • currentTask.taskId is verified against tasks WHERE id=? AND agent_id=? before persistence so a compromised agent token cannot attribute errors to another agent's tasks.
  • FK on agent_errors.task_id (ON DELETE SET NULL) as last-line guard; migration 0007 backfills orphan refs before attaching.
  • Fatal-heartbeat recovery: split agent-token middleware into strict (work endpoints) and recovery-friendly (requireAgentTokenAllowRecovery, heartbeat-only) variants so an errored agent can self-recover via a clean heartbeat.

Atomicity + correctness

  • db.transaction wraps the heartbeat-error INSERT + agents UPDATE; SELECT ... FOR UPDATE on the agent row closes the TOCTOU race on prior-status read.
  • Event emits (emitAgentError, emitAgentStatus) fire AFTER the transaction commits so SSE clients never see rolled-back state.
  • Fatal task-failure loop wraps each handleTaskFailure in try/catch; partial failures surface in the response as {attempted, failed}.
  • HeartbeatTransition refactored to a discriminated union (kind: 'noop' | 'transition'); invalid combinations are compile errors and the call-site triple guard collapses to one kind check.

Type tightening

  • decideHeartbeatTransition parameter and HeartbeatTransition.effectiveStatus use the AgentHeartbeat['status'] literal union — status typos at call sites are compile errors.
  • errorSeverity parameter typed via AgentHeartbeatError['severity'].

Contract + ergonomics

  • All agent-API zValidator failures return the documented {error: {code: 'VALIDATION_ERROR', message}} envelope via a centralized agentValidationHook.
  • Mock-leak fix: contract test re-exports the real decideHeartbeatTransition, classifyWorstSeverity, classifyRecentErrors, pickCurrentTaskByAgent, and scrubAgentErrorContext from its mock.module factory to avoid Linux-vs-macOS load-order CI flakes (GOTCHAS.md pattern).
  • logAgentError accepts an optional drizzle client (so it can participate in an outer transaction) and an optional projectId pass-through (so processHeartbeat skips the redundant SELECT).

OpenAPI spec

  • Heartbeat schema documents currentTask, error, the narrow severity enum, size caps, the non-redundant heartbeat-vs-/errors contract, and the recovery path.

Test coverage

Phase Tests Notes
Main suite 372 pass / 0 fail / 15 skip Unit + route contract
Heartbeat integration (AGENT_HEARTBEAT_TEST_ISOLATED=1) 9 pass Full processHeartbeat path with mocked drizzle; 1:1 mapped to plan U5
Tasks isolated 22 pass Existing
Queue-manager isolated 19 pass Existing
Control-RBAC isolated 3 pass Existing
Health-deterministic isolated 9 pass Existing

Plan U5 scenarios covered by the new tests/integration/agent-heartbeat.test.ts:

  • warning persistence (R3/R4)
  • fatal → handleTaskFailure invocation (R3/R4/R5)
  • fatal with no active tasks (R5 edge)
  • transition log fires once on real transition (R2)
  • no transition log on no-op heartbeat (R2)
  • currentTask.taskId carryover when owned
  • currentTask.taskId dropped when not owned (with audit log)
  • mid-loop handleTaskFailure throw doesn't strand siblings
  • error.context PII scrub before persistence

Test plan

  • just check (format, lint, type-check, build) — clean
  • just ci-check (full test suite on top of check) — 372 main + 9 heartbeat integration + 53 other isolated = 396 total, 0 fail, 15 skip
  • Plan U5 integration tests — covered (see table above)
  • Manual exercise against a local dev server — heartbeat with no error / warning / fatal / malformed all behave as specified

Refs

  • Spec: spec/tickets/Agent_Heartbeat_&_Error_Handling.md
  • Plan: docs/plans/2026-05-16-001-feat-agent-heartbeat-error-handling-plan.md (gitignored)
  • Migration: packages/shared/src/db/migrations/0007_light_ulik.sql

Extend agentHeartbeatSchema with currentTask telemetry and a narrow
warning|fatal error block. Heartbeat-borne errors are now persisted to
agent_errors via logAgentError (warning rows leave status untouched,
fatal rows force status='error' and route active tasks through
handleTaskFailure). Status transitions emit a structured logger.info
line for audit; no-op transitions are suppressed to avoid heartbeat-
volume noise.

The transition decision is extracted as a pure decideHeartbeatTransition
helper so the policy can be pinned without DB mocks, matching the
classifyWorstSeverity pattern.

Refs: spec/tickets/Agent_Heartbeat_&_Error_Handling.md
Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Apply ce-code-review safe_auto fixes from run 20260516-021943-e0500813:

- Bound heartbeat error message (max 4096) and context (max 16K chars
  serialized) at the zod boundary so a compromised agent cannot bloat
  agent_errors with multi-MB rows every ~30s. Mirror the caps onto the
  standalone POST /errors schema.
- Pass projectId through from processHeartbeat into logAgentError so
  the helper skips its redundant SELECT on every error-bearing
  heartbeat.
- Tighten decideHeartbeatTransition to use the AgentHeartbeat['status']
  literal union instead of bare string; a status typo at a call site
  is now a compile error.
- Update packages/openapi/agent-api.yaml Heartbeat schema with
  currentTask and error properties, document the narrow warning|fatal
  enum, the non-redundant heartbeat-vs-/errors channel contract, and
  the new size caps on AgentError.

Cross-corroborated by sec-1+adv-1 (P1), maintainability+perf (P2),
kieran-typescript (P2), api-contract+project-standards (P1). Residual
non-safe-auto findings (transactional integrity, currentTask ownership
check, atomic CTE, integration tests, fatal-strand recovery path)
recorded in /tmp/compound-engineering/ce-code-review/20260516-021943-e0500813/residual.md.

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Copilot AI review requested due to automatic review settings May 16, 2026 06:50
@coderabbitai coderabbitai Bot added enhancement New feature or request backend testing Testing infrastructure and test cases priority:medium Should be done — improves quality or capability shared agent-api labels May 16, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements the Agent Heartbeat & Error Handling ticket: extends agentHeartbeatSchema with optional currentTask telemetry and a narrow error: { severity: warning|fatal, message, context } block, persists heartbeat-borne errors to agent_errors, escalates fatal errors to status='error' plus handleTaskFailure on all active tasks, and emits a structured audit log line on real status transitions. A pure decideHeartbeatTransition helper is extracted so the policy can be unit-tested without DB mocks. The OpenAPI spec and standalone /errors route are updated with matching size caps.

Changes:

  • New agentHeartbeatErrorSchema / agentHeartbeatCurrentTaskSchema (with 4 KiB message / 16 KiB serialized context caps) added to the shared package and surfaced in the OpenAPI Heartbeat schema; standalone AgentError gets the same caps.
  • processHeartbeat rewritten around decideHeartbeatTransition: reads prior status, logs the error via logAgentError (with project-id pass-through to skip a redundant SELECT), updates the row, emits a transition audit log, and runs handleTaskFailure for fatals.
  • New unit tests for decideHeartbeatTransition (6) and route-level schema validation tests for POST /heartbeat (5).

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
packages/shared/src/schemas/index.ts Adds heartbeat error/current-task schemas with size caps and extends agentHeartbeatSchema.
packages/shared/src/types/index.ts Re-exports new AgentHeartbeatError and AgentHeartbeatCurrentTask types.
packages/openapi/agent-api.yaml Documents currentTask and error on Heartbeat and tightens the standalone AgentError contract.
packages/backend/src/services/agents.ts Adds decideHeartbeatTransition, status-transition logging, fatal-error task failure flow, and projectId pass-through to logAgentError.
packages/backend/src/routes/agent/index.ts Imports new caps and applies matching bounds to the standalone /errors validation schema.
packages/backend/tests/unit/agents-service.test.ts 6 unit tests for decideHeartbeatTransition.
packages/backend/tests/unit/agent-api-contract.test.ts 5 new route-level schema-validation tests for POST /heartbeat.
Comments suppressed due to low confidence (2)

packages/backend/src/services/agents.ts:434

  • The agent_errors insert and the agents UPDATE happen as two separate, non-transactional statements (with the insert occurring first, before the row is locked or updated). If the UPDATE — or the subsequent handleTaskFailure loop on a fatal heartbeat — throws, the system is left with an orphan error row attributed to an agent whose status was never transitioned and whose active tasks were never failed, breaking the invariant that fatal heartbeats produce a consistent (error_row, status='error', failed_tasks) triple. Wrapping the insert + update + task-failure loop in db.transaction(...) would restore atomicity. (PR description already flags this as residual #3/#4; flagging here so it's anchored to the diff.)

This issue also appears on line 378 of the same file.

  if (data.error) {
    await logAgentError({
      agentId,
      severity: data.error.severity,
      message: data.error.message,
      context: data.error.context,
      taskId: data.currentTask?.taskId,
      // Skip the redundant SELECT inside logAgentError — we already
      // have projectId from priorRow.
      projectId: priorRow?.projectId,
    });
  }

  const updates: Record<string, unknown> = {
    status: effectiveStatus,
    lastSeenAt: new Date(),
    updatedAt: new Date(),
  };

  if (data.capabilities) {
    updates['capabilities'] = data.capabilities;
  }
  if (data.deviceInfo) {
    updates['hardwareProfile'] = data.deviceInfo;
  }

  const [updated] = await db.update(agents).set(updates).where(eq(agents.id, agentId)).returning();

  if (updated) {
    emitAgentStatus(updated.projectId, updated.id, effectiveStatus);

    if (transition.shouldLogTransition && transition.reason && priorStatus) {
      logStatusTransition({
        agentId: updated.id,
        projectId: updated.projectId,
        fromStatus: priorStatus,
        toStatus: effectiveStatus,
        reason: transition.reason,
      });
    }
  }

  // On fatal error, fail the agent's current tasks. `handleTaskFailure`
  // applies the up-to-3-retry policy defined in the ticket; rolling a
  // parallel path here would diverge. Dynamic import preserves the
  // existing tasks.ts <-> agents.ts circular-import workaround.
  if (isFatalError) {
    const activeTasks = await db
      .select({ id: tasks.id })
      .from(tasks)
      .where(and(eq(tasks.agentId, agentId), sql`${tasks.status} IN ('assigned', 'running')`));

    const { handleTaskFailure } = await import('./tasks.js');
    for (const activeTask of activeTasks) {
      await handleTaskFailure(activeTask.id, agentId, data.error?.message ?? 'Agent fatal error');
    }
  }

packages/backend/src/services/agents.ts:389

  • currentTask.taskId from the heartbeat payload is persisted directly into agent_errors.task_id with no check that the task belongs to the calling agent (no tasks.agentId = agentId lookup, and the column is not enforced by an FK constraint visible here). A compromised or buggy agent can attribute its errors to arbitrary task IDs — including tasks owned by other agents or other projects — which corrupts per-task error history and any UI that joins agent_errors to tasks. Either validate ownership before persisting or drop taskId when ownership cannot be established.
  if (data.error) {
    await logAgentError({
      agentId,
      severity: data.error.severity,
      message: data.error.message,
      context: data.error.context,
      taskId: data.currentTask?.taskId,
      // Skip the redundant SELECT inside logAgentError — we already
      // have projectId from priorRow.
      projectId: priorRow?.projectId,
    });
  }

Comment thread packages/backend/tests/unit/agents-service.test.ts Outdated
Comment thread packages/backend/src/services/agents.ts Outdated
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 16, 2026

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

This PR extends agent heartbeats with optional inline error and currentTask fields (bounded sizes), extracts/export a pure heartbeat status decision, refactors processHeartbeat/logAgentError for transactional emission and fatal-task fan-out, centralizes Zod validation error shaping, adds recovery-friendly agent auth for heartbeats, updates OpenAPI/schema/migrations, and expands tests.

Changes

Agent Heartbeat Enhancement

Layer / File(s) Summary
Shared schemas and constants
packages/shared/src/schemas/index.ts
Adds HEARTBEAT_ERROR_MESSAGE_MAX and HEARTBEAT_ERROR_CONTEXT_MAX_CHARS, agentHeartbeatErrorSchema (severity enum, message cap, serialized-context refinement), agentHeartbeatCurrentTaskSchema, and extends agentHeartbeatSchema with error and currentTask.
Type exports
packages/shared/src/types/index.ts
Re-exports AgentHeartbeatError and AgentHeartbeatCurrentTask derived from the new schemas.
OpenAPI heartbeat and error contracts
packages/openapi/agent-api.yaml
Adds currentTask and error to Heartbeat schema and tightens AgentError schema with message/context size guidance and taskId minimum.
Auth middleware factory & recovery variant
packages/backend/src/middleware/auth.ts
Introduce createAgentTokenMiddleware({allowErroredAgent}), rewire requireAgentToken, and add requireAgentTokenAllowRecovery for heartbeat recovery.
Route validation and centralized hook
packages/backend/src/routes/agent/index.ts
Introduces agentValidationHook to normalize Zod validation failures to { error: { code: 'VALIDATION_ERROR', message } }, wires it into multiple agent endpoints, and applies heartbeat-aligned size limits to POST /errors.
Status transition decision and wiring
packages/backend/src/services/agents.ts
Adds exported types and pure decideHeartbeatTransition to compute effectiveStatus, isFatalError, and transition metadata; adds scrubAgentErrorContext and verifyTaskOwnership; rewires processHeartbeat to use the decision, persist effective status/timestamps/capabilities, emit agent_status, and conditionally record status-transition audits and fail active tasks on fatal heartbeats.
logAgentError transactional emission
packages/backend/src/services/agents.ts
Extends logAgentError signature with optional projectId, suppressEvent, and optional dbClient; resolves projectId when needed and emits SSE only when projectId is known; logs insert failures.
Integration tests: heartbeat error handling
packages/backend/tests/integration/agent-heartbeat.test.ts
Adds isolated integration tests mocking DB/events/tasks/auth to validate warning vs fatal persistence, status transitions, task-failure fan-out, redaction, and audit emission.
Unit & contract tests
packages/backend/tests/unit/*
Adjusts Bun mocks to re-export pure helpers; adds expectAgentValidationError helper; expands heartbeat contract tests and adds decideHeartbeatTransition unit tests; adds auth recovery tests.
DB migration and schema
packages/shared/src/db/migrations/*, packages/shared/src/db/schema.ts
Backfill orphaned agent_errors.task_id to NULL, add FK ON DELETE SET NULL, update snapshot and journal, and reflect FK in schema.ts.

Sequence Diagram(s)

sequenceDiagram
  participant Agent
  participant API as AgentAPI
  participant Service as processHeartbeat
  participant DB as agentsDB
  participant Audit as logStatusTransition

  Agent->>API: POST /api/v1/agent/heartbeat (status, error?, currentTask?)
  API->>Service: processHeartbeat(agentId, heartbeat)
  Service->>DB: SELECT prior agent row (status, projectId)
  Service->>Service: decideHeartbeatTransition(payloadStatus, errorSeverity, priorStatus)
  Service->>DB: UPDATE agents SET effectiveStatus, timestamps, capabilities
  Service->>DB: INSERT agent_errors (suppressed SSE)
  Service->>API: emit agent_status SSE (post-commit)
  Service->>Audit: logStatusTransition(...) [only if transition]
  Service->>Service: on fatal -> dynamically import handleTaskFailure -> fan out per active task
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

documentation, priority:high

Poem

Heartbeats now carry tales, concise and tight,
errors trimmed to fit the night,
a pure decision guides the state,
tasks fail gently when it's late.
🔧💓

🚥 Pre-merge checks | ✅ 7 | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 36.84% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Typescript Strict Mode Compliance ⚠️ Warning Six exported async functions lack explicit Promise return types: getAgentById, processHeartbeat, updateAgent, logAgentError, getAgentErrors, submitBenchmarks (lines 127, 431, 584, 615, 689, 705). Add explicit Promise return annotations. No 'any' types found. noUncheckedIndexedAccess and exactOptionalPropertyTypes compliance verified.
Agent Api Stability ❓ Inconclusive No result was produced after verification. Marking as INCONCLUSIVE. Re-run the check or adjust instructions to produce a final result.
✅ Passed checks (7 passed)
Check name Status Explanation
Title check ✅ Passed Title follows conventional commit format with type 'feat', scope 'agent', and descriptive lowercase text clearly summarizing the main change: heartbeat and error handling with audit logging.
Description check ✅ Passed Description is comprehensive and directly related to the changeset, detailing implementation units, hardening applied, test coverage, and references to specification and migration files.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Betterauth And Auth Patterns ✅ Passed Dashboard API uses BetterAuth session (auth.api.getSession), X-Project-Id header, and RBAC. Agent API uses bearer token lookup (not JWT). No hand-rolled JWT code. Architectural separation documented.
Error Handling Patterns ✅ Passed All errors explicitly handled and logged via pino. No empty catch blocks. Hono onError checks instanceof HTTPException. Client responses use controlled envelopes; no internal details leak.
Drizzle Orm Patterns ✅ Passed Parameterized queries verified; onConflictDoUpdate correct; atomic status guards via FOR UPDATE + WHERE; deferred post-commit events; transaction-aware logAgentError; correct FK backfill in migration.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/agent-heartbeat-error-handling

Warning

Review ran into problems

🔥 Problems

These MCP integrations need to be re-authenticated in the Integrations settings: Linear, Notion


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 1

🧹 Nitpick comments (3)
packages/backend/tests/unit/agent-api-contract.test.ts (1)

163-242: ⚡ Quick win

Add heartbeat boundary tests for new size caps.

These cases validate enum/required fields, but they don’t cover the new error.message max (4096) and serialized error.context 16KiB cap. Adding pass/fail boundary assertions here would lock the contract and catch regressions early.

As per coding guidelines, "Agent API contract tests should validate responses against OpenAPI spec to keep server and clients in sync."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/backend/tests/unit/agent-api-contract.test.ts` around lines 163 -
242, Add boundary tests in agent-api-contract.test.ts for the
AGENT_BASE/heartbeat endpoint to validate the new error.message max length
(4096) and serialized error.context cap (16 * 1024 bytes): create one test that
sends error.message exactly 4096 characters and error.context whose JSON string
length is exactly 16KiB and assert res.status === 200, and a complementary test
that sends error.message of 4097 chars and/or error.context whose JSON string
length is 16KiB+1 and assert res.status === 400; reuse the existing request
pattern (agentToken(TEST_AGENT_TOKEN), app.request to `${AGENT_BASE}/heartbeat`)
and build payloads with JSON.stringify to measure serialized size to ensure the
server enforces the limits.
packages/backend/src/services/agents.ts (1)

353-353: 💤 Low value

Consider adding an explicit return type.

Per coding guidelines on exported functions. The implicit type works but documenting the contract helps downstream callers.

Suggested signature
-export async function processHeartbeat(agentId: number, data: AgentHeartbeat) {
+export async function processHeartbeat(agentId: number, data: AgentHeartbeat): Promise<{
+  agent: NonNullable<Awaited<ReturnType<typeof getAgentById>>> | null;
+  hasHighPriorityTasks: boolean;
+}> {

Or define a named interface if this return shape is reused elsewhere.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/backend/src/services/agents.ts` at line 353, The exported function
processHeartbeat currently relies on an implicit return type; update its
signature to include an explicit return type (e.g., Promise<void> or a specific
Promise<YourReturnShape>) to follow exported-function guidelines and clarify the
contract for callers; modify the declaration "export async function
processHeartbeat(agentId: number, data: AgentHeartbeat)" to include the
appropriate explicit return type or define and reference a named interface/type
if the return shape is reused elsewhere.
packages/backend/tests/unit/agents-service.test.ts (1)

319-337: 💤 Low value

Test at lines 319-327 duplicates inputs from 272-283.

Both tests use { payloadStatus: 'online', errorSeverity: 'fatal', priorStatus: 'online' }. The earlier test already asserts shouldLogTransition and reason. Consider consolidating or repurposing this test for an untested edge case (e.g., fatal error when priorStatus='error' — confirms no-op when already in error state).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/backend/tests/unit/agents-service.test.ts` around lines 319 - 337,
The two tests duplicate the same inputs for decideHeartbeatTransition; update
the second test to cover a missing edge case instead of repeating `{
payloadStatus: 'online', errorSeverity: 'fatal', priorStatus: 'online' }`:
modify the test that currently asserts suppression when priorStatus is null to
instead change the duplicated test (the one that flips an online agent to fatal)
into a case where errorSeverity: 'fatal' and priorStatus: 'error' (or otherwise
already-in-error) and assert that shouldLogTransition is false and reason is
null to confirm no-op when agent is already in error; keep references to the
decideHeartbeatTransition call and its result assertions.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/openapi/agent-api.yaml`:
- Around line 317-319: The OpenAPI contract allows heartbeat errors with
severity "fatal" to set agent.status='error', which conflicts with the auth
middleware check in auth.ts (if (!agent || agent.status === 'error') ...) and
leaves agents permanently blocked; either update packages/openapi/agent-api.yaml
to remove or rename the fatal severity so heartbeats cannot force
status='error', or implement recovery semantics: add an agent recovery endpoint
or admin-reset flow and update the middleware/auth logic to permit recovery
requests (or a special recovery token) from agents with status='error';
reference the heartbeat error schema in agent-api.yaml and the auth middleware
in packages/backend/src/middleware/auth.ts when making the change.

---

Nitpick comments:
In `@packages/backend/src/services/agents.ts`:
- Line 353: The exported function processHeartbeat currently relies on an
implicit return type; update its signature to include an explicit return type
(e.g., Promise<void> or a specific Promise<YourReturnShape>) to follow
exported-function guidelines and clarify the contract for callers; modify the
declaration "export async function processHeartbeat(agentId: number, data:
AgentHeartbeat)" to include the appropriate explicit return type or define and
reference a named interface/type if the return shape is reused elsewhere.

In `@packages/backend/tests/unit/agent-api-contract.test.ts`:
- Around line 163-242: Add boundary tests in agent-api-contract.test.ts for the
AGENT_BASE/heartbeat endpoint to validate the new error.message max length
(4096) and serialized error.context cap (16 * 1024 bytes): create one test that
sends error.message exactly 4096 characters and error.context whose JSON string
length is exactly 16KiB and assert res.status === 200, and a complementary test
that sends error.message of 4097 chars and/or error.context whose JSON string
length is 16KiB+1 and assert res.status === 400; reuse the existing request
pattern (agentToken(TEST_AGENT_TOKEN), app.request to `${AGENT_BASE}/heartbeat`)
and build payloads with JSON.stringify to measure serialized size to ensure the
server enforces the limits.

In `@packages/backend/tests/unit/agents-service.test.ts`:
- Around line 319-337: The two tests duplicate the same inputs for
decideHeartbeatTransition; update the second test to cover a missing edge case
instead of repeating `{ payloadStatus: 'online', errorSeverity: 'fatal',
priorStatus: 'online' }`: modify the test that currently asserts suppression
when priorStatus is null to instead change the duplicated test (the one that
flips an online agent to fatal) into a case where errorSeverity: 'fatal' and
priorStatus: 'error' (or otherwise already-in-error) and assert that
shouldLogTransition is false and reason is null to confirm no-op when agent is
already in error; keep references to the decideHeartbeatTransition call and its
result assertions.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7c3eea7e-ab5e-48ee-9427-658b05ffca0e

📥 Commits

Reviewing files that changed from the base of the PR and between 5b603fd and ca6a144.

📒 Files selected for processing (7)
  • packages/backend/src/routes/agent/index.ts
  • packages/backend/src/services/agents.ts
  • packages/backend/tests/unit/agent-api-contract.test.ts
  • packages/backend/tests/unit/agents-service.test.ts
  • packages/openapi/agent-api.yaml
  • packages/shared/src/schemas/index.ts
  • packages/shared/src/types/index.ts

Comment thread packages/openapi/agent-api.yaml
Address findings from the pr-review-toolkit review pass:

- Fix false TextEncoder/ES2022 comment in the heartbeat error schema
  (comment rot from an earlier autofix swap).
- Re-export the real decideHeartbeatTransition, classifyWorstSeverity,
  classifyRecentErrors, and pickCurrentTaskByAgent from the
  agent-api-contract mock.module factory. Without this, sibling test
  files load undefined for these symbols depending on bun's per-file
  load order — a known Linux-vs-macOS CI flake documented in GOTCHAS.md.
- Narrow errorSeverity in decideHeartbeatTransition to
  AgentHeartbeatError['severity'] so the schema is the single source of
  truth for the warning|fatal enum.
- Drop the dead `&& priorStatus` clause in the transition-log guard
  (decideHeartbeatTransition already ensures shouldLogTransition is
  false when priorStatus is null).
- Surface the silent-no-op cases: a missing agent row at heartbeat
  time emits a logger.warn, and an agent_errors INSERT that returns no
  row emits logger.error so a regression isn't invisible.
- Trim ticket references and the duplicated non-redundant-channel
  paragraph in service code; the OpenAPI spec is the source of truth.

Tests:
- Add response-body assertions on accept-case heartbeats so a route
  regression that drops `acknowledged` is caught.
- Add a back-compat baseline test for `{status:'online'}` only.
- Add boundary tests for the 4096-char message cap and 16K-char
  context cap so a refactor that drops the limits ships visibly.
- Add a negative-progress rejection test for currentTask.
- Convert the new decideHeartbeatTransition tests to AAA structure to
  match the existing classifyWorstSeverity tests in the same file.

361 backend tests pass; just ci-check clean.

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 1

🧹 Nitpick comments (1)
packages/backend/tests/unit/agent-api-contract.test.ts (1)

365-386: ⚡ Quick win

Add an exact-limit acceptance test for error.context size.

You test >16K rejection, but not acceptance at exactly the cap. Add the at-cap case to lock the boundary and prevent off-by-one regressions.

As per coding guidelines, "Agent API contract tests must validate responses against OpenAPI spec to keep server and clients in sync."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/backend/tests/unit/agent-api-contract.test.ts` around lines 365 -
386, Add a new unit test in agent-api-contract.test.ts that mirrors the existing
oversized test but uses a context whose JSON serialization is exactly
HEARTBEAT_ERROR_CONTEXT_MAX_CHARS (e.g., const atCap = { blob: 'x'.repeat(16 *
1024) }) and posts the same heartbeat payload via app.request to
`${AGENT_BASE}/heartbeat` with agentToken(TEST_AGENT_TOKEN), then assert the
response is accepted (status 200 or the expected success code) and also validate
the response against the OpenAPI spec using the same response validation helper
used in other contract tests so the boundary (exact cap) is locked and checked
against the API spec.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/backend/tests/unit/agent-api-contract.test.ts`:
- Around line 252-266: The test "rejects an error.severity outside the
warning|fatal enum" only asserts HTTP 400 but must also validate the error
response body shape against the API contract; update the test that posts to
`${AGENT_BASE}/heartbeat` (using agentToken/TEST_AGENT_TOKEN and app.request) to
parse the JSON body and assert the error envelope fields (for example
expect(body).toHaveProperty('error.code') and other required fields) and/or run
it through the existing OpenAPI response validator helper used elsewhere in the
suite so the response matches the spec; apply the same additional assertions to
the other listed failing tests (the blocks around lines 268-282, 284-298,
300-314, 343-363, 365-386).

---

Nitpick comments:
In `@packages/backend/tests/unit/agent-api-contract.test.ts`:
- Around line 365-386: Add a new unit test in agent-api-contract.test.ts that
mirrors the existing oversized test but uses a context whose JSON serialization
is exactly HEARTBEAT_ERROR_CONTEXT_MAX_CHARS (e.g., const atCap = { blob:
'x'.repeat(16 * 1024) }) and posts the same heartbeat payload via app.request to
`${AGENT_BASE}/heartbeat` with agentToken(TEST_AGENT_TOKEN), then assert the
response is accepted (status 200 or the expected success code) and also validate
the response against the OpenAPI spec using the same response validation helper
used in other contract tests so the boundary (exact cap) is locked and checked
against the API spec.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a8f1da72-9fd1-43c7-947b-473d6f25ba2c

📥 Commits

Reviewing files that changed from the base of the PR and between ca6a144 and 49954ce.

📒 Files selected for processing (4)
  • packages/backend/src/services/agents.ts
  • packages/backend/tests/unit/agent-api-contract.test.ts
  • packages/backend/tests/unit/agents-service.test.ts
  • packages/shared/src/schemas/index.ts
🚧 Files skipped from review as they are similar to previous changes (3)
  • packages/backend/src/services/agents.ts
  • packages/backend/tests/unit/agents-service.test.ts
  • packages/shared/src/schemas/index.ts

Comment thread packages/backend/tests/unit/agent-api-contract.test.ts
Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Copilot AI review requested due to automatic review settings May 17, 2026 20:11
@unclesp1d3r
Copy link
Copy Markdown
Member Author

Fixes Applied Successfully

Fixed 2 file(s) based on 2 CodeRabbit feedback item(s).

Files modified:

  • packages/backend/src/routes/agent/index.ts
  • packages/backend/tests/unit/agent-api-contract.test.ts

Commit: 082bd4e2d287ff99e1ea7fb2c4cda81cd4f77d06

The latest autofix changes are on the feat/agent-heartbeat-error-handling branch.

Resolved

  • build(deps-dev): Bump @types/express from 4.17.25 to 5.0.5 #2 (Major) — Added agentValidationHook to all agent zValidator usages so validation 400s now return the documented {error: {code: 'VALIDATION_ERROR', message}} envelope (matching the 401 contract and AGENTS.md "API Surfaces"). Updated all 8 new 400 contract tests to assert that envelope via a new expectAgentValidationError helper.

Deferred

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (3)

packages/backend/src/services/agents.ts:444

  • The agent_errors INSERT (via logAgentError), the agents UPDATE, and the handleTaskFailure loop are issued as separate, non-transactional statements. If any later step throws (e.g., the UPDATE fails, handleTaskFailure rejects mid-iteration), the heartbeat will have already persisted an agent_errors row and/or emitted SSE events via emitAgentError/emitAgentStatus, but the agent row may not be marked 'error' and some/all of the active tasks may not be failed. Dashboard SSE clients will then see stale state that disagrees with the DB. Consider wrapping insert + update + task-failure loop in db.transaction(...) and emitting SSE events only after the transaction commits.

This issue also appears on line 434 of the same file.

  if (data.error) {
    await logAgentError({
      agentId,
      severity: data.error.severity,
      message: data.error.message,
      context: data.error.context,
      taskId: data.currentTask?.taskId,
      // Skip the redundant SELECT inside logAgentError — we already
      // have projectId from priorRow.
      projectId: priorRow?.projectId,
    });
  }

  const updates: Record<string, unknown> = {
    status: effectiveStatus,
    lastSeenAt: new Date(),
    updatedAt: new Date(),
  };

  if (data.capabilities) {
    updates['capabilities'] = data.capabilities;
  }
  if (data.deviceInfo) {
    updates['hardwareProfile'] = data.deviceInfo;
  }

  const [updated] = await db.update(agents).set(updates).where(eq(agents.id, agentId)).returning();

  if (updated) {
    emitAgentStatus(updated.projectId, updated.id, effectiveStatus);

    // `priorStatus` is non-null whenever `shouldLogTransition` is true
    // (see decideHeartbeatTransition); the assertion below keeps TS happy
    // without re-checking at runtime.
    if (transition.shouldLogTransition && transition.reason) {
      logStatusTransition({
        agentId: updated.id,
        projectId: updated.projectId,
        fromStatus: priorStatus as string,
        toStatus: effectiveStatus,
        reason: transition.reason,
      });
    }
  } else {
    // Auth middleware verified the agent's bearer token, so the row was
    // present a moment ago. A vanishing row mid-heartbeat means it was
    // deleted concurrently. Surface it so an operator can correlate
    // bursts of "ghost heartbeat" warnings with cleanup actions.
    logger.warn(
      { agentId, status: effectiveStatus },
      'Heartbeat for an agent row that no longer exists'
    );
  }

  // On fatal error, fail the agent's current tasks. `handleTaskFailure`
  // owns the retry policy; rolling a parallel path here would diverge.
  // Dynamic import preserves the existing tasks.ts <-> agents.ts
  // circular-import workaround.
  if (isFatalError) {
    const activeTasks = await db
      .select({ id: tasks.id })
      .from(tasks)
      .where(and(eq(tasks.agentId, agentId), sql`${tasks.status} IN ('assigned', 'running')`));

    const { handleTaskFailure } = await import('./tasks.js');
    for (const activeTask of activeTasks) {
      await handleTaskFailure(activeTask.id, agentId, data.error?.message ?? 'Agent fatal error');
    }
  }

packages/backend/src/services/agents.ts:402

  • The SELECT of priorRow followed by the unconditional UPDATE on agents is a TOCTOU window: a concurrent heartbeat (or an admin-driven status mutation) between the two statements can be silently overwritten. Because processHeartbeat is invoked on the agent's natural heartbeat cadence, two near-simultaneous requests against the same agent row can interleave such that the audit log records from=A → to=B while the row actually moved A → C → B. Consider a single UPDATE ... RETURNING that captures the prior status via a CTE, or use a row lock (SELECT ... FOR UPDATE inside a transaction) before the UPDATE.
  const [priorRow] = await db
    .select({ status: agents.status, projectId: agents.projectId })
    .from(agents)
    .where(eq(agents.id, agentId))
    .limit(1);
  const priorStatus = priorRow?.status ?? null;

  const transition = decideHeartbeatTransition({
    payloadStatus: data.status,
    errorSeverity: data.error?.severity,
    priorStatus,
  });
  const { effectiveStatus, isFatalError } = transition;

  // Persist heartbeat-borne errors to agent_errors regardless of severity.
  // Warnings are recorded but do not change status or fail the running
  // task; fatals are recorded AND drive the status/task transitions
  // below. The non-redundant-channel contract (heartbeat vs POST /errors)
  // is documented on the Heartbeat schema in packages/openapi/agent-api.yaml.
  if (data.error) {
    await logAgentError({
      agentId,
      severity: data.error.severity,
      message: data.error.message,
      context: data.error.context,
      taskId: data.currentTask?.taskId,
      // Skip the redundant SELECT inside logAgentError — we already
      // have projectId from priorRow.
      projectId: priorRow?.projectId,
    });
  }

  const updates: Record<string, unknown> = {
    status: effectiveStatus,
    lastSeenAt: new Date(),
    updatedAt: new Date(),
  };

  if (data.capabilities) {
    updates['capabilities'] = data.capabilities;
  }
  if (data.deviceInfo) {
    updates['hardwareProfile'] = data.deviceInfo;
  }

  const [updated] = await db.update(agents).set(updates).where(eq(agents.id, agentId)).returning();

packages/backend/src/services/agents.ts:436

  • A throw from handleTaskFailure for any single task aborts this loop, leaving earlier-iteration tasks failed and later-iteration tasks still in assigned/running. Since the agent has just been forced to status='error', those leftover tasks are stranded until the next sweep. Wrap the per-task call in try/catch and continue with logged failures, or include the loop in a surrounding transaction.
  if (isFatalError) {
    const activeTasks = await db
      .select({ id: tasks.id })

Comment thread packages/backend/src/services/agents.ts Outdated
Comment thread packages/backend/src/services/agents.ts Outdated
Comment thread packages/backend/tests/unit/agent-api-contract.test.ts
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/backend/src/routes/agent/index.ts`:
- Around line 41-50: The agentValidationHook result parameter uses an incomplete
inline type; update it to match `@hono/zod-validator` 0.7.6 by replacing the
current ad-hoc type with the correct union that includes data and target or by
importing the official hook type: use a type like ({ success: true; data: T } |
{ success: false; error: ZodError; data: T }) & { target: Target } (or import
the provided generic hook type from `@hono/zod-validator`) so that
agentValidationHook(result, c) references a result that includes data, target
and a ZodError on failure.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 719edbf7-81ec-40d6-81ee-67b4a321a50a

📥 Commits

Reviewing files that changed from the base of the PR and between 49954ce and 082bd4e.

📒 Files selected for processing (2)
  • packages/backend/src/routes/agent/index.ts
  • packages/backend/tests/unit/agent-api-contract.test.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • packages/backend/tests/unit/agent-api-contract.test.ts

Comment thread packages/backend/src/routes/agent/index.ts
Resolve the CodeRabbit-flagged fatal-heartbeat lockout (Residual #1).
Before this change, the strict requireAgentToken middleware rejected
any agent whose row was in status='error'. Once a fatal heartbeat
forced the agent into that state, every subsequent request — including
the heartbeat the agent would use to announce recovery — returned 401.
Agents could not self-recover.

Resolution:
- Refactor requireAgentToken to a factory parameterized by
  allowErroredAgent.
- Export two variants: the strict default (work endpoints) and a new
  requireAgentTokenAllowRecovery that admits status='error' agents so
  they can post a clean heartbeat. processHeartbeat is the only
  programmatic recovery path; the heartbeat handler transitions the
  agent back to 'online' on a non-fatal payload.
- Wire requireAgentTokenAllowRecovery on POST /api/v1/agent/heartbeat
  only. /tasks/*, /errors, /benchmark, /resources/*, and /cracker/*
  keep the strict middleware so a broken agent does not pick up new
  work until it has recovered via /heartbeat first.
- Document the recovery contract on the Heartbeat error.severity field
  in packages/openapi/agent-api.yaml.

Tests:
- Three new auth-middleware cases pin the recovery variant: accepts
  online agents (parity with strict), accepts error-state agents, and
  still rejects unknown tokens.
- Two new agent-api-contract cases exercise the route-level
  integration: an errored agent posting a clean heartbeat receives 200,
  and the same errored agent gets 401 on /tasks/next (strict middleware
  unchanged).

366 backend tests pass; just ci-check clean.

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
@unclesp1d3r
Copy link
Copy Markdown
Member Author

Critical Finding Resolved

Per maintainer direction, the previously-deferred CodeRabbit Critical finding (fatal-heartbeat lockout) is now fixed in commit de4b05be1c45bfa5e3284ce701766e237a300b61.

Resolution: split agent-token middleware into a strict variant (work endpoints) and a recovery-friendly variant (requireAgentTokenAllowRecovery). The recovery variant is mounted only on POST /api/v1/agent/heartbeat so an errored agent can post a clean heartbeat to return to service; processHeartbeat transitions it back to online. Every other agent endpoint stays strict.

Files modified:

  • packages/backend/src/middleware/auth.ts — factory refactor, new requireAgentTokenAllowRecovery export
  • packages/backend/src/routes/agent/index.ts — mounts the recovery variant on /heartbeat only
  • packages/backend/tests/unit/auth-middleware.test.ts — three new tests pin the recovery variant
  • packages/backend/tests/unit/agent-api-contract.test.ts — two new tests prove recovery works on /heartbeat but /tasks/next still rejects errored agents
  • packages/openapi/agent-api.yaml — documents the recovery contract on error.severity

366 backend tests pass; just ci-check clean.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 1

🧹 Nitpick comments (1)
packages/backend/tests/unit/agent-api-contract.test.ts (1)

185-197: 💤 Low value

Extend existing 400 tests with envelope assertions for consistency.

The new heartbeat tests use expectAgentValidationError(res) to assert the documented envelope shape. The older 400 tests for this endpoint (and /tasks/:id/report, /errors) only assert status. Since agentValidationHook was wired in this PR, consider updating these to catch contract drift.

Applies to: lines 525-537, 553-565, 567-579.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/backend/tests/unit/agent-api-contract.test.ts` around lines 185 -
197, Update the older 400 tests for the heartbeat, tasks report and errors
endpoints to assert the documented error envelope using the shared helper
instead of (or in addition to) just checking status: replace the plain
expect(res.status).toBe(400) checks in the tests for "should return 400 for
invalid heartbeat status enum" (heartbeat), the "/tasks/:id/report" 400 test,
and the "/errors" 400 test with a call to expectAgentValidationError(res) so the
response shape is validated by the agentValidationHook; locate the tests by
their descriptions and use the helper expectAgentValidationError to assert the
envelope.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/backend/tests/unit/auth-middleware.test.ts`:
- Around line 255-266: The test named "should accept a valid active agent (same
path as the strict middleware)" uses mockAgentResult with an invalid status
value; update the mock agent object assigned to mockAgentResult (used in this
test) to use a valid status from the enum (change status: 'active' to status:
'online') so the mocked agent matches the expected schema and the other
assertions (e.g., the request to '/agent-endpoint' expecting 200) behave
correctly.

---

Nitpick comments:
In `@packages/backend/tests/unit/agent-api-contract.test.ts`:
- Around line 185-197: Update the older 400 tests for the heartbeat, tasks
report and errors endpoints to assert the documented error envelope using the
shared helper instead of (or in addition to) just checking status: replace the
plain expect(res.status).toBe(400) checks in the tests for "should return 400
for invalid heartbeat status enum" (heartbeat), the "/tasks/:id/report" 400
test, and the "/errors" 400 test with a call to expectAgentValidationError(res)
so the response shape is validated by the agentValidationHook; locate the tests
by their descriptions and use the helper expectAgentValidationError to assert
the envelope.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 5d147780-2c64-48cf-a1ba-173e96b3ac1f

📥 Commits

Reviewing files that changed from the base of the PR and between 082bd4e and de4b05b.

📒 Files selected for processing (5)
  • packages/backend/src/middleware/auth.ts
  • packages/backend/src/routes/agent/index.ts
  • packages/backend/tests/unit/agent-api-contract.test.ts
  • packages/backend/tests/unit/auth-middleware.test.ts
  • packages/openapi/agent-api.yaml
🚧 Files skipped from review as they are similar to previous changes (1)
  • packages/openapi/agent-api.yaml

Comment thread packages/backend/tests/unit/auth-middleware.test.ts
Address every remaining residual from the multi-agent code review:

* #2 — Integration coverage. New tests/integration/agent-heartbeat.test.ts
  exercises the full heartbeat path with a mocked drizzle client across
  9 scenarios mapped 1:1 to plan U5: warning persistence, fatal task
  failure, fatal with no active tasks, transition log fires once, no
  log on no-op, currentTask.taskId carryover, ownership rejection,
  mid-loop continue-on-error, and context PII scrub. The file runs in
  an isolated bun:test phase (AGENT_HEARTBEAT_TEST_ISOLATED=1) per the
  GOTCHAS.md mock.module leakage pattern; package.json wires it as a
  dedicated phase in the test script.

* #3 — Transactional integrity. processHeartbeat now wraps the prior-
  status read (SELECT ... FOR UPDATE), the agent_errors INSERT, and
  the agents UPDATE in db.transaction. A rollback no longer leaves an
  orphan error row referencing an un-updated agent. logAgentError
  accepts an optional drizzle client so it can participate in the
  outer transaction.

* #4 — Continue-on-error fatal task fan-out. The handleTaskFailure
  loop now wraps each call in try/catch, logs failures with the
  failing task ID, and returns a {attempted, failed} summary so the
  caller / monitoring can detect partial fan-outs without losing
  sibling work.

* #5 — currentTask ownership enforcement. New verifyTaskOwnership
  helper queries tasks WHERE id=? AND agent_id=? before the
  transaction opens; mismatches drop the task linkage on the persisted
  agent_errors row and log a warn so compromised tokens cannot
  attribute errors across agents. An FK on agent_errors.task_id
  (ON DELETE SET NULL) is the last-line guard against dangling rows.
  Migration 0007 backfills orphan task_ids before attaching the FK.

* #6 — TOCTOU race on prior-status read. The atomic block from #3
  uses .for('update') on the agents row, so concurrent heartbeats
  serialize on the row and cannot both observe the same prior status.
  Resolves the long-standing GOTCHAS.md "atomic status guards"
  warning for this code path.

* #7 — Post-commit event emits. emitAgentError and emitAgentStatus
  fire AFTER db.transaction returns; SSE clients never see a status
  the DB later rolls back. logAgentError gains a suppressEvent flag
  so the heartbeat path can defer the emit without changing
  POST /errors behavior.

* #8 — HeartbeatTransition discriminated union. Replaces the
  shouldLogTransition boolean + reason union with a kind:'noop' |
  kind:'transition' tagged variant. Invalid combinations
  ({shouldLogTransition: true, reason: null}) are now compile errors;
  the call-site triple guard collapses to one kind check, and
  fromStatus is only accessible on the transition arm.

* #10 — PII scrub. New scrubAgentErrorContext walks the agent-
  supplied context jsonb (depth-capped) and redacts any object key
  matching /token|password|secret|api[_-]?key|authorization|cookie
  |bearer/i to "[REDACTED]". processHeartbeat scrubs before
  persisting so a leaked agent process or accidentally-serialized
  env block cannot deposit credentials into agent_errors.

396 backend tests pass (5 isolated phases + main); just ci-check
clean. Migration 0007 backfills orphan agent_errors.task_id refs
before attaching the FK so the change applies safely to existing
production data.

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Copilot AI review requested due to automatic review settings May 17, 2026 21:35
@unclesp1d3r
Copy link
Copy Markdown
Member Author

All residual review findings resolved

Per maintainer direction (no deferrals), every remaining residual is fixed in commit 4c94b281e78d53d6c74cb98f92d97d66e26bb485. The PR body has been updated to remove the Residual Review Findings table.

Code changes

  • packages/shared/src/db/schema.ts — agent_errors.task_id FK (ON DELETE SET NULL).
  • packages/shared/src/db/migrations/0007_light_ulik.sql — backfills orphan refs, attaches the FK.
  • packages/backend/src/services/agents.ts — db.transaction wrapper, FOR UPDATE lock, post-commit emits, continue-on-error fan-out, verifyTaskOwnership helper, scrubAgentErrorContext helper, HeartbeatTransition discriminated union, optional drizzle client + suppressEvent flag on logAgentError.
  • packages/backend/tests/integration/agent-heartbeat.test.ts — new 9-scenario integration suite covering plan U5; runs in an isolated phase via AGENT_HEARTBEAT_TEST_ISOLATED=1.
  • packages/backend/tests/unit/agents-service.test.ts — 6 new scrubAgentErrorContext tests; existing decideHeartbeatTransition tests updated for the discriminated union.
  • packages/backend/tests/unit/agent-api-contract.test.ts — re-exports scrubAgentErrorContext from the mock factory.
  • packages/backend/package.json — adds the AGENT_HEARTBEAT_TEST_ISOLATED phase to the test script.

Resolution map

Residual Fix
#2 — Plan U5 integration tests 9 new integration tests, 1:1 to U5 scenarios
#3 — Atomicity of insert + update db.transaction wraps the heartbeat-error INSERT + agents UPDATE
#4 — Mid-loop handleTaskFailure try/catch per task, partial failure surfaced in response
#5 — currentTask.taskId ownership verifyTaskOwnership pre-check + FK on agent_errors.task_id
#6 — TOCTOU race on prior-status read SELECT FOR UPDATE inside the transaction
#7 — Events before DB commit emitAgentError/Status moved post-commit; suppressEvent flag on logAgentError
#8 — HeartbeatTransition discriminated union kind: 'noop'
#10 — PII scrub on error.context scrubAgentErrorContext redacts secret-shaped keys before persist

396 total tests pass (372 main + 9 heartbeat-integration + 9 health-deterministic + 3 control-RBAC + 22 tasks + 19 queue-manager isolated phases). just ci-check clean.

@coderabbitai coderabbitai Bot removed enhancement New feature or request backend labels May 17, 2026
@coderabbitai coderabbitai Bot added documentation Improvements or additions to documentation priority:high Important — significant operational value and removed testing Testing infrastructure and test cases priority:medium Should be done — improves quality or capability shared agent-api labels May 17, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 10 comments.

Comments suppressed due to low confidence (1)

packages/backend/src/services/agents.ts:561

  • campaigns is already imported at the top of this file. The dynamic await import('@hashhive/shared') here shadows the top-level binding and adds an unnecessary async resolution on every heartbeat that has an updated agent row. Use the static import.
  // Check if there are high-priority pending tasks for this agent's project
  let hasHighPriorityTasks = false;
  if (updated) {
    const { campaigns } = await import('@hashhive/shared');

Comment thread packages/backend/src/services/agents.ts
Comment thread packages/backend/src/services/agents.ts
Comment thread packages/backend/src/services/agents.ts
Comment thread packages/backend/src/services/agents.ts Outdated
Comment thread packages/backend/src/services/agents.ts Outdated
Comment thread packages/shared/src/db/migrations/0007_light_ulik.sql Outdated
Comment thread packages/backend/tests/unit/agents-service.test.ts
Comment thread packages/backend/tests/integration/agent-heartbeat.test.ts
Comment thread packages/backend/src/middleware/auth.ts
Comment thread packages/backend/src/routes/agent/index.ts
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 2

🧹 Nitpick comments (2)
packages/backend/src/services/agents.ts (1)

373-373: 💤 Low value

Consider extending pattern to cover additional credential-shaped keys.

Current pattern misses common variants like credential, private, session, and jwt. Not blocking, but operators may inadvertently leak these.

-const SECRET_KEY_PATTERN = /token|password|secret|api[_-]?key|authorization|cookie|bearer/i;
+const SECRET_KEY_PATTERN = /token|password|secret|api[_-]?key|authorization|cookie|bearer|credential|private|session|jwt/i;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/backend/src/services/agents.ts` at line 373, SECRET_KEY_PATTERN
currently misses common credential-like keys; update the regular expression
constant SECRET_KEY_PATTERN to include additional variants such as credential,
private, session, jwt (and optionally token-related forms like access[_-]?token)
so keys like "credential", "privateKey", "session_id", "jwt" are matched; modify
the pattern used in packages/backend/src/services/agents.ts (the
SECRET_KEY_PATTERN constant) to append these alternates into the alternation
list while preserving existing case-insensitive flag and boundary/word
considerations.
packages/backend/tests/integration/agent-heartbeat.test.ts (1)

122-124: 💤 Low value

Intentional thenable pattern — suppress or acknowledge.

Biome flags then property as suspicious, but this is an intentional pattern to make the drizzle mock chain awaitable without calling .limit(). Consider adding a suppression comment for clarity:

     // Make the chain thenable so consumers that await it without .limit() resolve.
+    // biome-ignore lint/suspicious/noThenProperty: intentional thenable mock for drizzle chain
     (chain as { then: Promise<unknown[]>['then'] }).then = (resolve: (v: unknown) => void) =>
       Promise.resolve(rows).then(resolve);

Same applies to line 185.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/backend/tests/integration/agent-heartbeat.test.ts` around lines 122
- 124, The test uses an intentional thenable pattern by assigning (chain as {
then: Promise<unknown[]>['then'] }).then = ... to make the drizzle mock chain
awaitable; Biome flags this as suspicious, so add a short
suppression/acknowledgement comment immediately above that assignment (and the
analogous assignment around line 185) explaining that the then property is
intentionally added to make the mock awaitable and should not be removed, or add
the appropriate Biome suppression directive to silence the warning while keeping
the thenable behavior in functions/mocks named like the chain mock.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/backend/tests/integration/agent-heartbeat.test.ts`:
- Around line 272-280: The beforeEach currently calls mockClear() on several
mocks which only clears call history and can leak queued mockResolvedValueOnce
values; update the beforeEach to call mockReset() for handleTaskFailureMock,
loggerMock.info, loggerMock.warn, loggerMock.error, emitAgentErrorMock, and
emitAgentStatusMock so each test starts with fully reset mocks (e.g., replace
all .mockClear() calls with .mockReset() in the beforeEach block that also calls
resetState()).

In `@packages/shared/src/db/migrations/meta/_journal.json`:
- Around line 54-59: Migration entry with idx 7 (tag "0007_light_ulik") has an
out-of-order "when" timestamp (1779052939557) that precedes migration 6;
regenerate the migration journal so the "when" for idx 7 is updated to a
correct, increasing timestamp by running the project's migration generator (run
"just db-generate"), then commit the updated _journal.json entry so the "when"
value for idx 7 is chronologically after migration 6.

---

Nitpick comments:
In `@packages/backend/src/services/agents.ts`:
- Line 373: SECRET_KEY_PATTERN currently misses common credential-like keys;
update the regular expression constant SECRET_KEY_PATTERN to include additional
variants such as credential, private, session, jwt (and optionally token-related
forms like access[_-]?token) so keys like "credential", "privateKey",
"session_id", "jwt" are matched; modify the pattern used in
packages/backend/src/services/agents.ts (the SECRET_KEY_PATTERN constant) to
append these alternates into the alternation list while preserving existing
case-insensitive flag and boundary/word considerations.

In `@packages/backend/tests/integration/agent-heartbeat.test.ts`:
- Around line 122-124: The test uses an intentional thenable pattern by
assigning (chain as { then: Promise<unknown[]>['then'] }).then = ... to make the
drizzle mock chain awaitable; Biome flags this as suspicious, so add a short
suppression/acknowledgement comment immediately above that assignment (and the
analogous assignment around line 185) explaining that the then property is
intentionally added to make the mock awaitable and should not be removed, or add
the appropriate Biome suppression directive to silence the warning while keeping
the thenable behavior in functions/mocks named like the chain mock.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 94987f31-bcc8-4862-9db1-8c9aa927287a

📥 Commits

Reviewing files that changed from the base of the PR and between de4b05b and 4c94b28.

📒 Files selected for processing (9)
  • packages/backend/package.json
  • packages/backend/src/services/agents.ts
  • packages/backend/tests/integration/agent-heartbeat.test.ts
  • packages/backend/tests/unit/agent-api-contract.test.ts
  • packages/backend/tests/unit/agents-service.test.ts
  • packages/shared/src/db/migrations/0007_light_ulik.sql
  • packages/shared/src/db/migrations/meta/0007_snapshot.json
  • packages/shared/src/db/migrations/meta/_journal.json
  • packages/shared/src/db/schema.ts
✅ Files skipped from review due to trivial changes (2)
  • packages/shared/src/db/migrations/0007_light_ulik.sql
  • packages/shared/src/db/migrations/meta/0007_snapshot.json
🚧 Files skipped from review as they are similar to previous changes (2)
  • packages/backend/tests/unit/agents-service.test.ts
  • packages/backend/tests/unit/agent-api-contract.test.ts

Comment thread packages/backend/tests/integration/agent-heartbeat.test.ts
Comment thread packages/shared/src/db/migrations/meta/_journal.json
Address every actionable review thread from CodeRabbit + Copilot:

* Validation hook now uses a structural type that mirrors the
  @hono/zod-validator 0.7.6 Hook signature and walks every zod issue
  (not just the first) so union-refinement failures appear in the 400
  body instead of being dropped.
* Refine the secret-key scrub: replace the substring regex with a
  whole-word + trailing-suffix check (isSecretKey) so descriptive names
  like tokenCount, cookieDomain, bearerHostname, apiKeyName,
  secretsManagerEnabled, and passwordAge are no longer falsely
  redacted. Add positive and negative coverage.
* Move verifyTaskOwnership inside the heartbeat transaction with a
  FOR UPDATE lock on the task row so a concurrent reassignment cannot
  land between the ownership check and the agent_errors insert.
* Cache the dynamic import('./tasks.js') call (lazy module ref via
  getHandleTaskFailure) so the import resolves once instead of on
  every fatal-error heartbeat. Drop a redundant dynamic
  import('@hashhive/shared') for `campaigns` (already imported
  statically).
* Rename requireAgentTokenAllowRecovery to
  requireAgentTokenForHeartbeatRecovery so the heartbeat-only intent
  is harder to miss when adding new endpoints.
* Migration 0007: add IF EXISTS to the DROP INDEX so a hand-rebuilt
  environment doesn't block the FK addition; fix journal timestamp
  ordering (0007.when now > 0006.when).
* Replace mockClear() with mockReset() in the heartbeat integration
  beforeEach (per GOTCHAS.md "Use mockReset() not mockClear() in
  beforeEach"); reinstall the default handleTaskFailure impl so the
  scenarios that rely on it keep working.
* Fix a stray status:'active' literal in the auth-middleware test;
  the agent status enum only accepts online|busy|error|benchmarked.
* Replace the redundant decideHeartbeatTransition fatal-from-online
  case with the previously-missing fatal-when-already-error edge case
  so the test exercises the noop-on-same-resolved-status branch.

374 backend tests pass + 9 isolated heartbeat phase = 383 total in
this PR; just ci-check clean.

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
@unclesp1d3r unclesp1d3r self-assigned this May 17, 2026
@unclesp1d3r unclesp1d3r merged commit 249cb61 into main May 17, 2026
10 checks passed
@unclesp1d3r unclesp1d3r deleted the feat/agent-heartbeat-error-handling branch May 17, 2026 22:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation priority:high Important — significant operational value

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants