Skip to content

[Bug] Hermes integration: gateway repeatedly respawns on EADDRINUSE and can leave gateway process alive after host restart #94

@brucecbi

Description

@brucecbi

Summary

When TencentDB-Agent-Memory is used as a Hermes memory provider, the gateway supervisor exhibits two related lifecycle problems:

  1. Repeated EADDRINUSE: 127.0.0.1:8420 failures when multiple supervisors / threads race to (re)spawn the gateway, producing log spam and a recovery storm.
  2. The TencentDB gateway subprocess tree can outlive a host gateway restart/termination event — in one later cleanup event, the already-spawned gateway tree continued listening on :8420 after the host gateway process had exited, leaving subsequent host starts to either hit EADDRINUSE or silently reuse a process no longer supervised by that host instance.

Together these produce sustained gateway churn even when nothing else has changed.

Environment

  • OS: macOS (launchd-managed Hermes gateway)
  • Host agent: Hermes agent (multi-profile setup)
  • TencentDB-Agent-Memory gateway: local Node/tsx gateway invoked via npx tsx src/gateway/server.ts
  • Bind address: 127.0.0.1:8420
  • Hermes config: memory.provider: memory_tencentdb

Observed behavior

Symptom 1 — EADDRINUSE storm

Over a multi-day observation window, host's agent.log contained:

WARNING memory-tencentdb: memory-tencentdb watchdog: Gateway unreachable; attempting to resurrect.
WARNING memory-tencentdb: memory-tencentdb Gateway appears down; attempting to resurrect.
INFO    memory-tencentdb.supervisor: Starting memory-tencentdb Gateway: sh -c '... npx tsx src/gateway/server.ts'
ERROR   memory-tencentdb.supervisor: memory-tencentdb Gateway process exited with code 1 during startup. ...
  code: 'EADDRINUSE'
  errno: -48
  syscall: 'listen'
  address: '127.0.0.1'
  port: 8420
INFO    memory-tencentdb.supervisor: memory-tencentdb Gateway already running at http://127.0.0.1:8420
INFO    memory-tencentdb: memory-tencentdb Gateway recovery succeeded.

This sequence repeats. In one observation window:

  • agent.log contains 55 occurrences of EADDRINUSE
  • TencentDB stderr log contains 118 occurrences of EADDRINUSE
  • agent.log contains 59 occurrences of Gateway appears down

The watchdog and recovery succeed in the steady state, but on each restart cycle the storm reoccurs.

Symptom 2 — Gateway tree outliving host restart/termination

In one incident, the TencentDB gateway tree was first observed as a normal child tree of the Hermes host gateway:

$ ps -p 39754,39769,39770 -o pid,ppid,lstart,command
PID    PPID   STARTED                       COMMAND
39754  60654  Mon May 25 22:37:03 2026      npm exec tsx src/gateway/server.ts
39769  39754  Mon May 25 22:37:03 2026      node .../.bin/tsx src/gateway/server.ts
39770  39769  Mon May 25 22:37:04 2026      node ... src/gateway/server.ts

$ lsof -nP -iTCP:8420 -sTCP:LISTEN
node    39770    <user>   24u  IPv4  TCP 127.0.0.1:8420 (LISTEN)

The host gateway (PID 60654) had spawned this TencentDB gateway tree at 22:37:03. At this point the tree was not yet orphaned; it was still correctly parented under the host gateway.

During a later launchctl kickstart -k / restart cleanup of the host gateway, the host process exited (ps -p 60654 returned no rows), while the same TencentDB gateway process family continued to run and kept :8420 occupied. In that later cleanup state, the surviving subprocesses were no longer owned by the host gateway instance and had to be cleaned up explicitly via:

pkill -TERM -f 'tdai-memory-openclaw-plugin.*src/gateway/server.ts'

A subsequent host gateway restart would either:

  • Detect "Gateway already running at http://127.0.0.1:8420" via supervisor's reuse-existing path (silent reuse, but no actual supervision)
  • Or hit EADDRINUSE if it tried to spawn directly

This report is intentionally separating the two observations: the 22:37 process listing proves the host-spawned tree and port owner; the later restart/termination observation is what motivates the "outlives host" lifecycle claim.

Expected behavior

The Hermes integration / gateway supervisor should:

  1. Pre-spawn health check: treat an already-healthy gateway on 127.0.0.1:8420 as reusable instead of spawning another process.
  2. Single-flight startup: use a startup lock, pidfile, or single-flight guard to prevent concurrent gateway spawns from multiple supervisor threads.
  3. Ownership tracking: distinguish between gateways spawned by this supervisor instance and externally-owned ones.
  4. Subprocess group cleanup: spawn the Node/tsx gateway with a clear ownership model and ensure host restart/termination takes down or deliberately detaches the entire subprocess tree (e.g. by sending a signal to the process group, or by using a parent process that monitors the host and propagates termination).
  5. Recall/capture non-blocking: ensure none of recall/capture/recovery can prevent the host agent from sending its main response (see related issue/enhancement to be filed separately).
  6. Clearer diagnostics: surface distinct log messages for:
    • gateway already running and healthy (no action)
    • gateway owned by this supervisor (action: nothing or graceful restart)
    • gateway owned by another process / orphan (action: warn, do not silently reuse)
    • gateway unhealthy (action: terminate + respawn)
    • gateway startup failed (action: backoff + retry with reason)

Impact

In a real Hermes-agent deployment over ~7 days, this caused:

  • Repeated gateway restart/recovery attempts (hundreds of log lines)
  • A local gateway process continuing to listen on :8420 after the host gateway restart/termination cleanup path
  • Memory provider appearing "active" (port held) even after host-side memory was supposed to be disabled; this part is primarily a Hermes config semantics problem and is only included here as impact context
  • Suspected response pipeline stalls after model completion but before message delivery (time-correlation only, needs upstream reproduction to confirm causality)

The last two points may be host-integration specific, but the gateway lifecycle and EADDRINUSE behavior are directly observable TencentDB-Agent-Memory-side symptoms.

Suggested fix sketch

// pseudocode in src/gateway/supervisor.ts
async function ensureGatewayRunning() {
  // 1. Health probe first
  if (await probeHealthy("127.0.0.1:8420")) {
    return existingHealthy;
  }

  // 2. Single-flight lock (pidfile or atomic file lock)
  using lock = await tryAcquireStartLock();
  if (!lock) return await waitForOther();

  // 3. If port is held but unhealthy, identify owner
  const ownerPid = await findPortOwner(8420);
  if (ownerPid && !isOurChild(ownerPid)) {
    log.warn(`port 8420 held by external PID ${ownerPid} — refusing to spawn`);
    return null;
  }

  // 4. Spawn with an explicit ownership model so host restart/termination cleans or detaches the tree intentionally
  const child = spawn("npx", ["tsx", "src/gateway/server.ts"], {
    detached: false,                 // child dies with parent
    stdio: ["ignore", logFile, logFile],
  });
  process.on("SIGTERM", () => {
    try { process.kill(-child.pid, "SIGTERM"); } catch {}
  });

  return child;
}

Related

  • This issue focuses on gateway lifecycle. A separate issue will be filed for "recall path should be best-effort sidecar / non-blocking" since that is a distinct surface area.
  • This issue does NOT include the host-side memory_enabled config naming confusion, which is upstream to the host (hermes-agent) and not to TencentDB-Agent-Memory.
  • A related case of L1 over-generalization producing persona pollution is documented in #48.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions