[Bug] Hermes integration: gateway repeatedly respawns on EADDRINUSE and can leave gateway process alive after host restart

## Summary

When `TencentDB-Agent-Memory` is used as a Hermes memory provider, the gateway supervisor exhibits two related lifecycle problems:

1. **Repeated `EADDRINUSE: 127.0.0.1:8420`** failures when multiple supervisors / threads race to (re)spawn the gateway, producing log spam and a recovery storm.
2. **The TencentDB gateway subprocess tree can outlive a host gateway restart/termination event** — in one later cleanup event, the already-spawned gateway tree continued listening on `:8420` after the host gateway process had exited, leaving subsequent host starts to either hit `EADDRINUSE` or silently reuse a process no longer supervised by that host instance.

Together these produce sustained gateway churn even when nothing else has changed.

## Environment

- OS: macOS (`launchd`-managed Hermes gateway)
- Host agent: Hermes agent (multi-profile setup)
- TencentDB-Agent-Memory gateway: local Node/tsx gateway invoked via `npx tsx src/gateway/server.ts`
- Bind address: `127.0.0.1:8420`
- Hermes config: `memory.provider: memory_tencentdb`

## Observed behavior

### Symptom 1 — EADDRINUSE storm

Over a multi-day observation window, host's `agent.log` contained:

```
WARNING memory-tencentdb: memory-tencentdb watchdog: Gateway unreachable; attempting to resurrect.
WARNING memory-tencentdb: memory-tencentdb Gateway appears down; attempting to resurrect.
INFO    memory-tencentdb.supervisor: Starting memory-tencentdb Gateway: sh -c '... npx tsx src/gateway/server.ts'
ERROR   memory-tencentdb.supervisor: memory-tencentdb Gateway process exited with code 1 during startup. ...
  code: 'EADDRINUSE'
  errno: -48
  syscall: 'listen'
  address: '127.0.0.1'
  port: 8420
INFO    memory-tencentdb.supervisor: memory-tencentdb Gateway already running at http://127.0.0.1:8420
INFO    memory-tencentdb: memory-tencentdb Gateway recovery succeeded.
```

This sequence repeats. In one observation window:

- `agent.log` contains **55 occurrences of `EADDRINUSE`**
- TencentDB stderr log contains **118 occurrences of `EADDRINUSE`**
- `agent.log` contains **59 occurrences of `Gateway appears down`**

The watchdog and recovery succeed in the steady state, but on each restart cycle the storm reoccurs.

### Symptom 2 — Gateway tree outliving host restart/termination

In one incident, the TencentDB gateway tree was first observed as a normal child tree of the Hermes host gateway:

```
$ ps -p 39754,39769,39770 -o pid,ppid,lstart,command
PID    PPID   STARTED                       COMMAND
39754  60654  Mon May 25 22:37:03 2026      npm exec tsx src/gateway/server.ts
39769  39754  Mon May 25 22:37:03 2026      node .../.bin/tsx src/gateway/server.ts
39770  39769  Mon May 25 22:37:04 2026      node ... src/gateway/server.ts

$ lsof -nP -iTCP:8420 -sTCP:LISTEN
node    39770    <user>   24u  IPv4  TCP 127.0.0.1:8420 (LISTEN)
```

The host gateway (PID 60654) had spawned this TencentDB gateway tree at 22:37:03. At this point the tree was **not** yet orphaned; it was still correctly parented under the host gateway.

During a later `launchctl kickstart -k` / restart cleanup of the host gateway, the host process exited (`ps -p 60654` returned no rows), while the same TencentDB gateway process family **continued to run** and kept `:8420` occupied. In that later cleanup state, the surviving subprocesses were no longer owned by the host gateway instance and had to be cleaned up explicitly via:

```
pkill -TERM -f 'tdai-memory-openclaw-plugin.*src/gateway/server.ts'
```

A subsequent host gateway restart would either:
- Detect "Gateway already running at http://127.0.0.1:8420" via supervisor's reuse-existing path (silent reuse, but no actual supervision)
- Or hit `EADDRINUSE` if it tried to spawn directly

This report is intentionally separating the two observations: the 22:37 process listing proves the host-spawned tree and port owner; the later restart/termination observation is what motivates the "outlives host" lifecycle claim.

## Expected behavior

The Hermes integration / gateway supervisor should:

1. **Pre-spawn health check**: treat an already-healthy gateway on `127.0.0.1:8420` as reusable instead of spawning another process.
2. **Single-flight startup**: use a startup lock, pidfile, or single-flight guard to prevent concurrent gateway spawns from multiple supervisor threads.
3. **Ownership tracking**: distinguish between gateways spawned by this supervisor instance and externally-owned ones.
4. **Subprocess group cleanup**: spawn the Node/tsx gateway with a clear ownership model and ensure host restart/termination takes down or deliberately detaches the entire subprocess tree (e.g. by sending a signal to the process group, or by using a parent process that monitors the host and propagates termination).
5. **Recall/capture non-blocking**: ensure none of recall/capture/recovery can prevent the host agent from sending its main response (see related issue/enhancement to be filed separately).
6. **Clearer diagnostics**: surface distinct log messages for:
   - gateway already running and healthy (no action)
   - gateway owned by this supervisor (action: nothing or graceful restart)
   - gateway owned by another process / orphan (action: warn, do not silently reuse)
   - gateway unhealthy (action: terminate + respawn)
   - gateway startup failed (action: backoff + retry with reason)

## Impact

In a real Hermes-agent deployment over ~7 days, this caused:

- Repeated gateway restart/recovery attempts (hundreds of log lines)
- A local gateway process continuing to listen on `:8420` after the host gateway restart/termination cleanup path
- Memory provider appearing "active" (port held) even after host-side memory was supposed to be disabled; this part is primarily a Hermes config semantics problem and is only included here as impact context
- Suspected response pipeline stalls after model completion but before message delivery (time-correlation only, needs upstream reproduction to confirm causality)

The last two points may be host-integration specific, but the gateway lifecycle and `EADDRINUSE` behavior are directly observable TencentDB-Agent-Memory-side symptoms.

## Suggested fix sketch

```typescript
// pseudocode in src/gateway/supervisor.ts
async function ensureGatewayRunning() {
  // 1. Health probe first
  if (await probeHealthy("127.0.0.1:8420")) {
    return existingHealthy;
  }

  // 2. Single-flight lock (pidfile or atomic file lock)
  using lock = await tryAcquireStartLock();
  if (!lock) return await waitForOther();

  // 3. If port is held but unhealthy, identify owner
  const ownerPid = await findPortOwner(8420);
  if (ownerPid && !isOurChild(ownerPid)) {
    log.warn(`port 8420 held by external PID ${ownerPid} — refusing to spawn`);
    return null;
  }

  // 4. Spawn with an explicit ownership model so host restart/termination cleans or detaches the tree intentionally
  const child = spawn("npx", ["tsx", "src/gateway/server.ts"], {
    detached: false,                 // child dies with parent
    stdio: ["ignore", logFile, logFile],
  });
  process.on("SIGTERM", () => {
    try { process.kill(-child.pid, "SIGTERM"); } catch {}
  });

  return child;
}
```

## Related

- This issue focuses on gateway lifecycle. A separate issue will be filed for "recall path should be best-effort sidecar / non-blocking" since that is a distinct surface area.
- This issue does NOT include the host-side `memory_enabled` config naming confusion, which is upstream to the host (hermes-agent) and not to TencentDB-Agent-Memory.
- A related case of L1 over-generalization producing persona pollution is documented in [#48](https://github.com/Tencent/TencentDB-Agent-Memory/issues/48#issuecomment-4544395405).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Hermes integration: gateway repeatedly respawns on EADDRINUSE and can leave gateway process alive after host restart #94

Summary

Environment

Observed behavior

Symptom 1 — EADDRINUSE storm

Symptom 2 — Gateway tree outliving host restart/termination

Expected behavior

Impact

Suggested fix sketch

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Bug] Hermes integration: gateway repeatedly respawns on EADDRINUSE and can leave gateway process alive after host restart #94

Description

Summary

Environment

Observed behavior

Symptom 1 — EADDRINUSE storm

Symptom 2 — Gateway tree outliving host restart/termination

Expected behavior

Impact

Suggested fix sketch

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions