Hal0ai · thinmintdev · May 29, 2026 · May 29, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -77,6 +77,23 @@ servers and the dashboard v3 surfaces that consume them. Landed
   - Deduped the non-empty check in `Updater.apply()`; the extract
     step is the single source of truth.
 
+### Tests
+
+- **δ-harness coverage of Hermes `delegate_task` for 3 backends**
+  (Phase 0 OpenRouter prereq — DA must-fix #2). New δ-tier
+  pytest suite at `tests/harness/integration/test_delegate_task_*.py`
+  proves the `delegate_task → execution-backend` dispatch hop works
+  end-to-end for local + docker + modal with mocked
+  `BaseEnvironment` subclasses (no Modal credits, no docker pulls in
+  CI). The matrix test fans out one call across all three backends
+  and asserts each was invoked exactly once with a per-backend-shaped
+  payload. Findings catalogued at `tests/harness/FINDINGS.md` §46
+  including the upstream audit (R7's "7 backends" claim corrected
+  to 6 — local/docker/singularity/modal/daytona/ssh; Vercel Sandbox
+  not present in upstream pin `0554ef1a`). Gates V3a Hermes
+  observability per
+  `openrouter-research-2026-05-28/PLANNING.md` §3 Phase 0.
+
 ### Deferred
 
 - MCP-installed-server supervisor: start / stop / restart still

diff --git a/tests/harness/FINDINGS.md b/tests/harness/FINDINGS.md
@@ -1278,3 +1278,88 @@ memory chip) that previously rendered stale static data.
   `tests/agents/test_agent_memory_stats_endpoint.py`.
 - **Status:** ✅ landed in PR-11 (2026-05-28).
 
+## 46. δ-harness `delegate_task` 3-backend dispatch coverage — **gap / regression-guard + DA finding**
+
+DA must-fix #2 from the OpenRouter integration analysis
+(`openrouter-research-2026-05-28/notes/da-or.md` line 39 + PLANNING.md
+§4 #2) flagged that R7's "Hermes already ships 7 spawn backends"
+claim was unverified marketing.  The Phase 0 delegate harness
+addresses that with one parametrised dispatch test plus four
+per-backend test files exercising local / docker / modal at the
+δ-tier.  Coverage uses mocked backends so CI never spends Modal
+credits or pulls docker images.
+
+Findings from the upstream audit (pin `0554ef1a`):
+
+- **R7's "7 backends" is partial marketing.** Upstream actually ships
+  **6 execution-environment backends**: local, docker, singularity,
+  modal, daytona, ssh.  Vercel Sandbox does NOT exist in upstream as
+  a `BaseEnvironment` subclass.  Cite:
+  `~/src/hermes-agent/tools/terminal_tool.py:1039-1178`
+  (`_create_environment` factory),
+  `~/src/hermes-agent/tools/environments/__init__.py:1-12`
+  (docstring enumerates the six).
+- **Modal has two sub-modes** (`direct` + `managed`) selected via
+  `terminal.modal_mode`, plus an "auto" fallback.  Our fake covers
+  the direct mode + the credentials-missing degraded path.
+- **`BaseEnvironment` is the actual ABC**, not a per-backend "spawn
+  adapter" — every concrete backend subclasses it.  The DA framing
+  "delegate_task w/ Modal/Daytona path is not exercised in
+  tests/agents/" is accurate: upstream tests cover individual
+  environments in isolation but nothing exercises the
+  `delegate_task → backend dispatch` hop.
+
+The drift gate
+(`test_upstream_base_environment_still_has_expected_methods`) skips
+on machines without `~/src/hermes-agent` (CI, fresh laptops) and
+asserts the four-method contract on machines where the checkout is
+present.  When the weekly `hermes-sdk-diff` workflow (ADR-0018)
+bumps the pin, that gate fires first if upstream renamed a method.
+
+| Row                                                                         | Tier | Outcome | Notes |
+|-----------------------------------------------------------------------------|------|---------|-------|
+| `test_delegate_task_local / round_trips_simple_echo`                         | δ    | pass    | happy path; output reaches assistant response |
+| `test_delegate_task_local / records_invocation_count_and_payload`            | δ    | pass    | command + cwd captured verbatim |
+| `test_delegate_task_local / error_envelope_does_not_crash_parent`            | δ    | pass    | RuntimeError surfaces as per-task error |
+| `test_delegate_task_local / empty_goal_rejected_before_dispatch`             | δ    | pass    | mirrors upstream `tools/delegate_tool.py:2034` |
+| `test_delegate_task_docker / round_trips_with_image_kwargs`                  | δ    | pass    | image kwarg reaches the fake; output round-trips |
+| `test_delegate_task_docker / unavailable_degrades_gracefully`                | δ    | pass    | init_session raise → per-task error, parent intact |
+| `test_delegate_task_docker / payload_includes_container_kwargs`              | δ    | pass    | image / cpu / memory / disk / volumes / env captured |
+| `test_delegate_task_docker / nonzero_returncode_surfaces_as_error`           | δ    | pass    | exit 127 surfaces as inline error; output preserved |
+| `test_delegate_task_modal / round_trips_with_sandbox_kwargs`                 | δ    | pass    | sandbox_kwargs (cpu/memory/ephemeral_disk) captured |
+| `test_delegate_task_modal / token_missing_degrades_gracefully`               | δ    | pass    | MODAL_TOKEN missing → per-task error |
+| `test_delegate_task_modal / cold_start_latency_visible_in_duration`          | δ    | pass    | 200 ms simulated cold-start shows up in duration_ms |
+| `test_delegate_task_modal / multiple_commands_share_one_sandbox`             | δ    | pass    | init_session called once, execute called twice |
+| `test_delegate_task_dispatch_matrix / per_backend_round_trips[local]`        | δ    | pass    | dispatch matrix L1 |
+| `test_delegate_task_dispatch_matrix / per_backend_round_trips[docker]`       | δ    | pass    | dispatch matrix L2 |
+| `test_delegate_task_dispatch_matrix / per_backend_round_trips[modal]`        | δ    | pass    | dispatch matrix L3 |
+| `test_delegate_task_dispatch_matrix / fans_out_to_three_backends_in_one_call`| δ    | pass    | upstream batch-mode shape — three backends, one call |
+| `test_delegate_task_dispatch_matrix / unknown_backend_raises_keyerror`       | δ    | pass    | unregistered backend name fails loud, no silent local fallback |
+| `test_delegate_task_dispatch_matrix / upstream_base_environment_methods`     | δ    | skipped on CI / passes on dev | drift gate against `tools.environments.base.BaseEnvironment` |
+| `test_delegate_task_dispatch_matrix / all_fakes_implement_backend_contract`  | δ    | pass    | every fake implements `_BackendContract` (`init_session`/`execute`/`cleanup`) |
+
+- **Cite:** `tests/harness/integration/_delegate_fakes.py`,
+  `tests/harness/integration/_delegate_runner.py`,
+  `tests/harness/integration/test_delegate_task_local.py`,
+  `tests/harness/integration/test_delegate_task_docker.py`,
+  `tests/harness/integration/test_delegate_task_modal.py`,
+  `tests/harness/integration/test_delegate_task_dispatch_matrix.py`,
+  upstream `tools/environments/base.py:288` + `terminal_tool.py:1039`.
+- **Status:** ✅ landed in the Phase 0 delegate-harness PR (2026-05-29).
+  Gates V3a Hermes-observability per
+  `openrouter-research-2026-05-28/PLANNING.md` §3 Phase 0.
+
+### V3a observability scope decision
+
+Three backends (local + docker + modal) round-trip cleanly through
+the dispatch hop — V3a Hermes-observability survives intact.  The
+three remaining upstream backends (singularity / daytona / ssh) are
+out of Phase 0 scope but can be added incrementally without
+rescoping V3a: §14 of the README walks through the per-backend
+add procedure.
+
+If the upstream-drift gate fires in a future
+`scripts/hermes-sdk-diff.sh --bump` run, the appropriate response is
+to re-shape `_delegate_fakes.py::_BackendContract` to match (or, if
+upstream rolls back to a smaller surface, scope down V3a's display).
+
diff --git a/tests/harness/README.md b/tests/harness/README.md
@@ -111,6 +111,12 @@ tests/harness/
     conftest.py           #         FakeWsServer fixture + harness_client
     test_v0_3_chat_roundtrip.py    # WS proxy → mock hermes round-trip
     test_v0_3_persona_activate.py  # persona swap + hot-reload nudge round-trip
+    _delegate_fakes.py             # Hermes execution-backend ABC + 3 fakes
+    _delegate_runner.py            # in-process delegate_task dispatch harness
+    test_delegate_task_local.py    # local-backend round-trip + error paths
+    test_delegate_task_docker.py   # docker-backend round-trip + degrade
+    test_delegate_task_modal.py    # modal-backend round-trip + cold-start
+    test_delegate_task_dispatch_matrix.py  # 3-backend fan-out + ABC drift gate
   reports/
     .api-handoff          # ephemeral handoff between tiers (HAL0_API_URL, HAL0_HOME, HAL0_SERVE_PID)
     installer.json        # per-tier reports, hal0.harness-report.v1
@@ -438,3 +444,63 @@ When picking where a new test lives, the rule:
 - A tier's `status` values expand (we currently use ok / missing).
 
 Additive optional fields don't require a bump.
+
+---
+
+## 14. δ-harness: `delegate_task` coverage
+
+Upstream Hermes-Agent's `delegate_task` tool spawns one or more child
+`AIAgent` threads.  Each child's tool loop runs shell commands through
+one of upstream's **execution-environment backends** declared in
+`tools/environments/` and selected by the `TERMINAL_ENV` env var.
+
+Three of those backends — local, docker, and modal — have δ-tier
+coverage at `tests/harness/integration/test_delegate_task_*.py`.  The
+matrix test (`test_delegate_task_dispatch_matrix.py`) gates the
+"fan-out across N backends in one call" shape upstream batch mode
+exposes.
+
+### Testing philosophy
+
+These tests run **mocked backends + real orchestration**: a hand-rolled
+`FakeDelegateRunner` (`_delegate_runner.py`) mirrors the dispatch
+loop upstream's `delegate_task` runs, and the three fake backends
+(`_delegate_fakes.py`) implement the same `BaseEnvironment` ABC
+upstream uses.  No real subprocess, no docker daemon, no Modal credit
+spend.  The tests prove **the dispatch hop works** end-to-end; the
+γ-tier suite on hal0-test LXC (`scripts/release-test.sh`) covers
+"does the real backend actually launch a real container".
+
+The trade: contributors can run these on any laptop in <1 second; CI
+never burns Modal credits; the upstream-contract drift gate
+(`test_upstream_base_environment_still_has_expected_methods`) catches
+upstream renames the moment a contributor with `~/src/hermes-agent`
+checked out re-runs the suite.
+
+### Adding a fourth backend (e.g. daytona, ssh, vercel)
+
+1. **Audit upstream first.**  Confirm the backend actually exists in
+   the upstream pin (`pyproject.toml [tool.hal0.upstream-hermes]
+   commit`).  `tools/environments/<name>.py` should have a concrete
+   `BaseEnvironment` subclass.  If it doesn't, the test is testing a
+   feature that doesn't ship — surface that as a finding in
+   `FINDINGS.md` before writing fakes.
+
+2. **Add the fake to `_delegate_fakes.py`.**  Subclass
+   `_BackendContract`, capture the backend-specific kwargs in
+   `backend_context` so tests can assert provisioning intent
+   (image / sandbox kwargs / connection config), and add a "feature
+   missing" knob (e.g. `unavailable: bool` or `token_missing: bool`)
+   for the degraded-path test.
+
+3. **Add a `test_delegate_task_<name>.py`.**  Mirror the existing
+   layout: round-trip + payload-shape + degraded-path + at-least-one
+   backend-unique edge case (e.g. cold-start latency for modal,
+   non-zero exit code for docker).
+
+4. **Extend `test_delegate_task_dispatch_matrix.py`.**  Add the
+   backend to the parametrised dispatch matrix + the fan-out test so
+   the "all backends round-trip uniformly" gate covers it.
+
+5. **Append to `FINDINGS.md` §46.**  One row per test added; cite
+   upstream backend name + the file:line that confirmed it exists.