Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
91 changes: 91 additions & 0 deletions harness/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
# Harness families

`harness/` is the developer-only home for **harness families** — focused, local infrastructure that measures something about Shipgate's behavior or its adoption. It is **not packaged** into the `agents-shipgate` wheel.

The first family is [`harness.adoption`](adoption/), which drives coding agents (Claude Code, Codex, Cursor) across a matrix of (archetype, variant, prompt) cells and scores their behavior against the [adoption rubric](../docs/agent-adoption-harness.md).

This README documents the **layout convention** so future families — perf regression, false-positive baseline, framework-version drift, etc. — can be added with a shared shape and a shared dispatcher.

## Convention

A subpackage `harness/<name>/` is recognized as a harness family iff it satisfies all three rules:

| # | Rule | Why |
|---|------|-----|
| 1 | `harness/<name>/__init__.py` exists with a **non-empty docstring**. The first line becomes the family's one-line description in `python -m harness list`. | Discoverability + human-readable inventory. |
| 2 | `harness/<name>/cli.py` exists and exposes `app` — typically a `typer.Typer` instance, but any zero-arg callable suffices. | Single, predictable entry point that the dispatcher can introspect without running the harness. |
| 3 | `harness/<name>/__main__.py` exists and calls `app()`. | `python -m harness.<name>` is a working invocation regardless of how the dispatcher evolves. |

The convention is pinned by [`tests/harness/test_harness_layout.py`](../tests/harness/test_harness_layout.py) — every subpackage that LOOKS like a family but misses one of the three files fails the contract test loudly.

## Discovery and dispatch

```bash
# Show usage + every discovered family
python -m harness --help

# Tab-separated one-per-line listing (for piping)
python -m harness list

# Forward to a family's own CLI (identical to ``python -m harness.<name> ...``)
python -m harness adoption smoke
python -m harness adoption run --matrix benchmark/matrix.yaml
```

Forwarding is done via `subprocess` so the family's own `sys.argv[0]` matches a direct invocation exactly. Typer/Click `--help` output is byte-identical between `python -m harness.adoption --help` and `python -m harness adoption --help`.

The dispatcher returns:

- `0` on a successful forward (or when the family's own exit is `0`).
- The family's own exit code on a forwarded run.
- `2` if you name an unknown harness (config-error convention, mirrors `agents-shipgate scan` exit codes).

## Adding a new harness family

1. Pick a snake_case name. Examples: `perf_regression`, `false_positive_baseline`, `framework_version_drift`.
2. Create the three required files:
```
harness/<name>/__init__.py # docstring describes what the harness measures
harness/<name>/cli.py # exports ``app`` (Typer recommended)
harness/<name>/__main__.py # bootstrap sys.path, then ``app()``
```
Use [`harness/adoption/__main__.py`](adoption/__main__.py) as the template for the `sys.path` bootstrap. Skipping that bootstrap means a sibling worktree's editable install can shadow the working tree under test.
3. Add any new shared runtime deps to [`harness/requirements.txt`](requirements.txt). Per-family `requirements.txt` files are not currently supported — if your family has conflicting deps, put it in a separate venv.
4. Drop tests under `tests/harness/`. The layout contract test picks the new family up automatically — no test wiring needed.
5. Document the rubric / what-it-measures in either:
- the family's `cli.py` docstring (short),
- `harness/<name>/README.md` (medium), or
- `docs/agent-<name>-harness.md` (long, for adoption-class families).

## What goes UNDER a family

Anything family-internal. The dispatcher only scans the top level of `harness/`. The adoption family uses:

```
harness/adoption/
├── __init__.py # docstring (rule 1)
├── __main__.py # ``python -m harness.adoption`` (rule 3)
├── cli.py # exports ``app`` (rule 2)
├── context.py
├── matrix.py
├── overlay.py
├── workspace.py
├── drivers/ # pluggable drivers per agent IDE
├── observer/ # transcript / fs / redaction
├── scorer/ # rubric application
└── scripts/ # fixture sync, etc.
```

There is no requirement to mirror this layout. A leaner family (one cli.py + a single scorer module) is fine. A larger family can grow its own subdirectories.

## What harnesses are NOT

- **Not packaged.** Harnesses ship inside the repo but never inside the wheel. The `[project]` table in `pyproject.toml` does not include `harness/` in its sdist or wheel.
- **Not part of the public API.** Internal modules under `harness/<name>/` can change shape between releases without a STABILITY contract bump. The only stable surface is the **layout convention** documented here.
- **Not a replacement for unit tests.** Harnesses measure end-to-end behavior on realistic inputs (cold-agent runs, perf regressions on real repos, etc.). Use `tests/` for invariants on small inputs.

## Where this convention is enforced

- **Layout contract**: [`tests/harness/test_harness_layout.py`](../tests/harness/test_harness_layout.py) — parametrized over `discover_harnesses()`. A new family that satisfies the convention is automatically covered.
- **Discovery code**: [`harness/__init__.py`](__init__.py) defines `HarnessSpec` and `discover_harnesses()`.
- **Dispatcher**: [`harness/__main__.py`](__main__.py) implements the `python -m harness ...` entry points.
168 changes: 168 additions & 0 deletions harness/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,168 @@
"""Harness families for agents-shipgate.

Each top-level subpackage under ``harness/`` is one harness family — a
focused, local-only piece of developer infrastructure that measures
something about Shipgate's behavior or its adoption. The first such
family is ``harness.adoption``, which drives coding agents across a
matrix of (archetype, variant, prompt) cells and scores their behavior
against the adoption rubric.

This module defines the **harness layout convention** so future
families (perf regression, false-positive baseline, framework-version
drift, etc.) can be added with a shared shape and a shared dispatcher.

## Convention (every family MUST follow)

A subpackage ``harness/<name>/`` is recognized as a harness family iff:

1. ``harness/<name>/__init__.py`` exists with a non-empty docstring.
The first line of the docstring becomes the family's one-line
description in ``python -m harness list``.
2. ``harness/<name>/cli.py`` exists and exposes ``app`` — typically a
``typer.Typer`` instance, but any zero-arg callable suffices.
3. ``harness/<name>/__main__.py`` exists and calls ``app()`` so that
``python -m harness.<name>`` is a working entry point.

Subdirectories under ``harness/<name>/`` (e.g. ``drivers/``,
``observer/``, ``scorer/``) are family-internal. Only top-level
subpackages of ``harness/`` are scanned for the convention.

Harness families are **not packaged** into the ``agents-shipgate``
wheel — they are developer infrastructure only. Shared runtime
dependencies live in ``harness/requirements.txt``; install with
``pip install -r harness/requirements.txt`` from a clone.

## Discovery and dispatch

- ``discover_harnesses()`` walks ``harness/*/`` and returns one
:class:`HarnessSpec` per conforming family.
- ``python -m harness list`` prints the discovered set (delegates to
``discover_harnesses()``).
- ``python -m harness <name> [args...]`` forwards to
``python -m harness.<name>`` so a future family is invokable through
the same dispatcher.
- ``tests/harness/test_harness_layout.py`` pins the convention with a
parametrized contract test — a new family that misses any of the
three required files fails the test loudly.

## Adding a new harness family (checklist)

1. Create ``harness/<name>/`` with ``__init__.py``, ``cli.py``,
``__main__.py``.
2. Make ``cli.py`` export ``app`` (Typer recommended for argv
parsing; the existing :mod:`harness.adoption.cli` is the canonical
template).
3. Make ``__main__.py`` call ``app()`` after bootstrapping
``sys.path`` the way :mod:`harness.adoption.__main__` does — so a
sibling-worktree editable install never wins over the colocated
``src/``.
4. Add shared runtime deps to ``harness/requirements.txt``.
5. Drop tests under ``tests/harness/`` (the layout contract test
picks the new family up automatically).
6. Document any new top-level entry-point flag or score rubric under
``docs/`` or the family's own ``README.md``.

See :mod:`harness.adoption` for the canonical example.
"""

from __future__ import annotations

import importlib
import pkgutil
from collections.abc import Callable
from dataclasses import dataclass
from pathlib import Path
from typing import Any

__all__ = ["HARNESS_DIR", "HarnessSpec", "discover_harnesses"]

# Filesystem root for harness discovery. Kept as a module-level constant
# so tests can sanity-check the package layout without re-deriving it.
HARNESS_DIR: Path = Path(__file__).resolve().parent

# Subpackages under ``harness/`` that are NOT harness families even if
# they satisfy the cli.py shape. ``tests`` is reserved because pytest
# can pick up a future ``harness/tests/`` directory; underscored names
# are private by convention. Add to this set if you introduce a
# non-family helper subpackage (don't add new public families here).
_EXCLUDED_SUBPACKAGES: frozenset[str] = frozenset({"tests"})


@dataclass(frozen=True)
class HarnessSpec:
"""Discovered metadata for one harness family.

Attributes:
name: The subpackage name (e.g. ``"adoption"``). Used as the
argv selector for ``python -m harness <name>`` and as the
stable identifier in the contract test.
description: First line of the family's ``__init__.py``
docstring. Empty string only if the docstring is itself
empty, which the contract test rejects.
app: The entry-point callable from ``harness.<name>.cli``.
Conforming families expose ``typer.Typer`` instances; the
convention only requires a callable so a future family
using a different argv parser remains valid.
module_path: The dotted module path (e.g.
``"harness.adoption"``). ``python -m <module_path>`` works
via the family's ``__main__.py``.
package_dir: Absolute filesystem path to the family's package
directory. Tests and tooling use this to read sibling
files (README.md, requirements.txt) without re-deriving
``HARNESS_DIR``.
"""

name: str
description: str
app: Callable[..., Any]
module_path: str
package_dir: Path


def discover_harnesses() -> list[HarnessSpec]:
"""Walk ``harness/`` and return every conforming family.

A subpackage conforms iff (a) it is not in
:data:`_EXCLUDED_SUBPACKAGES`, (b) it has a ``cli.py`` module that
can be imported, and (c) ``cli`` exposes a non-None ``app``
attribute. Non-conforming directories are silently skipped here —
the contract test
(``tests/harness/test_harness_layout.py::test_every_harness_subpackage_conforms``)
is what FAILS LOUDLY if a subpackage looks like a harness but
misses a required file.

Ordering: results are sorted by ``name`` for deterministic
enumeration in ``python -m harness list`` and parametrized tests.

Import failures in ``cli.py`` are NOT swallowed — they propagate
so the developer sees a real traceback instead of an empty list.
"""
specs: list[HarnessSpec] = []
for finder_info in pkgutil.iter_modules([str(HARNESS_DIR)]):
if not finder_info.ispkg:
continue
name = finder_info.name
if name.startswith("_") or name in _EXCLUDED_SUBPACKAGES:
continue
package_dir = HARNESS_DIR / name
cli_path = package_dir / "cli.py"
if not cli_path.exists():
continue
cli_module = importlib.import_module(f"harness.{name}.cli")
app = getattr(cli_module, "app", None)
if app is None:
continue
init_module = importlib.import_module(f"harness.{name}")
doc = (init_module.__doc__ or "").strip()
description = doc.splitlines()[0] if doc else ""
specs.append(
HarnessSpec(
name=name,
description=description,
app=app,
module_path=f"harness.{name}",
package_dir=package_dir,
)
)
specs.sort(key=lambda spec: spec.name)
return specs
122 changes: 122 additions & 0 deletions harness/__main__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
"""Top-level dispatcher for harness families.

Invocations::

python -m harness Show usage + discovered families.
python -m harness --help / -h Same as above.
python -m harness list One family per line: ``<name>\\t<description>``.
python -m harness <name> [args] Forward to ``python -m harness.<name>``.

Forwarding is done via :mod:`subprocess` so the family's own
``__main__.py`` runs with ``sys.argv[0]`` set exactly as if it were
invoked directly with ``python -m harness.<name>``. This avoids the
Typer/Click prog-name detection corner cases that ``runpy``-based
forwarding hits, and keeps the family's own ``--help`` output
identical between direct and dispatched invocation.

The dispatcher is a developer convenience, not a packaged entry
point. Direct ``python -m harness.<name>`` invocation continues to
work; the dispatcher exists so future families don't each need a
custom invocation pattern in CI scripts and docs.
"""
from __future__ import annotations

import subprocess
import sys
from pathlib import Path

# Bootstrap sys.path the same way ``harness/adoption/__main__.py`` does
# so the colocated ``src/`` wins over any editable install from a
# sibling worktree. Without this a checked-in
# ``agents_shipgate`` import from a different worktree could shadow
# the working tree under test. We only need this for the in-process
# ``discover_harnesses()`` call below — the subprocess child inherits
# the environment but Python's own ``-m`` flag handles its sys.path.
_REPO_ROOT = Path(__file__).resolve().parents[1]
for _path in (_REPO_ROOT, _REPO_ROOT / "src"):
_s = str(_path)
if _s not in sys.path:
sys.path.insert(0, _s)

from harness import HarnessSpec, discover_harnesses # noqa: E402

_USAGE_HEADER = """\
Usage: python -m harness <command> [args...]

Commands:
list One harness per line: ``<name>\\t<description>``.
<name> [args...] Forward to ``python -m harness.<name>``.
--help, -h, help Show this message.

Discovered harness families:
"""

_USAGE_FOOTER = """\

See harness/README.md for the convention every family must follow.
"""


def _format_families(specs: list[HarnessSpec]) -> str:
if not specs:
return " (none — add a family under harness/<name>/)"
width = max(len(spec.name) for spec in specs)
return "\n".join(
f" {spec.name:<{width}} {spec.description}" for spec in specs
)


def main(argv: list[str] | None = None) -> int:
"""Entry point for ``python -m harness``.

Returns 0 on success, 2 on unknown harness, or the forwarded
family's own exit code on a successful dispatch. Argv parsing is
intentionally hand-rolled (no Typer, no argparse) so this stays a
thin dispatch shim with no surprises for the family's own argv
layer.
"""
args = list(sys.argv[1:] if argv is None else argv)
specs = discover_harnesses()

# ``--help`` / ``-h`` / ``help`` / no args → usage + family list.
if not args or args[0] in ("--help", "-h", "help"):
sys.stdout.write(_USAGE_HEADER)
sys.stdout.write(_format_families(specs))
sys.stdout.write("\n" + _USAGE_FOOTER)
return 0

# ``list`` → tab-separated one-per-line, for piping.
if args[0] == "list":
for spec in specs:
sys.stdout.write(f"{spec.name}\t{spec.description}\n")
return 0

# Otherwise treat the first positional as a harness name and
# forward to ``python -m harness.<name>`` via subprocess. Unknown
# names are rejected with a routable error and exit 2 (config-
# error convention shared with the main agents-shipgate CLI).
name = args[0]
by_name = {spec.name: spec for spec in specs}
if name not in by_name:
sys.stderr.write(f"error: no harness named {name!r}\n")
available = ", ".join(spec.name for spec in specs) or "(none)"
sys.stderr.write(f"available: {available}\n")
sys.stderr.write(
"Run ``python -m harness --help`` for the full convention.\n"
)
return 2

# Forward via subprocess so the child's ``sys.argv[0]`` matches a
# direct ``python -m harness.<name>`` invocation exactly. This
# keeps the child's Typer/Click ``--help`` output indistinguishable
# from direct invocation, which is the whole point of the
# convention.
completed = subprocess.run(
[sys.executable, "-m", f"harness.{name}", *args[1:]],
check=False,
)
return completed.returncode


if __name__ == "__main__":
sys.exit(main())
Loading