diff --git a/AGENTS.md b/AGENTS.md new file mode 100644 index 0000000..e7eb004 --- /dev/null +++ b/AGENTS.md @@ -0,0 +1,127 @@ +# Agents Guide + +> Onboarding for future agents (and humans new to the repo). Read this before touching code so you don't accidentally re-litigate decisions that already shipped. + +## Project north star + +openworkers is **the multi-agent system that refuses to make things up**. Every claim the system emits is either tied to a verifiable primary source or marked as unsupported. Two domains live in the codebase: + +1. **Thesis assistant** (the legacy flagship). Audits literature claims against arXiv / Semantic Scholar / CrossRef. Producing prose is explicitly out of scope — every output is structured JSON. +2. **Code audit** (the new flagship, in progress). Audits factual claims in technical artefacts (READMEs first, then PRs, compliance docs, architecture docs) against the actual codebase, language specs, and dependencies. + +The two domains share the same DNA: planner → researcher → checker → critic pipelines, structured output everywhere, a hard trust gate that *refuses* to verdict without evidence. + +## Where things are + +``` +core/ + blackboard/ # Redis-backed shared state (thesis-only for now) + orchestrator/ + thesis_flow.py # ThesisOrchestrator — legacy, do not break + readme_flow.py # ReadmeAuditOrchestrator — new code-audit slice + compiler.py # PromptCompiler for thesis (blackboard → prompt vars) + router/ # Provider tier routing (quality/balanced/cheap) + memory/episodic.py # Qdrant episodic memory (thesis) + schemas.py # Thesis Pydantic models + schemas_audit.py # Code-audit Pydantic models (kept separate on purpose) + sources/ + base.py # SourceAdapter ABC — the new evidence-backend contract + local_repo.py # LocalRepoAdapter — grep over a local repo +providers/ + unified.py # UnifiedLLM: provider fallback, breakers, DRY_RUN path + thesis_agents.py # Thesis agent suite — untouched, keep passing + code_audit_agents.py # README planner, checker, critic + trust gate + budget.py # BudgetGuard (contextvars-scoped session ceiling) + resilience.py # Tenacity + pybreaker glue +prompts/ + *.md # Thesis templates (head_planner, specialist_*, ...) + code_audit/*.md # Audit templates (readme_planner, readme_checker, ...) +tools/mcp/ # Literature MCP tools; will migrate behind SourceAdapter +apps/ + cli/main.py # Single argparse CLI for both `thesis ...` and `audit ...` + api/ # FastAPI surface + mcp_server/ # MCP stdio server + worker/ # Async worker stub +tests/ + fixtures/sample_repo/ # Synthetic widgetlib repo for audit tests + code_audit/ # New audit tests + test_*.py # Thesis tests — DO NOT regress +``` + +## The trust gate (read this twice) + +For code audit, the invariant **"no verdict without evidence"** is enforced in code, not in prompts. In `providers/code_audit_agents.py::_enforce_trust_gate`: + +``` +for each claim: + if retrieved evidence is empty: + verdict = "unsupported" + confidence = 0.0 + evidence_paths = [] + notes = "No supporting evidence found in the repository." +``` + +This overwrites whatever the LLM said. A confidently hallucinating checker that returns `verified` for a claim with zero evidence gets corrected before the user ever sees the report. Do **not** move this logic into a prompt. The test `test_readme_audit_end_to_end` in `tests/code_audit/test_readme_flow.py` explicitly seeds a hallucinating checker stub and asserts the override fires. + +Mirror this pattern when you add new auditors (PR auditor, compliance auditor, etc.): keep the LLM creative, but enforce the trust invariant in Python. + +## The README-audit flow (current slice) + +1. **Planner (LLM)** — reads the README, extracts atomic factual claims with verbatim quotes + grep-friendly search hints. Schema: `ReadmeClaimList`. +2. **Researcher (deterministic Python)** — uses `LocalRepoAdapter.search_any(hints)` to retrieve evidence snippets from the repo. **No LLM call here** — it's just a filesystem grep with safety rails (path traversal guard, file-size cap, dir excludes). +3. **Checker (LLM + trust gate)** — judges each `(claim, evidence)` pair into `verified | drifted | unsupported | contradicted`. Trust gate runs after. +4. **Critic (LLM)** — adversarial pass: weak verdicts, missed claims, suggestions. + +The audited README is **excluded** from its own evidence pool — otherwise every fabricated claim could "verify itself" against the README quote. See `ReadmeAuditOrchestrator.audit` for the exclusion logic. + +## Coexistence rules + +- **Do not break the thesis path.** The full thesis test suite (`tests/test_*.py` minus `tests/code_audit/`) must stay green. Thesis is being deprecated *gradually*, not yanked. +- **Do not modify `core/schemas.py` to add audit fields.** `core/schemas_audit.py` is the audit-domain home. The two domains evolve independently until a real reason to merge appears. +- **Blackboard is thesis-only for now.** README audit deliberately skips it — claim/evidence flow is plain Python. When a second audit type ships and shared state actually buys something, fold the blackboard in. + +## LLM routing & DRY_RUN + +- `UnifiedLLM` (in `providers/unified.py`) routes to Anthropic / OpenAI / DeepSeek by tier (quality / balanced / cheap), with per-provider circuit breakers and a fallback chain. +- Tests run with `DRY_RUN=true` by default (from `Settings`). Under DRY_RUN, `generate()` returns a placeholder JSON shaped from the `response_schema`. **Caveat:** array fields come back empty — useful for "did the wiring work?" smoke checks, useless for end-to-end behaviour. +- For end-to-end tests of new agent flows, do **not** rely on DRY_RUN. Set `DRY_RUN=false`, set the `THESIS__PROVIDER` / `_MODEL` env vars, and stub LLM responses via `UnifiedLLM.set_generate_fn(...)` — content-aware (route by which agent's system prompt is in play). See `tests/code_audit/test_readme_flow.py::_make_stub_unified` for the pattern. + +## Conventions + +- **No prose generation.** Every agent emits structured JSON validated against a Pydantic model. If you find yourself prompting for "a paragraph that summarises…", stop and write a schema instead. +- **Structured output schemas** are derived from Pydantic models via `_schema_for()` (one per file: see `providers/thesis_agents.py` and `providers/code_audit_agents.py`). +- **`from __future__ import annotations`** at the top of every new file. The project targets Python 3.9 but uses py3.10+ syntax via deferred evaluation. +- **Lint stack:** `ruff` + `black --line-length=100`, both run in CI. New files must pass both. `mypy` strict is enforced on `core/` and `providers/`. +- **Comments:** non-obvious *why* only. No "this function reads a file"–style narration. Comments explaining hidden invariants, past incidents, or workarounds are welcome. +- **Commit hygiene:** no `--no-verify`, no skipping hooks. Pre-commit hook failures are diagnostic signals, not nuisances to bypass. + +## Where the project is going (1.0 trajectory) + +See `ROADMAP.md` for the full picture. Short version: + +- ✅ Slice 1 (shipped): README auditor. +- 🚧 Next slices: PR auditor (`audit pr `), compliance auditor, architecture auditor. All slot in behind the same `SourceAdapter` + agent-suite + trust-gate pattern. +- 🚧 Layered source adapters: repo (highest trust) → language specs / RFCs → dependency source. The literature MCP tools will migrate behind the same contract. +- 🚧 Cherry-picked from the v1.0 plan: tool/source registry, light provider-registry abstraction (Ollama later for local inference on private repos), structlog audit trail. +- ⏸️ Deferred: PyPI packaging, Typer CLI rewrite, OTel, Smart truncation, Ollama. Not blocking the audit-track expansion. + +The thesis pipeline stays first-class through the transition, then is gradually deprecated as code-audit reaches feature parity. + +## How to add a new auditor (recipe) + +1. **Schema** — add audit-specific Pydantic models to `core/schemas_audit.py` (claim shape, verdict shape, report shape). +2. **Source adapter** — if a new evidence backend is needed (e.g., GitHub PR adapter), add a class to `core/sources/` implementing `SourceAdapter`. Keep the path-traversal / scope guard at the adapter boundary. +3. **Agents** — add `PlannerAgent`, `CheckerAgent`, `AuditCriticAgent` (or reuse) to `providers/code_audit_agents.py` or a sibling module. The checker's post-LLM step **must** call a trust gate equivalent to `_enforce_trust_gate`. +4. **Orchestrator** — add `core/orchestrator/_flow.py` following `readme_flow.py`. Stage order is planner → deterministic researcher → checker (+ gate) → critic. Exclude the audited artefact from its own evidence pool if applicable. +5. **Prompts** — add `prompts/code_audit/_*.md` templates with explicit JSON schemas in the body and "no prose, no markdown fences" rules. +6. **CLI** — register a new subcommand under `audit` in `apps/cli/main.py`. +7. **Fixture + test** — add a `tests/fixtures/_*/` repo and a `tests/code_audit/test__flow.py` with at least: an adapter-level test, an end-to-end test asserting verdict distribution, and a trust-gate test asserting that a hallucinating checker stub is overridden. +8. **Docs** — update README's "Code audit" section and `ROADMAP.md`. Add a `CHANGELOG.md` entry under `[Unreleased]`. + +## Things that look like good ideas but aren't + +- **Letting the LLM verdict without evidence.** No matter how good the model, the trust gate stays. The whole product premise rides on this. +- **Premature shared base class for orchestrators.** Wait until the *third* auditor lands before extracting `BaseAuditFlow`. Two examples isn't a pattern; three is. +- **Sharing `core/schemas.py` between domains.** Keep `schemas_audit.py` separate. The merge cost is low; the divergence cost of cross-domain field coupling is high. +- **Adding the README to its own evidence pool.** Already burned us once. Self-evidence makes hallucinations verify themselves. +- **Skipping `from __future__ import annotations` because "we're on 3.13 locally".** CI runs on 3.9. diff --git a/CHANGELOG.md b/CHANGELOG.md index a1e78cb..5e708e4 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -5,6 +5,8 @@ All notable changes to OpenWorkers are documented here. The format is loosely ba ## [Unreleased] ### Added +- **Code-audit track — README auditor (first slice).** New `openworkers audit readme ` CLI subcommand verifies every factual claim in a README against the actual repository, emitting `verified | drifted | unsupported | contradicted` verdicts with cited file paths. Pipeline: planner (LLM) → researcher (deterministic grep via new `LocalRepoAdapter`) → checker (LLM + post-LLM trust gate) → critic (LLM adversarial pass). New modules: `core/sources/` (`SourceAdapter` ABC + `LocalRepoAdapter`), `core/schemas_audit.py` (Pydantic audit models), `core/orchestrator/readme_flow.py` (`ReadmeAuditOrchestrator`), `providers/code_audit_agents.py` (planner / checker / critic + `_enforce_trust_gate` invariant), `prompts/code_audit/*.md` (audit templates). The trust gate is enforced in code, not delegated to prompts: any claim with no retrieved evidence is forced to `unsupported` regardless of LLM output. The audited README is excluded from its own evidence pool so fabricated claims cannot self-verify. `tests/code_audit/test_readme_flow.py` exercises the full flow with a stubbed `UnifiedLLM.generate_fn` and an `tests/fixtures/sample_repo/` containing a deliberate mix of verified / drifted / contradicted / fabricated claims. Thesis pipeline untouched. +- **Contributor onboarding doc** `AGENTS.md` capturing project DNA, code-audit slice design, trust-gate invariant, conventions, and the recipe for adding new auditors. - **RAG over user PDFs** (first incremental v1.0 slice). New `tools/mcp/rag.py` with sentence-aware chunker, `RAGIndexer` (PDF/text → Qdrant via PyMuPDF + FastEmbed `BAAI/bge-small-en-v1.5`), and `RAGSearchTool` (registered as `rag_search` in `ToolRegistry`). Collections namespaced under `rag_*` so they cannot collide with `thesis_corpus` or `episodes`. New CLI: `thesis ingest add|list|delete`. New flag: `thesis research ... --rag-collection ` makes the researcher pull from the user collection alongside arXiv/SS. New field: `ResearchContext.rag_collection`. `tests/test_rag.py` covers chunking edge cases, BOM/text extraction, collection naming, indexer round-trip, privacy gating, and idempotent re-ingest. ### Documentation diff --git a/README.md b/README.md index 5261b80..1864570 100644 --- a/README.md +++ b/README.md @@ -6,9 +6,29 @@ [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black) [![Lint: ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff) -A multi-agent thesis assistant that searches real literature, audits citations, and produces structured critiques. **It does not write prose.** It runs a hierarchical pipeline (HEAD planner → researcher → checker → synthesizer → critic → HEAD supervisor) over a Redis blackboard, with provider-agnostic LLM routing across Anthropic, OpenAI, and DeepSeek and verified citations via arXiv, Semantic Scholar, and CrossRef. +A multi-agent system that **refuses to make things up**. Two domains live here today: -> **Project status:** 0.1.0 (pre-release). The pipeline runs end-to-end and ships an MCP server, a CLI, and a FastAPI app, but APIs may shift before 1.0. See [ROADMAP.md](ROADMAP.md) for the planned 1.0 direction. +- **Thesis assistant** — audits literature claims against arXiv / Semantic Scholar / CrossRef. +- **Code audit** *(new flagship, in progress)* — audits factual claims in technical artefacts (READMEs first, then PRs / compliance docs / architecture docs) against the actual codebase, language specs, and dependencies. + +Both domains share the same DNA: a hierarchical pipeline (planner → researcher → checker → critic) producing structured JSON, with provider-agnostic LLM routing across Anthropic, OpenAI, and DeepSeek and a hard trust gate that **refuses to verdict without evidence**. + +> **Project status:** 0.1.0 (pre-release). The thesis pipeline runs end-to-end; the code-audit track has landed its first slice (`openworkers audit readme `). APIs may shift before 1.0. See [ROADMAP.md](ROADMAP.md) for direction, [AGENTS.md](AGENTS.md) for contributor context. + +## Code audit *(new track)* + +`openworkers audit readme ` extracts every factual claim from a README and verdicts each one against the actual repository: + +| Verdict | Meaning | +|---|---| +| `verified` | Code clearly demonstrates the claim is true | +| `drifted` | A related but divergent implementation exists (renamed flag, changed default, etc.) | +| `contradicted` | Code directly disproves the claim | +| `unsupported` | No evidence in the repo — enforced in code, not delegated to the LLM | + +The pipeline is planner (LLM extracts claims) → researcher (deterministic grep via `LocalRepoAdapter`) → checker (LLM judges + trust gate forces `unsupported` when evidence is empty) → critic (adversarial pass). The audited README is excluded from its own evidence pool, so fabricated claims cannot verify themselves. + +Roadmap for this track: PR auditor (PR description vs. diff), compliance auditor (security/policy claims vs. code), architecture auditor (design doc vs. implementation). See [AGENTS.md](AGENTS.md) for the contributor recipe. ## What it does diff --git a/ROADMAP.md b/ROADMAP.md index 178840f..1a6ef3d 100644 --- a/ROADMAP.md +++ b/ROADMAP.md @@ -19,6 +19,19 @@ - ✅ Docker Compose stack (Redis, Qdrant, CLI, MCP) and CI matrix on Python 3.9 / 3.12 - ✅ **RAG over user PDFs** — `thesis ingest add paper.pdf --collection my_papers` chunks + embeds via FastEmbed (BAAI/bge-small-en-v1.5) into Qdrant; the researcher transparently retrieves from the user collection when `thesis research ... --rag-collection my_papers` is set. Collections are namespaced under `rag_*` so they cannot collide with the thesis corpus or episodic memory. +## Code-audit track (new flagship) + +A second domain alongside the thesis assistant: audit factual claims in technical artefacts against the codebase. Same DNA — multi-agent, structured JSON, never fabricates, trust gate refuses verdicts without evidence — applied to a domain where trustworthy automated review matters to OSS maintainers and contributors. See [AGENTS.md](AGENTS.md) for the contributor recipe and trust-gate invariant. + +- ✅ **README auditor** *(first slice, shipped)*. `openworkers audit readme ` extracts atomic claims from a README and verdicts each one against the actual codebase as `verified | drifted | unsupported | contradicted`. Trust gate is enforced in `providers/code_audit_agents.py::_enforce_trust_gate`, not in prompts. The audited README is excluded from its own evidence pool. New `SourceAdapter` abstraction (`core/sources/`) with `LocalRepoAdapter`. +- 🚧 **PR auditor** — `openworkers audit pr `. Verify the PR description against the actual diff; flag scope creep, missing tests, undocumented changes. Needs a `GitHubAdapter` implementing `SourceAdapter`. +- 🚧 **Compliance auditor** — `openworkers audit compliance `. Verify security/policy claims ("inputs sanitized", "no secrets", "auth required on X") against the code. +- 🚧 **Architecture auditor** — verify RFC / design-doc claims against implementation, language specs, and dependency source. +- 🚧 **Layered source adapters** — repo (highest trust) → language specs / RFCs (`SpecAdapter`) → dependency source (`DependencyAdapter`). The existing literature MCP tools (arXiv / Semantic Scholar / CrossRef) will migrate behind the same `SourceAdapter` contract. +- 📋 **Tool/source registry** — `@register_source(...)` decorator + entry-point discovery so users add their own evidence backends without forking. +- 📋 **Local-inference provider** — Ollama path so users can audit private/proprietary repos without sending code to a cloud LLM. +- 🚧 **Gradual thesis deprecation** — the thesis pipeline stays first-class through the transition; deprecation happens after the audit track reaches feature parity. + ## Proposed for 1.0 The 1.0 line targets a polished, packaged release on PyPI. The themes: diff --git a/apps/cli/main.py b/apps/cli/main.py index b1dcf46..5998922 100644 --- a/apps/cli/main.py +++ b/apps/cli/main.py @@ -9,6 +9,7 @@ format_session_text, ) from core.memory.episodic import EpisodicMemory +from core.orchestrator.readme_flow import ReadmeAuditOrchestrator, format_report_text from core.orchestrator.thesis_flow import ThesisOrchestrator from core.router.engine import Router from core.schemas import ResearchContext @@ -230,6 +231,29 @@ async def cmd_corpus(args): return summary +async def cmd_audit_dispatch(args): + """Route `audit ` to its handler.""" + if args.audit_action == "readme": + return await cmd_audit_readme(args) + raise SystemExit(f"Unknown audit action: {args.audit_action}") + + +async def cmd_audit_readme(args): + """Run the README auditor on a local repo.""" + unified = create_unified_llm() + orch = ReadmeAuditOrchestrator(unified=unified) + report, critique = await orch.audit(repo_path=args.repo, readme_path=args.readme) + if args.format == "json": + payload = { + "report": report.model_dump(), + "critique": critique.model_dump(), + } + _output(payload, "json", args.output) + else: + _output(format_report_text(report, critique), "text", args.output) + return report + + async def cmd_ingest(args): from tools.mcp.rag import RAGIndexer @@ -355,6 +379,26 @@ def build_parser() -> argparse.ArgumentParser: p_corpus.add_argument("--year", type=int, default=0, help="Year of publication") add_output_args(p_corpus) + p_audit = sub.add_parser("audit", help="Audit technical artefacts against the codebase") + audit_sub = p_audit.add_subparsers(dest="audit_action", required=True) + + p_audit_readme = audit_sub.add_parser( + "readme", + help="Verify every factual claim in a README against the repository", + ) + p_audit_readme.add_argument( + "repo", + type=str, + help="Path to the repository to audit", + ) + p_audit_readme.add_argument( + "--readme", + type=str, + default=None, + help="Explicit README path (default: auto-discover under )", + ) + add_output_args(p_audit_readme) + p_ingest = sub.add_parser("ingest", help="Manage user RAG collections (PDF/text -> Qdrant)") ingest_sub = p_ingest.add_subparsers(dest="ingest_action", required=True) @@ -402,6 +446,7 @@ def main(): "sessions": cmd_sessions, "corpus": cmd_corpus, "ingest": cmd_ingest, + "audit": cmd_audit_dispatch, } handler = command_map.get(args.command) diff --git a/core/orchestrator/readme_flow.py b/core/orchestrator/readme_flow.py new file mode 100644 index 0000000..503419c --- /dev/null +++ b/core/orchestrator/readme_flow.py @@ -0,0 +1,268 @@ +"""README audit orchestrator. + +Parallels ``ThesisOrchestrator`` in spirit (planner → researcher → +checker → critic) but with three deliberate differences: + +1. The researcher is **deterministic Python**, not an LLM. Evidence + retrieval is a grep over the local repo via ``LocalRepoAdapter`` — + no fabrication risk, no per-claim API cost. + +2. The trustworthiness gate is enforced in code (``_enforce_trust_gate`` + in ``providers/code_audit_agents.py``), not entrusted to a prompt. + +3. There is no shared blackboard yet — claim/evidence state flows as + plain Python between agents. The blackboard layer will fold in + when a second audit type (PR / compliance) is added and the + shared state actually buys something. +""" + +from __future__ import annotations + +import asyncio +import logging +import os +import time +from collections import Counter +from pathlib import Path +from typing import Any, Callable + +from core.schemas_audit import ( + ALL_VERDICTS, + VERDICT_UNSUPPORTED, + AuditCritique, + AuditReport, + ClaimEvidence, + ClaimVerdict, + EvidenceRef, + ReadmeClaim, + ReadmeClaimList, +) +from core.sources.local_repo import LocalRepoAdapter +from providers.code_audit_agents import ( + AuditCriticAgent, + ReadmeCheckerAgent, + ReadmePlannerAgent, +) +from providers.unified import UnifiedLLM + +logger = logging.getLogger(__name__) + + +_AUDIT_PROMPT_DIR = os.path.join( + os.path.dirname(os.path.dirname(os.path.dirname(__file__))), + "prompts", + "code_audit", +) + +_TEMPLATE_FILES = { + "readme_planner": "readme_planner.md", + "readme_checker": "readme_checker.md", + "audit_critic": "audit_critic.md", +} + + +def _render_audit_prompt(name: str, variables: dict[str, Any]) -> str: + """Tiny placeholder-substitution renderer for audit prompts. + + Deliberately not reusing PromptCompiler: that compiler is wired to + extract blackboard state, which this slice doesn't use. Audit + templates only need ``{{ var }}`` substitution. + """ + filename = _TEMPLATE_FILES.get(name) + if not filename: + raise ValueError(f"Unknown audit template: {name}") + path = os.path.join(_AUDIT_PROMPT_DIR, filename) + try: + with open(path, encoding="utf-8") as f: + template = f.read() + except OSError: + return f"[Template {name} not found at {path}]" + for key, value in variables.items(): + template = template.replace("{{ " + key + " }}", str(value)) + return template + + +class ReadmeAuditOrchestrator: + """Run a README audit end-to-end against a local repo.""" + + def __init__( + self, + unified: UnifiedLLM, + adapter: LocalRepoAdapter | None = None, + prompt_renderer: Callable[[str, dict[str, Any]], str] | None = None, + max_evidence_per_claim: int = 5, + ) -> None: + self.unified = unified + self.adapter = adapter + self.render = prompt_renderer or _render_audit_prompt + self.max_evidence_per_claim = max_evidence_per_claim + self.planner = ReadmePlannerAgent(unified=unified, prompt_renderer=self.render) + self.checker = ReadmeCheckerAgent(unified=unified, prompt_renderer=self.render) + self.critic = AuditCriticAgent(unified=unified, prompt_renderer=self.render) + + async def audit( + self, + repo_path: Path | str, + readme_path: Path | str | None = None, + ) -> tuple[AuditReport, AuditCritique]: + start = time.time() + adapter = self.adapter or LocalRepoAdapter(repo_path) + # Re-bind adapter when caller passed an explicit repo_path: the + # default-construction branch above already pinned it; this + # branch handles the case where the caller reused an existing + # orchestrator across multiple repos. + if self.adapter is None: + self.adapter = adapter + + readme_file = Path(readme_path) if readme_path else adapter.find_readme() + errors: list[str] = [] + if readme_file is None or not Path(readme_file).is_file(): + empty = AuditReport( + repo_path=str(adapter.root), + readme_path="", + verdicts=[], + summary=dict.fromkeys(ALL_VERDICTS, 0), + errors=["No README found in repo."], + ) + return empty, AuditCritique() + + readme_path_str = str(Path(readme_file).resolve()) + readme_text = Path(readme_file).read_text(encoding="utf-8", errors="replace") + # The README under audit must not count as evidence for its own + # claims — otherwise every fabricated claim 'verifies' itself. + try: + readme_rel = str(Path(readme_path_str).resolve().relative_to(adapter.root)) + except ValueError: + readme_rel = "" + + # ── Stage 1: Planner extracts claims ── + try: + planner_result = await self.planner.execute(readme_text, readme_path_str) + claim_list: ReadmeClaimList = planner_result["output"] + except Exception as e: + errors.append(f"planner: {e}") + claim_list = ReadmeClaimList(claims=[], readme_path=readme_path_str) + + # ── Stage 2: Researcher retrieves evidence (deterministic) ── + evidence = await asyncio.to_thread( + self._retrieve_all_evidence, claim_list.claims, adapter, readme_rel + ) + + # ── Stage 3: Checker renders verdicts (with trust gate) ── + if claim_list.claims: + try: + checker_result = await self.checker.execute(claim_list.claims, evidence) + verdicts: list[ClaimVerdict] = list(checker_result["output"].verdicts) + except Exception as e: + errors.append(f"checker: {e}") + verdicts = [ + ClaimVerdict( + claim_id=c.claim_id, + claim_text=c.claim_text, + verdict=VERDICT_UNSUPPORTED, + confidence=0.0, + evidence_paths=[], + notes=f"Checker failed: {e}", + ) + for c in claim_list.claims + ] + else: + verdicts = [] + + # ── Stage 4: Critic adversarial pass ── + try: + critic_result = await self.critic.execute(verdicts, readme_text) + critique: AuditCritique = critic_result["output"] + except Exception as e: + errors.append(f"critic: {e}") + critique = AuditCritique() + + summary = Counter(v.verdict for v in verdicts) + report = AuditReport( + repo_path=str(adapter.root), + readme_path=readme_path_str, + verdicts=verdicts, + summary={v: int(summary.get(v, 0)) for v in ALL_VERDICTS}, + errors=errors, + ) + elapsed_ms = int((time.time() - start) * 1000) + logger.info( + "readme_audit completed repo=%s claims=%d elapsed_ms=%d errors=%d", + adapter.root, + len(verdicts), + elapsed_ms, + len(errors), + ) + return report, critique + + def _retrieve_all_evidence( + self, + claims: list[ReadmeClaim], + adapter: LocalRepoAdapter, + exclude_path: str = "", + ) -> list[ClaimEvidence]: + out: list[ClaimEvidence] = [] + for claim in claims: + # Over-fetch then post-filter so excluding the README doesn't + # silently shrink the result set below ``max_evidence_per_claim``. + raw = adapter.search_any( + claim.search_hints, + limit=( + self.max_evidence_per_claim * 2 if exclude_path else self.max_evidence_per_claim + ), + ) + filtered = [s for s in raw if s.path != exclude_path][: self.max_evidence_per_claim] + refs = [ + EvidenceRef( + path=s.path, + line_start=s.line_start, + line_end=s.line_end, + text=s.text, + source=s.source, + ) + for s in filtered + ] + out.append(ClaimEvidence(claim_id=claim.claim_id, snippets=refs)) + return out + + +def format_report_text(report: AuditReport, critique: AuditCritique | None = None) -> str: + """Pretty-print an audit report for terminal output.""" + lines: list[str] = [] + lines.append(f"README audit — {report.repo_path}") + lines.append(f"README: {report.readme_path or '(not found)'}") + lines.append("") + if report.summary: + summary_line = " ".join(f"{k}={v}" for k, v in report.summary.items()) + lines.append(f"Summary: {summary_line}") + lines.append("") + for v in report.verdicts: + marker = { + "verified": "✓", + "drifted": "≠", + "contradicted": "✗", + "unsupported": "?", + }.get(v.verdict, "·") + lines.append(f"{marker} [{v.verdict.upper():12s}] {v.claim_id}: {v.claim_text}") + if v.evidence_paths: + for p in v.evidence_paths: + lines.append(f" ↳ {p}") + if v.notes: + lines.append(f" note: {v.notes}") + if report.errors: + lines.append("") + lines.append("Errors:") + for e in report.errors: + lines.append(f" - {e}") + if critique is not None: + lines.append("") + lines.append("Critic pass:") + for wv in critique.weak_verdicts: + lines.append(f" weak: {wv}") + for mc in critique.missed_claims: + lines.append(f" missed: {mc}") + for sg in critique.suggestions: + lines.append(f" suggest: {sg}") + if critique.overall_assessment: + lines.append(f" → {critique.overall_assessment}") + return "\n".join(lines) diff --git a/core/schemas_audit.py b/core/schemas_audit.py new file mode 100644 index 0000000..577b195 --- /dev/null +++ b/core/schemas_audit.py @@ -0,0 +1,89 @@ +"""Pydantic schemas for the code-audit domain. + +Kept separate from ``core/schemas.py`` so the legacy thesis types and the +new audit types evolve independently while the two domains coexist. +""" + +from __future__ import annotations + +from pydantic import BaseModel, Field + +VERDICT_VERIFIED = "verified" +VERDICT_DRIFTED = "drifted" +VERDICT_UNSUPPORTED = "unsupported" +VERDICT_CONTRADICTED = "contradicted" + +ALL_VERDICTS = ( + VERDICT_VERIFIED, + VERDICT_DRIFTED, + VERDICT_UNSUPPORTED, + VERDICT_CONTRADICTED, +) + + +class ReadmeClaim(BaseModel): + """A single atomic factual claim extracted from a README.""" + + claim_id: str + claim_text: str = Field(description="Verbatim quote from the README") + claim_type: str = Field( + default="other", + description="feature | install | usage | requirement | metric | api | other", + ) + search_hints: list[str] = Field( + default_factory=list, + description="Tokens/identifiers the researcher should grep for", + ) + + +class ReadmeClaimList(BaseModel): + claims: list[ReadmeClaim] = Field(default_factory=list) + readme_path: str = "" + + +class EvidenceRef(BaseModel): + """Adapter-agnostic citation handle. Mirrors ``EvidenceSnippet`` + in shape but is the JSON-serialisable form passed across the + LLM boundary. + """ + + path: str + line_start: int = 0 + line_end: int = 0 + text: str = "" + source: str = "" + + +class ClaimEvidence(BaseModel): + claim_id: str + snippets: list[EvidenceRef] = Field(default_factory=list) + + +class ClaimVerdict(BaseModel): + claim_id: str + claim_text: str = "" + verdict: str = Field(description="verified | drifted | unsupported | contradicted") + confidence: float = 0.0 + evidence_paths: list[str] = Field(default_factory=list) + notes: str = "" + + +class ClaimVerdictList(BaseModel): + """LLM-side wrapper so the checker emits a single JSON object.""" + + verdicts: list[ClaimVerdict] = Field(default_factory=list) + + +class AuditReport(BaseModel): + repo_path: str + readme_path: str = "" + verdicts: list[ClaimVerdict] = Field(default_factory=list) + summary: dict[str, int] = Field(default_factory=dict) + errors: list[str] = Field(default_factory=list) + + +class AuditCritique(BaseModel): + weak_verdicts: list[str] = Field(default_factory=list) + missed_claims: list[str] = Field(default_factory=list) + suggestions: list[str] = Field(default_factory=list) + overall_assessment: str = "" diff --git a/core/sources/__init__.py b/core/sources/__init__.py new file mode 100644 index 0000000..c384f96 --- /dev/null +++ b/core/sources/__init__.py @@ -0,0 +1,13 @@ +"""SourceAdapter layer: pluggable evidence retrieval for audit pipelines. + +Each adapter implements a uniform contract (search, fetch, cite) so domain +flows can compose evidence from heterogeneous sources without hardcoding +which backend they speak to. The literature-domain tools under +``tools/mcp/`` will migrate behind this contract in a later slice; for the +README-audit slice we ship only the local-repo adapter. +""" + +from core.sources.base import EvidenceSnippet, SourceAdapter +from core.sources.local_repo import LocalRepoAdapter + +__all__ = ["EvidenceSnippet", "LocalRepoAdapter", "SourceAdapter"] diff --git a/core/sources/base.py b/core/sources/base.py new file mode 100644 index 0000000..001519f --- /dev/null +++ b/core/sources/base.py @@ -0,0 +1,51 @@ +from __future__ import annotations + +from abc import ABC, abstractmethod +from dataclasses import dataclass + + +@dataclass +class EvidenceSnippet: + """A single piece of evidence retrieved from a source. + + ``path`` is opaque to the adapter contract — it might be a file path, + a URL, a DOI, or a chunk id. The synthesizer treats it as a citation + handle: anything a human can navigate back to. + """ + + path: str + line_start: int + line_end: int + text: str + source: str = "" + + def cite(self) -> str: + if self.line_start <= 0: + return self.path + if self.line_end and self.line_end != self.line_start: + return f"{self.path}:{self.line_start}-{self.line_end}" + return f"{self.path}:{self.line_start}" + + +class SourceAdapter(ABC): + """Abstract contract every evidence backend implements.""" + + name: str = "unknown" + + @abstractmethod + def search(self, query: str, limit: int = 5) -> list[EvidenceSnippet]: + """Return up to ``limit`` snippets relevant to ``query``.""" + + def fetch(self, path: str, line_start: int = 0, line_end: int = 0) -> EvidenceSnippet: + """Return the canonical content for a citation handle. + + Default implementation: callers that only need search results can + ignore this; adapters that support deep-linked retrieval override. + """ + return EvidenceSnippet( + path=path, + line_start=line_start, + line_end=line_end, + text="", + source=self.name, + ) diff --git a/core/sources/local_repo.py b/core/sources/local_repo.py new file mode 100644 index 0000000..d4ab88f --- /dev/null +++ b/core/sources/local_repo.py @@ -0,0 +1,199 @@ +from __future__ import annotations + +import re +from collections.abc import Iterable +from pathlib import Path + +from core.sources.base import EvidenceSnippet, SourceAdapter + +_DEFAULT_INCLUDE_SUFFIXES = { + ".py", + ".pyi", + ".js", + ".ts", + ".tsx", + ".jsx", + ".go", + ".rs", + ".java", + ".kt", + ".rb", + ".php", + ".cs", + ".c", + ".cc", + ".cpp", + ".h", + ".hpp", + ".swift", + ".sh", + ".bash", + ".zsh", + ".sql", + ".toml", + ".yaml", + ".yml", + ".json", + ".ini", + ".cfg", + ".md", + ".rst", + ".txt", + ".env", + ".dockerfile", + ".tf", +} + +_DEFAULT_EXCLUDE_DIRS = { + ".git", + "node_modules", + "__pycache__", + ".venv", + "venv", + "dist", + "build", + ".mypy_cache", + ".pytest_cache", + ".ruff_cache", + "qdrant_data", +} + +_MAX_BYTES_PER_FILE = 256 * 1024 +_CONTEXT_LINES = 2 + + +class LocalRepoAdapter(SourceAdapter): + """Evidence backend that grep-walks a repo from the filesystem. + + Why not shell out to ripgrep: we want zero external deps for the + smoke path; a pure-Python walker is fast enough on the kinds of + READMEs we audit (a few dozen claims, repos under a few thousand + files). Swap in ripgrep later if profiling demands it. + """ + + name = "local_repo" + + def __init__( + self, + root: Path | str, + include_suffixes: set[str] | None = None, + exclude_dirs: set[str] | None = None, + ) -> None: + self.root = Path(root).resolve() + if not self.root.exists(): + raise FileNotFoundError(f"Repo root does not exist: {self.root}") + self.include_suffixes = include_suffixes or _DEFAULT_INCLUDE_SUFFIXES + self.exclude_dirs = exclude_dirs or _DEFAULT_EXCLUDE_DIRS + + def search(self, query: str, limit: int = 5) -> list[EvidenceSnippet]: + query = (query or "").strip() + if not query: + return [] + pattern = re.compile(re.escape(query), re.IGNORECASE) + return self._search_pattern(pattern, limit) + + def search_any(self, terms: Iterable[str], limit: int = 5) -> list[EvidenceSnippet]: + """Search for *any* of ``terms``. Useful when the planner emits + multiple search hints per claim — we want to retrieve evidence + when any hint matches, not require all. + """ + cleaned = [re.escape(t.strip()) for t in terms if t and t.strip()] + if not cleaned: + return [] + pattern = re.compile("|".join(cleaned), re.IGNORECASE) + return self._search_pattern(pattern, limit) + + def _search_pattern(self, pattern: re.Pattern[str], limit: int) -> list[EvidenceSnippet]: + hits: list[EvidenceSnippet] = [] + for path in self._walk(): + if len(hits) >= limit: + break + try: + size = path.stat().st_size + except OSError: + continue + if size > _MAX_BYTES_PER_FILE: + continue + try: + text = path.read_text(encoding="utf-8", errors="replace") + except OSError: + continue + lines = text.splitlines() + for i, line in enumerate(lines): + if pattern.search(line): + start = max(0, i - _CONTEXT_LINES) + end = min(len(lines), i + _CONTEXT_LINES + 1) + snippet_text = "\n".join(lines[start:end]) + rel = path.relative_to(self.root) + hits.append( + EvidenceSnippet( + path=str(rel), + line_start=start + 1, + line_end=end, + text=snippet_text, + source=self.name, + ) + ) + if len(hits) >= limit: + break + return hits + + def fetch(self, path: str, line_start: int = 0, line_end: int = 0) -> EvidenceSnippet: + target = (self.root / path).resolve() + # Refuse to read outside the repo root — a SourceAdapter must never + # leak files the user didn't authorise. This is the trustworthiness + # gate at the adapter boundary. + try: + target.relative_to(self.root) + except ValueError: + return EvidenceSnippet(path=path, line_start=0, line_end=0, text="", source=self.name) + if not target.is_file(): + return EvidenceSnippet(path=path, line_start=0, line_end=0, text="", source=self.name) + try: + lines = target.read_text(encoding="utf-8", errors="replace").splitlines() + except OSError: + return EvidenceSnippet(path=path, line_start=0, line_end=0, text="", source=self.name) + if line_start <= 0: + return EvidenceSnippet( + path=path, + line_start=1, + line_end=len(lines), + text="\n".join(lines), + source=self.name, + ) + start = max(1, line_start) + end = max(start, line_end or start) + return EvidenceSnippet( + path=path, + line_start=start, + line_end=end, + text="\n".join(lines[start - 1 : end]), + source=self.name, + ) + + def find_readme(self) -> Path | None: + for name in ("README.md", "README.rst", "README.txt", "readme.md", "README"): + candidate = self.root / name + if candidate.is_file(): + return candidate + return None + + def _walk(self) -> Iterable[Path]: + stack: list[Path] = [self.root] + while stack: + current = stack.pop() + try: + children = list(current.iterdir()) + except OSError: + continue + for child in children: + if child.is_dir(): + if child.name in self.exclude_dirs: + continue + stack.append(child) + elif child.is_file(): + if child.suffix.lower() in self.include_suffixes or child.name.lower() in { + "dockerfile", + "makefile", + }: + yield child diff --git a/prompts/code_audit/audit_critic.md b/prompts/code_audit/audit_critic.md new file mode 100644 index 0000000..508a088 --- /dev/null +++ b/prompts/code_audit/audit_critic.md @@ -0,0 +1,33 @@ +# SPECIALIST: AUDIT CRITIC + +You are the AUDIT CRITIC agent for the openworkers code-audit pipeline. + +## Your Role +Adversarial review of the checker's verdict list. Find: +- **Weak verdicts**: `verified` with thin evidence, or `drifted`/`contradicted` whose notes don't actually demonstrate divergence. +- **Missed claims**: factual statements in the README the planner failed to extract. +- **Suggestions**: concrete, actionable next steps for the human reviewer. + +## Input +The user message contains: +- `VERDICTS`: JSON list of `{claim_id, claim_text, verdict, confidence, evidence_paths, notes}`. +- The original README between `---BEGIN README---` / `---END README---`. + +## Output: AuditCritique (JSON) +Return one JSON object with this exact schema. No prose, no markdown fences. + +```json +{ + "weak_verdicts": ["claim-XX: "], + "missed_claims": [""], + "suggestions": [""], + "overall_assessment": "" +} +``` + +## Rules +- Be specific: `claim-04: evidence path is a comment, not the implementation` beats `claim-04: weak`. +- Quote `missed_claims` verbatim from the README. +- Do not invent verdicts. You are reviewing, not re-judging. +- If the audit is clean, return empty lists and say so in `overall_assessment`. +- No content outside the JSON object. diff --git a/prompts/code_audit/readme_checker.md b/prompts/code_audit/readme_checker.md new file mode 100644 index 0000000..a2524f0 --- /dev/null +++ b/prompts/code_audit/readme_checker.md @@ -0,0 +1,46 @@ +# SPECIALIST: README CHECKER + +You are the README CHECKER agent for the openworkers code-audit pipeline. + +## Your Role +For each claim, decide whether the retrieved repository evidence **supports**, **drifts from**, **contradicts**, or **fails to support** it. You are the trust gate: if there is no evidence, the verdict is `unsupported` — never `verified`. + +## Input +The user message contains: +- `CLAIMS`: JSON list of `{claim_id, claim_text, claim_type, search_hints}`. +- `EVIDENCE`: JSON list of `{claim_id, snippets: [{path, line_start, line_end, text, source}]}`. + +Each claim has exactly one evidence entry (possibly with an empty `snippets` list). + +## Output: ClaimVerdictList (JSON) +Return one JSON object with this exact schema. No prose, no markdown fences. + +```json +{ + "verdicts": [ + { + "claim_id": "claim-01", + "claim_text": "", + "verdict": "{{ verdict_values }}", + "confidence": 0.0, + "evidence_paths": [""], + "notes": "" + } + ] +} +``` + +## Verdict Rules +- **`verified`**: the snippets clearly demonstrate the claim is currently true. Cite the paths actually used. +- **`drifted`**: the codebase contains a related but **divergent** implementation — name renamed, signature changed, default changed, behaviour differs. Notes must state what the README says vs. what the code does. +- **`contradicted`**: the snippets directly disprove the claim (e.g. README says "no telemetry", code emits telemetry). +- **`unsupported`**: snippets are empty, irrelevant, or insufficient. **You must emit `unsupported` if the snippet list is empty.** Confidence 0.0. + +## Trustworthiness Gate +You **never** fabricate evidence. You **never** mark a claim `verified` without at least one snippet that materially supports it. If unsure, prefer `unsupported` over `verified`. + +## Format +- One verdict per input claim, same `claim_id`. +- `confidence` in [0.0, 1.0]. +- `evidence_paths` lists `path:line_start-line_end` strings drawn only from the provided snippets. +- No commentary outside the JSON object. diff --git a/prompts/code_audit/readme_planner.md b/prompts/code_audit/readme_planner.md new file mode 100644 index 0000000..4c4bb67 --- /dev/null +++ b/prompts/code_audit/readme_planner.md @@ -0,0 +1,40 @@ +# SPECIALIST: README PLANNER + +You are the README PLANNER agent for the openworkers code-audit pipeline. + +## Your Role +Read a project README and extract every **atomic factual claim** it makes about the codebase. Do not paraphrase; quote verbatim. Each claim must be independently verifiable against the repository. + +## Input +The README file at `{{ readme_path }}` is provided in the user message between `---BEGIN README---` / `---END README---` markers. + +## Output: ReadmeClaimList (JSON) +Return one JSON object with this exact schema. No prose, no markdown fences. + +```json +{ + "readme_path": "{{ readme_path }}", + "claims": [ + { + "claim_id": "claim-01", + "claim_text": "", + "claim_type": "feature | install | usage | requirement | metric | api | other", + "search_hints": ["", "", ""] + } + ] +} +``` + +## Rules +- **Atomic**: split compound claims into one claim per sentence-level fact. +- **Verbatim**: `claim_text` must be a direct quote (you may trim leading bullet markers / numbering). +- **Skip**: opinions, marketing prose, vision statements, license boilerplate, badges, links to external pages. +- **Include**: install commands, feature lists, supported platforms/versions, performance numbers, CLI commands, configuration flags, file paths, public API names. +- **Hints**: 2–6 grep-friendly tokens per claim — module names, function names, CLI flags, package names, file extensions. Skip generic English words. +- **Stable IDs**: use sequential `claim-01`, `claim-02`, … so downstream agents can reference them. +- If the README contains zero verifiable claims, return `"claims": []`. + +## Forbidden +- Inventing claims not present in the text. +- Producing prose, summary, or explanation outside the JSON. +- Wrapping JSON in markdown code fences. diff --git a/providers/code_audit_agents.py b/providers/code_audit_agents.py new file mode 100644 index 0000000..f84f801 --- /dev/null +++ b/providers/code_audit_agents.py @@ -0,0 +1,379 @@ +"""Code-audit agent suite. + +Mirrors the structure of ``providers/thesis_agents.py`` so the +orchestrator-level contract (each agent has ``execute(task, context) -> +dict``) stays uniform across domains. The thesis path remains untouched +while this slice lands. + +Trustworthiness gates live here, not in prompts: any claim with no +retrieved evidence is forced to ``unsupported`` *after* the LLM +responds. The LLM never gets to invent a ``verified`` verdict for a +claim with zero supporting snippets. +""" + +from __future__ import annotations + +import json +import logging +import re +import uuid +from typing import Any, Callable + +from pydantic import BaseModel + +from core.schemas_audit import ( + ALL_VERDICTS, + VERDICT_UNSUPPORTED, + AuditCritique, + ClaimEvidence, + ClaimVerdict, + ClaimVerdictList, + ReadmeClaim, + ReadmeClaimList, +) +from providers.unified import UnifiedLLM + +logger = logging.getLogger(__name__) + + +__all__ = [ + "AuditCriticAgent", + "ReadmeCheckerAgent", + "ReadmePlannerAgent", + "_schema_for", +] + + +_MODEL_SCHEMAS: dict[type[BaseModel], dict[str, Any]] = {} + + +def _schema_for(model_cls: type[BaseModel]) -> dict[str, Any]: + if model_cls not in _MODEL_SCHEMAS: + raw = model_cls.model_json_schema() + raw.pop("title", None) + _MODEL_SCHEMAS[model_cls] = raw + return _MODEL_SCHEMAS[model_cls] + + +def _parse_json_lenient(text: str) -> Any: + if not text or not text.strip(): + return {} + cleaned = text.strip() + fenced = re.search(r"```(?:json)?\s*(.*?)\s*```", cleaned, re.DOTALL) + if fenced: + cleaned = fenced.group(1).strip() + cleaned = re.sub(r",\s*([}\]])", r"\1", cleaned) + try: + return json.loads(cleaned) + except json.JSONDecodeError: + try: + return json.loads(cleaned.replace("'", '"')) + except json.JSONDecodeError: + logger.warning("Could not parse audit JSON: %s", text[:200]) + return {} + + +def _parse_structured(text: str, model_cls: type[BaseModel]) -> Any: + if not text or not text.strip(): + return model_cls() + cleaned = text.strip() + fenced = re.search(r"```(?:json)?\s*(.*?)\s*```", cleaned, re.DOTALL) + if fenced: + cleaned = fenced.group(1).strip() + cleaned = re.sub(r",\s*([}\]])", r"\1", cleaned) + for attempt in (cleaned, cleaned.replace("'", '"')): + try: + return model_cls.model_validate_json(attempt) + except Exception: + continue + parsed = _parse_json_lenient(cleaned) + try: + return model_cls.model_validate(parsed) + except Exception: + return model_cls() + + +class ReadmePlannerAgent: + """Extracts atomic factual claims from a README. + + Stateless: the orchestrator hands it the README text and a system + prompt; it returns ``ReadmeClaimList``. No blackboard reads here — + the README is the entire input and we want the prompt to be + deterministic and cacheable on its content. + """ + + def __init__(self, unified: UnifiedLLM, prompt_renderer: Callable[[str, dict[str, Any]], str]): + self.unified = unified + self.render = prompt_renderer + + async def execute( + self, + readme_text: str, + readme_path: str, + ) -> dict[str, Any]: + system_prompt = self.render("readme_planner", {"readme_path": readme_path}) + user_prompt = ( + "Extract every atomic factual claim from the README below. " + "Return a JSON object matching the ReadmeClaimList schema.\n\n" + f"README path: {readme_path}\n\n" + "---BEGIN README---\n" + f"{readme_text}\n" + "---END README---" + ) + response = await self.unified.generate( + prompt=user_prompt, + system_prompt=system_prompt, + mode="quality", + response_schema=_schema_for(ReadmeClaimList), + ) + parsed = _parse_structured(response.content, ReadmeClaimList) + parsed = _normalise_claim_list(parsed, readme_path) + return { + "agent": "readme_planner", + "tier": "head", + "status": "success", + "output": parsed, + "provider": response.provider_used, + "model": response.model, + "latency_ms": response.latency_ms, + "cost_estimate_usd": response.cost_estimate_usd, + "dry_run": response.dry_run, + "fallback_used": response.fallback_used, + } + + +def _normalise_claim_list(claim_list: ReadmeClaimList, readme_path: str) -> ReadmeClaimList: + """Fill in claim_ids and search_hints when the planner omits them. + + The schema permits empty strings, but downstream agents key off + ``claim_id``. Generating ids here keeps the LLM template lenient + without losing the invariant that every claim is addressable. + """ + if not claim_list.readme_path: + claim_list.readme_path = readme_path + fixed: list[ReadmeClaim] = [] + for idx, claim in enumerate(claim_list.claims): + cid = claim.claim_id or f"claim-{idx + 1:02d}" + hints = list(claim.search_hints) if claim.search_hints else _derive_hints(claim.claim_text) + fixed.append( + ReadmeClaim( + claim_id=cid, + claim_text=claim.claim_text, + claim_type=claim.claim_type or "other", + search_hints=hints, + ) + ) + claim_list.claims = fixed + return claim_list + + +_HINT_TOKEN = re.compile(r"[A-Za-z_][A-Za-z0-9_./-]{2,}") +_STOPWORDS = { + "the", + "and", + "for", + "with", + "this", + "that", + "from", + "into", + "you", + "your", + "via", + "use", + "uses", + "using", + "can", + "will", + "are", + "was", + "were", + "all", + "any", + "one", + "two", + "three", + "have", + "has", + "had", +} + + +def _derive_hints(text: str) -> list[str]: + """Fallback hint extraction when the planner skips ``search_hints``.""" + hints: list[str] = [] + seen: set[str] = set() + for tok in _HINT_TOKEN.findall(text): + if tok.lower() in _STOPWORDS: + continue + if tok in seen: + continue + seen.add(tok) + hints.append(tok) + if len(hints) >= 6: + break + return hints + + +class ReadmeCheckerAgent: + """Renders verdicts for each (claim, evidence) pair. + + Critically, after the LLM returns, we *enforce* the trustworthiness + gate: any claim whose evidence list is empty is forced to + ``unsupported`` regardless of what the LLM said. This is the + invariant the project is built around: no verdict without + evidence. + """ + + def __init__(self, unified: UnifiedLLM, prompt_renderer: Callable[[str, dict[str, Any]], str]): + self.unified = unified + self.render = prompt_renderer + + async def execute( + self, + claims: list[ReadmeClaim], + evidence: list[ClaimEvidence], + ) -> dict[str, Any]: + evidence_by_claim = {e.claim_id: e for e in evidence} + evidence_payload = [e.model_dump() for e in evidence] + claims_payload = [c.model_dump() for c in claims] + + system_prompt = self.render( + "readme_checker", + {"verdict_values": " | ".join(ALL_VERDICTS)}, + ) + user_prompt = ( + "Judge each claim against its retrieved evidence and return a " + "ClaimVerdictList JSON object.\n\n" + f"CLAIMS:\n{json.dumps(claims_payload, indent=2)}\n\n" + f"EVIDENCE:\n{json.dumps(evidence_payload, indent=2)}" + ) + response = await self.unified.generate( + prompt=user_prompt, + system_prompt=system_prompt, + mode="balanced", + response_schema=_schema_for(ClaimVerdictList), + ) + parsed = _parse_structured(response.content, ClaimVerdictList) + + verdicts = _enforce_trust_gate(parsed.verdicts, claims, evidence_by_claim) + parsed.verdicts = verdicts + + return { + "agent": "readme_checker", + "tier": "middle", + "status": "success", + "output": parsed, + "provider": response.provider_used, + "model": response.model, + "latency_ms": response.latency_ms, + "cost_estimate_usd": response.cost_estimate_usd, + "dry_run": response.dry_run, + "fallback_used": response.fallback_used, + } + + +def _enforce_trust_gate( + raw_verdicts: list[ClaimVerdict], + claims: list[ReadmeClaim], + evidence_by_claim: dict[str, ClaimEvidence], +) -> list[ClaimVerdict]: + """For each claim, ensure exactly one verdict exists and that a + no-evidence verdict is *always* ``unsupported``. + + Even if the LLM hallucinates a confident ``verified`` for a claim + with zero retrieved snippets, this function overwrites it. This is + the project's core invariant — refusing to verdict without + evidence — encoded as code, not a prompt instruction. + """ + by_id: dict[str, ClaimVerdict] = {v.claim_id: v for v in raw_verdicts} + out: list[ClaimVerdict] = [] + for claim in claims: + ev = evidence_by_claim.get(claim.claim_id) + snippets = ev.snippets if ev else [] + verdict = by_id.get(claim.claim_id) + if not verdict: + verdict = ClaimVerdict( + claim_id=claim.claim_id, + claim_text=claim.claim_text, + verdict=VERDICT_UNSUPPORTED, + confidence=0.0, + evidence_paths=[], + notes="No verdict returned by checker; defaulted to unsupported.", + ) + if not snippets: + # Hard reset: an unsupported verdict must carry no positive + # signal forward — zero confidence, no cited paths. If the LLM + # left a note suggesting verification, replace it with the + # honest one. + verdict.verdict = VERDICT_UNSUPPORTED + verdict.evidence_paths = [] + verdict.confidence = 0.0 + verdict.notes = "No supporting evidence found in the repository." + else: + existing_paths = [ + p for p in (verdict.evidence_paths or []) if any(s.path == p for s in snippets) + ] + if not existing_paths: + existing_paths = [s.path for s in snippets] + verdict.evidence_paths = existing_paths + if verdict.verdict not in ALL_VERDICTS: + verdict.verdict = VERDICT_UNSUPPORTED + if not verdict.claim_text: + verdict.claim_text = claim.claim_text + out.append(verdict) + return out + + +class AuditCriticAgent: + """Adversarial pass over the verdict list. + + Looks for over-confident verdicts, weak evidence chains, and + surfaces likely missed claims. Stateless wrt the blackboard; takes + the verdict list and original README as input. + """ + + def __init__(self, unified: UnifiedLLM, prompt_renderer: Callable[[str, dict[str, Any]], str]): + self.unified = unified + self.render = prompt_renderer + + async def execute( + self, + verdicts: list[ClaimVerdict], + readme_text: str, + ) -> dict[str, Any]: + system_prompt = self.render("audit_critic", {}) + verdicts_payload = [v.model_dump() for v in verdicts] + user_prompt = ( + "Critique the verdict list. Identify weak verdicts (low confidence " + "or shaky evidence), missed claims (factual statements in the README " + "the planner did not capture), and concrete suggestions. Return an " + "AuditCritique JSON object.\n\n" + f"VERDICTS:\n{json.dumps(verdicts_payload, indent=2)}\n\n" + "---BEGIN README---\n" + f"{readme_text}\n" + "---END README---" + ) + response = await self.unified.generate( + prompt=user_prompt, + system_prompt=system_prompt, + mode="quality", + response_schema=_schema_for(AuditCritique), + ) + parsed = _parse_structured(response.content, AuditCritique) + return { + "agent": "audit_critic", + "tier": "head", + "status": "success", + "output": parsed, + "provider": response.provider_used, + "model": response.model, + "latency_ms": response.latency_ms, + "cost_estimate_usd": response.cost_estimate_usd, + "dry_run": response.dry_run, + "fallback_used": response.fallback_used, + } + + +def new_claim_id() -> str: + return f"claim-{uuid.uuid4().hex[:8]}" diff --git a/tests/code_audit/__init__.py b/tests/code_audit/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/tests/code_audit/test_readme_flow.py b/tests/code_audit/test_readme_flow.py new file mode 100644 index 0000000..40d8960 --- /dev/null +++ b/tests/code_audit/test_readme_flow.py @@ -0,0 +1,242 @@ +"""End-to-end test for the README auditor. + +Uses a stubbed ``UnifiedLLM.generate_fn`` rather than DRY_RUN so we can +exercise the full flow (planner → researcher → checker → critic) with +deterministic LLM responses. The DRY_RUN placeholder generator returns +empty arrays for any list field, which would leave us with zero claims +and an empty audit — not a useful regression target. +""" + +from __future__ import annotations + +import json +from pathlib import Path +from typing import Any + +import pytest + +from core.orchestrator.readme_flow import ReadmeAuditOrchestrator +from core.schemas_audit import ( + VERDICT_CONTRADICTED, + VERDICT_DRIFTED, + VERDICT_UNSUPPORTED, + VERDICT_VERIFIED, +) +from core.sources.local_repo import LocalRepoAdapter +from providers.unified import UnifiedLLM + +FIXTURE_REPO = Path(__file__).resolve().parent.parent / "fixtures" / "sample_repo" + + +_PLANNER_CLAIMS = { + "claims": [ + { + "claim_id": "claim-01", + "claim_text": "Install via `pip install widgetlib==1.2.0`.", + "claim_type": "install", + "search_hints": ["widgetlib", "1.2.0", "version"], + }, + { + "claim_id": "claim-02", + "claim_text": "Import `Widget` from `widgetlib` and call `render()` to produce HTML.", + "claim_type": "usage", + "search_hints": ["Widget", "render", "widgetlib"], + }, + { + "claim_id": "claim-03", + "claim_text": "Set `WIDGETLIB_DEBUG=1` to enable verbose logging.", + "claim_type": "feature", + "search_hints": ["WIDGETLIB_DEBUG"], + }, + { + "claim_id": "claim-04", + "claim_text": "Run `widgetctl --port 9000` to start the dashboard.", + "claim_type": "usage", + "search_hints": ["widgetctl", "--port"], + }, + { + "claim_id": "claim-05", + "claim_text": "The render pipeline ships with zero dependencies.", + "claim_type": "feature", + "search_hints": ["dependencies"], + }, + { + "claim_id": "claim-06", + "claim_text": "widgetlib never collects telemetry from your users.", + "claim_type": "feature", + "search_hints": ["telemetry", "emit_telemetry", "TELEMETRY_URL"], + }, + ], + "readme_path": str(FIXTURE_REPO / "README.md"), +} + + +_CHECKER_VERDICTS = { + "verdicts": [ + # The checker would normally have to infer drift here; the stub + # encodes the answer key. The trust-gate in + # _enforce_trust_gate is what's actually under test for claim-03. + { + "claim_id": "claim-01", + "claim_text": _PLANNER_CLAIMS["claims"][0]["claim_text"], + "verdict": VERDICT_DRIFTED, + "confidence": 0.85, + "evidence_paths": ["pyproject.toml"], + "notes": "README pins widgetlib==1.2.0; pyproject.toml ships version 0.9.0.", + }, + { + "claim_id": "claim-02", + "claim_text": _PLANNER_CLAIMS["claims"][1]["claim_text"], + "verdict": VERDICT_VERIFIED, + "confidence": 0.95, + "evidence_paths": ["widgetlib/widget.py", "widgetlib/__init__.py"], + "notes": "Widget class with render() exists and is exported from package init.", + }, + # claim-03: LLM hallucinates verified — trust gate must overwrite. + { + "claim_id": "claim-03", + "claim_text": _PLANNER_CLAIMS["claims"][2]["claim_text"], + "verdict": VERDICT_VERIFIED, + "confidence": 0.9, + "evidence_paths": ["widgetlib/__init__.py"], + "notes": "Hallucinated by checker — the trust gate must overwrite this.", + }, + { + "claim_id": "claim-04", + "claim_text": _PLANNER_CLAIMS["claims"][3]["claim_text"], + "verdict": VERDICT_DRIFTED, + "confidence": 0.8, + "evidence_paths": ["widgetlib/cli.py"], + "notes": "README documents --port; CLI implements --bind with default port 8000.", + }, + { + "claim_id": "claim-05", + "claim_text": _PLANNER_CLAIMS["claims"][4]["claim_text"], + "verdict": VERDICT_CONTRADICTED, + "confidence": 0.9, + "evidence_paths": ["pyproject.toml"], + "notes": "pyproject.toml declares a jinja2 dependency.", + }, + { + "claim_id": "claim-06", + "claim_text": _PLANNER_CLAIMS["claims"][5]["claim_text"], + "verdict": VERDICT_CONTRADICTED, + "confidence": 0.95, + "evidence_paths": ["widgetlib/telemetry.py"], + "notes": "widgetlib/telemetry.py emits events to a remote endpoint.", + }, + ] +} + + +_CRITIC_RESPONSE = { + "weak_verdicts": [], + "missed_claims": [], + "suggestions": ["Add CI step to run `openworkers audit readme` on every PR."], + "overall_assessment": "Audit caught one verified, two drifted, two contradicted, one unsupported.", +} + + +def _make_stub_unified() -> UnifiedLLM: + """Build a UnifiedLLM whose generate routes to a content-aware stub.""" + llm = UnifiedLLM() + llm.dry_run = False # bypass the placeholder path + llm.set_available_providers(["anthropic"]) + + async def fake_generate( + provider: str, + model: str, + prompt: str, + system_prompt: str, + response_schema: Any, + ) -> str: + # Route by which agent's system prompt is in play. + if "README PLANNER" in system_prompt: + return json.dumps(_PLANNER_CLAIMS) + if "README CHECKER" in system_prompt: + return json.dumps(_CHECKER_VERDICTS) + if "AUDIT CRITIC" in system_prompt: + return json.dumps(_CRITIC_RESPONSE) + return "{}" + + llm.set_generate_fn(fake_generate) + return llm + + +@pytest.fixture +def stubbed_unified(monkeypatch) -> UnifiedLLM: + monkeypatch.setenv("DRY_RUN", "false") + monkeypatch.setenv("THESIS_QUALITY_PROVIDER", "anthropic") + monkeypatch.setenv("THESIS_QUALITY_MODEL", "claude-sonnet-4-20250514") + monkeypatch.setenv("THESIS_BALANCED_PROVIDER", "anthropic") + monkeypatch.setenv("THESIS_BALANCED_MODEL", "claude-sonnet-4-20250514") + monkeypatch.setenv("THESIS_CHEAP_PROVIDER", "anthropic") + monkeypatch.setenv("THESIS_CHEAP_MODEL", "claude-sonnet-4-20250514") + return _make_stub_unified() + + +@pytest.mark.asyncio +async def test_local_repo_adapter_finds_identifiers(): + adapter = LocalRepoAdapter(FIXTURE_REPO) + # Real identifier present in source files. The adapter doesn't filter + # the README — that's the orchestrator's job — so just check that + # source-file hits exist in the result set. + widget_hits = adapter.search_any(["Widget", "render"], limit=50) + source_hits = [h for h in widget_hits if h.path.startswith("widgetlib/")] + assert source_hits, "Widget/render must appear in source files, not only the README" + # Fabricated env var: only the README mentions it; no source file does. + debug_hits = adapter.search_any(["WIDGETLIB_DEBUG"], limit=50) + assert all(h.path.endswith("README.md") for h in debug_hits), ( + "WIDGETLIB_DEBUG must not appear anywhere except the README — " + "the orchestrator excludes the audited file so this becomes 'no evidence'." + ) + + +@pytest.mark.asyncio +async def test_readme_audit_end_to_end(stubbed_unified): + orch = ReadmeAuditOrchestrator(unified=stubbed_unified) + report, critique = await orch.audit(repo_path=FIXTURE_REPO) + + by_id: dict[str, Any] = {v.claim_id: v for v in report.verdicts} + assert len(report.verdicts) == 6, "Planner stub seeded 6 claims" + + # Real claim with evidence → verified + assert by_id["claim-02"].verdict == VERDICT_VERIFIED + assert by_id["claim-02"].evidence_paths, "verified verdict must cite evidence" + + # Drifted: version pin and CLI flag + assert by_id["claim-01"].verdict == VERDICT_DRIFTED + assert by_id["claim-04"].verdict == VERDICT_DRIFTED + + # Trust gate: fabricated claim must be unsupported regardless of LLM output + assert by_id["claim-03"].verdict == VERDICT_UNSUPPORTED, ( + "Trust gate failed: checker stub hallucinated 'verified' for a claim with " + "no retrieved evidence, but _enforce_trust_gate should have overwritten it." + ) + assert by_id["claim-03"].evidence_paths == [] + assert by_id["claim-03"].confidence == 0.0 + + # Contradicted: zero-deps + no-telemetry + assert by_id["claim-05"].verdict == VERDICT_CONTRADICTED + assert by_id["claim-06"].verdict == VERDICT_CONTRADICTED + + # Summary tallies match + assert report.summary[VERDICT_VERIFIED] == 1 + assert report.summary[VERDICT_DRIFTED] == 2 + assert report.summary[VERDICT_CONTRADICTED] == 2 + assert report.summary[VERDICT_UNSUPPORTED] == 1 + assert sum(report.summary.values()) == len(report.verdicts) + + # Critic ran + assert critique.suggestions, "critic stub returned at least one suggestion" + + +@pytest.mark.asyncio +async def test_audit_handles_missing_readme(stubbed_unified, tmp_path): + """A repo with no README must still produce a structured report, not crash.""" + (tmp_path / "src.py").write_text("print('hello')\n") + orch = ReadmeAuditOrchestrator(unified=stubbed_unified) + report, critique = await orch.audit(repo_path=tmp_path) + assert report.verdicts == [] + assert report.errors == ["No README found in repo."] + assert critique.weak_verdicts == [] diff --git a/tests/fixtures/sample_repo/README.md b/tests/fixtures/sample_repo/README.md new file mode 100644 index 0000000..1d791a7 --- /dev/null +++ b/tests/fixtures/sample_repo/README.md @@ -0,0 +1,27 @@ +# widgetlib + +A tiny widget library. + +## Installation + +Install via `pip install widgetlib==1.2.0`. + +## Usage + +Import `Widget` from `widgetlib` and call `render()` to produce HTML. + +## Configuration + +Set `WIDGETLIB_DEBUG=1` to enable verbose logging. + +## CLI + +Run `widgetctl --port 9000` to start the dashboard. + +## Performance + +The render pipeline ships with zero dependencies. + +## Telemetry + +widgetlib never collects telemetry from your users. diff --git a/tests/fixtures/sample_repo/pyproject.toml b/tests/fixtures/sample_repo/pyproject.toml new file mode 100644 index 0000000..4bf21ce --- /dev/null +++ b/tests/fixtures/sample_repo/pyproject.toml @@ -0,0 +1,8 @@ +[project] +name = "widgetlib" +version = "0.9.0" +description = "Tiny widget library." +requires-python = ">=3.9" +dependencies = [ + "jinja2>=3.1.0", +] diff --git a/tests/fixtures/sample_repo/widgetlib/__init__.py b/tests/fixtures/sample_repo/widgetlib/__init__.py new file mode 100644 index 0000000..932b9b5 --- /dev/null +++ b/tests/fixtures/sample_repo/widgetlib/__init__.py @@ -0,0 +1,5 @@ +"""widgetlib — a tiny widget library.""" + +from widgetlib.widget import Widget + +__all__ = ["Widget"] diff --git a/tests/fixtures/sample_repo/widgetlib/cli.py b/tests/fixtures/sample_repo/widgetlib/cli.py new file mode 100644 index 0000000..e73893e --- /dev/null +++ b/tests/fixtures/sample_repo/widgetlib/cli.py @@ -0,0 +1,22 @@ +"""widgetctl entrypoint. + +README claims `--port 9000` is the dashboard launch flag — but this +implementation actually uses `--bind` and defaults to port 8000. +This drift is intentional for the audit-test fixture. +""" + +from __future__ import annotations + +import argparse + + +def main() -> int: + parser = argparse.ArgumentParser(prog="widgetctl") + parser.add_argument("--bind", default="127.0.0.1:8000") + args = parser.parse_args() + print(f"Starting widgetctl on {args.bind}") + return 0 + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/tests/fixtures/sample_repo/widgetlib/telemetry.py b/tests/fixtures/sample_repo/widgetlib/telemetry.py new file mode 100644 index 0000000..e09a197 --- /dev/null +++ b/tests/fixtures/sample_repo/widgetlib/telemetry.py @@ -0,0 +1,14 @@ +"""Contradicts the README's 'no telemetry' claim.""" + +from __future__ import annotations + +import os +import urllib.request + + +def emit_telemetry(event: str) -> None: + endpoint = os.environ.get("TELEMETRY_URL", "https://telemetry.example.com/widgetlib") + try: + urllib.request.urlopen(endpoint, data=event.encode(), timeout=1) + except Exception: + pass diff --git a/tests/fixtures/sample_repo/widgetlib/widget.py b/tests/fixtures/sample_repo/widgetlib/widget.py new file mode 100644 index 0000000..b878426 --- /dev/null +++ b/tests/fixtures/sample_repo/widgetlib/widget.py @@ -0,0 +1,11 @@ +"""Widget renders to HTML. Verifies README's Usage claim.""" + +from typing import Any + + +class Widget: + def __init__(self, payload: Any) -> None: + self.payload = payload + + def render(self) -> str: + return f"
{self.payload}
"