Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,9 @@ Monorepo for the explicit-run `vitest-evals` shape:

- `packages/vitest-evals`: core suite API, judges, normalized harness/session
types, and reporter
- `packages/http`: engine-neutral HTTP interception and replay helpers
- `packages/http-vercel-sandbox`: Vercel Sandbox forwarded-request adapter for
`@vitest-evals/http`
- `packages/harness-ai-sdk`: `ai-sdk`-focused harness adapter
- `packages/harness-openai-agents`: `@openai/agents`-focused harness adapter
- `packages/harness-pi-ai`: `pi-ai`-focused harness adapter with tool replay
Expand Down Expand Up @@ -33,6 +36,8 @@ Monorepo for the explicit-run `vitest-evals` shape:
```text
packages/
vitest-evals/
http/
http-vercel-sandbox/
harness-ai-sdk/
harness-openai-agents/
harness-pi-ai/
Expand Down Expand Up @@ -251,6 +256,13 @@ otherwise. `record` always calls live and overwrites recordings — use it to
refresh fixtures intentionally. Recordings are stored under
`.vitest-evals/recordings/<tool-name>/`.

For applications that make outbound service calls outside local tool wrappers,
`@vitest-evals/http` exposes an engine-neutral HTTP interceptor primitive.
Vercel Sandbox support lives in `@vitest-evals/http-vercel-sandbox`, while
Docker proxies, MSW servers, Playwright routes, or fetch shims can adapt
traffic into `{ request, upstreamUrl, provider, engine }` and share fixture
chaining plus replay-aware recording.

`pnpm evals` fans out to each workspace package or app that exposes an `evals`
script. The shared eval CLI defaults replay to `auto` and writes recordings
under `.vitest-evals/recordings`, unless those environment variables are
Expand Down
34 changes: 34 additions & 0 deletions docs/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,11 @@ packages/
harness.ts
index.ts
reporter.ts
replay.ts
judges/
legacy/
http/
http-vercel-sandbox/
harness-ai-sdk/
harness-openai-agents/
harness-pi-ai/
Expand Down Expand Up @@ -67,6 +70,37 @@ tool counts, retries, provider, and model. Provider-specific cost estimates are
not normalized because pricing semantics vary by runtime and can be stale; if a
harness needs to retain them, store them under `usage.metadata`.

### `packages/http`

Defines the engine-neutral HTTP interceptor package:

- `HttpInterceptRequest`
- `HttpInterceptor`
- `createHttpInterceptor(...)`
- `executeHttpWithReplay(...)`
- `createHttpReplayInterceptor(...)`

Engines such as Docker egress proxies, MSW, Playwright routing, or fetch shims
own the transport-specific work of constructing a Fetch `Request` for the
intended upstream URL. The package owns the fixture chain, deterministic
unhandled responses, and replay-backed request/response cassette behavior.

HTTP replay uses the same `VITEST_EVALS_REPLAY_MODE` and
`VITEST_EVALS_REPLAY_DIR` settings as tool replay, but it records serialized
HTTP request/response pairs instead of local tool inputs and outputs.

### `packages/http-vercel-sandbox`

Adapts Vercel Sandbox forwarded HTTP requests into `@vitest-evals/http`:

- validates forwarded host, scheme, port, and path headers
- strips Vercel proxy-only and hop-by-hop headers from the upstream request
- applies app-owned credential/header transforms
- routes traffic through interceptors before optional live fetch fallback

It intentionally does not own Vercel OIDC verification, requester
authorization, credential issuance, or sandbox network policy.

### `packages/vitest-evals/src/index.ts`

Defines the harness-first public API:
Expand Down
13 changes: 13 additions & 0 deletions docs/development-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ The repository is now harness-first:
When changing behavior, decide first which surface you are touching:

- root harness/judge API
- HTTP interception/replay packages
- reporter output
- GitHub JSON post-processing output
- a first-party harness package
Expand Down Expand Up @@ -59,6 +60,18 @@ Owns:
- reporter integration
- legacy compatibility exports

### `packages/http`

Owns engine-neutral HTTP interception and request/response replay. It depends
on the core replay primitive but stays outside the root package so engine
adapters can evolve independently.

### `packages/http-vercel-sandbox`

Owns Vercel Sandbox forwarded-request adaptation for `@vitest-evals/http`.
Keep Vercel-specific forwarded header parsing here, not in core and not in the
engine-neutral HTTP package.

### `packages/harness-ai-sdk`

Owns:
Expand Down
9 changes: 9 additions & 0 deletions docs/testing.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,15 @@ Cover:
- matcher behavior for `toSatisfyJudge(...)`
- any task metadata the reporter depends on

### HTTP Interceptor Package Changes

Cover:

- interceptor chaining and pass-through behavior
- request/response serialization and redaction
- replay modes, cache keys, and recording metadata
- engine adapter assumptions such as cloned request bodies

### Reporter Changes

Cover:
Expand Down
1 change: 1 addition & 0 deletions packages/docs/astro.config.mjs
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@ export default defineConfig({
],
},
{ label: "Tool Replay", link: "/docs/tool-replay" },
{ label: "HTTP Interceptors", link: "/docs/http-interceptors" },
{ label: "GitHub Reporting", link: "/docs/github" },
],
},
Expand Down
4 changes: 4 additions & 0 deletions packages/docs/src/content/docs/docs.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,10 @@ keeps the documentation focused on the adapter boundary:
<strong>Tool Replay</strong>
<span>Record deterministic tool calls without hiding model behavior.</span>
</a>
<a class="api-link-card" href="/docs/http-interceptors/">
<strong>HTTP Interceptors</strong>
<span>Mock, record, or replay outbound HTTP through sandbox, proxy, or browser engines.</span>
</a>
<a class="api-link-card" href="/docs/github/">
<strong>GitHub Reporting</strong>
<span>Publish eval summaries and checks from workflow JSON output.</span>
Expand Down
199 changes: 199 additions & 0 deletions packages/docs/src/content/docs/docs/http-interceptors.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,199 @@
---
title: HTTP Interceptors
description: Intercept, mock, record, or replay outbound HTTP through any engine that can expose Fetch requests.
editUrl: false
---

HTTP interceptors are the request-level sibling of tool replay. Use them when
an eval exercises a full application or sandbox where outbound service calls do
not pass through a local tool wrapper.

The primitive is engine-neutral: Vercel Sandbox forwarding, a Docker egress
proxy, MSW, Playwright routing, or a fetch shim can all adapt outbound traffic
into the same `HttpInterceptRequest` shape.

## Interceptor Shape

An interceptor receives the upstream `Request`, the original `URL`, and optional
provider/engine labels. Return a `Response` to handle the request, or
`undefined` to let the next interceptor or live transport decide.

```ts title="evals/interceptHttp.ts"
import {
createHttpFixtureInterceptor,
createHttpInterceptor,
httpFixture,
unhandledHttpResponse,
type HttpInterceptRequest,
} from "@vitest-evals/http";

async function githubFixture(
input: HttpInterceptRequest,
): Promise<Response | undefined> {
if (
input.provider !== "github" ||
input.upstreamUrl.hostname !== "api.github.com"
) {
return undefined;
}

if (input.request.method === "GET" && input.upstreamUrl.pathname === "/user") {
return Response.json({ login: "eval-user" });
}

return new Response("missing GitHub fixture\n", { status: 501 });
}

export const interceptHttp = createHttpInterceptor([githubFixture], {
unhandled: unhandledHttpResponse,
});
```

The composer clones the request before each interceptor so one fixture can
inspect a body without consuming it for the next fixture.

Use `createHttpFixtureInterceptor()` when direct fixture injection should be
declarative:

```ts title="evals/interceptHttp.ts"
const staticFixtures = createHttpFixtureInterceptor([
httpFixture.get("/health", Response.json({ ok: true })),
httpFixture.post(
{
hostname: "api.github.com",
pathname: "/graphql",
},
async (input) =>
Response.json({
data: {
request: await input.request.json(),
viewer: { login: "eval-user" },
},
}),
),
]);
```

## Replay HTTP

Use `createHttpReplayInterceptor()` when the request should be recorded on a
cache miss and replayed on later runs. It uses the same
`VITEST_EVALS_REPLAY_MODE` and `VITEST_EVALS_REPLAY_DIR` settings as tool
replay.

```ts title="evals/interceptHttp.ts"
import {
createHttpInterceptor,
createHttpReplayInterceptor,
httpFixture,
} from "@vitest-evals/http";

export const interceptHttp = createHttpInterceptor([
staticFixtures,
githubFixture,
createHttpReplayInterceptor({
name: "sandbox-egress",
replay: {
version: "v1",
key: (request) => ({
method: request.method,
url: request.url,
body: request.body ?? null,
}),
},
}),
]);
```

Direct fixtures should usually come before replay so hand-authored responses
win and the remaining traffic records or replays. By default, HTTP replay keys
on method, URL, and body. Recordings include
request and response headers and bodies, with common sensitive headers such as
`authorization`, `cookie`, and `set-cookie` redacted. Use `sanitize` for body
redaction or fixture-specific response trimming. Use `redactHeaders` when a
contract test needs a narrower header redaction list, or `redactHeaders: false`
when the cassette is already fully sanitized elsewhere.

## Vercel Sandbox

Use `@vitest-evals/http-vercel-sandbox` when Vercel Sandbox forwards egress
traffic back to your app. The adapter parses Vercel's forwarded headers,
reconstructs the intended upstream `Request`, strips proxy-only headers, and
then calls the generic HTTP interceptor chain.

```ts title="api/internal/sandbox-egress.ts"
import {
createHttpInterceptor,
createHttpReplayInterceptor,
} from "@vitest-evals/http";
import { proxyVercelSandboxHttp } from "@vitest-evals/http-vercel-sandbox";

const interceptHttp = createHttpInterceptor([
createHttpReplayInterceptor({
name: "sandbox-egress",
replay: {
version: "v1",
},
}),
]);

export async function ALL(request: Request): Promise<Response> {
await verifyVercelOidc(request);

return await proxyVercelSandboxHttp(request, {
fixtures: [
httpFixture.get("/health", Response.json({ ok: true })),
],
interceptHttp,
provider: ({ upstreamUrl }) => providerForHost(upstreamUrl.hostname),
headers: ({ provider, upstreamUrl }) =>
credentialHeadersFor(provider, upstreamUrl),
});
}
```

Your app still owns OIDC verification, requester authorization, credential
issuance, and sandbox network policy. The package only adapts Vercel's
forwarded request format into `@vitest-evals/http`.

## Other Engine Adapters

Engine adapters should keep their own transport responsibilities. They should
only call the interceptor after they have reconstructed the upstream request
the app intended to make.

```ts title="sandbox-egress.ts"
import { interceptHttp } from "./evals/interceptHttp";

export async function handleSandboxEgress(proxyRequest: Request) {
const upstreamUrl = forwardedUrlFromEngine(proxyRequest);
const provider = providerForHost(upstreamUrl.hostname);
const body =
proxyRequest.body &&
proxyRequest.method !== "GET" &&
proxyRequest.method !== "HEAD"
? await proxyRequest.arrayBuffer()
: undefined;
const request = new Request(upstreamUrl, {
method: proxyRequest.method,
headers: credentialHeadersFor(provider, upstreamUrl),
...(body ? { body } : {}),
});

const intercepted = await interceptHttp({
engine: "vercel-sandbox",
provider,
request,
upstreamUrl,
});
if (intercepted) {
return intercepted;
}

return fetch(request);
}
```

That keeps Vercel-specific forwarded headers, Docker proxy routing, browser
route APIs, credential injection, and live network policy outside
`vitest-evals`, while still sharing fixture and replay behavior across engines.
46 changes: 46 additions & 0 deletions packages/http-vercel-sandbox/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# @vitest-evals/http-vercel-sandbox

Vercel Sandbox HTTP adapter for `@vitest-evals/http`.

## Install

```sh
npm install -D vitest-evals @vitest-evals/http @vitest-evals/http-vercel-sandbox
```

## Usage

```ts
import {
createHttpInterceptor,
createHttpReplayInterceptor,
httpFixture,
} from "@vitest-evals/http";
import { proxyVercelSandboxHttp } from "@vitest-evals/http-vercel-sandbox";

const interceptHttp = createHttpInterceptor([
createHttpReplayInterceptor({
name: "sandbox-egress",
replay: true,
}),
]);

export async function ALL(request: Request): Promise<Response> {
return await proxyVercelSandboxHttp(request, {
fixtures: [
httpFixture.get("/health", Response.json({ ok: true })),
],
interceptHttp,
provider: ({ upstreamUrl }) => providerForHost(upstreamUrl.hostname),
headers: ({ provider, upstreamUrl }) =>
credentialHeadersFor(provider, upstreamUrl),
});
}
```

This package reconstructs Vercel Sandbox forwarded requests into the generic
`HttpInterceptRequest` shape. Your app remains responsible for OIDC
verification, requester authorization, and credential policy.

Direct `fixtures` run before `interceptHttp`, so Vercel Sandbox evals can mix
hand-authored responses with record/replay fallback.
Loading