vllm-project · franciscojavierarceo · Jun 17, 2026 · Jun 17, 2026 · Jun 17, 2026 · Jun 17, 2026
@@ -0,0 +1,179 @@
+# Cassette Recorder
+
+`record_cassette.py` runs an embedded proxy between the script and an upstream API (OpenAI or vLLM). Every request and response is captured into a YAML cassette for use in replay tests.
+
+## How it works
+
+```
+[record_cassette.py] -> [proxy :7070] -> [OpenAI | vLLM]
+                         (cassette written here)
+```
+
+The proxy intercepts each turn, records the request body and response, then appends a `t<N>` entry to the output YAML.
+
+The recorder is interactive. For each turn it prompts you to type the input message and waits for you to press Enter before sending the request. You can run it directly in your terminal and type the prompts by hand, or pipe them in from a script using `printf` or `echo` to feed all turns non-interactively:
+
+```bash
+# interactive -- type each prompt when asked
+python tests/cassettes/record_cassette.py --mode responses --turns 2 --no-stream --vllm http://localhost:5050 --model Qwen/Qwen3-30B-A3B-FP8 --output out.yaml
+
+# non-interactive -- pipe prompts in (one line per turn)
+printf 'First prompt\nSecond prompt\n' | python tests/cassettes/record_cassette.py --mode responses --turns 2 --no-stream --vllm http://localhost:5050 --model Qwen/Qwen3-30B-A3B-FP8 --output out.yaml
+```
+
+The recorder scripts (`record_reasoning_cassettes.sh`, `record_tool_call_cassettes.sh`, etc.) use `printf` to feed fixed prompts per test so no manual input is needed.
+
+## Modes
+
+| Mode | Description |
+|------|-------------|
+| `responses` | Chains turns via `previous_response_id`. Only mode supported with `--vllm`. |
+| `conv` | Creates a conversation object, passes `conversation` id each turn. |
+| `isolation` | Two independent conversations (A and B) recorded into one cassette. |
+| `mixed` | Turn 1 uses `conversation` id, turns 2+ switch to `previous_response_id`. |
+| `store_true_then_store_false` | Turn 1: `store=true` with conversation id. Remaining turns: `store=false`, still pass conversation id. |
+
+## CLI options
+
+```
+--turns N              Number of turns
+--output PATH          Output YAML path
+--mode MODE            responses | conv | isolation | mixed | store_true_then_store_false  (default: conv)
+--stream / --no-stream Streaming or non-streaming (default: streaming)
+--model NAME           Model name sent in requests
+--no-store             Set store=false
+--vllm URL             vLLM upstream, e.g. http://localhost:8000 (responses mode only)
+--openai URL           OpenAI upstream (default https://api.openai.com)
+--tools FILE           JSON file containing a tools array (responses mode only)
+--tool-choice VALUE    "auto", "none", "required", or JSON e.g. '{"type":"function","name":"foo"}'
+--proxy-port PORT      Local proxy port (default 7070)
+--branch-from TURN     Branch from this turn's response id (repeatable)
+--branch-turn-number N First turn number for the corresponding branch (repeatable)
+```
+
+## Cassette YAML structure
+
+Each cassette has a `turns` list. One entry is appended per request.
+
+**Single turn (`--turns 1`, non-streaming):**
+
+```yaml
+turns:
+- filename: t1
+  request:
+    method: POST
+    path: /v1/responses
+    body:
+      model: Qwen/Qwen3-30B-A3B-FP8
+      input: Reply with exactly one word: HELLO
+      stream: false
+      store: true
+    headers:
+      content-type: application/json
+    query_params: {}
+  response:
+    status_code: 200
+    headers:
+      content-type: application/json
+    body:
+      id: resp_abc123
+      output: [...]
+      usage: {...}
+```
+
+**Two turns (`--turns 2`, non-streaming) -- `t2` adds `previous_response_id`:**
+
+```yaml
+turns:
+- filename: t1
+  request:
+    body:
+      input: "Remember the word APPLE. Just say: OK"
+      store: true
+  response:
+    body:
+      id: resp_abc123
+
+- filename: t2
+  request:
+    body:
+      input: What word did I ask you to remember?
+      previous_response_id: resp_abc123
+  response:
+    body:
+      id: resp_def456
+```
+
+**Tool call turn -- `tool_choice` and `tools` appear in the request body:**
+
+```yaml
+turns:
+- filename: t1
+  request:
+    body:
+      input: What is the NVIDIA stock price?
+      tool_choice: auto
+      tools:
+      - type: function
+        name: get_stock_price
+        description: ...
+        parameters: {...}
+  response:
+    body:
+      output:
+      - type: function_call
+        name: get_stock_price
+        arguments: '{"ticker": "NVDA"}'
+```
+
+**Streaming turn -- `response.body` is replaced by `response.sse`, a list of raw SSE lines:**
+
+```yaml
+turns:
+- filename: t1
+  request:
+    body:
+      stream: true
+  response:
+    status_code: 200
+    headers:
+      content-type: text/event-stream; charset=utf-8
+    sse:
+    - "event: response.created\n"
+    - "data: {...}\n"
+    - "event: response.output_text.delta\n"
+    - "data: {...}\n"
+    - "event: response.completed\n"
+    - "data: {...}\n"
+```
+
+## Recorder scripts
+
+| Script | Cassettes | Backend |
+|--------|-----------|---------|
+| `record_text_only_cassettes.sh` | 10 text-only cassettes (responses + conv modes, streaming + non-streaming) | OpenAI (`OPENAI_API_KEY`) |
+| `record_reasoning_cassettes.sh` | 2 reasoning cassettes (single turn, streaming + non-streaming) | vLLM |
+| `record_tool_call_cassettes.sh` | 8 tool-call cassettes (4 tool_choice modes x streaming + non-streaming) | vLLM |
+
+### Text-only (OpenAI)
+
+```bash
+OPENAI_API_KEY=sk-... bash tests/cassettes/record_text_only_cassettes.sh
+MODEL=gpt-4o-mini OPENAI_API_KEY=sk-... bash tests/cassettes/record_text_only_cassettes.sh
+```
+
+### Reasoning (vLLM)
+
+```bash
+vllm serve Qwen/Qwen3-30B-A3B-FP8 --reasoning-parser deepseek_r1 --port 5050 > server.log 2>&1
+
+VLLM_URL=http://0.0.0.0:5050 MODEL=Qwen/Qwen3-30B-A3B-FP8 bash tests/cassettes/record_reasoning_cassettes.sh
+```
+
+### Tool calls (vLLM)
+
+```bash
+vllm serve Qwen/Qwen3-30B-A3B-FP8 --tool-call-parser hermes --enable-auto-tool-choice --port 5050 > server.log 2>&1
+
+VLLM_URL=http://0.0.0.0:5050 MODEL=Qwen/Qwen3-30B-A3B-FP8 bash tests/cassettes/record_tool_call_cassettes.sh
+```
@@ -325,6 +325,13 @@ def _prompt(label: str) -> str:
         sys.exit(0)
 
 
+def _inject_tools(body: dict, tools: list | None, tool_choice: Any) -> None:
+    if tools is not None:
+        body["tools"] = tools
+    if tool_choice is not None:
+        body["tool_choice"] = tool_choice
+
+
 def run_conv(
     client: httpx.Client,
     turns: int,
@@ -470,6 +477,8 @@ def run_responses(
     store: bool,
     branches: list[tuple[int, int | None]],
     proxy_url: str,
+    tools: list | None = None,
+    tool_choice: Any = None,
 ) -> None:
     response_ids: dict[int, str] = {}
     branch_map: dict[int, int] = {}
@@ -497,6 +506,7 @@ def run_responses(
         body: dict = {"model": model, "input": prompt, "stream": stream, "store": store}
         if previous_response_id and store:
             body["previous_response_id"] = previous_response_id
+        _inject_tools(body, tools, tool_choice)
         response_id = _send(client, body, stream, proxy_url)
         previous_response_id = response_id if store else None
         if response_id:
@@ -522,6 +532,7 @@ def run_responses(
             "store": store,
             "previous_response_id": branch_resp_id,
         }
+        _inject_tools(body, tools, tool_choice)
         _send(client, body, stream, proxy_url)
 
 
@@ -593,6 +604,21 @@ def run_responses(
     default=None,
     help="vLLM upstream URL, e.g. http://localhost:8000 (responses mode only, no auth).",
 )
+@click.option(
+    "--tools",
+    "tools_file",
+    metavar="FILE",
+    default=None,
+    type=click.Path(exists=True),
+    help="Path to a JSON file containing a tools array to inject into every request.",
+)
+@click.option(
+    "--tool-choice",
+    "tool_choice_raw",
+    metavar="VALUE",
+    default=None,
+    help='tool_choice value: "auto", "none", "required", or JSON e.g. \'{"type":"function","name":"foo"}\'.',
+)
 def main(
     turns: int,
     output: str,
@@ -605,6 +631,8 @@ def main(
     proxy_port: int,
     openai_url: str | None,
     vllm_url: str | None,
+    tools_file: str | None,
+    tool_choice_raw: str | None,
 ) -> None:
     """Interactive multi-turn cassette recorder (proxy embedded)."""
     if branch_turn_number and not branch_from:
@@ -625,6 +653,21 @@ def main(
             f"--vllm is only supported with --mode responses (got --mode {mode})."
         )
 
+    tools: list | None = None
+    if tools_file:
+        with open(tools_file, encoding="utf-8") as f:
+            tools = json.load(f)
+        if not isinstance(tools, list):
+            raise click.UsageError("--tools file must contain a JSON array.")
+
+    tool_choice: Any = None
+    if tool_choice_raw:
+        stripped = tool_choice_raw.strip()
+        if stripped.startswith("{") or stripped.startswith("["):
+            tool_choice = json.loads(stripped)
+        else:
+            tool_choice = stripped
+
     if vllm_url:
         target = vllm_url.rstrip("/")
         headers: dict = {}
@@ -660,7 +703,7 @@ def main(
             elif mode == "mixed":
                 run_mixed(client, turns, model, stream, store, proxy_url)
             elif mode == "responses":
-                run_responses(client, turns, model, stream, store, branches, proxy_url)
+                run_responses(client, turns, model, stream, store, branches, proxy_url, tools, tool_choice)
             elif mode == "store_true_then_store_false":
                 run_store_true_then_store_false(client, turns, model, stream, proxy_url)
     finally: