microsoft · pwang347 · Apr 14, 2026 · Apr 14, 2026 · Apr 14, 2026 · Apr 14, 2026
diff --git a/.github/skills/chat-perf/SKILL.md b/.github/skills/chat-perf/SKILL.md
@@ -0,0 +1,173 @@
+---
+name: chat-perf
+description: Run chat perf benchmarks and memory leak checks against the local dev build or any published VS Code version. Use when investigating chat rendering regressions, validating perf-sensitive changes to chat UI, or checking for memory leaks in the chat response pipeline.
+---
+
+# Chat Performance Testing
+
+## When to use
+
+- Before/after modifying chat rendering code (`chatListRenderer.ts`, `chatInputPart.ts`, markdown rendering)
+- When changing the streaming response pipeline or SSE processing
+- When modifying disposable/lifecycle patterns in chat components
+- To compare performance between two VS Code releases
+- In CI to gate PRs that touch chat UI code
+
+## Quick start
+
+```bash
+# Run perf regression test (compares local dev build vs VS Code 1.115.0):
+npm run perf:chat -- --scenario text-only --runs 3
+
+# Run all scenarios with no baseline (just measure):
+npm run perf:chat -- --no-baseline --runs 3
+
+# Run memory leak check (10 messages in one session):
+npm run perf:chat-leak
+
+# Run leak check with more messages for accuracy:
+npm run perf:chat-leak -- --messages 20 --verbose
+```
+
+## Perf regression test
+
+**Script:** `scripts/chat-simulation/test-chat-perf-regression.js`
+**npm:** `npm run perf:chat`
+
+Launches VS Code via Playwright Electron, opens the chat panel, sends a message with a mock LLM response, and measures timing, layout, and rendering metrics. By default, downloads VS Code 1.115.0 as a baseline, benchmarks it, then benchmarks the local dev build and compares.
+
+### Key flags
+
+| Flag | Default | Description |
+|---|---|---|
+| `--runs <n>` | `5` | Runs per scenario. More = more stable. Use 5+ for CI. |
+| `--scenario <id>` / `-s` | all | Scenario to test (repeatable). See `common/perf-scenarios.js`. |
+| `--build <path\|ver>` / `-b` | local dev | Build to test. Accepts path or version (`1.110.0`, `insiders`, commit hash). |
+| `--baseline <path>` | — | Compare against a previously saved baseline JSON file. |
+| `--baseline-build <ver>` | `1.115.0` | Version to download and benchmark as baseline. |
+| `--no-baseline` | — | Skip baseline comparison entirely. |
+| `--save-baseline` | — | Save results as the new baseline (requires `--baseline <path>`). |
+| `--resume <path>` | — | Resume a previous run, adding more iterations to increase confidence. |
+| `--threshold <frac>` | `0.2` | Regression threshold (0.2 = flag if 20% slower). |
+| `--no-cache` | — | Ignore cached baseline data, always run fresh. |
+| `--ci` | — | CI mode: write Markdown summary to `ci-summary.md` (implies `--no-cache`). |
+| `--verbose` | — | Print per-run details including response content. |
+
+### Comparing two remote builds
+
+```bash
+# Compare 1.110.0 against 1.115.0 (no local build needed):
+npm run perf:chat -- --build 1.110.0 --baseline-build 1.115.0 --runs 5
+```
+
+### Resuming a run for more confidence
+
+When results exceed the threshold but aren't statistically significant, the tool prints a `--resume` hint. Use it to add more iterations to an existing run:
+
+```bash
+# Initial run with 3 iterations — may be inconclusive:
+npm run perf:chat -- --scenario text-only --runs 3
+
+# Add 3 more runs to the same results file (both test + baseline):
+npm run perf:chat -- --resume .chat-simulation-data/2026-04-14T02-15-14/results.json --runs 3
+
+# Keep adding until confidence is reached:
+npm run perf:chat -- --resume .chat-simulation-data/2026-04-14T02-15-14/results.json --runs 5
+```
+
+`--resume` loads the previous `results.json` and its associated `baseline-*.json`, runs N more iterations for both builds, merges rawRuns, recomputes stats, and re-runs the comparison. The updated files are written back in-place. You can resume multiple times — samples accumulate.
+
+### Statistical significance
+
+Regression detection uses **Welch's t-test** to avoid false positives from noisy measurements. A metric is only flagged as `REGRESSION` when it both exceeds the threshold AND is statistically significant (p < 0.05). Otherwise it's reported as `(likely noise — p=X, not significant)`.
+
+With typical variance (cv ≈ 20%), you need:
+- **n ≥ 5** per build to detect a 35% regression at 95% confidence
+- **n ≥ 10** per build to detect a 20% regression reliably
+
+Confidence levels reported: `high` (p < 0.01), `medium` (p < 0.05), `low` (p < 0.1), `none`.
+
+### Exit codes
+
+- `0` — all metrics within threshold, or exceeding threshold but not statistically significant
+- `1` — statistically significant regression detected, or all runs failed
+
+### Scenarios
+
+Scenarios are defined in `scripts/chat-simulation/common/perf-scenarios.js` and registered via `registerPerfScenarios()`. There are three categories:
+
+- **Content-only** — plain streaming responses (e.g. `text-only`, `large-codeblock`, `rapid-stream`)
+- **Tool-call** — multi-turn scenarios with tool invocations (e.g. `tool-read-file`, `tool-edit-file`)
+- **Multi-turn user** — multi-turn conversations with user follow-ups, thinking blocks (e.g. `thinking-response`, `multi-turn-user`, `long-conversation`)
+
+Run `npm run perf:chat -- --help` to see the full list of registered scenario IDs.
+
+### Metrics collected
+
+- **Timing:** time to first token, time to complete (prefers internal `code/chat/*` perf marks, falls back to client-side measurement)
+- **Rendering:** layout count, style recalculation count, forced reflows, long tasks (>50ms)
+- **Memory:** heap before/after (informational, noisy for single requests)
+
+### Statistics
+
+Results use **IQR-based outlier removal** and **median** (not mean) to handle startup jitter. The **coefficient of variation (cv)** is reported — under 15% is stable, over 15% gets a ⚠ warning. Baseline comparison uses **Welch's t-test** on raw run values to determine statistical significance before flagging regressions. Use 5+ runs to get stable results.
+
+## Memory leak check
+
+**Script:** `scripts/chat-simulation/test-chat-mem-leaks.js`
+**npm:** `npm run perf:chat-leak`
+
+Launches one VS Code session, sends N messages sequentially, forces GC between each, and measures renderer heap and DOM node count. Uses **linear regression** on the samples to compute per-message growth rate, which is compared against a threshold.
+
+### Key flags
+
+| Flag | Default | Description |
+|---|---|---|
+| `--messages <n>` / `-n` | `10` | Number of messages to send. More = more accurate slope. |
+| `--build <path\|ver>` / `-b` | local dev | Build to test. |
+| `--threshold <MB>` | `2` | Max per-message heap growth in MB. |
+| `--verbose` | — | Print per-message heap/DOM counts. |
+
+### What it measures
+
+- **Heap growth slope** (MB/message) — linear regression over forced-GC heap samples. A leak shows as sustained positive slope.
+- **DOM node growth** (nodes/message) — catches rendering leaks where elements aren't cleaned up. Healthy chat virtualizes old messages so node count plateaus.
+
+### Interpreting results
+
+- `0.3–1.0 MB/msg` — normal (V8 internal overhead, string interning)
+- `>2.0 MB/msg` — likely leak, investigate retained objects
+- DOM nodes stable after first message — normal (chat list virtualization working)
+- DOM nodes growing linearly — rendering leak, check disposable cleanup
-Launches one VS Code session, sends N messages sequentially, forces GC between each, and measures renderer heap and DOM node count. Uses **linear regression** on the samples to compute per-message growth rate, which is compared against a threshold.
-
-### Key flags
-
-| Flag | Default | Description |
-|---|---|---|
-| `--messages <n>` / `-n` | `10` | Number of messages to send. More = more accurate slope. |
-| `--build <path\|ver>` / `-b` | local dev | Build to test. |
-| `--threshold <MB>` | `2` | Max per-message heap growth in MB. |
-| `--verbose` | — | Print per-message heap/DOM counts. |
-
-### What it measures
-
- **Heap growth slope** (MB/message) — linear regression over forced-GC heap samples. A leak shows as sustained positive slope.
- **DOM node growth** (nodes/message) — catches rendering leaks where elements aren't cleaned up. Healthy chat virtualizes old messages so node count plateaus.
-
-### Interpreting results
-
- `0.3–1.0 MB/msg` — normal (V8 internal overhead, string interning)
- `>2.0 MB/msg` — likely leak, investigate retained objects
- DOM nodes stable after first message — normal (chat list virtualization working)
- DOM nodes growing linearly — rendering leak, check disposable cleanup
+Launches one VS Code session, cycles through the built-in chat scenarios, forces GC between iterations, and measures renderer heap and DOM node count. After completing a full scenario pass, the test resets and repeats so it can detect retained state that accumulates across scenario cycles.
+
+### Key flags
+
+| Flag | Default | Description |
+|---|---|---|
+| `--build <path\|ver>` / `-b` | local dev | Build to test. |
+| `--threshold <MB>` | `2` | Maximum retained heap growth allowed before the run is treated as suspicious. |
+| `--verbose` | — | Print per-iteration heap and DOM counts. |
+
+### What it measures
+
+- **Retained heap after forced GC** (MB) — catches memory that survives scenario execution and reset boundaries.
+- **DOM node count across iterations** — catches rendering leaks where elements or widgets are not cleaned up after scenario completion.
+- **Growth across scenario cycles** — the important signal is whether heap or node counts keep rising after each full pass instead of returning to a stable range.
+
+### Interpreting results
+
+- Small first-iteration increases are normal (startup, caches, JIT, string interning).
+- Heap that keeps increasing after each full scenario cycle is suspicious and should be investigated for retained objects.
+- DOM nodes returning to a steady range after reset is normal.
+- DOM nodes or heap continuing to climb from cycle to cycle indicates a likely rendering or lifecycle leak; check disposable cleanup and retained references.
-Launches one VS Code session, sends N messages sequentially, forces GC between each, and measures renderer heap and DOM node count. Uses **linear regression** on the samples to compute per-message growth rate, which is compared against a threshold.
-
-### Key flags
-
-| Flag | Default | Description |
-|---|---|---|
-| `--messages <n>` / `-n` | `10` | Number of messages to send. More = more accurate slope. |
-| `--build <path\|ver>` / `-b` | local dev | Build to test. |
-| `--threshold <MB>` | `2` | Max per-message heap growth in MB. |
-| `--verbose` | — | Print per-message heap/DOM counts. |
-
-### What it measures
-
- **Heap growth slope** (MB/message) — linear regression over forced-GC heap samples. A leak shows as sustained positive slope.
- **DOM node growth** (nodes/message) — catches rendering leaks where elements aren't cleaned up. Healthy chat virtualizes old messages so node count plateaus.
-
-### Interpreting results
-
- `0.3–1.0 MB/msg` — normal (V8 internal overhead, string interning)
- `>2.0 MB/msg` — likely leak, investigate retained objects
- DOM nodes stable after first message — normal (chat list virtualization working)
- DOM nodes growing linearly — rendering leak, check disposable cleanup
+Launches one VS Code session, cycles through the built-in chat scenarios repeatedly, forces GC between iterations, and measures renderer heap and DOM node count after each pass. The test uses these state-based samples to detect sustained growth across repeated scenario cycles, rather than computing a per-message linear-regression slope.
+
+### Key flags
+
+| Flag | Default | Description |
+|---|---|---|
+| `--build <path\|ver>` / `-b` | local dev | Build to test. |
+| `--threshold <MB>` | `2` | Max allowed renderer heap growth over the leak-check run. |
+| `--verbose` | — | Print per-iteration heap/DOM counts. |
+
+### What it measures
+
+- **Renderer heap growth** (MB over repeated scenario iterations) — forced-GC heap samples are compared across the run to catch retained state that does not get cleaned up between scenario cycles.
+- **DOM node growth** (nodes over repeated scenario iterations) — catches rendering leaks where elements are retained instead of being released or virtualized as chat state is exercised and reset.
+
+### Interpreting results
+
+- Small heap fluctuations between iterations are normal (V8 internal overhead, caching, string interning).
+- Heap usage that keeps climbing across repeated scenario cycles is a likely leak; investigate retained objects and lifecycle cleanup.
+- DOM nodes that stabilize after the initial iterations are normal (chat list virtualization and cleanup working).
+- DOM nodes that continue growing across iterations suggest a rendering leak; check disposable cleanup and retained elements.
-Launches one VS Code session, sends N messages sequentially, forces GC between each, and measures renderer heap and DOM node count. Uses **linear regression** on the samples to compute per-message growth rate, which is compared against a threshold.
-
-### Key flags
-
-| Flag | Default | Description |
-|---|---|---|
-| `--messages <n>` / `-n` | `10` | Number of messages to send. More = more accurate slope. |
-| `--build <path\|ver>` / `-b` | local dev | Build to test. |
-| `--threshold <MB>` | `2` | Max per-message heap growth in MB. |
-| `--verbose` | — | Print per-message heap/DOM counts. |
-
-### What it measures
-
- **Heap growth slope** (MB/message) — linear regression over forced-GC heap samples. A leak shows as sustained positive slope.
- **DOM node growth** (nodes/message) — catches rendering leaks where elements aren't cleaned up. Healthy chat virtualizes old messages so node count plateaus.
-
-### Interpreting results
-
- `0.3–1.0 MB/msg` — normal (V8 internal overhead, string interning)
- `>2.0 MB/msg` — likely leak, investigate retained objects
- DOM nodes stable after first message — normal (chat list virtualization working)
- DOM nodes growing linearly — rendering leak, check disposable cleanup
+Launches one VS Code session, cycles through the built-in chat scenarios, forces GC between iterations, and measures renderer heap and DOM node count. After completing a full scenario pass, the test resets and repeats so it can detect retained state that accumulates across scenario cycles.
+
+### Key flags
+
+| Flag | Default | Description |
+|---|---|---|
+| `--build <path\|ver>` / `-b` | local dev | Build to test. |
+| `--threshold <MB>` | `2` | Maximum retained heap growth allowed before the run is treated as suspicious. |
+| `--verbose` | — | Print per-iteration heap and DOM counts. |
+
+### What it measures
+
+- **Retained heap after forced GC** (MB) — catches memory that survives scenario execution and reset boundaries.
+- **DOM node count across iterations** — catches rendering leaks where elements or widgets are not cleaned up after scenario completion.
+- **Growth across scenario cycles** — the important signal is whether heap or node counts keep rising after each full pass instead of returning to a stable range.
+
+### Interpreting results
+
+- Small first-iteration increases are normal (startup, caches, JIT, string interning).
+- Heap that keeps increasing after each full scenario cycle is suspicious and should be investigated for retained objects.
+- DOM nodes returning to a steady range after reset is normal.
+- DOM nodes or heap continuing to climb from cycle to cycle indicates a likely rendering or lifecycle leak; check disposable cleanup and retained references.
-Launches one VS Code session, sends N messages sequentially, forces GC between each, and measures renderer heap and DOM node count. Uses **linear regression** on the samples to compute per-message growth rate, which is compared against a threshold.
-
-### Key flags
-
-| Flag | Default | Description |
-|---|---|---|
-| `--messages <n>` / `-n` | `10` | Number of messages to send. More = more accurate slope. |
-| `--build <path\|ver>` / `-b` | local dev | Build to test. |
-| `--threshold <MB>` | `2` | Max per-message heap growth in MB. |
-| `--verbose` | — | Print per-message heap/DOM counts. |
-
-### What it measures
-
- **Heap growth slope** (MB/message) — linear regression over forced-GC heap samples. A leak shows as sustained positive slope.
- **DOM node growth** (nodes/message) — catches rendering leaks where elements aren't cleaned up. Healthy chat virtualizes old messages so node count plateaus.
-
-### Interpreting results
-
- `0.3–1.0 MB/msg` — normal (V8 internal overhead, string interning)
- `>2.0 MB/msg` — likely leak, investigate retained objects
- DOM nodes stable after first message — normal (chat list virtualization working)
- DOM nodes growing linearly — rendering leak, check disposable cleanup
+Launches one VS Code session, cycles through the built-in chat scenarios repeatedly, forces GC between iterations, and measures renderer heap and DOM node count after each pass. The test uses these state-based samples to detect sustained growth across repeated scenario cycles, rather than computing a per-message linear-regression slope.
+
+### Key flags
+
+| Flag | Default | Description |
+|---|---|---|
+| `--build <path\|ver>` / `-b` | local dev | Build to test. |
+| `--threshold <MB>` | `2` | Max allowed renderer heap growth over the leak-check run. |
+| `--verbose` | — | Print per-iteration heap/DOM counts. |
+
+### What it measures
+
+- **Renderer heap growth** (MB over repeated scenario iterations) — forced-GC heap samples are compared across the run to catch retained state that does not get cleaned up between scenario cycles.
+- **DOM node growth** (nodes over repeated scenario iterations) — catches rendering leaks where elements are retained instead of being released or virtualized as chat state is exercised and reset.
+
+### Interpreting results
+
+- Small heap fluctuations between iterations are normal (V8 internal overhead, caching, string interning).
+- Heap usage that keeps climbing across repeated scenario cycles is a likely leak; investigate retained objects and lifecycle cleanup.
+- DOM nodes that stabilize after the initial iterations are normal (chat list virtualization and cleanup working).
+- DOM nodes that continue growing across iterations suggest a rendering leak; check disposable cleanup and retained elements.
+
+## Architecture
+
+```
+scripts/chat-simulation/
+├── common/
+│   ├── mock-llm-server.js    # Mock CAPI server matching @vscode/copilot-api URL structure
+│   ├── perf-scenarios.js     # Built-in scenario definitions (content, tool-call, multi-turn)
+│   └── utils.js              # Shared: paths, env setup, stats, launch helpers
+├── config.jsonc              # Default config (baseline version, runs, thresholds)
+├── fixtures/                 # TypeScript fixture files used by tool-call scenarios
+├── test-chat-perf-regression.js
+└── test-chat-mem-leaks.js
+```
+
+### Mock server
+
+The mock LLM server (`common/mock-llm-server.js`) implements the full CAPI URL structure from `@vscode/copilot-api`'s `DomainService`:
+
+- `GET /models` — returns model metadata
+- `POST /models/session` — returns `AutoModeAPIResponse` with `available_models` and `session_token`
+- `POST /models/session/intent` — model router
+- `POST /chat/completions` — SSE streaming response matching the scenario
+- Agent, session, telemetry, and token endpoints
+
+The copilot extension connects to this server via `IS_SCENARIO_AUTOMATION=1` mode with `overrideCapiUrl` and `overrideProxyUrl` settings. The `vscode-api-tests` extension must be disabled (`--disable-extension=vscode.vscode-api-tests`) because it contributes a duplicate `copilot` vendor that blocks the real extension's language model provider registration.
+
+### Adding a scenario
+
+1. Add a new entry to the appropriate object (`CONTENT_SCENARIOS`, `TOOL_CALL_SCENARIOS`, or `MULTI_TURN_SCENARIOS`) in `common/perf-scenarios.js` using the `ScenarioBuilder` API from `common/mock-llm-server.js`
+2. The scenario is auto-registered by `registerPerfScenarios()` — no manual ID list to update
+3. Run: `npm run perf:chat -- --scenario your-new-scenario --runs 1 --no-baseline --verbose`