Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
173 changes: 173 additions & 0 deletions .github/skills/chat-perf/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,173 @@
---
name: chat-perf
description: Run chat perf benchmarks and memory leak checks against the local dev build or any published VS Code version. Use when investigating chat rendering regressions, validating perf-sensitive changes to chat UI, or checking for memory leaks in the chat response pipeline.
---

# Chat Performance Testing

## When to use

- Before/after modifying chat rendering code (`chatListRenderer.ts`, `chatInputPart.ts`, markdown rendering)
- When changing the streaming response pipeline or SSE processing
- When modifying disposable/lifecycle patterns in chat components
- To compare performance between two VS Code releases
- In CI to gate PRs that touch chat UI code

## Quick start

```bash
# Run perf regression test (compares local dev build vs VS Code 1.115.0):
npm run perf:chat -- --scenario text-only --runs 3

# Run all scenarios with no baseline (just measure):
npm run perf:chat -- --no-baseline --runs 3

# Run memory leak check (10 messages in one session):
npm run perf:chat-leak

# Run leak check with more messages for accuracy:
npm run perf:chat-leak -- --messages 20 --verbose
```

## Perf regression test

**Script:** `scripts/chat-simulation/test-chat-perf-regression.js`
**npm:** `npm run perf:chat`

Launches VS Code via Playwright Electron, opens the chat panel, sends a message with a mock LLM response, and measures timing, layout, and rendering metrics. By default, downloads VS Code 1.115.0 as a baseline, benchmarks it, then benchmarks the local dev build and compares.

### Key flags

| Flag | Default | Description |
|---|---|---|
| `--runs <n>` | `5` | Runs per scenario. More = more stable. Use 5+ for CI. |
| `--scenario <id>` / `-s` | all | Scenario to test (repeatable). See `common/perf-scenarios.js`. |
| `--build <path\|ver>` / `-b` | local dev | Build to test. Accepts path or version (`1.110.0`, `insiders`, commit hash). |
| `--baseline <path>` | — | Compare against a previously saved baseline JSON file. |
| `--baseline-build <ver>` | `1.115.0` | Version to download and benchmark as baseline. |
| `--no-baseline` | — | Skip baseline comparison entirely. |
| `--save-baseline` | — | Save results as the new baseline (requires `--baseline <path>`). |
| `--resume <path>` | — | Resume a previous run, adding more iterations to increase confidence. |
| `--threshold <frac>` | `0.2` | Regression threshold (0.2 = flag if 20% slower). |
| `--no-cache` | — | Ignore cached baseline data, always run fresh. |
| `--ci` | — | CI mode: write Markdown summary to `ci-summary.md` (implies `--no-cache`). |
| `--verbose` | — | Print per-run details including response content. |

### Comparing two remote builds

```bash
# Compare 1.110.0 against 1.115.0 (no local build needed):
npm run perf:chat -- --build 1.110.0 --baseline-build 1.115.0 --runs 5
```

### Resuming a run for more confidence

When results exceed the threshold but aren't statistically significant, the tool prints a `--resume` hint. Use it to add more iterations to an existing run:

```bash
# Initial run with 3 iterations — may be inconclusive:
npm run perf:chat -- --scenario text-only --runs 3

# Add 3 more runs to the same results file (both test + baseline):
npm run perf:chat -- --resume .chat-simulation-data/2026-04-14T02-15-14/results.json --runs 3

# Keep adding until confidence is reached:
npm run perf:chat -- --resume .chat-simulation-data/2026-04-14T02-15-14/results.json --runs 5
```

`--resume` loads the previous `results.json` and its associated `baseline-*.json`, runs N more iterations for both builds, merges rawRuns, recomputes stats, and re-runs the comparison. The updated files are written back in-place. You can resume multiple times — samples accumulate.

### Statistical significance

Regression detection uses **Welch's t-test** to avoid false positives from noisy measurements. A metric is only flagged as `REGRESSION` when it both exceeds the threshold AND is statistically significant (p < 0.05). Otherwise it's reported as `(likely noise — p=X, not significant)`.

With typical variance (cv ≈ 20%), you need:
- **n ≥ 5** per build to detect a 35% regression at 95% confidence
- **n ≥ 10** per build to detect a 20% regression reliably

Confidence levels reported: `high` (p < 0.01), `medium` (p < 0.05), `low` (p < 0.1), `none`.

### Exit codes

- `0` — all metrics within threshold, or exceeding threshold but not statistically significant
- `1` — statistically significant regression detected, or all runs failed

### Scenarios

Scenarios are defined in `scripts/chat-simulation/common/perf-scenarios.js` and registered via `registerPerfScenarios()`. There are three categories:

- **Content-only** — plain streaming responses (e.g. `text-only`, `large-codeblock`, `rapid-stream`)
- **Tool-call** — multi-turn scenarios with tool invocations (e.g. `tool-read-file`, `tool-edit-file`)
- **Multi-turn user** — multi-turn conversations with user follow-ups, thinking blocks (e.g. `thinking-response`, `multi-turn-user`, `long-conversation`)

Run `npm run perf:chat -- --help` to see the full list of registered scenario IDs.

### Metrics collected

- **Timing:** time to first token, time to complete (prefers internal `code/chat/*` perf marks, falls back to client-side measurement)
- **Rendering:** layout count, style recalculation count, forced reflows, long tasks (>50ms)
- **Memory:** heap before/after (informational, noisy for single requests)

### Statistics

Results use **IQR-based outlier removal** and **median** (not mean) to handle startup jitter. The **coefficient of variation (cv)** is reported — under 15% is stable, over 15% gets a ⚠ warning. Baseline comparison uses **Welch's t-test** on raw run values to determine statistical significance before flagging regressions. Use 5+ runs to get stable results.

## Memory leak check

**Script:** `scripts/chat-simulation/test-chat-mem-leaks.js`
**npm:** `npm run perf:chat-leak`

Launches one VS Code session, sends N messages sequentially, forces GC between each, and measures renderer heap and DOM node count. Uses **linear regression** on the samples to compute per-message growth rate, which is compared against a threshold.

### Key flags

| Flag | Default | Description |
|---|---|---|
| `--messages <n>` / `-n` | `10` | Number of messages to send. More = more accurate slope. |
| `--build <path\|ver>` / `-b` | local dev | Build to test. |
| `--threshold <MB>` | `2` | Max per-message heap growth in MB. |
| `--verbose` | — | Print per-message heap/DOM counts. |

### What it measures

- **Heap growth slope** (MB/message) — linear regression over forced-GC heap samples. A leak shows as sustained positive slope.
- **DOM node growth** (nodes/message) — catches rendering leaks where elements aren't cleaned up. Healthy chat virtualizes old messages so node count plateaus.

### Interpreting results

- `0.3–1.0 MB/msg` — normal (V8 internal overhead, string interning)
- `>2.0 MB/msg` — likely leak, investigate retained objects
- DOM nodes stable after first message — normal (chat list virtualization working)
- DOM nodes growing linearly — rendering leak, check disposable cleanup
Comment on lines +120 to +141
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SKILL documentation describes a per-message linear-regression leak test controlled by --messages, but test-chat-mem-leaks.js currently implements a state-based iteration approach (cycle all scenarios, then reset) and does not use messages or compute a slope. Update the SKILL doc to match the implemented algorithm/flags (or adjust the script to match the documented behavior).

Suggested change
Launches one VS Code session, sends N messages sequentially, forces GC between each, and measures renderer heap and DOM node count. Uses **linear regression** on the samples to compute per-message growth rate, which is compared against a threshold.
### Key flags
| Flag | Default | Description |
|---|---|---|
| `--messages <n>` / `-n` | `10` | Number of messages to send. More = more accurate slope. |
| `--build <path\|ver>` / `-b` | local dev | Build to test. |
| `--threshold <MB>` | `2` | Max per-message heap growth in MB. |
| `--verbose` || Print per-message heap/DOM counts. |
### What it measures
- **Heap growth slope** (MB/message) — linear regression over forced-GC heap samples. A leak shows as sustained positive slope.
- **DOM node growth** (nodes/message) — catches rendering leaks where elements aren't cleaned up. Healthy chat virtualizes old messages so node count plateaus.
### Interpreting results
- `0.3–1.0 MB/msg`normal (V8 internal overhead, string interning)
- `>2.0 MB/msg` — likely leak, investigate retained objects
- DOM nodes stable after first message — normal (chat list virtualization working)
- DOM nodes growing linearly — rendering leak, check disposable cleanup
Launches one VS Code session, cycles through the built-in chat scenarios, forces GC between iterations, and measures renderer heap and DOM node count. After completing a full scenario pass, the test resets and repeats so it can detect retained state that accumulates across scenario cycles.
### Key flags
| Flag | Default | Description |
|---|---|---|
| `--build <path\|ver>` / `-b` | local dev | Build to test. |
| `--threshold <MB>` | `2` | Maximum retained heap growth allowed before the run is treated as suspicious. |
| `--verbose` | | Print per-iteration heap and DOM counts. |
### What it measures
- **Retained heap after forced GC** (MB) — catches memory that survives scenario execution and reset boundaries.
- **DOM node count across iterations** — catches rendering leaks where elements or widgets are not cleaned up after scenario completion.
- **Growth across scenario cycles** — the important signal is whether heap or node counts keep rising after each full pass instead of returning to a stable range.
### Interpreting results
- Small first-iteration increases are normal (startup, caches, JIT, string interning).
- Heap that keeps increasing after each full scenario cycle is suspicious and should be investigated for retained objects.
- DOM nodes returning to a steady range after reset is normal.
- DOM nodes or heap continuing to climb from cycle to cycle indicates a likely rendering or lifecycle leak; check disposable cleanup and retained references.

Copilot uses AI. Check for mistakes.
Comment on lines +120 to +141
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SKILL documentation describes a per-message linear-regression leak test controlled by --messages, but test-chat-mem-leaks.js currently implements a state-based iteration approach (cycle all scenarios, then reset) and does not use messages or compute a slope. Update the SKILL doc to match the implemented algorithm/flags (or adjust the script to match the documented behavior).

Suggested change
Launches one VS Code session, sends N messages sequentially, forces GC between each, and measures renderer heap and DOM node count. Uses **linear regression** on the samples to compute per-message growth rate, which is compared against a threshold.
### Key flags
| Flag | Default | Description |
|---|---|---|
| `--messages <n>` / `-n` | `10` | Number of messages to send. More = more accurate slope. |
| `--build <path\|ver>` / `-b` | local dev | Build to test. |
| `--threshold <MB>` | `2` | Max per-message heap growth in MB. |
| `--verbose` || Print per-message heap/DOM counts. |
### What it measures
- **Heap growth slope** (MB/message) — linear regression over forced-GC heap samples. A leak shows as sustained positive slope.
- **DOM node growth** (nodes/message) — catches rendering leaks where elements aren't cleaned up. Healthy chat virtualizes old messages so node count plateaus.
### Interpreting results
- `0.3–1.0 MB/msg` — normal (V8 internal overhead, string interning)
- `>2.0 MB/msg` — likely leak, investigate retained objects
- DOM nodes stable after first message — normal (chat list virtualization working)
- DOM nodes growing linearly — rendering leak, check disposable cleanup
Launches one VS Code session, cycles through the built-in chat scenarios repeatedly, forces GC between iterations, and measures renderer heap and DOM node count after each pass. The test uses these state-based samples to detect sustained growth across repeated scenario cycles, rather than computing a per-message linear-regression slope.
### Key flags
| Flag | Default | Description |
|---|---|---|
| `--build <path\|ver>` / `-b` | local dev | Build to test. |
| `--threshold <MB>` | `2` | Max allowed renderer heap growth over the leak-check run. |
| `--verbose` || Print per-iteration heap/DOM counts. |
### What it measures
- **Renderer heap growth** (MB over repeated scenario iterations) — forced-GC heap samples are compared across the run to catch retained state that does not get cleaned up between scenario cycles.
- **DOM node growth** (nodes over repeated scenario iterations) — catches rendering leaks where elements are retained instead of being released or virtualized as chat state is exercised and reset.
### Interpreting results
- Small heap fluctuations between iterations are normal (V8 internal overhead, caching, string interning).
- Heap usage that keeps climbing across repeated scenario cycles is a likely leak; investigate retained objects and lifecycle cleanup.
- DOM nodes that stabilize after the initial iterations are normal (chat list virtualization and cleanup working).
- DOM nodes that continue growing across iterations suggest a rendering leak; check disposable cleanup and retained elements.

Copilot uses AI. Check for mistakes.

## Architecture

```
scripts/chat-simulation/
├── common/
│ ├── mock-llm-server.js # Mock CAPI server matching @vscode/copilot-api URL structure
│ ├── perf-scenarios.js # Built-in scenario definitions (content, tool-call, multi-turn)
│ └── utils.js # Shared: paths, env setup, stats, launch helpers
├── config.jsonc # Default config (baseline version, runs, thresholds)
├── fixtures/ # TypeScript fixture files used by tool-call scenarios
├── test-chat-perf-regression.js
└── test-chat-mem-leaks.js
```

### Mock server

The mock LLM server (`common/mock-llm-server.js`) implements the full CAPI URL structure from `@vscode/copilot-api`'s `DomainService`:

- `GET /models` — returns model metadata
- `POST /models/session` — returns `AutoModeAPIResponse` with `available_models` and `session_token`
- `POST /models/session/intent` — model router
- `POST /chat/completions` — SSE streaming response matching the scenario
- Agent, session, telemetry, and token endpoints

The copilot extension connects to this server via `IS_SCENARIO_AUTOMATION=1` mode with `overrideCapiUrl` and `overrideProxyUrl` settings. The `vscode-api-tests` extension must be disabled (`--disable-extension=vscode.vscode-api-tests`) because it contributes a duplicate `copilot` vendor that blocks the real extension's language model provider registration.

### Adding a scenario

1. Add a new entry to the appropriate object (`CONTENT_SCENARIOS`, `TOOL_CALL_SCENARIOS`, or `MULTI_TURN_SCENARIOS`) in `common/perf-scenarios.js` using the `ScenarioBuilder` API from `common/mock-llm-server.js`
2. The scenario is auto-registered by `registerPerfScenarios()` — no manual ID list to update
3. Run: `npm run perf:chat -- --scenario your-new-scenario --runs 1 --no-baseline --verbose`
Loading
Loading