Extend OpenAiCompatibleChatClient for vLLM backend compatibility

## Summary

`OpenAiCompatibleChatClient` is currently exercised against llama.cpp's
`llama-server`. An evaluation of vLLM (`vllm.entrypoints.openai.api_server`)
as an alternative OpenAI-compatible backend surfaces several compatibility
gaps that should be closed so the two servers are genuinely interchangeable
behind the same provider type.

vLLM is worth supporting: it has first-class AMD ROCm support, uses
PagedAttention for dynamic KV cache allocation (eliminating the
preallocation waste llama.cpp suffers at `LLAMA_PARALLEL > 1`), and is the
most widely deployed self-hosted OpenAI-compatible server in production.
Being able to swap backends without code changes keeps the provider
abstraction honest.

This issue tracks the work to:

1. Identify and document the concrete API-surface deltas between
   llama-server and vLLM that `OpenAiCompatibleChatClient` currently
   depends on.
2. Extend the client (and/or `OpenAiCompatibleCapabilityResolver`) to
   handle both backends transparently.
3. Add integration coverage for vLLM so regressions are caught.

## Known compatibility gaps

### 1. `/props` endpoint is llama.cpp-only

`OpenAiCompatibleCapabilityResolver` probes `GET /props` at startup to
extract:

- `default_generation_settings.params.n_ctx` → context window
- `modalities.vision` → vision support flag

vLLM does not expose `/props`. The resolver already has a `/v1/models`
fallback path, but that path reads `meta.n_ctx_train` from the model
descriptor — which is a llama.cpp-specific extension. vLLM's `/v1/models`
response is OpenAI-standard and does **not** include context window or
modality metadata in any documented field.

**Action**:
- Verify the `/v1/models` fallback actually works against vLLM (spoiler:
  it probably returns a default or errors silently).
- Add a vLLM-aware code path. Options:
  - Query `/v1/models` for the served model id, then consult a small
    built-in table keyed by model family
  - Require an explicit `ContextWindow` / `SupportsVision` override in
    provider config when the backend can't self-describe
  - Probe `/metrics` (vLLM exposes Prometheus metrics including
    `vllm:num_requests_running` and `vllm:gpu_cache_usage_perc`, but not
    context window — so this is a dead end for capability detection)

### 2. Tool-call parser streaming format depends on vLLM parser choice

vLLM's tool-call parsing is plugin-based via `--tool-call-parser <name>`.
Different Qwen3-family variants use different parsers, and the streaming
behavior is not uniform across them:

- **`hermes`** parser (vLLM's general Qwen3 recommendation) has a
  documented streaming bug
  ([vllm-project/vllm#31871](https://github.com/vllm-project/vllm/issues/31871),
  open as of Feb 2026, v0.15.1 still affected): when `stream: true` is
  set, vLLM returns raw `<tool_call>…</tool_call>` XML inside
  `delta.content` with `finish_reason: "stop"` instead of the structured
  `tool_calls` delta array required by the OpenAI spec. Non-streaming
  requests are unaffected.
- **`qwen3_coder`** parser (vLLM 0.10+, designed for Qwen3-Coder /
  Qwen3.5 A3B MoE agent models) is a separate code path. Whether it
  exhibits the same streaming degradation as `hermes` is **unverified**
  — it may be fine, it may have different quirks, it needs actual
  testing against a live vLLM instance.

Since `OpenAiCompatibleChatClient` always sends `stream: true` (line
156) and tool calls are a first-class feature of Netclaw sessions, both
parsers need to be exercised before declaring either one a safe target.

**Good news**: Netclaw already has a `TextToolCallParser` (lines 105-141
and 207-241 of `OpenAiCompatibleChatClient.cs`) that handles exactly
this pattern — text-mode `<tool_call>` XML emitted by models that don't
use structured JSON tool calls. It was added for a llama-server Qwen3
quirk and may provide partial coverage for the vLLM degraded format.

**Action**:
- Capture actual vLLM streaming output for a tool-calling request under
  **both** `--tool-call-parser hermes` and `--tool-call-parser
  qwen3_coder`, for single-tool and parallel-tool cases.
- For each parser, compare the emitted format (tag names, attribute
  placement, argument JSON serialization, opening/closing tag
  boundaries across SSE chunks) against what `TextToolCallParser`
  currently expects.
- Extend the parser for any gaps. If the two vLLM parsers emit
  meaningfully different shapes, the fallback path needs to handle both
  variants.
- Add streaming regression tests against captured SSE transcripts so
  these paths stay covered.
- Document which parser is the recommended target for each supported
  Qwen3 variant in provider config docs.

### 3. llama.cpp `timings` object extraction breaks on vLLM

llama.cpp's `/v1/chat/completions` response includes a top-level
`timings` object with fields like `prompt_n`, `prompt_ms`,
`prompt_per_second`, `predicted_n`, `predicted_ms`,
`predicted_per_second`, and (when prompt caching is active)
`cached_tokens`. This is a llama.cpp-specific extension — vLLM does not
emit it.

A downstream consumer in Netclaw (PR #615, "llama.cpp timings parsing")
extracts `cached_tokens`, `prompt_ms`, and `predicted_per_second` from
this shape and feeds them into eval-suite analysis (specifically the
Multi-Turn Cache Evolution table) and session-level cache telemetry.
On vLLM those fields don't exist, so:

- The eval suite's cache-evolution metrics will silently populate as
  zero or null against a vLLM backend
- Any cache-hit-rate debugging or regression detection driven off
  `cached_tokens` will break
- Session telemetry that depends on per-request `prompt_ms` for latency
  analysis will lose fidelity

This is more than a "response extras tolerance" concern — it's a real
observability regression when the backend changes.

**Action**:
- Identify the exact consumers of the PR #615 timings parser (grep for
  `cached_tokens`, `prompt_ms`, `predicted_per_second`).
- Build a vLLM-equivalent extraction path. vLLM exposes equivalent data
  via different mechanisms:
  - **Usage field**: vLLM's `usage` object adds
    `prompt_tokens_details.cached_tokens` (OpenAI-standard prefix cache
    field, supported by vLLM for its automatic prefix cache).
    Mapping: `timings.cached_tokens` ↔ `usage.prompt_tokens_details.cached_tokens`.
  - **`/metrics` endpoint** (Prometheus): `vllm:prompt_tokens_total`,
    `vllm:generation_tokens_total`,
    `vllm:time_to_first_token_seconds`,
    `vllm:time_per_output_token_seconds`. These are aggregate, not
    per-request, so they don't substitute for per-request `prompt_ms`
    at the request level.
  - **Per-request timing**: can be derived client-side from wall-clock
    measurements around the HTTP call (stream-start → first-chunk →
    last-chunk), which is arguably more accurate than trusting the
    backend's self-reported timings anyway.
- Refactor the timings parser into a backend-agnostic interface:
  `ITimingsExtractor` with `LlamaCppTimingsExtractor` (parses the
  `timings` object) and `VllmTimingsExtractor` (reads
  `usage.prompt_tokens_details.cached_tokens` plus client-side wall
  clock). Select based on backend detection.
- Update the Multi-Turn Cache Evolution eval metric to use the
  abstracted extractor so the table populates correctly regardless of
  backend.

### 4. Strict `model` field matching on vLLM

vLLM validates the request's `model` field against `--served-model-name`
and rejects mismatches. llama-server is looser — it typically ignores the
value because there's one model per process.

`OpenAiCompatibleChatClient` already sends `ChatOptions.ModelId` from
config, so this is largely a configuration concern — but the error path
matters: a misconfigured `ModelId` against a vLLM backend produces an
opaque 404 or 400 where against llama-server it would "just work."

**Action**:
- Ensure the error body from vLLM's model-mismatch response is surfaced
  to users with a clear diagnostic ("the backend reports the served model
  name is `X`, but `ChatOptions.ModelId` is `Y`").
- Document the requirement in provider config docs.
- Consider a startup sanity check: after `/v1/models` is fetched,
  validate `ChatOptions.ModelId` against the returned list and fail fast
  with an actionable error.

### 5. Error shape and retry policy alignment

vLLM error responses use the OpenAI `{"error": {...}}` envelope but
different `type` values than llama-server (vLLM: `BadRequestError`,
`invalid_request_error`, etc.; llama-server: free-form strings). The
current error parser (lines 237-259) handles both the nested envelope
and a bare `error: "string"` fallback, so no breakage expected — but
verify the retry classification in `RetryPolicy.cs` treats vLLM
`context_length_exceeded` as non-retryable (it's a permanent failure at
this prompt length, not a transient error).

Additionally, vLLM pre-validates `max_tokens` against context length
*before* generation begins and returns 400 immediately; llama-server
clamps at runtime. Long-prompt sessions that currently "succeed" on
llama-server with truncated output may start failing on vLLM — arguably
a correctness improvement, but worth surfacing the error cleanly rather
than confusing users.

### 6. Minor: stop-sequence semantics

vLLM stops *before* emitting the stop string; llama-server historically
sometimes includes the stop token in the content stream. No code in
`OpenAiCompatibleChatClient` appears to depend on this, but flagged for
awareness in case any downstream code inspects content tails for stop
markers.

### 7. Minor: `/v1/chat/completions` response extras (non-timings)

Beyond the llama.cpp `timings` object (covered in Gap #3), both backends
emit additional non-standard fields:

- vLLM may add `prompt_token_ids` / `prompt_logprobs` (diagnostic-only,
  not consumed by Netclaw)
- Individual vLLM versions have added various telemetry fields

If strict-mode JSON deserialization is enabled anywhere, verify unknown
fields are tolerated on both sides.

## References

- [vLLM OpenAI-compatible server docs](https://docs.vllm.ai/en/stable/serving/openai_compatible_server/)
- [vLLM tool-calling docs](https://docs.vllm.ai/en/latest/features/tool_calling/)
- [vllm-project/vllm#31871](https://github.com/vllm-project/vllm/issues/31871) — hermes streaming tool-call bug (affects `hermes` parser specifically; unverified for `qwen3_coder`)
- [llama.cpp server README](https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md)
- [llama.cpp function-calling docs](https://github.com/ggml-org/llama.cpp/blob/master/docs/function-calling.md)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend OpenAiCompatibleChatClient for vLLM backend compatibility #619

Summary

Known compatibility gaps

1. `/props` endpoint is llama.cpp-only

2. Tool-call parser streaming format depends on vLLM parser choice

3. llama.cpp `timings` object extraction breaks on vLLM

4. Strict `model` field matching on vLLM

5. Error shape and retry policy alignment

6. Minor: stop-sequence semantics

7. Minor: `/v1/chat/completions` response extras (non-timings)

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Extend OpenAiCompatibleChatClient for vLLM backend compatibility #619

Description

Summary

Known compatibility gaps

1. /props endpoint is llama.cpp-only

2. Tool-call parser streaming format depends on vLLM parser choice

3. llama.cpp timings object extraction breaks on vLLM

4. Strict model field matching on vLLM

5. Error shape and retry policy alignment

6. Minor: stop-sequence semantics

7. Minor: /v1/chat/completions response extras (non-timings)

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. `/props` endpoint is llama.cpp-only

3. llama.cpp `timings` object extraction breaks on vLLM

4. Strict `model` field matching on vLLM

7. Minor: `/v1/chat/completions` response extras (non-timings)