Skip to content

Extend OpenAiCompatibleChatClient for vLLM backend compatibility #619

@Aaronontheweb

Description

@Aaronontheweb

Summary

OpenAiCompatibleChatClient is currently exercised against llama.cpp's
llama-server. An evaluation of vLLM (vllm.entrypoints.openai.api_server)
as an alternative OpenAI-compatible backend surfaces several compatibility
gaps that should be closed so the two servers are genuinely interchangeable
behind the same provider type.

vLLM is worth supporting: it has first-class AMD ROCm support, uses
PagedAttention for dynamic KV cache allocation (eliminating the
preallocation waste llama.cpp suffers at LLAMA_PARALLEL > 1), and is the
most widely deployed self-hosted OpenAI-compatible server in production.
Being able to swap backends without code changes keeps the provider
abstraction honest.

This issue tracks the work to:

  1. Identify and document the concrete API-surface deltas between
    llama-server and vLLM that OpenAiCompatibleChatClient currently
    depends on.
  2. Extend the client (and/or OpenAiCompatibleCapabilityResolver) to
    handle both backends transparently.
  3. Add integration coverage for vLLM so regressions are caught.

Known compatibility gaps

1. /props endpoint is llama.cpp-only

OpenAiCompatibleCapabilityResolver probes GET /props at startup to
extract:

  • default_generation_settings.params.n_ctx → context window
  • modalities.vision → vision support flag

vLLM does not expose /props. The resolver already has a /v1/models
fallback path, but that path reads meta.n_ctx_train from the model
descriptor — which is a llama.cpp-specific extension. vLLM's /v1/models
response is OpenAI-standard and does not include context window or
modality metadata in any documented field.

Action:

  • Verify the /v1/models fallback actually works against vLLM (spoiler:
    it probably returns a default or errors silently).
  • Add a vLLM-aware code path. Options:
    • Query /v1/models for the served model id, then consult a small
      built-in table keyed by model family
    • Require an explicit ContextWindow / SupportsVision override in
      provider config when the backend can't self-describe
    • Probe /metrics (vLLM exposes Prometheus metrics including
      vllm:num_requests_running and vllm:gpu_cache_usage_perc, but not
      context window — so this is a dead end for capability detection)

2. Tool-call parser streaming format depends on vLLM parser choice

vLLM's tool-call parsing is plugin-based via --tool-call-parser <name>.
Different Qwen3-family variants use different parsers, and the streaming
behavior is not uniform across them:

  • hermes parser (vLLM's general Qwen3 recommendation) has a
    documented streaming bug
    (vllm-project/vllm#31871,
    open as of Feb 2026, v0.15.1 still affected): when stream: true is
    set, vLLM returns raw <tool_call>…</tool_call> XML inside
    delta.content with finish_reason: "stop" instead of the structured
    tool_calls delta array required by the OpenAI spec. Non-streaming
    requests are unaffected.
  • qwen3_coder parser (vLLM 0.10+, designed for Qwen3-Coder /
    Qwen3.5 A3B MoE agent models) is a separate code path. Whether it
    exhibits the same streaming degradation as hermes is unverified
    — it may be fine, it may have different quirks, it needs actual
    testing against a live vLLM instance.

Since OpenAiCompatibleChatClient always sends stream: true (line
156) and tool calls are a first-class feature of Netclaw sessions, both
parsers need to be exercised before declaring either one a safe target.

Good news: Netclaw already has a TextToolCallParser (lines 105-141
and 207-241 of OpenAiCompatibleChatClient.cs) that handles exactly
this pattern — text-mode <tool_call> XML emitted by models that don't
use structured JSON tool calls. It was added for a llama-server Qwen3
quirk and may provide partial coverage for the vLLM degraded format.

Action:

  • Capture actual vLLM streaming output for a tool-calling request under
    both --tool-call-parser hermes and --tool-call-parser qwen3_coder, for single-tool and parallel-tool cases.
  • For each parser, compare the emitted format (tag names, attribute
    placement, argument JSON serialization, opening/closing tag
    boundaries across SSE chunks) against what TextToolCallParser
    currently expects.
  • Extend the parser for any gaps. If the two vLLM parsers emit
    meaningfully different shapes, the fallback path needs to handle both
    variants.
  • Add streaming regression tests against captured SSE transcripts so
    these paths stay covered.
  • Document which parser is the recommended target for each supported
    Qwen3 variant in provider config docs.

3. llama.cpp timings object extraction breaks on vLLM

llama.cpp's /v1/chat/completions response includes a top-level
timings object with fields like prompt_n, prompt_ms,
prompt_per_second, predicted_n, predicted_ms,
predicted_per_second, and (when prompt caching is active)
cached_tokens. This is a llama.cpp-specific extension — vLLM does not
emit it.

A downstream consumer in Netclaw (PR #615, "llama.cpp timings parsing")
extracts cached_tokens, prompt_ms, and predicted_per_second from
this shape and feeds them into eval-suite analysis (specifically the
Multi-Turn Cache Evolution table) and session-level cache telemetry.
On vLLM those fields don't exist, so:

  • The eval suite's cache-evolution metrics will silently populate as
    zero or null against a vLLM backend
  • Any cache-hit-rate debugging or regression detection driven off
    cached_tokens will break
  • Session telemetry that depends on per-request prompt_ms for latency
    analysis will lose fidelity

This is more than a "response extras tolerance" concern — it's a real
observability regression when the backend changes.

Action:

  • Identify the exact consumers of the PR feat(providers): parse llama.cpp timings for cache + performance metrics #615 timings parser (grep for
    cached_tokens, prompt_ms, predicted_per_second).
  • Build a vLLM-equivalent extraction path. vLLM exposes equivalent data
    via different mechanisms:
    • Usage field: vLLM's usage object adds
      prompt_tokens_details.cached_tokens (OpenAI-standard prefix cache
      field, supported by vLLM for its automatic prefix cache).
      Mapping: timings.cached_tokensusage.prompt_tokens_details.cached_tokens.
    • /metrics endpoint (Prometheus): vllm:prompt_tokens_total,
      vllm:generation_tokens_total,
      vllm:time_to_first_token_seconds,
      vllm:time_per_output_token_seconds. These are aggregate, not
      per-request, so they don't substitute for per-request prompt_ms
      at the request level.
    • Per-request timing: can be derived client-side from wall-clock
      measurements around the HTTP call (stream-start → first-chunk →
      last-chunk), which is arguably more accurate than trusting the
      backend's self-reported timings anyway.
  • Refactor the timings parser into a backend-agnostic interface:
    ITimingsExtractor with LlamaCppTimingsExtractor (parses the
    timings object) and VllmTimingsExtractor (reads
    usage.prompt_tokens_details.cached_tokens plus client-side wall
    clock). Select based on backend detection.
  • Update the Multi-Turn Cache Evolution eval metric to use the
    abstracted extractor so the table populates correctly regardless of
    backend.

4. Strict model field matching on vLLM

vLLM validates the request's model field against --served-model-name
and rejects mismatches. llama-server is looser — it typically ignores the
value because there's one model per process.

OpenAiCompatibleChatClient already sends ChatOptions.ModelId from
config, so this is largely a configuration concern — but the error path
matters: a misconfigured ModelId against a vLLM backend produces an
opaque 404 or 400 where against llama-server it would "just work."

Action:

  • Ensure the error body from vLLM's model-mismatch response is surfaced
    to users with a clear diagnostic ("the backend reports the served model
    name is X, but ChatOptions.ModelId is Y").
  • Document the requirement in provider config docs.
  • Consider a startup sanity check: after /v1/models is fetched,
    validate ChatOptions.ModelId against the returned list and fail fast
    with an actionable error.

5. Error shape and retry policy alignment

vLLM error responses use the OpenAI {"error": {...}} envelope but
different type values than llama-server (vLLM: BadRequestError,
invalid_request_error, etc.; llama-server: free-form strings). The
current error parser (lines 237-259) handles both the nested envelope
and a bare error: "string" fallback, so no breakage expected — but
verify the retry classification in RetryPolicy.cs treats vLLM
context_length_exceeded as non-retryable (it's a permanent failure at
this prompt length, not a transient error).

Additionally, vLLM pre-validates max_tokens against context length
before generation begins and returns 400 immediately; llama-server
clamps at runtime. Long-prompt sessions that currently "succeed" on
llama-server with truncated output may start failing on vLLM — arguably
a correctness improvement, but worth surfacing the error cleanly rather
than confusing users.

6. Minor: stop-sequence semantics

vLLM stops before emitting the stop string; llama-server historically
sometimes includes the stop token in the content stream. No code in
OpenAiCompatibleChatClient appears to depend on this, but flagged for
awareness in case any downstream code inspects content tails for stop
markers.

7. Minor: /v1/chat/completions response extras (non-timings)

Beyond the llama.cpp timings object (covered in Gap #3), both backends
emit additional non-standard fields:

  • vLLM may add prompt_token_ids / prompt_logprobs (diagnostic-only,
    not consumed by Netclaw)
  • Individual vLLM versions have added various telemetry fields

If strict-mode JSON deserialization is enabled anywhere, verify unknown
fields are tolerated on both sides.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestreliabilityRetries, resilience, graceful degradationsessionsLLM session actor, turn lifecycle, pipelines

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions