You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
OpenAiCompatibleChatClient is currently exercised against llama.cpp's llama-server. An evaluation of vLLM (vllm.entrypoints.openai.api_server)
as an alternative OpenAI-compatible backend surfaces several compatibility
gaps that should be closed so the two servers are genuinely interchangeable
behind the same provider type.
vLLM is worth supporting: it has first-class AMD ROCm support, uses
PagedAttention for dynamic KV cache allocation (eliminating the
preallocation waste llama.cpp suffers at LLAMA_PARALLEL > 1), and is the
most widely deployed self-hosted OpenAI-compatible server in production.
Being able to swap backends without code changes keeps the provider
abstraction honest.
This issue tracks the work to:
Identify and document the concrete API-surface deltas between
llama-server and vLLM that OpenAiCompatibleChatClient currently
depends on.
Extend the client (and/or OpenAiCompatibleCapabilityResolver) to
handle both backends transparently.
Add integration coverage for vLLM so regressions are caught.
Known compatibility gaps
1. /props endpoint is llama.cpp-only
OpenAiCompatibleCapabilityResolver probes GET /props at startup to
extract:
vLLM does not expose /props. The resolver already has a /v1/models
fallback path, but that path reads meta.n_ctx_train from the model
descriptor — which is a llama.cpp-specific extension. vLLM's /v1/models
response is OpenAI-standard and does not include context window or
modality metadata in any documented field.
Action:
Verify the /v1/models fallback actually works against vLLM (spoiler:
it probably returns a default or errors silently).
Add a vLLM-aware code path. Options:
Query /v1/models for the served model id, then consult a small
built-in table keyed by model family
Require an explicit ContextWindow / SupportsVision override in
provider config when the backend can't self-describe
Probe /metrics (vLLM exposes Prometheus metrics including vllm:num_requests_running and vllm:gpu_cache_usage_perc, but not
context window — so this is a dead end for capability detection)
2. Tool-call parser streaming format depends on vLLM parser choice
vLLM's tool-call parsing is plugin-based via --tool-call-parser <name>.
Different Qwen3-family variants use different parsers, and the streaming
behavior is not uniform across them:
hermes parser (vLLM's general Qwen3 recommendation) has a
documented streaming bug
(vllm-project/vllm#31871,
open as of Feb 2026, v0.15.1 still affected): when stream: true is
set, vLLM returns raw <tool_call>…</tool_call> XML inside delta.content with finish_reason: "stop" instead of the structured tool_calls delta array required by the OpenAI spec. Non-streaming
requests are unaffected.
qwen3_coder parser (vLLM 0.10+, designed for Qwen3-Coder /
Qwen3.5 A3B MoE agent models) is a separate code path. Whether it
exhibits the same streaming degradation as hermes is unverified
— it may be fine, it may have different quirks, it needs actual
testing against a live vLLM instance.
Since OpenAiCompatibleChatClient always sends stream: true (line
156) and tool calls are a first-class feature of Netclaw sessions, both
parsers need to be exercised before declaring either one a safe target.
Good news: Netclaw already has a TextToolCallParser (lines 105-141
and 207-241 of OpenAiCompatibleChatClient.cs) that handles exactly
this pattern — text-mode <tool_call> XML emitted by models that don't
use structured JSON tool calls. It was added for a llama-server Qwen3
quirk and may provide partial coverage for the vLLM degraded format.
Action:
Capture actual vLLM streaming output for a tool-calling request under both--tool-call-parser hermes and --tool-call-parser qwen3_coder, for single-tool and parallel-tool cases.
For each parser, compare the emitted format (tag names, attribute
placement, argument JSON serialization, opening/closing tag
boundaries across SSE chunks) against what TextToolCallParser
currently expects.
Extend the parser for any gaps. If the two vLLM parsers emit
meaningfully different shapes, the fallback path needs to handle both
variants.
Add streaming regression tests against captured SSE transcripts so
these paths stay covered.
Document which parser is the recommended target for each supported
Qwen3 variant in provider config docs.
3. llama.cpp timings object extraction breaks on vLLM
llama.cpp's /v1/chat/completions response includes a top-level timings object with fields like prompt_n, prompt_ms, prompt_per_second, predicted_n, predicted_ms, predicted_per_second, and (when prompt caching is active) cached_tokens. This is a llama.cpp-specific extension — vLLM does not
emit it.
A downstream consumer in Netclaw (PR #615, "llama.cpp timings parsing")
extracts cached_tokens, prompt_ms, and predicted_per_second from
this shape and feeds them into eval-suite analysis (specifically the
Multi-Turn Cache Evolution table) and session-level cache telemetry.
On vLLM those fields don't exist, so:
The eval suite's cache-evolution metrics will silently populate as
zero or null against a vLLM backend
Any cache-hit-rate debugging or regression detection driven off cached_tokens will break
Session telemetry that depends on per-request prompt_ms for latency
analysis will lose fidelity
This is more than a "response extras tolerance" concern — it's a real
observability regression when the backend changes.
Build a vLLM-equivalent extraction path. vLLM exposes equivalent data
via different mechanisms:
Usage field: vLLM's usage object adds prompt_tokens_details.cached_tokens (OpenAI-standard prefix cache
field, supported by vLLM for its automatic prefix cache).
Mapping: timings.cached_tokens ↔ usage.prompt_tokens_details.cached_tokens.
/metrics endpoint (Prometheus): vllm:prompt_tokens_total, vllm:generation_tokens_total, vllm:time_to_first_token_seconds, vllm:time_per_output_token_seconds. These are aggregate, not
per-request, so they don't substitute for per-request prompt_ms
at the request level.
Per-request timing: can be derived client-side from wall-clock
measurements around the HTTP call (stream-start → first-chunk →
last-chunk), which is arguably more accurate than trusting the
backend's self-reported timings anyway.
Refactor the timings parser into a backend-agnostic interface: ITimingsExtractor with LlamaCppTimingsExtractor (parses the timings object) and VllmTimingsExtractor (reads usage.prompt_tokens_details.cached_tokens plus client-side wall
clock). Select based on backend detection.
Update the Multi-Turn Cache Evolution eval metric to use the
abstracted extractor so the table populates correctly regardless of
backend.
4. Strict model field matching on vLLM
vLLM validates the request's model field against --served-model-name
and rejects mismatches. llama-server is looser — it typically ignores the
value because there's one model per process.
OpenAiCompatibleChatClient already sends ChatOptions.ModelId from
config, so this is largely a configuration concern — but the error path
matters: a misconfigured ModelId against a vLLM backend produces an
opaque 404 or 400 where against llama-server it would "just work."
Action:
Ensure the error body from vLLM's model-mismatch response is surfaced
to users with a clear diagnostic ("the backend reports the served model
name is X, but ChatOptions.ModelId is Y").
Document the requirement in provider config docs.
Consider a startup sanity check: after /v1/models is fetched,
validate ChatOptions.ModelId against the returned list and fail fast
with an actionable error.
5. Error shape and retry policy alignment
vLLM error responses use the OpenAI {"error": {...}} envelope but
different type values than llama-server (vLLM: BadRequestError, invalid_request_error, etc.; llama-server: free-form strings). The
current error parser (lines 237-259) handles both the nested envelope
and a bare error: "string" fallback, so no breakage expected — but
verify the retry classification in RetryPolicy.cs treats vLLM context_length_exceeded as non-retryable (it's a permanent failure at
this prompt length, not a transient error).
Additionally, vLLM pre-validates max_tokens against context length before generation begins and returns 400 immediately; llama-server
clamps at runtime. Long-prompt sessions that currently "succeed" on
llama-server with truncated output may start failing on vLLM — arguably
a correctness improvement, but worth surfacing the error cleanly rather
than confusing users.
6. Minor: stop-sequence semantics
vLLM stops before emitting the stop string; llama-server historically
sometimes includes the stop token in the content stream. No code in OpenAiCompatibleChatClient appears to depend on this, but flagged for
awareness in case any downstream code inspects content tails for stop
markers.
Summary
OpenAiCompatibleChatClientis currently exercised against llama.cpp'sllama-server. An evaluation of vLLM (vllm.entrypoints.openai.api_server)as an alternative OpenAI-compatible backend surfaces several compatibility
gaps that should be closed so the two servers are genuinely interchangeable
behind the same provider type.
vLLM is worth supporting: it has first-class AMD ROCm support, uses
PagedAttention for dynamic KV cache allocation (eliminating the
preallocation waste llama.cpp suffers at
LLAMA_PARALLEL > 1), and is themost widely deployed self-hosted OpenAI-compatible server in production.
Being able to swap backends without code changes keeps the provider
abstraction honest.
This issue tracks the work to:
llama-server and vLLM that
OpenAiCompatibleChatClientcurrentlydepends on.
OpenAiCompatibleCapabilityResolver) tohandle both backends transparently.
Known compatibility gaps
1.
/propsendpoint is llama.cpp-onlyOpenAiCompatibleCapabilityResolverprobesGET /propsat startup toextract:
default_generation_settings.params.n_ctx→ context windowmodalities.vision→ vision support flagvLLM does not expose
/props. The resolver already has a/v1/modelsfallback path, but that path reads
meta.n_ctx_trainfrom the modeldescriptor — which is a llama.cpp-specific extension. vLLM's
/v1/modelsresponse is OpenAI-standard and does not include context window or
modality metadata in any documented field.
Action:
/v1/modelsfallback actually works against vLLM (spoiler:it probably returns a default or errors silently).
/v1/modelsfor the served model id, then consult a smallbuilt-in table keyed by model family
ContextWindow/SupportsVisionoverride inprovider config when the backend can't self-describe
/metrics(vLLM exposes Prometheus metrics includingvllm:num_requests_runningandvllm:gpu_cache_usage_perc, but notcontext window — so this is a dead end for capability detection)
2. Tool-call parser streaming format depends on vLLM parser choice
vLLM's tool-call parsing is plugin-based via
--tool-call-parser <name>.Different Qwen3-family variants use different parsers, and the streaming
behavior is not uniform across them:
hermesparser (vLLM's general Qwen3 recommendation) has adocumented streaming bug
(vllm-project/vllm#31871,
open as of Feb 2026, v0.15.1 still affected): when
stream: trueisset, vLLM returns raw
<tool_call>…</tool_call>XML insidedelta.contentwithfinish_reason: "stop"instead of the structuredtool_callsdelta array required by the OpenAI spec. Non-streamingrequests are unaffected.
qwen3_coderparser (vLLM 0.10+, designed for Qwen3-Coder /Qwen3.5 A3B MoE agent models) is a separate code path. Whether it
exhibits the same streaming degradation as
hermesis unverified— it may be fine, it may have different quirks, it needs actual
testing against a live vLLM instance.
Since
OpenAiCompatibleChatClientalways sendsstream: true(line156) and tool calls are a first-class feature of Netclaw sessions, both
parsers need to be exercised before declaring either one a safe target.
Good news: Netclaw already has a
TextToolCallParser(lines 105-141and 207-241 of
OpenAiCompatibleChatClient.cs) that handles exactlythis pattern — text-mode
<tool_call>XML emitted by models that don'tuse structured JSON tool calls. It was added for a llama-server Qwen3
quirk and may provide partial coverage for the vLLM degraded format.
Action:
both
--tool-call-parser hermesand--tool-call-parser qwen3_coder, for single-tool and parallel-tool cases.placement, argument JSON serialization, opening/closing tag
boundaries across SSE chunks) against what
TextToolCallParsercurrently expects.
meaningfully different shapes, the fallback path needs to handle both
variants.
these paths stay covered.
Qwen3 variant in provider config docs.
3. llama.cpp
timingsobject extraction breaks on vLLMllama.cpp's
/v1/chat/completionsresponse includes a top-leveltimingsobject with fields likeprompt_n,prompt_ms,prompt_per_second,predicted_n,predicted_ms,predicted_per_second, and (when prompt caching is active)cached_tokens. This is a llama.cpp-specific extension — vLLM does notemit it.
A downstream consumer in Netclaw (PR #615, "llama.cpp timings parsing")
extracts
cached_tokens,prompt_ms, andpredicted_per_secondfromthis shape and feeds them into eval-suite analysis (specifically the
Multi-Turn Cache Evolution table) and session-level cache telemetry.
On vLLM those fields don't exist, so:
zero or null against a vLLM backend
cached_tokenswill breakprompt_msfor latencyanalysis will lose fidelity
This is more than a "response extras tolerance" concern — it's a real
observability regression when the backend changes.
Action:
cached_tokens,prompt_ms,predicted_per_second).via different mechanisms:
usageobject addsprompt_tokens_details.cached_tokens(OpenAI-standard prefix cachefield, supported by vLLM for its automatic prefix cache).
Mapping:
timings.cached_tokens↔usage.prompt_tokens_details.cached_tokens./metricsendpoint (Prometheus):vllm:prompt_tokens_total,vllm:generation_tokens_total,vllm:time_to_first_token_seconds,vllm:time_per_output_token_seconds. These are aggregate, notper-request, so they don't substitute for per-request
prompt_msat the request level.
measurements around the HTTP call (stream-start → first-chunk →
last-chunk), which is arguably more accurate than trusting the
backend's self-reported timings anyway.
ITimingsExtractorwithLlamaCppTimingsExtractor(parses thetimingsobject) andVllmTimingsExtractor(readsusage.prompt_tokens_details.cached_tokensplus client-side wallclock). Select based on backend detection.
abstracted extractor so the table populates correctly regardless of
backend.
4. Strict
modelfield matching on vLLMvLLM validates the request's
modelfield against--served-model-nameand rejects mismatches. llama-server is looser — it typically ignores the
value because there's one model per process.
OpenAiCompatibleChatClientalready sendsChatOptions.ModelIdfromconfig, so this is largely a configuration concern — but the error path
matters: a misconfigured
ModelIdagainst a vLLM backend produces anopaque 404 or 400 where against llama-server it would "just work."
Action:
to users with a clear diagnostic ("the backend reports the served model
name is
X, butChatOptions.ModelIdisY")./v1/modelsis fetched,validate
ChatOptions.ModelIdagainst the returned list and fail fastwith an actionable error.
5. Error shape and retry policy alignment
vLLM error responses use the OpenAI
{"error": {...}}envelope butdifferent
typevalues than llama-server (vLLM:BadRequestError,invalid_request_error, etc.; llama-server: free-form strings). Thecurrent error parser (lines 237-259) handles both the nested envelope
and a bare
error: "string"fallback, so no breakage expected — butverify the retry classification in
RetryPolicy.cstreats vLLMcontext_length_exceededas non-retryable (it's a permanent failure atthis prompt length, not a transient error).
Additionally, vLLM pre-validates
max_tokensagainst context lengthbefore generation begins and returns 400 immediately; llama-server
clamps at runtime. Long-prompt sessions that currently "succeed" on
llama-server with truncated output may start failing on vLLM — arguably
a correctness improvement, but worth surfacing the error cleanly rather
than confusing users.
6. Minor: stop-sequence semantics
vLLM stops before emitting the stop string; llama-server historically
sometimes includes the stop token in the content stream. No code in
OpenAiCompatibleChatClientappears to depend on this, but flagged forawareness in case any downstream code inspects content tails for stop
markers.
7. Minor:
/v1/chat/completionsresponse extras (non-timings)Beyond the llama.cpp
timingsobject (covered in Gap #3), both backendsemit additional non-standard fields:
prompt_token_ids/prompt_logprobs(diagnostic-only,not consumed by Netclaw)
If strict-mode JSON deserialization is enabled anywhere, verify unknown
fields are tolerated on both sides.
References
hermesparser specifically; unverified forqwen3_coder)