Add multi-image and prompt token id support to vLLM server by bangawayoo · Pull Request #5228 · huggingface/trl

bangawayoo · 2026-03-06T04:57:45Z

What does this PR do?

Context

Many vision-language models require multiple images per prompt (e.g., Alpamayo-R1 uses 16 camera frames for autonomous driving), but the current /generate/ API only supports one image per prompt. Additionally, GRPO training rollouts need to send pre-tokenized prompts (prompt_token_ids) to avoid re-tokenization issues from BPE merge ambiguities (see #5225).

This PR adds both capabilities to the vLLM server and client.

This enables GRPO training with multimodal VLMs that require multiple images served via trl vllm-serve:

# Training rollout sends pre-tokenized prompts with multiple camera images
result = vllm_client.generate(
    prompt_token_ids=[[151644, 8948, ...]],
    images=[[pil_cam1, pil_cam2, pil_cam3, pil_cam4]],
    mm_processor_kwargs={"min_pixels": 163840, "max_pixels": 196608},
    temperature=0.6,
    max_tokens=256,
)

Changes

`vllm_serve.py` — `/generate/` endpoint

GenerateRequest.prompts is now optional (list[str] | None). Either prompts or prompt_token_ids must be provided.
Added prompt_token_ids: list[list[int]] | None for pre-tokenized prompts.
Changed images from list[str] | None to list[list[str] | str | None] | None — each element can be a single base64 string (one image) or a list of base64 strings (multiple images per prompt).
Added mm_processor_kwargs: dict | None forwarded to vLLM per prompt (e.g., {"min_pixels": 163840, "max_pixels": 196608}).
When prompt_token_ids is provided alongside images, the handler decodes token IDs to text before dispatching to the vLLM worker. This is necessary because vLLM's multimodal preprocessor uses a dummy-text code path for raw token IDs + images that is incompatible with some vision models (e.g., Qwen3-VL).

`vllm_client.py` — `VLLMClient.generate()`

prompts is now optional. Added prompt_token_ids: list[list[int]] | None and mm_processor_kwargs: dict | None parameters.
Image encoding supports per-prompt lists: each element in images can be a single PIL.Image or a list[PIL.Image] for multi-image prompts.

Backward compatibility

All changes are additive. Existing callers using prompts: list[str] with images: list[str] are unaffected:

New fields have None defaults and are ignored when not provided.
The token-ID-to-text decode only triggers when prompt_token_ids is used with images.
Single-image images: ["<b64>"] is a valid list[str | ...] and flows through the same logic as before.

Tests

28 passed, 8 skipped (require 3+ GPUs), 1 xfailed (pre-existing upstream issue).

Full pytest output

tests/test_vllm_client_server.py::TestChunkList::test_even_split PASSED
tests/test_vllm_client_server.py::TestChunkList::test_uneven_split PASSED
tests/test_vllm_client_server.py::TestChunkList::test_more_chunks_than_elements PASSED
tests/test_vllm_client_server.py::TestChunkList::test_n_equals_len PASSED
tests/test_vllm_client_server.py::TestChunkList::test_n_is_1 PASSED
tests/test_vllm_client_server.py::TestChunkList::test_single_element_list PASSED
tests/test_vllm_client_server.py::TestChunkList::test_any_dtype PASSED
tests/test_vllm_client_server.py::TestExtractLogprobs::test_extract_logprobs_sorts_by_rank_and_replaces_nan PASSED
tests/test_vllm_client_server.py::TestExtractLogprobs::test_extract_logprobs_returns_none_token_ids_when_logprobs_missing PASSED
tests/test_vllm_client_server.py::TestVLLMClientImageEncoding::test_single_image_per_prompt PASSED
tests/test_vllm_client_server.py::TestVLLMClientImageEncoding::test_multi_image_per_prompt PASSED
tests/test_vllm_client_server.py::TestVLLMClientImageEncoding::test_none_image_in_list PASSED
tests/test_vllm_client_server.py::TestVLLMClientImageEncoding::test_prompt_token_ids_forwarded PASSED
tests/test_vllm_client_server.py::TestVLLMClientImageEncoding::test_neither_prompts_nor_token_ids_raises PASSED
tests/test_vllm_client_server.py::TestVLLMClientServer::test_generate PASSED
tests/test_vllm_client_server.py::TestVLLMClientServer::test_chat PASSED
tests/test_vllm_client_server.py::TestVLLMClientServer::test_generate_with_params PASSED
tests/test_vllm_client_server.py::TestVLLMClientServer::test_update_model_params PASSED
tests/test_vllm_client_server.py::TestVLLMClientServer::test_generate_with_prompt_token_ids PASSED
tests/test_vllm_client_server.py::TestVLLMClientServer::test_reset_prefix_cache PASSED
tests/test_vllm_client_server.py::TestVLLMClientServer::test_logprobs_match_with_non_default_sampling XFAIL
tests/test_vllm_client_server.py::TestVLLMClientServerBaseURL::test_generate PASSED
tests/test_vllm_client_server.py::TestVLLMClientServerBaseURL::test_chat PASSED
tests/test_vllm_client_server.py::TestVLLMClientServerBaseURL::test_generate_with_params PASSED
tests/test_vllm_client_server.py::TestVLLMClientServerBaseURL::test_update_model_params PASSED
tests/test_vllm_client_server.py::TestVLLMClientServerBaseURL::test_reset_prefix_cache PASSED
tests/test_vllm_client_server.py::TestVLLMClientServerTP::test_generate SKIPPED (requires 3+ GPUs)
tests/test_vllm_client_server.py::TestVLLMClientServerTP::test_chat SKIPPED (requires 3+ GPUs)
tests/test_vllm_client_server.py::TestVLLMClientServerTP::test_update_model_params SKIPPED (requires 3+ GPUs)
tests/test_vllm_client_server.py::TestVLLMClientServerTP::test_reset_prefix_cache SKIPPED (requires 3+ GPUs)
tests/test_vllm_client_server.py::TestVLLMClientServerDP::test_generate SKIPPED (requires 3+ GPUs)
tests/test_vllm_client_server.py::TestVLLMClientServerDP::test_chat SKIPPED (requires 3+ GPUs)
tests/test_vllm_client_server.py::TestVLLMClientServerDP::test_update_model_params SKIPPED (requires 3+ GPUs)
tests/test_vllm_client_server.py::TestVLLMClientServerDP::test_reset_prefix_cache SKIPPED (requires 3+ GPUs)
tests/test_vllm_client_server.py::TestVLLMClientServerDeviceParameter::test_init_communicator_with_device_int PASSED
tests/test_vllm_client_server.py::TestVLLMClientServerDeviceParameter::test_init_communicator_with_device_string PASSED
tests/test_vllm_client_server.py::TestVLLMClientServerDeviceParameter::test_init_communicator_with_torch_device PASSED

============= 28 passed, 8 skipped, 1 xfailed in 299.77s (0:04:59) =============

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Note

Medium Risk
Extends the public /generate/ request schema and changes prompt dispatching logic (including token-id decoding when images are present), which could affect multimodal generation behavior and backward compatibility if clients rely on the old payload shape.

Overview
Adds multi-image-per-prompt and pre-tokenized prompt (token IDs) support end-to-end for trl vllm-serve and VLLMClient.generate().

The client now accepts prompts as either text or list[list[int]], encodes images as per-prompt base64 strings or lists of strings, and forwards optional mm_processor_kwargs to the server.

The server /generate/ endpoint updates its Pydantic schema accordingly, normalizes single vs multi-image inputs, forwards mm_processor_kwargs per prompt, and conditionally decodes token-id prompts to text when images are present to avoid vLLM multimodal preprocessing hangs; logging around dispatch/worker responses was also expanded.

Tests add CPU-only coverage for image encoding shapes and payload forwarding, plus an integration test that generates from prompt_token_ids against a live server.

^{Written by Cursor Bugbot for commit 1aeeb74. This will update automatically on new commits. Configure here.}

trl/scripts/vllm_serve.py

qgallouedec · 2026-03-06T16:49:02Z

thanks for the pr. to align closely with the vllm api, I'd prefer not to have prompt_token_ids and use prompts for both text and ints

- GenerateRequest: make `prompts` optional, add `prompt_token_ids: list[list[int]] | None`, change `images` to accept per-prompt lists of base64 strings (`list[list[str] | str | None] | None`), add `mm_processor_kwargs: dict | None` - generate handler: build vLLM prompt dicts using token IDs directly when provided; decode multi-image lists into `multi_modal_data` with a list of PIL Images; forward `mm_processor_kwargs` per prompt - VLLMClient.generate(): add `prompt_token_ids` and `mm_processor_kwargs` params; handle per-prompt image lists in the base64 conversion step

vLLM's multimodal preprocessor hangs when it receives raw token IDs alongside images — it goes through a dummy-text code path that is incompatible with models like Qwen3-VL. When prompt_token_ids are provided with images, decode them to text so vLLM uses the working text+images preprocessing path instead. Also add request-level logging for debugging multimodal requests.

…ocessor_kwargs

…c field Merge the separate `prompt_token_ids` parameter into `prompts`, which now accepts both `list[str]` (text) and `list[list[int]]` (token IDs). This aligns more closely with the vLLM API convention and simplifies the interface.

bangawayoo · 2026-03-09T04:15:53Z

Thanks for the suggestion. That makes sense.
I updated prompts to support both texts and ints.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

cursor · 2026-03-09T04:24:35Z

trl/scripts/vllm_serve.py

-            if image is not None:
-                row["multi_modal_data"] = {"image": Image.open(BytesIO(base64.b64decode(image)))}
+        for i in range(n_prompts):
+            imgs = images_per_prompt[i]


Lost strict length validation between images and prompts

Low Severity

The old code used zip(request.prompts, request.images, strict=True) which raised a clear ValueError when the images and prompts lists had different lengths. The new code uses images_per_prompt[i] with range(n_prompts), which silently ignores extra images when len(images) > len(prompts). This could mask a caller error where more images than prompts are provided, leading to some images being quietly dropped without any indication.

Additional Locations (1)

trl/scripts/vllm_serve.py#L582-L583

qgallouedec · 2026-03-10T00:35:50Z

trl/scripts/vllm_serve.py

+            if is_token_ids:
+                if has_images:
+                    # Decode to text so vLLM uses the text+images code path.
+                    prompt_text = _tokenizer.decode(request.prompts[i], skip_special_tokens=False)


I don't think we want to decode the the tokens

qgallouedec · 2026-03-10T00:38:19Z

trl/scripts/vllm_serve.py

+                        len(prompt_text),
+                    )
+                else:
+                    row = {"prompt_token_ids": request.prompts[i]}


prompt_token_ids?

qgallouedec · 2026-03-10T00:40:14Z

thanks, this PR is doing multiple things at once:

mm_processor_kwargs support
support for tokenized sequence in vllm server (addressed Add support for raw ids in prompts in vLLM client and server #5225)
some logging

ideally we would decouple things to ease review.

did you test the code? I see prompt_token_ids which is not a valid arg in vllm.

bangawayoo · 2026-03-10T05:38:32Z

Thank you for reviewing this. It looks like a262d9f (#5227) already adds support for multi-images, so I'll close this PR to avoid duplication.

Regarding your question about prompt_token_ids: I was using vllm version 0.11.0, which does support this argument (vllm/inputs/parse.py#L127).

cursor bot reviewed Mar 6, 2026

View reviewed changes

trl/scripts/vllm_serve.py Show resolved Hide resolved

bangawayoo added 5 commits March 9, 2026 11:17

style: apply ruff formatting

26dec69

test: add tests for multi-image encoding, prompt_token_ids, and mm_pr…

c659f0e

…ocessor_kwargs

bangawayoo force-pushed the feat/multi-image-prompt-token-ids branch from 82bede3 to 787f15c Compare March 9, 2026 04:12

Merge branch 'main' into feat/multi-image-prompt-token-ids

1aeeb74

bangawayoo changed the title ~~Add multi-image and prompt_token_ids support to vLLM server~~ Add multi-image and prompt token id support to vLLM server Mar 9, 2026

cursor bot reviewed Mar 9, 2026

View reviewed changes

qgallouedec reviewed Mar 10, 2026

View reviewed changes

trl/scripts/vllm_serve.py

len(prompt_text),

)

else:

row = {"prompt_token_ids": request.prompts[i]}

Copy link

Member

qgallouedec Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prompt_token_ids?

bangawayoo closed this Mar 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multi-image and prompt token id support to vLLM server#5228

Add multi-image and prompt token id support to vLLM server#5228
bangawayoo wants to merge 6 commits intohuggingface:mainfrom
bangawayoo:feat/multi-image-prompt-token-ids

bangawayoo commented Mar 6, 2026 •

edited by cursor bot

Loading

Uh oh!

Uh oh!

qgallouedec commented Mar 6, 2026

Uh oh!

bangawayoo commented Mar 9, 2026

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Mar 9, 2026

Uh oh!

qgallouedec Mar 10, 2026

Uh oh!

qgallouedec Mar 10, 2026

Uh oh!

qgallouedec commented Mar 10, 2026

Uh oh!

bangawayoo commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bangawayoo commented Mar 6, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Context

Changes

vllm_serve.py — /generate/ endpoint

vllm_client.py — VLLMClient.generate()

Backward compatibility

Tests

Before submitting

Uh oh!

Uh oh!

qgallouedec commented Mar 6, 2026

Uh oh!

bangawayoo commented Mar 9, 2026

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Mar 9, 2026

Choose a reason for hiding this comment

Lost strict length validation between images and prompts

Uh oh!

qgallouedec Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

qgallouedec Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

qgallouedec commented Mar 10, 2026

Uh oh!

bangawayoo commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bangawayoo commented Mar 6, 2026 •

edited by cursor bot

Loading

`vllm_serve.py` — `/generate/` endpoint

`vllm_client.py` — `VLLMClient.generate()`