Add multi-image and prompt token id support to vLLM server#5228
Add multi-image and prompt token id support to vLLM server#5228bangawayoo wants to merge 6 commits intohuggingface:mainfrom
Conversation
|
thanks for the pr. to align closely with the vllm api, I'd prefer not to have |
- GenerateRequest: make `prompts` optional, add `prompt_token_ids: list[list[int]] | None`, change `images` to accept per-prompt lists of base64 strings (`list[list[str] | str | None] | None`), add `mm_processor_kwargs: dict | None` - generate handler: build vLLM prompt dicts using token IDs directly when provided; decode multi-image lists into `multi_modal_data` with a list of PIL Images; forward `mm_processor_kwargs` per prompt - VLLMClient.generate(): add `prompt_token_ids` and `mm_processor_kwargs` params; handle per-prompt image lists in the base64 conversion step
vLLM's multimodal preprocessor hangs when it receives raw token IDs alongside images — it goes through a dummy-text code path that is incompatible with models like Qwen3-VL. When prompt_token_ids are provided with images, decode them to text so vLLM uses the working text+images preprocessing path instead. Also add request-level logging for debugging multimodal requests.
…c field Merge the separate `prompt_token_ids` parameter into `prompts`, which now accepts both `list[str]` (text) and `list[list[int]]` (token IDs). This aligns more closely with the vLLM API convention and simplifies the interface.
82bede3 to
787f15c
Compare
|
Thanks for the suggestion. That makes sense. |
prompt_token_ids support to vLLM serverThere was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
| if image is not None: | ||
| row["multi_modal_data"] = {"image": Image.open(BytesIO(base64.b64decode(image)))} | ||
| for i in range(n_prompts): | ||
| imgs = images_per_prompt[i] |
There was a problem hiding this comment.
Lost strict length validation between images and prompts
Low Severity
The old code used zip(request.prompts, request.images, strict=True) which raised a clear ValueError when the images and prompts lists had different lengths. The new code uses images_per_prompt[i] with range(n_prompts), which silently ignores extra images when len(images) > len(prompts). This could mask a caller error where more images than prompts are provided, leading to some images being quietly dropped without any indication.
Additional Locations (1)
| if is_token_ids: | ||
| if has_images: | ||
| # Decode to text so vLLM uses the text+images code path. | ||
| prompt_text = _tokenizer.decode(request.prompts[i], skip_special_tokens=False) |
There was a problem hiding this comment.
I don't think we want to decode the the tokens
| len(prompt_text), | ||
| ) | ||
| else: | ||
| row = {"prompt_token_ids": request.prompts[i]} |
|
thanks, this PR is doing multiple things at once:
ideally we would decouple things to ease review. did you test the code? I see prompt_token_ids which is not a valid arg in vllm. |
|
Thank you for reviewing this. It looks like a262d9f (#5227) already adds support for multi-images, so I'll close this PR to avoid duplication. Regarding your question about |


What does this PR do?
Context
Many vision-language models require multiple images per prompt (e.g., Alpamayo-R1 uses 16 camera frames for autonomous driving), but the current
/generate/API only supports one image per prompt. Additionally, GRPO training rollouts need to send pre-tokenized prompts (prompt_token_ids) to avoid re-tokenization issues from BPE merge ambiguities (see #5225).This PR adds both capabilities to the vLLM server and client.
This enables GRPO training with multimodal VLMs that require multiple images served via
trl vllm-serve:Changes
vllm_serve.py—/generate/endpointGenerateRequest.promptsis now optional (list[str] | None). Eitherpromptsorprompt_token_idsmust be provided.prompt_token_ids: list[list[int]] | Nonefor pre-tokenized prompts.imagesfromlist[str] | Nonetolist[list[str] | str | None] | None— each element can be a single base64 string (one image) or a list of base64 strings (multiple images per prompt).mm_processor_kwargs: dict | Noneforwarded to vLLM per prompt (e.g.,{"min_pixels": 163840, "max_pixels": 196608}).prompt_token_idsis provided alongside images, the handler decodes token IDs to text before dispatching to the vLLM worker. This is necessary because vLLM's multimodal preprocessor uses a dummy-text code path for raw token IDs + images that is incompatible with some vision models (e.g., Qwen3-VL).vllm_client.py—VLLMClient.generate()promptsis now optional. Addedprompt_token_ids: list[list[int]] | Noneandmm_processor_kwargs: dict | Noneparameters.imagescan be a singlePIL.Imageor alist[PIL.Image]for multi-image prompts.Backward compatibility
All changes are additive. Existing callers using
prompts: list[str]withimages: list[str]are unaffected:Nonedefaults and are ignored when not provided.prompt_token_idsis used with images.images: ["<b64>"]is a validlist[str | ...]and flows through the same logic as before.Tests
28 passed, 8 skipped (require 3+ GPUs), 1 xfailed (pre-existing upstream issue).
Full pytest output
Before submitting
Pull Request section?
to it if that's the case.
Note
Medium Risk
Extends the public
/generate/request schema and changes prompt dispatching logic (including token-id decoding when images are present), which could affect multimodal generation behavior and backward compatibility if clients rely on the old payload shape.Overview
Adds multi-image-per-prompt and pre-tokenized prompt (token IDs) support end-to-end for
trl vllm-serveandVLLMClient.generate().The client now accepts
promptsas either text orlist[list[int]], encodesimagesas per-prompt base64 strings or lists of strings, and forwards optionalmm_processor_kwargsto the server.The server
/generate/endpoint updates its Pydantic schema accordingly, normalizes single vs multi-image inputs, forwardsmm_processor_kwargsper prompt, and conditionally decodes token-id prompts to text when images are present to avoid vLLM multimodal preprocessing hangs; logging around dispatch/worker responses was also expanded.Tests add CPU-only coverage for image encoding shapes and payload forwarding, plus an integration test that generates from
prompt_token_idsagainst a live server.Written by Cursor Bugbot for commit 1aeeb74. This will update automatically on new commits. Configure here.