Skip to content

Add multi-image and prompt token id support to vLLM server#5228

Closed
bangawayoo wants to merge 6 commits intohuggingface:mainfrom
bangawayoo:feat/multi-image-prompt-token-ids
Closed

Add multi-image and prompt token id support to vLLM server#5228
bangawayoo wants to merge 6 commits intohuggingface:mainfrom
bangawayoo:feat/multi-image-prompt-token-ids

Conversation

@bangawayoo
Copy link

@bangawayoo bangawayoo commented Mar 6, 2026

What does this PR do?

Context

Many vision-language models require multiple images per prompt (e.g., Alpamayo-R1 uses 16 camera frames for autonomous driving), but the current /generate/ API only supports one image per prompt. Additionally, GRPO training rollouts need to send pre-tokenized prompts (prompt_token_ids) to avoid re-tokenization issues from BPE merge ambiguities (see #5225).

This PR adds both capabilities to the vLLM server and client.

This enables GRPO training with multimodal VLMs that require multiple images served via trl vllm-serve:

# Training rollout sends pre-tokenized prompts with multiple camera images
result = vllm_client.generate(
    prompt_token_ids=[[151644, 8948, ...]],
    images=[[pil_cam1, pil_cam2, pil_cam3, pil_cam4]],
    mm_processor_kwargs={"min_pixels": 163840, "max_pixels": 196608},
    temperature=0.6,
    max_tokens=256,
)

Changes

vllm_serve.py/generate/ endpoint

  • GenerateRequest.prompts is now optional (list[str] | None). Either prompts or prompt_token_ids must be provided.
  • Added prompt_token_ids: list[list[int]] | None for pre-tokenized prompts.
  • Changed images from list[str] | None to list[list[str] | str | None] | None — each element can be a single base64 string (one image) or a list of base64 strings (multiple images per prompt).
  • Added mm_processor_kwargs: dict | None forwarded to vLLM per prompt (e.g., {"min_pixels": 163840, "max_pixels": 196608}).
  • When prompt_token_ids is provided alongside images, the handler decodes token IDs to text before dispatching to the vLLM worker. This is necessary because vLLM's multimodal preprocessor uses a dummy-text code path for raw token IDs + images that is incompatible with some vision models (e.g., Qwen3-VL).

vllm_client.pyVLLMClient.generate()

  • prompts is now optional. Added prompt_token_ids: list[list[int]] | None and mm_processor_kwargs: dict | None parameters.
  • Image encoding supports per-prompt lists: each element in images can be a single PIL.Image or a list[PIL.Image] for multi-image prompts.

Backward compatibility

All changes are additive. Existing callers using prompts: list[str] with images: list[str] are unaffected:

  • New fields have None defaults and are ignored when not provided.
  • The token-ID-to-text decode only triggers when prompt_token_ids is used with images.
  • Single-image images: ["<b64>"] is a valid list[str | ...] and flows through the same logic as before.

Tests

28 passed, 8 skipped (require 3+ GPUs), 1 xfailed (pre-existing upstream issue).

Full pytest output
tests/test_vllm_client_server.py::TestChunkList::test_even_split PASSED
tests/test_vllm_client_server.py::TestChunkList::test_uneven_split PASSED
tests/test_vllm_client_server.py::TestChunkList::test_more_chunks_than_elements PASSED
tests/test_vllm_client_server.py::TestChunkList::test_n_equals_len PASSED
tests/test_vllm_client_server.py::TestChunkList::test_n_is_1 PASSED
tests/test_vllm_client_server.py::TestChunkList::test_single_element_list PASSED
tests/test_vllm_client_server.py::TestChunkList::test_any_dtype PASSED
tests/test_vllm_client_server.py::TestExtractLogprobs::test_extract_logprobs_sorts_by_rank_and_replaces_nan PASSED
tests/test_vllm_client_server.py::TestExtractLogprobs::test_extract_logprobs_returns_none_token_ids_when_logprobs_missing PASSED
tests/test_vllm_client_server.py::TestVLLMClientImageEncoding::test_single_image_per_prompt PASSED
tests/test_vllm_client_server.py::TestVLLMClientImageEncoding::test_multi_image_per_prompt PASSED
tests/test_vllm_client_server.py::TestVLLMClientImageEncoding::test_none_image_in_list PASSED
tests/test_vllm_client_server.py::TestVLLMClientImageEncoding::test_prompt_token_ids_forwarded PASSED
tests/test_vllm_client_server.py::TestVLLMClientImageEncoding::test_neither_prompts_nor_token_ids_raises PASSED
tests/test_vllm_client_server.py::TestVLLMClientServer::test_generate PASSED
tests/test_vllm_client_server.py::TestVLLMClientServer::test_chat PASSED
tests/test_vllm_client_server.py::TestVLLMClientServer::test_generate_with_params PASSED
tests/test_vllm_client_server.py::TestVLLMClientServer::test_update_model_params PASSED
tests/test_vllm_client_server.py::TestVLLMClientServer::test_generate_with_prompt_token_ids PASSED
tests/test_vllm_client_server.py::TestVLLMClientServer::test_reset_prefix_cache PASSED
tests/test_vllm_client_server.py::TestVLLMClientServer::test_logprobs_match_with_non_default_sampling XFAIL
tests/test_vllm_client_server.py::TestVLLMClientServerBaseURL::test_generate PASSED
tests/test_vllm_client_server.py::TestVLLMClientServerBaseURL::test_chat PASSED
tests/test_vllm_client_server.py::TestVLLMClientServerBaseURL::test_generate_with_params PASSED
tests/test_vllm_client_server.py::TestVLLMClientServerBaseURL::test_update_model_params PASSED
tests/test_vllm_client_server.py::TestVLLMClientServerBaseURL::test_reset_prefix_cache PASSED
tests/test_vllm_client_server.py::TestVLLMClientServerTP::test_generate SKIPPED (requires 3+ GPUs)
tests/test_vllm_client_server.py::TestVLLMClientServerTP::test_chat SKIPPED (requires 3+ GPUs)
tests/test_vllm_client_server.py::TestVLLMClientServerTP::test_update_model_params SKIPPED (requires 3+ GPUs)
tests/test_vllm_client_server.py::TestVLLMClientServerTP::test_reset_prefix_cache SKIPPED (requires 3+ GPUs)
tests/test_vllm_client_server.py::TestVLLMClientServerDP::test_generate SKIPPED (requires 3+ GPUs)
tests/test_vllm_client_server.py::TestVLLMClientServerDP::test_chat SKIPPED (requires 3+ GPUs)
tests/test_vllm_client_server.py::TestVLLMClientServerDP::test_update_model_params SKIPPED (requires 3+ GPUs)
tests/test_vllm_client_server.py::TestVLLMClientServerDP::test_reset_prefix_cache SKIPPED (requires 3+ GPUs)
tests/test_vllm_client_server.py::TestVLLMClientServerDeviceParameter::test_init_communicator_with_device_int PASSED
tests/test_vllm_client_server.py::TestVLLMClientServerDeviceParameter::test_init_communicator_with_device_string PASSED
tests/test_vllm_client_server.py::TestVLLMClientServerDeviceParameter::test_init_communicator_with_torch_device PASSED

============= 28 passed, 8 skipped, 1 xfailed in 299.77s (0:04:59) =============

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a GitHub issue? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

Note

Medium Risk
Extends the public /generate/ request schema and changes prompt dispatching logic (including token-id decoding when images are present), which could affect multimodal generation behavior and backward compatibility if clients rely on the old payload shape.

Overview
Adds multi-image-per-prompt and pre-tokenized prompt (token IDs) support end-to-end for trl vllm-serve and VLLMClient.generate().

The client now accepts prompts as either text or list[list[int]], encodes images as per-prompt base64 strings or lists of strings, and forwards optional mm_processor_kwargs to the server.

The server /generate/ endpoint updates its Pydantic schema accordingly, normalizes single vs multi-image inputs, forwards mm_processor_kwargs per prompt, and conditionally decodes token-id prompts to text when images are present to avoid vLLM multimodal preprocessing hangs; logging around dispatch/worker responses was also expanded.

Tests add CPU-only coverage for image encoding shapes and payload forwarding, plus an integration test that generates from prompt_token_ids against a live server.

Written by Cursor Bugbot for commit 1aeeb74. This will update automatically on new commits. Configure here.

@qgallouedec
Copy link
Member

thanks for the pr. to align closely with the vllm api, I'd prefer not to have prompt_token_ids and use prompts for both text and ints

- GenerateRequest: make `prompts` optional, add `prompt_token_ids:
  list[list[int]] | None`, change `images` to accept per-prompt lists
  of base64 strings (`list[list[str] | str | None] | None`), add
  `mm_processor_kwargs: dict | None`
- generate handler: build vLLM prompt dicts using token IDs directly
  when provided; decode multi-image lists into `multi_modal_data`
  with a list of PIL Images; forward `mm_processor_kwargs` per prompt
- VLLMClient.generate(): add `prompt_token_ids` and `mm_processor_kwargs`
  params; handle per-prompt image lists in the base64 conversion step
vLLM's multimodal preprocessor hangs when it receives raw token IDs
alongside images — it goes through a dummy-text code path that is
incompatible with models like Qwen3-VL. When prompt_token_ids are
provided with images, decode them to text so vLLM uses the working
text+images preprocessing path instead.

Also add request-level logging for debugging multimodal requests.
…c field

Merge the separate `prompt_token_ids` parameter into `prompts`, which now
accepts both `list[str]` (text) and `list[list[int]]` (token IDs). This
aligns more closely with the vLLM API convention and simplifies the
interface.
@bangawayoo bangawayoo force-pushed the feat/multi-image-prompt-token-ids branch from 82bede3 to 787f15c Compare March 9, 2026 04:12
@bangawayoo
Copy link
Author

Thanks for the suggestion. That makes sense.
I updated prompts to support both texts and ints.

@bangawayoo bangawayoo changed the title Add multi-image and prompt_token_ids support to vLLM server Add multi-image and prompt token id support to vLLM server Mar 9, 2026
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

if image is not None:
row["multi_modal_data"] = {"image": Image.open(BytesIO(base64.b64decode(image)))}
for i in range(n_prompts):
imgs = images_per_prompt[i]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lost strict length validation between images and prompts

Low Severity

The old code used zip(request.prompts, request.images, strict=True) which raised a clear ValueError when the images and prompts lists had different lengths. The new code uses images_per_prompt[i] with range(n_prompts), which silently ignores extra images when len(images) > len(prompts). This could mask a caller error where more images than prompts are provided, leading to some images being quietly dropped without any indication.

Additional Locations (1)

Fix in Cursor Fix in Web

if is_token_ids:
if has_images:
# Decode to text so vLLM uses the text+images code path.
prompt_text = _tokenizer.decode(request.prompts[i], skip_special_tokens=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to decode the the tokens

len(prompt_text),
)
else:
row = {"prompt_token_ids": request.prompts[i]}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prompt_token_ids?

@qgallouedec
Copy link
Member

thanks, this PR is doing multiple things at once:

ideally we would decouple things to ease review.

did you test the code? I see prompt_token_ids which is not a valid arg in vllm.

@bangawayoo
Copy link
Author

Thank you for reviewing this. It looks like a262d9f (#5227) already adds support for multi-images, so I'll close this PR to avoid duplication.

Regarding your question about prompt_token_ids: I was using vllm version 0.11.0, which does support this argument (vllm/inputs/parse.py#L127).

@bangawayoo bangawayoo closed this Mar 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants