fix(transformers): include image entries in chat-template content for VLM#6
fix(transformers): include image entries in chat-template content for VLM#6AlanKharebov wants to merge 1 commit into
Conversation
… VLM
The VLM (processor) path was applying the chat template with content as
a plain prompt string. For Qwen3-VL the template then emits no
<|vision_start|><|image_pad|><|vision_end|> placeholder tokens, so when
the processor splices in the image features the forward pass raises:
ValueError: Image features and image tokens do not match:
tokens: 0, features 1333
Fix: when there are images and a processor (VLM mode), pass structured
content (image entries + text entry) to apply_chat_template so the
placeholder tokens land in the prompt.
Closes #5
There was a problem hiding this comment.
Pull request overview
This PR fixes VLM image generation failures in the Transformers adapter by ensuring that, when images are provided and a processor is in use, apply_chat_template() receives structured chat content that includes image placeholder entries (so the processor emits the required image placeholder tokens).
Changes:
- Build
messages[0]["content"]as a multimodal content list ([{type: "image"}, …, {type: "text", …}]) whenkwargs["images"]is present and the head is in VLM/processor mode. - Preserve prior behavior for text-only calls (content remains a plain string prompt).
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| images_for_template = kwargs.get("images") | ||
| if images_for_template and self._processor is not None: | ||
| content: Any = [{"type": "image"} for _ in images_for_template] + [ | ||
| {"type": "text", "text": prompt}, | ||
| ] |
|
Closing — after checking |
Fixes #5.
Summary
Qwen3-VL generate fails with
Image features and image tokens do not match: tokens: 0, features 1333becauseTransformersAdapter.generate()applies the chat template with content as a plain prompt string — the processor then emits no image placeholder tokens, so image features have nothing to bind to.Change
When there are images and the head uses a processor (VLM mode), pass structured content to
apply_chat_template:For text-only / LLM paths, content stays a string (no behavior change).
Verification
02dd56d, restarted MultiHead, confirmedqwen3-vl-8bloads and the prior crash no longer triggers.vision-vlm(Qwen3-VL-32B-Thinking) — needs ~24 GB VRAM and I'm on a 4080. Same processor family so should be equivalent, but worth a smoke test on a 4090 before merge.Test plan
qwen3-vl-8b, POST/heads/qwen3-vl-8b/generatewith{ prompt, images: [<base64>] }— expect text response, not the ValueError.vision-vlmon 24 GB+ hardware.