Skip to content

fix(transformers): include image entries in chat-template content for VLM#6

Closed
AlanKharebov wants to merge 1 commit into
mainfrom
fix/5-vlm-prompt-template
Closed

fix(transformers): include image entries in chat-template content for VLM#6
AlanKharebov wants to merge 1 commit into
mainfrom
fix/5-vlm-prompt-template

Conversation

@AlanKharebov

Copy link
Copy Markdown
Collaborator

Fixes #5.

Summary

Qwen3-VL generate fails with Image features and image tokens do not match: tokens: 0, features 1333 because TransformersAdapter.generate() applies the chat template with content as a plain prompt string — the processor then emits no image placeholder tokens, so image features have nothing to bind to.

Change

When there are images and the head uses a processor (VLM mode), pass structured content to apply_chat_template:

content = [{type: image} for _ in images] + [{type: text, text: prompt}]

For text-only / LLM paths, content stays a string (no behavior change).

Verification

  • Local: applied this patch on top of 02dd56d, restarted MultiHead, confirmed qwen3-vl-8b loads and the prior crash no longer triggers.
  • Not validated on vision-vlm (Qwen3-VL-32B-Thinking) — needs ~24 GB VRAM and I'm on a 4080. Same processor family so should be equivalent, but worth a smoke test on a 4090 before merge.

Test plan

  • Wake qwen3-vl-8b, POST /heads/qwen3-vl-8b/generate with { prompt, images: [<base64>] } — expect text response, not the ValueError.
  • Repeat for vision-vlm on 24 GB+ hardware.
  • Text-only generate (LLM heads) unchanged.

… VLM

The VLM (processor) path was applying the chat template with content as
a plain prompt string. For Qwen3-VL the template then emits no
<|vision_start|><|image_pad|><|vision_end|> placeholder tokens, so when
the processor splices in the image features the forward pass raises:

  ValueError: Image features and image tokens do not match:
    tokens: 0, features 1333

Fix: when there are images and a processor (VLM mode), pass structured
content (image entries + text entry) to apply_chat_template so the
placeholder tokens land in the prompt.

Closes #5
Copilot AI review requested due to automatic review settings June 4, 2026 00:15

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes VLM image generation failures in the Transformers adapter by ensuring that, when images are provided and a processor is in use, apply_chat_template() receives structured chat content that includes image placeholder entries (so the processor emits the required image placeholder tokens).

Changes:

  • Build messages[0]["content"] as a multimodal content list ([{type: "image"}, …, {type: "text", …}]) when kwargs["images"] is present and the head is in VLM/processor mode.
  • Preserve prior behavior for text-only calls (content remains a plain string prompt).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +149 to +153
images_for_template = kwargs.get("images")
if images_for_template and self._processor is not None:
content: Any = [{"type": "image"} for _ in images_for_template] + [
{"type": "text", "text": prompt},
]
@AlanKharebov

Copy link
Copy Markdown
Collaborator Author

Closing — after checking Axsar/multihead-dev (the active development repo), this code path is already fixed there: transformers_adapter.py on multihead-dev/main builds a structured content_blocks list with {type: image, image: img} entries before calling apply_chat_template on the processor (lines ~200-229 on dev). The bug only exists on Axsar/multihead because that repo is ~592 commits behind multihead-dev. Filed in error against a stale mirror.

@AlanKharebov AlanKharebov deleted the fix/5-vlm-prompt-template branch June 4, 2026 04:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

transformers_adapter: VLM image generate fails — missing chat-template / image-placeholder tokens

2 participants