fix(transformers): include image entries in chat-template content for VLM by AlanKharebov · Pull Request #6 · Axsar/multihead

AlanKharebov · 2026-06-04T00:15:40Z

Fixes #5.

Summary

Qwen3-VL generate fails with Image features and image tokens do not match: tokens: 0, features 1333 because TransformersAdapter.generate() applies the chat template with content as a plain prompt string — the processor then emits no image placeholder tokens, so image features have nothing to bind to.

Change

When there are images and the head uses a processor (VLM mode), pass structured content to apply_chat_template:

content = [{type: image} for _ in images] + [{type: text, text: prompt}]

For text-only / LLM paths, content stays a string (no behavior change).

Verification

Local: applied this patch on top of 02dd56d, restarted MultiHead, confirmed qwen3-vl-8b loads and the prior crash no longer triggers.
Not validated on vision-vlm (Qwen3-VL-32B-Thinking) — needs ~24 GB VRAM and I'm on a 4080. Same processor family so should be equivalent, but worth a smoke test on a 4090 before merge.

Test plan

Wake qwen3-vl-8b, POST /heads/qwen3-vl-8b/generate with { prompt, images: [<base64>] } — expect text response, not the ValueError.
Repeat for vision-vlm on 24 GB+ hardware.
Text-only generate (LLM heads) unchanged.

… VLM The VLM (processor) path was applying the chat template with content as a plain prompt string. For Qwen3-VL the template then emits no <|vision_start|><|image_pad|><|vision_end|> placeholder tokens, so when the processor splices in the image features the forward pass raises: ValueError: Image features and image tokens do not match: tokens: 0, features 1333 Fix: when there are images and a processor (VLM mode), pass structured content (image entries + text entry) to apply_chat_template so the placeholder tokens land in the prompt. Closes #5

Copilot

Pull request overview

This PR fixes VLM image generation failures in the Transformers adapter by ensuring that, when images are provided and a processor is in use, apply_chat_template() receives structured chat content that includes image placeholder entries (so the processor emits the required image placeholder tokens).

Changes:

Build messages[0]["content"] as a multimodal content list ([{type: "image"}, …, {type: "text", …}]) when kwargs["images"] is present and the head is in VLM/processor mode.
Preserve prior behavior for text-only calls (content remains a plain string prompt).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+            images_for_template = kwargs.get("images")
+            if images_for_template and self._processor is not None:
+                content: Any = [{"type": "image"} for _ in images_for_template] + [
+                    {"type": "text", "text": prompt},
+                ]


AlanKharebov · 2026-06-04T04:56:35Z

Closing — after checking Axsar/multihead-dev (the active development repo), this code path is already fixed there: transformers_adapter.py on multihead-dev/main builds a structured content_blocks list with {type: image, image: img} entries before calling apply_chat_template on the processor (lines ~200-229 on dev). The bug only exists on Axsar/multihead because that repo is ~592 commits behind multihead-dev. Filed in error against a stale mirror.

Copilot AI review requested due to automatic review settings June 4, 2026 00:15

Copilot started reviewing on behalf of AlanKharebov June 4, 2026 00:15 View session

Copilot AI reviewed Jun 4, 2026

View reviewed changes

Comment thread src/multihead/adapters/transformers_adapter.py

Comment on lines +149 to +153

images_for_template = kwargs.get("images")

if images_for_template and self._processor is not None:

content: Any = [{"type": "image"} for _ in images_for_template] + [

{"type": "text", "text": prompt},

]

AlanKharebov closed this Jun 4, 2026

AlanKharebov deleted the fix/5-vlm-prompt-template branch June 4, 2026 04:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(transformers): include image entries in chat-template content for VLM#6

fix(transformers): include image entries in chat-template content for VLM#6
AlanKharebov wants to merge 1 commit into
mainfrom
fix/5-vlm-prompt-template

AlanKharebov commented Jun 4, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

AlanKharebov commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AlanKharebov commented Jun 4, 2026

Summary

Change

Verification

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

AlanKharebov commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants