Skip to content

(retriever) Add VLM image captioning via vLLM#1660

Draft
edknv wants to merge 20 commits intoNVIDIA:mainfrom
edknv:edwardk/retriever-image-caption
Draft

(retriever) Add VLM image captioning via vLLM#1660
edknv wants to merge 20 commits intoNVIDIA:mainfrom
edknv:edwardk/retriever-image-caption

Conversation

@edknv
Copy link
Collaborator

@edknv edknv commented Mar 19, 2026

Description

  • Add a .caption() pipeline stage to both batch and in-process ingestors that generates text descriptions for extracted images using a VLM (Nemotron Nano 12B v2 VL via vLLM locally, or a remote NIM endpoint). - Use nv-ingest-api's extract_image_like_objects_from_pdfium_page during PDF extraction to detect, merge, and crop image-like objects (images, shapes, forms) from each page into the images column. - The caption stage filters out small images (< 32px), sends the remaining to the VLM, and writes captions back as images[i]["text"]. Optionally prepends surrounding page text to the VLM prompt via context_text_max_chars.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant