docs: add Ollama local VLM guide#16
docs: add Ollama local VLM guide#16abhijeetnardele24-hash wants to merge 1 commit intovericontext:mainfrom
Conversation
Signed-off-by: abhijeet nardele <234410808+abhijeetnardele24-hash@users.noreply.github.com>
|
@abhijeetnardele24-hash is attempting to deploy a commit to the Kiyeon Jeon's projects Team on Vercel. A member of the Team first needs to authorize it. |
There was a problem hiding this comment.
Pull request overview
Adds documentation for using Ollama as a local Vision-Language Model (VLM) provider in parsemux, aiming to enable “free local image description” with qwen2.5vl:7b and to close issue #9.
Changes:
- Adds an “Ollama local VLM” section to the main README with a quick-start command.
- Introduces a new dedicated setup/performance guide at
docs/ollama-guide.md.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| README.md | Adds a short Ollama local VLM quick-start section and links to the full guide. |
| docs/ollama-guide.md | New guide describing Ollama installation, usage, and performance expectations. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ```bash | ||
| ollama pull qwen2.5vl:7b | ||
| parsemux parse doc.pdf --extract-images --describe-images | ||
| ``` | ||
|
|
||
| When no VLM key is provided, parsemux falls back to Ollama automatically. | ||
| See [docs/ollama-guide.md](docs/ollama-guide.md) for setup details and performance notes. |
There was a problem hiding this comment.
The example command here (parsemux parse ... --describe-images with no --vlm-key/--llm-key) won’t currently trigger image description: in src/parsemux/core/engine.py the VLM step is gated behind if vlm_key: (derived from request.vlm_api_key / request.llm_api_key / PARSEMUX_VLM_API_KEY). With no key provided, the engine skips VLM entirely, so this section’s “falls back to Ollama automatically” claim is inaccurate. Either (a) adjust the docs to require providing a key/env var (even for Ollama), or (b) update the engine to allow Ollama descriptions with an empty key when --vlm-provider ollama (or when provider auto-detects to Ollama).
| When you do not provide `--vlm-key` or `--llm-key`, parsemux auto-detects the VLM provider as Ollama. | ||
|
|
||
| ```bash | ||
| parsemux parse doc.pdf --extract-images --describe-images | ||
| ``` | ||
|
|
||
| This flow: | ||
|
|
||
| - extracts images from the document | ||
| - sends them to the local Ollama server at `http://localhost:11434` | ||
| - writes image descriptions back into the parse result | ||
|
|
||
| You can also set the provider explicitly: | ||
|
|
||
| ```bash | ||
| parsemux parse doc.pdf --extract-images --describe-images --vlm-provider ollama | ||
| ``` |
There was a problem hiding this comment.
This section says that omitting --vlm-key/--llm-key will auto-detect Ollama and run image description, but the current engine only runs the VLM step when a non-empty key is present (if vlm_key: in src/parsemux/core/engine.py). As written, both example commands will parse and extract images but will not describe them unless a key/env var is provided. Please either update these instructions to include the required key/env var (or explicitly document the current limitation), or update the engine so Ollama can run without a key when selected.
kiyeonjeon21
left a comment
There was a problem hiding this comment.
Great guide! One factual correction needed:
When no VLM key is provided, parsemux falls back to Ollama automatically.
This isn't accurate with the current code. When describe_images=True but no key is provided (no --vlm-key, no PARSEMUX_VLM_API_KEY in .env), the VLM step is skipped entirely — it doesn't auto-fallback to Ollama.
The user must explicitly either:
- Have Ollama running and use
--vlm-provider ollama - Or have no key at all, in which case the auto-detect logic in
vlm.pysends keyless requests to Ollama as the default
Actually looking at the code again — _detect_provider() does fall back to VLMProvider.OLLAMA when no key prefix matches. But the engine only calls VLM when vlm_key is truthy. So with an empty key, VLM is skipped.
Fix: Change the README/guide wording to:
To use Ollama, explicitly pass the provider:
parsemux parse doc.pdf --extract-images --describe-images --vlm-provider ollama
Or note that users can set PARSEMUX_VLM_API_KEY=ollama (any non-empty string) to trigger the Ollama path.
Once updated, I'll approve and merge. Thanks!
Adds a dedicated Ollama guide and README documentation for free local image description with qwen2.5vl:7b.
Closes #9