Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,18 @@ parsemux parse doc.pdf --extract-images --describe-images --vlm-key sk-...
Provider is auto-detected from key prefix (`sk-` → OpenAI, `sk-ant-` → Anthropic, `AI` → Google).
Default models: gpt-5.4-nano, claude-haiku-4.5, gemini-2.5-flash, qwen2.5vl:7b (local).

### Ollama local VLM

For free local image description, install Ollama and pull the default local vision model:

```bash
ollama pull qwen2.5vl:7b
parsemux parse doc.pdf --extract-images --describe-images
```

When no VLM key is provided, parsemux falls back to Ollama automatically.
See [docs/ollama-guide.md](docs/ollama-guide.md) for setup details and performance notes.
Comment on lines +59 to +65
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example command here (parsemux parse ... --describe-images with no --vlm-key/--llm-key) won’t currently trigger image description: in src/parsemux/core/engine.py the VLM step is gated behind if vlm_key: (derived from request.vlm_api_key / request.llm_api_key / PARSEMUX_VLM_API_KEY). With no key provided, the engine skips VLM entirely, so this section’s “falls back to Ollama automatically” claim is inaccurate. Either (a) adjust the docs to require providing a key/env var (even for Ollama), or (b) update the engine to allow Ollama descriptions with an empty key when --vlm-provider ollama (or when provider auto-detects to Ollama).

Copilot uses AI. Check for mistakes.

### Start your own server

```bash
Expand Down
50 changes: 50 additions & 0 deletions docs/ollama-guide.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# Ollama local VLM guide

Use Ollama when you want free, local image description for extracted document images.

## Install Ollama

1. Install Ollama from <https://ollama.com/download>.
2. Start the Ollama service on your machine.
3. Pull the default local vision model used by parsemux:

```bash
ollama pull qwen2.5vl:7b
```

Parsemux defaults to `qwen2.5vl:7b` for local image description.

## Run parsemux with local image description

When you do not provide `--vlm-key` or `--llm-key`, parsemux auto-detects the VLM provider as Ollama.

```bash
parsemux parse doc.pdf --extract-images --describe-images
```

This flow:

- extracts images from the document
- sends them to the local Ollama server at `http://localhost:11434`
- writes image descriptions back into the parse result

You can also set the provider explicitly:

```bash
parsemux parse doc.pdf --extract-images --describe-images --vlm-provider ollama
```
Comment on lines +19 to +35
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section says that omitting --vlm-key/--llm-key will auto-detect Ollama and run image description, but the current engine only runs the VLM step when a non-empty key is present (if vlm_key: in src/parsemux/core/engine.py). As written, both example commands will parse and extract images but will not describe them unless a key/env var is provided. Please either update these instructions to include the required key/env var (or explicitly document the current limitation), or update the engine so Ollama can run without a key when selected.

Copilot uses AI. Check for mistakes.

## Performance expectations

Ollama is the zero-cost option, but it trades speed for privacy and local control.

- Speed: slower than hosted APIs, especially on CPU-only machines
- Quality: good enough for many document images, charts, and screenshots, but usually below top cloud vision models
- Privacy: best option when documents must stay on your machine
- Cost: `0.0` direct API cost inside parsemux

For best results:

- use a machine with a capable GPU if available
- keep document batches small when testing locally
- expect longer runtimes for image-heavy PDFs
Loading