LRTSG is a zero-shot Vision-Language Reasoning (VLR) pipeline. It takes an image and a natural-language question, and returns a contextually grounded answer plus a verified Scene Graph that documents which spatial / attribute / relational facts the answer rests on.
The pipeline is agentic: six small agents coordinate to plan, observe, structure, ground, audit, and answer. By making the model's intermediate reasoning explicit (a Scene Graph with bounding boxes and per-edge audit logs), the system substantially reduces hallucinations compared to a single end-to-end prompt.
LRTSG runs in two modes via a single env var:
| Mode | Engine | Use when |
|---|---|---|
| Local (default) | Qwen3-VL-30B-A3B-Instruct served by llama-server |
You have a CUDA GPU with ≥ 22 GB VRAM. No API costs, fully offline. |
| API | Google Gemini | You don't have a local GPU, or you want to compare against a closed-source model. |
Both modes use the same agent code, prompts, and Scene Graph schema. Only the underlying VLM call swaps.
image + question
│
▼
┌──────────────┐ Classifies the question (spatial / attribute / etc.) and
│ Planner │ picks target entities to focus on.
└──────────────┘
│
▼
┌──────────────┐ Generates per-entity observations: concrete nouns,
│ Narrator │ abstract nouns, attributes, interactions.
└──────────────┘
│
▼
┌──────────────┐ Compiles the narration into a Scene Graph
│ Cartographer│ (nodes + subject-predicate-object edges).
└──────────────┘
│
▼
┌──────────────┐ Zero-shot bbox detection per node (normalized 0-1000
│ Localizer │ coordinates).
└──────────────┘
│
▼
┌──────────────┐ Per edge: crops the encapsulating image region, asks the
│ Auditor │ model to verify the relationship visually. On a fail,
│ (loop) │ re-runs the Narrator with the critique appended.
└──────────────┘ Up to `max_retries` rounds.
│
▼
┌──────────────┐ Answers the question using only the audit-passed edges,
│ Oracle │ citing the supporting edges in its rationale.
└──────────────┘
│
▼
answer + verified Scene Graph + visualization
All six agents inherit from a single BaseAgent that
exposes generate(prompt, image) and generate_json(prompt, schema, image).
The agent code is identical across local and API modes.
BaseAgent.generate* does not call any model directly. It delegates to a
VLMBackend chosen at runtime from the VLM_BACKEND env var:
Agents (Planner, Narrator, Cartographer, Auditor, Oracle, Localizer)
│ generate / generate_json
▼
agents/base.py (thin shim)
│
▼
┌─────────────┴─────────────┬───────────────────────┐
▼ ▼ ▼
OpenAICompatBackend GeminiBackend TransformersBackend
(default; HTTP) (Google Gemini API) (in-process; advanced)
│
▼
llama-server.exe (port 8080)
│
▼
Qwen3-VL-30B-A3B-Instruct GGUF + mmproj
OpenAICompatBackend speaks the OpenAI chat-completions HTTP protocol.
That protocol is implemented by many engines (llama.cpp, vLLM, LM Studio,
Ollama, …); the model running underneath is whatever you load into the
server. By default the project pairs it with a Qwen3-VL GGUF served by
llama.cpp.
The Localizer auto-pairs to the active backend: with gemini it uses
Gemini's spatial-detection prompt; with the local backends it uses
Qwen3-VL's native bbox_2d grounding output (already in the same 0-1000
normalized form).
LRTSG ships two entry points that share the same agents, prompts, Scene Graph schema, and VLM backend:
Linear — main.py |
Agentic — agentic_main.py |
|
|---|---|---|
| Control flow | Fixed order: Planner → Narrator → Cartographer → Localizer → Auditor → Oracle | An LLM reasons each step and decides which tool to call next |
| Adapts to the question? | No — same steps every time | Yes — skips / repeats / reorders tools per question |
| Self-evaluation | No | Yes — self_critique (ReAct) / image-aware Evaluator (CodeAct) |
| Backend | get_backend() (VLM_BACKEND) |
Same get_backend() — coordinator included |
The agentic entry point is the "make the control flow itself agentic" step: the
LLM owns the loop, versus the linear pipeline's hard-coded order. It runs on
the same local VLM deployment — including the fully in-process transformers
backend — with no separate coordinator server and no AutoGen. The
coordinator's per-step decision is produced with backend.generate_json(...)
(a small {thought, action, action_input} object), the same structured-output
path every agent already uses.
# ReAct (default): tool-calling loop; self-critiques the draft before committing
python agentic_main.py --image tests/kitchen_test.png \
--question "Is the toaster silver?"
# CodeAct: a Writer LLM emits a Python program, an Executor runs it, and an
# image-aware Evaluator judges it and triggers a rewrite on failure (VISPROG-style)
python agentic_main.py --image tests/kitchen_test.png \
--question "If the toaster were removed, what is next to the coffee machine?" \
--mode codeact --max-retries 3Outputs go to output_react.json / output_codeact.json (per-step history,
the final scene graph, and the answer). --compact shrinks the ReAct tool set +
prompt for very context-limited models; the in-process backend has a large
context window, so full (default) mode is recommended.
| Tool | Wraps | Notes |
|---|---|---|
plan_scene |
Planner | classify question, pick targets |
narrate_scene |
Narrator | per-entity observations (accepts a critique) |
build_scene_graph |
Cartographer | narration → nodes + edges |
localize_entities |
Localizer | ground entities → bounding boxes |
audit_edge |
Auditor | visually verify one edge |
verify_spatial |
(pure geometry) | deterministic bbox check for spatial predicates — always available, serverless |
answer_question |
Oracle | draft answer from the verified graph |
self_critique |
Critic | score the draft; gates committing the answer |
segment_entities |
SAM (optional) | pixel-accurate masks; see below |
Tool availability ≠ tool usage. The LLM decides in the loop whether a tool helps for the current question — SAM is just one more tool it can elect to call.
The agentic pipeline is fully serverless without SAM:
verify_spatialalways works using bounding-box geometry (zero VRAM, no network).segment_entitiesreturns a clean "not configured" message when no SAM backend is present, so the LLM routes around it.
To enable pixel-accurate masks, point a SAM server via .env
(SAM_IP/SAM_PORT/SAM_API_KEY). SAM 2.1 large adds only ~3 GB VRAM on top of
Qwen3-VL. A future in-process SAM backend can be slotted into
agentic/tools.py's segment_entities without touching the
coordinators.
A complete end-to-end run on the bundled tests/kitchen_test.png, with the
default local backend (Qwen3-VL-30B-A3B-Instruct UD-Q4_K_XL on llama.cpp).
| Image | Question |
|---|---|
![]() |
"Is the toaster silver?" |
python main.py --image tests/kitchen_test.png \
--question "Is the toaster silver?" \
--visualize[12:55:39] [PIPELINE] >>> START: Processing question: 'Is the toaster silver?'
[12:55:39] [PIPELINE] >>> PLANNER: Analyzing question to generate focus area...
[12:55:41] [PIPELINE] >>> PLANNER: Question Type: attribute, Target Entities: ['toaster']
[12:55:41] [PIPELINE] >>> NARRATOR: Generating initial scene narration...
[12:55:46] [PIPELINE] >>> CARTOGRAPHER: Constructing Scene Graph v1...
[12:55:49] [PIPELINE] >>> LOCALIZER: Grounding entities to pixel coordinates...
[12:55:52] [PIPELINE] >>> LOCALIZER: Localized 5/5 nodes.
[12:55:52] [PIPELINE] >>> AUDITOR: Starting iterative edge verification...
[12:55:52] [PIPELINE] >>> AUDITOR: Verifying edge: toaster --[sits on]--> countertop
[12:55:55] [PIPELINE] >>> AUDITOR: ✓ VALID (Conf: 0.95)
[12:55:55] [PIPELINE] >>> AUDITOR: Verifying edge: toaster --[next to]--> coffee maker
[12:55:56] [PIPELINE] >>> AUDITOR: ✓ VALID (Conf: 0.95)
[12:55:56] [PIPELINE] >>> AUDITOR: Verifying edge: toaster --[next to]--> potted plant
[12:55:56] [PIPELINE] >>> AUDITOR: ✓ VALID (Conf: 0.95)
[12:55:56] [PIPELINE] >>> AUDITOR: Verifying edge: toaster --[has a cord]--> cord
[12:55:56] [PIPELINE] >>> AUDITOR: ✓ VALID (Conf: 0.95)
[12:55:56] [PIPELINE] >>> ORACLE: Generating final answer using verified scene graph...
[12:55:59] [PIPELINE] >>> COMPLETE: Pipeline execution finished.
Total wall time: ~20 seconds. Audit pass-rate: 4/4 edges. Localizer recall: 5/5 entities grounded.
The pipeline writes output.json and visualization.png.
Annotated image:
The 5 verified entities (toaster, coffee maker, potted plant, countertop, cord) are drawn with red rectangles in their grounded positions.
Yes, the toaster is silver. This is confirmed by the scene graph, which describes the toaster (n1) with the attribute "silver" in its node description.
Coordinates in box are in normalized 0–1000 form
(ymin, xmin, ymax, xmax); convert to pixels by multiplying each axis by
the image's (height, width) and dividing by 1000.
uncertain_edges is the list of edges the Auditor rejected — empty here
because every relationship the Cartographer proposed was visually
confirmed. When this list is non-empty, those edges are excluded from the
graph the Oracle reasons over, which is how LRTSG keeps unverified claims
out of the final answer.
LRTSG/
├── agents/
│ ├── base.py # Thin shim over a VLMBackend
│ ├── planner.py # Question decomposition & planning
│ ├── narrator.py # Visual observation & narration
│ ├── cartographer.py # Scene Graph construction
│ ├── localizer.py # Localizer wrapper (auto-pairs to backend)
│ ├── auditor.py # Per-edge visual verification
│ ├── oracle.py # Final answer + edge citations
│ ├── critic.py # Self-evaluation of a draft answer (agentic only)
│ └── backends/
│ ├── base.py # VLMBackend interface
│ ├── factory.py # get_backend() reads VLM_BACKEND
│ ├── openai_compat_backend.py # llama-server / vLLM / LM Studio / ...
│ ├── gemini_backend.py # Google Gemini REST API
│ └── transformers_backend.py # In-process HF transformers (advanced)
├── agentic/ # Agentic loop (runs on the local backend, no server)
│ ├── tools.py # Tool functions + FUNCTION_MAP
│ ├── agent.py # ReAct: LLM-driven loop over a blackboard state
│ └── codeact.py # CodeAct: Writer → Executor → image-aware Evaluator
├── core/
│ ├── orchestrator.py # Wires all agents together (linear mode)
│ └── scene_graph.py # Pydantic models: Node, Edge, Graph, BoundingBox
├── vision/
│ ├── localizer.py # Localizer factory + GeminiLocalizer
│ ├── qwen_localizer.py # Qwen3-VL bbox grounding
│ └── utils.py # draw_bboxes, crop_bbox, crop_encapsulated_region
├── tests/
│ ├── kitchen_test.png # Example input image
│ └── test_vision.py # Unit tests for vision/utils
├── app.py # Streamlit UI (linear mode)
├── main.py # CLI entry point (linear mode)
├── agentic_main.py # CLI entry point (agentic — react / codeact)
├── smoke_local.py # Standalone sanity script for the local stack
├── requirements.txt # Always-required dependencies
├── requirements-local.txt # Optional dependencies for local backends
├── .env.example # Template for env vars
├── output.json # Example output from a previous run
├── visualization.png # Example annotated image
├── LOCAL_VLM_PLAN.md # Design history of the local-VLM integration
└── README.md # This file
- Any machine with Python 3.10+
- A
GOOGLE_API_KEY - Internet connection
- NVIDIA GPU with CUDA 12.x driver, ≥ 22 GB VRAM at runtime (RTX 4090 / 5090 / A6000 / similar). RTX 50-series cards need CUDA 12.8.
- ~20 GB free disk for the GGUF + projector
- Windows or Linux (instructions below assume Windows; Linux is similar)
The pipeline has been verified on:
- Windows 11, RTX 5090 32 GB (Blackwell, sm_120), CUDA 12.8
- Python 3.10 (conda env), torch 2.7+cu128, transformers 5.1
- llama.cpp release b8994, CUDA 12.4 build (forward-compatible with the CUDA 12.8 driver)
conda activate <your-env> # or any Python 3.10+ environment
pip install -r requirements.txt # core deps for both modes
pip install -r requirements-local.txt # optional: only needed for in-process
# transformers backend, not for llama.cpphf download unsloth/Qwen3-VL-30B-A3B-Instruct-GGUF \
--include "Qwen3-VL-30B-A3B-Instruct-UD-Q4_K_XL.gguf" \
--include "mmproj-BF16.gguf"This downloads ~18.8 GB into your HuggingFace cache (HF_HOME if set,
otherwise ~/.cache/huggingface/).
Download llama-bXXXX-bin-win-cuda-12.4-x64.zip from the latest
llama.cpp release,
unzip somewhere outside the repo (e.g. D:\llama.cpp\).
You do not need the cudart-* archive if you already have CUDA 12.x
installed.
The CUDA 12.4 binary is forward-compatible with CUDA 12.5 / 12.6 / 12.7 / 12.8 drivers. RTX 50-series cards JIT-compile the kernels for sm_120 on first run.
D:\llama.cpp\llama-server.exe ^
-m "<HF_CACHE>\hub\models--unsloth--Qwen3-VL-30B-A3B-Instruct-GGUF\snapshots\<HASH>\Qwen3-VL-30B-A3B-Instruct-UD-Q4_K_XL.gguf" ^
--mmproj "<HF_CACHE>\hub\models--unsloth--Qwen3-VL-30B-A3B-Instruct-GGUF\snapshots\<HASH>\mmproj-BF16.gguf" ^
-ngl 999 -c 8192 --port 8080Replace <HF_CACHE> and <HASH> with your local paths. The server takes
~20 s to load and uses ~21 GB VRAM. It exposes an OpenAI-compatible API
at http://127.0.0.1:8080/v1.
You can verify it's running:
curl http://127.0.0.1:8080/v1/modelsCopy .env.example to .env. The defaults already target the local
server, so no edits are needed for this path:
VLM_BACKEND=openai_compat
OPENAI_COMPAT_BASE_URL=http://localhost:8080/v1python main.py --image tests/kitchen_test.png --question "Is the toaster silver?" --visualizeEnd-to-end runtime is ~20 s on an RTX 5090. Outputs go to output.json
(the verified Scene Graph + answer) and visualization.png (the input
image with bounding boxes drawn on the verified entities).
Windows + Chinese locale note: If you see
UnicodeEncodeError: 'gbk' codec can't encode...when running throughconda run, prefix the command withset PYTHONIOENCODING=utf-8(cmd) or$env:PYTHONIOENCODING="utf-8"(PowerShell), or invoke the env's Python directly instead of going throughconda run.
pip install -r requirements.txtCopy .env.example to .env and set:
VLM_BACKEND=gemini
GOOGLE_API_KEY=your_key_here
GEMINI_MODEL=gemini-1.5-pro # or gemini-2.0-flash for cheaper / fasterpython main.py --image tests/kitchen_test.png --question "Is the toaster silver?" --visualizeFor interactive exploration:
streamlit run app.pyOpens at http://localhost:8501.
The sidebar lets you:
- Toggle backend (
openai_compat↔gemini) - See live server-reachability status when using the local path
- Adjust the audit refinement loop (
max_retries) - Copy a ready-to-use
llama-serverlaunch command
# Which backend powers the agents.
# openai_compat - local llama-server / vLLM / LM Studio (default)
# gemini - Google Gemini REST API
# transformers - In-process HF transformers (advanced; bnb4 is broken on
# Qwen3-VL-30B-A3B's MoE expert layers — use openai_compat
# for that model)
VLM_BACKEND=openai_compat
# OpenAI-compatible backend
OPENAI_COMPAT_BASE_URL=http://localhost:8080/v1
OPENAI_COMPAT_API_KEY=sk-no-key # llama.cpp ignores this
OPENAI_COMPAT_MODEL=local # llama.cpp ignores this too
# Localizer (zero-shot bbox detection)
# Leave blank to auto-pair to backend:
# gemini -> GeminiLocalizer
# openai_compat -> Qwen3VLLocalizer (uses the active backend)
LOCALIZER_PROVIDER=
# In-process transformers backend (only used if VLM_BACKEND=transformers)
LOCAL_VLM_MODEL=Qwen/Qwen3-VL-8B-Instruct
LOCAL_VLM_QUANT=none # none (bf16) | bnb4 | awq | fp8
LOCAL_VLM_DEVICE=cuda
# Gemini API
GOOGLE_API_KEY=
GEMINI_MODEL=gemini-1.5-proEach agent's system_instruction lives in its respective file under
agents/. To change how the Narrator describes scenes, for example, edit
agents/narrator.py.
max_retriesin core/orchestrator.py controls how many refinement rounds the Auditor runs per failed edge (default: 2).padding_pctin thecrop_encapsulated_regioncall inside the Orchestrator widens the crop the Auditor sees (default: 0.1, i.e. 10 %).
- Add fields to the Pydantic models in core/scene_graph.py.
- Update the corresponding agent prompts so they populate the new fields.
Any GGUF that llama.cpp can serve will work — change the -m and
--mmproj paths in the launch command. Vision models need an mmproj file;
text-only models don't (and obviously won't help with this pipeline).
Create a class in agents/backends/ that subclasses
VLMBackend and implements generate_text and
generate_json. Then register it in
agents/backends/factory.py.
The local llama-server.exe isn't running, or it's bound to a different
port. Start it with the launch command from the sidebar's "How to start
the local server" expander.
Your Windows locale defaults to GBK; the orchestrator prints ✓ / ✗
which can't be encoded. Set PYTHONIOENCODING=utf-8 before running, or
invoke Python directly instead of through conda run (which pipes stdout
through its own GBK encoder).
Known incompatibility: bnb's 4-bit packer doesn't handle Qwen3-VL's MoE
expert weight layouts. Use the openai_compat backend with a GGUF
checkpoint instead. Documented in
LOCAL_VLM_PLAN.md.
Expected on the very first call to a new model: CUDA kernels JIT-compile for your GPU's compute capability (especially on RTX 50-series sm_120), and the vision encoder warms up. Subsequent calls drop to 0.3–3 s thanks to llama.cpp's prompt cache.
The Auditor crops the encapsulating region of two entities, which on high-res images can produce a lot of vision tokens. Two fixes:
- Reduce
-c(context length) on the llama-server command line. - Resize input images to ≤ 1024 px on the longest side before feeding them to the pipeline.
| Stage | Latency |
|---|---|
| Server load (one-time) | ~20 s |
| First inference (warmup) | ~30 s |
| Steady-state per agent call | 0.3–3 s |
| End-to-end pipeline (6 agents + 4 audit edges) | ~20 s |
| VRAM at runtime | ~21 GB / 32 GB |
- DepthAnything intermediate-image tools — separate workstream; the proposal mentions them as future enrichment. (SAM is integrated as an optional agentic tool — see "Two ways to run: linear vs. agentic".)
- Benchmark harness for SPHERE-VLM / VisuLogic / similar — out of scope for this iteration.
- Fine-tuning — the pipeline is purely zero-shot, by design.
- vLLM as the local server — viable drop-in (just change
OPENAI_COMPAT_BASE_URL), but vLLM prefers Linux/WSL.
- LOCAL_VLM_PLAN.md — design history of the local integration: which approaches were tried, which failed, and why.
- llama.cpp — the inference engine used by the default backend.
- Qwen3-VL — model documentation, including the official prompts for spatial grounding.
- Unsloth GGUFs — pre-quantized checkpoints used by the default local setup.



{ "question": "Is the toaster silver?", "answer": "Yes, the toaster is silver. ...", "verified_graph": { "nodes": [ { "id": "n1", "label": "toaster", "attributes": [ "silver", "stainless steel", "metallic", "shiny", "rectangular", "two-slot", "has a digital display", "has a control panel", "has a lever", "has a knob", "has a cord" ], "box": {"ymin": 515, "xmin": 555, "ymax": 789, "xmax": 790, "label": "toaster"} }, {"id": "n2", "label": "coffee maker", "box": {...}}, {"id": "n3", "label": "potted plant", "box": {...}}, {"id": "n4", "label": "countertop", "box": {...}}, {"id": "n5", "label": "cord", "box": {...}} ], "edges": [ {"subject_id": "n1", "predicate": "sits on", "object_id": "n4"}, {"subject_id": "n1", "predicate": "next to", "object_id": "n2"}, {"subject_id": "n1", "predicate": "next to", "object_id": "n3"}, {"subject_id": "n1", "predicate": "has a cord","object_id": "n5"} ] }, "uncertain_edges": [] }