Skip to content

Ashindustry007/LRTSG---Language-Reasoning-Task-with-Scene-Graph

Repository files navigation

LRTSG: Language-Reasoning Task with Scene Graph

LRTSG is a zero-shot Vision-Language Reasoning (VLR) pipeline. It takes an image and a natural-language question, and returns a contextually grounded answer plus a verified Scene Graph that documents which spatial / attribute / relational facts the answer rests on.

The pipeline is agentic: six small agents coordinate to plan, observe, structure, ground, audit, and answer. By making the model's intermediate reasoning explicit (a Scene Graph with bounding boxes and per-edge audit logs), the system substantially reduces hallucinations compared to a single end-to-end prompt.

LRTSG runs in two modes via a single env var:

Mode Engine Use when
Local (default) Qwen3-VL-30B-A3B-Instruct served by llama-server You have a CUDA GPU with ≥ 22 GB VRAM. No API costs, fully offline.
API Google Gemini You don't have a local GPU, or you want to compare against a closed-source model.

Both modes use the same agent code, prompts, and Scene Graph schema. Only the underlying VLM call swaps.


Pipeline at a glance

image + question
     │
     ▼
┌──────────────┐  Classifies the question (spatial / attribute / etc.) and
│  Planner     │  picks target entities to focus on.
└──────────────┘
     │
     ▼
┌──────────────┐  Generates per-entity observations: concrete nouns,
│  Narrator    │  abstract nouns, attributes, interactions.
└──────────────┘
     │
     ▼
┌──────────────┐  Compiles the narration into a Scene Graph
│  Cartographer│  (nodes + subject-predicate-object edges).
└──────────────┘
     │
     ▼
┌──────────────┐  Zero-shot bbox detection per node (normalized 0-1000
│  Localizer   │  coordinates).
└──────────────┘
     │
     ▼
┌──────────────┐  Per edge: crops the encapsulating image region, asks the
│  Auditor     │  model to verify the relationship visually. On a fail,
│  (loop)      │  re-runs the Narrator with the critique appended.
└──────────────┘  Up to `max_retries` rounds.
     │
     ▼
┌──────────────┐  Answers the question using only the audit-passed edges,
│  Oracle      │  citing the supporting edges in its rationale.
└──────────────┘
     │
     ▼
answer + verified Scene Graph + visualization

All six agents inherit from a single BaseAgent that exposes generate(prompt, image) and generate_json(prompt, schema, image). The agent code is identical across local and API modes.


Architecture: the backend abstraction

BaseAgent.generate* does not call any model directly. It delegates to a VLMBackend chosen at runtime from the VLM_BACKEND env var:

Agents (Planner, Narrator, Cartographer, Auditor, Oracle, Localizer)
                 │ generate / generate_json
                 ▼
         agents/base.py (thin shim)
                 │
                 ▼
   ┌─────────────┴─────────────┬───────────────────────┐
   ▼                           ▼                       ▼
OpenAICompatBackend      GeminiBackend         TransformersBackend
(default; HTTP)          (Google Gemini API)   (in-process; advanced)
   │
   ▼
llama-server.exe (port 8080)
   │
   ▼
Qwen3-VL-30B-A3B-Instruct GGUF + mmproj

OpenAICompatBackend speaks the OpenAI chat-completions HTTP protocol. That protocol is implemented by many engines (llama.cpp, vLLM, LM Studio, Ollama, …); the model running underneath is whatever you load into the server. By default the project pairs it with a Qwen3-VL GGUF served by llama.cpp.

The Localizer auto-pairs to the active backend: with gemini it uses Gemini's spatial-detection prompt; with the local backends it uses Qwen3-VL's native bbox_2d grounding output (already in the same 0-1000 normalized form).


Two ways to run: linear vs. agentic

LRTSG ships two entry points that share the same agents, prompts, Scene Graph schema, and VLM backend:

Linear — main.py Agentic — agentic_main.py
Control flow Fixed order: Planner → Narrator → Cartographer → Localizer → Auditor → Oracle An LLM reasons each step and decides which tool to call next
Adapts to the question? No — same steps every time Yes — skips / repeats / reorders tools per question
Self-evaluation No Yes — self_critique (ReAct) / image-aware Evaluator (CodeAct)
Backend get_backend() (VLM_BACKEND) Same get_backend() — coordinator included

The agentic entry point is the "make the control flow itself agentic" step: the LLM owns the loop, versus the linear pipeline's hard-coded order. It runs on the same local VLM deployment — including the fully in-process transformers backend — with no separate coordinator server and no AutoGen. The coordinator's per-step decision is produced with backend.generate_json(...) (a small {thought, action, action_input} object), the same structured-output path every agent already uses.

Two agentic modes

# ReAct (default): tool-calling loop; self-critiques the draft before committing
python agentic_main.py --image tests/kitchen_test.png \
    --question "Is the toaster silver?"

# CodeAct: a Writer LLM emits a Python program, an Executor runs it, and an
# image-aware Evaluator judges it and triggers a rewrite on failure (VISPROG-style)
python agentic_main.py --image tests/kitchen_test.png \
    --question "If the toaster were removed, what is next to the coffee machine?" \
    --mode codeact --max-retries 3

Outputs go to output_react.json / output_codeact.json (per-step history, the final scene graph, and the answer). --compact shrinks the ReAct tool set + prompt for very context-limited models; the in-process backend has a large context window, so full (default) mode is recommended.

Tools the coordinator can choose

Tool Wraps Notes
plan_scene Planner classify question, pick targets
narrate_scene Narrator per-entity observations (accepts a critique)
build_scene_graph Cartographer narration → nodes + edges
localize_entities Localizer ground entities → bounding boxes
audit_edge Auditor visually verify one edge
verify_spatial (pure geometry) deterministic bbox check for spatial predicates — always available, serverless
answer_question Oracle draft answer from the verified graph
self_critique Critic score the draft; gates committing the answer
segment_entities SAM (optional) pixel-accurate masks; see below

Tool availability ≠ tool usage. The LLM decides in the loop whether a tool helps for the current question — SAM is just one more tool it can elect to call.

SAM is optional (off by default)

The agentic pipeline is fully serverless without SAM:

  • verify_spatial always works using bounding-box geometry (zero VRAM, no network).
  • segment_entities returns a clean "not configured" message when no SAM backend is present, so the LLM routes around it.

To enable pixel-accurate masks, point a SAM server via .env (SAM_IP/SAM_PORT/SAM_API_KEY). SAM 2.1 large adds only ~3 GB VRAM on top of Qwen3-VL. A future in-process SAM backend can be slotted into agentic/tools.py's segment_entities without touching the coordinators.


Sample run

A complete end-to-end run on the bundled tests/kitchen_test.png, with the default local backend (Qwen3-VL-30B-A3B-Instruct UD-Q4_K_XL on llama.cpp).

Input

Image Question
Input image "Is the toaster silver?"

Command

python main.py --image tests/kitchen_test.png \
               --question "Is the toaster silver?" \
               --visualize

Console transcript

[12:55:39] [PIPELINE] >>> START: Processing question: 'Is the toaster silver?'
[12:55:39] [PIPELINE] >>> PLANNER: Analyzing question to generate focus area...
[12:55:41] [PIPELINE] >>> PLANNER: Question Type: attribute, Target Entities: ['toaster']
[12:55:41] [PIPELINE] >>> NARRATOR: Generating initial scene narration...
[12:55:46] [PIPELINE] >>> CARTOGRAPHER: Constructing Scene Graph v1...
[12:55:49] [PIPELINE] >>> LOCALIZER: Grounding entities to pixel coordinates...
[12:55:52] [PIPELINE] >>> LOCALIZER: Localized 5/5 nodes.
[12:55:52] [PIPELINE] >>> AUDITOR: Starting iterative edge verification...
[12:55:52] [PIPELINE] >>> AUDITOR: Verifying edge: toaster --[sits on]--> countertop
[12:55:55] [PIPELINE] >>> AUDITOR:   ✓ VALID (Conf: 0.95)
[12:55:55] [PIPELINE] >>> AUDITOR: Verifying edge: toaster --[next to]--> coffee maker
[12:55:56] [PIPELINE] >>> AUDITOR:   ✓ VALID (Conf: 0.95)
[12:55:56] [PIPELINE] >>> AUDITOR: Verifying edge: toaster --[next to]--> potted plant
[12:55:56] [PIPELINE] >>> AUDITOR:   ✓ VALID (Conf: 0.95)
[12:55:56] [PIPELINE] >>> AUDITOR: Verifying edge: toaster --[has a cord]--> cord
[12:55:56] [PIPELINE] >>> AUDITOR:   ✓ VALID (Conf: 0.95)
[12:55:56] [PIPELINE] >>> ORACLE: Generating final answer using verified scene graph...
[12:55:59] [PIPELINE] >>> COMPLETE: Pipeline execution finished.

Total wall time: ~20 seconds. Audit pass-rate: 4/4 edges. Localizer recall: 5/5 entities grounded.

Outputs

The pipeline writes output.json and visualization.png. Annotated image:

Verified Scene Graph visualization

The 5 verified entities (toaster, coffee maker, potted plant, countertop, cord) are drawn with red rectangles in their grounded positions.

Final answer

Yes, the toaster is silver. This is confirmed by the scene graph, which describes the toaster (n1) with the attribute "silver" in its node description.

Verified Scene Graph (excerpt from output.json)

{
  "question": "Is the toaster silver?",
  "answer": "Yes, the toaster is silver. ...",
  "verified_graph": {
    "nodes": [
      {
        "id": "n1",
        "label": "toaster",
        "attributes": [
          "silver", "stainless steel", "metallic", "shiny",
          "rectangular", "two-slot", "has a digital display",
          "has a control panel", "has a lever", "has a knob", "has a cord"
        ],
        "box": {"ymin": 515, "xmin": 555, "ymax": 789, "xmax": 790,
                "label": "toaster"}
      },
      {"id": "n2", "label": "coffee maker", "box": {...}},
      {"id": "n3", "label": "potted plant", "box": {...}},
      {"id": "n4", "label": "countertop",  "box": {...}},
      {"id": "n5", "label": "cord",         "box": {...}}
    ],
    "edges": [
      {"subject_id": "n1", "predicate": "sits on",   "object_id": "n4"},
      {"subject_id": "n1", "predicate": "next to",   "object_id": "n2"},
      {"subject_id": "n1", "predicate": "next to",   "object_id": "n3"},
      {"subject_id": "n1", "predicate": "has a cord","object_id": "n5"}
    ]
  },
  "uncertain_edges": []
}

Coordinates in box are in normalized 0–1000 form (ymin, xmin, ymax, xmax); convert to pixels by multiplying each axis by the image's (height, width) and dividing by 1000.

uncertain_edges is the list of edges the Auditor rejected — empty here because every relationship the Cartographer proposed was visually confirmed. When this list is non-empty, those edges are excluded from the graph the Oracle reasons over, which is how LRTSG keeps unverified claims out of the final answer.


Project structure

LRTSG/
├── agents/
│   ├── base.py                  # Thin shim over a VLMBackend
│   ├── planner.py               # Question decomposition & planning
│   ├── narrator.py              # Visual observation & narration
│   ├── cartographer.py          # Scene Graph construction
│   ├── localizer.py             # Localizer wrapper (auto-pairs to backend)
│   ├── auditor.py               # Per-edge visual verification
│   ├── oracle.py                # Final answer + edge citations
│   ├── critic.py                # Self-evaluation of a draft answer (agentic only)
│   └── backends/
│       ├── base.py              # VLMBackend interface
│       ├── factory.py           # get_backend() reads VLM_BACKEND
│       ├── openai_compat_backend.py   # llama-server / vLLM / LM Studio / ...
│       ├── gemini_backend.py    # Google Gemini REST API
│       └── transformers_backend.py    # In-process HF transformers (advanced)
├── agentic/                     # Agentic loop (runs on the local backend, no server)
│   ├── tools.py                 # Tool functions + FUNCTION_MAP
│   ├── agent.py                 # ReAct: LLM-driven loop over a blackboard state
│   └── codeact.py               # CodeAct: Writer → Executor → image-aware Evaluator
├── core/
│   ├── orchestrator.py          # Wires all agents together (linear mode)
│   └── scene_graph.py           # Pydantic models: Node, Edge, Graph, BoundingBox
├── vision/
│   ├── localizer.py             # Localizer factory + GeminiLocalizer
│   ├── qwen_localizer.py        # Qwen3-VL bbox grounding
│   └── utils.py                 # draw_bboxes, crop_bbox, crop_encapsulated_region
├── tests/
│   ├── kitchen_test.png         # Example input image
│   └── test_vision.py           # Unit tests for vision/utils
├── app.py                       # Streamlit UI (linear mode)
├── main.py                      # CLI entry point (linear mode)
├── agentic_main.py              # CLI entry point (agentic — react / codeact)
├── smoke_local.py               # Standalone sanity script for the local stack
├── requirements.txt             # Always-required dependencies
├── requirements-local.txt       # Optional dependencies for local backends
├── .env.example                 # Template for env vars
├── output.json                  # Example output from a previous run
├── visualization.png            # Example annotated image
├── LOCAL_VLM_PLAN.md            # Design history of the local-VLM integration
└── README.md                    # This file

Hardware & software requirements

For the API path (Gemini)

  • Any machine with Python 3.10+
  • A GOOGLE_API_KEY
  • Internet connection

For the local path (Qwen3-VL via llama.cpp)

  • NVIDIA GPU with CUDA 12.x driver, ≥ 22 GB VRAM at runtime (RTX 4090 / 5090 / A6000 / similar). RTX 50-series cards need CUDA 12.8.
  • ~20 GB free disk for the GGUF + projector
  • Windows or Linux (instructions below assume Windows; Linux is similar)

The pipeline has been verified on:

  • Windows 11, RTX 5090 32 GB (Blackwell, sm_120), CUDA 12.8
  • Python 3.10 (conda env), torch 2.7+cu128, transformers 5.1
  • llama.cpp release b8994, CUDA 12.4 build (forward-compatible with the CUDA 12.8 driver)

Quick start — local Qwen3-VL (recommended)

1. Install Python dependencies

conda activate <your-env>           # or any Python 3.10+ environment
pip install -r requirements.txt     # core deps for both modes
pip install -r requirements-local.txt   # optional: only needed for in-process
                                        # transformers backend, not for llama.cpp

2. Download the Qwen3-VL GGUF + multimodal projector

hf download unsloth/Qwen3-VL-30B-A3B-Instruct-GGUF \
    --include "Qwen3-VL-30B-A3B-Instruct-UD-Q4_K_XL.gguf" \
    --include "mmproj-BF16.gguf"

This downloads ~18.8 GB into your HuggingFace cache (HF_HOME if set, otherwise ~/.cache/huggingface/).

3. Get the llama.cpp Windows + CUDA binary

Download llama-bXXXX-bin-win-cuda-12.4-x64.zip from the latest llama.cpp release, unzip somewhere outside the repo (e.g. D:\llama.cpp\). You do not need the cudart-* archive if you already have CUDA 12.x installed.

The CUDA 12.4 binary is forward-compatible with CUDA 12.5 / 12.6 / 12.7 / 12.8 drivers. RTX 50-series cards JIT-compile the kernels for sm_120 on first run.

4. Start the model server (keep this terminal open)

D:\llama.cpp\llama-server.exe ^
  -m "<HF_CACHE>\hub\models--unsloth--Qwen3-VL-30B-A3B-Instruct-GGUF\snapshots\<HASH>\Qwen3-VL-30B-A3B-Instruct-UD-Q4_K_XL.gguf" ^
  --mmproj "<HF_CACHE>\hub\models--unsloth--Qwen3-VL-30B-A3B-Instruct-GGUF\snapshots\<HASH>\mmproj-BF16.gguf" ^
  -ngl 999 -c 8192 --port 8080

Replace <HF_CACHE> and <HASH> with your local paths. The server takes ~20 s to load and uses ~21 GB VRAM. It exposes an OpenAI-compatible API at http://127.0.0.1:8080/v1.

You can verify it's running:

curl http://127.0.0.1:8080/v1/models

5. Point the project at the local server

Copy .env.example to .env. The defaults already target the local server, so no edits are needed for this path:

VLM_BACKEND=openai_compat
OPENAI_COMPAT_BASE_URL=http://localhost:8080/v1

6. Run a query

python main.py --image tests/kitchen_test.png --question "Is the toaster silver?" --visualize

End-to-end runtime is ~20 s on an RTX 5090. Outputs go to output.json (the verified Scene Graph + answer) and visualization.png (the input image with bounding boxes drawn on the verified entities).

Windows + Chinese locale note: If you see UnicodeEncodeError: 'gbk' codec can't encode... when running through conda run, prefix the command with set PYTHONIOENCODING=utf-8 (cmd) or $env:PYTHONIOENCODING="utf-8" (PowerShell), or invoke the env's Python directly instead of going through conda run.


Quick start — Gemini API (no local GPU needed)

1. Install dependencies

pip install -r requirements.txt

2. Configure your API key

Copy .env.example to .env and set:

VLM_BACKEND=gemini
GOOGLE_API_KEY=your_key_here
GEMINI_MODEL=gemini-1.5-pro    # or gemini-2.0-flash for cheaper / faster

3. Run a query

python main.py --image tests/kitchen_test.png --question "Is the toaster silver?" --visualize

Streamlit UI

For interactive exploration:

streamlit run app.py

Opens at http://localhost:8501.

Streamlit UI

The sidebar lets you:

  • Toggle backend (openai_compatgemini)
  • See live server-reachability status when using the local path
  • Adjust the audit refinement loop (max_retries)
  • Copy a ready-to-use llama-server launch command

Configuration reference (.env)

# Which backend powers the agents.
# openai_compat  - local llama-server / vLLM / LM Studio (default)
# gemini         - Google Gemini REST API
# transformers   - In-process HF transformers (advanced; bnb4 is broken on
#                  Qwen3-VL-30B-A3B's MoE expert layers — use openai_compat
#                  for that model)
VLM_BACKEND=openai_compat

# OpenAI-compatible backend
OPENAI_COMPAT_BASE_URL=http://localhost:8080/v1
OPENAI_COMPAT_API_KEY=sk-no-key      # llama.cpp ignores this
OPENAI_COMPAT_MODEL=local            # llama.cpp ignores this too

# Localizer (zero-shot bbox detection)
# Leave blank to auto-pair to backend:
#   gemini        -> GeminiLocalizer
#   openai_compat -> Qwen3VLLocalizer (uses the active backend)
LOCALIZER_PROVIDER=

# In-process transformers backend (only used if VLM_BACKEND=transformers)
LOCAL_VLM_MODEL=Qwen/Qwen3-VL-8B-Instruct
LOCAL_VLM_QUANT=none                 # none (bf16) | bnb4 | awq | fp8
LOCAL_VLM_DEVICE=cuda

# Gemini API
GOOGLE_API_KEY=
GEMINI_MODEL=gemini-1.5-pro

Customizing the pipeline

Tweak agent prompts

Each agent's system_instruction lives in its respective file under agents/. To change how the Narrator describes scenes, for example, edit agents/narrator.py.

Adjust audit sensitivity

  • max_retries in core/orchestrator.py controls how many refinement rounds the Auditor runs per failed edge (default: 2).
  • padding_pct in the crop_encapsulated_region call inside the Orchestrator widens the crop the Auditor sees (default: 0.1, i.e. 10 %).

Extend the Scene Graph schema

  • Add fields to the Pydantic models in core/scene_graph.py.
  • Update the corresponding agent prompts so they populate the new fields.

Swap the local model

Any GGUF that llama.cpp can serve will work — change the -m and --mmproj paths in the launch command. Vision models need an mmproj file; text-only models don't (and obviously won't help with this pipeline).

Add a new backend

Create a class in agents/backends/ that subclasses VLMBackend and implements generate_text and generate_json. Then register it in agents/backends/factory.py.


Troubleshooting

"Server not reachable" in the Streamlit sidebar

The local llama-server.exe isn't running, or it's bound to a different port. Start it with the launch command from the sidebar's "How to start the local server" expander.

UnicodeEncodeError: 'gbk' codec can't encode character...

Your Windows locale defaults to GBK; the orchestrator prints / which can't be encoded. Set PYTHONIOENCODING=utf-8 before running, or invoke Python directly instead of through conda run (which pipes stdout through its own GBK encoder).

bitsandbytes 4-bit fails on Qwen3-VL-30B-A3B

Known incompatibility: bnb's 4-bit packer doesn't handle Qwen3-VL's MoE expert weight layouts. Use the openai_compat backend with a GGUF checkpoint instead. Documented in LOCAL_VLM_PLAN.md.

First inference is slow (~30 s) on the local backend

Expected on the very first call to a new model: CUDA kernels JIT-compile for your GPU's compute capability (especially on RTX 50-series sm_120), and the vision encoder warms up. Subsequent calls drop to 0.3–3 s thanks to llama.cpp's prompt cache.

OOM during the Auditor step

The Auditor crops the encapsulating region of two entities, which on high-res images can produce a lot of vision tokens. Two fixes:

  1. Reduce -c (context length) on the llama-server command line.
  2. Resize input images to ≤ 1024 px on the longest side before feeding them to the pipeline.

Performance reference (RTX 5090, Q4_K_XL, kitchen_test.png)

Stage Latency
Server load (one-time) ~20 s
First inference (warmup) ~30 s
Steady-state per agent call 0.3–3 s
End-to-end pipeline (6 agents + 4 audit edges) ~20 s
VRAM at runtime ~21 GB / 32 GB

What's intentionally out of scope

  • DepthAnything intermediate-image tools — separate workstream; the proposal mentions them as future enrichment. (SAM is integrated as an optional agentic tool — see "Two ways to run: linear vs. agentic".)
  • Benchmark harness for SPHERE-VLM / VisuLogic / similar — out of scope for this iteration.
  • Fine-tuning — the pipeline is purely zero-shot, by design.
  • vLLM as the local server — viable drop-in (just change OPENAI_COMPAT_BASE_URL), but vLLM prefers Linux/WSL.

Further reading

  • LOCAL_VLM_PLAN.md — design history of the local integration: which approaches were tried, which failed, and why.
  • llama.cpp — the inference engine used by the default backend.
  • Qwen3-VL — model documentation, including the official prompts for spatial grounding.
  • Unsloth GGUFs — pre-quantized checkpoints used by the default local setup.

About

Given an image I and a natural language question q, output an answer a to an arbitrary relationship reasoning question with no task-specific training, fine-tuning, or labeled examples. Scope & Assumptions.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors