react-video-agent

A ReAct (Reasoning + Acting) agent that answers natural language questions about video files.

Overview

The agent receives a video file and a question, then iteratively calls tools in a THINK → ACT → OBSERVE loop to derive an answer.

Question
  ↓
THINK → ACT → OBSERVE  (up to MAX_ITERATIONS times)
  ↓
Answer

Available tools:

Tool	Role
`global_browse_tool`	Semantic search over the entire video for events and subjects
`clip_search_tool`	Semantic search for clips matching a text description
`frame_inspect_tool`	Visual analysis of frames in a given time range via VLM (enabled when `LITE_MODE=False`)

Background

This project is a restructured implementation based on DeepVideoDiscovery by Microsoft.

Key changes from the original:

Abstracted LLM backend to allow easy model swapping
Separated video processing, tools, and agent core into distinct layers
Reorganized module structure inspired by react-from-scratch

Setup

Requirements: Python 3.12 or higher

pip install -e .

Environment variables: Create a .env file in the project root.

# For OpenAI API
OPENAI_API_KEY=sk-...

# For Azure OpenAI, edit endpoint settings directly in src/config/settings.py

Usage

python agent_run.py <path/to/video> "<question>"

Example:

python agent_run.py ./video.mp4 "Where was the main character at the end?"

On the first run, the following steps are performed automatically:

Decode video into frames → video_database/<video_id>/frames/
Generate captions → video_database/<video_id>/captions/captions.json
Build vector DB → video_database/<video_id>/database.json

Subsequent runs reuse the cache, so different questions on the same video respond much faster.

Configuration

Key settings in src/config/settings.py:

Setting	Default	Description
`LITE_MODE`	`True`	When `True`, uses subtitle text only (skips frame-level VLM analysis)
`MAX_ITERATIONS`	`3`	Maximum number of agent loop iterations
`AOAI_ORCHESTRATOR_LLM_MODEL_NAME`	`o3`	Model used for orchestration
`AOAI_TOOL_VLM_MODEL_NAME`	`gpt-4.1-mini`	Model used for tool-level VLM calls
`GLOBAL_BROWSE_TOPK`	`300`	Maximum number of clips returned by `global_browse_tool`

Code Structure

src/
├── config/
│   ├── logging.py      # Logging setup
│   └── settings.py     # Configuration values (model names, endpoints, etc.)
├── llm/
│   ├── base.py         # Abstract LLM interface (BaseLLM)
│   └── openai.py       # OpenAI / Azure OpenAI implementation (OpenAILLM)
├── react/
│   └── agent.py        # ReAct agent core (DVDCoreAgent)
├── tools/
│   ├── clip_search.py  # clip_search_tool
│   ├── frame_inspect.py # frame_inspect_tool
│   └── global_browse.py # global_browse_tool
├── utils/
│   ├── retry.py        # Exponential backoff retry decorator
│   └── schema.py       # JSON schema auto-generation for OpenAI Function Calling
└── video/
    ├── caption.py      # Frame captioning pipeline
    ├── database.py     # Vector DB construction and management
    └── utils.py        # Video download and frame extraction

Extending the Agent

Adding a new LLM backend

Subclass BaseLLM from src/llm/base.py and pass the instance to DVDCoreAgent.

# src/llm/qwen.py (example)
from src.llm.base import BaseLLM

class QwenLLM(BaseLLM):
    def __init__(self, base_url: str, model_name: str):
        self.base_url = base_url
        self.model_name = model_name

    def call_with_tools(self, messages, tools=None, **kwargs) -> dict:
        # Call an OpenAI-compatible endpoint (e.g. vLLM / Ollama)
        ...

    def get_embeddings(self, text) -> list:
        ...

# Usage in agent_run.py
from src.llm.qwen import QwenLLM

agent = DVDCoreAgent(
    video_db_path=video_db_path,
    video_caption_path=caption_file,
    max_iterations=15,
    llm=QwenLLM(base_url="http://localhost:8000", model_name="qwen2.5"),
)

Adding a new tool

Create a new file in src/tools/

# src/tools/my_tool.py (example)
from typing import Annotated as A
from src.utils.schema import doc as D

def my_tool(
    query: A[str, D("Search query")],
) -> str:
    """Tool description (used as a prompt sent to OpenAI)."""
    ...
    return result

Add the function to self.tools in src/react/agent.py

from src.tools.my_tool import my_tool

self.tools = [frame_inspect_tool, clip_search_tool, global_browse_tool, my_tool, finish]

The agent will automatically recognize and call the new tool.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
src		src
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
README_ja.md		README_ja.md
REFACTORING_PLAN.md		REFACTORING_PLAN.md
agent_run.py		agent_run.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

react-video-agent

Overview

Background

Setup

Usage

Configuration

Code Structure

Extending the Agent

Adding a new LLM backend

Adding a new tool

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

react-video-agent

Overview

Background

Setup

Usage

Configuration

Code Structure

Extending the Agent

Adding a new LLM backend

Adding a new tool

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages