Vision Agent 🎬🧠

Real-time multimodal AI video agent — watches, listens, detects, and reasons about video in real-time with bounding box overlays and 2-tier agent intelligence.

Stream live video → YOLOv8 detects objects with bounding boxes → Agent gives instant deterministic reply (<500ms) + polished LLM reply with provenance → All measured and displayed in real-time.

Built for the WeMakeDevs Vision Possible Hackathon — powered by Vision Agents SDK.

🚀 Live Demo

	URL
Backend	https://vision-agent-1.onrender.com
API Docs	https://vision-agent-1.onrender.com/docs
Health Check	https://vision-agent-1.onrender.com/health

Note: Free-tier Render instances spin down after inactivity. First request may take ~30-50s to cold-start.

✨ Features

Feature	Detail
📡 Live Webcam Stream	2s chunk streaming with real-time YOLO detection + bounding box canvas overlays
🎯 Bounding Box Overlays	Canvas-drawn bboxes with labels, confidence %, and per-label color coding
📊 Real-Time Metrics	6-metric dashboard: chunks, avg latency, P90, model FPS, frames, objects
🤖 2-Tier Agent	FastReply (<500ms, deterministic) + PolishReply (LLM, ~3-8s, background)
🔗 Provenance Links	Every agent response cites its sources: detection data, transcript, notes
📤 Video Upload	Drag-and-drop MP4/MOV/WebM, full pipeline analysis
🎙️ Audio Transcription	OpenAI Whisper API (cloud)
🔍 Object Detection	YOLOv8n per-frame with full `[x1,y1,x2,y2]` bounding boxes
🧠 AI Notes	LLM-generated summary, concepts, formulas, viva questions
💬 Click-to-Ask	Click a detected label → agent answers with context
🧪 Quiz Generator	MCQs + short-answer questions with auto-scoring
⚡ Dual LLM	Gemini + OpenAI with auto-fallback + quota handling
🌐 URL Ingestion	Paste YouTube/Vimeo URL → auto-download + full analysis
📐 LaTeX Formulas	MathJax-rendered formulas from lectures

🏗️ Architecture

┌──────────────┐    ┌───────────────────────────────────────────────────┐
│  Browser UI  │───▶│                FastAPI Server                     │
│              │    │                                                   │
│  🎥 Live     │    │  /stream_chunk ──▶ vision_worker.py (YOLO)       │
│  Stream +    │    │                    ↳ bboxes + latency tracking    │
│  Canvas      │    │  /stream_status ──▶ real-time metrics            │
│  Overlays    │    │                                                   │
│              │    │  /ask ──▶ agent_core.py                           │
│  📊 Metrics  │    │           ├─ Tier A: FastReply (<500ms)           │
│  Dashboard   │    │           └─ Tier B: PolishReply (LLM + jobs.py) │
│              │    │                                                   │
│  🤖 Agent    │    │  /analyze ──▶ ffmpeg ──▶ whisper ──▶ YOLO        │
│  Chat Panel  │    │  /generate_notes ──▶ llm_provider.py (async job) │
│              │    │  /ingest_url ──▶ yt-dlp ──▶ full pipeline        │
└──────────────┘    └───────────────────────────────────────────────────┘

2-Tier Agent Intelligence

User clicks "person" label
         │
         ▼
┌─ Tier A: FastReply ──────────────┐
│ Template + YOLO labels + transcript │  < 500ms
│ "I see person ×3 in the video"   │  deterministic
│ Source: detection, transcript    │  cached
└──────────────────────────────────┘
         │
         ▼ (background LLM job)
┌─ Tier B: PolishReply ────────────┐
│ Full context → Gemini/OpenAI     │  ~3-8s
│ Detailed analysis with reasoning │  provenance links
│ Auto-fallback on quota/timeout   │  polls /jobs/{id}
└──────────────────────────────────┘

🚀 Quick Start

Option A: Docker (recommended)

OPENAI_API_KEY="sk-..." GEMINI_API_KEY="..." docker compose up --build
# Open http://localhost:8000

Option B: Local Setup

Prerequisites

Python 3.10+
ffmpeg on PATH (download)
API key: OpenAI and/or Gemini

Windows PowerShell

cd vision-agent\backend
python -m venv venv
.\venv\Scripts\Activate.ps1
pip install -r requirements.txt
$env:OPENAI_API_KEY = "sk-..."
$env:GEMINI_API_KEY = "..."
uvicorn main:app --reload --port 8000

Linux / macOS

cd vision-agent/backend
python3 -m venv venv && source venv/bin/activate
pip install -r requirements.txt
export OPENAI_API_KEY="sk-..." GEMINI_API_KEY="..."
uvicorn main:app --reload --port 8000

💡 No API key? Server runs fine — transcription returns placeholder, notes use pre-generated samples. Judges can see the full UI and stream demo immediately.

📡 API Endpoints

Method	Path	Description
GET	`/`	Upload & live-stream UI
GET	`/demo`	Interactive demo (QA, quiz, timeline)
POST	`/upload`	Upload video → extract frames
POST	`/analyze`	Full pipeline: frames + transcript + detection
POST	`/generate_notes`	Async LLM notes generation (returns job_id)
POST	`/ask`	2-tier agent: fast reply + background LLM polish
POST	`/generate_quiz`	MCQ + short-answer quiz
POST	`/stream_chunk`	Stream 2s chunk → YOLO bboxes + transcript
GET	`/stream_status`	Real-time metrics: latency, FPS, detections
POST	`/stream_finalize`	Stitch chunks + full analysis
POST	`/ingest_url`	Download from URL + full pipeline
GET	`/jobs/{id}`	Poll async job progress

⚙️ Environment Variables

Variable	Default	Description
`OPENAI_API_KEY`	—	Transcription (Whisper), notes, QA, quiz
`GEMINI_API_KEY`	—	Gemini LLM for notes + agent reasoning
`LLM_PROVIDER`	`auto`	`gemini`, `openai`, or `auto` (try both)
`CLOUDFLARE_ACCOUNT_ID`	—	Cloudflare Workers AI account id (optional)
`CLOUDFLARE_API_TOKEN`	—	Cloudflare Workers AI API token (optional)
`CLOUDFLARE_MODEL`	`@cf/qwen/qwen1.5-14b-chat-awq`	Cloudflare Workers AI model id
`GEMINI_MODEL`	`gemini-2.0-flash`	Gemini model
`OPENAI_MODEL`	`gpt-4o-mini`	OpenAI chat model
`YOLO_MODEL`	`yolov8n.pt`	YOLO model file (nano/small/medium)

Live Coach (Pose + Rep Counting)

The Live Stream tab now includes a fitness coach panel (exercise selection, rep counter, form cues, and optional browser voice feedback).

Backend dependency: mediapipe (installed via backend/requirements.txt). If it is missing, the rest of the app still works, but rep counting will be disabled.
No paid TTS required: the UI uses the browser's built-in Speech Synthesis when enabled.

🛠️ Tech Stack

Backend: Python 3.11, FastAPI, Uvicorn
Vision: YOLOv8 (ultralytics), OpenCV, vision_worker.py singleton
Audio: OpenAI Whisper API
LLM: Gemini + OpenAI with provider abstraction (llm_provider.py)
Agent: 2-tier reasoning engine (agent_core.py)
Jobs: Thread-safe async job store (jobs.py)
Math: MathJax 3 (LaTeX rendering)
Streaming: ffmpeg, MediaRecorder API, growing-file WebM strategy
Frontend: Vanilla HTML/CSS/JS, Canvas API for bounding boxes, dark glassmorphism
Deploy: Docker, GitHub Actions CI

📁 Project Structure

vision-agent/
├── backend/
│   ├── main.py              # FastAPI app (all routes)
│   ├── agent_core.py        # 2-tier agent: FastReply + PolishReply
│   ├── vision_worker.py     # Singleton YOLO worker with latency tracking
│   ├── detect.py            # YOLOv8 detection with bounding boxes
│   ├── frame_extractor.py   # OpenCV frame extraction
│   ├── transcribe.py        # OpenAI Whisper transcription
│   ├── llm_provider.py      # Gemini/OpenAI provider abstraction
│   ├── jobs.py              # Thread-safe async job store
│   ├── streaming.py         # Real-time chunk streaming + /stream_status
│   ├── url_ingest.py        # URL download + pipeline
│   ├── requirements.txt
│   ├── analysis/sample/     # Pre-generated outputs for judges
│   └── static/
│       ├── index.html       # Upload & live-stream UI with canvas overlays
│       └── demo.html        # Interactive demo (QA, quiz, timeline)
├── Dockerfile
├── docker-compose.yml
├── README.md
├── SUBMISSION_NOTES.md
├── BLOG_POST.md
└── LICENSE

📊 Performance Metrics

Step	Time	Notes
Frame extraction (30s video)	~1-2s	OpenCV at 1 fps
Whisper transcription (cloud)	~2-5s	Depends on audio length
YOLOv8 detection (30 frames)	~3-6s	Nano model, CPU
LLM notes generation	~3-8s	Gemini 2.0 Flash or GPT-4o-mini
FastReply (agent Tier A)	<500ms	Deterministic, cached
PolishReply (agent Tier B)	~3-8s	Full LLM analysis
Stream chunk (end-to-end)	~2-5s	Transcode + extract + detect
Total pipeline	~10-20s	Full upload → analysis

License

MIT — see LICENSE

Built with ❤️ for the WeMakeDevs Vision Possible Hackathon — powered by Vision Agents SDK, Gemini, & OpenAI

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github/workflows		.github/workflows
backend		backend
.gitignore		.gitignore
BLOG_POST.md		BLOG_POST.md
DAY5_CHECKLIST.md		DAY5_CHECKLIST.md
DEMO_SCRIPT.md		DEMO_SCRIPT.md
DEVPOST_SUBMISSION.md		DEVPOST_SUBMISSION.md
Dockerfile		Dockerfile
LICENSE		LICENSE
PRIVACY.md		PRIVACY.md
README.md		README.md
RELEASE_NOTES.md		RELEASE_NOTES.md
SOCIAL_TEMPLATES.md		SOCIAL_TEMPLATES.md
SUBMISSION_NOTES.md		SUBMISSION_NOTES.md
SUBMISSION_READY.txt		SUBMISSION_READY.txt
VERIFICATION_CHECKLIST.md		VERIFICATION_CHECKLIST.md
docker-compose.yml		docker-compose.yml
render.yaml		render.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vision Agent 🎬🧠

🚀 Live Demo

✨ Features

🏗️ Architecture

2-Tier Agent Intelligence

🚀 Quick Start

Option A: Docker (recommended)

Option B: Local Setup

Prerequisites

Windows PowerShell

Linux / macOS

📡 API Endpoints

⚙️ Environment Variables

Live Coach (Pose + Rep Counting)

🛠️ Tech Stack

📁 Project Structure

📊 Performance Metrics

License

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Vision Agent 🎬🧠

🚀 Live Demo

✨ Features

🏗️ Architecture

2-Tier Agent Intelligence

🚀 Quick Start

Option A: Docker (recommended)

Option B: Local Setup

Prerequisites

Windows PowerShell

Linux / macOS

📡 API Endpoints

⚙️ Environment Variables

Live Coach (Pose + Rep Counting)

🛠️ Tech Stack

📁 Project Structure

📊 Performance Metrics

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages