Real-time visual assistance for blind and visually impaired users, powered by Vision Agents.
LightLens streams live camera video to an AI agent that describes surroundings, warns about hazards, and gives step-by-step walking directions using clock-face references ("chair at 2 o'clock, about 3 steps").
LightLens is built on Vision Agents, a framework for building real-time voice and video AI applications. Vision Agents handles the hard parts like WebRTC transport, LLM orchestration, video frame distribution, function calling, and session management. so we can focus on the visual assistance logic.
Agent + Gemini Realtime LLM: Vision Agents connects the user's microphone and camera to Google's Gemini Realtime model for native speech-to-speech conversation with sub-50ms latency. The user speaks, Gemini sees the live video and hears the audio, and speaks back, all through Vision Agents Agent class and gemini.Realtime plugin.
Video Processors: Vision Agents distributes video frames to custom processors at independent FPS rates. LightLens uses three:
| Processor | Base Class | FPS | What It Does |
|---|---|---|---|
| YOLO | VideoProcessorPublisher |
20 | Detects objects (person, chair, car, etc.) with bounding boxes |
| MiDaS | VideoProcessorPublisher |
15 | Estimates depth/distance for a 3x3 spatial grid |
| Navigation | Processor |
every 5s | Fuses YOLO + MiDaS into step-by-step walking directions |
Function Calling: Vision Agents' @llm.register_function() decorator lets us register Python functions that Gemini can call mid-conversation. LightLens registers get_steps_to_nearest_object() so the agent can answer "how do I get to the chair?" with real-time sensor data.
Edge Transport (Stream.io): Vision Agents' getstream.Edge plugin handles WebRTC video/audio transport and chat through Stream's global edge network. The framework manages call creation, user tokens, and session lifecycle.
HTTP Server Mode: Vision Agents Runner + AgentLauncher serve the agent as a FastAPI application with built-in session management (POST /sessions, DELETE /sessions/{id}), health checks, and concurrency limits.
Browser (React) Backend (Python)
┌──────────────┐ Stream WebRTC ┌───────────────────────┐
│ Video Call │◄──────────────────► │ Gemini Realtime LLM │
│ Chat Panel │ │ │
│ │ │ YOLO Processor (20fps)│
│ YOLO Panel │ WebSocket /ws │ MiDaS Processor(15fps)│
│ MiDaS Panel │◄──────────────────► │ Nav Processor (5s) │
│ Nav Panel │ │ │
└──────────────┘ └───────────────────────┘
- Python 3.13+
- Node.js 18+
- Stream.io account (API key + secret)
- Google AI Studio Gemini API key
git clone <repo-url>
cd LightLensCreate the environment file:
cp ai/.env.example ai/.envEdit ai/.env and fill in your keys:
STREAM_API_KEY=your_stream_api_key
STREAM_API_SECRET=your_stream_api_secret
GEMINI_API_KEY=your_gemini_api_key
...
cd ai
uv sync
uv run main.py serveThe backend starts on http://localhost:8000.
cd frontend
npm install
npm run devThe frontend starts on http://localhost:5173.
Open http://localhost:5173 in your browser, click Connect, and allow camera + microphone access. The AI agent will join the call and start describing what it sees.
ai/
├── main.py # Entry point
├── app.py # FastAPI app setup
├── config.py # Environment variables
├── agent.py # Agent creation + LLM function tools
├── ws_manager.py # WebSocket broadcast manager
├── instructions.md # Agent behavior prompt (16 rules)
├── processors/
│ ├── yolo_processor.py
│ ├── midas_processor.py
│ └── navigation_processor.py
└── routes/
├── token.py # POST /api/token
└── ws.py # WebSocket /ws
frontend/
├── src/
│ ├── App.tsx # Main layout
│ ├── api.ts # Backend API calls
│ ├── hooks/ # useStreamCall, useWebSocket
│ └── components/ # UI panels
└── vite.config.ts # Dev proxy to backend
MIT