Skip to content

Real-time visual assistance for blind and visually impaired users, powered by Vision Agents

Notifications You must be signed in to change notification settings

Shadow-Flash/LightLens

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LightLens

Real-time visual assistance for blind and visually impaired users, powered by Vision Agents.

LightLens streams live camera video to an AI agent that describes surroundings, warns about hazards, and gives step-by-step walking directions using clock-face references ("chair at 2 o'clock, about 3 steps").

How It Works

LightLens is built on Vision Agents, a framework for building real-time voice and video AI applications. Vision Agents handles the hard parts like WebRTC transport, LLM orchestration, video frame distribution, function calling, and session management. so we can focus on the visual assistance logic.

Vision Agents in This Project

Agent + Gemini Realtime LLM: Vision Agents connects the user's microphone and camera to Google's Gemini Realtime model for native speech-to-speech conversation with sub-50ms latency. The user speaks, Gemini sees the live video and hears the audio, and speaks back, all through Vision Agents Agent class and gemini.Realtime plugin.

Video Processors: Vision Agents distributes video frames to custom processors at independent FPS rates. LightLens uses three:

Processor Base Class FPS What It Does
YOLO VideoProcessorPublisher 20 Detects objects (person, chair, car, etc.) with bounding boxes
MiDaS VideoProcessorPublisher 15 Estimates depth/distance for a 3x3 spatial grid
Navigation Processor every 5s Fuses YOLO + MiDaS into step-by-step walking directions

Function Calling: Vision Agents' @llm.register_function() decorator lets us register Python functions that Gemini can call mid-conversation. LightLens registers get_steps_to_nearest_object() so the agent can answer "how do I get to the chair?" with real-time sensor data.

Edge Transport (Stream.io): Vision Agents' getstream.Edge plugin handles WebRTC video/audio transport and chat through Stream's global edge network. The framework manages call creation, user tokens, and session lifecycle.

HTTP Server Mode: Vision Agents Runner + AgentLauncher serve the agent as a FastAPI application with built-in session management (POST /sessions, DELETE /sessions/{id}), health checks, and concurrency limits.

Architecture

Browser (React)                          Backend (Python)
┌──────────────┐    Stream WebRTC    ┌───────────────────────┐
│  Video Call  │◄──────────────────► │  Gemini Realtime LLM  │
│  Chat Panel  │                     │                       │
│              │                     │ YOLO Processor (20fps)│
│  YOLO Panel  │    WebSocket /ws    │ MiDaS Processor(15fps)│
│  MiDaS Panel │◄──────────────────► │  Nav Processor (5s)   │
│  Nav Panel   │                     │                       │
└──────────────┘                     └───────────────────────┘

Prerequisites

Setup

1. Clone and configure

git clone <repo-url>
cd LightLens

Create the environment file:

cp ai/.env.example ai/.env

Edit ai/.env and fill in your keys:

STREAM_API_KEY=your_stream_api_key
STREAM_API_SECRET=your_stream_api_secret
GEMINI_API_KEY=your_gemini_api_key
...

2. Start the backend

cd ai
uv sync
uv run main.py serve

The backend starts on http://localhost:8000.

3. Start the frontend

cd frontend
npm install
npm run dev

The frontend starts on http://localhost:5173.

4. Use it

Open http://localhost:5173 in your browser, click Connect, and allow camera + microphone access. The AI agent will join the call and start describing what it sees.

Project Structure

ai/
├── main.py              # Entry point
├── app.py               # FastAPI app setup
├── config.py            # Environment variables
├── agent.py             # Agent creation + LLM function tools
├── ws_manager.py        # WebSocket broadcast manager
├── instructions.md      # Agent behavior prompt (16 rules)
├── processors/
│   ├── yolo_processor.py
│   ├── midas_processor.py
│   └── navigation_processor.py
└── routes/
    ├── token.py         # POST /api/token
    └── ws.py            # WebSocket /ws

frontend/
├── src/
│   ├── App.tsx          # Main layout
│   ├── api.ts           # Backend API calls
│   ├── hooks/           # useStreamCall, useWebSocket
│   └── components/      # UI panels
└── vite.config.ts       # Dev proxy to backend

License

MIT

About

Real-time visual assistance for blind and visually impaired users, powered by Vision Agents

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors