Multimodal Video Insight Search

Technical Documentation

Project Overview

This project implements a high-performance Multimodal Video Search Engine that allows users to query video content using natural language. The system identifies specific moments within a video file by computing semantic similarity between a text query and temporal visual samples (extracted frames).

Key Capability: Search inside videos with natural language — find the exact moment that best matches your description.

🏗️ System Architecture

flowchart TD
    A[📹 Video Input] --> B[⏱️ Temporal Sampling\nOpenCV - 1 FPS]
    B --> C[🖼️ Frame Extraction]
    C --> D[🔍 CLIP Vision Encoder\nopenai/clip-vit-base-patch32]
    E[📝 Text Query] --> F[🔤 CLIP Text Encoder]
    D --> G[🧠 512-dim Image Embeddings]
    F --> H[🧠 512-dim Text Embedding]
    G & H --> I[📐 Cosine Similarity]
    I --> J[🔥 Find Global Maximum]
    J --> K[⏰ Best Timestamp + Confidence]
    
    style A fill:#ff9,stroke:#333
    style K fill:#9f9,stroke:#333,stroke-width:3px
    style I fill:#bbf,stroke:#333

🛠️ Technical Specifications Core Technologies

Model: OpenAI CLIP ViT-B/32 (Vision Transformer) Backend: PyTorch + Hugging Face Transformers Video Processing: OpenCV + PIL UI/UX: Gradio Embedding Dimension: 512 Precision: float16 (optimized memory usage)

The Moondream2 Pivot Initially, the project explored Moondream2, a lightweight Vision-Language Model (VLM). However, due to repeated compatibility issues with the Transformers library — particularly rope_scaling and pad_token_id errors in custom modeling files — the architecture was pivoted to CLIP.

subgraph Preprocessing
    A[Video File] --> B[Decode with OpenCV]
    B --> C[Sample Frames @ 1Hz]
    C --> D[Resize 224×224]
    D --> E[Normalize<br/>ImageNet Stats]
end

subgraph Encoding
    E --> F[CLIP Vision Encoder]
    G[Text Query] --> H[CLIP Text Encoder]

    F --> I[Image Embeddings<br/>512-dim]
    H --> J[Text Embedding<br/>512-dim]
end

subgraph Matching
    I --> K[Cosine Similarity]
    J --> K
    K --> L[Softmax over Time]
    L --> M[Max Score + Timestamp]
end

classDef proc fill:#e1f5fe,stroke:#01579b
classDef enc fill:#f3e5f5,stroke:#4a148c
classDef match fill:#e8f5e9,stroke:#1b5e20

class A,B,C,D,E proc
class F,G,H,I,J enc
class K,L,M match

🚀 Deployment The system is deployed using Gradio, providing:

Drag & drop video upload Natural language search box Real-time visualization of similarity scores over time Timestamped result with confidence percentage Local tunnel support for easy sharing

Future Enhancements

Multi-query support Temporal segment retrieval (not just single frame) Integration with faster embedding models (e.g. SigLIP, CLIP-B/16) Video indexing for large archives Audio + Visual multimodal fusion

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
Multimodal_Video_Insight_Search.ipynb		Multimodal_Video_Insight_Search.ipynb
README.md		README.md
github_ready.ipynb		github_ready.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multimodal Video Insight Search

Project Overview

🏗️ System Architecture

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Multimodal Video Insight Search

Project Overview

🏗️ System Architecture

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages