Skip to content

yasirusman85/Multimodal-Video-Insight-Search

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

8 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Multimodal Video Insight Search

Technical Documentation


Project Overview

This project implements a high-performance Multimodal Video Search Engine that allows users to query video content using natural language. The system identifies specific moments within a video file by computing semantic similarity between a text query and temporal visual samples (extracted frames).

Key Capability: Search inside videos with natural language β€” find the exact moment that best matches your description.


πŸ—οΈ System Architecture

flowchart TD
    A[πŸ“Ή Video Input] --> B[⏱️ Temporal Sampling\nOpenCV - 1 FPS]
    B --> C[πŸ–ΌοΈ Frame Extraction]
    C --> D[πŸ” CLIP Vision Encoder\nopenai/clip-vit-base-patch32]
    E[πŸ“ Text Query] --> F[πŸ”€ CLIP Text Encoder]
    D --> G[🧠 512-dim Image Embeddings]
    F --> H[🧠 512-dim Text Embedding]
    G & H --> I[πŸ“ Cosine Similarity]
    I --> J[πŸ”₯ Find Global Maximum]
    J --> K[⏰ Best Timestamp + Confidence]
    
    style A fill:#ff9,stroke:#333
    style K fill:#9f9,stroke:#333,stroke-width:3px
    style I fill:#bbf,stroke:#333
Loading

πŸ› οΈ Technical Specifications Core Technologies

Model: OpenAI CLIP ViT-B/32 (Vision Transformer) Backend: PyTorch + Hugging Face Transformers Video Processing: OpenCV + PIL UI/UX: Gradio Embedding Dimension: 512 Precision: float16 (optimized memory usage)

The Moondream2 Pivot Initially, the project explored Moondream2, a lightweight Vision-Language Model (VLM). However, due to repeated compatibility issues with the Transformers library β€” particularly rope_scaling and pad_token_id errors in custom modeling files β€” the architecture was pivoted to CLIP.

subgraph Preprocessing
    A[Video File] --> B[Decode with OpenCV]
    B --> C[Sample Frames @ 1Hz]
    C --> D[Resize 224Γ—224]
    D --> E[Normalize<br/>ImageNet Stats]
end

subgraph Encoding
    E --> F[CLIP Vision Encoder]
    G[Text Query] --> H[CLIP Text Encoder]

    F --> I[Image Embeddings<br/>512-dim]
    H --> J[Text Embedding<br/>512-dim]
end

subgraph Matching
    I --> K[Cosine Similarity]
    J --> K
    K --> L[Softmax over Time]
    L --> M[Max Score + Timestamp]
end

classDef proc fill:#e1f5fe,stroke:#01579b
classDef enc fill:#f3e5f5,stroke:#4a148c
classDef match fill:#e8f5e9,stroke:#1b5e20

class A,B,C,D,E proc
class F,G,H,I,J enc
class K,L,M match

πŸš€ Deployment The system is deployed using Gradio, providing:

Drag & drop video upload Natural language search box Real-time visualization of similarity scores over time Timestamped result with confidence percentage Local tunnel support for easy sharing

Future Enhancements

Multi-query support Temporal segment retrieval (not just single frame) Integration with faster embedding models (e.g. SigLIP, CLIP-B/16) Video indexing for large archives Audio + Visual multimodal fusion

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors