This project implements a video similarity search system using the Timesformer transformer model (pretrained on Kinetics-400) to generate video embeddings, and FAISS for efficient nearest neighbor search. Users can upload a video, and the app will return visually similar videos from the UCF101 dataset using precomputed embedding indexes.
- Video Embedding: Uses Timesformer to extract powerful video representations.
- Similarity Search: Efficiently retrieves similar videos using FAISS vector search.
- Interactive Frontend: Built with Streamlit for easy video upload and result visualization.
- GIF Previews: Generates GIF previews for both uploaded and retrieved videos.
-
The system uses the UCF101 action recognition dataset, which contains 13,320 videos across 101 action categories.
-
The embedder.ipynb notebook (inside src/) supports generating Timesformer embeddings for all 101 classes, enabling full-scale similarity search.
-
For a quick test and faster demo experience, a precomputed FAISS index is included in the demo_folder/embeddings/ directory.
-
This allows the app to run immediately without requiring full dataset processing.
Requirements:
- Python 3.8+
- Jupyter Notebook
- PyTorch
transformers,datasets,pandas,scikit-learn, and other standard ML/NLP libraries
Setup:
- Clone the repository.
- Install dependencies:
pip install requirements.txt- Run
embedder.ipynbto generate embeddings for the videos. - Use
frontend.pyto search for similar videos and compare the different pooling strategies.
streamlit run frontent.py-
Timesformer (default, Hugging Face)
-
Easily extensible to other video transformer models
-
Leverages CLS pooling to generate contextually informed embeddings
-
Retrieves and displays the top-k most similar videos to a given query using transformer-based embeddings and FAISS.
-
Visual previews (GIFs) make it easy to assess retrieval quality.
-
Achieves high retrieval accuracy, with combined recall@1: 0.9797, recall@3: 0.9737, and recall@5: 0.9564, indicating that the correct class is almost always among the top results.
-
Recall@k measures how often the correct item appears within the top-k retrieved results. A higher recall@k indicates better retrieval performance, meaning the system is more likely to present relevant results to the user quickly.
-
The system is efficient and scalable, capable of handling large video datasets and real-time search scenarios by indexing normalized embeddings with FAISS.