Semantic search system built using text embeddings and vector similarity. Instead of matching keywords, this project retrieves news articles based on semantic meaning using cosine similarity.
Traditional search relies on keyword matching. Semantic search converts text into vector embeddings, allowing queries to match documents with similar meaning even when the exact words differ.
This project demonstrates a minimal pipeline for semantic search using the BBC News dataset.
Example query:
Query: "companies investing in artificial intelligence"
Returns articles related to technology and business investments in AI, even if those exact words are not present.
The project uses the BBC News dataset available on Hugging Face.
Dataset characteristics:
- ~2200 news articles
- Categories: business, politics, sport, tech, entertainment
- Well suited for small semantic search experiments
Source: https://huggingface.co/datasets/SetFit/bbc-news
Pipeline:
Documents
↓
Text Embeddings
↓
Vector Store
↓
Query Embedding
↓
k-Nearest Neighbor Search
↓
Top-k Similar Documents
- Load the dataset from Hugging Face.
- Extract article text.
- Generate embeddings for each article using an embedding model.
- Store embeddings in a vector index.
- Convert user queries into embeddings.
- Use k-Nearest Neighbors (kNN) with cosine similarity to retrieve the most similar articles.
Cosine similarity measures the angular distance between vectors, allowing the system to find semantically related text.
- Python
- Hugging Face Datasets
- Sentence Transformers
- Scikit-learn
- Google Colab
Install dependencies:
pip install datasets sentence-transformers scikit-learn
Load the dataset:
from datasets import load_dataset
dataset = load_dataset("SetFit/bbc-news")Generate embeddings and run similarity search.
- Semantic document search
- Knowledge base retrieval
- Retrieval-Augmented Generation (RAG)
- Internal company search tools
- Recommendation systems
- Replace kNN with a scalable vector database (FAISS / Pinecone)
- Add a simple search UI
- Support larger document collections
- Integrate with a RAG pipeline for question answering