Semantic Search with LLM Embeddings (BBC News)

Semantic search system built using text embeddings and vector similarity. Instead of matching keywords, this project retrieves news articles based on semantic meaning using cosine similarity.

Overview

Traditional search relies on keyword matching. Semantic search converts text into vector embeddings, allowing queries to match documents with similar meaning even when the exact words differ.

This project demonstrates a minimal pipeline for semantic search using the BBC News dataset.

Example query:

Query: "companies investing in artificial intelligence"

Returns articles related to technology and business investments in AI, even if those exact words are not present.

Dataset

The project uses the BBC News dataset available on Hugging Face.

Dataset characteristics:

~2200 news articles
Categories: business, politics, sport, tech, entertainment
Well suited for small semantic search experiments

Source: https://huggingface.co/datasets/SetFit/bbc-news

Architecture

Pipeline:

Documents
   ↓
Text Embeddings
   ↓
Vector Store
   ↓
Query Embedding
   ↓
k-Nearest Neighbor Search
   ↓
Top-k Similar Documents

How It Works

Load the dataset from Hugging Face.
Extract article text.
Generate embeddings for each article using an embedding model.
Store embeddings in a vector index.
Convert user queries into embeddings.
Use k-Nearest Neighbors (kNN) with cosine similarity to retrieve the most similar articles.

Cosine similarity measures the angular distance between vectors, allowing the system to find semantically related text.

Tech Stack

Python
Hugging Face Datasets
Sentence Transformers
Scikit-learn
Google Colab

Running the Project

Install dependencies:

pip install datasets sentence-transformers scikit-learn

Load the dataset:

from datasets import load_dataset
dataset = load_dataset("SetFit/bbc-news")

Generate embeddings and run similarity search.

Example Use Cases

Semantic document search
Knowledge base retrieval
Retrieval-Augmented Generation (RAG)
Internal company search tools
Recommendation systems

Future Improvements

Replace kNN with a scalable vector database (FAISS / Pinecone)
Add a simple search UI
Support larger document collections
Integrate with a RAG pipeline for question answering

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
README.md		README.md
SemanticSearch.ipynb		SemanticSearch.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Semantic Search with LLM Embeddings (BBC News)

Overview

Dataset

Architecture

How It Works

Tech Stack

Running the Project

Example Use Cases

Future Improvements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Semantic Search with LLM Embeddings (BBC News)

Overview

Dataset

Architecture

How It Works

Tech Stack

Running the Project

Example Use Cases

Future Improvements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages