Skip to content

eerhshr/basic-semantic-search

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

Semantic Search with LLM Embeddings (BBC News)

Semantic search system built using text embeddings and vector similarity. Instead of matching keywords, this project retrieves news articles based on semantic meaning using cosine similarity.

Overview

Traditional search relies on keyword matching. Semantic search converts text into vector embeddings, allowing queries to match documents with similar meaning even when the exact words differ.

This project demonstrates a minimal pipeline for semantic search using the BBC News dataset.

Example query:

Query: "companies investing in artificial intelligence"

Returns articles related to technology and business investments in AI, even if those exact words are not present.

Dataset

The project uses the BBC News dataset available on Hugging Face.

Dataset characteristics:

  • ~2200 news articles
  • Categories: business, politics, sport, tech, entertainment
  • Well suited for small semantic search experiments

Source: https://huggingface.co/datasets/SetFit/bbc-news

Architecture

Pipeline:

Documents
   ↓
Text Embeddings
   ↓
Vector Store
   ↓
Query Embedding
   ↓
k-Nearest Neighbor Search
   ↓
Top-k Similar Documents

How It Works

  1. Load the dataset from Hugging Face.
  2. Extract article text.
  3. Generate embeddings for each article using an embedding model.
  4. Store embeddings in a vector index.
  5. Convert user queries into embeddings.
  6. Use k-Nearest Neighbors (kNN) with cosine similarity to retrieve the most similar articles.

Cosine similarity measures the angular distance between vectors, allowing the system to find semantically related text.

Tech Stack

  • Python
  • Hugging Face Datasets
  • Sentence Transformers
  • Scikit-learn
  • Google Colab

Running the Project

Install dependencies:

pip install datasets sentence-transformers scikit-learn

Load the dataset:

from datasets import load_dataset
dataset = load_dataset("SetFit/bbc-news")

Generate embeddings and run similarity search.

Example Use Cases

  • Semantic document search
  • Knowledge base retrieval
  • Retrieval-Augmented Generation (RAG)
  • Internal company search tools
  • Recommendation systems

Future Improvements

  • Replace kNN with a scalable vector database (FAISS / Pinecone)
  • Add a simple search UI
  • Support larger document collections
  • Integrate with a RAG pipeline for question answering

About

Semantic Search within news documents with LLM embeddings

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors