Vector DB API

Overview

This project implements a REST API for indexing and querying documents within a Vector Database. It allows users to manage libraries, documents, and chunks, as well as perform k-Nearest Neighbor vector searches over indexed content.

Features

CRUD operations for Libraries, Documents, and Chunks
Automatic document chunking with customizable parameters
Vector indexing using HNSW (Hierarchical Navigable Small World) and Brute Force algorithms
Configurable similarity measures (Cosine, Euclidean, Dot Product)
Flexible configuration through settings file
Containerized with Docker and deployable on Kubernetes

Tech Stack

Backend: Python + FastAPI + Pydantic
Database: SQLite
Vector Embeddings: Cohere API

API Endpoints

/api/v1/libraries: CRUD operations for libraries
/api/v1/documents: CRUD operations for documents
/api/v1/chunks: CRUD operations for chunks
/api/v1/search: Vector search endpoint

Configuration

Key settings can be adjusted in app/core/config.py, including:

Chunking parameters (token range, padding)
Indexing algorithm selection and parameters
Similarity measure selection
Embedding model selection

Automatic Document Chunking

The system employs an intelligent chunking algorithm to divide documents into meaningful segments:

Documents are first tokenized using the Cohere tokenizer.
Chunks are created within a specified token range (e.g., 70-130 tokens), configurable in settings.
The algorithm starts at the maximum token limit and backtrack to find natural break points:
- First, it looks for paragraph breaks (\n).
- If no paragraph break is found, it searches for sentence endings.
- If no sentence end is found, it finds the nearest space.
Optionally, chunk padding can be applied to include additional context before and after each chunk.

This approach ensures that chunks maintain semantic coherence while adhering to size constraints, enhancing the quality of vector representations and search results.

Indexing Algorithms

Brute Force

Time Complexity:
- Search: O(n * d)
- Insert: O(1)
Space Complexity: O(n * d)

Where n is the number of vectors and d is the vector dimension.

Analysis: Simple to implement and effective for small datasets. Becomes inefficient for large-scale searches.

HNSW (Hierarchical Navigable Small World)

Time Complexity:
- Search: O(log(n) * d) average case
- Insert: O(log(n) * d) average case
Space Complexity: O(n _ d _ M)

Where n is the number of vectors, d is the vector dimension, and M is the max number of connections per node.

Analysis: Significantly faster searches for large datasets, scales well, but uses more memory due to graph structure.

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
alembic		alembic
app		app
scripts		scripts
tests		tests
vector_index		vector_index
.gitattributes		.gitattributes
.gitignore		.gitignore
Demo.mp4		Demo.mp4
Dockerfile		Dockerfile
README.md		README.md
alembic.ini		alembic.ini
doc1.txt		doc1.txt
doc2.txt		doc2.txt
essay.txt		essay.txt
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
reset_db.py		reset_db.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vector DB API

Overview

Features

Tech Stack

API Endpoints

Configuration

Automatic Document Chunking

Indexing Algorithms

Brute Force

HNSW (Hierarchical Navigable Small World)

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Vector DB API

Overview

Features

Tech Stack

API Endpoints

Configuration

Automatic Document Chunking

Indexing Algorithms

Brute Force

HNSW (Hierarchical Navigable Small World)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages