Skip to content

HashIndex: LLM-optimized Document Indexing without vector search

License

Notifications You must be signed in to change notification settings

JasonHonKL/HashIndex

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HashIndex ⚡️

Ultra-fast, LLM-optimized document indexing in Python.

Built by the team at Pardus AI – The fastest AI Data Analysis Platform.

License: MIT Discord


HashIndex is the core indexing engine we use at Pardus AI to process 50MB+ CSVs and PDFs in seconds. We are open-sourcing our Python implementation so you can build better RAG pipelines without the bloat of LangChain.

Want to analyze documents without coding? Try our no-code platform: Pardus AI Dashboard (Free for huge files).

Installation

# Clone the repository
git clone https://github.com/JasonHonKL/HashIndex.git
cd HashIndex

# Install with uv (recommended - faster and more reliable)
uv venv                    # Create virtual environment
uv sync                    # Install dependencies and package in editable mode
source .venv/bin/activate  # Activate the virtual environment (Linux/Mac)
# or
.venv\Scripts\activate     # Activate the virtual environment (Windows)

# Alternatively, install with pip
pip install -e .

Usage

As a Python Library

from hashindex import index_pdf, query_index, HashIndex

# Index a PDF document
index = index_pdf("document.pdf")

# Save the index
index.save("document.index.json")

# Load an existing index
index = HashIndex.load("document.index.json")

# Query the index
answer = query_index(index, "What is the main conclusion?")
print(answer)

Using the CLI

cp .env.example .env

then just modify the config we support almost all api !

Advanced Usage

from hashindex import HashIndex, Model, ListKeys, GetSummary, GetContent

# Create a custom model
model = Model(model="anthic/claude-3.5-sonnet")

# Work with index objects directly
index = HashIndex()
# ... customize indexing logic ...

# Use verbose=False for silent operation
from hashindex import index_pdf, query_index
index = index_pdf("document.pdf", verbose=False)
answer = query_index(index, "Your question", verbose=False)

# Access pages directly
for key, obj in index.PageTable.items():
    print(f"{key}: {obj.summary}")

Comparative Analysis

HashIndex outperforms standard paradigms in specific Long-Context Narrative tasks where causality matters more than keyword matching.

Method Topology Context Management Robustness (Unstructured Data) Latency
Vector RAG Disconnected Chunks Additive (FIFO overflow) High Low (O(1))
PageIndex Hierarchical Tree Path-Dependent Low (Requires Clean Headers) High (O(log n))
RAPTOR Recursive Tree Cluster-Based Medium Medium
HashIndex (Ours) Hash Table Dynamic Pruning (Agent-led) High (Mechanical Split) Medium-Low

By treating document chunks as Hash Table entries rather than Vector Embeddings, HashIndex avoids the 'Lost in the Middle' phenomenon common in vector search.

Citation

If you use HashIndex in your research or project, please cite it as follows:

@software{HashIndex2026,
  author = {Hon, Jason and Pardus AI Team},
  title = {HashIndex: LLM-optimized Document Indexing without vector search},
  year = {2026},
  publisher = {Pardus AI}
}

About

HashIndex: LLM-optimized Document Indexing without vector search

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages