Reasoning-first document intelligence system.
Querdex indexes any document into a hierarchical tree, then uses a two-tier LLM search to answer questions with cited sources. It works without an LLM (keyword heuristics), and optionally plugs in Anthropic or OpenAI for higher-quality results.
- How it works
- Installation
- Quick Start (CLI)
- LLM Setup
- CLI Reference
- Python API
- Supported File Types
- Environment Variables
Document
│
▼
Ingestion ──► parse into pages/sections (Section[])
│
▼
Indexing ───► build hierarchical tree (TreeNode) + entity map + knowledge graph
│
▼
Storage ────► persist to SQLite (sections, tree, entities, graph, query cache)
│
▼
Query
├─ Tier 1: LLM (or keyword) batch-prune of tree nodes
├─ Tier 2: LLM (or heuristic) per-node relevance scoring
├─ Retrieval: pull section text for selected nodes
└─ Answer: LLM synthesizes answer with source citations
│
▼
Adaptive ───► update node summaries based on query feedback (runs in background)
Three query routes are selected automatically:
- single_doc — standard hierarchical search on one document
- multi_doc — virtual super-tree across up to 3 documents
- graph — entity-seeded graph walk for relationship queries ("how does X relate to Y?")
Base install (no LLM, uses keyword heuristics):
pip install querdexWith Anthropic (Claude):
pip install querdex[anthropic]With OpenAI (GPT):
pip install querdex[openai]Development:
git clone <repo>
cd querdex
uv sync --extra dev
# or with an LLM provider:
uv sync --extra dev --extra anthropic
uv sync --extra dev --extra openaiRequirements: Python 3.11+
querdex index ./report.pdf --doc-id annual-reportOutput:
Indexed doc_id=annual-report version=1
Nodes=12 max_depth=3
querdex query --doc-id annual-report --query "What was the Q3 revenue?"Output:
Query ID: 3f8a1c...
Intent: single_doc | Cache hit: False
Q3 revenue was $1.2B, up 8% year-over-year (Revenue Analysis, pages 4-6).
# First turn
querdex query --doc-id annual-report \
--query "What were the risk factors?" \
--session-id session_001
# Second turn — context from first turn is carried over
querdex query --doc-id annual-report \
--query "Which of those risks materialised?" \
--session-id session_001When the document changes, Querdex only rebuilds the affected parts:
querdex index ./report_v2.pdf --doc-id annual-reportquerdex delete --doc-id annual-reportBy default the database is stored at ./index_store/querdex.db. To change it:
querdex --db /path/to/my.db index ./report.pdf --doc-id demo
querdex --db /path/to/my.db query --doc-id demo --query "summary?"Without any LLM configured, Querdex falls back to keyword/heuristic matching — it always produces an answer, just less precise.
export QUERDEX_LLM_PROVIDER=anthropic
export QUERDEX_LLM_API_KEY=sk-ant-...
# Optional: override model defaults
export QUERDEX_LLM_TIER1_MODEL=claude-haiku-4-5-20251001 # fast, cheap (batch prune)
export QUERDEX_LLM_TIER2_MODEL=claude-sonnet-4-6 # powerful (deep reasoning + answers)export QUERDEX_LLM_PROVIDER=openai
export QUERDEX_LLM_API_KEY=sk-...
# Optional: override model defaults
export QUERDEX_LLM_TIER1_MODEL=gpt-4o-mini # fast, cheap
export QUERDEX_LLM_TIER2_MODEL=gpt-4o # powerfulHow the two tiers are used:
| Tier | Model | Purpose |
|---|---|---|
| Tier 1 | cheap/fast | Single batched call to prune all tree nodes to the relevant few |
| Tier 2 | powerful | Per-node deep reasoning to confirm relevance + score confidence |
| Answer | powerful | Synthesise a cited answer from the retrieved section text |
querdex [--db PATH] <command> [options]
| Command | Description |
|---|---|
index <file> |
Index a document. Auto-detects format from extension. |
query |
Query an indexed document. |
delete |
Remove a document and all its data from the store. |
querdex index <file_path> [--doc-id ID]
| Argument | Default | Description |
|---|---|---|
file_path |
required | Path to the document to index |
--doc-id |
auto-generated from filename+hash | Stable identifier for this document |
querdex query --doc-id ID --query TEXT [--session-id ID]
| Argument | Default | Description |
|---|---|---|
--doc-id |
required | Document to query |
--query |
required | Natural language question |
--session-id |
none | Enables multi-turn context (pass same ID across turns) |
querdex delete --doc-id ID
For integration into your own application:
import asyncio
from querdex.services import build_engine
# build_engine reads QUERDEX_LLM_* env vars automatically
engine = build_engine("./index_store/querdex.db")
# Index a document
doc = asyncio.run(engine.index_document("./report.pdf", doc_id="annual-report"))
print(f"Indexed: {doc.doc_id} | nodes={doc.stats.total_nodes}")
# Query
result = engine.query_document("annual-report", "What was Q3 revenue?")
print(result.answer)
print(f"Confidence: {result.confidence:.0%}")
for source in result.source_nodes:
print(f" Source: {source.title}, pages {source.pages}")
# Multi-turn query
result2 = engine.query_document(
"annual-report",
"What caused that increase?",
session_id="my-session-001",
)
# Re-index after the document changes
doc_v2 = asyncio.run(engine.reindex_document("./report_v2.pdf", doc_id="annual-report"))
# Delete
engine.store.delete_document("annual-report")
# Always close when done
engine.store.close()from querdex.llm.anthropic_client import AnthropicLLMClient
from querdex.services.engine import QuerdexEngine
from querdex.storage import SQLiteStore
llm = AnthropicLLMClient(
api_key="sk-ant-...",
tier1_model="claude-haiku-4-5-20251001",
tier2_model="claude-sonnet-4-6",
)
store = SQLiteStore("./querdex.db")
engine = QuerdexEngine(store, llm_client=llm)from querdex.llm.fake_client import FakeLLMClient
from querdex.query.answering import AnswerGenerator
fake = FakeLLMClient(
default='{"answer": "Revenue was $1.2B.", "confidence": 0.9}'
)
gen = AnswerGenerator(llm_client=fake)
answer, confidence, sources = gen.generate("What was revenue?", chunks)| Extension | Parser | Notes |
|---|---|---|
.txt |
TextParser | Plain text, split by paragraphs |
.md, .markdown |
MarkdownParser | Heading-aware section splitting |
.html, .htm |
HTMLParser | Strips tags, extracts text blocks |
.docx |
DOCXParser | Microsoft Word, paragraph-level |
.pdf |
PDFParser | Page-level; OCR optional (see below) |
.py |
PythonCodeParser | Function/class level chunking |
.js, .ts, .jsx, .tsx |
JSCodeParser | Function-level chunking |
.csv |
CSVParser | Row-batched sections |
.db, .sqlite |
SQLiteParser | Table-level sections |
.mp3, .wav, .m4a, .mp4, .mov |
AudioVideoParser | Transcript-based (requires Whisper or similar) |
.url |
URLParser | Fetches and parses the web page at that URL |
| URL string | URLParser | Pass a URL string directly as the file path |
For scanned PDFs, enable OCR via environment variables:
# Tesseract (local)
export QUERDEX_OCR_ENABLED=true
export QUERDEX_OCR_PROVIDER=tesseract # default when OCR enabled
export QUERDEX_TESSERACT_CMD=tesseract # path to tesseract binary
# Cloud OCR (custom endpoint)
export QUERDEX_OCR_ENABLED=true
export QUERDEX_OCR_PROVIDER=cloud
export QUERDEX_OCR_ENDPOINT=https://your-ocr-api.com/v1/ocr
export QUERDEX_OCR_API_KEY=your-key| Variable | Default | Description |
|---|---|---|
QUERDEX_LLM_PROVIDER |
(none) | anthropic or openai. If unset, heuristic mode is used. |
QUERDEX_LLM_API_KEY |
(none) | API key for the selected provider |
QUERDEX_LLM_TIER1_MODEL |
claude-haiku-4-5-20251001 / gpt-4o-mini |
Fast model for batch node pruning |
QUERDEX_LLM_TIER2_MODEL |
claude-sonnet-4-6 / gpt-4o |
Powerful model for deep reasoning and answers |
QUERDEX_OCR_ENABLED |
false |
Enable OCR for scanned PDFs |
QUERDEX_OCR_PROVIDER |
tesseract |
tesseract or cloud |
QUERDEX_TESSERACT_CMD |
tesseract |
Path to Tesseract binary |
QUERDEX_OCR_ENDPOINT |
(none) | Endpoint URL for cloud OCR provider |
QUERDEX_OCR_API_KEY |
(none) | API key for cloud OCR provider |
MIT