Querdex

Reasoning-first document intelligence system.

Querdex indexes any document into a hierarchical tree, then uses a two-tier LLM search to answer questions with cited sources. It works without an LLM (keyword heuristics), and optionally plugs in Anthropic or OpenAI for higher-quality results.

How it works

Document
   │
   ▼
Ingestion ──► parse into pages/sections (Section[])
   │
   ▼
Indexing ───► build hierarchical tree (TreeNode) + entity map + knowledge graph
   │
   ▼
Storage ────► persist to SQLite (sections, tree, entities, graph, query cache)
   │
   ▼
Query
  ├─ Tier 1: LLM (or keyword) batch-prune of tree nodes
  ├─ Tier 2: LLM (or heuristic) per-node relevance scoring
  ├─ Retrieval: pull section text for selected nodes
  └─ Answer: LLM synthesizes answer with source citations
   │
   ▼
Adaptive ───► update node summaries based on query feedback (runs in background)

Three query routes are selected automatically:

single_doc — standard hierarchical search on one document
multi_doc — virtual super-tree across up to 3 documents
graph — entity-seeded graph walk for relationship queries ("how does X relate to Y?")

Installation

Base install (no LLM, uses keyword heuristics):

pip install querdex

With Anthropic (Claude):

pip install querdex[anthropic]

With OpenAI (GPT):

pip install querdex[openai]

Development:

git clone <repo>
cd querdex
uv sync --extra dev
# or with an LLM provider:
uv sync --extra dev --extra anthropic
uv sync --extra dev --extra openai

Requirements: Python 3.11+

Quick Start (CLI)

1. Index a document

querdex index ./report.pdf --doc-id annual-report

Output:

Indexed doc_id=annual-report version=1
Nodes=12 max_depth=3

2. Query it

querdex query --doc-id annual-report --query "What was the Q3 revenue?"

Output:

Query ID: 3f8a1c...
Intent: single_doc | Cache hit: False
Q3 revenue was $1.2B, up 8% year-over-year (Revenue Analysis, pages 4-6).

3. Multi-turn conversation (session)

# First turn
querdex query --doc-id annual-report \
  --query "What were the risk factors?" \
  --session-id session_001

# Second turn — context from first turn is carried over
querdex query --doc-id annual-report \
  --query "Which of those risks materialised?" \
  --session-id session_001

4. Re-index an updated document

When the document changes, Querdex only rebuilds the affected parts:

querdex index ./report_v2.pdf --doc-id annual-report

5. Delete a document

querdex delete --doc-id annual-report

Custom database path

By default the database is stored at ./index_store/querdex.db. To change it:

querdex --db /path/to/my.db index ./report.pdf --doc-id demo
querdex --db /path/to/my.db query --doc-id demo --query "summary?"

LLM Setup

Without any LLM configured, Querdex falls back to keyword/heuristic matching — it always produces an answer, just less precise.

Anthropic (Claude)

export QUERDEX_LLM_PROVIDER=anthropic
export QUERDEX_LLM_API_KEY=sk-ant-...

# Optional: override model defaults
export QUERDEX_LLM_TIER1_MODEL=claude-haiku-4-5-20251001   # fast, cheap (batch prune)
export QUERDEX_LLM_TIER2_MODEL=claude-sonnet-4-6            # powerful (deep reasoning + answers)

OpenAI (GPT)

export QUERDEX_LLM_PROVIDER=openai
export QUERDEX_LLM_API_KEY=sk-...

# Optional: override model defaults
export QUERDEX_LLM_TIER1_MODEL=gpt-4o-mini   # fast, cheap
export QUERDEX_LLM_TIER2_MODEL=gpt-4o         # powerful

How the two tiers are used:

Tier	Model	Purpose
Tier 1	cheap/fast	Single batched call to prune all tree nodes to the relevant few
Tier 2	powerful	Per-node deep reasoning to confirm relevance + score confidence
Answer	powerful	Synthesise a cited answer from the retrieved section text

CLI Reference

querdex [--db PATH] <command> [options]

Command	Description
`index <file>`	Index a document. Auto-detects format from extension.
`query`	Query an indexed document.
`delete`	Remove a document and all its data from the store.

`index`

querdex index <file_path> [--doc-id ID]

Argument	Default	Description
`file_path`	required	Path to the document to index
`--doc-id`	auto-generated from filename+hash	Stable identifier for this document

`query`

querdex query --doc-id ID --query TEXT [--session-id ID]

Argument	Default	Description
`--doc-id`	required	Document to query
`--query`	required	Natural language question
`--session-id`	none	Enables multi-turn context (pass same ID across turns)

`delete`

querdex delete --doc-id ID

Python API

For integration into your own application:

import asyncio
from querdex.services import build_engine

# build_engine reads QUERDEX_LLM_* env vars automatically
engine = build_engine("./index_store/querdex.db")

# Index a document
doc = asyncio.run(engine.index_document("./report.pdf", doc_id="annual-report"))
print(f"Indexed: {doc.doc_id} | nodes={doc.stats.total_nodes}")

# Query
result = engine.query_document("annual-report", "What was Q3 revenue?")
print(result.answer)
print(f"Confidence: {result.confidence:.0%}")
for source in result.source_nodes:
    print(f"  Source: {source.title}, pages {source.pages}")

# Multi-turn query
result2 = engine.query_document(
    "annual-report",
    "What caused that increase?",
    session_id="my-session-001",
)

# Re-index after the document changes
doc_v2 = asyncio.run(engine.reindex_document("./report_v2.pdf", doc_id="annual-report"))

# Delete
engine.store.delete_document("annual-report")

# Always close when done
engine.store.close()

Passing an LLM client directly

from querdex.llm.anthropic_client import AnthropicLLMClient
from querdex.services.engine import QuerdexEngine
from querdex.storage import SQLiteStore

llm = AnthropicLLMClient(
    api_key="sk-ant-...",
    tier1_model="claude-haiku-4-5-20251001",
    tier2_model="claude-sonnet-4-6",
)
store = SQLiteStore("./querdex.db")
engine = QuerdexEngine(store, llm_client=llm)

Using the FakeLLMClient in tests

from querdex.llm.fake_client import FakeLLMClient
from querdex.query.answering import AnswerGenerator

fake = FakeLLMClient(
    default='{"answer": "Revenue was $1.2B.", "confidence": 0.9}'
)
gen = AnswerGenerator(llm_client=fake)
answer, confidence, sources = gen.generate("What was revenue?", chunks)

Supported File Types

Extension	Parser	Notes
`.txt`	TextParser	Plain text, split by paragraphs
`.md`, `.markdown`	MarkdownParser	Heading-aware section splitting
`.html`, `.htm`	HTMLParser	Strips tags, extracts text blocks
`.docx`	DOCXParser	Microsoft Word, paragraph-level
`.pdf`	PDFParser	Page-level; OCR optional (see below)
`.py`	PythonCodeParser	Function/class level chunking
`.js`, `.ts`, `.jsx`, `.tsx`	JSCodeParser	Function-level chunking
`.csv`	CSVParser	Row-batched sections
`.db`, `.sqlite`	SQLiteParser	Table-level sections
`.mp3`, `.wav`, `.m4a`, `.mp4`, `.mov`	AudioVideoParser	Transcript-based (requires Whisper or similar)
`.url`	URLParser	Fetches and parses the web page at that URL
URL string	URLParser	Pass a URL string directly as the file path

PDF OCR

For scanned PDFs, enable OCR via environment variables:

# Tesseract (local)
export QUERDEX_OCR_ENABLED=true
export QUERDEX_OCR_PROVIDER=tesseract         # default when OCR enabled
export QUERDEX_TESSERACT_CMD=tesseract        # path to tesseract binary

# Cloud OCR (custom endpoint)
export QUERDEX_OCR_ENABLED=true
export QUERDEX_OCR_PROVIDER=cloud
export QUERDEX_OCR_ENDPOINT=https://your-ocr-api.com/v1/ocr
export QUERDEX_OCR_API_KEY=your-key

Environment Variables

Variable	Default	Description
`QUERDEX_LLM_PROVIDER`	(none)	`anthropic` or `openai`. If unset, heuristic mode is used.
`QUERDEX_LLM_API_KEY`	(none)	API key for the selected provider
`QUERDEX_LLM_TIER1_MODEL`	`claude-haiku-4-5-20251001` / `gpt-4o-mini`	Fast model for batch node pruning
`QUERDEX_LLM_TIER2_MODEL`	`claude-sonnet-4-6` / `gpt-4o`	Powerful model for deep reasoning and answers
`QUERDEX_OCR_ENABLED`	`false`	Enable OCR for scanned PDFs
`QUERDEX_OCR_PROVIDER`	`tesseract`	`tesseract` or `cloud`
`QUERDEX_TESSERACT_CMD`	`tesseract`	Path to Tesseract binary
`QUERDEX_OCR_ENDPOINT`	(none)	Endpoint URL for cloud OCR provider
`QUERDEX_OCR_API_KEY`	(none)	API key for cloud OCR provider

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github/workflows		.github/workflows
infra		infra
scripts		scripts
src/querdex		src/querdex
tests		tests
.env.example		.env.example
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Querdex

Table of Contents

How it works

Installation

Quick Start (CLI)

1. Index a document

2. Query it

3. Multi-turn conversation (session)

4. Re-index an updated document

5. Delete a document

Custom database path

LLM Setup

Anthropic (Claude)

OpenAI (GPT)

CLI Reference

`index`

`query`

`delete`

Python API

Passing an LLM client directly

Using the FakeLLMClient in tests

Supported File Types

PDF OCR

Environment Variables

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Querdex

Table of Contents

How it works

Installation

Quick Start (CLI)

1. Index a document

2. Query it

3. Multi-turn conversation (session)

4. Re-index an updated document

5. Delete a document

Custom database path

LLM Setup

Anthropic (Claude)

OpenAI (GPT)

CLI Reference

index

query

delete

Python API

Passing an LLM client directly

Using the FakeLLMClient in tests

Supported File Types

PDF OCR

Environment Variables

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`index`

`query`

`delete`

Packages