Skip to content
Merged

Dev #109

Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
a47f207
feat(python): add Answer wrapper and refactor engine interface
zTgx Apr 22, 2026
d5389a2
refactor(docs): update project structure and rename engine description
zTgx Apr 22, 2026
e106de0
docs(CLAUDE.md): update development workflow paths
zTgx Apr 22, 2026
2e8a47b
feat(pyproject.toml): update project configuration and dependencies
zTgx Apr 22, 2026
63354a6
refactor(core): remove exclusion rules and example files
zTgx Apr 22, 2026
65208c7
docs(CLAUDE.md): update examples section documentation
zTgx Apr 22, 2026
d7362e3
refactor(examples): update single_doc_challenge example with new API
zTgx Apr 22, 2026
6c3ecb5
docs(HISTORY): add history tracking file
zTgx Apr 22, 2026
1291e8f
docs(HISTORY): add comprehensive history file with version changelog
zTgx Apr 22, 2026
edf97d7
chore(workspace): update project metadata and version management
zTgx Apr 22, 2026
67113e7
feat(index): add concept extraction stage with LLM support
zTgx Apr 22, 2026
451f3fa
feat(agent): add reasoning trace collection and verification stage
zTgx Apr 22, 2026
03d3cc2
feat: split monorepo into modular crates and add agent functionality
zTgx Apr 23, 2026
5abf1d8
feat: add vectorless-rerank module and reorganize query types
zTgx Apr 23, 2026
a1f8373
refactor(vectorless-llm): update import paths and add dev dependency
zTgx Apr 23, 2026
9a7cb06
refactor: update DocumentTree import path in test modules
zTgx Apr 23, 2026
db2396d
refactor(builder): update indexer client import path
zTgx Apr 23, 2026
060345c
refactor(engine): remove explicit type annotation in source_path mapping
zTgx Apr 23, 2026
c94953a
refactor(engine): update DocumentTree type reference in indexer
zTgx Apr 23, 2026
5d0f7d1
refactor: reorganize imports and reorder statements across modules
zTgx Apr 23, 2026
60c8fab
refactor(engine): update DocContext creation and improve module imports
zTgx Apr 23, 2026
968936c
refactor(engine): re-export types from sub-crates for better API access
zTgx Apr 23, 2026
275c5a3
refactor(vectorless-config): move DocumentGraphConfig to vectorless-g…
zTgx Apr 23, 2026
2757aa6
feat: remove vectorless core module and related functionality
zTgx Apr 23, 2026
9d76eba
refactor(docs): update project structure documentation with fine-grai…
zTgx Apr 23, 2026
9ea317c
refactor(config): reorder imports in types module
zTgx Apr 23, 2026
2af0f29
feat: remove all example files and documentation
zTgx Apr 23, 2026
05e5fc9
feat: add single-document reasoning challenge example
zTgx Apr 23, 2026
4cc38f4
Merge pull request #108 from vectorlessflow/feat-understanding
zTgx Apr 23, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
89 changes: 58 additions & 31 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# CLAUDE.md

Vectorless is a reasoning-native document intelligence engine written in Rust.
Vectorless is a Document Understanding Engine for AI written in Rust.

## Principles

Expand All @@ -10,32 +10,56 @@ Vectorless is a reasoning-native document intelligence engine written in Rust.

## Project Structure

- `rust/` - Rust core engine
- `src/client/` - Client API (EngineBuilder, Engine) - facade layer, no business logic
- `src/document/` - Document data structures (DocumentTree, NavigationIndex, ReasoningIndex)
- `src/index/` - Compile pipeline (8-stage, checkpointing, incremental update)
- `src/retrieval/` - Retrieval dispatch layer (preprocessing, dispatch, postprocessing, cache, streaming)
- `src/query/` - Query understanding and planning (intent classification, rewrite, decomposition)
- `src/agent/` - Retrieval execution (Worker: doc navigation, Orchestrator: supervisor loop + multi-doc fusion)
- `src/rerank/` - Result reranking and answer synthesis (dedup, scoring, fusion, synthesis)
- `src/scoring/` - Scoring and ranking strategies (BM25, relevance scoring, score combination)
- `src/llm/` - LLM client (connection pool, memo/caching, throttle/rate-limiting, fallback)
- `src/storage/` - Persistence (Workspace, LRU cache, backend abstraction file/memory)
- `src/graph/` - Cross-document relationship graph
- `src/metrics/` - Metrics collection and reporting
- `src/events/` - Event system for progress monitoring
- `src/config/` - Configuration types and validation
- `src/error.rs` - Unified error types
- `src/utils/` - Utility functions (token counting, fingerprinting, validation)
- `examples/` - Rust examples (flow, indexing, pdf, batch, etc.)
- `python/` - Python SDK (PyO3 bindings) + CLI
Cargo workspace with 17 fine-grained Rust crates + pure Python SDK:

```
vectorless-core/
├── vectorless-error/ # Error types (Result, Error enum)
├── vectorless-document/ # Document types (Document, Tree, NavigationIndex, ReasoningIndex)
├── vectorless-config/ # Configuration hub (aggregates all config types)
├── vectorless-utils/ # Utilities (fingerprinting, token counting, validation)
├── vectorless-scoring/ # Scoring (BM25, keyword extraction)
├── vectorless-graph/ # Cross-document relationship graph
├── vectorless-events/ # Event system for progress monitoring
├── vectorless-metrics/ # Metrics collection and reporting
├── vectorless-llm/ # LLM client (pool, memo/cache, throttle, fallback)
├── vectorless-storage/ # Persistence (Workspace, LRU cache, file/memory backends)
├── vectorless-query/ # Query understanding (intent classification, rewrite)
├── vectorless-index/ # Compile pipeline (10-stage, checkpointing, incremental update)
├── vectorless-agent/ # Retrieval execution (Worker navigation + Orchestrator fusion)
├── vectorless-retrieval/ # Retrieval dispatch layer (dispatcher, cache, streaming)
├── vectorless-rerank/ # Result reranking (dedup, BM25 scoring, fusion)
├── vectorless-engine/ # Facade (Engine, EngineBuilder) — re-exports public API
└── vectorless-py/ # PyO3 bindings (compiled into Python native module)
```

- `vectorless/` - Pure Python SDK (high-level wrappers, CLI, config loading, integrations)
- `examples/` - Python examples (primary, for Python ecosystem)
- `docs/` - Docusaurus documentation site
- `samples/` - Sample files

### Dependency Layers

```
Layer 0: error · document · utils · scoring (no workspace deps)
Layer 1: graph · events · config · metrics (depends on Layer 0)
Layer 2: llm · storage (depends on Layer 0–1)
Layer 3: query (depends on Layer 0–2)
Layer 4: index · agent (depends on Layer 0–3)
Layer 5: retrieval · rerank (depends on Layer 0–4)
Layer 6: engine (facade) · vectorless-py (bindings) (depends on all)
```

### Compilation Isolation

改一个模块只重编译该 crate + 上游 facade:
- 改 `agent` → agent, retrieval, rerank, engine, py 重编译;index/llm/storage 不动
- 改 `llm` → llm 及其上层重编译;index/agent/stage 不重编译
- 改 `document` → 全部重编译(核心类型,预期行为)

### Retrieval Call Flow

```
Engine.query()
Engine.ask()
→ retrieval/dispatcher
→ query/understand() → QueryPlan (LLM intent + concepts + strategy)
→ Orchestrator (always, single or multi-doc)
Expand All @@ -49,16 +73,17 @@ Engine.query()
## Build Commands

```bash
# Rust core
cd rust
cargo build # Build
cargo test # Run tests
# Build (workspace)
cargo build # Build all crates
cargo test # Run tests (488 tests across all crates)
cargo clippy # Lint
cargo fmt # Format code

# Build specific crate (fast — only that crate + dependents)
cargo build -p vectorless-agent

# Python SDK
cd python
pip install -e . # Install in editable mode
pip install -e . # Install in editable mode (from project root, uses maturin)

# Docs site
cd docs
Expand Down Expand Up @@ -145,7 +170,9 @@ When uncertain whether an operation is safe, **default to asking user confirmati

## Common Development Workflow

1. **Adding features**: Implement in appropriate `rust/src/` module, add tests
1. **Adding features**: Implement in the appropriate `vectorless-core/vectorless-*/` crate, add tests
2. **Fixing bugs**: Add failing test case first, fix and ensure tests pass
3. **Python bindings**: Update `python/src/lib.rs` (PyO3) when Rust APIs change
4. **Committing code**: Use semantic commit messages, format: `type(scope): description`
3. **Adding crates**: New modules get their own crate under `vectorless-core/`, add to workspace Cargo.toml
4. **Python bindings**: Update `vectorless-core/vectorless-py/src/lib.rs` (PyO3) when Rust APIs change
5. **Python SDK**: Update `vectorless/` when API surface changes
6. **Committing code**: Use semantic commit messages, format: `type(scope): description`
24 changes: 21 additions & 3 deletions Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,10 +1,28 @@
[workspace]
members = ["rust", "python"]
members = [
"vectorless-core/vectorless-error",
"vectorless-core/vectorless-document",
"vectorless-core/vectorless-config",
"vectorless-core/vectorless-utils",
"vectorless-core/vectorless-scoring",
"vectorless-core/vectorless-graph",
"vectorless-core/vectorless-events",
"vectorless-core/vectorless-metrics",
"vectorless-core/vectorless-llm",
"vectorless-core/vectorless-storage",
"vectorless-core/vectorless-query",
"vectorless-core/vectorless-index",
"vectorless-core/vectorless-agent",
"vectorless-core/vectorless-retrieval",
"vectorless-core/vectorless-rerank",
"vectorless-core/vectorless-engine",
"vectorless-core/vectorless-py",
]
resolver = "2"

[workspace.package]
version = "0.1.32"
description = "Reasoning-based Document Engine"
version = "0.1.12"
description = "Document Understanding Engine for AI"
edition = "2024"
authors = ["zTgx <beautifularea@gmail.com>"]
license = "Apache-2.0"
Expand Down
99 changes: 99 additions & 0 deletions HISTORY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
# HISTORY

## 0.1.11 (2026-04-21)

- Project description updated to "reasoning-based document engine"
- Core principles documentation (Reason don't vector, Model fails we fail, No thought no answer)
- Updated homepage with three core principles and key features

## 0.1.10 (2026-04-21)

- Description generation enabled by default
- `timeout_secs` option for Python indexing
- Agent-based navigation documentation

## 0.1.9 (2026-04-20)

- **Agent-based retrieval architecture**: replaced pilot/search with Orchestrator + Workers
- Navigation commands: `ls`, `cd`, `cat`, `grep`, `find`, `head`, `pwd`, `wc`
- Orchestrator supervisor loop with dynamic re-planning
- Query understanding pipeline with `QueryPlan`
- Evidence evaluation and replanning modules
- `NavigationIndex` with `DocCard` and `SectionCard`
- LLM-based confidence scoring (replaced BM25)
- Unified rerank pipeline (replaced synthesis/fusion)
- `DocCard` catalog in workspace storage
- Shared concurrency control for LLM clients
- Memoization for LLM operations in retrieval pipeline
- LLM request timeout configuration

## 0.1.8 (2026-04-16)

- GitHub Actions workflow for automated releases
- Endpoint parameter support for API configuration
- Custom config option in `EngineBuilder`
- Enhanced error messages with detailed failure info
- Endpoint validation in engine builder

## 0.1.7 (2026-04-15)

- Runtime metrics reports (LLM, Pilot, Retrieval)
- Recursive option for `from_dir` method
- Directory indexing support via `IndexContext`
- Centralized `LlmPool` configuration system
- Shared LLM client injected into pipeline context
- Pipeline checkpoint for resumable indexing
- `source_path` field and updated `QueryContext` API

## 0.1.6 (2026-04-15)

- `IndexMetrics` binding with detailed indexing statistics
- `StrategyPreference` for controlling retrieval strategies
- Pure Pilot search algorithm, beam search with backtracking
- Per-step reasoning support in search algorithms
- Binary pruning and pre-filtering for wide nodes
- LLM-based query complexity detection
- Cross-document strategy with graph-based boosting
- Synonym expansion for improved query recall
- Default summary strategy changed to Full

## 0.1.4 (2026-04-13)

- PDF parser: switch to `pdf-extract` for reliable text extraction
- Concurrent LLM verification for TOC entries
- PDF indexing example

## 0.1.3 (2026-04-13)

- Internal module naming cleanup (`_` prefix for private functions)

## 0.1.2 (2026-04-13)

- Search-from functionality and ToC-based navigation
- Reasoning chain (replacing navigation trace)
- Adaptive budget controller for pipeline token management
- Structural path constraints and hints extraction
- Reasoning index for fast retrieval path resolution
- Document graph system for cross-document relationships
- Streaming retrieval with `RetrieveEvent` support
- Multi-document query support
- Incremental indexing with content and logic fingerprinting
- Parallel processing for multiple document sources
- Pipeline checkpoint and content merging/splitting support

## 0.1.1 (2026-04-08)

- Workspace-managed dependencies and configuration
- LLM pilot functionality and summary generation
- Query decomposition support
- LLM-first search with TOC-based location
- Restructured Python examples

## 0.1.0 (2026-04-07)

Initial Python SDK release.

- PyO3 bindings for the Rust engine core
- Basic `Engine` class with `index()` and `query()` methods
- `pyproject.toml` with maturin build backend
- Ruff formatting configuration
28 changes: 0 additions & 28 deletions examples/batch_indexing/README.md

This file was deleted.

Loading
Loading