Add ontology normalization and config-driven Phase 3 entity resolution by vmoudyp · Pull Request #3 · sam234990/BookRAG

vmoudyp · 2026-03-06T23:55:05Z

Summary

add ontology-driven canonical entity normalization across config, indexing, graph persistence, and retrieval
add a short ontology usage guide and conservative Phase 3 tenant/global entity-resolution config
preserve ontology metadata in tenant-global VDB records and add targeted tests

Validation

.venv/bin/python -m pytest tests/test_ontology_integration.py -q
.venv/bin/python -m pytest tests/test_ontology_integration.py tests/test_legal_heading_detector.py tests/test_pdf_refiner_lang.py -q

Pull Request opened by Augment Code with guidance from the PR author

- Add DoclingConfig dataclass (Core/configs/docling_config.py) with ocr_engine, force_full_page_ocr, images_scale, lang fields - Update SystemConfig to include parser (default: 'mineru') and docling config fields (Core/configs/system_config.py) - Add Docling adapter module (Core/provider/extract_pdf_info_docling.py) with parse_doc_with_docling() that maps Docling output to the canonical pdf_list format consumed by the BookRAG pipeline - Update build_tree_from_pdf() to branch on cfg.parser; MinerU path unchanged; Docling path added with isolated cache dir (Core/pipelines/doc_tree_builder.py) - Add example config for scanned documents (config/docling.yaml) with force_full_page_ocr: true and easyocr engine MinerU remains the default parser. No existing behaviour changes when parser is unset or set to 'mineru'.

Phase 1 - FalkorDB & Config Foundation: - Add Core/configs/falkordb_config.py: FalkorDBConfig dataclass with graph_name_for_doc() and graph_name_for_global() helpers - Add Core/configs/mongodb_config.py: MongoDBConfig dataclass with tenant_db_name() helper - Update Core/configs/system_config.py: add tenant_id, doc_id, falkordb, mongodb fields; backward-compatible (all optional) - Rewrite Core/Index/Graph.py: optional FalkorDB persistence alongside NetworkX; new methods _save_to_falkordb, _load_from_falkordb, _get_fdb_subgraph, save_to_global_graph, get_global_subgraph; fix pre-existing bug: node_link_graph now uses edges='links' to match node_link_data save format (NetworkX 3.x compatibility) - Update Core/Index/GBCIndex.py: tenant/doc_id-namespaced VDB paths, pass falkordb config to Graph.load_from_dir, add rebuild_global_vdb() - Update Core/pipelines/kg_builder.py: pass tenant_id, doc_id, falkordb_cfg from config into Graph constructor Phase 2 - FastAPI Multi-User API: - api/main.py: FastAPI app with lifespan, CORS, health endp- api/main.py: FastAPI app with lifespan, CORS, health endp- api/ma, get_current_user, require_admin, check_doc_access, filter_accessible_docs - api/db/mongodb.py: Motor async CRUD for tenants,- api/db/mongodb.py: Motmi- api/db/mongodb.py: Motor async CRUD for tenants,- api/db/mopi/models/requests.py: Pydantic request/response models - api/routers/auth.py: POST /auth/register, POST /auth/login - api/routers/tenants.py: tenant management (admin-gated) - api/routers/documents.py: PDF upload + background indexing, list, status - api/routers/chat.py: POST /chat/query, session create, message history - api/services/indexing.py: background GBC index build in thread pool with MongoDB status tracking Phase 3 - Cross-Document Entity ResoluPhase 3 - Cross-Document Entity ResoluPhase 3 - Cross-Document Entity chPhase 3 - Cross-Document Entity ResoluPhase 3 - CrokorDB graph, HAS_MENTION edge creation, update global VDB for canonical entities - api/services/chat.py: parallel per-doc GBC RAG queries (up to 5 docs) with answer synthesis for cross-doc mode - Entity resolution runs automatically after indexing (non-fatal) Config updates (from previous session): - config/gbc.yaml: upgrade LLM/VLM to Qwen3.5-35B-A3B-AWQ - Core/configs/vlm_config.py: update defaults for Qwen3.5 backend - Core/prompts/kg_prompt.py: language-agnostic NER prompt

…rge, and split operations

…erver for AI integration - Added path traversal prevention in the raw document download endpoint. - Standardized error messages in entity management endpoints to "Internal server error". - Updated permission granting logic to require document ownership for non-admin users. - Refactored chat service to include caching for GBC indexes and improved query handling with history relevance filtering. - Introduced a new MCP server implementation to expose BookRAG API for AI agents, including resource and tool mappings. - Added detailed documentation for MCP server setup and usage.

…d revocation; enhance document upload size handling

…features - Updated Pydantic models in `requests.py` to include detailed field descriptions and added new fields for document metadata (created_at, document_date). - Introduced new session management endpoints in `chat.py` for listing and deleting chat sessions, including pagination support. - Enhanced message retrieval in `get_messages` to support pagination and return total message count. - Added document deletion functionality in `documents.py`, including cleanup of associated files and database records. - Updated indexing service to propagate document dates for temporal awareness during indexing. - Modified various configuration files to update model names, API endpoints, and parameters for improved performance and compatibility.

…es; enhance document processing with language-aware capabilities

- Introduced `run_e2e_test.py` to facilitate the testing of the knowledge graph construction process. - Configurations are loaded from `gbc.yaml`, with paths set for PDF input and output directory. - The script loads an existing document tree, initializes a token tracker, and builds the knowledge graph. - Outputs performance metrics including total nodes, edges, and token usage after graph construction.

vmoudyp added 11 commits March 2, 2026 20:21

feat: implement entity management features including list, rename, me…

c5f6db5

…rge, and split operations

feat: implement refresh token management with storage, validation, an…

0bc18ea

…d revocation; enhance document upload size handling

feat: Integrate language detection and legal heading detection featur…

c11b888

…es; enhance document processing with language-aware capabilities

Refactor code structure for improved readability and maintainability

66f7ac3

Add ontology-driven entity normalization

ccdcd4b

Add config-driven Phase 3 entity resolution

4c6136d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ontology normalization and config-driven Phase 3 entity resolution#3

Add ontology normalization and config-driven Phase 3 entity resolution#3
vmoudyp wants to merge 11 commits into
sam234990:mainfrom
vmoudyp:feature/docling-integration

vmoudyp commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vmoudyp commented Mar 6, 2026

Summary

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant