Add ontology normalization and config-driven Phase 3 entity resolution#3
Open
vmoudyp wants to merge 11 commits into
Open
Add ontology normalization and config-driven Phase 3 entity resolution#3vmoudyp wants to merge 11 commits into
vmoudyp wants to merge 11 commits into
Conversation
- Add DoclingConfig dataclass (Core/configs/docling_config.py) with ocr_engine, force_full_page_ocr, images_scale, lang fields - Update SystemConfig to include parser (default: 'mineru') and docling config fields (Core/configs/system_config.py) - Add Docling adapter module (Core/provider/extract_pdf_info_docling.py) with parse_doc_with_docling() that maps Docling output to the canonical pdf_list format consumed by the BookRAG pipeline - Update build_tree_from_pdf() to branch on cfg.parser; MinerU path unchanged; Docling path added with isolated cache dir (Core/pipelines/doc_tree_builder.py) - Add example config for scanned documents (config/docling.yaml) with force_full_page_ocr: true and easyocr engine MinerU remains the default parser. No existing behaviour changes when parser is unset or set to 'mineru'.
Phase 1 - FalkorDB & Config Foundation: - Add Core/configs/falkordb_config.py: FalkorDBConfig dataclass with graph_name_for_doc() and graph_name_for_global() helpers - Add Core/configs/mongodb_config.py: MongoDBConfig dataclass with tenant_db_name() helper - Update Core/configs/system_config.py: add tenant_id, doc_id, falkordb, mongodb fields; backward-compatible (all optional) - Rewrite Core/Index/Graph.py: optional FalkorDB persistence alongside NetworkX; new methods _save_to_falkordb, _load_from_falkordb, _get_fdb_subgraph, save_to_global_graph, get_global_subgraph; fix pre-existing bug: node_link_graph now uses edges='links' to match node_link_data save format (NetworkX 3.x compatibility) - Update Core/Index/GBCIndex.py: tenant/doc_id-namespaced VDB paths, pass falkordb config to Graph.load_from_dir, add rebuild_global_vdb() - Update Core/pipelines/kg_builder.py: pass tenant_id, doc_id, falkordb_cfg from config into Graph constructor Phase 2 - FastAPI Multi-User API: - api/main.py: FastAPI app with lifespan, CORS, health endp- api/main.py: FastAPI app with lifespan, CORS, health endp- api/ma, get_current_user, require_admin, check_doc_access, filter_accessible_docs - api/db/mongodb.py: Motor async CRUD for tenants,- api/db/mongodb.py: Motmi- api/db/mongodb.py: Motor async CRUD for tenants,- api/db/mopi/models/requests.py: Pydantic request/response models - api/routers/auth.py: POST /auth/register, POST /auth/login - api/routers/tenants.py: tenant management (admin-gated) - api/routers/documents.py: PDF upload + background indexing, list, status - api/routers/chat.py: POST /chat/query, session create, message history - api/services/indexing.py: background GBC index build in thread pool with MongoDB status tracking Phase 3 - Cross-Document Entity ResoluPhase 3 - Cross-Document Entity ResoluPhase 3 - Cross-Document Entity chPhase 3 - Cross-Document Entity ResoluPhase 3 - CrokorDB graph, HAS_MENTION edge creation, update global VDB for canonical entities - api/services/chat.py: parallel per-doc GBC RAG queries (up to 5 docs) with answer synthesis for cross-doc mode - Entity resolution runs automatically after indexing (non-fatal) Config updates (from previous session): - config/gbc.yaml: upgrade LLM/VLM to Qwen3.5-35B-A3B-AWQ - Core/configs/vlm_config.py: update defaults for Qwen3.5 backend - Core/prompts/kg_prompt.py: language-agnostic NER prompt
…rge, and split operations
…erver for AI integration - Added path traversal prevention in the raw document download endpoint. - Standardized error messages in entity management endpoints to "Internal server error". - Updated permission granting logic to require document ownership for non-admin users. - Refactored chat service to include caching for GBC indexes and improved query handling with history relevance filtering. - Introduced a new MCP server implementation to expose BookRAG API for AI agents, including resource and tool mappings. - Added detailed documentation for MCP server setup and usage.
…d revocation; enhance document upload size handling
…features - Updated Pydantic models in `requests.py` to include detailed field descriptions and added new fields for document metadata (created_at, document_date). - Introduced new session management endpoints in `chat.py` for listing and deleting chat sessions, including pagination support. - Enhanced message retrieval in `get_messages` to support pagination and return total message count. - Added document deletion functionality in `documents.py`, including cleanup of associated files and database records. - Updated indexing service to propagate document dates for temporal awareness during indexing. - Modified various configuration files to update model names, API endpoints, and parameters for improved performance and compatibility.
…es; enhance document processing with language-aware capabilities
- Introduced `run_e2e_test.py` to facilitate the testing of the knowledge graph construction process. - Configurations are loaded from `gbc.yaml`, with paths set for PDF input and output directory. - The script loads an existing document tree, initializes a token tracker, and builds the knowledge graph. - Outputs performance metrics including total nodes, edges, and token usage after graph construction.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Validation
.venv/bin/python -m pytest tests/test_ontology_integration.py -q.venv/bin/python -m pytest tests/test_ontology_integration.py tests/test_legal_heading_detector.py tests/test_pdf_refiner_lang.py -qPull Request opened by Augment Code with guidance from the PR author