Skip to content

Add ontology normalization and config-driven Phase 3 entity resolution#3

Open
vmoudyp wants to merge 11 commits into
sam234990:mainfrom
vmoudyp:feature/docling-integration
Open

Add ontology normalization and config-driven Phase 3 entity resolution#3
vmoudyp wants to merge 11 commits into
sam234990:mainfrom
vmoudyp:feature/docling-integration

Conversation

@vmoudyp
Copy link
Copy Markdown

@vmoudyp vmoudyp commented Mar 6, 2026

Summary

  • add ontology-driven canonical entity normalization across config, indexing, graph persistence, and retrieval
  • add a short ontology usage guide and conservative Phase 3 tenant/global entity-resolution config
  • preserve ontology metadata in tenant-global VDB records and add targeted tests

Validation

  • .venv/bin/python -m pytest tests/test_ontology_integration.py -q
  • .venv/bin/python -m pytest tests/test_ontology_integration.py tests/test_legal_heading_detector.py tests/test_pdf_refiner_lang.py -q

Pull Request opened by Augment Code with guidance from the PR author

vmoudyp added 11 commits March 2, 2026 20:21
- Add DoclingConfig dataclass (Core/configs/docling_config.py)
  with ocr_engine, force_full_page_ocr, images_scale, lang fields

- Update SystemConfig to include parser (default: 'mineru') and
  docling config fields (Core/configs/system_config.py)

- Add Docling adapter module (Core/provider/extract_pdf_info_docling.py)
  with parse_doc_with_docling() that maps Docling output to the
  canonical pdf_list format consumed by the BookRAG pipeline

- Update build_tree_from_pdf() to branch on cfg.parser;
  MinerU path unchanged; Docling path added with isolated cache dir
  (Core/pipelines/doc_tree_builder.py)

- Add example config for scanned documents (config/docling.yaml)
  with force_full_page_ocr: true and easyocr engine

MinerU remains the default parser. No existing behaviour changes
when parser is unset or set to 'mineru'.
Phase 1 - FalkorDB & Config Foundation:
- Add Core/configs/falkordb_config.py: FalkorDBConfig dataclass with
  graph_name_for_doc() and graph_name_for_global() helpers
- Add Core/configs/mongodb_config.py: MongoDBConfig dataclass with
  tenant_db_name() helper
- Update Core/configs/system_config.py: add tenant_id, doc_id,
  falkordb, mongodb fields; backward-compatible (all optional)
- Rewrite Core/Index/Graph.py: optional FalkorDB persistence alongside
  NetworkX; new methods _save_to_falkordb, _load_from_falkordb,
  _get_fdb_subgraph, save_to_global_graph, get_global_subgraph;
  fix pre-existing bug: node_link_graph now uses edges='links' to match
  node_link_data save format (NetworkX 3.x compatibility)
- Update Core/Index/GBCIndex.py: tenant/doc_id-namespaced VDB paths,
  pass falkordb config to Graph.load_from_dir, add rebuild_global_vdb()
- Update Core/pipelines/kg_builder.py: pass tenant_id, doc_id,
  falkordb_cfg from config into Graph constructor

Phase 2 - FastAPI Multi-User API:
- api/main.py: FastAPI app with lifespan, CORS, health endp- api/main.py: FastAPI app with lifespan, CORS, health endp- api/ma,
  get_current_user, require_admin, check_doc_access, filter_accessible_docs
- api/db/mongodb.py: Motor async CRUD for tenants,- api/db/mongodb.py: Motmi- api/db/mongodb.py: Motor async CRUD for tenants,- api/db/mopi/models/requests.py: Pydantic request/response models
- api/routers/auth.py: POST /auth/register, POST /auth/login
- api/routers/tenants.py: tenant management (admin-gated)
- api/routers/documents.py: PDF upload + background indexing, list, status
- api/routers/chat.py: POST /chat/query, session create, message history
- api/services/indexing.py: background GBC index build in thread pool
  with MongoDB status tracking

Phase 3 - Cross-Document Entity ResoluPhase 3 - Cross-Document Entity ResoluPhase 3 - Cross-Document Entity chPhase 3 - Cross-Document Entity ResoluPhase 3 - CrokorDB graph,
  HAS_MENTION edge creation, update global VDB for canonical entities
- api/services/chat.py: parallel per-doc GBC RAG queries (up to 5 docs)
  with answer synthesis for cross-doc mode
- Entity resolution runs automatically after indexing (non-fatal)

Config updates (from previous session):
- config/gbc.yaml: upgrade LLM/VLM to Qwen3.5-35B-A3B-AWQ
- Core/configs/vlm_config.py: update defaults for Qwen3.5 backend
- Core/prompts/kg_prompt.py: language-agnostic NER prompt
…erver for AI integration

- Added path traversal prevention in the raw document download endpoint.
- Standardized error messages in entity management endpoints to "Internal server error".
- Updated permission granting logic to require document ownership for non-admin users.
- Refactored chat service to include caching for GBC indexes and improved query handling with history relevance filtering.
- Introduced a new MCP server implementation to expose BookRAG API for AI agents, including resource and tool mappings.
- Added detailed documentation for MCP server setup and usage.
…d revocation; enhance document upload size handling
…features

- Updated Pydantic models in `requests.py` to include detailed field descriptions and added new fields for document metadata (created_at, document_date).
- Introduced new session management endpoints in `chat.py` for listing and deleting chat sessions, including pagination support.
- Enhanced message retrieval in `get_messages` to support pagination and return total message count.
- Added document deletion functionality in `documents.py`, including cleanup of associated files and database records.
- Updated indexing service to propagate document dates for temporal awareness during indexing.
- Modified various configuration files to update model names, API endpoints, and parameters for improved performance and compatibility.
…es; enhance document processing with language-aware capabilities
- Introduced `run_e2e_test.py` to facilitate the testing of the knowledge graph construction process.
- Configurations are loaded from `gbc.yaml`, with paths set for PDF input and output directory.
- The script loads an existing document tree, initializes a token tracker, and builds the knowledge graph.
- Outputs performance metrics including total nodes, edges, and token usage after graph construction.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant