Easy Knowledge Retriever is a powerful and flexible library for building Retrieval-Augmented Generation (RAG) systems with integrated Knowledge Graph support. It allows you to easily ingest documents, build a structured knowledge base (combining vector embeddings and graph relations), and perform advanced queries using Large Language Models (LLMs).
- Multimodal Ingestion: Parse & ingest PDF data containing images, tables, equations, ... Based on MinerU.
- Hybrid Retrieval: Combines vector similarity search with knowledge graph exploration for more context-aware answers.
- Smart Graph Re-ranking: Uses local centrality algorithms (PageRank) to filter and prioritize the most semantically relevant graph edges for the user query.
- Knowledge Graph Integration: Automatically extracts entities and relationships from your text documents.
- Modular Storage: Supports various backends for Key-Value pairs, Vector Stores, and Graph Storage (e.g., JSON, NanoVectorDB, NetworkX, Neo4j, Milvus).
- LLM Agnostic: Designed to work with OpenAI-compatible LLM APIs (OpenAI, Gemini via OpenAI adapter, etc.).
- Async Support: built with
asynciofor high-performance ingestion and retrieval.
You can install the library via pip:
pip install easy-knowledge-retrieverThis guide will show you how to build a database from PDF documents and then query it.
During this step, documents are processed, chunked, embedded, and entities/relations are extracted to build the Knowledge Graph options.
import asyncio
import os
from easy_knowledge_retriever import EasyKnowledgeRetriever
from easy_knowledge_retriever.llm.service import OpenAILLMService, OpenAIEmbeddingService
from easy_knowledge_retriever.kg.json_kv_impl import JsonKVStorage
from easy_knowledge_retriever.kg.nano_vector_db_impl import NanoVectorDBStorage
from easy_knowledge_retriever.kg.networkx_impl import NetworkXStorage
from easy_knowledge_retriever.kg.json_doc_status_impl import JsonDocStatusStorage
async def build_database():
# 1. Configure Services
# Replace with your actual API keys and endpoints
embedding_service = OpenAIEmbeddingService(
api_key="your-embedding-api-key",
base_url="https://api.openai.com/v1", # or compatible
model="text-embedding-3-small",
embedding_dim=1536
)
llm_service = OpenAILLMService(
model="gpt-4o",
api_key="your-llm-api-key",
base_url="https://api.openai.com/v1"
)
# 2. Initialize Retriever with specific storage backends
working_dir = "./rag_data"
rag = EasyKnowledgeRetriever(
working_dir=working_dir,
llm_service=llm_service,
embedding_service=embedding_service,
kv_storage=JsonKVStorage(),
vector_storage=NanoVectorDBStorage(cosine_better_than_threshold=0.2),
graph_storage=NetworkXStorage(),
doc_status_storage=JsonDocStatusStorage(),
)
await rag.initialize_storages()
try:
# 3. Ingest Documents
pdf_path = "./documents/example.pdf"
if os.path.exists(pdf_path):
print(f"Ingesting {pdf_path}...")
await rag.ingest(pdf_path)
print("Ingestion complete.")
else:
print("Please provide a valid PDF path.")
finally:
# Always finalize to save state
await rag.finalize_storages()
if __name__ == "__main__":
asyncio.run(build_database())Once the database is built, you can query it.
import asyncio
from easy_knowledge_retriever import EasyKnowledgeRetriever
from easy_knowledge_retriever.retrieval import MixRetrieval
from easy_knowledge_retriever.llm.service import OpenAILLMService, OpenAIEmbeddingService
from easy_knowledge_retriever.kg.json_kv_impl import JsonKVStorage
from easy_knowledge_retriever.kg.nano_vector_db_impl import NanoVectorDBStorage
from easy_knowledge_retriever.kg.networkx_impl import NetworkXStorage
from easy_knowledge_retriever.kg.json_doc_status_impl import JsonDocStatusStorage
async def query_knowledge_base():
# 1. Re-initialize Services (same config as build)
embedding_service = OpenAIEmbeddingService(
api_key="your-embedding-api-key",
base_url="https://api.openai.com/v1",
model="text-embedding-3-small",
embedding_dim=1536
)
llm_service = OpenAILLMService(
model="gpt-4o",
api_key="your-llm-api-key",
base_url="https://api.openai.com/v1"
)
# 2. Load the existing Retriever
working_dir = "./rag_data"
rag = EasyKnowledgeRetriever(
working_dir=working_dir,
llm_service=llm_service,
embedding_service=embedding_service,
kv_storage=JsonKVStorage(),
vector_storage=NanoVectorDBStorage(cosine_better_than_threshold=0.2),
graph_storage=NetworkXStorage(),
doc_status_storage=JsonDocStatusStorage(),
)
await rag.initialize_storages()
try:
# 3. Perform a Query
query_text = "What does the document say about forest fires?"
# 'mix' mode uses both vector search and knowledge graph
# Use MixRetrieval strategy
print(f"Querying: {query_text}")
result = await rag.aquery(query_text, retrieval=MixRetrieval())
print("\nAnswer:")
print(result)
finally:
await rag.finalize_storages()
if __name__ == "__main__":
asyncio.run(query_knowledge_base())Easy Knowledge Retriever integrates Mineru (based on magic-pdf) to handle complex PDF documents, preserving layouts and extracting images.
The ingestion process orchestrates several advanced steps to transform raw documents into a rich knowledge base:
- Parsing with Mineru: The system uses Mineru (based on Magic-PDF) to extract text, tables, and images with high structural fidelity.
- Multimodal Enrichment: Extracted images are processed by a Vision Language Model (VLM, e.g., GPT-4o). The VLM generates descriptive summaries which are injected directly into the text context, making visual data searchable.
- Page-Aware Chunking: The text is split into chunks using a sliding window approach that preserves the mapping to original page numbers for precise citations.
- Knowledge Graph Extraction: An LLM extracts entities and relationships from the chunks. It performs iterative gleaning to ensure no details are missed, building a structured graph of knowledge alongside vector embeddings.
Ensure that the mineru dependencies are installed (included in requirements.txt) and that you are using a Vision-capable LLM model.
The usage remains simple:
# ... Initialize RAG as shown in Quick Start ...
# Ingest a complex PDF with images
await rag.ingest("./documents/complex_report_with_charts.pdf")
# The system automatically handles image extraction and summarization.The retrieval process employs a hybrid strategy that orchestrates parallel searches to capture both semantic meaning and explicit knowledge connections.
- Keyword Extraction: An LLM extracts both high-level concepts (for thematic search) and low-level entities (for specific details) from the user query.
- Parallel Search:
- Local Search: Navigates the Knowledge Graph using low-level entities to find direct neighbors and details.
- Global Search: Explores broader relationships in the Knowledge Graph using high-level concepts.
- Vector Search: Finds semantically similar text chunks from the vector database using the query embedding.
- Fusion & Context Building: Results from all sources are merged, deduplicated, and mapped back to their original source text chunks. This comprehensive context is then provided to the LLM to generate an accurate, grounded response.
Easy Knowledge Retriever offers flexible retrieval strategies to suit different use cases:
- Naive (
naive): Standard Vector Search on text chunks. Best for simple exact matches. - Local (
local): Entity-focused Graph Retrieval. Best for specific details about entities. - Global (
global): Relation-focused Graph Retrieval. Best for broad thematic questions. - Hybrid (
hybrid): Combines Local and Global Graph Retrieval. - Mix (
mix): Combines Graph (Hybrid) and Vector (Naive). Recommended default for best performance. - HybridMix (
hybrid_mix): Advanced Chunk Search combining Dense (Vector) and Sparse (BM25) search with Fusion.
For detailed workflows and comparisons, see Retrieval Strategies Documentation.
You can swap out storage implementations by creating instances of different classes from easy_knowledge_retriever.kg.*:
- Vector Storage:
NanoVectorDBStorage(local, lightweight),MilvusStorage(scalable). - Graph Storage:
NetworkXStorage(in-memory/json, simple),Neo4jStorage(robust graph DB). - KV Storage:
JsonKVStorage,RedisKVStorage(if available), etc.
Example for Neo4j:
from easy_knowledge_retriever.kg.neo4j_impl import Neo4jStorage
graph_storage = Neo4jStorage(
uri="bolt://localhost:7687",
user="neo4j",
password="password"
)For a complete, up-to-date list of all services (LLM, Vector/KV/Graph/Doc Status) and their configuration options, see:
- docs/ServiceCatalog.md
To set up the project for development:
- Clone the repository.
- Install dependencies:
pip install -r requirements.txt. - Install the package in editable mode:
pip install -e ..
(Instructions for running tests if applicable)
pytestThe system is evaluated using the RAGAS framework on two distinct datasets to assess performance across different domains and document types.
This dataset consists of plain text files covering general topics such as Forest Fires and Childbirth. It tests the system's ability to retrieve information from unstructured text.
Results (Dataset 1):
-
Faithfulness: 0.99 The Faithfulness metric measures how factually consistent a response is with the retrieved context. It ranges from 0 to 1, with higher scores indicating better consistency.
-
Context Recall: 1.0 Context Recall measures how many of the relevant documents (or pieces of information) were successfully retrieved. It focuses on not missing important results. Higher recall means fewer relevant documents were left out. In short, recall is about not missing anything important.
-
Answer Relevancy: 0.78 (Gemini 2.0 Flash Lite) / 0.81 (Gemini 2.5 Flash Lite) Answer Relevancy focuses on assessing how pertinent the generated answer is to the given prompt. A lower score is assigned to answers that are incomplete or contain redundant information, and higher scores indicate better relevancy.
This dataset comprises scientific research papers in PDF format, specifically focusing on Deep Reinforcement Learning (DRL) for Autonomous Vehicle Intersection Management. It evaluates the system's capability to handle complex documents, scientific terminology, and multi-modal content.
Results (Dataset 2):
- Faithfulness: 1.0 (in both cases)
- Context Recall: 1.0 (in both cases)
- Answer Relevancy:
- 0.92 with Gemini 2.5 Flash Lite (Standard Retrieval without reranker).
- 0.96 with Gemini 2.5 Flash Lite using a Reranker and HybridMixRetrieval (combining Hybrid Vector Search + Knowledge Graph + Query Decomposition).
This project draws inspiration and references from the following projects:
This project is licensed under the Creative Commons Attribution–NonCommercial–ShareAlike 4.0 International (CC BY‑NC‑SA 4.0).
- You must give appropriate credit, provide a link to the license, and indicate if changes were made.
- You may not use the material for commercial purposes.
- If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.
Full legal text: https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode
Summary (EN): https://creativecommons.org/licenses/by-nc-sa/4.0/



