Advanced OCR, Vector Search & Knowledge Graph Extraction for Historical Medical Documents
- Advanced OCR Engine: Gemini AI-powered text extraction with image preprocessing
- Vector Search: Google File Search for semantic document retrieval
- Knowledge Graph: CAMEL-AI + Neo4j for entity extraction and relationship mapping
- RAG Support: Retrieval Augmented Generation for contextual answers
graph TD
User([User]) -->|Upload PDF| Flask
subgraph "Pipeline A: Semantic Search"
Flask -->|1. Extract Metadata| Meta[Metadata Extractor]
Meta -->|2. Index Document| VectorDB[(☁️ Google Vector Store)]
User -->|Query| VectorDB
end
subgraph "Pipeline B: Knowledge Graph"
Flask -->|1. Vision OCR| OCR[👁️ OCR Engine]
OCR -->|2. Raw Text| Agent[🤖 Camel-AI Agent]
Agent -->|3. Extract Entities| GEMINI[GEMINI_2-0-FLASH]
GEMINI -->|4. Commit Data| Neo4j[(🧠 Neo4j Graph DB)]
end
- Python 3.11+
- Google Cloud Platform account (for Gemini AI)
- Neo4j database
- Groq account
-
Clone the repository
git clone <repository-url> cd file-search
-
Install dependencies
uv sync
-
Set up environment variables Create a
.envfile in the project root:# Google AI Services GEMINI_API_KEY=your_gemini_api_key_here # Neo4j Database NEO4J_URI=bolt://localhost:7687 NEO4J_USERNAME=neo4j NEO4J_PASSWORD=your_neo4j_password # Optional PORT=5000
-
Run the application
uv run main.py
-
Open your browser Navigate to
http://localhost:5000
- Upload Documents: Drag and drop PDF files onto the upload area
- Ask Questions: Use the chat interface to query processed documents
- View Library: Monitor processed documents in the sidebar
- Delete Documents: Remove documents from the search index
- Enhancement Levels:
light,medium,aggressive - DPI Settings: 200 or 300 for scan quality
- Medical Context: Specialized prompts for medical terminology
- Preprocessing: Image enhancement and deskewing
- Chunk Size: 512 tokens per chunk
- Overlap: 50 tokens between chunks
- Model: Gemini-2.0-flash for search and responses
-
Node Types:
ClinicalObservationTherapeuticOutcomeContextualFactorMechanisticConceptTherapeuticApproachSourceText
-
Relationship Types:
co_occurs_withpreceded_by/followed_bymodified_byresponds_toassociated_withresults_indescribed_incontradicts/corroborates
├── main.py # Flask application entry point
├── file_search.py # Vector search engine (Google AI)
├── ocr_engine.py # Advanced OCR with Gemini Vision
├── kg_agents.py # Knowledge graph extraction agent
├── templates/
│ └── index.html # Web interface
├── uploads/ # Temporary file storage
├── debug_images/ # OCR preprocessing debug output
├── pyproject.toml # Python dependencies
└── Procfile # Heroku deployment config