AI-Based Digitization & Digital Archive Builder for Libraries
A production-grade desktop application that converts scanned documents into searchable, metadata-rich digital archives using a guided workflow.
LibraDigit AI is an offline-first desktop application designed for librarians, archivists, and digitization teams to:
- โ Convert scanned PDFs/images to searchable documents using OCR.
- โ Automatic Scanned PDF Detection - Intelligently detects image-based PDFs and applies OCR automatically.
- โ Advanced OCR with AI-powered layout analysis - Detect tables, forms, signatures, and page structure.
- โ Handwritten text to PDF conversion - Transform handwritten notes into formatted, searchable PDFs.
- โ Clean and improve OCR text accuracy.
- โ Add comprehensive metadata (title, author, year, subject, keywords).
- โ Generate structured digital archives with organized folder hierarchies.
- โ Search across an entire archive using a dedicated Full-Text Search engine.
- โ Analyze digitization progress with a built-in statistics dashboard.
- Scanned PDF OCR with Handwritten Support: Automatically detects PDFs with embedded images and applies intelligent OCR. Switches to handwritten mode (LSTM) when handwriting is detected on any page.
- Intelligent Layout Understanding: Automatically detects page structure including headers, footers, stamps, and signatures.
- Table & Form Extraction: Identifies and extracts structured data from tables and form fields with checkbox detection.
- Auto-Orientation Correction: Automatically detects and corrects page rotation (0ยฐ, 90ยฐ, 180ยฐ, 270ยฐ).
- Handwritten Text Recognition: Specialized LSTM neural network for improved handwriting accuracy (75-92%).
- Enhanced Preprocessing: CLAHE enhancement, adaptive thresholding, and advanced denoising for better accuracy.
- Handwritten to PDF: Convert handwritten notes directly to professionally formatted, searchable PDF documents.
- Full-Text Search (FTS5): Powered by SQLite's FTS5, search instantly through thousands of archived documents.
- Content-Aware Snippets: Search results show exactly where terms appear with keyword highlighting.
- Universal Metadata Search: Find documents by Title, Author, Keywords, or any content within the text.
- Workflow Visualization: Track project distribution across Upload, OCR, Cleanup, Metadata, and Archived stages.
- Storage Metrics: Real-time tracking of disk space usage by your digital collection.
- Activity Trends: Weekly activity charts showing your digitization team's productivity.
- Top Subjects: Bar charts showcasing the most represented subjects in your archive.
- Secure Offline Auth: Implements bcryptjs hashing for local authentication.
- First-Run Setup: Guided password setup on the first launch.
- Privacy-First: Zero cloud dependency; all data, hashes, and files stay exclusively on your local machine.
- Responsive Design: Optimized for everything from desktop monitors to mobile devices.
- Multi-tab Synchronization: Log out or delete a project in one browser tab, and all other tabs will instantly synchronize.
- Premium Aesthetics: High-end dark theme with smooth gradients and micro-animations.
- BagIt Packaging: Implements the international BagIt standard for robust, verifiable data packages.
- XMP Metadata Embedding: Metadata (Title, Author, etc.) is embedded directly into the PDF binary, traveling with the file even when shared.
- MD5 Manifests: Automatic integrity checks to ensure files remain uncorrupted over decades.
-
Node.js (v18 or higher)
- Download: https://nodejs.org/
-
Python (v3.8 or higher)
- Download: https://www.python.org/downloads/
-
Tesseract OCR (for OCR functionality)
- Windows: Download installer from https://github.com/UB-Mannheim/tesseract/wiki
- macOS:
brew install tesseract - Linux:
sudo apt-get install tesseract-ocr
-
OpenCV (for advanced image processing)
- Installed automatically via
requirements.txt - Required for: Advanced OCR, handwritten text recognition, table detection
- Installed automatically via
-
ReportLab (for PDF generation)
- Installed automatically via
requirements.txt - Required for: Handwritten to PDF conversion
- Installed automatically via
cd "LibraDigit AI"# Install frontend packages
npm install
# Install backend packages (includes OpenCV, NumPy, ReportLab)
cd backend
pip install -r requirements.txt
cd ..The easiest way to run the application is using the combined dev script:
npm run devThis will:
- Start the React frontend (Vite)
- Start the Python backend (Flask)
- Launch the Electron desktop window
- Launch & Setup: On first run, create your master password.
- Upload Document: Drag and drop a PDF or image file (PDF, PNG, JPEG, TIFF). Scanned PDFs are automatically detected.
- Choose OCR Method:
- Standard OCR: Fast text extraction for printed documents and scanned PDFs
- Advanced OCR: AI-powered analysis with table detection, form recognition, and layout understanding (images only)
- Handwritten to PDF: Convert handwritten notes to formatted, searchable PDFs (images only)
- Run OCR: Tesseract converts image text into a searchable layer. For scanned PDFs, pages are automatically rendered as images at 300 DPI.
- Clean Text: Use the side-by-side rich text editor to correct OCR typos.
- Add Metadata: Add descriptive details (Subject, Year, Author).
- Generate Archive: The system builds the BagIt package and embeds your metadata.
For documents with complex layouts:
- Upload your document (image format recommended)
- Toggle "Advanced OCR Analysis" switch
- Click "Run Advanced OCR"
- View comprehensive results including:
- Detected tables and their contents
- Form fields and checkboxes (with fill status)
- Page orientation corrections
- Headers, footers, stamps, and signatures
- Enhanced text extraction with layout preservation
For handwritten documents:
- Upload a clear image of handwritten notes (300+ DPI recommended)
- Select the appropriate language
- Click "Convert Handwritten to PDF"
- Receive a professionally formatted PDF with:
- Extracted and structured text
- Detected headings and paragraphs
- Bullet points and lists
- Diagrams and technical content
- Complete metadata
For a detailed step-by-step visual guide on installing the Electron desktop application, please refer to:
public/install_guide.html (included in the distribution package).
This guide covers:
- System Requirements (Tesseract OCR)
- SmartScreen Security Bypass (for internal tools)
- First-time Account Setup
Click "Archive Search" in the sidebar to perform lightning-fast keyword searches across your entire processed collection.
Documents are organized using a standard preservation hierarchy:
Archive/
โโโ Subject/
โโโ Year/
โโโ Author_Year_Title/
โโโ data/
โ โโโ Author_Year_Title.pdf (Final PDF with embedded metadata)
โโโ bag-info.txt (Archive package metadata)
โโโ manifest-md5.txt (Checksums for file integrity)
- React 18 (Vite)
- Lucide React (Icons)
- Recharts (Analytics)
- Bcryptjs (Local Auth)
- Axios (API)
- Electron (Cross-platform desktop engine)
- Flask (Python API)
- SQLite 3 (Database & FTS5 Search Engine)
- Tesseract OCR (Text Extraction with LSTM neural networks)
- PyMuPDF (fitz) (PDF rendering for scanned PDF OCR at 300 DPI)
- OpenCV (Advanced image processing & computer vision)
- NumPy (Numerical operations for image analysis)
- PyPDF2 & ReportLab (PDF Metadata, Generation & Manipulation)
- Bagit-Python (Packaging standard)
LibraDigit AI/
โโโ backend/ # Flask server & OCR engines
โ โโโ advanced_ocr_processor.py # Advanced OCR with layout analysis
โ โโโ handwritten_to_pdf.py # Handwritten text converter
โ โโโ metadata_extractor.py # Metadata extraction
โ โโโ batch_processor.py # Batch operations
โ โโโ server.py # Main Flask API
โโโ src/
โ โโโ components/ # UI elements (Charts, Loaders, Sidebar)
โ โ โโโ AdvancedOCRResults.jsx # Advanced OCR results display
โ โโโ pages/ # Full views (Analytics, Search, Dashboard)
โ โโโ context/ # Multi-tab sync & Global state
โ โโโ index.css # Design system & Desktop/Mobile styles
โโโ Archive/ # Final BagIt collections
โโโ Documentation/ # Feature documentation
โ โโโ ADVANCED_OCR_DOCUMENTATION.md
โ โโโ HANDWRITTEN_TO_PDF_DOCUMENTATION.md
โ โโโ QUICK_START_ADVANCED_OCR.md
โโโ package.json # Frontend scripts
โโโ README.md # This guide
- Advanced OCR Documentation - Complete guide to advanced OCR features
- Handwritten to PDF Guide - Handwritten text conversion documentation
- Quick Start Guide - Get started with advanced features quickly
- JSON Serialization Fix - Technical troubleshooting guide
Built with โค๏ธ for librarians and archivists worldwide | github.com/carthworks
