Skip to content

AI-Based Digitization & Digital Archive Builder for Libraries - production-grade desktop application that converts scanned documents into searchable, metadata-rich digital archives using a guided workflow.

Notifications You must be signed in to change notification settings

carthworks/LibraDigitAI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

21 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

LibraDigit AI

AI-Based Digitization & Digital Archive Builder for Libraries

A production-grade desktop application that converts scanned documents into searchable, metadata-rich digital archives using a guided workflow.

Version License

LibraDigit AI Poster

๐ŸŽฏ Overview

LibraDigit AI is an offline-first desktop application designed for librarians, archivists, and digitization teams to:

  • โœ… Convert scanned PDFs/images to searchable documents using OCR.
  • โœ… Automatic Scanned PDF Detection - Intelligently detects image-based PDFs and applies OCR automatically.
  • โœ… Advanced OCR with AI-powered layout analysis - Detect tables, forms, signatures, and page structure.
  • โœ… Handwritten text to PDF conversion - Transform handwritten notes into formatted, searchable PDFs.
  • โœ… Clean and improve OCR text accuracy.
  • โœ… Add comprehensive metadata (title, author, year, subject, keywords).
  • โœ… Generate structured digital archives with organized folder hierarchies.
  • โœ… Search across an entire archive using a dedicated Full-Text Search engine.
  • โœ… Analyze digitization progress with a built-in statistics dashboard.

๐Ÿš€ Key Features

๐Ÿค– Advanced OCR & AI Analysis

  • Scanned PDF OCR with Handwritten Support: Automatically detects PDFs with embedded images and applies intelligent OCR. Switches to handwritten mode (LSTM) when handwriting is detected on any page.
  • Intelligent Layout Understanding: Automatically detects page structure including headers, footers, stamps, and signatures.
  • Table & Form Extraction: Identifies and extracts structured data from tables and form fields with checkbox detection.
  • Auto-Orientation Correction: Automatically detects and corrects page rotation (0ยฐ, 90ยฐ, 180ยฐ, 270ยฐ).
  • Handwritten Text Recognition: Specialized LSTM neural network for improved handwriting accuracy (75-92%).
  • Enhanced Preprocessing: CLAHE enhancement, adaptive thresholding, and advanced denoising for better accuracy.
  • Handwritten to PDF: Convert handwritten notes directly to professionally formatted, searchable PDF documents.

๐Ÿ” Extensive Search Facility

  • Full-Text Search (FTS5): Powered by SQLite's FTS5, search instantly through thousands of archived documents.
  • Content-Aware Snippets: Search results show exactly where terms appear with keyword highlighting.
  • Universal Metadata Search: Find documents by Title, Author, Keywords, or any content within the text.

๐Ÿ“Š Analytics & Statistics

  • Workflow Visualization: Track project distribution across Upload, OCR, Cleanup, Metadata, and Archived stages.
  • Storage Metrics: Real-time tracking of disk space usage by your digital collection.
  • Activity Trends: Weekly activity charts showing your digitization team's productivity.
  • Top Subjects: Bar charts showcasing the most represented subjects in your archive.

๐Ÿ”’ Secure & Private

  • Secure Offline Auth: Implements bcryptjs hashing for local authentication.
  • First-Run Setup: Guided password setup on the first launch.
  • Privacy-First: Zero cloud dependency; all data, hashes, and files stay exclusively on your local machine.

๐Ÿ“ฑ Responsive & Modern UI

  • Responsive Design: Optimized for everything from desktop monitors to mobile devices.
  • Multi-tab Synchronization: Log out or delete a project in one browser tab, and all other tabs will instantly synchronize.
  • Premium Aesthetics: High-end dark theme with smooth gradients and micro-animations.

๐Ÿ“ฆ Archival Standards

  • BagIt Packaging: Implements the international BagIt standard for robust, verifiable data packages.
  • XMP Metadata Embedding: Metadata (Title, Author, etc.) is embedded directly into the PDF binary, traveling with the file even when shared.
  • MD5 Manifests: Automatic integrity checks to ensure files remain uncorrupted over decades.

๐Ÿ“‹ Prerequisites

Required Software

  1. Node.js (v18 or higher)

  2. Python (v3.8 or higher)

  3. Tesseract OCR (for OCR functionality)

Additional Dependencies for Advanced Features

  1. OpenCV (for advanced image processing)

    • Installed automatically via requirements.txt
    • Required for: Advanced OCR, handwritten text recognition, table detection
  2. ReportLab (for PDF generation)

    • Installed automatically via requirements.txt
    • Required for: Handwritten to PDF conversion

๐Ÿ› ๏ธ Installation

1. Clone or Download the Project

cd "LibraDigit AI"

2. Install Dependencies

# Install frontend packages
npm install

# Install backend packages (includes OpenCV, NumPy, ReportLab)
cd backend
pip install -r requirements.txt
cd ..

๐ŸŽฎ Running the Application

Development Mode

The easiest way to run the application is using the combined dev script:

npm run dev

This will:

  • Start the React frontend (Vite)
  • Start the Python backend (Flask)
  • Launch the Electron desktop window

๐Ÿ“– Usage Guide

Creating Your First Project

  1. Launch & Setup: On first run, create your master password.
  2. Upload Document: Drag and drop a PDF or image file (PDF, PNG, JPEG, TIFF). Scanned PDFs are automatically detected.
  3. Choose OCR Method:
    • Standard OCR: Fast text extraction for printed documents and scanned PDFs
    • Advanced OCR: AI-powered analysis with table detection, form recognition, and layout understanding (images only)
    • Handwritten to PDF: Convert handwritten notes to formatted, searchable PDFs (images only)
  4. Run OCR: Tesseract converts image text into a searchable layer. For scanned PDFs, pages are automatically rendered as images at 300 DPI.
  5. Clean Text: Use the side-by-side rich text editor to correct OCR typos.
  6. Add Metadata: Add descriptive details (Subject, Year, Author).
  7. Generate Archive: The system builds the BagIt package and embeds your metadata.

๐Ÿค– Using Advanced OCR

For documents with complex layouts:

  1. Upload your document (image format recommended)
  2. Toggle "Advanced OCR Analysis" switch
  3. Click "Run Advanced OCR"
  4. View comprehensive results including:
    • Detected tables and their contents
    • Form fields and checkboxes (with fill status)
    • Page orientation corrections
    • Headers, footers, stamps, and signatures
    • Enhanced text extraction with layout preservation

โœ๏ธ Converting Handwritten Notes to PDF

For handwritten documents:

  1. Upload a clear image of handwritten notes (300+ DPI recommended)
  2. Select the appropriate language
  3. Click "Convert Handwritten to PDF"
  4. Receive a professionally formatted PDF with:
    • Extracted and structured text
    • Detected headings and paragraphs
    • Bullet points and lists
    • Diagrams and technical content
    • Complete metadata

๐Ÿ“š Installation Guide

For a detailed step-by-step visual guide on installing the Electron desktop application, please refer to: public/install_guide.html (included in the distribution package).

This guide covers:

  • System Requirements (Tesseract OCR)
  • SmartScreen Security Bypass (for internal tools)
  • First-time Account Setup

Searching the Archive

Click "Archive Search" in the sidebar to perform lightning-fast keyword searches across your entire processed collection.

Archive Structure (BagIt Standard)

Documents are organized using a standard preservation hierarchy:

Archive/
  โ””โ”€โ”€ Subject/
      โ””โ”€โ”€ Year/
          โ””โ”€โ”€ Author_Year_Title/
              โ”œโ”€โ”€ data/
              โ”‚   โ””โ”€โ”€ Author_Year_Title.pdf   (Final PDF with embedded metadata)
              โ”œโ”€โ”€ bag-info.txt                (Archive package metadata)
              โ””โ”€โ”€ manifest-md5.txt            (Checksums for file integrity)

๐Ÿ”ง Technology Stack

Frontend & UI

  • React 18 (Vite)
  • Lucide React (Icons)
  • Recharts (Analytics)
  • Bcryptjs (Local Auth)
  • Axios (API)

Desktop

  • Electron (Cross-platform desktop engine)

Backend & Engine

  • Flask (Python API)
  • SQLite 3 (Database & FTS5 Search Engine)
  • Tesseract OCR (Text Extraction with LSTM neural networks)
  • PyMuPDF (fitz) (PDF rendering for scanned PDF OCR at 300 DPI)
  • OpenCV (Advanced image processing & computer vision)
  • NumPy (Numerical operations for image analysis)
  • PyPDF2 & ReportLab (PDF Metadata, Generation & Manipulation)
  • Bagit-Python (Packaging standard)

๐ŸŽจ Project Structure

LibraDigit AI/
โ”œโ”€โ”€ backend/                    # Flask server & OCR engines
โ”‚   โ”œโ”€โ”€ advanced_ocr_processor.py    # Advanced OCR with layout analysis
โ”‚   โ”œโ”€โ”€ handwritten_to_pdf.py        # Handwritten text converter
โ”‚   โ”œโ”€โ”€ metadata_extractor.py        # Metadata extraction
โ”‚   โ”œโ”€โ”€ batch_processor.py           # Batch operations
โ”‚   โ””โ”€โ”€ server.py                    # Main Flask API
โ”œโ”€โ”€ src/
โ”‚   โ”œโ”€โ”€ components/             # UI elements (Charts, Loaders, Sidebar)
โ”‚   โ”‚   โ””โ”€โ”€ AdvancedOCRResults.jsx   # Advanced OCR results display
โ”‚   โ”œโ”€โ”€ pages/                  # Full views (Analytics, Search, Dashboard)
โ”‚   โ”œโ”€โ”€ context/                # Multi-tab sync & Global state
โ”‚   โ””โ”€โ”€ index.css               # Design system & Desktop/Mobile styles
โ”œโ”€โ”€ Archive/                    # Final BagIt collections
โ”œโ”€โ”€ Documentation/              # Feature documentation
โ”‚   โ”œโ”€โ”€ ADVANCED_OCR_DOCUMENTATION.md
โ”‚   โ”œโ”€โ”€ HANDWRITTEN_TO_PDF_DOCUMENTATION.md
โ”‚   โ””โ”€โ”€ QUICK_START_ADVANCED_OCR.md
โ”œโ”€โ”€ package.json                # Frontend scripts
โ””โ”€โ”€ README.md                   # This guide

๐Ÿ“š Additional Documentation


Built with โค๏ธ for librarians and archivists worldwide | github.com/carthworks

About

AI-Based Digitization & Digital Archive Builder for Libraries - production-grade desktop application that converts scanned documents into searchable, metadata-rich digital archives using a guided workflow.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published