Skip to content

lyomagit/llm-kb-parser

Repository files navigation

kbparser

Local document → rich JSON parser for LLM knowledge-base ingestion.

Supported formats: PDF, DOCX, DOC (via LibreOffice), XLS, XLSX.

Install

# Clone and install in a virtual environment
python3 -m venv venv
source venv/bin/activate
pip install -e ".[dev]"

# Verify environment
kbparser doctor

Usage

# Parse a single file
kbparser parse document.pdf

# Parse with options
kbparser parse document.docx --out ./output --profile fidelity --overwrite

# Parse a directory (batch mode)
kbparser parse ./documents/ --out ./output

# OCR for scanned PDFs
kbparser parse scanned.pdf --lang rus+eng
# or via environment variable:
KBPARSER_OCR_LANGS=rus+eng kbparser parse scanned.pdf

# Check runtime dependencies
kbparser doctor

# Version info
kbparser --version

Desktop App

The package also ships a small Tkinter desktop app:

kbparser-gui

For distributable builds, install the build extras and run:

pip install ".[dev,build]"
python scripts/build_apps.py

The build writes self-contained PyInstaller artifacts under dist/apps/:

  • kbparser / kbparser.exe — command-line parser
  • KBParser.app on macOS, or KBParser.exe on Windows — desktop app

Set KBPARSER_DIST_ROOT=/tmp/kbparser-apps to write the build output outside the repository, which is useful on macOS when the checkout lives in a synced Desktop/iCloud folder.

GitHub Actions also builds Windows and macOS archives from .github/workflows/build-apps.yml.

Note: the bundled apps include Python and Python package dependencies. For legacy .doc and scanned-PDF OCR, the app now detects LibreOffice/Tesseract from system paths, explicit env overrides, or a portable tools/ sidecar next to the app.

Android Phase 0

Android is a separate Kotlin/Compose + Chaquopy spike, not a PyInstaller build. The Phase 0 app lives under android/ and embeds Python 3.13 with a narrow mobile facade:

  • Supported now: .xls, .xlsx through kbparser.mobile.facade.
  • Disabled now: .doc, .docx, .pdf, OCR.
  • Reason: desktop pydantic-core, PyMuPDF, LibreOffice, Tkinter, and Tesseract CLI are not Android-safe assumptions.

Build the debug APK with:

ANDROID_HOME="$HOME/Library/Android/sdk" \
ANDROID_SDK_ROOT="$HOME/Library/Android/sdk" \
./android/gradlew -p android :app:assembleDebug

The generated artifact is android/app/build/outputs/apk/debug/app-debug.apk. GitHub Actions also has .github/workflows/android.yml for the Phase 0 debug APK.

Output

Each parsed document produces a JSON file containing:

  • document — canonical representation: source metadata, sections, blocks, tables, pages, sheets, assets, relationships, warnings
  • records — retrieval-oriented projections for RAG/KB pipelines: chunk, reference_chunk, diagram_chunk, table, section_summary_seed, sheet_region

Output includes versioning fields: schema_version, records_version, parser_version.

Batch mode additionally writes manifest.json with per-file results, timings, record counts, and runtime metadata.

Profiles

Profile Description
fidelity Maximum extraction fidelity (default)
balanced Balance between fidelity and chunk cleanliness
text-lite Minimal extraction, text-focused

Optional Dependencies

Dependency Required for Install
LibreOffice .doc parsing (converted to .docx) macOS: brew install --cask libreoffice; Windows: official 64-bit stable installer
Tesseract OCR on scanned PDF pages macOS: brew install tesseract tesseract-lang; Windows: UB Mannheim 64-bit installer

Run kbparser doctor to verify these are available. The desktop app also has a Setup tools button with download links and portable sidecar layout.

Robust discovery order:

  1. Explicit env overrides: KBPARSER_LIBREOFFICE, KBPARSER_TESSERACT, and TESSDATA_PREFIX.
  2. KBPARSER_TOOLS_DIR, for example a shared tools directory.
  3. Portable tools/ next to the packaged app.
  4. User tools folder:
    • macOS: ~/Library/Application Support/KBParser/tools
    • Windows: %LOCALAPPDATA%\KBParser\tools
  5. Standard system locations and PATH.

Portable sidecar examples:

  • tools/LibreOffice/program/soffice.exe
  • tools/LibreOffice.app/Contents/MacOS/soffice
  • tools/Tesseract-OCR/tesseract.exe
  • tools/Tesseract-OCR/tessdata/rus.traineddata

Layout

src/kbparser/
├── __init__.py          # package version
├── cli.py               # CLI entrypoint + doctor command
├── dispatcher.py        # format detection → parser dispatch
├── export.py            # deterministic JSON writer
├── ids.py               # deterministic ID generation (SHA-256)
├── model.py             # canonical pydantic schema
├── normalize/           # normalization utilities
├── parsers/
│   ├── base.py          # parser protocol + shared helpers
│   ├── docx.py          # DOCX parser
│   ├── doc.py           # DOC parser (LibreOffice conversion)
│   ├── pdf.py           # PDF parser (pymupdf + pdfplumber + OCR)
│   ├── excel.py         # XLS/XLSX parser (openpyxl + xlrd)
│   └── ocr.py           # Tesseract OCR helper
├── records/
│   └── chunker.py       # records/chunk builder
└── validation/
    └── validator.py     # structural validation

Development

pip install -e ".[dev,lint]"
pytest                   # run tests
ruff check src/ tests/   # lint
mypy src/kbparser/       # type check

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors