Local document → rich JSON parser for LLM knowledge-base ingestion.
Supported formats: PDF, DOCX, DOC (via LibreOffice), XLS, XLSX.
# Clone and install in a virtual environment
python3 -m venv venv
source venv/bin/activate
pip install -e ".[dev]"
# Verify environment
kbparser doctor# Parse a single file
kbparser parse document.pdf
# Parse with options
kbparser parse document.docx --out ./output --profile fidelity --overwrite
# Parse a directory (batch mode)
kbparser parse ./documents/ --out ./output
# OCR for scanned PDFs
kbparser parse scanned.pdf --lang rus+eng
# or via environment variable:
KBPARSER_OCR_LANGS=rus+eng kbparser parse scanned.pdf
# Check runtime dependencies
kbparser doctor
# Version info
kbparser --versionThe package also ships a small Tkinter desktop app:
kbparser-guiFor distributable builds, install the build extras and run:
pip install ".[dev,build]"
python scripts/build_apps.pyThe build writes self-contained PyInstaller artifacts under dist/apps/:
kbparser/kbparser.exe— command-line parserKBParser.appon macOS, orKBParser.exeon Windows — desktop app
Set KBPARSER_DIST_ROOT=/tmp/kbparser-apps to write the build output outside
the repository, which is useful on macOS when the checkout lives in a synced
Desktop/iCloud folder.
GitHub Actions also builds Windows and macOS archives from .github/workflows/build-apps.yml.
Note: the bundled apps include Python and Python package dependencies. For
legacy .doc and scanned-PDF OCR, the app now detects LibreOffice/Tesseract
from system paths, explicit env overrides, or a portable tools/ sidecar next
to the app.
Android is a separate Kotlin/Compose + Chaquopy spike, not a PyInstaller build.
The Phase 0 app lives under android/ and embeds Python 3.13 with a narrow
mobile facade:
- Supported now:
.xls,.xlsxthroughkbparser.mobile.facade. - Disabled now:
.doc,.docx,.pdf, OCR. - Reason: desktop
pydantic-core, PyMuPDF, LibreOffice, Tkinter, and Tesseract CLI are not Android-safe assumptions.
Build the debug APK with:
ANDROID_HOME="$HOME/Library/Android/sdk" \
ANDROID_SDK_ROOT="$HOME/Library/Android/sdk" \
./android/gradlew -p android :app:assembleDebugThe generated artifact is android/app/build/outputs/apk/debug/app-debug.apk.
GitHub Actions also has .github/workflows/android.yml for the Phase 0 debug
APK.
Each parsed document produces a JSON file containing:
document— canonical representation: source metadata, sections, blocks, tables, pages, sheets, assets, relationships, warningsrecords— retrieval-oriented projections for RAG/KB pipelines:chunk,reference_chunk,diagram_chunk,table,section_summary_seed,sheet_region
Output includes versioning fields: schema_version, records_version, parser_version.
Batch mode additionally writes manifest.json with per-file results, timings, record counts, and runtime metadata.
| Profile | Description |
|---|---|
fidelity |
Maximum extraction fidelity (default) |
balanced |
Balance between fidelity and chunk cleanliness |
text-lite |
Minimal extraction, text-focused |
| Dependency | Required for | Install |
|---|---|---|
| LibreOffice | .doc parsing (converted to .docx) |
macOS: brew install --cask libreoffice; Windows: official 64-bit stable installer |
| Tesseract | OCR on scanned PDF pages | macOS: brew install tesseract tesseract-lang; Windows: UB Mannheim 64-bit installer |
Run kbparser doctor to verify these are available. The desktop app also has a
Setup tools button with download links and portable sidecar layout.
Robust discovery order:
- Explicit env overrides:
KBPARSER_LIBREOFFICE,KBPARSER_TESSERACT, andTESSDATA_PREFIX. KBPARSER_TOOLS_DIR, for example a shared tools directory.- Portable
tools/next to the packaged app. - User tools folder:
- macOS:
~/Library/Application Support/KBParser/tools - Windows:
%LOCALAPPDATA%\KBParser\tools
- macOS:
- Standard system locations and
PATH.
Portable sidecar examples:
tools/LibreOffice/program/soffice.exetools/LibreOffice.app/Contents/MacOS/sofficetools/Tesseract-OCR/tesseract.exetools/Tesseract-OCR/tessdata/rus.traineddata
src/kbparser/
├── __init__.py # package version
├── cli.py # CLI entrypoint + doctor command
├── dispatcher.py # format detection → parser dispatch
├── export.py # deterministic JSON writer
├── ids.py # deterministic ID generation (SHA-256)
├── model.py # canonical pydantic schema
├── normalize/ # normalization utilities
├── parsers/
│ ├── base.py # parser protocol + shared helpers
│ ├── docx.py # DOCX parser
│ ├── doc.py # DOC parser (LibreOffice conversion)
│ ├── pdf.py # PDF parser (pymupdf + pdfplumber + OCR)
│ ├── excel.py # XLS/XLSX parser (openpyxl + xlrd)
│ └── ocr.py # Tesseract OCR helper
├── records/
│ └── chunker.py # records/chunk builder
└── validation/
└── validator.py # structural validation
pip install -e ".[dev,lint]"
pytest # run tests
ruff check src/ tests/ # lint
mypy src/kbparser/ # type check