Locus - PDF Semantic Search

Find the exact page that answers your question.

A lightweight desktop app for students and researchers to search PDF folders using natural language.

Features

Hybrid search: BM25 keyword retrieval + semantic reranking (FastEmbed)
Two index modes: Fast (quick startup) and Deep (precomputed embeddings)
Chunked indexing: better precision while keeping page numbers
Optional OCR: for scanned PDFs or image-only pages
Multilingual search: cross-lingual matching with the multilingual model
Open PDF at page: jump directly to the relevant page
Model manager: download/delete models and choose fusion method

Quick Start

Option A: Windows EXE

Download the latest release from Releases and run Locus.exe.

Option B: Run from source

# Install dependencies
pip install -r requirements.txt

# Run
python gui.py

How to Use

Click Browse and select a folder containing PDFs.
Click Load Index (or Rebuild Index when files/models change).
Choose index mode:
- Fast Index: faster startup, good for small collections
- Deep Index: slower startup, best recall
Type a query and press Search.
Double-click a result to open the PDF at the correct page.

Search Quality (Models)

Balanced / High / Best control embedding model size and accuracy.
Multilingual enables cross-lingual search.

Tip: Use the Manage Models window to download/delete models.

OCR

OCR is off by default and can be enabled in the OCR selector.
Fast mode: OCR only for image-heavy pages with little text.
Deep mode: OCR for all pages that contain images.

OCR results are cached to speed up later runs.

Score Fusion

Choose in Manage Models:

RRF (Rank Fusion) (default)
Percentile Blend

Caches

Caches are stored outside your PDF folder:

Index cache:
- Windows: %LOCALAPPDATA%\Locus\index_cache
- macOS: ~/Library/Caches/Locus/index_cache
- Linux: ~/.cache/Locus/index_cache
OCR cache:
- Windows: %LOCALAPPDATA%\Locus\ocr_cache
- macOS: ~/Library/Caches/Locus/ocr_cache
- Linux: ~/.cache/Locus/ocr_cache
Model cache:
- Windows: %LOCALAPPDATA%\Locusastembed_cache
- macOS: ~/Library/Caches/Locus/fastembed_cache
- Linux: ~/.cache/Locus/fastembed_cache

Use Manage Models to clear index or OCR cache.

Requirements

Python 3.8+
PDF viewer with page navigation support (SumatraPDF recommended on Windows)

Dependencies:

PyMuPDF
rank-bm25
fastembed
numpy
customtkinter
rapidocr-onnxruntime

FAQ

Why is indexing slow?
Deep mode precomputes embeddings and OCR can be expensive. Use Fast mode or lower OCR quality.

Why don't I see a score in RRF mode?
RRF is rank-based; numeric scores are hidden by design.

License

MIT - free for personal and educational use.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
assets		assets
.gitignore		.gitignore
README.md		README.md
app.md		app.md
dialogs.py		dialogs.py
fonts.py		fonts.py
gui.py		gui.py
i18n.py		i18n.py
locator.py		locator.py
locus.ico		locus.ico
model_manager.py		model_manager.py
ocr_check_images.py		ocr_check_images.py
ocr_check_import.py		ocr_check_import.py
ocr_check_run.py		ocr_check_run.py
pdf_viewer.py		pdf_viewer.py
requirements.txt		requirements.txt
splash.py		splash.py
training_data_example.json		training_data_example.json
widgets.py		widgets.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Locus - PDF Semantic Search

Features

Quick Start

Option A: Windows EXE

Option B: Run from source

How to Use

Search Quality (Models)

OCR

Score Fusion

Caches

Requirements

FAQ

License

About

Uh oh!

Releases 5

Packages

Uh oh!

Contributors

Uh oh!

Languages

llk214/locus

Folders and files

Latest commit

History

Repository files navigation

Locus - PDF Semantic Search

Features

Quick Start

Option A: Windows EXE

Option B: Run from source

How to Use

Search Quality (Models)

OCR

Score Fusion

Caches

Requirements

FAQ

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages