paper-brain

Local-only research paper processing pipeline for offline academic literature analysis.

Run a local LLM, feed it PDFs, and get structured Paper Cards, taxonomy classification, and knowledge graph entries — all without sending data to any cloud API.

Features

100% local: No cloud models, no external API calls. All LLM inference runs on your machine (http://127.0.0.1:8080/v1).
PDF text extraction: Uses PyMuPDF (fitz) to extract text from academic PDFs.
Paper Card generation: Produces structured Markdown summaries with metadata, methods, equations, results, and limitations.
Adaptive chunking: Automatically handles long papers by chunking → per-chunk analysis → aggregation.
Taxonomy classification: Classifies papers into a growing taxonomy of 34+ categories.
Knowledge graph construction: Extracts typed entity-relation graphs from each paper.
Reasoning model aware: Falls back to reasoning_content for models that output CoT traces.

Installation

Requirements

Python 3.9+
A locally running OpenAI-compatible LLM endpoint (e.g., llama.cpp, vllm, ollama)

Install

pip install paper-brain

Or install from source:

git clone https://github.com/your-username/paper-brain.git
cd paper-brain
pip install -e .

Dependencies

Automatically installed:

openai>=1.0.0 — API-compatible client for the local LLM endpoint
PyMuPDF>=1.23.0 — PDF text extraction
pyyaml>=6.0 — Configuration file parsing
tenacity>=8.0.0 — Retry logic for LLM calls

Quick Start

1. Configure

Copy the example config and adjust:

cp config.yaml.example config.yaml

Edit config.yaml to point to your local LLM:

llm:
  base_url: "http://127.0.0.1:8080/v1"
  api_key: "your-api-key-here"        # often "local-not-needed"
  model: "your-model-name.gguf"
  temperature: 0.2
  max_tokens: 8192
  timeout: 600

2. Process a single paper

paper-brain --input path/to/paper.pdf

3. Batch process a directory

paper-brain --batch path/to/pdf/library/

4. Verbose mode (debug logging)

paper-brain --input paper.pdf --verbose

Output Structure

data/
├── paper_cards/          # Structured Markdown summaries
│   └── grover_*.md
├── taxonomy/
│   └── taxonomy.json     # Evolving category tree (auto-updated)
├── graph/                # Knowledge graph entries (JSONL)
│   └── grover_*.jsonl
└── logs/                 # Pipeline logs & API call history
    ├── pipeline_*.log
    └── api_calls.jsonl

Sample Paper Card output

# Paper Card

## Metadata
- Title: Grover algorithm with zero theoretical failure rate
- Authors: G. L. Long
- Year: 2001
- Venue / Source: Physical Review A, Volume 64, 022307

## 1. Core Problem
...

## 2. Method
...

## 3. Key Equations
...

Architecture

PDF Input → Text Extraction (PyMuPDF) → Adaptive Chunking
                                              │
                                    ┌─────────┴─────────┐
                                    │                   │
                              Short path          Long path
                              (single call)   (chunk → aggregate)
                                    │                   │
                                    └─────────┬─────────┘
                                              │
                                    Local LLM Inference
                                    (127.0.0.1:8080/v1)
                                              │
                    ┌───────────┬──────────────┼──────────────┐
                    │           │              │              │
               Metadata    Paper Card     Taxonomy      Knowledge
             Extraction   Generation   Classification   Graph Entry
                    │           │              │              │
                    └───────────┴──────────────┴──────────────┘
                                              │
                                       Output Persistence
                                   (Markdown / JSONL / Log)

Configuration Reference

Key	Default	Description
`llm.base_url`	`http://127.0.0.1:8080/v1`	Local LLM endpoint
`llm.api_key`	`local-not-needed`	API key (if required)
`llm.model`	—	Model name (e.g., `Qwen3.6-35B-A3B.gguf`)
`llm.temperature`	`0.2`	Generation temperature
`llm.max_tokens`	`8192`	Maximum output tokens (high for reasoning models)
`llm.timeout`	`600`	Request timeout in seconds
`llm.max_retries`	`3`	Retry attempts on failure
`chunking.max_chunk_size`	`8000`	Max characters per chunk for long papers
`chunking.overlap`	`100`	Overlap between chunks in characters
`pipeline.skip_existing`	`true`	Skip already-processed papers
`pipeline.log_api_calls`	`true`	Log every LLM API call to JSONL

How It Works

PDF Text Extraction: Opens the PDF with PyMuPDF, iterates pages, extracts text per page.
Adaptive Chunking: If text exceeds the chunk threshold, it's split at paragraph/sentence boundaries with configurable overlap.
Metadata Extraction: The first ~3000 characters are sent to the LLM to extract title, authors, year, venue, and DOI.
Paper Card Generation: The full text (or chunk summaries for long papers) is sent to the LLM with a structured prompt that produces a complete Paper Card.
Taxonomy Classification: The Paper Card is classified into an existing taxonomy. If a paper doesn't fit, a new category is proposed.
Knowledge Graph Entry: Entities (concepts, methods, phenomena) and typed relations are extracted as a JSON graph.
Persistence: All outputs are saved to disk. API calls are logged for debugging.

Tips

Warm up your LLM first: The pipeline sends a warm-up request on startup. Cold starts can take ~60s for large models.
Prefer short config overrides: CLI arguments (--input, --batch, --verbose) override config.yaml.
Reasoning models: If your LLM outputs content under reasoning_content rather than content, the client auto-falls back.
Chunk tuning: For very long papers (100K+ chars), increase max_chunk_size and decrease overlap to reduce total API calls.

Limitations

Scanned-image PDFs are not supported (no OCR).
The quality of extraction and classification depends on the underlying LLM's instruction-following ability.
The taxonomy is initialized with 34 categories covering quantum computing, ML, and physics. Papers outside these domains will trigger new category proposals.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data/taxonomy		data/taxonomy
scripts		scripts
src/paper_brain		src/paper_brain
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
config.yaml.example		config.yaml.example
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

paper-brain

Features

Installation

Requirements

Install

Dependencies

Quick Start

1. Configure

2. Process a single paper

3. Batch process a directory

4. Verbose mode (debug logging)

Output Structure

Sample Paper Card output

Architecture

Configuration Reference

How It Works

Tips

Limitations

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

paper-brain

Features

Installation

Requirements

Install

Dependencies

Quick Start

1. Configure

2. Process a single paper

3. Batch process a directory

4. Verbose mode (debug logging)

Output Structure

Sample Paper Card output

Architecture

Configuration Reference

How It Works

Tips

Limitations

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages