Local-only research paper processing pipeline for offline academic literature analysis.
Run a local LLM, feed it PDFs, and get structured Paper Cards, taxonomy classification, and knowledge graph entries — all without sending data to any cloud API.
- 100% local: No cloud models, no external API calls. All LLM inference runs on your machine (
http://127.0.0.1:8080/v1). - PDF text extraction: Uses PyMuPDF (fitz) to extract text from academic PDFs.
- Paper Card generation: Produces structured Markdown summaries with metadata, methods, equations, results, and limitations.
- Adaptive chunking: Automatically handles long papers by chunking → per-chunk analysis → aggregation.
- Taxonomy classification: Classifies papers into a growing taxonomy of 34+ categories.
- Knowledge graph construction: Extracts typed entity-relation graphs from each paper.
- Reasoning model aware: Falls back to
reasoning_contentfor models that output CoT traces.
pip install paper-brainOr install from source:
git clone https://github.com/your-username/paper-brain.git
cd paper-brain
pip install -e .Automatically installed:
openai>=1.0.0— API-compatible client for the local LLM endpointPyMuPDF>=1.23.0— PDF text extractionpyyaml>=6.0— Configuration file parsingtenacity>=8.0.0— Retry logic for LLM calls
Copy the example config and adjust:
cp config.yaml.example config.yamlEdit config.yaml to point to your local LLM:
llm:
base_url: "http://127.0.0.1:8080/v1"
api_key: "your-api-key-here" # often "local-not-needed"
model: "your-model-name.gguf"
temperature: 0.2
max_tokens: 8192
timeout: 600paper-brain --input path/to/paper.pdfpaper-brain --batch path/to/pdf/library/paper-brain --input paper.pdf --verbosedata/
├── paper_cards/ # Structured Markdown summaries
│ └── grover_*.md
├── taxonomy/
│ └── taxonomy.json # Evolving category tree (auto-updated)
├── graph/ # Knowledge graph entries (JSONL)
│ └── grover_*.jsonl
└── logs/ # Pipeline logs & API call history
├── pipeline_*.log
└── api_calls.jsonl
# Paper Card
## Metadata
- Title: Grover algorithm with zero theoretical failure rate
- Authors: G. L. Long
- Year: 2001
- Venue / Source: Physical Review A, Volume 64, 022307
## 1. Core Problem
...
## 2. Method
...
## 3. Key Equations
...PDF Input → Text Extraction (PyMuPDF) → Adaptive Chunking
│
┌─────────┴─────────┐
│ │
Short path Long path
(single call) (chunk → aggregate)
│ │
└─────────┬─────────┘
│
Local LLM Inference
(127.0.0.1:8080/v1)
│
┌───────────┬──────────────┼──────────────┐
│ │ │ │
Metadata Paper Card Taxonomy Knowledge
Extraction Generation Classification Graph Entry
│ │ │ │
└───────────┴──────────────┴──────────────┘
│
Output Persistence
(Markdown / JSONL / Log)
| Key | Default | Description |
|---|---|---|
llm.base_url |
http://127.0.0.1:8080/v1 |
Local LLM endpoint |
llm.api_key |
local-not-needed |
API key (if required) |
llm.model |
— | Model name (e.g., Qwen3.6-35B-A3B.gguf) |
llm.temperature |
0.2 |
Generation temperature |
llm.max_tokens |
8192 |
Maximum output tokens (high for reasoning models) |
llm.timeout |
600 |
Request timeout in seconds |
llm.max_retries |
3 |
Retry attempts on failure |
chunking.max_chunk_size |
8000 |
Max characters per chunk for long papers |
chunking.overlap |
100 |
Overlap between chunks in characters |
pipeline.skip_existing |
true |
Skip already-processed papers |
pipeline.log_api_calls |
true |
Log every LLM API call to JSONL |
- PDF Text Extraction: Opens the PDF with PyMuPDF, iterates pages, extracts text per page.
- Adaptive Chunking: If text exceeds the chunk threshold, it's split at paragraph/sentence boundaries with configurable overlap.
- Metadata Extraction: The first ~3000 characters are sent to the LLM to extract title, authors, year, venue, and DOI.
- Paper Card Generation: The full text (or chunk summaries for long papers) is sent to the LLM with a structured prompt that produces a complete Paper Card.
- Taxonomy Classification: The Paper Card is classified into an existing taxonomy. If a paper doesn't fit, a new category is proposed.
- Knowledge Graph Entry: Entities (concepts, methods, phenomena) and typed relations are extracted as a JSON graph.
- Persistence: All outputs are saved to disk. API calls are logged for debugging.
- Warm up your LLM first: The pipeline sends a warm-up request on startup. Cold starts can take ~60s for large models.
- Prefer short config overrides: CLI arguments (
--input,--batch,--verbose) override config.yaml. - Reasoning models: If your LLM outputs content under
reasoning_contentrather thancontent, the client auto-falls back. - Chunk tuning: For very long papers (100K+ chars), increase
max_chunk_sizeand decreaseoverlapto reduce total API calls.
- Scanned-image PDFs are not supported (no OCR).
- The quality of extraction and classification depends on the underlying LLM's instruction-following ability.
- The taxonomy is initialized with 34 categories covering quantum computing, ML, and physics. Papers outside these domains will trigger new category proposals.
MIT