Skip to content

Suxeca/paper-brain

Repository files navigation

paper-brain

Local-only research paper processing pipeline for offline academic literature analysis.

Run a local LLM, feed it PDFs, and get structured Paper Cards, taxonomy classification, and knowledge graph entries — all without sending data to any cloud API.


Features

  • 100% local: No cloud models, no external API calls. All LLM inference runs on your machine (http://127.0.0.1:8080/v1).
  • PDF text extraction: Uses PyMuPDF (fitz) to extract text from academic PDFs.
  • Paper Card generation: Produces structured Markdown summaries with metadata, methods, equations, results, and limitations.
  • Adaptive chunking: Automatically handles long papers by chunking → per-chunk analysis → aggregation.
  • Taxonomy classification: Classifies papers into a growing taxonomy of 34+ categories.
  • Knowledge graph construction: Extracts typed entity-relation graphs from each paper.
  • Reasoning model aware: Falls back to reasoning_content for models that output CoT traces.

Installation

Requirements

Install

pip install paper-brain

Or install from source:

git clone https://github.com/your-username/paper-brain.git
cd paper-brain
pip install -e .

Dependencies

Automatically installed:

  • openai>=1.0.0 — API-compatible client for the local LLM endpoint
  • PyMuPDF>=1.23.0 — PDF text extraction
  • pyyaml>=6.0 — Configuration file parsing
  • tenacity>=8.0.0 — Retry logic for LLM calls

Quick Start

1. Configure

Copy the example config and adjust:

cp config.yaml.example config.yaml

Edit config.yaml to point to your local LLM:

llm:
  base_url: "http://127.0.0.1:8080/v1"
  api_key: "your-api-key-here"        # often "local-not-needed"
  model: "your-model-name.gguf"
  temperature: 0.2
  max_tokens: 8192
  timeout: 600

2. Process a single paper

paper-brain --input path/to/paper.pdf

3. Batch process a directory

paper-brain --batch path/to/pdf/library/

4. Verbose mode (debug logging)

paper-brain --input paper.pdf --verbose

Output Structure

data/
├── paper_cards/          # Structured Markdown summaries
│   └── grover_*.md
├── taxonomy/
│   └── taxonomy.json     # Evolving category tree (auto-updated)
├── graph/                # Knowledge graph entries (JSONL)
│   └── grover_*.jsonl
└── logs/                 # Pipeline logs & API call history
    ├── pipeline_*.log
    └── api_calls.jsonl

Sample Paper Card output

# Paper Card

## Metadata
- Title: Grover algorithm with zero theoretical failure rate
- Authors: G. L. Long
- Year: 2001
- Venue / Source: Physical Review A, Volume 64, 022307

## 1. Core Problem
...

## 2. Method
...

## 3. Key Equations
...

Architecture

PDF Input → Text Extraction (PyMuPDF) → Adaptive Chunking
                                              │
                                    ┌─────────┴─────────┐
                                    │                   │
                              Short path          Long path
                              (single call)   (chunk → aggregate)
                                    │                   │
                                    └─────────┬─────────┘
                                              │
                                    Local LLM Inference
                                    (127.0.0.1:8080/v1)
                                              │
                    ┌───────────┬──────────────┼──────────────┐
                    │           │              │              │
               Metadata    Paper Card     Taxonomy      Knowledge
             Extraction   Generation   Classification   Graph Entry
                    │           │              │              │
                    └───────────┴──────────────┴──────────────┘
                                              │
                                       Output Persistence
                                   (Markdown / JSONL / Log)

Configuration Reference

Key Default Description
llm.base_url http://127.0.0.1:8080/v1 Local LLM endpoint
llm.api_key local-not-needed API key (if required)
llm.model Model name (e.g., Qwen3.6-35B-A3B.gguf)
llm.temperature 0.2 Generation temperature
llm.max_tokens 8192 Maximum output tokens (high for reasoning models)
llm.timeout 600 Request timeout in seconds
llm.max_retries 3 Retry attempts on failure
chunking.max_chunk_size 8000 Max characters per chunk for long papers
chunking.overlap 100 Overlap between chunks in characters
pipeline.skip_existing true Skip already-processed papers
pipeline.log_api_calls true Log every LLM API call to JSONL

How It Works

  1. PDF Text Extraction: Opens the PDF with PyMuPDF, iterates pages, extracts text per page.
  2. Adaptive Chunking: If text exceeds the chunk threshold, it's split at paragraph/sentence boundaries with configurable overlap.
  3. Metadata Extraction: The first ~3000 characters are sent to the LLM to extract title, authors, year, venue, and DOI.
  4. Paper Card Generation: The full text (or chunk summaries for long papers) is sent to the LLM with a structured prompt that produces a complete Paper Card.
  5. Taxonomy Classification: The Paper Card is classified into an existing taxonomy. If a paper doesn't fit, a new category is proposed.
  6. Knowledge Graph Entry: Entities (concepts, methods, phenomena) and typed relations are extracted as a JSON graph.
  7. Persistence: All outputs are saved to disk. API calls are logged for debugging.

Tips

  • Warm up your LLM first: The pipeline sends a warm-up request on startup. Cold starts can take ~60s for large models.
  • Prefer short config overrides: CLI arguments (--input, --batch, --verbose) override config.yaml.
  • Reasoning models: If your LLM outputs content under reasoning_content rather than content, the client auto-falls back.
  • Chunk tuning: For very long papers (100K+ chars), increase max_chunk_size and decrease overlap to reduce total API calls.

Limitations

  • Scanned-image PDFs are not supported (no OCR).
  • The quality of extraction and classification depends on the underlying LLM's instruction-following ability.
  • The taxonomy is initialized with 34 categories covering quantum computing, ML, and physics. Papers outside these domains will trigger new category proposals.

License

MIT

About

Local-only research paper processing pipeline for offline academic literature analysis

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages