Skip to content

πŸ“š AI-Powered Book Chapter Extraction & Summarization - Transform PDF/EPUB books into structured chapters with ChatGPT and Claude AI summaries

Notifications You must be signed in to change notification settings

apathi/Book_Summarizer

Repository files navigation

πŸ“š BookSummarizer - AI-Powered Book Chapter Extraction & Summarization

Python OpenAI Anthropic License AI Powered PDF EPUB PyPDF2 PyMuPDF ebooklib Playwright BeautifulSoup Pillow

Transform books into structured chapters and AI-powered study guides

A Python system for processing PDF and EPUB books by extracting individual chapters and generating AI-powered study summaries using ChatGPT and Claude APIs.


πŸ’Ό Technical Skills Showcase

πŸ€– AI/ML Engineering

  • Multi-LLM Integration: Dual AI provider architecture (OpenAI GPT-4o + Anthropic Claude-3.5-Sonnet)
  • Prompt Engineering: Sophisticated prompt templates with structured output formatting for study guide generation
  • Large Context Optimization: Handles 120K-180K token contexts with intelligent content validation
  • Token Budget Management: Dynamic content length validation and truncation strategies
  • Temperature Tuning: Optimized for consistent, factual outputs (temp=0.1)
  • API Security: Environment-based credential management, zero hardcoded secrets

πŸ—οΈ Software Architecture & Design Patterns

  • Orchestrator Pattern: Ultra-minimal 45-line main orchestrator with specialized processors
  • Dependency Injection: Clean separation of concerns across 9 specialized modules
  • Factory Pattern: Dynamic processor selection based on file format detection
  • Strategy Pattern: Pluggable AI providers with consistent interface
  • Modular Design: 3,612 LOC across 14 modules with single-responsibility principle
  • Clean Code: Type hints, comprehensive docstrings, PEP 8 compliance

πŸ“Š Document Processing & Data Engineering

  • Multi-Format Parsing: PDF (PyPDF2, pdfplumber, PyMuPDF) and EPUB (ebooklib) processing
  • HTML Parsing: BeautifulSoup4 for EPUB content extraction and cleaning
  • TOC Algorithms: Custom table-of-contents parsing with multi-page support
  • Chapter Detection: Intelligent boundary detection with section hierarchy analysis
  • Image Preservation: EPUB image extraction and PDF conversion pipeline
  • Browser Automation: Playwright-based HTMLβ†’PDF conversion with image retention
  • Batch Processing: Recursive directory traversal with parallel processing capability

πŸ”§ DevOps & Quality Assurance

  • Regression Testing: Custom test framework with baseline comparisons (test_regression.py)
  • CI/CD Ready: Proper exit codes, automated testing, version control integration
  • Error Handling: Comprehensive validation with user-friendly error messages
  • Logging System: Verbose mode for debugging and production monitoring
  • Documentation: 6 comprehensive guides (Architecture, Technical Specs, Installation, Usage, Troubleshooting)
  • Report Generation: Automated JSON reports with processing metadata and statistics

πŸ› οΈ Full-Stack Development Skills

  • CLI Development: Argparse-based interface with batch mode and recursive processing
  • File System Operations: Cross-platform path handling, directory management, cleanup
  • Environment Management: python-dotenv for configuration, .env file support
  • Image Processing: Pillow for format conversion and manipulation
  • Markdown Generation: Dynamic study guide creation with Notion-compatible formatting
  • API Integration: RESTful API consumption with proper error handling and retry logic

✨ Book Summarizer - Key Highlights

  • πŸ“„ PDF Processing: Extract chapters from PDFs with table of contents parsing
  • πŸ“š EPUB Support: Process EPUB books with image preservation and smart chapter detection
  • 🎯 Production-Ready: Comprehensive regression testing with baselines for backward compatibility
  • 🧠 Dual AI Support: Generate study guides using ChatGPT (120K context) & Claude (180K context)
  • πŸ“Š Smart Processing: 8-step PDF workflow, 4-step EPUB workflow with intelligent detection
  • πŸ–ΌοΈ Image Intelligence: EPUB images automatically preserved during PDF conversion
  • πŸ“ Auto-Organization: Section-based folders (A, B, C) or flat structure based on book type
  • πŸ”§ Modular Design: Ultra-minimal 45-line orchestrator with specialized processors
  • πŸ§ͺ Quality Assured: Regression tests prevent processing errors across 50+ chapters

πŸš€ Quick Start

Prerequisites

  • Python 3.12+
  • Virtual environment (.venv/)
  • OpenAI and/or Anthropic API keys for summarization

Key Dependencies

  • PyPDF2, pdfplumber - PDF processing
  • ebooklib, beautifulsoup4 - EPUB processing
  • playwright - HTMLβ†’PDF conversion
  • openai, anthropic - AI summarization
  • python-dotenv - Environment management

Installation

# Activate virtual environment
source .venv/bin/activate

# Install dependencies  
pip install -r requirements.txt
playwright install chromium

# Set up API keys
cp .env.example .env
# Edit .env with your API keys

Basic Usage

Process a Book

# Process PDF book
python book_processor.py "books/your-book.pdf" --output "book_chapters" --verbose

# Process EPUB book  
python book_processor.py "books/your-book.epub" --output "book_chapters" --verbose

Generate AI Summaries

cd AI_summarizer

# Single chapter
python chatgpt_summarizer.py "../book_chapters/book-name_chapters/Chapter_01-Title.pdf"
python claude_summarizer.py "../book_chapters/book-name_chapters/Chapter_01-Title.pdf"

# Batch process all chapters
python chatgpt_summarizer.py "../book_chapters/book-name_chapters" --batch --recursive
python claude_summarizer.py "../book_chapters/book-name_chapters" --batch --recursive

Example workflow

# Example workflow
python book_processor.py "books/cracking-the-pm-career.pdf" --output "study_materials" --verbose
cd AI_summarizer  
python chatgpt_summarizer.py "../study_materials/cracking-the-pm-career_chapters" --batch -r
python claude_summarizer.py "../study_materials/cracking-the-pm-career_chapters" --batch -r

Result:

study_materials/cracking-the-pm-career_chapters/
β”œβ”€β”€ A._Foreword/
β”‚   β”œβ”€β”€ Chapter_01-Introduction.pdf
β”‚   β”œβ”€β”€ Chapter_01-Introduction_chatgpt_summary.md
β”‚   └── Chapter_01-Introduction_claude_summary.md
β”œβ”€β”€ C._Product_Skills/
β”‚   └── [more chapters with summaries...]
└── cracking-the-pm-career_processing_report.json

Run Regression Tests

# Test all books for backward compatibility
python test_regression.py

# Verbose output for debugging
python test_regression.py --verbose

# Test specific book only
python test_regression.py --book cracking-the-pm-career

πŸ“ Project Structure

BookProcessor/
β”œβ”€β”€ book_processor.py              # πŸš€ Main CLI entry point
β”œβ”€β”€ test_regression.py             # πŸ§ͺ Regression test suite
β”œβ”€β”€ AI_summarizer/                 # πŸ€– AI integration tools
β”‚   β”œβ”€β”€ chatgpt_summarizer.py     # OpenAI ChatGPT integration
β”‚   β”œβ”€β”€ claude_summarizer.py      # Anthropic Claude integration
β”‚   β”œβ”€β”€ pdf_text_extractor.py     # PDF text extraction utilities
β”‚   └── prompt_template.py        # Shared AI prompt templates
β”œβ”€β”€ book_processing/               # βš™οΈ Core processing engine
β”‚   β”œβ”€β”€ main.py                   # Processing orchestrator (45 lines)
β”‚   β”œβ”€β”€ pdf_processor.py          # PDF workflow (200 lines)
β”‚   β”œβ”€β”€ epub_processor.py         # EPUB workflow with image extraction (400 lines)
β”‚   β”œβ”€β”€ toc_parser.py             # Table of Contents parsing
β”‚   β”œβ”€β”€ chapter_detector.py       # Chapter detection algorithms
β”‚   β”œβ”€β”€ epub_image_extractor.py   # EPUB image extraction & processing
β”‚   β”œβ”€β”€ html_to_pdf_converter.py  # HTMLβ†’PDF conversion (Playwright/WeasyPrint)
β”‚   β”œβ”€β”€ report_generator.py       # Processing reports & summaries
β”‚   └── utils.py                  # Shared utility functions
β”œβ”€β”€ books/                         # πŸ“š Source PDF/EPUB books
β”œβ”€β”€ book_chapters/                 # πŸ“ Processed output (chapters & summaries)
β”œβ”€β”€ docs/                          # πŸ“– Comprehensive documentation (6 guides)
└── requirements.txt               # πŸ“¦ Python dependencies

πŸ—οΈ System Architecture

Clean, modular design with clear separation of concerns:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    CLI Entry Point                          β”‚
β”‚                 book_processor.py                           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                      β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                 Orchestrator                                β”‚
β”‚              book_processing/main.py                        β”‚
β”‚                   (45 lines)                               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              β”‚                       β”‚
              β–Ό                       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚     PDF Processor       β”‚ β”‚    EPUB Processor       β”‚
β”‚   Complete Workflow     β”‚ β”‚   Complete Workflow     β”‚
β”‚     (200 lines)         β”‚ β”‚     (400 lines)         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚                           β”‚
          β–Ό                           β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                 Shared Components                           β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”           β”‚
β”‚  β”‚ TOC Parser  β”‚ β”‚Chapter      β”‚ β”‚Report       β”‚           β”‚
β”‚  β”‚             β”‚ β”‚Detector     β”‚ β”‚Generator    β”‚           β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜           β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”           β”‚
│  │EPUB Image   │ │HTML→PDF     │ │Utilities    │           │
β”‚  β”‚Extractor    β”‚ β”‚Converter    β”‚ β”‚             β”‚           β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Design Principles:

  • Delegation Pattern: Main orchestrator delegates to specialized processors
  • Single Responsibility: Each module has one clear purpose
  • Dependency Injection: Processors receive dependencies at initialization
  • Error Boundaries: Graceful degradation with informative error messages

πŸ”§ Supported Book Formats

PDF Books

  • Sectioned Structure: Books with hierarchical organization (A. Section, B. Section)
    • Example: cracking-the-pm-career.pdf
  • Flat Structure: Simple chapter sequences (Chapter 1, Chapter 2)
    • Example: the-pm-interview.pdf
  • Part Structure: Books with Part I, Part II organization
    • Example: AI Product Managers Handbook

EPUB Books

  • Direct Extraction: Chapter-by-chapter processing
  • Image Preservation: Embedded images converted to PDFs
  • Smart Classification: Distinguishes chapters from supporting content
    • Example: decode-and-conquer.epub

πŸ€– AI Summarization

Supported Providers

  • OpenAI ChatGPT: GPT-4o model (120K context, 4K output)
  • Anthropic Claude: Claude-3.5-Sonnet (180K context, 8K output)

Output Format

  • Notion-Ready: Direct import to Notion workspaces
  • Study-Optimized: Frameworks, examples, and memory aids
  • Markdown Format: Clean, portable documentation

Sample Output

See sample_output/ folder for example chapter summaries generated by both ChatGPT and Claude.

πŸ“– Documentation

See /docs folder for comprehensive documentation:

  • Architecture & Data Flow
  • Technical Specifications
  • Installation & Setup Guide
  • Usage Examples & Workflows
  • Troubleshooting Guide

πŸ”’ Security

  • API keys stored in .env files (excluded from version control)
  • Input validation for file formats and accessibility
  • Automatic cleanup of temporary files
  • Error handling with graceful degradation

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

  • OpenAI for ChatGPT API
  • Anthropic for Claude API
  • PyPDF2/pdfplumber for PDF processing
  • ebooklib for EPUB handling
  • Playwright for HTMLβ†’PDF conversion

⭐ Status: Production-ready for all book formats with comprehensive regression testing ⭐ If this project helps you, please consider giving it a star!

Releases

No releases published

Packages

No packages published

Languages