Transform books into structured chapters and AI-powered study guides
A Python system for processing PDF and EPUB books by extracting individual chapters and generating AI-powered study summaries using ChatGPT and Claude APIs.
- Multi-LLM Integration: Dual AI provider architecture (OpenAI GPT-4o + Anthropic Claude-3.5-Sonnet)
- Prompt Engineering: Sophisticated prompt templates with structured output formatting for study guide generation
- Large Context Optimization: Handles 120K-180K token contexts with intelligent content validation
- Token Budget Management: Dynamic content length validation and truncation strategies
- Temperature Tuning: Optimized for consistent, factual outputs (temp=0.1)
- API Security: Environment-based credential management, zero hardcoded secrets
- Orchestrator Pattern: Ultra-minimal 45-line main orchestrator with specialized processors
- Dependency Injection: Clean separation of concerns across 9 specialized modules
- Factory Pattern: Dynamic processor selection based on file format detection
- Strategy Pattern: Pluggable AI providers with consistent interface
- Modular Design: 3,612 LOC across 14 modules with single-responsibility principle
- Clean Code: Type hints, comprehensive docstrings, PEP 8 compliance
- Multi-Format Parsing: PDF (PyPDF2, pdfplumber, PyMuPDF) and EPUB (ebooklib) processing
- HTML Parsing: BeautifulSoup4 for EPUB content extraction and cleaning
- TOC Algorithms: Custom table-of-contents parsing with multi-page support
- Chapter Detection: Intelligent boundary detection with section hierarchy analysis
- Image Preservation: EPUB image extraction and PDF conversion pipeline
- Browser Automation: Playwright-based HTMLβPDF conversion with image retention
- Batch Processing: Recursive directory traversal with parallel processing capability
- Regression Testing: Custom test framework with baseline comparisons (test_regression.py)
- CI/CD Ready: Proper exit codes, automated testing, version control integration
- Error Handling: Comprehensive validation with user-friendly error messages
- Logging System: Verbose mode for debugging and production monitoring
- Documentation: 6 comprehensive guides (Architecture, Technical Specs, Installation, Usage, Troubleshooting)
- Report Generation: Automated JSON reports with processing metadata and statistics
- CLI Development: Argparse-based interface with batch mode and recursive processing
- File System Operations: Cross-platform path handling, directory management, cleanup
- Environment Management: python-dotenv for configuration, .env file support
- Image Processing: Pillow for format conversion and manipulation
- Markdown Generation: Dynamic study guide creation with Notion-compatible formatting
- API Integration: RESTful API consumption with proper error handling and retry logic
- π PDF Processing: Extract chapters from PDFs with table of contents parsing
- π EPUB Support: Process EPUB books with image preservation and smart chapter detection
- π― Production-Ready: Comprehensive regression testing with baselines for backward compatibility
- π§ Dual AI Support: Generate study guides using ChatGPT (120K context) & Claude (180K context)
- π Smart Processing: 8-step PDF workflow, 4-step EPUB workflow with intelligent detection
- πΌοΈ Image Intelligence: EPUB images automatically preserved during PDF conversion
- π Auto-Organization: Section-based folders (A, B, C) or flat structure based on book type
- π§ Modular Design: Ultra-minimal 45-line orchestrator with specialized processors
- π§ͺ Quality Assured: Regression tests prevent processing errors across 50+ chapters
- Python 3.12+
- Virtual environment (
.venv/) - OpenAI and/or Anthropic API keys for summarization
PyPDF2,pdfplumber- PDF processingebooklib,beautifulsoup4- EPUB processingplaywright- HTMLβPDF conversionopenai,anthropic- AI summarizationpython-dotenv- Environment management
# Activate virtual environment
source .venv/bin/activate
# Install dependencies
pip install -r requirements.txt
playwright install chromium
# Set up API keys
cp .env.example .env
# Edit .env with your API keys# Process PDF book
python book_processor.py "books/your-book.pdf" --output "book_chapters" --verbose
# Process EPUB book
python book_processor.py "books/your-book.epub" --output "book_chapters" --verbosecd AI_summarizer
# Single chapter
python chatgpt_summarizer.py "../book_chapters/book-name_chapters/Chapter_01-Title.pdf"
python claude_summarizer.py "../book_chapters/book-name_chapters/Chapter_01-Title.pdf"
# Batch process all chapters
python chatgpt_summarizer.py "../book_chapters/book-name_chapters" --batch --recursive
python claude_summarizer.py "../book_chapters/book-name_chapters" --batch --recursive# Example workflow
python book_processor.py "books/cracking-the-pm-career.pdf" --output "study_materials" --verbose
cd AI_summarizer
python chatgpt_summarizer.py "../study_materials/cracking-the-pm-career_chapters" --batch -r
python claude_summarizer.py "../study_materials/cracking-the-pm-career_chapters" --batch -rResult:
study_materials/cracking-the-pm-career_chapters/
βββ A._Foreword/
β βββ Chapter_01-Introduction.pdf
β βββ Chapter_01-Introduction_chatgpt_summary.md
β βββ Chapter_01-Introduction_claude_summary.md
βββ C._Product_Skills/
β βββ [more chapters with summaries...]
βββ cracking-the-pm-career_processing_report.json
# Test all books for backward compatibility
python test_regression.py
# Verbose output for debugging
python test_regression.py --verbose
# Test specific book only
python test_regression.py --book cracking-the-pm-careerBookProcessor/
βββ book_processor.py # π Main CLI entry point
βββ test_regression.py # π§ͺ Regression test suite
βββ AI_summarizer/ # π€ AI integration tools
β βββ chatgpt_summarizer.py # OpenAI ChatGPT integration
β βββ claude_summarizer.py # Anthropic Claude integration
β βββ pdf_text_extractor.py # PDF text extraction utilities
β βββ prompt_template.py # Shared AI prompt templates
βββ book_processing/ # βοΈ Core processing engine
β βββ main.py # Processing orchestrator (45 lines)
β βββ pdf_processor.py # PDF workflow (200 lines)
β βββ epub_processor.py # EPUB workflow with image extraction (400 lines)
β βββ toc_parser.py # Table of Contents parsing
β βββ chapter_detector.py # Chapter detection algorithms
β βββ epub_image_extractor.py # EPUB image extraction & processing
β βββ html_to_pdf_converter.py # HTMLβPDF conversion (Playwright/WeasyPrint)
β βββ report_generator.py # Processing reports & summaries
β βββ utils.py # Shared utility functions
βββ books/ # π Source PDF/EPUB books
βββ book_chapters/ # π Processed output (chapters & summaries)
βββ docs/ # π Comprehensive documentation (6 guides)
βββ requirements.txt # π¦ Python dependencies
Clean, modular design with clear separation of concerns:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CLI Entry Point β
β book_processor.py β
βββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββ
β Orchestrator β
β book_processing/main.py β
β (45 lines) β
βββββββββββββββ¬ββββββββββββββββββββββββ¬ββββββββββββββββββββββββ
β β
βΌ βΌ
βββββββββββββββββββββββββββ βββββββββββββββββββββββββββ
β PDF Processor β β EPUB Processor β
β Complete Workflow β β Complete Workflow β
β (200 lines) β β (400 lines) β
βββββββββββ¬ββββββββββββββββ βββββββββββ¬ββββββββββββββββ
β β
βΌ βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Shared Components β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β TOC Parser β βChapter β βReport β β
β β β βDetector β βGenerator β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β βEPUB Image β βHTMLβPDF β βUtilities β β
β βExtractor β βConverter β β β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Design Principles:
- Delegation Pattern: Main orchestrator delegates to specialized processors
- Single Responsibility: Each module has one clear purpose
- Dependency Injection: Processors receive dependencies at initialization
- Error Boundaries: Graceful degradation with informative error messages
- Sectioned Structure: Books with hierarchical organization (A. Section, B. Section)
- Example:
cracking-the-pm-career.pdf
- Example:
- Flat Structure: Simple chapter sequences (Chapter 1, Chapter 2)
- Example:
the-pm-interview.pdf
- Example:
- Part Structure: Books with Part I, Part II organization
- Example:
AI Product Managers Handbook
- Example:
- Direct Extraction: Chapter-by-chapter processing
- Image Preservation: Embedded images converted to PDFs
- Smart Classification: Distinguishes chapters from supporting content
- Example:
decode-and-conquer.epub
- Example:
- OpenAI ChatGPT: GPT-4o model (120K context, 4K output)
- Anthropic Claude: Claude-3.5-Sonnet (180K context, 8K output)
- Notion-Ready: Direct import to Notion workspaces
- Study-Optimized: Frameworks, examples, and memory aids
- Markdown Format: Clean, portable documentation
See sample_output/ folder for example chapter summaries generated by both ChatGPT and Claude.
See /docs folder for comprehensive documentation:
- Architecture & Data Flow
- Technical Specifications
- Installation & Setup Guide
- Usage Examples & Workflows
- Troubleshooting Guide
- API keys stored in
.envfiles (excluded from version control) - Input validation for file formats and accessibility
- Automatic cleanup of temporary files
- Error handling with graceful degradation
This project is licensed under the MIT License - see the LICENSE file for details.
- OpenAI for ChatGPT API
- Anthropic for Claude API
- PyPDF2/pdfplumber for PDF processing
- ebooklib for EPUB handling
- Playwright for HTMLβPDF conversion
β Status: Production-ready for all book formats with comprehensive regression testing β If this project helps you, please consider giving it a star!