From bf28229d9a92e35248b23364dbbff1458e1c2b4c Mon Sep 17 00:00:00 2001 From: Claude Date: Sat, 14 Feb 2026 08:30:02 +0000 Subject: [PATCH] Add PDF to Markdown Claude Code skill - Create skill/ directory with complete Claude Code skill implementation - Add pdf2md.py: standalone PDF extraction script (no external LLM API) - Add skill metadata files (JSON and YAML formats) - Add comprehensive prompts for Markdown conversion - Add detailed documentation (README, USAGE_GUIDE, INDEX) - Include conversion guidelines and quick reference - Add example usage scripts The skill reuses existing PDF processing code from file_worker.py but replaces LLM API calls with Claude Code's native vision capabilities. Features: - No external API required (uses Claude's vision) - Supports page ranges and custom DPI - Comprehensive Markdown conversion rules - Handles tables, math, code, and complex layouts - Includes troubleshooting and best practices https://claude.ai/code/session_012AMzzn5nwxZQaUGGvTnfyS --- skill/INDEX.md | 201 +++++++++++++++ skill/README.md | 313 +++++++++++++++++++++++ skill/USAGE_GUIDE.md | 504 +++++++++++++++++++++++++++++++++++++ skill/conversion_prompt.md | 131 ++++++++++ skill/example_usage.sh | 55 ++++ skill/pdf2md.py | 257 +++++++++++++++++++ skill/prompt.md | 94 +++++++ skill/skill.json | 23 ++ skill/skill.yaml | 48 ++++ skill/skill_main.md | 177 +++++++++++++ 10 files changed, 1803 insertions(+) create mode 100644 skill/INDEX.md create mode 100644 skill/README.md create mode 100644 skill/USAGE_GUIDE.md create mode 100644 skill/conversion_prompt.md create mode 100644 skill/example_usage.sh create mode 100755 skill/pdf2md.py create mode 100644 skill/prompt.md create mode 100644 skill/skill.json create mode 100644 skill/skill.yaml create mode 100644 skill/skill_main.md diff --git a/skill/INDEX.md b/skill/INDEX.md new file mode 100644 index 0000000..9e881c9 --- /dev/null +++ b/skill/INDEX.md @@ -0,0 +1,201 @@ +# PDF to Markdown Skill - File Index + +This directory contains a Claude Code skill for converting PDF files to Markdown format. + +## File Structure + +``` +skill/ +├── INDEX.md # This file - overview of all files +├── README.md # Main documentation and installation guide +├── USAGE_GUIDE.md # Detailed usage examples and workflows +│ +├── skill.json # Skill metadata (JSON format) +├── skill.yaml # Skill metadata (YAML format) +│ +├── skill_main.md # Main skill prompt with complete workflow +├── prompt.md # Detailed conversion guidelines for Claude +├── conversion_prompt.md # Quick reference guide for Markdown conversion +│ +├── pdf2md.py # PDF extraction utility (standalone) +└── example_usage.sh # Example shell script demonstrating usage +``` + +## Quick Reference + +| File | Purpose | When to Use | +|------|---------|-------------| +| **README.md** | Main documentation | Start here for overview and setup | +| **USAGE_GUIDE.md** | Detailed examples | Learn how to use the skill in practice | +| **skill_main.md** | Main skill prompt | Reference for the complete workflow | +| **conversion_prompt.md** | Quick guide | Quick lookup for Markdown syntax | +| **pdf2md.py** | Extraction script | Run this to extract PDF pages to images | +| **example_usage.sh** | Example script | See working examples | + +## Getting Started + +1. **Read:** Start with `README.md` +2. **Install:** `pip install pymupdf pypdf2` +3. **Extract:** Run `python3 pdf2md.py your_file.pdf --output-dir ./images` +4. **Convert:** Use Claude Code to convert images to Markdown +5. **Learn More:** See `USAGE_GUIDE.md` for detailed examples + +## File Descriptions + +### Documentation Files + +- **INDEX.md** (this file) + - Overview of all files in the skill directory + - Quick reference table + - Getting started guide + +- **README.md** + - Main documentation + - Feature overview + - Installation instructions + - Basic usage examples + - Comparison with main library + +- **USAGE_GUIDE.md** + - Detailed workflow examples + - Common use cases (academic papers, documentation, etc.) + - Advanced options + - Troubleshooting guide + - Best practices + +### Configuration Files + +- **skill.json** + - Skill metadata in JSON format + - Command definitions + - Dependency specifications + - Used by skill loading systems that expect JSON + +- **skill.yaml** + - Skill metadata in YAML format + - Same information as skill.json but in YAML + - More human-readable + - Includes examples and extended metadata + +### Prompt Files + +- **skill_main.md** + - Main skill execution prompt + - Complete workflow description + - Step-by-step instructions for Claude Code + - Error handling guidelines + - Quality checklist + +- **prompt.md** + - Original conversion guidelines + - Comprehensive Markdown rules + - Element-by-element conversion guide + - Quality standards + +- **conversion_prompt.md** + - Quick reference guide + - Condensed conversion rules + - Common patterns and tips + - Easy lookup format + +### Executable Files + +- **pdf2md.py** + - Standalone Python script + - Extracts PDF pages to images + - Self-contained (no imports from parent project) + - Command-line interface + - Supports page ranges, custom DPI, output directories + +- **example_usage.sh** + - Shell script with examples + - Demonstrates common usage patterns + - Includes test with sample PDF if available + +## Workflow Overview + +``` +User provides PDF + ↓ +Run pdf2md.py + ↓ +Extract pages as JPG images + ↓ +Claude Code reads images + ↓ +Convert to Markdown (using prompt guidelines) + ↓ +Combine pages + ↓ +Save final Markdown file +``` + +## Key Features + +- **No External LLM APIs**: Uses Claude Code's native vision +- **Standalone Script**: `pdf2md.py` works independently +- **Comprehensive Guides**: Multiple documentation levels +- **Flexible Configuration**: JSON or YAML metadata +- **Reference Prompts**: Multiple prompt files for different needs + +## Dependencies + +### Python Packages (required for pdf2md.py) +- `pymupdf` (fitz) - PDF to image conversion +- `pypdf2` - PDF page extraction + +### System Requirements +- Python 3.7+ +- Sufficient disk space for temporary images +- Read/write permissions for output directories + +## Version Information + +- **Skill Version:** 1.0.0 +- **Compatible with:** Claude Code (with vision support) +- **Based on:** MarkPDFDown library v1.1.2 + +## License + +This skill inherits the license from the parent markpdfdown project. + +## Contributing + +To improve this skill: +1. Test with various PDF types +2. Document edge cases in USAGE_GUIDE.md +3. Add examples to example_usage.sh +4. Refine conversion prompts based on results +5. Submit issues or pull requests + +## Support + +For help: +1. Check README.md for basic usage +2. Review USAGE_GUIDE.md for detailed examples +3. Test with the sample PDF: `../tests/fixtures/pdfs/input_tables.pdf` +4. Check the troubleshooting section in USAGE_GUIDE.md + +## Quick Commands + +```bash +# Extract PDF pages +python3 skill/pdf2md.py document.pdf --output-dir ./images + +# Extract specific range +python3 skill/pdf2md.py document.pdf --start 5 --end 10 --output-dir ./images + +# High resolution +python3 skill/pdf2md.py document.pdf --dpi 600 --output-dir ./images + +# Get help +python3 skill/pdf2md.py --help + +# Run example script +bash skill/example_usage.sh +``` + +--- + +**Last Updated:** 2026-02-14 +**Maintainer:** MarkPDFDown Project diff --git a/skill/README.md b/skill/README.md new file mode 100644 index 0000000..c6205ce --- /dev/null +++ b/skill/README.md @@ -0,0 +1,313 @@ +# PDF to Markdown Claude Code Skill + +A Claude Code skill for converting PDF files to Markdown format using Claude's native vision capabilities. + +## Overview + +This skill enables Claude Code to convert PDF documents into well-formatted Markdown without relying on external LLM APIs. It leverages: +- **Existing PDF processing code** from the markpdfdown library +- **Claude's vision capabilities** to analyze and convert page images +- **Interactive conversion** with Claude Code handling the entire process + +## Features + +- ✅ **No External API Required**: Uses Claude Code's built-in vision instead of calling external LLMs +- ✅ **Full PDF Support**: Handles multi-page PDFs with page range selection +- ✅ **High Quality**: Preserves document structure, tables, math formulas, and code blocks +- ✅ **Customizable**: Adjust DPI, page ranges, and output locations +- ✅ **Interactive**: Claude Code guides you through the conversion process + +## Installation + +### Prerequisites + +Make sure you have the required Python packages installed: + +```bash +pip install pymupdf pypdf2 +``` + +Or install from the parent project: + +```bash +cd .. +pip install -e . +``` + +### Skill Setup + +1. Copy the `skill` folder to your Claude Code skills directory, or use it directly from this repository + +2. Ensure the skill has access to the markpdfdown source code (the `pdf2md.py` script imports from `../src/markpdfdown`) + +## Usage + +### Basic Usage + +In Claude Code, use the skill to convert a PDF: + +``` +/pdf2md document.pdf +``` + +This will: +1. Extract all pages from `document.pdf` as images +2. Convert each page to Markdown using Claude's vision +3. Combine all pages into a single Markdown file +4. Save the output as `document.md` + +### With Options + +**Convert specific pages:** +``` +/pdf2md research_paper.pdf --start 1 --end 10 +``` + +**Custom output file:** +``` +/pdf2md slides.pdf --output my_notes.md +``` + +**Higher resolution:** +``` +/pdf2md document.pdf --dpi 600 +``` + +**All options combined:** +``` +/pdf2md book.pdf --start 5 --end 20 --output chapter1.md --dpi 300 +``` + +## How It Works + +### Architecture + +``` +User Input (PDF file) + ↓ +pdf2md.py (extraction script) + ↓ +PDF Pages → High-res Images (JPG) + ↓ +Claude Code (vision analysis) + ↓ +Image → Markdown (per page) + ↓ +Combined Markdown Document + ↓ +Output File (.md) +``` + +### Workflow + +1. **PDF Extraction**: + - The `pdf2md.py` script uses `file_worker.py` from the main library + - Converts PDF pages to high-resolution JPG images (default 300 DPI) + - Saves images to a temporary directory + +2. **Image Analysis**: + - Claude Code reads each extracted image + - Analyzes content using vision capabilities + - Converts to Markdown following detailed guidelines + +3. **Markdown Generation**: + - Preserves document structure (headings, lists, tables) + - Converts math to LaTeX (`$inline$` and `$$block$$`) + - Formats code blocks with language tags + - Maintains text formatting (bold, italic, code) + +4. **Output**: + - Combines all page Markdown with proper spacing + - Saves to specified output file + - Optionally cleans up temporary images + +## Conversion Guidelines + +The skill follows comprehensive Markdown conversion rules defined in `skill_main.md`: + +### Supported Elements + +| Element | Example Output | +|---------|---------------| +| Headings | `# Title`, `## Section`, `### Subsection` | +| Text Formatting | `**bold**`, `*italic*`, `` `code` `` | +| Lists | `- item` or `1. item` (with nesting) | +| Tables | Markdown table format with alignment | +| Math | `$E=mc^2$` (inline), `$$...$$` (block) | +| Code Blocks | ````python ... ``` ```` | +| Images | `![alt](url)` or descriptive text | +| Footnotes | `[^1]` with definitions | + +### Quality Standards + +- **Accuracy**: All text captured precisely +- **Structure**: Logical document hierarchy preserved +- **Formatting**: Consistent Markdown style +- **Completeness**: No content omitted (except decorative elements) + +## File Structure + +``` +skill/ +├── README.md # This file +├── skill.json # Skill metadata (JSON format) +├── skill.yaml # Skill metadata (YAML format) +├── skill_main.md # Main skill prompt with workflow +├── prompt.md # Detailed conversion guidelines +├── conversion_prompt.md # Quick reference guide +└── pdf2md.py # PDF extraction utility script +``` + +## Dependencies + +This skill reuses code from the parent markpdfdown library: + +- `src/markpdfdown/core/file_worker.py` - PDF and image processing +- `src/markpdfdown/core/utils.py` - File type detection and validation + +Required Python packages: +- `pymupdf` (fitz) - PDF to image conversion +- `pypdf2` - PDF page extraction +- `pathlib` - Path handling (built-in) + +## Examples + +### Example 1: Research Paper + +Input: `research_paper.pdf` (15 pages) + +``` +/pdf2md research_paper.pdf --start 1 --end 15 +``` + +Output: `research_paper.md` with: +- Title and abstract +- Section headings (Introduction, Methods, Results, etc.) +- Math formulas in LaTeX +- Tables formatted as Markdown +- References as a numbered list + +### Example 2: Technical Documentation + +Input: `api_docs.pdf` (50 pages) + +``` +/pdf2md api_docs.pdf --start 10 --end 25 --output api_reference.md +``` + +Output: `api_reference.md` with: +- API endpoint descriptions +- Code examples with syntax highlighting +- Parameter tables +- Example requests/responses + +### Example 3: Presentation Slides + +Input: `slides.pdf` (30 slides) + +``` +/pdf2md slides.pdf --output presentation_notes.md +``` + +Output: `presentation_notes.md` with: +- Each slide as a section +- Bullet points preserved +- Images described textually +- Code snippets formatted + +## Troubleshooting + +### Common Issues + +**Issue**: "PDF file not found" +- **Solution**: Check the file path is correct and the file exists + +**Issue**: "Unsupported file type" +- **Solution**: Ensure the file is a valid PDF or supported image format (JPG, PNG, BMP, GIF) + +**Issue**: "Invalid page range" +- **Solution**: Check that start/end page numbers are within the document's page count + +**Issue**: Images not displaying +- **Solution**: Verify the temporary image directory is accessible and has write permissions + +### Debug Mode + +To see detailed extraction info: + +```bash +# Run the extraction script directly +python3 skill/pdf2md.py document.pdf --output-dir ./debug_images +``` + +This will show: +- Total pages extracted +- Image file paths +- Any extraction errors + +## Customization + +### Adjusting DPI + +Higher DPI = better quality but larger files and slower processing: +- **150 DPI**: Fast, lower quality, smaller files +- **300 DPI**: Balanced (default) +- **600 DPI**: High quality, larger files, slower + +### Modifying Conversion Rules + +Edit `skill_main.md` or `conversion_prompt.md` to customize how Claude converts content: +- Change heading level logic +- Adjust table formatting +- Modify math formula handling +- Add custom patterns for specific document types + +### Adding Post-Processing + +You can add custom post-processing steps in the skill workflow: +- Auto-generate table of contents +- Add metadata headers +- Clean up specific formatting patterns +- Validate Markdown syntax + +## Comparison with Main Library + +| Feature | Main Library (markpdfdown) | This Skill | +|---------|---------------------------|------------| +| LLM Backend | External API (OpenAI, OpenRouter, etc.) | Claude Code (built-in) | +| API Key Required | ✅ Yes | ❌ No | +| Offline Use | ❌ No | ✅ Yes (if Claude Code is available) | +| Cost | Pay per API call | Free (part of Claude Code usage) | +| Customization | Config file | Interactive with Claude | +| Batch Processing | ✅ Yes (CLI) | Manual (interactive) | + +## Contributing + +This skill is part of the markpdfdown project. To contribute: + +1. Test the skill with various PDF types +2. Report issues or suggest improvements +3. Submit pull requests with enhancements + +## License + +This skill inherits the license from the parent markpdfdown project. + +## Credits + +Built on top of: +- **markpdfdown**: The original PDF to Markdown converter +- **PyMuPDF**: PDF rendering engine +- **Claude Code**: Anthropic's AI-powered coding assistant + +## Support + +For issues or questions: +1. Check this README +2. Review the skill prompt files +3. Test with the standalone script: `python3 skill/pdf2md.py --help` +4. Report issues to the markpdfdown project + +--- + +**Happy Converting! 📄 → 📝** diff --git a/skill/USAGE_GUIDE.md b/skill/USAGE_GUIDE.md new file mode 100644 index 0000000..2b6da30 --- /dev/null +++ b/skill/USAGE_GUIDE.md @@ -0,0 +1,504 @@ +# PDF to Markdown Skill - Usage Guide + +This guide demonstrates how to use the PDF to Markdown skill with Claude Code. + +## Quick Start + +### Step 1: Prepare Your Environment + +Ensure you have the required dependencies installed: + +```bash +pip install pymupdf pypdf2 +``` + +### Step 2: Extract PDF Pages to Images + +Use the `pdf2md.py` script to convert PDF pages to images: + +```bash +python3 skill/pdf2md.py --output-dir ./pdf_images +``` + +**Example:** +```bash +python3 skill/pdf2md.py research_paper.pdf --output-dir ./pdf_images --start 1 --end 10 +``` + +This will: +- Extract pages 1-10 from `research_paper.pdf` +- Convert them to 300 DPI JPG images +- Save them to `./pdf_images/` directory +- Output: `page_0001.jpg`, `page_0002.jpg`, etc. + +### Step 3: Convert Images to Markdown with Claude Code + +Now you can ask Claude Code to convert each image to Markdown. Claude has vision capabilities and can read the images directly. + +**Example conversation:** + +``` +User: I've extracted pages from my PDF to ./pdf_images/. Please convert them all to Markdown. + +Claude: I'll read each image and convert it to Markdown format. + +[Claude reads page_0001.jpg] + +Here's the Markdown for page 1: + +# Introduction to Machine Learning + +Machine learning is a subset of artificial intelligence... + +[Continues with remaining pages...] +``` + +### Step 4: Combine and Save + +Claude will combine all pages into a single Markdown document and save it to your desired output file. + +--- + +## Detailed Workflow Example + +Let's walk through a complete example with a real document. + +### Example: Converting a Research Paper + +**Input:** `research_paper.pdf` (20 pages) + +**Goal:** Convert pages 1-5 to Markdown + +#### 1. Extract Pages + +```bash +cd /path/to/markpdfdown-skill +python3 skill/pdf2md.py research_paper.pdf \ + --output-dir ./temp_images \ + --start 1 \ + --end 5 \ + --dpi 300 +``` + +**Output:** +``` +Successfully extracted 5 images from 20 pages +Images saved to: ./temp_images + +Extracted images: + 1. ./temp_images/page_0001.jpg + 2. ./temp_images/page_0002.jpg + 3. ./temp_images/page_0003.jpg + 4. ./temp_images/page_0004.jpg + 5. ./temp_images/page_0005.jpg +``` + +#### 2. Review Conversion Guidelines + +Before starting, review the conversion rules in `skill_main.md` or `conversion_prompt.md` to understand how Claude should format the output. + +Key guidelines: +- Headings: `#`, `##`, `###` based on hierarchy +- Math: `$inline$` and `$$block$$` LaTeX format +- Tables: Markdown table syntax +- Code: ` ```language ... ``` ` + +#### 3. Convert Each Page + +**Manual approach:** + +Ask Claude Code: +``` +Please read ./temp_images/page_0001.jpg and convert it to Markdown following +the guidelines in skill/conversion_prompt.md +``` + +Claude will analyze the image and produce Markdown output. + +**Batch approach:** + +Ask Claude Code: +``` +Please convert all images in ./temp_images/ to Markdown. For each image: +1. Read the image +2. Convert to Markdown following skill/conversion_prompt.md +3. Save each page's Markdown +4. Combine all pages into research_paper.md with proper spacing +``` + +#### 4. Review and Refine + +After conversion, review the output: +- Check table formatting +- Verify math formulas +- Ensure code blocks have correct language tags +- Confirm heading hierarchy + +If needed, ask Claude to make corrections: +``` +In research_paper.md, please fix the table on page 3 - some columns are misaligned +``` + +--- + +## Common Use Cases + +### Use Case 1: Academic Papers + +**Characteristics:** +- Abstract, sections, references +- Math formulas +- Tables and figures +- Citations + +**Example:** +```bash +python3 skill/pdf2md.py paper.pdf --output-dir ./paper_images --dpi 300 +``` + +**Expected Markdown structure:** +```markdown +# Title of Paper + +## Abstract +... + +## 1. Introduction +... + +### 1.1 Background +... + +## 2. Methods +... + +### 2.1 Dataset +... + +## References +1. Author et al. (2020)... +``` + +### Use Case 2: Technical Documentation + +**Characteristics:** +- Code examples +- API specifications +- Tables of parameters +- Diagrams + +**Example:** +```bash +python3 skill/pdf2md.py docs.pdf --start 10 --end 30 --output-dir ./docs_images +``` + +**Expected Markdown:** +````markdown +## API Endpoint: /users + +### Request + +```http +GET /api/v1/users +``` + +### Parameters + +| Parameter | Type | Required | Description | +|-----------|------|----------|-------------| +| id | integer | Yes | User ID | +| name | string | No | User name filter | + +### Response + +```json +{ + "users": [...] +} +``` +```` + +### Use Case 3: Presentation Slides + +**Characteristics:** +- Each slide is a section +- Bullet points +- Images and diagrams + +**Example:** +```bash +python3 skill/pdf2md.py slides.pdf --output-dir ./slides_images +``` + +**Expected Markdown:** +```markdown +## Slide 1: Introduction + +- Topic overview +- Key objectives +- Agenda + +## Slide 2: Background + +- Historical context +- Current challenges +- Opportunities + +... +``` + +### Use Case 4: Financial Reports + +**Characteristics:** +- Complex tables +- Numbers and currencies +- Headers/footers +- Multi-column layouts + +**Example:** +```bash +python3 skill/pdf2md.py annual_report.pdf --start 50 --end 60 --dpi 600 +``` + +**Tips:** +- Use higher DPI (600) for better table recognition +- Pay special attention to number alignment +- May need manual review for complex financial tables + +--- + +## Advanced Options + +### Custom DPI + +Adjust resolution based on content: + +```bash +# Low resolution (faster, smaller files) +python3 skill/pdf2md.py doc.pdf --dpi 150 + +# Standard resolution (balanced) +python3 skill/pdf2md.py doc.pdf --dpi 300 + +# High resolution (better quality, slower) +python3 skill/pdf2md.py doc.pdf --dpi 600 +``` + +**When to use higher DPI:** +- Small text or complex diagrams +- Tables with fine details +- Mathematical formulas with subscripts/superscripts + +### Selective Page Extraction + +Extract non-consecutive pages by running multiple commands: + +```bash +# Extract pages 1-5 +python3 skill/pdf2md.py book.pdf --start 1 --end 5 --output-dir ./chapter1 + +# Extract pages 20-30 +python3 skill/pdf2md.py book.pdf --start 20 --end 30 --output-dir ./chapter2 +``` + +### Custom Output Organization + +Organize output by document structure: + +```bash +# Introduction +python3 skill/pdf2md.py thesis.pdf --start 1 --end 10 --output-dir ./intro + +# Methods +python3 skill/pdf2md.py thesis.pdf --start 11 --end 30 --output-dir ./methods + +# Results +python3 skill/pdf2md.py thesis.pdf --start 31 --end 50 --output-dir ./results +``` + +Then ask Claude to convert each section separately. + +--- + +## Troubleshooting + +### Problem: Text is too small to read + +**Solution:** Increase DPI +```bash +python3 skill/pdf2md.py doc.pdf --dpi 600 +``` + +### Problem: Table columns are misaligned + +**Solutions:** +1. Use higher DPI for better image quality +2. Ask Claude to review the table specifically +3. Manually adjust the Markdown table after conversion + +### Problem: Math formulas not recognized + +**Solutions:** +1. Ensure formulas are clear in the image (check DPI) +2. Ask Claude to focus on mathematical content +3. Provide examples of the LaTeX format you want + +### Problem: Multi-column text is out of order + +**Solution:** Claude should read left-to-right, top-to-bottom. If not: +``` +Please re-read this page and maintain the reading order: left column first +(top to bottom), then right column (top to bottom) +``` + +### Problem: Code blocks missing language tags + +**Solution:** Ask Claude to add them: +``` +Please review the Markdown and add appropriate language tags to all code blocks +``` + +--- + +## Best Practices + +### 1. Check Image Quality First + +After extraction, quickly review 1-2 images to ensure quality: +```bash +# On Linux with image viewer +eog ./pdf_images/page_0001.jpg + +# On macOS +open ./pdf_images/page_0001.jpg +``` + +### 2. Provide Context to Claude + +When asking Claude to convert, provide context: +``` +This is a research paper in computer science. Please convert the images to +Markdown, paying special attention to: +- Mathematical formulas (use LaTeX) +- Code snippets (likely Python) +- Algorithm descriptions +``` + +### 3. Process in Batches + +For large documents, process in smaller batches: +- 5-10 pages at a time +- This makes it easier to review and catch errors +- Easier to provide specific feedback + +### 4. Iterate and Refine + +Don't expect perfect results on first try: +1. First pass: Get basic structure +2. Second pass: Fix tables and formulas +3. Final pass: Polish formatting and consistency + +### 5. Save Intermediate Results + +Save Markdown for each page separately before combining: +``` +./output/ + page_01.md + page_02.md + page_03.md + ... + combined.md +``` + +This makes it easier to: +- Identify problematic pages +- Make targeted corrections +- Regenerate only specific pages if needed + +--- + +## Integration with Claude Code + +### Automated Workflow + +You can create a simple script to automate the entire process: + +```bash +#!/bin/bash +# convert_pdf.sh + +PDF_FILE=$1 +OUTPUT_MD=${2:-output.md} +TEMP_DIR="./temp_pdf_images" + +# Extract images +echo "Extracting images from PDF..." +python3 skill/pdf2md.py "$PDF_FILE" --output-dir "$TEMP_DIR" + +# Now ask Claude Code to process the images +echo "Images extracted to $TEMP_DIR" +echo "Next: Ask Claude Code to convert images to $OUTPUT_MD" +``` + +Usage: +```bash +./convert_pdf.sh research_paper.pdf paper.md +``` + +### Custom Prompts + +Create custom conversion prompts for specific document types: + +**For code documentation:** +```markdown +Please convert this page to Markdown: +- Code blocks should use appropriate language tags +- API endpoints should be formatted as headings +- Parameter tables should use Markdown table syntax +- Keep inline code in backticks +``` + +**For academic papers:** +```markdown +Please convert this page to Markdown: +- Convert all math to LaTeX (inline: $...$, block: $$...$$) +- Section numbers should be part of the heading +- Keep reference formatting consistent +- Convert figures to descriptive text with > blockquote +``` + +--- + +## Examples Gallery + +See `skill/examples/` directory for: +- Sample PDFs +- Extracted images +- Converted Markdown +- Before/after comparisons + +(Note: Add actual examples when available) + +--- + +## Getting Help + +If you encounter issues: + +1. **Check the extraction:** Verify images are clear and readable +2. **Review guidelines:** See `skill_main.md` for conversion rules +3. **Test with sample:** Try the test PDF first: `tests/fixtures/pdfs/input_tables.pdf` +4. **Ask Claude:** Claude Code can help troubleshoot conversion issues + +--- + +## Next Steps + +After mastering basic conversion: + +1. **Customize prompts** for your specific document types +2. **Create templates** for common formats +3. **Build automation scripts** for repeated tasks +4. **Contribute examples** to help others + +Happy converting! diff --git a/skill/conversion_prompt.md b/skill/conversion_prompt.md new file mode 100644 index 0000000..1a03bfe --- /dev/null +++ b/skill/conversion_prompt.md @@ -0,0 +1,131 @@ +# Quick Markdown Conversion Reference + +This is a condensed reference for converting PDF page images to Markdown. + +## Structure Elements + +| Element | Markdown Syntax | Example | +|---------|----------------|---------| +| Title | `# Title` | `# Introduction to AI` | +| Section | `## Section` | `## Background` | +| Subsection | `### Subsection` | `### Related Work` | +| Paragraph | Text with blank lines | Regular paragraph text | +| Bold | `**text**` | `**important**` | +| Italic | `*text*` | `*emphasis*` | +| Code | `` `code` `` | `` `function()` `` | +| Link | `[text](url)` | `[Google](https://google.com)` | + +## Lists + +**Unordered:** +```markdown +- First item +- Second item + - Nested item + - Another nested +``` + +**Ordered:** +```markdown +1. First step +2. Second step + 1. Sub-step + 2. Another sub-step +``` + +## Tables + +```markdown +| Column 1 | Column 2 | Column 3 | +|----------|----------|----------| +| Data 1 | Data 2 | Data 3 | +| Data 4 | Data 5 | Data 6 | +``` + +## Math + +**Inline:** `$E = mc^2$` + +**Block:** +``` +$$ +\sum_{i=1}^{n} x_i = x_1 + x_2 + \cdots + x_n +$$ +``` + +## Code Blocks + +````markdown +```python +def hello(): + print("Hello!") +``` +```` + +## Images & Figures + +```markdown +![Image description](url) + +> **Figure 1**: Description of diagram or chart +``` + +## Footnotes + +```markdown +Some text with a footnote[^1] + +[^1]: The footnote content +``` + +## Common Patterns + +### Research Papers +- Title: `#` +- Abstract: `## Abstract` +- Sections: `##` (Introduction, Methods, Results, etc.) +- References: `## References` with numbered list + +### Technical Documentation +- Use code blocks for commands/code +- Tables for specifications +- Nested lists for procedures + +### Presentations/Slides +- Each slide title: `##` +- Bullet points: `-` or `1.` +- Keep formatting simple + +## Tips + +1. **Accuracy First**: Get the text right before worrying about perfect formatting +2. **Preserve Structure**: Maintain the document's logical hierarchy +3. **Clean Output**: No explanations, just pure Markdown +4. **Consistent Style**: Use the same patterns throughout +5. **Test Math**: Ensure LaTeX formulas are valid + +## What to Skip + +- Page numbers (unless important) +- Headers/footers (unless important) +- Watermarks +- Purely decorative elements +- Redundant spacing/formatting + +## Multi-Column Handling + +For multi-column layouts: +1. Read left column top to bottom +2. Then right column top to bottom +3. Combine in reading order +4. Maintain paragraph breaks + +## Quality Checks + +- [ ] All visible text captured +- [ ] Headings properly leveled +- [ ] Tables formatted correctly +- [ ] Math in LaTeX +- [ ] Code blocks have language tags +- [ ] Links preserved +- [ ] Structure is logical diff --git a/skill/example_usage.sh b/skill/example_usage.sh new file mode 100644 index 0000000..0d294e3 --- /dev/null +++ b/skill/example_usage.sh @@ -0,0 +1,55 @@ +#!/bin/bash +# Example usage of the PDF to Markdown skill + +echo "=== PDF to Markdown Skill - Example Usage ===" +echo "" + +# Example 1: Basic usage +echo "Example 1: Extract images from a PDF" +echo "Command: python3 pdf2md.py input.pdf" +echo "" + +# Example 2: With page range +echo "Example 2: Extract specific pages" +echo "Command: python3 pdf2md.py document.pdf --start 1 --end 10" +echo "" + +# Example 3: Custom output directory +echo "Example 3: Custom output directory" +echo "Command: python3 pdf2md.py report.pdf --output-dir ./my_images" +echo "" + +# Example 4: High resolution +echo "Example 4: High resolution extraction" +echo "Command: python3 pdf2md.py slides.pdf --dpi 600" +echo "" + +# Example 5: Full options +echo "Example 5: All options" +echo "Command: python3 pdf2md.py book.pdf --output-dir ./chapters --start 5 --end 20 --dpi 300" +echo "" + +echo "=== Testing with Sample PDF ===" +echo "" + +# Check if sample PDF exists +if [ -f "../tests/fixtures/pdfs/input_tables.pdf" ]; then + echo "Found sample PDF: tests/fixtures/pdfs/input_tables.pdf" + echo "Running extraction..." + echo "" + + python3 pdf2md.py ../tests/fixtures/pdfs/input_tables.pdf --output-dir ./test_output + + echo "" + echo "Check ./test_output for extracted images" +else + echo "Sample PDF not found. Please provide your own PDF file." + echo "Usage: python3 pdf2md.py " +fi + +echo "" +echo "=== Next Steps ===" +echo "1. Review the extracted images in the output directory" +echo "2. Use Claude Code to convert each image to Markdown" +echo "3. Combine all Markdown pages into a single document" +echo "" diff --git a/skill/pdf2md.py b/skill/pdf2md.py new file mode 100755 index 0000000..0181b72 --- /dev/null +++ b/skill/pdf2md.py @@ -0,0 +1,257 @@ +#!/usr/bin/env python3 +""" +PDF to Markdown Conversion Tool for Claude Code +This script extracts images from PDF pages for Claude Code to convert to Markdown. +""" + +import sys +import os +from pathlib import Path +from typing import Optional, Tuple + + +# ============================================================================ +# Utility functions (copied from markpdfdown to avoid dependency issues) +# ============================================================================ + +def detect_file_type(file_data: bytes, extension: str = None) -> Optional[str]: + """ + Detect file type from binary data or extension. + + Args: + file_data: Binary file data + extension: File extension (optional) + + Returns: + File type string (pdf, jpg, png, etc.) or None + """ + if not file_data: + return None + + # PDF file magic number + if file_data.startswith(b"%PDF-"): + return "pdf" + + # JPEG file magic numbers + elif file_data.startswith(b"\xff\xd8\xff"): + return "jpg" + + # PNG file magic number + elif file_data.startswith(b"\x89\x50\x4e\x47"): + return "png" + + # BMP file magic number + elif file_data.startswith(b"\x42\x4d"): + return "bmp" + + # GIF file magic number + elif file_data.startswith(b"GIF87a") or file_data.startswith(b"GIF89a"): + return "gif" + + # Fallback to extension if provided + if extension: + ext = extension.lower().lstrip('.') + if ext in ['pdf', 'jpg', 'jpeg', 'png', 'bmp', 'gif']: + return ext + + return None + + +def validate_page_range( + start_page: int, end_page: Optional[int], total_pages: int +) -> Tuple[int, int]: + """ + Validate and normalize page range. + + Args: + start_page: Starting page number (1-based) + end_page: Ending page number (1-based, None means last page) + total_pages: Total number of pages in document + + Returns: + Tuple of (normalized_start, normalized_end) + + Raises: + ValueError: If page range is invalid + """ + if start_page < 1: + raise ValueError("Start page must be >= 1") + + if start_page > total_pages: + raise ValueError(f"Start page {start_page} exceeds total pages {total_pages}") + + # Handle end_page = None (means last page) + if end_page is None: + end_page = total_pages + + if end_page < start_page: + raise ValueError(f"End page {end_page} must be >= start page {start_page}") + + if end_page > total_pages: + end_page = total_pages + + return start_page, end_page + + +def extract_pdf_images( + pdf_path: str, + output_dir: str, + start_page: int = 1, + end_page: Optional[int] = None, + dpi: int = 300, +) -> Tuple[list[str], int]: + """ + Extract images from PDF pages. + + Args: + pdf_path: Path to the PDF file + output_dir: Directory to save extracted images + start_page: Starting page number (1-based) + end_page: Ending page number (1-based, None for last page) + dpi: Resolution for image extraction + + Returns: + Tuple of (list of image paths, total page count) + """ + # Create output directory + output_path = Path(output_dir) + output_path.mkdir(parents=True, exist_ok=True) + + # Validate PDF file exists + if not os.path.exists(pdf_path): + raise FileNotFoundError(f"PDF file not found: {pdf_path}") + + # Detect file type + with open(pdf_path, "rb") as f: + file_data = f.read() + + file_type = detect_file_type(file_data, Path(pdf_path).suffix) + + if file_type not in ["pdf", "jpg", "jpeg", "png", "bmp", "gif"]: + raise ValueError(f"Unsupported file type: {file_type}") + + # Handle image files + if file_type in ["jpg", "jpeg", "png", "bmp", "gif"]: + # For image files, just return the original path + return [pdf_path], 1 + + # Handle PDF files + try: + import PyPDF2 + except ImportError: + raise ImportError("PyPDF2 is required for PDF processing. Install with: pip install pypdf2") + + try: + import fitz # PyMuPDF + except ImportError: + raise ImportError("PyMuPDF is required for PDF to image conversion. Install with: pip install pymupdf") + + # Read PDF and get page count + with open(pdf_path, "rb") as f: + pdf_reader = PyPDF2.PdfReader(f) + total_pages = len(pdf_reader.pages) + + # Validate and normalize page range + start_page, end_page = validate_page_range(start_page, end_page, total_pages) + + # Extract specified pages + pdf_writer = PyPDF2.PdfWriter() + with open(pdf_path, "rb") as f: + pdf_reader = PyPDF2.PdfReader(f) + for page_num in range(start_page - 1, end_page): + pdf_writer.add_page(pdf_reader.pages[page_num]) + + # Save extracted pages to temporary file + temp_pdf_path = output_path / "temp_extracted.pdf" + with open(temp_pdf_path, "wb") as f: + pdf_writer.write(f) + + # Convert pages to images using PyMuPDF + doc = fitz.open(str(temp_pdf_path)) + image_paths = [] + + for page_index in range(len(doc)): + page = doc[page_index] + # Render page to pixmap + mat = fitz.Matrix(dpi / 72, dpi / 72) # 72 is default DPI + pix = page.get_pixmap(matrix=mat) + + # Generate output filename + page_number = start_page + page_index + image_filename = f"page_{page_number:04d}.jpg" + image_path = output_path / image_filename + + # Save image + pix.save(str(image_path), "jpeg") + image_paths.append(str(image_path)) + + doc.close() + + # Clean up temporary PDF + if temp_pdf_path.exists(): + temp_pdf_path.unlink() + + return image_paths, total_pages + + +def main(): + """Main entry point for the script.""" + import argparse + + parser = argparse.ArgumentParser( + description="Extract images from PDF for Claude Code conversion" + ) + parser.add_argument( + "input", + help="Input PDF file path" + ) + parser.add_argument( + "--output-dir", + default="./pdf_images", + help="Output directory for extracted images (default: ./pdf_images)" + ) + parser.add_argument( + "--start", + type=int, + default=1, + help="Start page number (1-based, default: 1)" + ) + parser.add_argument( + "--end", + type=int, + default=None, + help="End page number (1-based, default: last page)" + ) + parser.add_argument( + "--dpi", + type=int, + default=300, + help="Image resolution (default: 300)" + ) + + args = parser.parse_args() + + try: + image_paths, total_pages = extract_pdf_images( + args.input, + args.output_dir, + args.start, + args.end, + args.dpi, + ) + + print(f"Successfully extracted {len(image_paths)} images from {total_pages} pages") + print(f"Images saved to: {args.output_dir}") + print("\nExtracted images:") + for i, img_path in enumerate(image_paths, 1): + print(f" {i}. {img_path}") + + return 0 + + except Exception as e: + print(f"Error: {e}", file=sys.stderr) + return 1 + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/skill/prompt.md b/skill/prompt.md new file mode 100644 index 0000000..20dc488 --- /dev/null +++ b/skill/prompt.md @@ -0,0 +1,94 @@ +# PDF to Markdown Conversion Skill + +You are a helpful assistant that converts PDF document images to Markdown format. + +## Your Task + +You will receive images extracted from PDF pages. For each image, you need to: + +1. **Analyze the content** carefully and convert it to well-structured Markdown +2. **Preserve the document structure** including headings, paragraphs, lists, tables, and code blocks +3. **Convert mathematical formulas** to LaTeX format (inline: `$formula$`, block: `$$formula$$`) +4. **Format tables** using Markdown table syntax +5. **Preserve formatting** like bold, italic, and code +6. **Handle special elements** like images, diagrams, and charts with appropriate descriptions + +## Conversion Guidelines + +### Headings +- Convert document titles to `# Heading` +- Section headings to `## Section` +- Subsections to `### Subsection` +- Use appropriate heading levels based on visual hierarchy + +### Text Formatting +- **Bold text**: `**bold**` +- *Italic text*: `*italic*` +- `Code or monospace`: `` `code` `` +- Links: `[text](url)` if URLs are visible + +### Lists +- Unordered lists: `- item` or `* item` +- Ordered lists: `1. item`, `2. item`, etc. +- Nested lists: indent with 2 or 4 spaces + +### Tables +```markdown +| Header 1 | Header 2 | Header 3 | +|----------|----------|----------| +| Cell 1 | Cell 2 | Cell 3 | +| Cell 4 | Cell 5 | Cell 6 | +``` +- Align columns properly +- Preserve cell content and structure + +### Mathematical Formulas +- Inline math: `$E = mc^2$` +- Block math: +``` +$$ +\int_{a}^{b} f(x) dx = F(b) - F(a) +$$ +``` +- Use proper LaTeX syntax + +### Code Blocks +````markdown +```python +def hello(): + print("Hello, World!") +``` +```` +- Specify language when identifiable +- Preserve indentation and formatting + +### Images and Diagrams +- For images: `![Image description](image_url_if_available)` +- For diagrams/charts: Provide a text description in a blockquote: +```markdown +> **Figure 1**: Description of the diagram or chart content +``` + +### Special Cases +- **Headers/Footers**: Include if they contain important information, otherwise skip +- **Page numbers**: Skip unless contextually important +- **Watermarks**: Ignore +- **Multi-column layouts**: Convert to single column, maintaining reading order +- **Footnotes**: Use `[^1]` notation with definitions at the end + +## Output Format + +- Output ONLY the Markdown content +- Do NOT include explanations or meta-comments +- Do NOT wrap the output in code blocks (no ````markdown` wrapper) +- Ensure proper spacing between elements (blank lines between paragraphs, sections, etc.) +- Maintain logical document flow + +## Quality Standards + +- **Accuracy**: Ensure text is accurate and complete +- **Structure**: Preserve the logical structure of the document +- **Readability**: Make the Markdown clean and easy to read +- **Completeness**: Don't omit content unless it's clearly decorative or redundant + +Begin the conversion when you receive the page image. diff --git a/skill/skill.json b/skill/skill.json new file mode 100644 index 0000000..cc3394c --- /dev/null +++ b/skill/skill.json @@ -0,0 +1,23 @@ +{ + "name": "pdf2md", + "version": "1.0.0", + "description": "Convert PDF files to Markdown format using Claude Code", + "author": "MarkPDFDown", + "main": "pdf2md.py", + "dependencies": { + "pymupdf": ">=1.25.3", + "pypdf2": ">=3.0.1" + }, + "commands": { + "pdf2md": { + "description": "Convert PDF to Markdown", + "usage": "pdf2md [options]", + "options": { + "--output": "Output markdown file path (default: .md)", + "--start": "Start page number (1-based, default: 1)", + "--end": "End page number (default: last page)", + "--dpi": "Image resolution for conversion (default: 300)" + } + } + } +} diff --git a/skill/skill.yaml b/skill/skill.yaml new file mode 100644 index 0000000..aac1058 --- /dev/null +++ b/skill/skill.yaml @@ -0,0 +1,48 @@ +name: pdf2md +version: 1.0.0 +description: Convert PDF files to Markdown format using Claude Code's vision capabilities +author: MarkPDFDown + +# Skill metadata +metadata: + category: document-processing + tags: + - pdf + - markdown + - conversion + - vision + requires: + - python3 + - pymupdf + - pypdf2 + +# Command definition +command: + name: pdf2md + description: Convert PDF to Markdown + usage: | + /pdf2md [options] + + Options: + --output Output markdown file (default: .md) + --start Start page number (default: 1) + --end End page number (default: last page) + --dpi Image resolution (default: 300) + + examples: + - description: Convert entire PDF + command: /pdf2md document.pdf + + - description: Convert specific page range + command: /pdf2md document.pdf --start 1 --end 10 + + - description: Convert with custom output + command: /pdf2md document.pdf --output my_notes.md + +# Main prompt that will be executed +prompt_file: skill_main.md + +# Additional resources +resources: + - conversion_prompt.md + - pdf2md.py diff --git a/skill/skill_main.md b/skill/skill_main.md new file mode 100644 index 0000000..34c0414 --- /dev/null +++ b/skill/skill_main.md @@ -0,0 +1,177 @@ +# PDF to Markdown Conversion Skill + +You are executing the PDF to Markdown conversion skill. Your task is to convert a PDF document into well-formatted Markdown. + +## Workflow + +Follow these steps to convert a PDF to Markdown: + +### Step 1: Extract PDF Information + +First, use the PDF extraction script to convert PDF pages into images: + +```bash +python3 skill/pdf2md.py --output-dir [--start ] [--end ] [--dpi ] +``` + +This will: +- Extract pages from the PDF as high-resolution images +- Save them to the specified output directory +- Print the list of extracted image files + +### Step 2: Process Each Page Image + +For each extracted image, you need to: + +1. **Read the image** using the Read tool to view its content +2. **Analyze the content** and convert it to Markdown following the conversion guidelines +3. **Save the Markdown** for this page + +### Step 3: Combine All Pages + +After processing all pages: +1. Combine all page Markdown into a single document +2. Add appropriate spacing between pages (use `\n\n` between pages) +3. Ensure consistent formatting throughout + +### Step 4: Save Final Output + +Write the complete Markdown to the output file specified by the user (or `.md` by default). + +## Conversion Guidelines + +When converting each page image to Markdown, follow these rules: + +### Document Structure + +- **Headings**: Use `#`, `##`, `###` etc. based on visual hierarchy + - Main title: `# Title` + - Sections: `## Section` + - Subsections: `### Subsection` + +- **Paragraphs**: Separate with blank lines + +- **Lists**: + - Unordered: `- item` or `* item` + - Ordered: `1. item`, `2. item` + - Nested: indent with 2-4 spaces + +### Text Formatting + +- **Bold**: `**text**` +- **Italic**: `*text*` +- **Code**: `` `code` `` +- **Links**: `[text](url)` when URLs are visible + +### Tables + +Format as Markdown tables: +```markdown +| Header 1 | Header 2 | Header 3 | +|----------|----------|----------| +| Cell 1 | Cell 2 | Cell 3 | +``` + +- Align columns properly +- Preserve all table content +- Use appropriate cell separators + +### Mathematical Formulas + +- **Inline math**: `$formula$` + - Example: `$E = mc^2$` + +- **Block math**: `$$formula$$` + - Example: + ``` + $$ + \int_{a}^{b} f(x) dx = F(b) - F(a) + $$ + ``` + +- Use proper LaTeX syntax +- Preserve all mathematical notation accurately + +### Code Blocks + +````markdown +```language +code here +``` +```` + +- Specify programming language when identifiable +- Preserve indentation and formatting +- Common languages: python, javascript, java, cpp, etc. + +### Images and Diagrams + +- **Photos/Images**: `![Description](url_if_available)` +- **Diagrams/Charts**: Provide descriptive text + ```markdown + > **Figure N**: Detailed description of the diagram, chart, or illustration + ``` + +### Special Elements + +- **Headers/Footers**: Include only if they contain important information +- **Page Numbers**: Omit unless contextually important +- **Watermarks**: Ignore +- **Multi-column Text**: Convert to single column, maintain reading order (left-to-right, top-to-bottom) +- **Footnotes**: Use `[^1]` notation: + ```markdown + Text with footnote[^1] + + [^1]: Footnote content here + ``` + +## Output Requirements + +- **Clean Markdown**: Output only the Markdown content, no meta-comments +- **No Code Block Wrappers**: Don't wrap the entire output in ````markdown` blocks +- **Proper Spacing**: Use blank lines between sections, paragraphs, and elements +- **Accuracy**: Ensure all text is captured accurately +- **Completeness**: Don't omit content unless it's purely decorative + +## Quality Checklist + +Before finalizing each page: +- [ ] All text has been captured +- [ ] Headings use appropriate levels +- [ ] Tables are properly formatted +- [ ] Math formulas use correct LaTeX syntax +- [ ] Code blocks specify language +- [ ] Lists are properly formatted +- [ ] Document structure is logical and readable + +## Example Usage + +If the user runs: +``` +/pdf2md research_paper.pdf --start 1 --end 5 +``` + +You should: +1. Run: `python3 skill/pdf2md.py research_paper.pdf --output-dir ./temp_pdf_images --start 1 --end 5` +2. Read each generated image (page_0001.jpg, page_0002.jpg, etc.) +3. Convert each image to Markdown +4. Combine all pages with `\n\n` separators +5. Save to `research_paper.md` +6. Clean up temporary images (optional) + +## Error Handling + +If you encounter errors: +- **PDF not found**: Verify the file path and inform the user +- **Invalid page range**: Check that start/end pages are valid +- **Image read errors**: Ensure images were extracted successfully +- **Conversion issues**: Ask the user for clarification if content is unclear + +## Notes + +- This skill leverages your native vision capabilities to read PDF page images +- No external LLM API is used - you perform all analysis directly +- The PDF extraction script (`pdf2md.py`) reuses the existing `file_worker.py` from the markpdfdown library +- Focus on accuracy and maintaining document structure + +Begin the conversion process when the user invokes this skill!