Documentation Scraper

A Python web scraper that extracts content from documentation pages and converts them to MDX format with automatic image downloading.

Features

Scrapes content from documentation websites
Downloads images locally and updates MDX references
Converts HTML content to clean MDX format
Automatically creates organized output directory structure
Error handling for network issues and parsing errors

Prerequisites

System Requirements

Python 3.6 or higher
pip (Python package installer)

Installation for Different Operating Systems

macOS

# Install Python via Homebrew (if not already installed)
brew install python3

# Verify installation
python3 --version
pip3 --version

Windows

Download Python from python.org
During installation, check "Add Python to PATH"
Verify installation:
```
python --version
pip --version
```

Linux (Ubuntu/Debian)

# Update package list
sudo apt update

# Install Python and pip
sudo apt install python3 python3-pip

# Verify installation
python3 --version
pip3 --version

Setup

Clone or download this repository

git clone <repository-url>
cd docs-scraper

Create and activate a virtual environment

macOS/Linux:

python3 -m venv venv
source venv/bin/activate

Windows:

python -m venv venv
venv\Scripts\activate

Install dependencies

pip install requests beautifulsoup4 html2text

Configuration

Setting the Target URL

Open scraper.py in your text editor

Find the target_url variable (around line 88-90):

target_url = (
    "https://docs.frankieone.com/docs/welcome-to-frankie"
)

Replace with your desired documentation URL

Finding the Right Content Element

The scraper needs to identify the main content area of the webpage. By default, it looks for an <article> tag.

To customize the content selector:

Inspect the target webpage:
- Right-click on the main content area
- Select "Inspect Element" (Chrome/Edge) or "Inspect" (Firefox)
- Look for the HTML element containing the documentation content
Common content selectors:
- <main> - Main content area
- <article> - Article content
- <div class="content"> - Content div with class
- <div id="main-content"> - Content div with ID

Update the scraper:

Find line 59 in scraper.py:
```
content_div = soup.find("article")
```

Replace with appropriate selector:

# For main tag
content_div = soup.find("main")

# For div with class
content_div = soup.find("div", class_="content")

# For div with ID
content_div = soup.find("div", id="main-content")

# For multiple classes
content_div = soup.find("div", class_="docs-content main-area")

Usage

Basic Usage

Activate virtual environment (if not already active):

macOS/Linux:
```
source venv/bin/activate
```
Windows:
```
venv\Scripts\activate
```
Run the scraper:
```
python scraper.py
```

Advanced Usage

Scraping Multiple URLs

Edit scraper.py to include multiple URLs:

urls = [
    "https://docs.example.com/page1",
    "https://docs.example.com/page2",
    "https://docs.example.com/page3"
]

for url in urls:
    scrape_and_save_as_mdx(url)

Custom Output Directory

scrape_and_save_as_mdx(target_url, output_dir="my_custom_docs")

Troubleshooting

Common Issues

"Could not find the main content div"
- The content selector is incorrect
- Inspect the webpage and update the selector in line 59
"ModuleNotFoundError: No module named 'requests'"
- Virtual environment not activated
- Dependencies not installed
- Run: pip install requests beautifulsoup4 html2text
"command not found: python"
- On macOS/Linux, use python3 instead of python
- Or activate the virtual environment first

Deactivating Virtual Environment

When you're done:

deactivate

Output

MDX files: Saved to output_mdx/ directory
Images: Downloaded to output_mdx/images/ directory
File naming: Based on URL path (e.g., welcome-to-frankie.mdx)

Dependencies

requests: HTTP requests and web scraping
beautifulsoup4: HTML parsing and element selection
html2text: HTML to Markdown/MDX conversion

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
scraper.py		scraper.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Documentation Scraper

Features

Prerequisites

System Requirements

Installation for Different Operating Systems

macOS

Windows

Linux (Ubuntu/Debian)

Setup

Configuration

Setting the Target URL

Finding the Right Content Element

Usage

Basic Usage

Advanced Usage

Scraping Multiple URLs

Custom Output Directory

Troubleshooting

Common Issues

Deactivating Virtual Environment

Output

Dependencies

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Documentation Scraper

Features

Prerequisites

System Requirements

Installation for Different Operating Systems

macOS

Windows

Linux (Ubuntu/Debian)

Setup

Configuration

Setting the Target URL

Finding the Right Content Element

Usage

Basic Usage

Advanced Usage

Scraping Multiple URLs

Custom Output Directory

Troubleshooting

Common Issues

Deactivating Virtual Environment

Output

Dependencies

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages