A Python web scraper that extracts content from documentation pages and converts them to MDX format with automatic image downloading.
- Scrapes content from documentation websites
- Downloads images locally and updates MDX references
- Converts HTML content to clean MDX format
- Automatically creates organized output directory structure
- Error handling for network issues and parsing errors
- Python 3.6 or higher
- pip (Python package installer)
# Install Python via Homebrew (if not already installed)
brew install python3
# Verify installation
python3 --version
pip3 --version- Download Python from python.org
- During installation, check "Add Python to PATH"
- Verify installation:
python --version pip --version
# Update package list
sudo apt update
# Install Python and pip
sudo apt install python3 python3-pip
# Verify installation
python3 --version
pip3 --version-
Clone or download this repository
git clone <repository-url> cd docs-scraper
-
Create and activate a virtual environment
macOS/Linux:
python3 -m venv venv source venv/bin/activateWindows:
python -m venv venv venv\Scripts\activate
-
Install dependencies
pip install requests beautifulsoup4 html2text
- Open
scraper.pyin your text editor - Find the
target_urlvariable (around line 88-90):target_url = ( "https://docs.frankieone.com/docs/welcome-to-frankie" )
- Replace with your desired documentation URL
The scraper needs to identify the main content area of the webpage. By default, it looks for an <article> tag.
To customize the content selector:
-
Inspect the target webpage:
- Right-click on the main content area
- Select "Inspect Element" (Chrome/Edge) or "Inspect" (Firefox)
- Look for the HTML element containing the documentation content
-
Common content selectors:
<main>- Main content area<article>- Article content<div class="content">- Content div with class<div id="main-content">- Content div with ID
-
Update the scraper:
- Find line 59 in
scraper.py:content_div = soup.find("article")
- Replace with appropriate selector:
# For main tag content_div = soup.find("main") # For div with class content_div = soup.find("div", class_="content") # For div with ID content_div = soup.find("div", id="main-content") # For multiple classes content_div = soup.find("div", class_="docs-content main-area")
- Find line 59 in
-
Activate virtual environment (if not already active):
macOS/Linux:
source venv/bin/activateWindows:
venv\Scripts\activate
-
Run the scraper:
python scraper.py
Edit scraper.py to include multiple URLs:
urls = [
"https://docs.example.com/page1",
"https://docs.example.com/page2",
"https://docs.example.com/page3"
]
for url in urls:
scrape_and_save_as_mdx(url)scrape_and_save_as_mdx(target_url, output_dir="my_custom_docs")-
"Could not find the main content div"
- The content selector is incorrect
- Inspect the webpage and update the selector in line 59
-
"ModuleNotFoundError: No module named 'requests'"
- Virtual environment not activated
- Dependencies not installed
- Run:
pip install requests beautifulsoup4 html2text
-
"command not found: python"
- On macOS/Linux, use
python3instead ofpython - Or activate the virtual environment first
- On macOS/Linux, use
When you're done:
deactivate- MDX files: Saved to
output_mdx/directory - Images: Downloaded to
output_mdx/images/directory - File naming: Based on URL path (e.g.,
welcome-to-frankie.mdx)
- requests: HTTP requests and web scraping
- beautifulsoup4: HTML parsing and element selection
- html2text: HTML to Markdown/MDX conversion