Skip to content

marvelken/Docs--Scrapper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

Documentation Scraper

A Python web scraper that extracts content from documentation pages and converts them to MDX format with automatic image downloading.

Features

  • Scrapes content from documentation websites
  • Downloads images locally and updates MDX references
  • Converts HTML content to clean MDX format
  • Automatically creates organized output directory structure
  • Error handling for network issues and parsing errors

Prerequisites

System Requirements

  • Python 3.6 or higher
  • pip (Python package installer)

Installation for Different Operating Systems

macOS

# Install Python via Homebrew (if not already installed)
brew install python3

# Verify installation
python3 --version
pip3 --version

Windows

  1. Download Python from python.org
  2. During installation, check "Add Python to PATH"
  3. Verify installation:
    python --version
    pip --version

Linux (Ubuntu/Debian)

# Update package list
sudo apt update

# Install Python and pip
sudo apt install python3 python3-pip

# Verify installation
python3 --version
pip3 --version

Setup

  1. Clone or download this repository

    git clone <repository-url>
    cd docs-scraper
  2. Create and activate a virtual environment

    macOS/Linux:

    python3 -m venv venv
    source venv/bin/activate

    Windows:

    python -m venv venv
    venv\Scripts\activate
  3. Install dependencies

    pip install requests beautifulsoup4 html2text

Configuration

Setting the Target URL

  1. Open scraper.py in your text editor
  2. Find the target_url variable (around line 88-90):
    target_url = (
        "https://docs.frankieone.com/docs/welcome-to-frankie"
    )
  3. Replace with your desired documentation URL

Finding the Right Content Element

The scraper needs to identify the main content area of the webpage. By default, it looks for an <article> tag.

To customize the content selector:

  1. Inspect the target webpage:

    • Right-click on the main content area
    • Select "Inspect Element" (Chrome/Edge) or "Inspect" (Firefox)
    • Look for the HTML element containing the documentation content
  2. Common content selectors:

    • <main> - Main content area
    • <article> - Article content
    • <div class="content"> - Content div with class
    • <div id="main-content"> - Content div with ID
  3. Update the scraper:

    • Find line 59 in scraper.py:
      content_div = soup.find("article")
    • Replace with appropriate selector:
      # For main tag
      content_div = soup.find("main")
      
      # For div with class
      content_div = soup.find("div", class_="content")
      
      # For div with ID
      content_div = soup.find("div", id="main-content")
      
      # For multiple classes
      content_div = soup.find("div", class_="docs-content main-area")

Usage

Basic Usage

  1. Activate virtual environment (if not already active):

    macOS/Linux:

    source venv/bin/activate

    Windows:

    venv\Scripts\activate
  2. Run the scraper:

    python scraper.py

Advanced Usage

Scraping Multiple URLs

Edit scraper.py to include multiple URLs:

urls = [
    "https://docs.example.com/page1",
    "https://docs.example.com/page2",
    "https://docs.example.com/page3"
]

for url in urls:
    scrape_and_save_as_mdx(url)

Custom Output Directory

scrape_and_save_as_mdx(target_url, output_dir="my_custom_docs")

Troubleshooting

Common Issues

  1. "Could not find the main content div"

    • The content selector is incorrect
    • Inspect the webpage and update the selector in line 59
  2. "ModuleNotFoundError: No module named 'requests'"

    • Virtual environment not activated
    • Dependencies not installed
    • Run: pip install requests beautifulsoup4 html2text
  3. "command not found: python"

    • On macOS/Linux, use python3 instead of python
    • Or activate the virtual environment first

Deactivating Virtual Environment

When you're done:

deactivate

Output

  • MDX files: Saved to output_mdx/ directory
  • Images: Downloaded to output_mdx/images/ directory
  • File naming: Based on URL path (e.g., welcome-to-frankie.mdx)

Dependencies

  • requests: HTTP requests and web scraping
  • beautifulsoup4: HTML parsing and element selection
  • html2text: HTML to Markdown/MDX conversion

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages