pixlie · brainless · Jul 9, 2025 · Jul 9, 2025 · Jul 9, 2025
diff --git a/README.md b/README.md
@@ -1,303 +1,112 @@
 # SmartCrawler
 
-A smart web crawler that uses AI to intelligently select and analyze web pages based on your specific objectives. SmartCrawler automatically discovers sitemaps, selects the most relevant URLs using Claude AI, and provides detailed analysis of the content it finds.
-
-## 📋 Table of Contents
-
-- [Features](#features)
-- [Quick Start](#quick-start)
-- [Installation](#installation)
-- [WebDriver Setup](#webdriver-setup)
-- [Usage](#usage)
-- [Examples](#examples)
-- [Configuration](#configuration)
-- [Troubleshooting](#troubleshooting)
-- [License](#license)
+A web crawler that uses WebDriver to extract and parse HTML content from web pages with intelligent duplicate detection and template pattern recognition.
 
 ## ✨ Features
 
-- **🤖 AI-Powered URL Selection**: Uses Claude AI to intelligently select relevant URLs from sitemaps
-- **🗺️ Automatic Sitemap Discovery**: Finds and parses XML sitemaps across multiple domains
-- **📄 Smart Content Analysis**: AI-powered analysis of scraped content for objective-specific insights
-- **🌐 Multi-Domain Support**: Crawl multiple websites in a single session
-- **⚡ Dynamic Content Loading**: Scrolls through pages to capture JavaScript-rendered content
-- **📊 Structured Output**: Results saved in JSON format for further analysis
-- **🛡️ Respectful Crawling**: Built-in rate limiting and respectful crawling practices
+- **🌐 Multi-URL Crawling**: Crawl multiple URLs in a single session
+- **🔍 Intelligent Duplicate Detection**: Automatically identifies and filters duplicate content patterns across domains
+- **📋 Template Pattern Recognition**: Detects variable patterns in content (e.g., "42 comments" → "{count} comments")
+- **🌳 Structured HTML Tree**: Provides filtered HTML tree view with duplicate marking
+- **⚡ WebDriver Integration**: Uses WebDriver for dynamic content handling
+- **📊 Verbose Output**: Detailed HTML tree analysis with filtering information
 
 ## 🚀 Quick Start
 
-1. **Download SmartCrawler** from the [releases page](https://github.com/brainless/SmartCrawler/releases)
-2. **Set up WebDriver** (Firefox or Chrome)
-3. **Get Claude API key** from [Anthropic](https://console.anthropic.com/)
-4. **Run SmartCrawler** with your objective
-
-## 📦 Installation
-
-### Option 1: Download Pre-built Binary (Recommended)
-
-**Windows:**
-1. Download `smart-crawler-windows-x64.zip` from [releases](https://github.com/brainless/SmartCrawler/releases)
-2. Extract the ZIP file to a folder (e.g., `C:\SmartCrawler\`)
-3. Add the folder to your PATH or run from the folder directly
-
-**macOS:**
-1. Download `smart-crawler-macos-x64.tar.gz` (Intel) or `smart-crawler-macos-arm64.tar.gz` (Apple Silicon)
-2. Extract: `tar -xzf smart-crawler-macos-*.tar.gz`
-3. Move to applications: `sudo mv smart-crawler /usr/local/bin/`
-4. Make executable: `chmod +x /usr/local/bin/smart-crawler`
-
-**Linux:**
-1. Download `smart-crawler-linux-x64.tar.gz`
-2. Extract: `tar -xzf smart-crawler-linux-x64.tar.gz`
-3. Move to bin: `sudo mv smart-crawler /usr/local/bin/`
-4. Make executable: `chmod +x /usr/local/bin/smart-crawler`
-
-### Option 2: Package Installers
+1. **Install SmartCrawler** - Download from [releases](https://github.com/pixlie/SmartCrawler/releases) or build from source
+2. **Set up WebDriver** - Install Firefox/Chrome and corresponding WebDriver
+3. **Start crawling** - Run SmartCrawler with your target URLs
 
-**Windows MSI Installer:**
-1. Download `smart-crawler-[version].msi` from releases
-2. Double-click to install
-3. SmartCrawler will be available in your PATH
-
-**Linux DEB Package (Ubuntu/Debian):**
-```bash
-wget https://github.com/brainless/SmartCrawler/releases/latest/download/smart-crawler-[version].deb
-sudo dpkg -i smart-crawler-[version].deb
-```
-
-**Linux RPM Package (RHEL/CentOS/Fedora):**
 ```bash
-wget https://github.com/brainless/SmartCrawler/releases/latest/download/smart-crawler-[version].rpm
-sudo rpm -i smart-crawler-[version].rpm
-```
-
-**macOS DMG:**
-1. Download `smart-crawler-[version].dmg` from releases
-2. Open the DMG and copy `smart-crawler` to `/usr/local/bin/`
+# Basic usage
+smart-crawler --link "https://example.com"
 
-### Option 3: Build from Source
+# Multiple URLs with verbose output
+smart-crawler --link "https://example.com" --link "https://another.com" --verbose
 
-If you have Rust installed:
-```bash
-git clone https://github.com/brainless/SmartCrawler.git
-cd SmartCrawler
-cargo build --release
-# Binary will be in target/release/smart-crawler
+# Template detection mode
+smart-crawler --link "https://example.com" --template --verbose
 ```
 
-## 🌐 WebDriver Setup
-
-SmartCrawler requires a WebDriver to control a browser for scraping. Choose one:
+## 📖 Documentation
 
-### Firefox (GeckoDriver) - Recommended
+### Getting Started
 
-**Windows:**
-1. Download `geckodriver.exe` from [Mozilla releases](https://github.com/mozilla/geckodriver/releases)
-2. Place in a folder in your PATH or the same folder as SmartCrawler
-3. Install Firefox browser if not already installed
+Choose your operating system for detailed setup instructions:
 
-**macOS:**
-```bash
-# Using Homebrew (recommended)
-brew install geckodriver
+- **[Windows Setup](docs/getting-started-windows.md)** - Complete Windows installation guide
+- **[macOS Setup](docs/getting-started-macos.md)** - macOS installation and setup
+- **[Linux Setup](docs/getting-started-linux.md)** - Linux installation for various distributions
 
-# Or download manually from releases page
-```
+### Usage
 
-**Linux:**
-```bash
-# Ubuntu/Debian
-sudo apt update
-sudo apt install firefox-esr
-wget https://github.com/mozilla/geckodriver/releases/latest/download/geckodriver-v0.33.0-linux64.tar.gz
-tar -xzf geckodriver-v0.33.0-linux64.tar.gz
-sudo mv geckodriver /usr/local/bin/
-
-# RHEL/CentOS/Fedora
-sudo dnf install firefox
-# Then download geckodriver as above
-```
+- **[CLI Options](docs/cli-options.md)** - Complete command-line reference and examples
 
-### Chrome (ChromeDriver) - Alternative
+### Development
 
-**Windows:**
-1. Download ChromeDriver from [Chrome for Testing](https://googlechromelabs.github.io/chrome-for-testing/)
-2. Place `chromedriver.exe` in your PATH or SmartCrawler folder
-3. Install Chrome browser if not already installed
+- **[Development Guide](docs/development.md)** - Setup, building, testing, and contributing instructions
 
-**macOS:**
-```bash
-# Using Homebrew
-brew install chromedriver
+## 🔧 System Requirements
 
-# Or download manually
-```
+- **Operating System**: Windows 10+, macOS 10.15+, or Linux
+- **Browser**: Firefox (recommended) or Chrome
+- **WebDriver**: GeckoDriver (Firefox) or ChromeDriver (Chrome)
+- **Memory**: 512MB RAM minimum, 1GB recommended
 
-**Linux:**
-```bash
-# Ubuntu/Debian
-sudo apt update
-sudo apt install google-chrome-stable
-wget https://chromedriver.storage.googleapis.com/[VERSION]/chromedriver_linux64.zip
-unzip chromedriver_linux64.zip
-sudo mv chromedriver /usr/local/bin/
-
-# Check Chrome version: google-chrome --version
-# Download matching ChromeDriver version
-```
+## 📋 Quick Reference
 
-## 🎯 Usage
+### Basic Commands
 
-### 1. Set up your Claude API key
+```bash
+# Crawl a single URL
+smart-crawler --link "https://example.com"
 
-**Windows (Command Prompt):**
-```cmd
-set ANTHROPIC_API_KEY=your-api-key-here
-```
+# Crawl with detailed output
+smart-crawler --link "https://example.com" --verbose
 
-**Windows (PowerShell):**
-```powershell
-$env:ANTHROPIC_API_KEY="your-api-key-here"
-```
+# Template detection mode
+smart-crawler --link "https://example.com" --template --verbose
 
-**macOS/Linux:**
-```bash
-export ANTHROPIC_API_KEY="your-api-key-here"
+# Multiple URLs
+smart-crawler --link "https://site1.com" --link "https://site2.com"
 ```
 
-### 2. Start WebDriver
+### WebDriver Setup
 
-**For Firefox (GeckoDriver):**
 ```bash
-# Windows
-geckodriver.exe --port 4444
-
-# macOS/Linux  
+# Start Firefox WebDriver
 geckodriver --port 4444
-```
-
-**For Chrome (ChromeDriver):**
-```bash
-# Windows
-chromedriver.exe --port=4444
 
-# macOS/Linux
+# Start Chrome WebDriver
 chromedriver --port=4444
 ```
 
-### 3. Run SmartCrawler
+## 🛠️ Development
 
-**Basic usage:**
-```bash
-smart-crawler --objective "Find pricing information" --domains "example.com" --max-urls 5
-```
-
-**With output file:**
-```bash
-smart-crawler -o "Find contact information" -d "company.com,business.org" -m 10 -O results.json
-```
-
-**Command-line options:**
-```
--o, --objective <OBJECTIVE>    What information to look for [REQUIRED]
--d, --domains <DOMAINS>        Comma-separated domains to crawl [REQUIRED]
--m, --max-urls <NUMBER>        Maximum URLs per domain [default: 10]
-    --delay <MILLISECONDS>     Delay between requests [default: 1000]
--O, --output <FILE>           Save results to JSON file
--v, --verbose                 Enable detailed logging
-```
+For developers interested in contributing to SmartCrawler or building from source:
 
-## 📝 Examples
+- **[Development Guide](docs/development.md)** - Complete setup, building, testing, and contributing instructions
 
-### E-commerce Price Research
-```bash
-smart-crawler \
-  --objective "Find product pricing, discounts, and shipping costs" \
-  --domains "shop1.com,shop2.com,competitor.com" \
-  --max-urls 15 \
-  --output pricing-research.json
-```
-
-### Company Information Gathering
-```bash
-smart-crawler \
-  -o "Find company contact information, team members, and office locations" \
-  -d "company.com" \
-  -m 8 \
-  --delay 2000 \
-  -v
-```
-
-### Technical Documentation Search
-```bash
-smart-crawler \
-  --objective "Find API documentation, integration guides, and developer resources" \
-  --domains "docs.example.com,api.service.com" \
-  --max-urls 20 \
-  --output api-docs.json
-```
-
-### News and Content Analysis
-```bash
-smart-crawler \
-  -o "Find recent news articles about artificial intelligence and machine learning" \
-  -d "news-site.com,tech-blog.com" \
-  -m 12 \
-  --delay 1500
-```
-
-## ⚙️ Configuration
-
-### Environment Variables
-
-- `ANTHROPIC_API_KEY`: Your Claude API key (required)
-- `RUST_LOG`: Set logging level (`debug`, `info`, `warn`, `error`)
-
-### Best Practices
-
-- **Specific objectives**: Use clear, specific objectives for better URL selection
-- **Conservative limits**: Start with lower `--max-urls` values (5-10) to test
-- **Respectful delays**: Use delays of 1000ms or more to avoid overwhelming servers
-- **Check robots.txt**: Review target sites' crawling policies
-- **Monitor API usage**: Claude API has usage limits and costs
-
-## 🔧 Troubleshooting
-
-### Common Issues
-
-**"WebDriver connection failed"**
-- Ensure WebDriver (geckodriver/chromedriver) is running on port 4444
-- Check that the browser is installed
-- Try restarting the WebDriver
-
-**"ANTHROPIC_API_KEY not found"**
-- Set the environment variable in your terminal session
-- Verify the API key is correct and has sufficient credits
-
-**"Permission denied" (macOS/Linux)**
-- Make the binary executable: `chmod +x smart-crawler`
-- For system-wide installation, use `sudo` when moving to `/usr/local/bin/`
+## 📄 License
 
-**"No URLs selected by LLM"**
-- Try a more specific objective
-- Increase `--max-urls` limit
-- Use `--verbose` to see what URLs were found
+This project is licensed under the GPL-3.0 license - see the [LICENSE](LICENSE) file for details.
 
-**Rate limiting/blocked requests**
-- Increase `--delay` between requests
-- Some sites block crawlers; respect robots.txt
-- Consider using fewer concurrent requests
+## 🔗 Links
 
-### Getting Help
+- [GitHub Repository](https://github.com/pixlie/SmartCrawler)
+- [Issue Tracker](https://github.com/pixlie/SmartCrawler/issues)
+- [Releases](https://github.com/pixlie/SmartCrawler/releases)
+- [Documentation](docs/)
 
-- Check the [Issues page](https://github.com/brainless/SmartCrawler/issues) for known problems
-- Create a new issue with detailed error messages and system information
-- Include your command and any error output when reporting problems
+## 🆘 Support
 
-## 📄 License
+If you encounter issues:
 
-GPL-3.0 License - see [LICENSE](LICENSE) file for details.
+1. Check the [getting started guides](docs/) for your operating system
+2. Review the [CLI options documentation](docs/cli-options.md)
+3. Search existing [GitHub issues](https://github.com/pixlie/SmartCrawler/issues)
+4. Create a new issue with detailed error information
 
 ---
 
-**Note**: SmartCrawler is designed for ethical web scraping and research purposes. Always respect websites' terms of service and robots.txt files. Be mindful of rate limits and server resources when crawling.
+**Note**: SmartCrawler is designed for ethical web scraping and research purposes. Always respect websites' robots.txt files and terms of service.