From 543ec3a431b5be7fb7584a223643c541819ddef0 Mon Sep 17 00:00:00 2001 From: Sumit Datta Date: Wed, 9 Jul 2025 08:43:24 +0530 Subject: [PATCH 1/2] chore: restructure documentation with docs/ folder and OS-specific guides MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Create docs/ folder with structured documentation - Add OS-specific getting started guides: - Windows installation and setup guide - macOS installation and setup guide - Linux installation for various distributions - Create comprehensive CLI options documentation - Update README.md to be minimal with links to detailed docs - Improve documentation organization for better user experience Addresses issue #64 πŸ€– Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude --- README.md | 318 ++++++++------------------------ docs/cli-options.md | 115 ++++++++++++ docs/getting-started-linux.md | 272 +++++++++++++++++++++++++++ docs/getting-started-macos.md | 228 +++++++++++++++++++++++ docs/getting-started-windows.md | 152 +++++++++++++++ 5 files changed, 844 insertions(+), 241 deletions(-) create mode 100644 docs/cli-options.md create mode 100644 docs/getting-started-linux.md create mode 100644 docs/getting-started-macos.md create mode 100644 docs/getting-started-windows.md diff --git a/README.md b/README.md index 932fa6f..d8b3612 100644 --- a/README.md +++ b/README.md @@ -1,303 +1,139 @@ # SmartCrawler -A smart web crawler that uses AI to intelligently select and analyze web pages based on your specific objectives. SmartCrawler automatically discovers sitemaps, selects the most relevant URLs using Claude AI, and provides detailed analysis of the content it finds. - -## πŸ“‹ Table of Contents - -- [Features](#features) -- [Quick Start](#quick-start) -- [Installation](#installation) -- [WebDriver Setup](#webdriver-setup) -- [Usage](#usage) -- [Examples](#examples) -- [Configuration](#configuration) -- [Troubleshooting](#troubleshooting) -- [License](#license) +A web crawler that uses WebDriver to extract and parse HTML content from web pages with intelligent duplicate detection and template pattern recognition. ## ✨ Features -- **πŸ€– AI-Powered URL Selection**: Uses Claude AI to intelligently select relevant URLs from sitemaps -- **πŸ—ΊοΈ Automatic Sitemap Discovery**: Finds and parses XML sitemaps across multiple domains -- **πŸ“„ Smart Content Analysis**: AI-powered analysis of scraped content for objective-specific insights -- **🌐 Multi-Domain Support**: Crawl multiple websites in a single session -- **⚑ Dynamic Content Loading**: Scrolls through pages to capture JavaScript-rendered content -- **πŸ“Š Structured Output**: Results saved in JSON format for further analysis -- **πŸ›‘οΈ Respectful Crawling**: Built-in rate limiting and respectful crawling practices +- **🌐 Multi-URL Crawling**: Crawl multiple URLs in a single session +- **πŸ” Intelligent Duplicate Detection**: Automatically identifies and filters duplicate content patterns across domains +- **πŸ“‹ Template Pattern Recognition**: Detects variable patterns in content (e.g., "42 comments" β†’ "{count} comments") +- **🌳 Structured HTML Tree**: Provides filtered HTML tree view with duplicate marking +- **⚑ WebDriver Integration**: Uses WebDriver for dynamic content handling +- **πŸ“Š Verbose Output**: Detailed HTML tree analysis with filtering information ## πŸš€ Quick Start -1. **Download SmartCrawler** from the [releases page](https://github.com/brainless/SmartCrawler/releases) -2. **Set up WebDriver** (Firefox or Chrome) -3. **Get Claude API key** from [Anthropic](https://console.anthropic.com/) -4. **Run SmartCrawler** with your objective - -## πŸ“¦ Installation - -### Option 1: Download Pre-built Binary (Recommended) - -**Windows:** -1. Download `smart-crawler-windows-x64.zip` from [releases](https://github.com/brainless/SmartCrawler/releases) -2. Extract the ZIP file to a folder (e.g., `C:\SmartCrawler\`) -3. Add the folder to your PATH or run from the folder directly - -**macOS:** -1. Download `smart-crawler-macos-x64.tar.gz` (Intel) or `smart-crawler-macos-arm64.tar.gz` (Apple Silicon) -2. Extract: `tar -xzf smart-crawler-macos-*.tar.gz` -3. Move to applications: `sudo mv smart-crawler /usr/local/bin/` -4. Make executable: `chmod +x /usr/local/bin/smart-crawler` - -**Linux:** -1. Download `smart-crawler-linux-x64.tar.gz` -2. Extract: `tar -xzf smart-crawler-linux-x64.tar.gz` -3. Move to bin: `sudo mv smart-crawler /usr/local/bin/` -4. Make executable: `chmod +x /usr/local/bin/smart-crawler` +1. **Install SmartCrawler** - Download from [releases](https://github.com/pixlie/SmartCrawler/releases) or build from source +2. **Set up WebDriver** - Install Firefox/Chrome and corresponding WebDriver +3. **Start crawling** - Run SmartCrawler with your target URLs -### Option 2: Package Installers - -**Windows MSI Installer:** -1. Download `smart-crawler-[version].msi` from releases -2. Double-click to install -3. SmartCrawler will be available in your PATH - -**Linux DEB Package (Ubuntu/Debian):** ```bash -wget https://github.com/brainless/SmartCrawler/releases/latest/download/smart-crawler-[version].deb -sudo dpkg -i smart-crawler-[version].deb -``` - -**Linux RPM Package (RHEL/CentOS/Fedora):** -```bash -wget https://github.com/brainless/SmartCrawler/releases/latest/download/smart-crawler-[version].rpm -sudo rpm -i smart-crawler-[version].rpm -``` - -**macOS DMG:** -1. Download `smart-crawler-[version].dmg` from releases -2. Open the DMG and copy `smart-crawler` to `/usr/local/bin/` +# Basic usage +smart-crawler --link "https://example.com" -### Option 3: Build from Source +# Multiple URLs with verbose output +smart-crawler --link "https://example.com" --link "https://another.com" --verbose -If you have Rust installed: -```bash -git clone https://github.com/brainless/SmartCrawler.git -cd SmartCrawler -cargo build --release -# Binary will be in target/release/smart-crawler +# Template detection mode +smart-crawler --link "https://example.com" --template --verbose ``` -## 🌐 WebDriver Setup +## πŸ“– Documentation -SmartCrawler requires a WebDriver to control a browser for scraping. Choose one: +### Getting Started -### Firefox (GeckoDriver) - Recommended +Choose your operating system for detailed setup instructions: -**Windows:** -1. Download `geckodriver.exe` from [Mozilla releases](https://github.com/mozilla/geckodriver/releases) -2. Place in a folder in your PATH or the same folder as SmartCrawler -3. Install Firefox browser if not already installed +- **[Windows Setup](docs/getting-started-windows.md)** - Complete Windows installation guide +- **[macOS Setup](docs/getting-started-macos.md)** - macOS installation and setup +- **[Linux Setup](docs/getting-started-linux.md)** - Linux installation for various distributions -**macOS:** -```bash -# Using Homebrew (recommended) -brew install geckodriver +### Usage -# Or download manually from releases page -``` - -**Linux:** -```bash -# Ubuntu/Debian -sudo apt update -sudo apt install firefox-esr -wget https://github.com/mozilla/geckodriver/releases/latest/download/geckodriver-v0.33.0-linux64.tar.gz -tar -xzf geckodriver-v0.33.0-linux64.tar.gz -sudo mv geckodriver /usr/local/bin/ - -# RHEL/CentOS/Fedora -sudo dnf install firefox -# Then download geckodriver as above -``` +- **[CLI Options](docs/cli-options.md)** - Complete command-line reference and examples -### Chrome (ChromeDriver) - Alternative +## πŸ”§ System Requirements -**Windows:** -1. Download ChromeDriver from [Chrome for Testing](https://googlechromelabs.github.io/chrome-for-testing/) -2. Place `chromedriver.exe` in your PATH or SmartCrawler folder -3. Install Chrome browser if not already installed +- **Operating System**: Windows 10+, macOS 10.15+, or Linux +- **Browser**: Firefox (recommended) or Chrome +- **WebDriver**: GeckoDriver (Firefox) or ChromeDriver (Chrome) +- **Memory**: 512MB RAM minimum, 1GB recommended -**macOS:** -```bash -# Using Homebrew -brew install chromedriver +## πŸ“‹ Quick Reference -# Or download manually -``` +### Basic Commands -**Linux:** ```bash -# Ubuntu/Debian -sudo apt update -sudo apt install google-chrome-stable -wget https://chromedriver.storage.googleapis.com/[VERSION]/chromedriver_linux64.zip -unzip chromedriver_linux64.zip -sudo mv chromedriver /usr/local/bin/ - -# Check Chrome version: google-chrome --version -# Download matching ChromeDriver version -``` +# Crawl a single URL +smart-crawler --link "https://example.com" -## 🎯 Usage +# Crawl with detailed output +smart-crawler --link "https://example.com" --verbose -### 1. Set up your Claude API key +# Template detection mode +smart-crawler --link "https://example.com" --template --verbose -**Windows (Command Prompt):** -```cmd -set ANTHROPIC_API_KEY=your-api-key-here +# Multiple URLs +smart-crawler --link "https://site1.com" --link "https://site2.com" ``` -**Windows (PowerShell):** -```powershell -$env:ANTHROPIC_API_KEY="your-api-key-here" -``` +### WebDriver Setup -**macOS/Linux:** ```bash -export ANTHROPIC_API_KEY="your-api-key-here" -``` - -### 2. Start WebDriver - -**For Firefox (GeckoDriver):** -```bash -# Windows -geckodriver.exe --port 4444 - -# macOS/Linux +# Start Firefox WebDriver geckodriver --port 4444 -``` -**For Chrome (ChromeDriver):** -```bash -# Windows -chromedriver.exe --port=4444 - -# macOS/Linux +# Start Chrome WebDriver chromedriver --port=4444 ``` -### 3. Run SmartCrawler +## πŸ› οΈ Development -**Basic usage:** -```bash -smart-crawler --objective "Find pricing information" --domains "example.com" --max-urls 5 -``` +### Building from Source -**With output file:** ```bash -smart-crawler -o "Find contact information" -d "company.com,business.org" -m 10 -O results.json -``` - -**Command-line options:** -``` --o, --objective What information to look for [REQUIRED] --d, --domains Comma-separated domains to crawl [REQUIRED] --m, --max-urls Maximum URLs per domain [default: 10] - --delay Delay between requests [default: 1000] --O, --output Save results to JSON file --v, --verbose Enable detailed logging +git clone https://github.com/pixlie/SmartCrawler.git +cd SmartCrawler +cargo build --release ``` -## πŸ“ Examples +### Running Tests -### E-commerce Price Research ```bash -smart-crawler \ - --objective "Find product pricing, discounts, and shipping costs" \ - --domains "shop1.com,shop2.com,competitor.com" \ - --max-urls 15 \ - --output pricing-research.json +cargo test ``` -### Company Information Gathering -```bash -smart-crawler \ - -o "Find company contact information, team members, and office locations" \ - -d "company.com" \ - -m 8 \ - --delay 2000 \ - -v -``` - -### Technical Documentation Search -```bash -smart-crawler \ - --objective "Find API documentation, integration guides, and developer resources" \ - --domains "docs.example.com,api.service.com" \ - --max-urls 20 \ - --output api-docs.json -``` +### Development Commands -### News and Content Analysis ```bash -smart-crawler \ - -o "Find recent news articles about artificial intelligence and machine learning" \ - -d "news-site.com,tech-blog.com" \ - -m 12 \ - --delay 1500 -``` - -## βš™οΈ Configuration - -### Environment Variables +# Format code +cargo fmt -- `ANTHROPIC_API_KEY`: Your Claude API key (required) -- `RUST_LOG`: Set logging level (`debug`, `info`, `warn`, `error`) +# Run linter +cargo clippy -### Best Practices - -- **Specific objectives**: Use clear, specific objectives for better URL selection -- **Conservative limits**: Start with lower `--max-urls` values (5-10) to test -- **Respectful delays**: Use delays of 1000ms or more to avoid overwhelming servers -- **Check robots.txt**: Review target sites' crawling policies -- **Monitor API usage**: Claude API has usage limits and costs - -## πŸ”§ Troubleshooting - -### Common Issues +# Debug mode +RUST_LOG=debug cargo run -- --link "https://example.com" +``` -**"WebDriver connection failed"** -- Ensure WebDriver (geckodriver/chromedriver) is running on port 4444 -- Check that the browser is installed -- Try restarting the WebDriver +## 🀝 Contributing -**"ANTHROPIC_API_KEY not found"** -- Set the environment variable in your terminal session -- Verify the API key is correct and has sufficient credits +1. Fork the repository +2. Create a feature branch (`git checkout -b feature/amazing-feature`) +3. Commit your changes (`git commit -m 'Add amazing feature'`) +4. Push to the branch (`git push origin feature/amazing-feature`) +5. Open a Pull Request -**"Permission denied" (macOS/Linux)** -- Make the binary executable: `chmod +x smart-crawler` -- For system-wide installation, use `sudo` when moving to `/usr/local/bin/` +## πŸ“„ License -**"No URLs selected by LLM"** -- Try a more specific objective -- Increase `--max-urls` limit -- Use `--verbose` to see what URLs were found +This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. -**Rate limiting/blocked requests** -- Increase `--delay` between requests -- Some sites block crawlers; respect robots.txt -- Consider using fewer concurrent requests +## πŸ”— Links -### Getting Help +- [GitHub Repository](https://github.com/pixlie/SmartCrawler) +- [Issue Tracker](https://github.com/pixlie/SmartCrawler/issues) +- [Releases](https://github.com/pixlie/SmartCrawler/releases) +- [Documentation](docs/) -- Check the [Issues page](https://github.com/brainless/SmartCrawler/issues) for known problems -- Create a new issue with detailed error messages and system information -- Include your command and any error output when reporting problems +## πŸ†˜ Support -## πŸ“„ License +If you encounter issues: -GPL-3.0 License - see [LICENSE](LICENSE) file for details. +1. Check the [getting started guides](docs/) for your operating system +2. Review the [CLI options documentation](docs/cli-options.md) +3. Search existing [GitHub issues](https://github.com/pixlie/SmartCrawler/issues) +4. Create a new issue with detailed error information --- -**Note**: SmartCrawler is designed for ethical web scraping and research purposes. Always respect websites' terms of service and robots.txt files. Be mindful of rate limits and server resources when crawling. \ No newline at end of file +**Note**: SmartCrawler is designed for ethical web scraping and research purposes. Always respect websites' robots.txt files and terms of service. \ No newline at end of file diff --git a/docs/cli-options.md b/docs/cli-options.md new file mode 100644 index 0000000..6a8210d --- /dev/null +++ b/docs/cli-options.md @@ -0,0 +1,115 @@ +# CLI Options + +SmartCrawler provides a simple command-line interface to crawl web pages and extract structured HTML content. + +## Basic Usage + +```bash +smart-crawler --link [OPTIONS] +``` + +## Required Arguments + +### `--link ` +- **Description**: URL to crawl (can be specified multiple times) +- **Required**: Yes +- **Multiple values**: Yes + +**Examples:** +```bash +# Crawl a single URL +smart-crawler --link "https://example.com" + +# Crawl multiple URLs +smart-crawler --link "https://example.com" --link "https://another.com" +``` + +## Optional Arguments + +### `--verbose` +- **Description**: Enable verbose output showing filtered HTML node tree +- **Type**: Flag (no value required) +- **Default**: Disabled + +When enabled, this option shows the complete HTML tree structure with duplicate node filtering applied. + +**Example:** +```bash +smart-crawler --link "https://example.com" --verbose +``` + +### `--template` +- **Description**: Enable template detection mode to identify patterns like '{count} comments' in HTML content +- **Type**: Flag (no value required) +- **Default**: Disabled + +When enabled, this option: +- Detects variable patterns in text content (e.g., "42 comments" becomes "{count} comments") +- Skips domain-wide duplicate filtering to show template patterns clearly +- Useful for identifying common content patterns across pages + +**Example:** +```bash +smart-crawler --link "https://example.com" --template --verbose +``` + +### `-h, --help` +- **Description**: Print help information +- **Type**: Flag + +### `-V, --version` +- **Description**: Print version information +- **Type**: Flag + +## Advanced Usage Examples + +### Crawl with Template Detection +```bash +smart-crawler --link "https://news-site.com" --template --verbose +``` + +### Crawl Multiple Domains +```bash +smart-crawler \ + --link "https://example.com" \ + --link "https://another.com" \ + --link "https://third-site.org" \ + --verbose +``` + +### Debug Mode Output +```bash +# Set log level for detailed debugging +RUST_LOG=debug smart-crawler --link "https://example.com" --verbose +``` + +## Output Format + +SmartCrawler outputs crawling results in a structured format: + +``` +=== Crawling Results === +URL: https://example.com +Title: Example Domain +Domain: example.com +--- +``` + +With `--verbose` enabled, it also shows the HTML tree structure with duplicate filtering applied. + +With `--template` enabled, it shows template patterns instead of actual values and skips duplicate filtering. + +## Exit Codes + +- `0`: Success +- `1`: Error (invalid arguments, connection failure, etc.) + +## Environment Variables + +- `RUST_LOG`: Set logging level (`debug`, `info`, `warn`, `error`) + +## Notes + +- URLs are automatically validated and deduplicated +- SmartCrawler requires a WebDriver server running on port 4444 +- See the [Getting Started guides](README.md#getting-started) for WebDriver setup instructions \ No newline at end of file diff --git a/docs/getting-started-linux.md b/docs/getting-started-linux.md new file mode 100644 index 0000000..2b11ec4 --- /dev/null +++ b/docs/getting-started-linux.md @@ -0,0 +1,272 @@ +# Getting Started on Linux + +This guide will help you set up SmartCrawler on Linux systems. + +## Prerequisites + +- Linux distribution (Ubuntu, Debian, CentOS, Fedora, etc.) +- Root/sudo access for installation +- Internet connection for downloads + +## Step 1: Install SmartCrawler + +### Option A: Download Pre-built Binary (Recommended) + +1. Go to the [SmartCrawler releases page](https://github.com/pixlie/SmartCrawler/releases) +2. Download `smart-crawler-linux-x64.tar.gz` +3. Extract and install: + ```bash + # Extract the downloaded file + tar -xzf smart-crawler-linux-x64.tar.gz + + # Move to a directory in your PATH + sudo mv smart-crawler /usr/local/bin/ + + # Make it executable + chmod +x /usr/local/bin/smart-crawler + ``` + +### Option B: Package Installers + +#### Ubuntu/Debian (DEB Package) +```bash +# Download and install the DEB package +wget https://github.com/pixlie/SmartCrawler/releases/latest/download/smart-crawler-[version].deb +sudo dpkg -i smart-crawler-[version].deb + +# Install dependencies if needed +sudo apt-get install -f +``` + +#### RHEL/CentOS/Fedora (RPM Package) +```bash +# Download and install the RPM package +wget https://github.com/pixlie/SmartCrawler/releases/latest/download/smart-crawler-[version].rpm +sudo rpm -i smart-crawler-[version].rpm + +# Or with dnf/yum +sudo dnf install smart-crawler-[version].rpm +``` + +### Option C: Build from Source + +If you have Rust installed: +```bash +git clone https://github.com/pixlie/SmartCrawler.git +cd SmartCrawler +cargo build --release +# Binary will be in target/release/smart-crawler +``` + +## Step 2: Set Up WebDriver + +SmartCrawler requires a WebDriver server to control a browser. Choose one: + +### Option A: Firefox with GeckoDriver (Recommended) + +#### Ubuntu/Debian +```bash +# Install Firefox +sudo apt update +sudo apt install firefox + +# Install GeckoDriver +wget https://github.com/mozilla/geckodriver/releases/latest/download/geckodriver-v0.33.0-linux64.tar.gz +tar -xzf geckodriver-v0.33.0-linux64.tar.gz +sudo mv geckodriver /usr/local/bin/ +chmod +x /usr/local/bin/geckodriver +``` + +#### CentOS/RHEL/Fedora +```bash +# Install Firefox +sudo dnf install firefox + +# Install GeckoDriver +wget https://github.com/mozilla/geckodriver/releases/latest/download/geckodriver-v0.33.0-linux64.tar.gz +tar -xzf geckodriver-v0.33.0-linux64.tar.gz +sudo mv geckodriver /usr/local/bin/ +chmod +x /usr/local/bin/geckodriver +``` + +#### Arch Linux +```bash +# Install Firefox and GeckoDriver +sudo pacman -S firefox geckodriver +``` + +### Option B: Chrome with ChromeDriver + +#### Ubuntu/Debian +```bash +# Install Chrome +wget -q -O - https://dl.google.com/linux/linux_signing_key.pub | sudo apt-key add - +echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" | sudo tee /etc/apt/sources.list.d/google-chrome.list +sudo apt update +sudo apt install google-chrome-stable + +# Install ChromeDriver +# First check Chrome version +google-chrome --version + +# Download matching ChromeDriver version +CHROME_VERSION=$(google-chrome --version | cut -d' ' -f3 | cut -d'.' -f1-3) +wget https://chromedriver.storage.googleapis.com/LATEST_RELEASE_${CHROME_VERSION} +CHROMEDRIVER_VERSION=$(cat LATEST_RELEASE_${CHROME_VERSION}) +wget https://chromedriver.storage.googleapis.com/${CHROMEDRIVER_VERSION}/chromedriver_linux64.zip +unzip chromedriver_linux64.zip +sudo mv chromedriver /usr/local/bin/ +chmod +x /usr/local/bin/chromedriver +``` + +#### CentOS/RHEL/Fedora +```bash +# Install Chrome +sudo dnf install google-chrome-stable + +# Install ChromeDriver (follow similar steps as Ubuntu above) +``` + +## Step 3: Test Your Setup + +1. **Open a terminal** +2. **Start WebDriver** (choose one): + ```bash + # For Firefox (GeckoDriver) + geckodriver --port 4444 + + # For Chrome (ChromeDriver) + chromedriver --port=4444 + ``` +3. **Open a new terminal** +4. **Test SmartCrawler**: + ```bash + smart-crawler --link "https://example.com" + ``` + +## Step 4: Run Your First Crawl + +```bash +# Basic crawl +smart-crawler --link "https://example.com" + +# Crawl with verbose output +smart-crawler --link "https://example.com" --verbose + +# Crawl with template detection +smart-crawler --link "https://example.com" --template --verbose + +# Crawl multiple sites +smart-crawler --link "https://example.com" --link "https://another.com" +``` + +## Troubleshooting + +### "WebDriver connection failed" +- Ensure WebDriver is running on port 4444 +- Check that the browser is installed +- Try restarting the WebDriver +- Verify no other application is using port 4444 + +### "Permission denied" errors +- Make sure the binary is executable: + ```bash + chmod +x /usr/local/bin/smart-crawler + ``` +- Check that `/usr/local/bin` is in your PATH: + ```bash + echo $PATH + ``` + +### "smart-crawler: command not found" +- If you didn't install to `/usr/local/bin`, add the location to your PATH: + ```bash + export PATH=$PATH:/path/to/smart-crawler + ``` +- Or run with the full path: + ```bash + /path/to/smart-crawler --link "https://example.com" + ``` + +### Browser/WebDriver issues +- Check if browser is installed correctly: + ```bash + firefox --version + google-chrome --version + ``` +- Verify WebDriver is accessible: + ```bash + geckodriver --version + chromedriver --version + ``` + +### Port already in use +- Kill any existing WebDriver processes: + ```bash + pkill geckodriver + pkill chromedriver + ``` +- Check what's using port 4444: + ```bash + sudo netstat -tlnp | grep 4444 + ``` + +### Missing dependencies +- Install missing libraries: + ```bash + # Ubuntu/Debian + sudo apt install libssl-dev pkg-config + + # CentOS/RHEL/Fedora + sudo dnf install openssl-devel pkgconfig + ``` + +## Distribution-Specific Notes + +### Ubuntu/Debian +- Use `apt` for package management +- Firefox ESR is available via `firefox-esr` package +- Chrome installation requires adding Google's repository + +### CentOS/RHEL/Fedora +- Use `dnf` or `yum` for package management +- EPEL repository may be needed for some packages +- Chrome is available through Google's repository + +### Arch Linux +- Use `pacman` for package management +- Both Firefox and GeckoDriver are available in official repositories +- Chrome is available in AUR as `google-chrome` + +### Alpine Linux +- Use `apk` for package management +- May need additional setup for glibc compatibility + +## Next Steps + +- Read the [CLI Options documentation](cli-options.md) for advanced usage +- Learn more about template detection for content pattern analysis +- Explore verbose mode for detailed HTML tree analysis + +## Getting Help + +If you encounter issues: + +1. Check the [troubleshooting section](#troubleshooting) above +2. Visit the [GitHub Issues page](https://github.com/pixlie/SmartCrawler/issues) +3. Search for existing solutions or create a new issue +4. Include your Linux distribution, browser version, and error messages + +## Additional Resources + +- [Firefox Download](https://www.firefox.com/) +- [Chrome Download](https://www.google.com/chrome/) +- [GeckoDriver Releases](https://github.com/mozilla/geckodriver/releases) +- [ChromeDriver Downloads](https://googlechromelabs.github.io/chrome-for-testing/) +- [Rust Installation](https://rustup.rs/) (if building from source) + +### Distribution-Specific Resources +- [Ubuntu Packages](https://packages.ubuntu.com/) +- [Debian Packages](https://packages.debian.org/) +- [Fedora Packages](https://packages.fedoraproject.org/) +- [Arch Linux AUR](https://aur.archlinux.org/) \ No newline at end of file diff --git a/docs/getting-started-macos.md b/docs/getting-started-macos.md new file mode 100644 index 0000000..1c04650 --- /dev/null +++ b/docs/getting-started-macos.md @@ -0,0 +1,228 @@ +# Getting Started on macOS + +This guide will help you set up SmartCrawler on macOS systems. + +## Prerequisites + +- macOS 10.15 (Catalina) or later +- Administrator access for installation +- Internet connection for downloads + +## Step 1: Install SmartCrawler + +### Option A: Download Pre-built Binary (Recommended) + +1. Go to the [SmartCrawler releases page](https://github.com/pixlie/SmartCrawler/releases) +2. Download the appropriate binary for your Mac: + - `smart-crawler-macos-x64.tar.gz` for Intel Macs + - `smart-crawler-macos-arm64.tar.gz` for Apple Silicon Macs (M1/M2/M3) +3. Extract and install: + ```bash + # Extract the downloaded file + tar -xzf smart-crawler-macos-*.tar.gz + + # Move to a directory in your PATH + sudo mv smart-crawler /usr/local/bin/ + + # Make it executable + chmod +x /usr/local/bin/smart-crawler + ``` + +### Option B: DMG Package + +1. Download `smart-crawler-[version].dmg` from the releases page +2. Open the DMG file +3. Copy `smart-crawler` to `/usr/local/bin/` or your preferred location +4. Make it executable: + ```bash + chmod +x /usr/local/bin/smart-crawler + ``` + +### Option C: Build from Source + +If you have Rust installed: +```bash +git clone https://github.com/pixlie/SmartCrawler.git +cd SmartCrawler +cargo build --release +# Binary will be in target/release/smart-crawler +``` + +## Step 2: Set Up WebDriver + +SmartCrawler requires a WebDriver server to control a browser. Choose one: + +### Option A: Firefox with GeckoDriver (Recommended) + +1. **Install Firefox** (if not already installed): + ```bash + # Using Homebrew (recommended) + brew install firefox + + # Or download from firefox.com + ``` + +2. **Install GeckoDriver**: + ```bash + # Using Homebrew (recommended) + brew install geckodriver + + # Or download manually from GitHub releases + wget https://github.com/mozilla/geckodriver/releases/latest/download/geckodriver-v0.33.0-macos.tar.gz + tar -xzf geckodriver-v0.33.0-macos.tar.gz + sudo mv geckodriver /usr/local/bin/ + ``` + +### Option B: Chrome with ChromeDriver + +1. **Install Chrome** (if not already installed): + ```bash + # Using Homebrew + brew install google-chrome + + # Or download from chrome.com + ``` + +2. **Install ChromeDriver**: + ```bash + # Using Homebrew + brew install chromedriver + + # Or download manually - check your Chrome version first + google-chrome --version + # Then download matching version from Chrome for Testing + ``` + +## Step 3: Handle macOS Security + +macOS may block unsigned executables. If you get a security warning: + +1. **Allow the binary to run**: + ```bash + # Remove quarantine attribute + xattr -d com.apple.quarantine /usr/local/bin/smart-crawler + + # Or allow in System Preferences + # System Preferences > Security & Privacy > General > Allow anyway + ``` + +2. **For WebDriver binaries**: + ```bash + # If you downloaded manually + xattr -d com.apple.quarantine /usr/local/bin/geckodriver + xattr -d com.apple.quarantine /usr/local/bin/chromedriver + ``` + +## Step 4: Test Your Setup + +1. **Open Terminal** +2. **Start WebDriver** (choose one): + ```bash + # For Firefox (GeckoDriver) + geckodriver --port 4444 + + # For Chrome (ChromeDriver) + chromedriver --port=4444 + ``` +3. **Open a new Terminal window** +4. **Test SmartCrawler**: + ```bash + smart-crawler --link "https://example.com" + ``` + +## Step 5: Run Your First Crawl + +```bash +# Basic crawl +smart-crawler --link "https://example.com" + +# Crawl with verbose output +smart-crawler --link "https://example.com" --verbose + +# Crawl with template detection +smart-crawler --link "https://example.com" --template --verbose + +# Crawl multiple sites +smart-crawler --link "https://example.com" --link "https://another.com" +``` + +## Troubleshooting + +### "WebDriver connection failed" +- Ensure WebDriver is running on port 4444 +- Check that the browser is installed +- Try restarting the WebDriver +- Verify no other application is using port 4444 + +### "Permission denied" errors +- Make sure the binary is executable: + ```bash + chmod +x /usr/local/bin/smart-crawler + ``` +- Check that `/usr/local/bin` is in your PATH: + ```bash + echo $PATH + ``` + +### "smart-crawler: command not found" +- If you didn't install to `/usr/local/bin`, add the location to your PATH: + ```bash + export PATH=$PATH:/path/to/smart-crawler + ``` +- Or run with the full path: + ```bash + /path/to/smart-crawler --link "https://example.com" + ``` + +### macOS Security Warnings +- Remove quarantine attributes: + ```bash + xattr -d com.apple.quarantine /usr/local/bin/smart-crawler + ``` +- Or go to System Preferences > Security & Privacy > General and click "Allow anyway" + +### Port already in use +- Kill any existing WebDriver processes: + ```bash + pkill geckodriver + pkill chromedriver + ``` + +## Advanced Setup with Homebrew + +If you use Homebrew, you can install everything in one go: + +```bash +# Install browsers and WebDriver +brew install firefox geckodriver + +# Or for Chrome +brew install google-chrome chromedriver + +# Download SmartCrawler binary and install +# (Follow Option A above for binary installation) +``` + +## Next Steps + +- Read the [CLI Options documentation](cli-options.md) for advanced usage +- Learn more about template detection for content pattern analysis +- Explore verbose mode for detailed HTML tree analysis + +## Getting Help + +If you encounter issues: + +1. Check the [troubleshooting section](#troubleshooting) above +2. Visit the [GitHub Issues page](https://github.com/pixlie/SmartCrawler/issues) +3. Search for existing solutions or create a new issue +4. Include your macOS version, browser version, and error messages + +## Additional Resources + +- [Homebrew](https://brew.sh/) - Package manager for macOS +- [Firefox Download](https://www.firefox.com/) +- [Chrome Download](https://www.google.com/chrome/) +- [GeckoDriver Releases](https://github.com/mozilla/geckodriver/releases) +- [ChromeDriver Downloads](https://googlechromelabs.github.io/chrome-for-testing/) +- [Rust Installation](https://rustup.rs/) (if building from source) \ No newline at end of file diff --git a/docs/getting-started-windows.md b/docs/getting-started-windows.md new file mode 100644 index 0000000..b1fd000 --- /dev/null +++ b/docs/getting-started-windows.md @@ -0,0 +1,152 @@ +# Getting Started on Windows + +This guide will help you set up SmartCrawler on Windows systems. + +## Prerequisites + +- Windows 10 or later +- Administrator access for installation +- Internet connection for downloads + +## Step 1: Install SmartCrawler + +### Option A: Download Pre-built Binary (Recommended) + +1. Go to the [SmartCrawler releases page](https://github.com/pixlie/SmartCrawler/releases) +2. Download the latest `smart-crawler-windows-x64.zip` +3. Extract the ZIP file to a folder (e.g., `C:\SmartCrawler\`) +4. **Optional**: Add the folder to your system PATH: + - Press `Win + X` and select "System" + - Click "Advanced system settings" + - Click "Environment Variables" + - Under "System variables", find and select "Path", then click "Edit" + - Click "New" and add your SmartCrawler folder path + - Click "OK" to save + +### Option B: MSI Installer + +1. Download `smart-crawler-[version].msi` from the releases page +2. Double-click the MSI file to install +3. Follow the installation wizard +4. SmartCrawler will be automatically added to your PATH + +### Option C: Build from Source + +If you have Rust installed: +```powershell +git clone https://github.com/pixlie/SmartCrawler.git +cd SmartCrawler +cargo build --release +# Binary will be in target\release\smart-crawler.exe +``` + +## Step 2: Set Up WebDriver + +SmartCrawler requires a WebDriver server to control a browser. Choose one: + +### Option A: Firefox with GeckoDriver (Recommended) + +1. **Install Firefox** (if not already installed): + - Download from [firefox.com](https://www.firefox.com/) + - Run the installer + +2. **Install GeckoDriver**: + - Download `geckodriver.exe` from [Mozilla GeckoDriver releases](https://github.com/mozilla/geckodriver/releases) + - Extract the file and place it in: + - The same folder as `smart-crawler.exe`, OR + - A folder in your system PATH (e.g., `C:\Windows\System32`) + +### Option B: Chrome with ChromeDriver + +1. **Install Chrome** (if not already installed): + - Download from [chrome.com](https://www.google.com/chrome/) + - Run the installer + +2. **Install ChromeDriver**: + - Check your Chrome version: Go to `chrome://version/` in Chrome + - Download the matching ChromeDriver from [Chrome for Testing](https://googlechromelabs.github.io/chrome-for-testing/) + - Extract `chromedriver.exe` and place it in: + - The same folder as `smart-crawler.exe`, OR + - A folder in your system PATH + +## Step 3: Test Your Setup + +1. **Open Command Prompt or PowerShell** +2. **Start WebDriver** (choose one): + ```cmd + # For Firefox (GeckoDriver) + geckodriver.exe --port 4444 + + # For Chrome (ChromeDriver) + chromedriver.exe --port=4444 + ``` +3. **Open a new Command Prompt/PowerShell window** +4. **Test SmartCrawler**: + ```cmd + smart-crawler --link "https://example.com" + ``` + +## Step 4: Run Your First Crawl + +```cmd +# Basic crawl +smart-crawler --link "https://example.com" + +# Crawl with verbose output +smart-crawler --link "https://example.com" --verbose + +# Crawl with template detection +smart-crawler --link "https://example.com" --template --verbose + +# Crawl multiple sites +smart-crawler --link "https://example.com" --link "https://another.com" +``` + +## Troubleshooting + +### "WebDriver connection failed" +- Ensure WebDriver is running on port 4444 +- Check that the browser is installed +- Try restarting the WebDriver +- Verify no other application is using port 4444 + +### "'smart-crawler' is not recognized" +- If you didn't add SmartCrawler to PATH, run it with the full path: + ```cmd + C:\SmartCrawler\smart-crawler.exe --link "https://example.com" + ``` +- Or add the folder to your PATH (see Step 1) + +### "Permission denied" errors +- Run Command Prompt as Administrator +- Check that the executable has proper permissions + +### Port already in use +- Kill any existing WebDriver processes: + ```cmd + taskkill /F /IM geckodriver.exe + taskkill /F /IM chromedriver.exe + ``` + +## Next Steps + +- Read the [CLI Options documentation](cli-options.md) for advanced usage +- Learn more about template detection for content pattern analysis +- Explore verbose mode for detailed HTML tree analysis + +## Getting Help + +If you encounter issues: + +1. Check the [troubleshooting section](#troubleshooting) above +2. Visit the [GitHub Issues page](https://github.com/pixlie/SmartCrawler/issues) +3. Search for existing solutions or create a new issue +4. Include your Windows version, browser version, and error messages + +## Additional Resources + +- [Firefox Download](https://www.firefox.com/) +- [Chrome Download](https://www.google.com/chrome/) +- [GeckoDriver Releases](https://github.com/mozilla/geckodriver/releases) +- [ChromeDriver Downloads](https://googlechromelabs.github.io/chrome-for-testing/) +- [Rust Installation](https://rustup.rs/) (if building from source) \ No newline at end of file From bada5212e702f6f6a894b076b239c1447f2bb1f7 Mon Sep 17 00:00:00 2001 From: Sumit Datta Date: Wed, 9 Jul 2025 08:50:19 +0530 Subject: [PATCH 2/2] docs: move development and contributing info to separate guide MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Create comprehensive development.md with setup, building, testing, and contributing instructions - Remove development sections from README to keep it minimal - Add link to development guide in README documentation section - Include architecture overview, debugging tips, and performance guidance - Cover security considerations and future improvement areas πŸ€– Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude --- README.md | 45 ++---- docs/development.md | 329 ++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 338 insertions(+), 36 deletions(-) create mode 100644 docs/development.md diff --git a/README.md b/README.md index d8b3612..40dc198 100644 --- a/README.md +++ b/README.md @@ -42,6 +42,10 @@ Choose your operating system for detailed setup instructions: - **[CLI Options](docs/cli-options.md)** - Complete command-line reference and examples +### Development + +- **[Development Guide](docs/development.md)** - Setup, building, testing, and contributing instructions + ## πŸ”§ System Requirements - **Operating System**: Windows 10+, macOS 10.15+, or Linux @@ -73,50 +77,19 @@ smart-crawler --link "https://site1.com" --link "https://site2.com" # Start Firefox WebDriver geckodriver --port 4444 -# Start Chrome WebDriver +# Start Chrome WebDriver chromedriver --port=4444 ``` ## πŸ› οΈ Development -### Building from Source - -```bash -git clone https://github.com/pixlie/SmartCrawler.git -cd SmartCrawler -cargo build --release -``` - -### Running Tests - -```bash -cargo test -``` - -### Development Commands - -```bash -# Format code -cargo fmt - -# Run linter -cargo clippy - -# Debug mode -RUST_LOG=debug cargo run -- --link "https://example.com" -``` - -## 🀝 Contributing +For developers interested in contributing to SmartCrawler or building from source: -1. Fork the repository -2. Create a feature branch (`git checkout -b feature/amazing-feature`) -3. Commit your changes (`git commit -m 'Add amazing feature'`) -4. Push to the branch (`git push origin feature/amazing-feature`) -5. Open a Pull Request +- **[Development Guide](docs/development.md)** - Complete setup, building, testing, and contributing instructions ## πŸ“„ License -This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. +This project is licensed under the GPL-3.0 license - see the [LICENSE](LICENSE) file for details. ## πŸ”— Links @@ -136,4 +109,4 @@ If you encounter issues: --- -**Note**: SmartCrawler is designed for ethical web scraping and research purposes. Always respect websites' robots.txt files and terms of service. \ No newline at end of file +**Note**: SmartCrawler is designed for ethical web scraping and research purposes. Always respect websites' robots.txt files and terms of service. diff --git a/docs/development.md b/docs/development.md new file mode 100644 index 0000000..ba684ff --- /dev/null +++ b/docs/development.md @@ -0,0 +1,329 @@ +# Development Guide + +This guide covers how to set up a development environment, build from source, contribute to the project, and understand the codebase. + +## πŸ› οΈ Development Environment Setup + +### Prerequisites + +- **Rust**: Install from [rustup.rs](https://rustup.rs/) +- **Git**: For version control +- **WebDriver**: Firefox (GeckoDriver) or Chrome (ChromeDriver) for testing + +### Building from Source + +```bash +# Clone the repository +git clone https://github.com/pixlie/SmartCrawler.git +cd SmartCrawler + +# Build in release mode +cargo build --release + +# The binary will be in target/release/smart-crawler +``` + +### Development Build + +```bash +# Build in development mode (faster compilation, includes debug info) +cargo build + +# Run directly with cargo +cargo run -- --link "https://example.com" +``` + +## πŸ§ͺ Testing + +### Running Tests + +```bash +# Run all tests +cargo test + +# Run tests with output +cargo test -- --nocapture + +# Run specific test +cargo test test_name + +# Run tests in a specific module +cargo test html_parser::tests +``` + +### Test Structure + +The project includes several types of tests: + +- **Unit tests**: Located in each module (`src/` files) +- **Integration tests**: In the `tests/` directory +- **Real-world tests**: For testing against actual websites (normally ignored) + +### Running Real-World Tests + +```bash +# Run real-world tests (requires WebDriver) +cargo test --test real_world_tests -- --ignored +``` + +## πŸ”§ Development Commands + +### Code Quality + +```bash +# Format code +cargo fmt + +# Check formatting +cargo fmt --check + +# Run linter +cargo clippy + +# Run linter with warnings as errors +cargo clippy -- -D warnings +``` + +### Debug Mode + +```bash +# Run with debug logging +RUST_LOG=debug cargo run -- --link "https://example.com" + +# Run with specific module logging +RUST_LOG=smart_crawler::html_parser=debug cargo run -- --link "https://example.com" +``` + +### Profiling and Performance + +```bash +# Build with profiling +cargo build --release --features profiling + +# Run with timing information +RUST_LOG=info cargo run --release -- --link "https://example.com" +``` + +## πŸ“ Project Structure + +``` +SmartCrawler/ +β”œβ”€β”€ src/ +β”‚ β”œβ”€β”€ main.rs # Main application entry point +β”‚ β”œβ”€β”€ lib.rs # Library exports +β”‚ β”œβ”€β”€ browser.rs # WebDriver browser integration +β”‚ β”œβ”€β”€ cli.rs # Command-line argument parsing +β”‚ β”œβ”€β”€ html_parser.rs # HTML parsing and tree building +β”‚ β”œβ”€β”€ storage.rs # URL storage and duplicate detection +β”‚ β”œβ”€β”€ template_detection.rs # Template pattern detection +β”‚ └── utils.rs # Utility functions +β”œβ”€β”€ tests/ +β”‚ └── real_world_tests.rs # Integration tests +β”œβ”€β”€ docs/ # Documentation +β”œβ”€β”€ CLAUDE.md # Development workflow guide +└── Cargo.toml # Project dependencies +``` + +## πŸ—οΈ Architecture Overview + +### Core Components + +1. **CLI Interface** (`cli.rs`): Parses command-line arguments +2. **Browser Integration** (`browser.rs`): WebDriver integration for page loading +3. **HTML Parser** (`html_parser.rs`): Parses HTML into structured tree +4. **Storage System** (`storage.rs`): Manages URLs and duplicate detection +5. **Template Detection** (`template_detection.rs`): Identifies content patterns +6. **Utilities** (`utils.rs`): Common helper functions + +### Data Flow + +``` +CLI Arguments β†’ Browser β†’ HTML Source β†’ HTML Parser β†’ Storage β†’ Duplicate Analysis β†’ Output + ↓ + Template Detection +``` + +### Key Features + +- **Domain-level duplicate detection**: Identifies similar content across pages +- **Template pattern recognition**: Converts variable content to template patterns +- **HTML tree filtering**: Shows structured view with duplicate marking +- **Multi-URL crawling**: Processes multiple URLs with smart discovery + +## 🀝 Contributing + +### Getting Started + +1. **Fork the repository** on GitHub +2. **Clone your fork**: + ```bash + git clone https://github.com/your-username/SmartCrawler.git + cd SmartCrawler + ``` +3. **Create a feature branch**: + ```bash + git checkout -b feature/your-feature-name + ``` + +### Development Workflow + +Follow the workflow described in `CLAUDE.md`: + +1. **Create a new branch** for each feature/fix +2. **Add tests** for any new functionality +3. **Run formatters and linters** before committing +4. **Write descriptive commit messages** +5. **Push to your fork** and create a pull request + +### Code Style + +- Follow Rust standard formatting (`cargo fmt`) +- Address all clippy warnings (`cargo clippy`) +- Write comprehensive tests for new features +- Document public APIs with doc comments +- Use meaningful variable and function names + +### Commit Message Format + +``` +type: brief description + +- Detailed explanation of changes +- Reference any related issues +- Include any breaking changes + +πŸ€– Generated with [Claude Code](https://claude.ai/code) + +Co-Authored-By: Claude +``` + +### Pull Request Process + +1. **Ensure tests pass**: Run `cargo test` +2. **Check code quality**: Run `cargo fmt` and `cargo clippy` +3. **Update documentation**: Add or update relevant docs +4. **Describe your changes**: Write a clear PR description +5. **Link related issues**: Reference any GitHub issues + +### Code Review Guidelines + +- Be constructive and respectful +- Focus on code quality and maintainability +- Test the changes locally when possible +- Ask questions if something isn't clear + +## πŸ› Debugging + +### Common Issues + +**Build failures**: +- Ensure Rust is up to date: `rustup update` +- Check dependencies: `cargo update` + +**Test failures**: +- Ensure WebDriver is running for integration tests +- Check that target websites are accessible + +**WebDriver issues**: +- Verify WebDriver version matches browser version +- Check that port 4444 is available +- Ensure browser is installed and accessible + +### Debug Tools + +```bash +# Check dependencies +cargo tree + +# Update dependencies +cargo update + +# Clean build artifacts +cargo clean + +# Verbose build output +cargo build --verbose +``` + +## πŸ“š Learning Resources + +### Rust Resources + +- [The Rust Programming Language](https://doc.rust-lang.org/book/) +- [Rust by Example](https://doc.rust-lang.org/rust-by-example/) +- [Rust Standard Library](https://doc.rust-lang.org/std/) + +### WebDriver Resources + +- [WebDriver Specification](https://w3c.github.io/webdriver/) +- [GeckoDriver Documentation](https://firefox-source-docs.mozilla.org/testing/geckodriver/) +- [ChromeDriver Documentation](https://chromedriver.chromium.org/) + +### HTML Parsing + +- [scraper crate documentation](https://docs.rs/scraper/) +- [HTML5 Specification](https://html.spec.whatwg.org/) + +## πŸ”’ Security + +### Security Considerations + +- Never commit API keys or secrets +- Validate all user inputs +- Respect robots.txt and website terms of service +- Use HTTPS for all network requests where possible + +### Reporting Security Issues + +If you discover a security vulnerability: + +1. **Do NOT** create a public GitHub issue +2. Email the maintainers directly +3. Provide detailed information about the vulnerability +4. Allow time for a fix before public disclosure + +## πŸ“ˆ Performance + +### Optimization Tips + +- Use `--release` builds for performance testing +- Profile with `perf` or similar tools +- Monitor memory usage during development +- Test with various website sizes and structures + +### Benchmarking + +```bash +# Build optimized version +cargo build --release + +# Time execution +time target/release/smart-crawler --link "https://example.com" + +# Profile memory usage +valgrind target/release/smart-crawler --link "https://example.com" +``` + +## 🎯 Future Improvements + +Areas for contribution: + +- **Performance optimization**: Faster HTML parsing and duplicate detection +- **Additional output formats**: JSON, CSV, XML export options +- **Enhanced filtering**: More sophisticated duplicate detection algorithms +- **UI improvements**: Better progress reporting and error messages +- **Platform support**: Windows-specific optimizations +- **Documentation**: More examples and use cases + +## πŸ“ž Getting Help + +- **GitHub Issues**: For bug reports and feature requests +- **Discussions**: For general questions and ideas +- **Code Review**: For feedback on contributions +- **Documentation**: For clarification on usage + +Remember to search existing issues before creating new ones! + +--- + +This development guide is continuously updated. If you find any information missing or outdated, please contribute improvements! \ No newline at end of file