Skip to content

MabudAlam/QuickCrawl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

16 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Quickcrawl

Quickcrawl

πŸš€ Web Scraping API for AI Agents β€” Scrape, crawl, and map websites with a single binary.

Playground: Try it online

Go Gin MCP Chrome Chrome LightPanda DuckDuckGo

Deploy on Railway


⚑ Overview

Quickcrawl is a powerful Go-based web scraping service that brings intelligence to AI agents. Whether you're scraping a single page, crawling an entire website, or mapping out link structures β€” Quickcrawl handles the heavy lifting with a sophisticated multi-layered architecture combining HTTP fetching, browser automation, and LLM-powered structured extraction.


✨ Features

Feature Description
🌐 Web Scraping Convert any URL to Markdown, HTML, Plain Text, or Links
πŸ”„ Async Crawling BFS website crawler with depth/page limits and rate limiting
πŸ—ΊοΈ URL Mapping Discover all URLs on a site instantly without scraping content
🧠 JavaScript Rendering Auto-detect SPAs and render via LightPanda or Chrome
πŸ“Š LLM Extraction Send a JSON schema, get validated structured data back
πŸ” Web Search DuckDuckGo-powered search for AI agent integration
πŸ€– MCP Server Built-in stdio transport for seamless AI agent integration
πŸ“¦ Multi-format Output Markdown, HTML, RawHTML, PlainText, Links, JSON

πŸ“Š Benchmark

85.4% scrape success rate across 1,000 diverse URLs from the Firecrawl Scrape Content Dataset v1

Tested against Firecrawl v2.5 on the same dataset:

Feature Quickcrawl Firecrawl
Coverage 85.4% 79.4%
Avg Scrape Latency 1,841.9ms 4,048ms
Self-hosting Single binary Multi-container (~4GB+)
Cost / 1K scrapes $0 (self-hosted) $9

Quality Metrics

Metric Value
Content Recall 44.03%
Noise Rejection 86.65%
Content Matches 376
Noise Leaks 114

Use Cases

Build RAG pipelines with clean LLM-ready markdown, give AI agents real-time web access, monitor content changes, extract structured data, convert HTML to clean markdown, or archive web pages at scale.


πŸ€– MCP Server

Quickcrawl MCP server provides AI agents with web scraping capabilities:

Available Tools

Tool Description
quickcrawl_scrape Scrape a single URL
quickcrawl_crawl Start async website crawl
quickcrawl_check_crawl_status Check crawl job status
quickcrawl_cancel_crawl Cancel running crawl
quickcrawl_map Discover URLs on a site
quickcrawl_search Search DuckDuckGo

OpenCode Integration

Add Quickcrawl to your OpenCode configuration:

{
  "mcp": {
    "quickcrawl": {
      "type": "local",
      "command": ["npx", "-y", "@mabudalam/quickcrawl-mcp"],
      "enabled": true
    }
  }
}

Configuration

Renderer selection in MCP:

  • Omit renderer or set renderer: "auto" to use the configured fallback chain
  • Set renderer: "lightpanda" or "chrome" to hard-pin a single renderer
  • Pinned renderers imply renderJs: true unless renderJs: false is set explicitly

Example MCP tool arguments:

{
  "url": "https://www.notion.so/",
  "formats": ["markdown"],
  "renderJs": true,
  "renderer": "chrome",
  "onlyMainContent": true
}

πŸ’» CLI

The Quickcrawl CLI provides standalone command-line access to all features. No server or Python needed.

Installation

# Quick install (recommended)
curl -fsSL https://raw.githubusercontent.com/MabudAlam/quickcrawl/main/install.sh | sh

# Or from source
go install github.com/MabudAlam/quickcrawl/cli

# Download from GitHub releases
curl -L https://github.com/MabudAlam/QuickCrawl/releases/latest/download/quickcrawl_darwin_arm64.tar.gz | tar -xz
./quickcrawl --help

Usage

# Scrape a single URL
quickcrawl scrape https://example.com
quickcrawl scrape https://example.com --formats html,markdown

# Crawl a website
quickcrawl crawl https://example.com --max-pages 10 --max-depth 3

# Discover URLs (without scraping content)
quickcrawl map https://example.com --max-depth 2

# Search DuckDuckGo
quickcrawl search "golang web scraping"
quickcrawl search "python" --scrape --formats markdown

CLI Examples

See cli/cmd/ for all subcommands:


🐍 Python SDK

Python SDK for Quickcrawl β€” scrape, crawl, and map websites from Python code.

Installation

# From PyPI (coming soon)
pip install quickcrawl

# From GitHub
pip install git+https://github.com/MabudAlam/quickcrawl.git@python-sdk#subdirectory=python

# Or clone and install
git clone https://github.com/MabudAlam/quickcrawl
cd quickcrawl/python
pip install -e .

Quick Start

from quickcrawl import QuickCrawlClient

# CLI mode (zero config, auto-downloads binary)
with QuickCrawlClient() as client:
    result = client.scrape("https://example.com")
    print(result["markdown"])

# HTTP mode (connect to deployed server)
client = QuickCrawlClient(api_url="https://your-server.com", api_key="...")
result = client.scrape("https://example.com")

Python SDK Examples

See python/examples/:


πŸ—οΈ Architecture

graph TD
    Client["πŸ–₯️ Client (HTTP / MCP)"]
    Server["πŸš€ Quickcrawl Server"]

    Client --> Server

    Server --> Router["πŸ“‘ Gin Router"]
    Router --> Handlers["πŸ”§ API Handlers"]

    Handlers --> Renderer["🎨 Renderer Layer"]
    Renderer --> HTTPFetcher["🌐 HTTP Fetcher"]
    Renderer --> Browser["🌎 Browser (CDP)"]
    Browser --> LightPanda["🐼 LightPanda"]
    Browser --> Chrome["πŸ”΅ Chrome DevTools"]
    Browser --> Chrome["πŸ‰ Chrome"]

    Renderer --> Extractor["πŸ“ Extractor"]
    Extractor --> Markdown["πŸ“„ Markdown"]
    Extractor --> HTML["🌐 HTML"]
    Extractor --> PlainText["πŸ“ƒ Plain Text"]
    Extractor --> Links["πŸ”— Links"]

    Handlers --> Crawler["πŸ”„ Crawler"]
    Crawler --> Robots["πŸ€– Robots.txt"]
    Crawler --> Sitemap["πŸ—ΊοΈ Sitemap"]
    Crawler --> RateLimit["⚑ Rate Limiter"]

    Handlers --> Search["πŸ” DuckDuckGo Search"]
    Search --> LLM["🧠 LLM Extraction"]
    LLM --> JSONSchema["πŸ“‹ JSON Schema Output"]
Loading

Renderer Fallback Chain

Request
    ↓
HTTP Fetcher (plain HTML, fastest)
    ↓ (if SPA detected or JS requested)
LightPanda (headless browser, fast)
    ↓ (if LightPanda unavailable)
Chrome DevTools (full browser, most compatible)

πŸ”Œ API Endpoints

🌐 Scraping β€” /v1/scrape

Method Path Description
POST /v1/scrape Scrape a single URL with multiple output formats

Request:

{
  "url": "https://www.mabud.dev/",
  "formats": ["markdown"],
  "onlyMainContent": true,
  "renderJs": false,
  "topK": 5
}

Response:

{
  "success": true,
  "data": {
    "markdown": "# Hey, I'm Mabud.\n\nI build AI native systems...",
    "metadata": {
      "title": "Mabud Alam",
      "description": "Software Engineer",
      "ogpTitle": "Mabud Alam",
      "ogpDescription": "Software Engineer",
      "sourceURL": "https://www.mabud.dev/",
      "language": "en",
      "statusCode": 200,
      "renderedMode": "http",
      "timeTaken": 866
    }
  }
}

πŸ”„ Crawling β€” /v1/crawl

Method Path Description
POST /v1/crawl Start an async BFS crawl of a website
GET /v1/crawl/:id Check crawl status and retrieve results
DELETE /v1/crawl/:id Cancel a running crawl job

Start Crawl Request:

{
  "url": "https://www.mabud.dev/",
  "maxDepth": 2,
  "maxPages": 100,
  "formats": ["markdown"],
  "onlyMainContent": true,
  "renderJs": false
}

Check Status Request: GET /v1/crawl/:id

No body required.

Cancel Request: DELETE /v1/crawl/:id

No body required.

πŸ—ΊοΈ Mapping β€” /v1/map

Method Path Description
POST /v1/map Discover all URLs on a site instantly

Request:

{
  "url": "https://www.mabud.dev/",
  "maxDepth": 2,
  "useSitemap": true,
  "timeout": 30000
}

Response:

{
  "success": true,
  "data": {
    "links": [
      "https://www.mabud.dev/blog",
      "https://www.mabud.dev/projects",
      "https://www.mabud.dev/resume"
    ]
  }
}

πŸ” Search β€” /v1/search

Method Path Description
POST /v1/search Search DuckDuckGo and scrape results in parallel

πŸ₯ Health β€” /health

Method Path Description
GET /health Health check with browser availability and active job count

🧠 LLM Extraction

Quickcrawl supports JSON Schema-based extraction for structured data:

{
  "url": "https://news.example.com",
  "formats": ["markdown", "json"],
  "onlyMainContent": true,
  "extract": {
    "schema": {
      "type": "object",
      "properties": {
        "title": { "type": "string" },
        "author": { "type": "string" },
        "publishDate": { "type": "string" }
      },
      "required": ["title", "author"]
    },
    "prompt": "Extract article title, author, and publish date",
    "responseFormat": "article"
  },
  "chunkStrategy": { "type": "sentence" },
  "query": "article title and author",
  "filterMode": "bm25",
  "topK": 5
}

Chunking Strategies

Strategy Description
sentence Split by sentence boundaries
paragraph Split by paragraph boundaries
regex Split by custom regex pattern
topic Split by topic changes

Filter Modes

Mode Description
bm25 BM25 algorithm for relevance scoring
rrf Reciprocal Rank Fusion
hybrid Combine bm25 and rrf

βš™οΈ Configuration

Config file: quickcrawl.toml

[server]
host = "0.0.0.0"
port = 3000
rate_limit_rps = 10

[renderer]
mode = "auto"           # auto, none, chrome, lightpanda
page_timeout_ms = 30000
pool_size = 4
render_js_default = false

[renderer.lightpanda]
ws_url = ""

[renderer.chrome]
ws_url = ""

[crawler]
max_concurrency = 40
requests_per_second = 40.0
respect_robots_txt = true
default_max_depth = 2
default_max_pages = 100

[extraction.llm]
provider = "openai"
model = "gpt-4o-mini"
api_key = ""
base_url = ""
max_tokens = 4000
temperature = 0.7

Or via environment variables:

SERVER__PORT=3000
RENDERER__MODE=auto
RENDERER__CHROME__WS_URL=ws://127.0.0.1:9222/devtools/browser/...
CRAWLER__MAX_CONCURRENCY=40
EXTRACTION__LLM__API_KEY=your-key

πŸ“‚ Project Structure

quickcrawl/
β”œβ”€β”€ cmd/
β”‚   β”œβ”€β”€ server/              # HTTP API server
β”‚   └── mcp/                 # MCP server for AI agents
β”œβ”€β”€ cli/                     # Standalone CLI binary
β”‚   β”œβ”€β”€ main.go              # Entry point
β”‚   └── cmd/                 # Cobra subcommands
β”œβ”€β”€ internal/
β”‚   β”œβ”€β”€ api/                 # HTTP handlers, routes, middleware
β”‚   β”œβ”€β”€ crawler/             # BFS crawler, robots.txt, sitemap
β”‚   β”œβ”€β”€ extractor/           # HTML cleaning, markdown, chunking
β”‚   β”œβ”€β”€ renderer/            # HTTP, browser fetching via CDP
β”‚   β”œβ”€β”€ search/              # DuckDuckGo integration
β”‚   β”œβ”€β”€ mcp/                 # MCP tool implementation
β”‚   └── types/               # Type definitions
β”œβ”€β”€ playground/              # Web UI playground
β”œβ”€β”€ npm/                     # NPM wrapper package
β”œβ”€β”€ python/                  # Python SDK
β”‚   β”œβ”€β”€ examples/            # Python usage examples
β”‚   └── README.md           # Python SDK documentation
β”œβ”€β”€ bench/                   # Benchmarks
β”œβ”€β”€ scripts/                 # Release scripts
└── workflows/               # CI/CD workflows

πŸ‘¨β€πŸ’» Tech Stack

Tech Use Case
Go 1.21+ Core backend, CLI, server
Gin HTTP framework
Cobra CLI framework
goquery HTML parsing and DOM manipulation
lightpanda Headless browser automation (CDP over WebSocket)
Chrome DevTools Browser automation via CDP WebSocket
MCP SDK Model Context Protocol server
slog Structured logging

πŸ§ͺ Playground UI

Quickcrawl includes a web-based playground for testing:

playground/
β”œβ”€β”€ app/
β”‚   └── playground/
β”‚       └── page.tsx        # Main playground UI
β”œβ”€β”€ components/
β”‚   └── response-viewer.tsx  # Response display components
└── lib/
    └── api-client.ts        # API client functions

Access at http://localhost:3000/playground when the server is running.


πŸš€ Deployment

Docker

Docker images are published to GitHub Container Registry:

# Server
docker build -f infra/Dockerfile.server -t quickcrawl .
docker run -p 3000:3000 quickcrawl

# Playground
docker build -f infra/Dockerfile.playground -t quickcrawl-playground .
docker run -p 3000:3000 quickcrawl-playground

Railway

Deploy on Railway

  1. Click the button above
  2. Connect your GitHub repository
  3. Configure environment variables
  4. Deploy

Roadmap

  1. Add Page Interaction
  2. Hooks & Auth
  3. Improve Search
  4. Better SPA Handling
  5. Auto mode improvements for JS rendering
  6. Add support for https://github.com/h4ckf0r0day/obscura headless
  7. Focus on scalablity
  8. Cache System
  9. Improve the SDK perfomance

About

Fast, lightweight web scraper, crawler & search API with MCP server for AI agents. 2.2x faster than Firecrawl in 1K-URL benchmarks (1,842ms) . Drop In QuickCrawl CLI , QuickCrawl API , QuickCrawl MCP. Supports Markdown, HTML, PlainText, Links, JSON output. Built-in DuckDuckGo search. Free for self-hosting.

Topics

Resources

Stars

Watchers

Forks

Contributors