Playground: Try it online
Quickcrawl is a powerful Go-based web scraping service that brings intelligence to AI agents. Whether you're scraping a single page, crawling an entire website, or mapping out link structures β Quickcrawl handles the heavy lifting with a sophisticated multi-layered architecture combining HTTP fetching, browser automation, and LLM-powered structured extraction.
| Feature | Description |
|---|---|
| π Web Scraping | Convert any URL to Markdown, HTML, Plain Text, or Links |
| π Async Crawling | BFS website crawler with depth/page limits and rate limiting |
| πΊοΈ URL Mapping | Discover all URLs on a site instantly without scraping content |
| π§ JavaScript Rendering | Auto-detect SPAs and render via LightPanda or Chrome |
| π LLM Extraction | Send a JSON schema, get validated structured data back |
| π Web Search | DuckDuckGo-powered search for AI agent integration |
| π€ MCP Server | Built-in stdio transport for seamless AI agent integration |
| π¦ Multi-format Output | Markdown, HTML, RawHTML, PlainText, Links, JSON |
85.4% scrape success rate across 1,000 diverse URLs from the Firecrawl Scrape Content Dataset v1
Tested against Firecrawl v2.5 on the same dataset:
| Feature | Quickcrawl | Firecrawl |
|---|---|---|
| Coverage | 85.4% | 79.4% |
| Avg Scrape Latency | 1,841.9ms | 4,048ms |
| Self-hosting | Single binary | Multi-container (~4GB+) |
| Cost / 1K scrapes | $0 (self-hosted) | $9 |
| Metric | Value |
|---|---|
| Content Recall | 44.03% |
| Noise Rejection | 86.65% |
| Content Matches | 376 |
| Noise Leaks | 114 |
Build RAG pipelines with clean LLM-ready markdown, give AI agents real-time web access, monitor content changes, extract structured data, convert HTML to clean markdown, or archive web pages at scale.
Quickcrawl MCP server provides AI agents with web scraping capabilities:
| Tool | Description |
|---|---|
quickcrawl_scrape |
Scrape a single URL |
quickcrawl_crawl |
Start async website crawl |
quickcrawl_check_crawl_status |
Check crawl job status |
quickcrawl_cancel_crawl |
Cancel running crawl |
quickcrawl_map |
Discover URLs on a site |
quickcrawl_search |
Search DuckDuckGo |
Add Quickcrawl to your OpenCode configuration:
{
"mcp": {
"quickcrawl": {
"type": "local",
"command": ["npx", "-y", "@mabudalam/quickcrawl-mcp"],
"enabled": true
}
}
}Renderer selection in MCP:
- Omit
rendereror setrenderer: "auto"to use the configured fallback chain - Set
renderer: "lightpanda"or"chrome"to hard-pin a single renderer - Pinned renderers imply
renderJs: trueunlessrenderJs: falseis set explicitly
Example MCP tool arguments:
{
"url": "https://www.notion.so/",
"formats": ["markdown"],
"renderJs": true,
"renderer": "chrome",
"onlyMainContent": true
}The Quickcrawl CLI provides standalone command-line access to all features. No server or Python needed.
# Quick install (recommended)
curl -fsSL https://raw.githubusercontent.com/MabudAlam/quickcrawl/main/install.sh | sh
# Or from source
go install github.com/MabudAlam/quickcrawl/cli
# Download from GitHub releases
curl -L https://github.com/MabudAlam/QuickCrawl/releases/latest/download/quickcrawl_darwin_arm64.tar.gz | tar -xz
./quickcrawl --help# Scrape a single URL
quickcrawl scrape https://example.com
quickcrawl scrape https://example.com --formats html,markdown
# Crawl a website
quickcrawl crawl https://example.com --max-pages 10 --max-depth 3
# Discover URLs (without scraping content)
quickcrawl map https://example.com --max-depth 2
# Search DuckDuckGo
quickcrawl search "golang web scraping"
quickcrawl search "python" --scrape --formats markdownSee cli/cmd/ for all subcommands:
scrape.goβ Scrape single URLscrawl.goβ Crawl websitesmap.goβ Discover URLssearch.goβ Web search
Python SDK for Quickcrawl β scrape, crawl, and map websites from Python code.
# From PyPI (coming soon)
pip install quickcrawl
# From GitHub
pip install git+https://github.com/MabudAlam/quickcrawl.git@python-sdk#subdirectory=python
# Or clone and install
git clone https://github.com/MabudAlam/quickcrawl
cd quickcrawl/python
pip install -e .from quickcrawl import QuickCrawlClient
# CLI mode (zero config, auto-downloads binary)
with QuickCrawlClient() as client:
result = client.scrape("https://example.com")
print(result["markdown"])
# HTTP mode (connect to deployed server)
client = QuickCrawlClient(api_url="https://your-server.com", api_key="...")
result = client.scrape("https://example.com")See python/examples/:
01_scrape.pyβ Scrape a single URL02_crawl.pyβ Crawl entire website03_map.pyβ Discover URLs without scraping04_formats.pyβ Multiple output formats05_cloud.pyβ Connect to deployed server06_search.pyβ Web search with scrapingperplexity.pyβ Perplexity-style AI research agent with Google ADK
graph TD
Client["π₯οΈ Client (HTTP / MCP)"]
Server["π Quickcrawl Server"]
Client --> Server
Server --> Router["π‘ Gin Router"]
Router --> Handlers["π§ API Handlers"]
Handlers --> Renderer["π¨ Renderer Layer"]
Renderer --> HTTPFetcher["π HTTP Fetcher"]
Renderer --> Browser["π Browser (CDP)"]
Browser --> LightPanda["πΌ LightPanda"]
Browser --> Chrome["π΅ Chrome DevTools"]
Browser --> Chrome["π Chrome"]
Renderer --> Extractor["π Extractor"]
Extractor --> Markdown["π Markdown"]
Extractor --> HTML["π HTML"]
Extractor --> PlainText["π Plain Text"]
Extractor --> Links["π Links"]
Handlers --> Crawler["π Crawler"]
Crawler --> Robots["π€ Robots.txt"]
Crawler --> Sitemap["πΊοΈ Sitemap"]
Crawler --> RateLimit["β‘ Rate Limiter"]
Handlers --> Search["π DuckDuckGo Search"]
Search --> LLM["π§ LLM Extraction"]
LLM --> JSONSchema["π JSON Schema Output"]
Request
β
HTTP Fetcher (plain HTML, fastest)
β (if SPA detected or JS requested)
LightPanda (headless browser, fast)
β (if LightPanda unavailable)
Chrome DevTools (full browser, most compatible)
| Method | Path | Description |
|---|---|---|
POST |
/v1/scrape |
Scrape a single URL with multiple output formats |
Request:
{
"url": "https://www.mabud.dev/",
"formats": ["markdown"],
"onlyMainContent": true,
"renderJs": false,
"topK": 5
}Response:
{
"success": true,
"data": {
"markdown": "# Hey, I'm Mabud.\n\nI build AI native systems...",
"metadata": {
"title": "Mabud Alam",
"description": "Software Engineer",
"ogpTitle": "Mabud Alam",
"ogpDescription": "Software Engineer",
"sourceURL": "https://www.mabud.dev/",
"language": "en",
"statusCode": 200,
"renderedMode": "http",
"timeTaken": 866
}
}
}| Method | Path | Description |
|---|---|---|
POST |
/v1/crawl |
Start an async BFS crawl of a website |
GET |
/v1/crawl/:id |
Check crawl status and retrieve results |
DELETE |
/v1/crawl/:id |
Cancel a running crawl job |
Start Crawl Request:
{
"url": "https://www.mabud.dev/",
"maxDepth": 2,
"maxPages": 100,
"formats": ["markdown"],
"onlyMainContent": true,
"renderJs": false
}Check Status Request: GET /v1/crawl/:id
No body required.
Cancel Request: DELETE /v1/crawl/:id
No body required.
| Method | Path | Description |
|---|---|---|
POST |
/v1/map |
Discover all URLs on a site instantly |
Request:
{
"url": "https://www.mabud.dev/",
"maxDepth": 2,
"useSitemap": true,
"timeout": 30000
}Response:
{
"success": true,
"data": {
"links": [
"https://www.mabud.dev/blog",
"https://www.mabud.dev/projects",
"https://www.mabud.dev/resume"
]
}
}| Method | Path | Description |
|---|---|---|
POST |
/v1/search |
Search DuckDuckGo and scrape results in parallel |
| Method | Path | Description |
|---|---|---|
GET |
/health |
Health check with browser availability and active job count |
Quickcrawl supports JSON Schema-based extraction for structured data:
{
"url": "https://news.example.com",
"formats": ["markdown", "json"],
"onlyMainContent": true,
"extract": {
"schema": {
"type": "object",
"properties": {
"title": { "type": "string" },
"author": { "type": "string" },
"publishDate": { "type": "string" }
},
"required": ["title", "author"]
},
"prompt": "Extract article title, author, and publish date",
"responseFormat": "article"
},
"chunkStrategy": { "type": "sentence" },
"query": "article title and author",
"filterMode": "bm25",
"topK": 5
}| Strategy | Description |
|---|---|
sentence |
Split by sentence boundaries |
paragraph |
Split by paragraph boundaries |
regex |
Split by custom regex pattern |
topic |
Split by topic changes |
| Mode | Description |
|---|---|
bm25 |
BM25 algorithm for relevance scoring |
rrf |
Reciprocal Rank Fusion |
hybrid |
Combine bm25 and rrf |
Config file: quickcrawl.toml
[server]
host = "0.0.0.0"
port = 3000
rate_limit_rps = 10
[renderer]
mode = "auto" # auto, none, chrome, lightpanda
page_timeout_ms = 30000
pool_size = 4
render_js_default = false
[renderer.lightpanda]
ws_url = ""
[renderer.chrome]
ws_url = ""
[crawler]
max_concurrency = 40
requests_per_second = 40.0
respect_robots_txt = true
default_max_depth = 2
default_max_pages = 100
[extraction.llm]
provider = "openai"
model = "gpt-4o-mini"
api_key = ""
base_url = ""
max_tokens = 4000
temperature = 0.7Or via environment variables:
SERVER__PORT=3000
RENDERER__MODE=auto
RENDERER__CHROME__WS_URL=ws://127.0.0.1:9222/devtools/browser/...
CRAWLER__MAX_CONCURRENCY=40
EXTRACTION__LLM__API_KEY=your-keyquickcrawl/
βββ cmd/
β βββ server/ # HTTP API server
β βββ mcp/ # MCP server for AI agents
βββ cli/ # Standalone CLI binary
β βββ main.go # Entry point
β βββ cmd/ # Cobra subcommands
βββ internal/
β βββ api/ # HTTP handlers, routes, middleware
β βββ crawler/ # BFS crawler, robots.txt, sitemap
β βββ extractor/ # HTML cleaning, markdown, chunking
β βββ renderer/ # HTTP, browser fetching via CDP
β βββ search/ # DuckDuckGo integration
β βββ mcp/ # MCP tool implementation
β βββ types/ # Type definitions
βββ playground/ # Web UI playground
βββ npm/ # NPM wrapper package
βββ python/ # Python SDK
β βββ examples/ # Python usage examples
β βββ README.md # Python SDK documentation
βββ bench/ # Benchmarks
βββ scripts/ # Release scripts
βββ workflows/ # CI/CD workflows
| Tech | Use Case |
|---|---|
| Go 1.21+ | Core backend, CLI, server |
| Gin | HTTP framework |
| Cobra | CLI framework |
| goquery | HTML parsing and DOM manipulation |
| lightpanda | Headless browser automation (CDP over WebSocket) |
| Chrome DevTools | Browser automation via CDP WebSocket |
| MCP SDK | Model Context Protocol server |
| slog | Structured logging |
Quickcrawl includes a web-based playground for testing:
playground/
βββ app/
β βββ playground/
β βββ page.tsx # Main playground UI
βββ components/
β βββ response-viewer.tsx # Response display components
βββ lib/
βββ api-client.ts # API client functions
Access at http://localhost:3000/playground when the server is running.
Docker images are published to GitHub Container Registry:
# Server
docker build -f infra/Dockerfile.server -t quickcrawl .
docker run -p 3000:3000 quickcrawl
# Playground
docker build -f infra/Dockerfile.playground -t quickcrawl-playground .
docker run -p 3000:3000 quickcrawl-playground- Click the button above
- Connect your GitHub repository
- Configure environment variables
- Deploy
- Add Page Interaction
- Hooks & Auth
- Improve Search
- Better SPA Handling
- Auto mode improvements for JS rendering
- Add support for https://github.com/h4ckf0r0day/obscura headless
- Focus on scalablity
- Cache System
- Improve the SDK perfomance