Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 18 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -337,30 +337,24 @@ Different tools make different tradeoffs. This table summarizes the main differe

Each tool has strengths: FireCrawl excels as a hosted API, Crawl4AI has deep browser automation, and Scrapy handles massive distributed workloads. MarkCrawl focuses on simple local crawls that produce LLM-ready Markdown.

### Benchmark results — last public run (April 2026, v2 methodology)

| Tool | Speed (p/s) | Content Signal | MRR | Answer (/5) | Annual cost (100K pages) |
|---|---|---|---|---|---|
| **markcrawl** (v0.9.x) | **12.1** | **99%** | 0.698 | 4.52 | **$4,505** |
| scrapy+md | 9.5 | 93% | 0.459 | 4.03 | $5,464 |
| colly+md | 4.2 | 67% | 0.677 | **4.53** | $7,213 |
| playwright | 2.2 | 64% | 0.727 | 4.42 | $7,320 |
| crawlee | 1.7 | 63% | **0.733** | 4.52 | $7,467 |
| crawl4ai | 1.5 | 83% | 0.694 | 4.43 | $6,960 |

**v0.10 projection (next public CI run, based on local-replica delta):**

| Metric | v0.9.x public | v0.10 projected | Δ |
|-----------------|--------------:|----------------:|-------------:|
| Speed | 12.1 (1st) | ~12.1 (1st) | unchanged |
| MRR | 0.698 (3rd) | **~0.78 (1st)** | **+11% projected**, multi-trial validated locally |
| Content signal | 99% (1st) | ~99% (1st) | unchanged |
| Cost / 100K pgs | $4,505 (1st) | **$0 (1st)** | **−$4,505/yr** with default local embedder |
| Answer (/5) | 4.52 (tied 2nd) | ~4.5 | within noise |

Drivers: `chunk_markdown` defaults flipped (Track D, validated +14% MRR on `all-MiniLM-L6-v2` and +15% on OpenAI 3-small across 9 trials) plus the bake-off-winning local embedder default (Track B, MRR-neutral vs 3-small with $0 cost-at-scale).

Full benchmark data: [docs/BENCHMARKS.md](docs/BENCHMARKS.md) | Methodology: [llm-crawler-benchmarks](https://github.com/AIMLPM/llm-crawler-benchmarks) | v0.10 details: [bench/local_replica/v010_release_report.md](bench/local_replica/v010_release_report.md)
### Benchmark results (6 tools, May 2026)

**Speed:** scrapy+md is fastest (5.0 pages/sec), markcrawl at 2.7. Playwright-based tools average 1.4-2.1 pages/sec.

**Output cleanliness:** markcrawl has the lowest nav pollution (53 words vs 500+ for others) — less junk in your embeddings.

**RAG answer quality:** markcrawl scores 3.77/5 on answer quality with the fewest chunks (27,193 total, 2.2x fewer than the most), keeping embedding costs low.

| Tool | Chunks/page | Answer Quality (/5) | Annual cost (100K pages, 1K queries/day) |
|---|---|---|---|
| **markcrawl** | **18.7** | **3.77** | **$4,505** |
| scrapy+md | 31.7 | 3.68 | $5,464 |
| crawl4ai | 16.8 | 4.72 | $6,960 |
| colly+md | 40.6 | 4.36 | $7,213 |
| playwright | 39.0 | 4.48 | $7,320 |
| crawlee | 40.5 | 4.68 | $7,467 |

Full benchmark data: [docs/BENCHMARKS.md](docs/BENCHMARKS.md) | Methodology: [llm-crawler-benchmarks](https://github.com/AIMLPM/llm-crawler-benchmarks)
</details>

## Installation
Expand Down
103 changes: 42 additions & 61 deletions docs/BENCHMARKS.md
Original file line number Diff line number Diff line change
@@ -1,69 +1,61 @@
<!-- AUTO-GENERATED by sync_markcrawl.py — do not edit manually -->
# MarkCrawl Benchmarks

> **Summary:** Across 7 open-source crawlers tested on 8 sites with a full RAG pipeline evaluation (109 queries, 99-query common subset), MarkCrawl is the fastest (12.1 pages/sec), produces the cleanest output (99% content signal — the ratio of answer-bearing text to total output), and the lowest total RAG pipeline cost at every scale tested.
> **Summary:** Across 6 open-source crawlers tested on 8 sites, MarkCrawl is produces the cleanest output (53 words of nav pollution vs 500+ for others), the lowest total RAG pipeline cost at every scale tested.
>
> **Where MarkCrawl is not first:** Retrieval MRR is 3rd (0.698 vs 0.733 for crawlee). Answer quality is tied #2 (4.52/5 vs 4.53/5 for colly+md). Recall on some sites is lower — the intentional trade-off behind the high content signal.
> **Where MarkCrawl is not first:** Speed is 2nd (2.7 pages/sec). Answer quality is 6th (3.77/5, crawl4ai leads at 4.72). Retrieval Hit@5 is 6th (42% vs 87% for crawl4ai-raw). Content recall is 6th (22% vs 70% for crawlee).

*Last run: April 2026 (v2 methodology). Reproducible via [llm-crawler-benchmarks](https://github.com/AIMLPM/llm-crawler-benchmarks).*
*Last run: May 2026. Reproducible via [llm-crawler-benchmarks](https://github.com/AIMLPM/llm-crawler-benchmarks).*

---

## Leaderboard

| Tool | Speed (p/s) | Content Signal | MRR | Answer (/5) | Annual Cost (100K pages) |
|---|---|---|---|---|---|
| **markcrawl** | **12.1** | **99%** | 0.698 | 4.52 | **$4,505** |
| scrapy+md | 9.5 | 93% | 0.459 | 4.03 | $5,464 |
| colly+md | 4.2 | 67% | 0.677 | **4.53** | $7,213 |
| playwright | 2.2 | 64% | 0.727 | 4.42 | $7,320 |
| crawlee | 1.7 | 63% | **0.733** | 4.52 | $7,467 |
| crawl4ai-raw | 1.5 | 84% | 0.694 | 4.44 | $6,961 |
| crawl4ai | 1.5 | 83% | 0.694 | 4.43 | $6,960 |

## Speed

MarkCrawl uses native async I/O (httpx) with concurrent fetching and process-pool HTML extraction. Playwright-based tools (crawl4ai, crawlee) are inherently slower due to full browser rendering per page. MarkCrawl processes pages ~27% faster than the runner-up (scrapy+md) and 5–8x faster than browser-based tools.

## Content signal
| Tool | Pages/sec |
|---|---|
| scrapy+md | 5.0 |
| **markcrawl** | **2.7** |
| playwright | 2.5 |
| crawl4ai | 1.4 |
| crawlee | 1.3 |
| colly+md | 1.0 |

**Content signal** = ratio of answer-bearing tokens to total extracted tokens. 99% means almost every word MarkCrawl writes out is content you'd actually embed — not navigation, footer, or cookie banners.
MarkCrawl uses native async I/O (httpx) with concurrent fetching and process-pool HTML extraction. Playwright-based tools (crawl4ai, crawlee) are inherently slower due to full browser rendering per page.

The tradeoff: browser-based tools (crawlee 63%, playwright 64%) capture more of the page but include a lot of boilerplate. Each noisy chunk dilutes embeddings and shows up in retrieval.
## Output cleanliness

MarkCrawl is intentionally aggressive about stripping non-content. On a few sites this costs recall — documented in [Known weaknesses](#known-weaknesses) below.
| Tool | Nav pollution (words) | Recall |
|---|---|---|
| **markcrawl** | **53** | **22%** |
| scrapy+md | 500 | 11% |
| crawl4ai | 545 | 56% |
| playwright | 1995 | 68% |
| crawlee | 2326 | 70% |
| colly+md | 2424 | 50% |

## Retrieval quality (MRR)
Nav pollution = boilerplate words (navigation, footer, cookie banners) that leak into extracted content. Lower is better — less junk means cleaner embeddings and fewer wasted tokens.

Evaluated on 99 queries shared across all tools, using `text-embedding-3-small` + cosine similarity (embedding mode; benchmark also reports BM25, hybrid, and reranked).
The tradeoff: crawlee captures 70% of page content but includes ~2,326 words of boilerplate per page. MarkCrawl captures 22% with 53 words of pollution. For RAG pipelines, the cleaner output produces better embeddings despite the lower recall.

| Tool | MRR | Hit@10 | Chunks |
|---|---|---|---|
| crawlee | **0.733** | — | 47,560 |
| playwright | 0.727 | — | 46,439 |
| **markcrawl** | **0.698** | **87%** | 27,051 |
| crawl4ai-raw | 0.694 | — | — |
| crawl4ai | 0.694 | — | 32,735 |
| colly+md | 0.677 | — | 42,934 |
| scrapy+md | 0.459 | — | 23,854 |
## RAG answer quality

**Reading the table:** MRR measures how high up the correct chunk appears in a ranked list. 0.698 means the right chunk is on average at rank ~1.4. The retrieval report notes that differences among the top tools fall within confidence intervals — retrieval mode (embedding vs BM25 vs hybrid vs reranked) has more impact than crawler choice.
| Tool | Chunks | Answer Quality (/5) | Hit@5 | Hit@20 |
|---|---|---|---|---|
| crawl4ai | 24,400 | 4.72 | 86% | 95% |
| **markcrawl** | **27,193** | **3.77** | **42%** | **45%** |
| scrapy+md | 46,141 | 3.68 | 21% | 22% |
| playwright | 56,855 | 4.48 | 87% | 94% |
| crawlee | 58,912 | 4.68 | 87% | 94% |
| colly+md | 59,078 | 4.36 | 51% | 56% |

MarkCrawl produces 2.1x fewer chunks than crawlee for comparable answer quality, which keeps embedding and storage costs down.
*FireCrawl's self-hosted version did not complete crawls on all sites across multiple attempts. Its scores are on a reduced set and are not directly comparable to tools that completed all sites.

## Answer quality
**Reading this table:**
- **Chunks** — total chunks across all sites. Fewer = less redundancy, lower embedding costs.
- **Answer Quality** — LLM-judged score for answers generated from retrieved chunks.
- **Hit@5 / Hit@20** — what percentage of queries find a relevant chunk in the top 5 or 20 results.

| Tool | Answer (/5) |
|---|---|
| colly+md | **4.53** |
| **markcrawl** | **4.52** |
| crawlee | 4.52 |
| playwright | 4.42 |
| crawl4ai-raw | 4.44 |
| crawl4ai | 4.43 |
| scrapy+md | 4.03 |

LLM-judged score for answers generated from retrieved chunks. The benchmark explicitly notes: *"answer quality is tight across all tools"* — the range across six of seven tools is 4.42–4.53. Differences in this range are mostly within noise.
**Fewer chunks = lower cost.** Each chunk requires an embedding call and vector storage. MarkCrawl produces 2.2x fewer chunks than colly+md for the same content, cutting embedding and storage costs significantly.

## Total cost of ownership

Expand All @@ -78,27 +70,16 @@ Annual cost estimate for a complete RAG pipeline: crawling + embedding + vector
| playwright | $517 | $7,320 | $73,202 |
| crawlee | $518 | $7,467 | $74,673 |

MarkCrawl's cost advantage comes from chunk efficiency — same content, fewer and cleaner chunks means fewer embedding API calls and less vector storage. The total cost difference between the cheapest and most expensive tools is ~$3,000/year at 100K pages.

## Known weaknesses

High content signal (99%) comes from aggressive boilerplate removal, and on some sites that removal takes legitimate content with it.

- **Recall on list-style pages** — quotes.toscrape: 45% vs 100% for browser-based tools. The extracted text is there, but heading structure and some list items are dropped.
- **Recall on certain doc sites** — fastapi-docs: 37% vs 88% for crawl4ai-raw. API signature blocks and sidebars are being stripped as non-content.
- **Heading density on sparse pages** — 0.9 headings/page on quotes.toscrape vs 2.6–2.9 for competitors. Fewer semantic anchors hurt retrieval on pages where every chunk needs to stand on its own.
- **Preamble on catalog pages** — books.toscrape: 68 words of preamble (still lower than all competitors, but flagged as worse than MarkCrawl's own average of 14).

These are active optimization targets — see `bench/autoresearch.py` and `bench/program.md` in the repo for the experiment loop.
MarkCrawl's cost advantage comes from chunk efficiency — same content, fewer and cleaner chunks means fewer embedding API calls and less vector storage. The total cost difference between the cheapest and most expensive tools is $2,962/year at 100K pages.

## Why these numbers matter

For a RAG pipeline, the crawler is stage 1 — everything downstream (chunking, embedding, retrieval, LLM generation) depends on the quality of what the crawler produces.

- **Fewer chunks per page** = lower embedding costs, less vector DB storage, faster retrieval
- **Higher content signal** = cleaner embeddings that match user queries instead of "Home | About | Login"
- **Higher MRR** = the right chunk shows up at the top, where the LLM can actually use it
- **Less nav pollution** = cleaner embeddings that match user queries instead of "Home | About | Login"
- **Higher answer quality** = the LLM gets better source material and produces more accurate answers

## Methodology

All benchmarks run on the same hardware, same sites, same queries, with reproducible scripts. No tool receives special treatment or configuration beyond its defaults. Retrieval evaluated with `text-embedding-3-small`; reranker is `cross-encoder/ms-marco-MiniLM-L-6-v2`. Full methodology, raw data, and reproduction instructions are in the [llm-crawler-benchmarks](https://github.com/AIMLPM/llm-crawler-benchmarks) repo.
All benchmarks run on the same hardware, same sites, same queries, with reproducible scripts. No tool receives special treatment or configuration beyond its defaults. The full methodology, raw data, and reproduction instructions are in the [llm-crawler-benchmarks](https://github.com/AIMLPM/llm-crawler-benchmarks) repo.