A scraper-first, Pi-native, and local-first extension for the Pi ecosystem.
pi-scraper reads known URLs and sites. Use it to scrape, summarize one page, crawl, map URLs, diff snapshots, retrieve stored results, or download/extract deterministic/structured data β including CloakBrowser-backed browser mode with C++ fingerprint patches and persistent sessions.
Install the extension via the Pi CLI:
pi install npm:pi-scraperAsk naturally; Pi can choose the right web tool automatically:
Tip
- "Read https://example.com as markdown."
- "List all URLs available from https://example.com."
- "Crawl https://example.com, up to 25 pages."
- "Compare https://example.com against my homepage snapshot."
- "Open https://example.com/login in browser mode, save the session, then scrape /dashboard."
pi-scraper intelligently escalates its scraping strategy to balance speed and capability.
| Mode | JS Support | Speed | Best Use Case |
|---|---|---|---|
fast |
β | π | Static HTML, documentation, and quick text extraction. |
fingerprint |
β | ποΈ | Sites that block simple bots (uses TLS fingerprinting). |
readable |
β | β±οΈ | Articles and blogs where noise reduction is critical. |
browser |
β | π’ | Heavily JS-rendered sites (uses CloakBrowser by default). |
auto |
π€ | π | Default. Automatically selects the best path based on signals. |
| Tool | Capability | Best For... | Contract β |
|---|---|---|---|
web_scrape |
π Local | Reading a single URL as Markdown, Text, or HTML. | 268 tokens |
web_crawl |
π·οΈ Resumable | BFS crawling to build local datasets or context packages. | 179 tokens |
web_map |
πΊοΈ Discovery | Inventorying URLs via robots.txt, sitemaps, and llms.txt. | 51 tokens |
web_batch |
π¦ Bulk | Scaping multiple independent URLs concurrently. | 151 tokens |
web_extract |
π Structured | Deterministic, selector-based, or LLM-backed extraction. | 407 tokens |
web_get_result |
π Retrieval | Accessing stored results, job manifests, or snapshots. | 55 tokens |
Note
Contract is the total tokens for the tool declaration.
| Area | Parameters | Description |
|---|---|---|
| Shared | sessionId, saveSession, clearSession, stealth, autoWait, browserBackend, proxy, headers, provider |
Sessions, browser controls, and LLM provider selection. |
| Scrape | url, urls, mode, format, refresh, respectRobots, timeoutSeconds |
Targets and fetch behavior. Summarize moved to web_extract action=summarize. |
| Limits | maxBytes, maxChars, onlyMainContent |
Size limits and content cleaning. |
| RAG chunks | chunks, maxTokens, overlapTokens |
chunks=true returns paragraph-bounded chunks[] alongside full markdown. |
| Filtering | include, exclude, linesMatching, contextLines, caseSensitive |
Glob patterns and line-based content filtering. |
| Redirection | followAlternates, followMetaRefresh |
Controls for non-standard redirects. |
| Snapshots | snapshotName, snapshotTag, diff, compareTag, maxSnapshotAgeSeconds |
Versioning and diffing baselines. |
| Crawl | action, maxPages, maxDepth, sameOrigin, concurrency, resume, crawlId, compile, seed, seedSitemap, status, limit, extract, strategy |
BFS/DFS/best-first discovery, limits, and state management. Strategy shown in progress output. |
| Extract | action, extractor, prompt, schema, selector, selectorType, attribute, adaptive, bullets, sentences, identifier, autoSave, threshold, extractSchema |
Vertical, ad-hoc (with grounded[] source spans), selector, and summarize (action=summarize, sentences/bullets) extraction. |
| Patterns | markers, contains, excerpts, regexes, sections, jsonPaths, sourceFormat, length |
Deterministic inspection: strings, regex, and ranges. |
| Strategy Extraction | selectors (fieldβselector map), query, topN, minScore, flags |
New: css-extract, xpath-extract, regex-extract, cosine |
| Proxy | proxy; HTTP_PROXY/HTTPS_PROXY/ALL_PROXY; NO_PROXY |
Explicit proxy URL, proxy arrays for crawl rotation, or env-var auto-config when proxy is omitted. |
| Map | url, maxSitemaps |
Site-wide discovery of robots.txt and sitemaps. |
| Storage | saveToFile |
true or {dir, filename, maxBytes} for disk storage. |
| Retrieval | responseId, jobId, snapshotUrl, snapshotName, snapshotTag |
Retrieve stored payloads and job manifests. |
pi-scraper is stateless by default. Use sessionId when you need to maintain state (cookies, login, cart) across multiple calls.
sessionId: A unique key for the session.saveSession: Persist cookies to disk (useful across Pi reloads).clearSession: Wipe the session state.fingerprint: Usemode: "fingerprint"to bypass basic bot blocks using browser-grade TLS fingerprints without the overhead of a full browser.
// Example: Log in and then scrape a protected page
web_scrape({ url: "https://example.com/login", sessionId: "user-1", saveSession: true })
web_scrape({ url: "https://example.com/dashboard", sessionId: "user-1" })
pi-scraper can route requests through explicit proxy URLs or standard proxy environment variables. SSRF protection still runs before network I/O and on redirects; SOCKS target hostnames are resolved locally before CONNECT so private/reserved addresses can still be blocked.
Pass proxy when you want a specific route for a call:
web_scrape({
url: "https://example.com",
proxy: "http://proxy.example:8080"
})
Supported proxy URL schemes for static fetch modes (fast and readable):
http://andhttps://HTTP proxy URLs.socks5://andsocks://SOCKS5 proxy URLs.socks4://SOCKS4 proxy URLs. SOCKS4 requires an IPv4 DNS result for the target.
pi-scraper intentionally rejects socks5h:// and socks4a://: those schemes require proxy-side DNS, which would bypass local DNS/SSRF validation.
mode: "fingerprint" can use HTTP(S) proxies. SOCKS proxies are accepted only for literal-IP targets; hostname targets fail closed with guidance to use fast/readable or an HTTP(S) proxy.
When proxy is omitted, pi-scraper automatically checks standard environment variables:
https://targets:HTTPS_PROXYβhttps_proxyβALL_PROXYβall_proxy.http://targets:HTTP_PROXYβhttp_proxyβALL_PROXYβall_proxy.
NO_PROXY / no_proxy bypasses env-derived proxies. It supports *, comma-separated host rules, domains and subdomains (example.com, .example.com), host:port including default ports, and IPv6 hosts with or without brackets.
HTTPS_PROXY=http://127.0.0.1:8080 \
NO_PROXY=localhost,127.0.0.1,.internal.example \
piAn explicit proxy parameter always wins over env vars.
web_crawl accepts a proxy array and resolves a proxy before each page scrape. For example, five pages with proxy: ["a", "b", "c"] use a, b, c, a, b.
web_crawl({
url: "https://docs.example.com",
maxPages: 25,
proxy: [
"http://proxy-us.example:8080",
"socks5://127.0.0.1:9050"
]
})
Single-string proxy keeps one proxy for the crawl. Omitting proxy uses env-var auto-config when available, otherwise direct fetches.
Extract structured data using CSS selectors, XPath, or plain text search.
| Parameter | Description |
|---|---|
selector |
The CSS/XPath/Text to find. |
attribute |
Extract a specific attribute (e.g., href) instead of text. |
adaptive |
Enable relocation if the page layout changes. Fingerprint-based first, then text-anchor healing. |
limit |
Maximum elements to return. |
{
"url": "https://example.com/products",
"selector": ".product-card",
"identifier": "products-v1",
"autoSave": true,
"limit": 5
}web_extract action=adhoc uses a model adapter to extract structured data from page text. After the LLM responds, pi-scraper post-processes the output to locate each extracted string in the cleaned source text.
Result shape:
{
"data": { "title": "Super Widget", "price": "$19.99" },
"grounded": [
{ "field": "title", "value": "Super Widget", "sourceSpan": { "start": 23, "end": 35 } },
{ "field": "price", "value": "$19.99", "sourceSpan": { "start": 43, "end": 49 } }
]
}sourceSpanβ character offsets into the cleaned text the LLM consumed (exact, case-insensitive, or whitespace-collapsed match).sourceSpan: nullβ value could not be verified in source (not a failure; field is still returned).- Tool summary shows
(verified/total fields source-grounded).
For web_extract action=summarize or action=adhoc when no explicit adapter is provided:
- Seamlessly falls back to the user's locally-configured Pi model (OpenAI, Anthropic, Gemini, Bedrock, etc.) if no registered adapter is available.
- Uses lazy dynamic imports of
@earendil-works/pi-aifor a zero install footprint for users who only use deterministic scraping and crawling. - Also participates in the cross-extension
pi:model-adapter/*event protocol so provider extensions can lend their LLM transport.
web_crawl is an high-concurrency crawler that supports pausing, resuming, and multiple path traversal strategies to build local datasets or context packages.
Configure how the crawler discovers and explores links using the strategy parameter:
bfs(Breadth-First Search - Default): Explores level-by-level (all links at depth 1, then depth 2, etc.). Best for general site scanning and sitemap building.dfs(Depth-First Search): Explores deep into a single branch (e.g., following nested subdirectories or article links) before backtracking. Perfect for systematically drilling down nested document files.best-first: Sorts and prioritizes links dynamically based on structural indicators (giving priority to documentation indexes, category pages, and main article hubs).- TUI Progress Feedback: The live crawler progress bar and terminal TUI cards dynamically render the active strategy so you can monitor traversals.
mode: "browser" uses CloakBrowser by default β a patched Chromium binary with 48 C++-level fingerprint patches.
| Backend | Default | Browser | Stealth level | Requirement |
|---|---|---|---|---|
"cloak" |
β | CloakBrowser Chromium 145 | C++ source-level (48 patches) | Bundled |
"playwright" |
β | Stock Playwright Chromium | JS page.evaluate() via stealth=true |
npm install playwright |
CloakBrowser does not need stealth=true β all anti-detection patches (navigator.webdriver, canvas, WebGL, audio, fonts, GPU, screen, WebRTC, network timing) are applied at the C++ binary level, undetectable by any JS-level bot detection.
Test results from CloakBrowser:
- reCAPTCHA v3 score: 0.9 (human)
- Cloudflare Turnstile: PASS
- FingerprintJS: PASS
- BrowserScan: NORMAL (4/4)
- 30+ detection sites: passed
When using CloakBrowser with sessionId + saveSession=true:
web_scrape url="https://example.com" mode=browser sessionId="my-session" saveSession=true
CloakBrowser uses launchPersistentContext() which writes cookies, localStorage, and session state to a disk profile at ~/.pi/browser-sessions/<sessionId>/. This:
- Avoids incognito/private-mode detection (BrowserScan penalizes incognito by ~10%)
- Survives Pi restarts and process reloads
- Keeps login state across multiple scrape calls
To persist an authenticated login flow:
-
Log in and Save the Session Open the login page in browser mode. Specifying
saveSession=truewrites the cookies and session state to your local profile.web_scrape url="https://example.com/login" mode=browser sessionId="site-session" saveSession=true
-
Scrape Authenticated Content Subsequent calls using the same
sessionIdautomatically inherit the authenticated state (cookies, local storage, etc.).web_scrape url="https://example.com/dashboard" mode=browser sessionId="site-session"
-
Clear the Session when Done (Optional) Wipe the saved session and context from your local disk.
web_scrape url="https://example.com" mode=browser sessionId="site-session" clearSession=true
| Option | Type | Description |
|---|---|---|
timezone |
string | IANA timezone (e.g. "America/New_York"). Set via binary flag β undetectable. |
locale |
string | BCP 47 locale (e.g. "en-US"). Set via --lang binary flag. |
proxy |
string | Browser backend proxy URL. Static fetch proxy schemes and env-var behavior are documented in the Proxy Configuration section above. |
These are safe to set even with the Playwright backend (ignored or applied via JS patches).
For well-known sites, pi-scraper uses optimized "vertical" extractors that hit APIs directly, bypassing slow HTML scraping.
| Vertical | Platforms / Sites | Extracted Data / Possibilities |
|---|---|---|
| GitHub Repo | GitHub | Metadata, README, File Tree, Languages, Topics. |
| GitHub Issue | GitHub | Issue body, comments, participants, labels, status. |
| GitHub PR | GitHub | Pull request body, diff stats, reviews, comments. |
| GitHub Release | GitHub | Release notes, tag info, assets, author metadata. |
| GitIngest | GitHub / gitingest.com | Prompt-friendly codebase digest, directory structure, and full-content download URL. |
| npm Package | npmjs.com | Manifest JSON, versions, dependencies, README. |
| PyPI Package | pypi.org | Package metadata, versions, author, description. |
| crates.io | crates.io | Rust crate metadata, versions, dependencies. |
| Docker Hub | hub.docker.com | Image metadata, tags, architectures, layers. |
| HF Model | huggingface.co | Model cards, metadata, files, community stats. |
| HF Dataset | huggingface.co | Dataset cards, configuration, metadata, previews. |
| Hacker News | ycombinator.com | Story/Comment trees via Firebase API. |
| arXiv | arxiv.org | Academic paper metadata and Atom feeds. |
| DeepWiki | deepwiki.io | Structured wiki content and metadata. |
| Docs Site | Docusaurus, RTD | Sections, sidebar navigation, and page metadata. |
| docstrings | TS/JS/Py/Rs | Exported symbols, types, and function signatures. |
| Youtube Metadata | youtube.com | Video title, views, channel name, duration, and description. |
| Youtube Transcriptions | youtube.com | Full transcripts in plain-text and timed segments. |
| Youtube Comments | youtube.com | Preview of top video comments and engagement stats. |
| Reddit Post | reddit.com | Post content, scoring, flairs, and author metadata. |
| Reddit Thread | reddit.com | Full nested comment trees (retains original thread depth). |
| Reddit List | reddit.com | Subreddit listings (hot/new/top) and search results. |
| OSS Analytics | ossinsight.io | Real-time repository metrics, stars, and contribution trends. |
| OSS Trending | ossinsight.io | Daily/weekly trending repositories and collections. |
| OSS Rankings | ossinsight.io | Collection-based rankings and ecosystem comparison data. |
// Get structured data for an npm package
web_extract({ action: "vertical", url: "https://www.npmjs.com/package/undici" })
// Get YouTube video metadata, transcript, and comment preview
web_extract({ action: "vertical", extractor: "youtube", url: "https://www.youtube.com/watch?v=arj7oStGLkU" })
Most verticals are backed by declarative YAML manifests in verticals/*.yaml. You can extend or override them without changing the public tool API: keep calling web_extract({ action: "vertical", ... }).
Manifest load order is layered:
- Built-ins: package
verticals/*.yaml - User overrides/additions:
~/.pi/scraper/verticals/*.{yaml,yml,json,jsonc} - Project overrides/additions:
.pi/scraper/verticals/*.{yaml,yml,json,jsonc}
A user or project manifest with the same name replaces the lower-priority manifest; a new name adds a new vertical. Run web_extract({ action: "list" }) to see each vertical's source, whether it is declarative, and whether it overrides another manifest.
See the bundled verticals/ folder for examples.
Minimal custom vertical example:
version: 1
name: my_docs
kind: api-json
description: Example docs metadata from a JSON endpoint.
urlPatterns:
- https://docs.example.com/:slug+
request:
urlTemplate: https://docs.example.com/api/pages/{{slug|encodePathSegments}}
extract:
title: $.title
updatedAt: $.updated_at
summary: $.summarySupported manifest styles include api-json, api-json-aggregate, api-json-chain, http-workflow, api-xml, selector, pattern, html-extract, text-extract, code-extract, and bounded recipe primitives.
If you are asking an LLM/agent to create or override one, use this prompt:
Use the
pi-scraperweb-scrapingskill's custom vertical manifest reference to create a YAML vertical for this site. Choose the right manifest style, place it in the correct project or user verticals folder, and verify it withweb_extract.
Large results are stored automatically. You can retrieve them later using web_get_result.
| Data | Path |
|---|---|
| SQLite Index | ~/.pi/scraper/index.db |
| Payload Blobs | ~/.pi/scraper/blobs/ |
| Downloads | ~/.pi/scraper/downloads/ |
Add saveToFile: true to persist PDFs, images, or archives to disk.
{ "url": "https://arxiv.org/pdf/1706.03762", "saveToFile": true }Control the fetch limit per request (default: 30 MB).
{ "url": "https://example.com/large.zip", "maxBytes": 104857600 }Use web_map for fast discovery of a domain's structure without downloading full page bodies. It is an "inventory-only" tool.
What it discovers:
robots.txt: Respects crawl delays and discovers sitemap links.- Sitemaps: Automatically parses
sitemap.xmland gzipped sitemaps. llms.txt: Finds specialized manifests designed for AI consumption.
// Inventory all known URLs for a domain
{ "url": "https://example.com", "action": "inventory" }- SSRF Protection: Built-in validation at the connect and redirect layers.
- Robots.txt: Full respect for site crawling rules (configurable).
- Memory Efficient: Large responses are streamed and stored locally.
- Incremental Enforcement:
maxByteslimits are enforced during the stream.
Use the /scrape-config slash command to manage your settings interactively or via the CLI:
/scrape-config status # View current settings
/scrape-config scrape-mode browser # Set default mode to browser
/scrape-config robots off # Disable robots.txt respect
/scrape-config cache clear # Wipe the local response cacheIf you are contributing to or building on top of pi-scraper:
- Node.js:
>=22.19.0 - Pi:
>=0.74.0
npm install # Install dependencies
npm run typecheck # Verify types
npm test # Run unit tests
npm run test:tools # Run tool smoke testsTo use stock Playwright Chromium instead of CloakBrowser:
npm install playwright
npx playwright install chromiumweb_scrape url="https://example.com" mode=browser browserBackend=playwright stealth=true
This project is licensed under the MIT License. See LICENSE for details.