diff --git a/skills/just-scrape/SKILL.md b/skills/just-scrape/SKILL.md index 18acea8..fa3b81e 100644 --- a/skills/just-scrape/SKILL.md +++ b/skills/just-scrape/SKILL.md @@ -1,12 +1,14 @@ --- name: just-scrape -description: "CLI tool for AI-powered web scraping, data extraction, search, and crawling via ScrapeGraph AI. Use when the user needs to scrape websites, extract structured data from URLs, convert pages to markdown, crawl multi-page sites, search the web for information, automate browser interactions (login, click, fill forms), get raw HTML, discover sitemaps, or generate JSON schemas. Triggers on tasks involving: (1) extracting data from websites, (2) web scraping or crawling, (3) converting webpages to markdown, (4) AI-powered web search with extraction, (5) browser automation, (6) generating output schemas for scraping. The CLI is just-scrape (npm package just-scrape)." +description: "CLI tool for AI-powered web scraping, data extraction, search, crawling, and page-change monitoring via ScrapeGraph AI (SDK v2). Use when the user needs to scrape webpages into one or more formats (markdown, html, screenshot, links, images, summary, branding, structured json), extract structured data from a URL with AI, search the web with optional AI extraction, crawl multi-page sites, monitor pages for changes on a schedule, or browse request history. The CLI is just-scrape (npm package just-scrape)." --- # Web Scraping with just-scrape AI-powered web scraping CLI by [ScrapeGraph AI](https://scrapegraphai.com). Get an API key at [dashboard.scrapegraphai.com](https://dashboard.scrapegraphai.com). +> **v1.0+ uses the scrapegraph-js v2 SDK.** The legacy commands `smart-scraper`, `search-scraper`, `markdownify`, `sitemap`, `agentic-scraper`, and `generate-schema` have been removed. Use `scrape --format ...` for multi-format output, `extract` for structured data, and `monitor` for page-change tracking. + ## Setup Always install or run the `@latest` version to ensure you have the most recent features and fixes. @@ -30,262 +32,266 @@ API key resolution order: `SGAI_API_KEY` env var → `.env` file → `~/.scrapeg | Need | Command | |---|---| -| Extract structured data from a known URL | `smart-scraper` | -| Search the web and extract from results | `search-scraper` | -| Convert a page to clean markdown | `markdownify` | +| Convert a page to markdown / HTML / screenshot / links / images / summary / branding | `scrape` | +| Extract structured JSON from a known URL with AI | `extract` (or `scrape --format json -p ...`) | +| Search the web (optionally extract from results) | `search` | | Crawl multiple pages from a site | `crawl` | -| Get raw HTML | `scrape` | -| Automate browser actions (login, click, fill) | `agentic-scraper` | -| Generate a JSON schema from description | `generate-schema` | -| Get all URLs from a sitemap | `sitemap` | -| Check credit balance | `credits` | +| Watch a page for changes on a schedule (cron / webhook) | `monitor` | | Browse past requests | `history` | -| Validate API key | `validate` | +| Check credit balance | `credits` | +| Validate API key / health check | `validate` | ## Common Flags All commands support `--json` for machine-readable output (suppresses banner, spinners, prompts). -Scraping commands share these optional flags: -- `--stealth` — bypass anti-bot detection (+4 credits) -- `--headers ` — custom HTTP headers as JSON string -- `--schema ` — enforce output JSON schema - -## Commands - -### Smart Scraper +Most scraping commands share these optional flags: +- `--stealth` — bypass anti-bot detection +- `--mode ` (`-m`) — fetch mode (`js` for JS-heavy SPAs) +- `--scrolls ` — infinite-scroll passes (0–100, where supported) +- `--country ` — geo-target by ISO country code +- `--headers ` / `--cookies ` — custom HTTP headers / cookies (where supported) +- `--schema ` — enforce output JSON schema (for AI-extraction commands / `--format json`) +- `--html-mode ` — HTML/markdown extraction mode -Extract structured data from any URL using AI. +## Output Formats (for `scrape` / `crawl` / `monitor`) -```bash -just-scrape smart-scraper -p -just-scrape smart-scraper -p --schema -just-scrape smart-scraper -p --scrolls # infinite scroll (0-100) -just-scrape smart-scraper -p --pages # multi-page (1-100) -just-scrape smart-scraper -p --stealth # anti-bot (+4 credits) -just-scrape smart-scraper -p --cookies --headers -just-scrape smart-scraper -p --plain-text -``` +`--format` (`-f`) accepts one or a comma-separated list: -```bash -# E-commerce extraction -just-scrape smart-scraper https://store.example.com/shoes -p "Extract all product names, prices, and ratings" +`markdown`, `html`, `screenshot`, `branding`, `links`, `images`, `summary`, and (for `scrape` only) `json`. -# Strict schema + scrolling -just-scrape smart-scraper https://news.example.com -p "Get headlines and dates" \ - --schema '{"type":"object","properties":{"articles":{"type":"array","items":{"type":"object","properties":{"title":{"type":"string"},"date":{"type":"string"}}}}}}' \ - --scrolls 5 +Default: `markdown`. -# JS-heavy SPA behind anti-bot -just-scrape smart-scraper https://app.example.com/dashboard -p "Extract user stats" \ - --stealth -``` +## Commands -### Search Scraper +### Scrape -Search the web and extract structured data from results. +Fetch a URL and return one or more formats. ```bash -just-scrape search-scraper -just-scrape search-scraper --num-results # sources to scrape (3-20, default 3) -just-scrape search-scraper --no-extraction # markdown only (2 credits vs 10) -just-scrape search-scraper --schema -just-scrape search-scraper --stealth --headers +just-scrape scrape # markdown (default) +just-scrape scrape -f html +just-scrape scrape -f markdown,links,images +just-scrape scrape -f screenshot +just-scrape scrape -f branding # logos, colors, fonts +just-scrape scrape -f summary +just-scrape scrape -f json -p "Extract all products" +just-scrape scrape -f json -p --schema +just-scrape scrape --html-mode reader # cleaner article extraction +just-scrape scrape --mode js --stealth --scrolls 5 +just-scrape scrape --country DE ``` ```bash -# Research across sources -just-scrape search-scraper "Best Python web frameworks in 2025" --num-results 10 +# Page → markdown +just-scrape scrape https://blog.example.com/article -# Cheap markdown-only -just-scrape search-scraper "React vs Vue comparison" --no-extraction --num-results 5 +# Multi-format in one call +just-scrape scrape https://example.com -f markdown,html,links --json > page.json -# Structured output -just-scrape search-scraper "Top 5 cloud providers pricing" \ - --schema '{"type":"object","properties":{"providers":{"type":"array","items":{"type":"object","properties":{"name":{"type":"string"},"free_tier":{"type":"string"}}}}}}' +# Structured JSON via scrape (no separate extract call) +just-scrape scrape https://store.example.com -f json \ + -p "Extract all product names and prices" \ + --schema '{"type":"object","properties":{"products":{"type":"array","items":{"type":"object","properties":{"name":{"type":"string"},"price":{"type":"number"}}}}}}' + +# JS-heavy SPA behind anti-bot +just-scrape scrape https://app.example.com/dashboard --mode js --stealth ``` -### Markdownify +### Extract -Convert any webpage to clean markdown. +Extract structured data from a URL using AI. Equivalent to `scrape -f json` but with a dedicated endpoint optimized for extraction. ```bash -just-scrape markdownify -just-scrape markdownify --stealth # +4 credits -just-scrape markdownify --headers +just-scrape extract -p +just-scrape extract -p --schema +just-scrape extract -p --scrolls # 0-100 +just-scrape extract -p --stealth --mode js +just-scrape extract -p --cookies --headers +just-scrape extract -p --html-mode reader +just-scrape extract -p --country ``` ```bash -just-scrape markdownify https://blog.example.com/my-article -just-scrape markdownify https://protected.example.com --stealth -just-scrape markdownify https://docs.example.com/api --json | jq -r '.result' > api-docs.md +# E-commerce +just-scrape extract https://store.example.com/shoes \ + -p "Extract all product names, prices, and ratings" + +# Strict schema + scrolling +just-scrape extract https://news.example.com -p "Get headlines and dates" \ + --schema '{"type":"object","properties":{"articles":{"type":"array","items":{"type":"object","properties":{"title":{"type":"string"},"date":{"type":"string"}}}}}}' \ + --scrolls 5 + +# Authenticated request via cookies +just-scrape extract https://app.example.com/dashboard -p "Extract user stats" \ + --cookies '{"session":"abc123"}' --stealth ``` -### Crawl +### Search -Crawl multiple pages and extract data from each. +Search the web; optionally extract structured data from the results. ```bash -just-scrape crawl -p -just-scrape crawl -p --max-pages # default 10 -just-scrape crawl -p --depth # default 1 -just-scrape crawl --no-extraction --max-pages # markdown only (2 credits/page) -just-scrape crawl -p --schema -just-scrape crawl -p --rules # include_paths, same_domain -just-scrape crawl -p --no-sitemap -just-scrape crawl -p --stealth +just-scrape search # markdown by default +just-scrape search --num-results # 1-20, default 3 +just-scrape search -p # AI extraction over results +just-scrape search -p --schema +just-scrape search --format html # markdown (default) or html +just-scrape search --country us # 2-letter geo code +just-scrape search --time-range past_week # past_hour | past_24_hours | past_week | past_month | past_year +just-scrape search --stealth --headers ``` ```bash -# Crawl docs site -just-scrape crawl https://docs.example.com -p "Extract all code snippets" --max-pages 20 --depth 3 +# Plain web search, top 10 results +just-scrape search "Best Python web frameworks in 2026" --num-results 10 -# Filter to blog pages only -just-scrape crawl https://example.com -p "Extract article titles" \ - --rules '{"include_paths":["/blog/*"],"same_domain":true}' --max-pages 50 +# Search + structured extraction +just-scrape search "Top 5 cloud providers pricing" \ + -p "Extract provider name and free-tier details" \ + --schema '{"type":"object","properties":{"providers":{"type":"array","items":{"type":"object","properties":{"name":{"type":"string"},"free_tier":{"type":"string"}}}}}}' -# Raw markdown, no AI extraction (cheaper) -just-scrape crawl https://example.com --no-extraction --max-pages 10 +# Recent news only +just-scrape search "AI regulation EU" --time-range past_week --country eu ``` -### Scrape +### Crawl -Get raw HTML content from a URL. +Crawl pages starting from a URL. Returns a job that's polled until completion. ```bash -just-scrape scrape -just-scrape scrape --stealth # +4 credits -just-scrape scrape --branding # extract logos/colors/fonts (+2 credits) -just-scrape scrape --country-code +just-scrape crawl +just-scrape crawl -f markdown,links +just-scrape crawl --max-pages # default 50, max 1000 +just-scrape crawl --max-depth # default 2 +just-scrape crawl --max-links-per-page # default 10 +just-scrape crawl --allow-external # follow off-domain links +just-scrape crawl --include-patterns # JSON array of regex strings +just-scrape crawl --exclude-patterns +just-scrape crawl --mode js --stealth ``` ```bash -just-scrape scrape https://example.com -just-scrape scrape https://store.example.com --stealth --country-code DE -just-scrape scrape https://example.com --branding -``` +# Crawl docs site to depth 3, get markdown +just-scrape crawl https://docs.example.com --max-pages 50 --max-depth 3 -### Agentic Scraper +# Same-domain crawl, blog only +just-scrape crawl https://example.com \ + --include-patterns '["^https://example\\.com/blog/.*"]' \ + --exclude-patterns '[".*\\.pdf$"]' \ + --max-pages 100 -Browser automation with AI — login, click, navigate, fill forms. Steps are comma-separated strings. - -```bash -just-scrape agentic-scraper -s -just-scrape agentic-scraper -s --ai-extraction -p -just-scrape agentic-scraper -s --schema -just-scrape agentic-scraper -s --use-session # persist browser session -``` - -```bash -# Login + extract dashboard -just-scrape agentic-scraper https://app.example.com/login \ - -s "Fill email with user@test.com,Fill password with secret,Click Sign In" \ - --ai-extraction -p "Extract all dashboard metrics" - -# Multi-step form -just-scrape agentic-scraper https://example.com/wizard \ - -s "Click Next,Select Premium plan,Fill name with John,Click Submit" - -# Persistent session across runs -just-scrape agentic-scraper https://app.example.com \ - -s "Click Settings" --use-session +# Multi-format per page +just-scrape crawl https://example.com -f markdown,links,images --max-pages 20 ``` -### Generate Schema +### Monitor -Generate a JSON schema from a natural language description. +Schedule a page to be re-scraped on a cron interval and (optionally) post diffs to a webhook. -```bash -just-scrape generate-schema -just-scrape generate-schema --existing-schema -``` +Actions: `create`, `list`, `get`, `update`, `delete`, `pause`, `resume`, `activity`. ```bash -just-scrape generate-schema "E-commerce product with name, price, ratings, and reviews array" - -# Refine an existing schema -just-scrape generate-schema "Add an availability field" \ - --existing-schema '{"type":"object","properties":{"name":{"type":"string"},"price":{"type":"number"}}}' +just-scrape monitor create --url --interval [--name ] [-f ] [--webhook-url ] [--mode js] [--stealth] +just-scrape monitor list +just-scrape monitor get --id +just-scrape monitor update --id [--name ...] [--interval ...] [-f ...] [--webhook-url ...] +just-scrape monitor pause --id +just-scrape monitor resume --id +just-scrape monitor delete --id +just-scrape monitor activity --id [--limit ] [--cursor ] # max 100/page ``` -### Sitemap - -Get all URLs from a website's sitemap. +`--interval` accepts a cron expression (`0 * * * *`) or a shorthand (`1h`, `30m`, `1d`). ```bash -just-scrape sitemap -just-scrape sitemap https://example.com --json | jq -r '.urls[]' +# Watch a pricing page hourly, alert via webhook +just-scrape monitor create \ + --url https://store.example.com/pricing \ + --interval 1h \ + --name "Pricing tracker" \ + -f markdown \ + --webhook-url https://hooks.example.com/pricing + +# Inspect recent ticks +just-scrape monitor activity --id mon_abc123 --limit 50 --json | jq '.ticks[]' + +# Pause / resume / delete +just-scrape monitor pause --id mon_abc123 +just-scrape monitor resume --id mon_abc123 +just-scrape monitor delete --id mon_abc123 ``` ### History -Browse request history. Interactive by default (arrow keys to navigate, select to view details). +Browse request history. Interactive by default (arrow keys to navigate, select to view details). Pass an ID after the service to view a specific request. ```bash -just-scrape history # interactive browser -just-scrape history # specific request +just-scrape history # all services, interactive +just-scrape history # filter by service +just-scrape history # specific request just-scrape history --page -just-scrape history --page-size # max 100 +just-scrape history --page-size # default 20, max 100 just-scrape history --json ``` -Services: `markdownify`, `smartscraper`, `searchscraper`, `scrape`, `crawl`, `agentic-scraper`, `sitemap` +Services: `scrape`, `extract`, `search`, `crawl`, `monitor`. ```bash -just-scrape history smartscraper -just-scrape history crawl --json --page-size 100 | jq '.requests[] | {id: .request_id, status}' +just-scrape history extract +just-scrape history crawl --json --page-size 100 | jq '.[] | {id, status}' +just-scrape history scrape req_abc123 --json ``` ### Credits & Validate ```bash just-scrape credits -just-scrape credits --json | jq '.remaining_credits' -just-scrape validate +just-scrape credits --json | jq '.remaining' +just-scrape validate # health check + key validation ``` ## Common Patterns -### Generate schema then scrape with it +### Pipe JSON for scripting ```bash -just-scrape generate-schema "Product with name, price, and reviews" --json | jq '.schema' > schema.json -just-scrape smart-scraper https://store.example.com -p "Extract products" --schema "$(cat schema.json)" +# Crawl, then re-extract structured data per page +just-scrape crawl https://example.com -f links --max-pages 20 --json \ + | jq -r '.pages[].url' \ + | while read url; do + just-scrape extract "$url" -p "Extract title and author" --json >> results.jsonl + done ``` -### Pipe JSON for scripting +### Multi-format snapshot ```bash -just-scrape sitemap https://example.com --json | jq -r '.urls[]' | while read url; do - just-scrape smart-scraper "$url" -p "Extract title" --json >> results.jsonl -done +just-scrape scrape https://example.com \ + -f markdown,html,screenshot,links,images,branding \ + --json > snapshot.json ``` -### Protected sites +### Authenticated / protected sites ```bash -# JS-heavy SPA behind Cloudflare -just-scrape smart-scraper https://protected.example.com -p "Extract data" --stealth +# Session cookie + custom headers +just-scrape extract https://app.example.com -p "Extract data" \ + --cookies '{"session":"abc123"}' \ + --headers '{"Authorization":"Bearer token"}' \ + --stealth -# With custom cookies/headers -just-scrape smart-scraper https://example.com -p "Extract data" \ - --cookies '{"session":"abc123"}' --headers '{"Authorization":"Bearer token"}' +# JS-heavy SPA +just-scrape scrape https://protected.example.com --mode js --stealth ``` -## Credit Costs - -| Feature | Extra Credits | -|---|---| -| `--stealth` | +4 per request | -| `--branding` (scrape only) | +2 | -| `search-scraper` extraction | 10 per request | -| `search-scraper --no-extraction` | 2 per request | -| `crawl --no-extraction` | 2 per page | - ## Environment Variables -```bash -SGAI_API_KEY=sgai-... # API key -JUST_SCRAPE_TIMEOUT_S=300 # Request timeout in seconds (default 120) -JUST_SCRAPE_DEBUG=1 # Debug logging to stderr -``` +| Variable | Description | Default | +|---|---|---| +| `SGAI_API_KEY` | ScrapeGraph API key | — | +| `SGAI_API_URL` | Override API base URL | `https://api.scrapegraphai.com/api/v2` | +| `SGAI_TIMEOUT` | Request timeout (seconds) | `120` | +| `SGAI_DEBUG` | Debug logging to stderr (`1` to enable) | `0` | + +Legacy aliases (still bridged for back-compat): `JUST_SCRAPE_API_URL` → `SGAI_API_URL`, `JUST_SCRAPE_TIMEOUT_S` / `SGAI_TIMEOUT_S` → `SGAI_TIMEOUT`, `JUST_SCRAPE_DEBUG` → `SGAI_DEBUG`.