diff --git a/user_guide/05-chatlas-integration.qmd b/user_guide/05-chatlas-integration.qmd new file mode 100644 index 0000000..cc61aa9 --- /dev/null +++ b/user_guide/05-chatlas-integration.qmd @@ -0,0 +1,313 @@ +--- +title: "Using RAG with chatlas" +guide-section: "Getting Started" +--- + +While raghilda builds the knowledge store, [chatlas](https://posit-dev.github.io/chatlas/) can handle the conversation part. The integration point between the two is a Python function that you register as a tool with chatlas. When the LLM decides it needs information from your store, it calls that function, receives the relevant chunks, and incorporates them into its answer. + +This page will walk you through the pattern step by step. It assumes you already have a populated store (look over [Core Concepts](00-getting-started.qmd) or [Crawling and Ingestion](04-crawling-and-ingestion.qmd) if you need to build one first). + +## Connecting to a store + +Let's start by connecting to an existing store (a `DuckDBStore`). Using `.connect(read_only=True)` is recommended when the store is only used for retrieval: + +```{python} +#| eval: false +from raghilda.store import DuckDBStore + +store = DuckDBStore.connect("quarto_docs.db", read_only=True) +print(f"Store contains {store.size()} documents") +``` + +Any raghilda store backend works here: `DuckDBStore`, `ChromaDBStore`, or `OpenAIStore`. The rest of the code is identical regardless of the backend. + +## Defining a search tool + +chatlas discovers tools through plain Python functions. The function's docstring and type hints tell the model what the tool does and what arguments it accepts. A retrieval tool might look like this: + +```{python} +#| eval: false +import json + +def search_docs(query: str, num_results: int = 5) -> str: + """ + Search the documentation for relevant information. + + Parameters + ---------- + query + A description of what to look for. + num_results + The number of relevant passages to return (default of `5`). + """ + chunks = store.retrieve(query, top_k=num_results, deoverlap=True) + return json.dumps( + [{"text": chunk.text, "context": chunk.context} for chunk in chunks] + ) +``` + +There are a few things we should take note of: + +- The function captures the `store` variable from the surrounding scope. This is a normal Python closure: as long as `store` is defined before the function is called, the reference works. +- The docstring is sent to the model as part of the tool description. Write it for the LLM: be specific about when the tool should be used and what `query=` should contain. +- The return value must be a string because LLM tool-calling APIs transmit results as text. JSON works really well here because it preserves structure without requiring the model to parse anything unusual. +- `deoverlap=True` (the default) merges overlapping chunks from the same document so the model receives coherent passages rather than repetitive fragments. + +The goal is a function that returns enough context for the model to answer accurately, but not so much that it drowns the prompt in noise. Start with a simple version like the one above and refine the docstring and return format once you can observe how the model uses the results. + +## Registering the tool and chatting + +Pass the function to `chat.register_tool()`. After registration, the model can call it whenever it determines that retrieval would help answer a prompt: + +```{python} +#| eval: false +from chatlas import ChatOpenAI + +chat = ChatOpenAI( + model="gpt-5.5", + system_prompt=( + "You are a helpful assistant that answers questions about Quarto. " + "Use the search_docs tool to find relevant information before answering." + ), +) +chat.register_tool(search_docs) + +chat.chat("How do I add citations to a Quarto document?") +``` + +When you call `.chat()`, chatlas sends the prompt to the model, displays any tool calls the model makes (including the query it passes to your function), and then streams the final answer to the terminal. You see the full round trip without needing to wire up any display logic yourself. + +The system prompt matters. Instructing the model to use the tool before answering reduces the chance that it falls back on its training data alone. + +## Interactive and programmatic use + +chatlas provides several ways to consume responses depending on context. + +**Console mode** for interactive exploration: + +```{python} +#| eval: false +chat.console() +``` + +This opens a REPL where you can ask questions and see tool calls in real time. Type `exit` or press `Ctrl+C` to quit. + +**Streaming** for applications that display output incrementally: + +```{python} +#| eval: false +for chunk in chat.stream("What formats does Quarto support?"): + print(chunk, end="", flush=True) +``` + +**Async** for concurrent workloads (note that `await` requires an `async def` context, so this form is typically used inside an async framework like FastAPI or an `asyncio.run()` entrypoint): + +```{python} +#| eval: false +response = await chat.chat_async("How do I create a Quarto presentation?") +print(response) +``` + +All three modes use the same registered tools and conversation history. The choice depends on where your code runs: `.console()` for quick experimentation in a terminal, `.stream()` for user-facing applications where perceived latency matters, and `.chat_async()` for server-side code that handles multiple requests concurrently. + +## Tailoring retrieval to the tool's purpose + +The tool function is where you control retrieval quality. Here are adjustments worth considering: + +Every `RetrievedChunk` carries an `.origin` attribute that records where the chunk came from (typically a URL or file path). Including it in the JSON response lets the model cite its sources when answering: + +```{python} +#| eval: false +def search_docs(query: str, num_results: int = 5) -> str: + """Search the documentation for relevant information.""" + chunks = store.retrieve(query, top_k=num_results, deoverlap=True) + return json.dumps([ + { + "text": chunk.text, + "context": chunk.context, + "source": chunk.origin, + } + for chunk in chunks + ]) +``` + +Adding `"source": chunk.origin` to the returned dictionary is all it takes. Once the model sees URLs or paths alongside the text, it can reference them in its answer without any additional prompting. + +When a store indexes content from multiple sources or sections, you can pass an `attributes_filter=` argument to `retrieve()` to restrict results to a subset. The filter uses a SQL-like expression (`"section = 'guide'"`) that matches against the attributes defined in your store's schema: + +```{python} +#| eval: false +def search_guides(query: str) -> str: + """Search only the user guide section of the documentation.""" + chunks = store.retrieve( + query, + top_k=5, + attributes_filter="section = 'guide'", + ) + return json.dumps([{"text": chunk.text} for chunk in chunks]) +``` + +Here only chunks whose `section` attribute equals `'guide'` are considered. This keeps retrieval focused and avoids pulling in, for example, API reference text when the user asks a conceptual question. See [Attribute Filters](03-attribute-filters.qmd) for more on defining and using attribute schemas. + +You can also register several tool functions on the same chat, each backed by a different filter or even a different store. The model decides which tool to invoke based on the docstrings, so give each function a clear description of what it covers: + +```{python} +#| eval: false +def search_api_reference(query: str) -> str: + """Search the API reference for function signatures and parameters.""" + chunks = store.retrieve( + query, + top_k=3, + attributes_filter="section = 'reference'", + ) + return json.dumps([{"text": chunk.text} for chunk in chunks]) + +def search_tutorials(query: str) -> str: + """Search the tutorials for step-by-step instructions and examples.""" + chunks = store.retrieve( + query, + top_k=5, + attributes_filter="section = 'tutorial'", + ) + return json.dumps([{"text": chunk.text} for chunk in chunks]) + +chat.register_tool(search_api_reference) +chat.register_tool(search_tutorials) +``` + +With two tools registered, a question like `"What arguments does `ChatOpenAI` accept?"` routes to `search_api_reference`, while `"How do I set up streaming in a Shiny app?"` routes to `search_tutorials`. The model makes the choice on each turn, and you can observe which tool it selects by watching the tool-call display in `.chat()` or `.console()`. + +None of these adjustments require any changes to chatlas itself. The retrieval logic lives entirely in your tool functions, which means you can iterate on what gets returned, how many results to include, and how to filter without touching the chat configuration. That separation is deliberate and it keeps the conversational layer stable while you tune retrieval independently. + +## Choosing a model provider + +Because the retrieval logic lives in a plain Python function, the choice of model provider is independent of raghilda. chatlas supports hosted APIs, cloud platforms, and local inference servers. The tool registration interface is the same in every case. + +Anthropic's Claude models tend to follow tool-calling instructions closely and produce well-structured answers: + +```{python} +#| eval: false +from chatlas import ChatAnthropic + +chat = ChatAnthropic(model="claude-opus-4-8") +chat.register_tool(search_docs) +``` + +Google's Gemini models offer a generous free tier, which is useful for prototyping before committing to a paid API: + +```{python} +#| eval: false +from chatlas import ChatGoogle + +chat = ChatGoogle(model="gemini-3.5-flash") +chat.register_tool(search_docs) +``` + +Ollama runs models locally, so nothing leaves your machine. This matters when the store contains proprietary or sensitive material: + +```{python} +#| eval: false +from chatlas import ChatOllama + +chat = ChatOllama(model="Llama-3.3-8B-Instruct") +chat.register_tool(search_docs) +``` + +The [chatlas model choice documentation](https://posit-dev.github.io/chatlas/get-started/models.html) lists all available providers. Switching between them requires changing only the constructor call; the registered tools, system prompt, and conversation history carry over if you assign them to a new chat object. + +## A full example + +The following script builds a store from a documentation site and starts an interactive RAG chat session. It reuses an existing store if one is already present. + +```{python} +#| eval: false +from pathlib import Path + +from chatlas import ChatOpenAI + +from raghilda.chunker import MarkdownChunker +from raghilda.crawl import CrawlScope, WebCrawler +from raghilda.embedding import EmbeddingOpenAI +from raghilda.store import DuckDBStore + +DB_PATH = Path("chatlas_docs.db") + + +def build_store() -> DuckDBStore: + store = DuckDBStore.create( + location=str(DB_PATH), + embed=EmbeddingOpenAI(), + name="chatlas_docs", + title="Chatlas Documentation", + overwrite=True, + ) + crawler = WebCrawler(cache_dir=True, max_workers=4) + scope = CrawlScope( + roots=["https://posit-dev.github.io/chatlas/"], + depth=1, + include_types=["html"], + ) + chunker = MarkdownChunker() + summary = store.ingest( + crawler.markdown_documents(scope), + prepare=chunker.chunk, + max_workers=4, + ) + store.build_index() + print(f"Indexed {summary.inserted} documents") + return store + + +def get_store() -> DuckDBStore: + if DB_PATH.exists(): + return DuckDBStore.connect(str(DB_PATH), read_only=True) + return build_store() + + +def main(): + import json + + store = get_store() + + def search_chatlas_docs(query: str, num_results: int = 5) -> str: + """ + Search the chatlas documentation. + + Use this tool when the user asks about chatlas features, + API usage, model providers, tool calling, or streaming. + + Parameters + ---------- + query + A description of what to look for. + num_results + Number of passages to return (default of 5). + """ + chunks = store.retrieve(query, top_k=num_results, deoverlap=True) + return json.dumps( + [{"text": chunk.text, "context": chunk.context} for chunk in chunks] + ) + + chat = ChatOpenAI( + model="gpt-5.5", + system_prompt=( + "You answer questions about the chatlas Python library. " + "Always use the search tool before answering." + ), + ) + chat.register_tool(search_chatlas_docs) + chat.console() + + +if __name__ == "__main__": + main() +``` + +This script separates store construction from chat setup so the expensive indexing step only runs once. On subsequent runs it reconnects to the existing database and goes straight to the interactive session. The same structure works for any documentation site or local file collection: swap the `CrawlScope` roots and adjust the system prompt to match your domain. + +## Next steps + +- The [Core Concepts](00-getting-started.qmd) guide covers building a store from scratch. +- The [Chunking](02-chunking.qmd) guide explains how to tune chunk size and overlap for better retrieval quality. +- The [Attribute Filters](03-attribute-filters.qmd) guide shows how to scope retrieval by metadata. +- The [chatlas documentation](https://posit-dev.github.io/chatlas/get-started/tools.html) has more detail on tool calling, streaming, and structured output. diff --git a/user_guide/52-cloudflare-crawler.qmd b/user_guide/52-cloudflare-crawler.qmd new file mode 100644 index 0000000..878aaab --- /dev/null +++ b/user_guide/52-cloudflare-crawler.qmd @@ -0,0 +1,439 @@ +--- +title: "Cloudflare Browser Rendering Crawler" +guide-section: "Store Backends" +--- + +The `CloudflareCrawler` delegates all page fetching and rendering to [Cloudflare's Browser Rendering API](https://developers.cloudflare.com/browser-rendering/). This provides a lot of benefits if you have an account with Cloudflare: + +- you effectively offload potentially very long-running crawling tasks, and possibly Markdown conversion, from the local host to Cloudflare's servers +- those websites that load their content entirely through JavaScript (which make text retrieval generally problematic) are handled by Cloudflare +- you don't have to run a headless browser locally, you simply receive ready-to-chunk text from the paid service + +This guide covers how to set up and use `CloudflareCrawler` to build a store. It assumes some familiarity with the general crawling workflow described in [Crawling and Ingestion](04-crawling-and-ingestion.qmd). + +## Prerequisites + +`CloudflareCrawler` requires two credentials from a Cloudflare account that has the Browser Rendering API enabled: + +- Account ID: found on your Cloudflare dashboard under the account overview page +- API Token: a bearer token with permission to access the Browser Rendering API + +You should store both as environment variables rather than hardcoding them in scripts: + +```{python} +#| eval: false +import os + +account_id = os.environ["CLOUDFLARE_ACCOUNT_ID"] +api_token = os.environ["CLOUDFLARE_API_TOKEN"] +``` + +If the credentials are missing or the account does not have Browser Rendering enabled, the crawler will raise an error on the first API call. + +## Basic usage + +A complete crawl-to-store pipeline needs four big pieces: + +1. a store to write into +2. a crawler to fetch and render pages +3. a scope describing which pages to include +4. a chunker to split the rendered Markdown into retrieval-sized pieces + +You also need to decide on an embedding provider for the store (or defer that by passing `embed=None` if you plan to add embeddings later). + +The example below uses all of the defaults for the crawler beyond the required credentials. That essentially means that: + +- browser rendering is on +- all discovery methods are active +- no caching is configured +- no filtering patterns are applied + +This is a pretty reasonable starting point for an initial exploration of a site before tuning scope and caching for production use. + +```{python} +#| eval: false +import os + +from raghilda.chunker import MarkdownChunker +from raghilda.crawl import CloudflareCrawler, CrawlScope +from raghilda.embedding import EmbeddingOpenAI +from raghilda.store import DuckDBStore + +# 1. Create a store with an embedding provider +store = DuckDBStore.create( + location="rendered_docs.db", + embed=EmbeddingOpenAI(), + name="rendered_docs", + title="Rendered Documentation", + overwrite=True, +) + +# 2. Set up the crawler with Cloudflare credentials +crawler = CloudflareCrawler( + account_id=os.environ["CLOUDFLARE_ACCOUNT_ID"], + api_token=os.environ["CLOUDFLARE_API_TOKEN"], +) + +# 3. Define the crawl scope +scope = CrawlScope( + roots=["https://example.com/docs/"], + depth=2, +) + +# 4. Chunk and ingest the crawled pages +chunker = MarkdownChunker() + +summary = store.ingest( + crawler.markdown_documents(scope), + prepare=chunker.chunk, + max_workers=4, +) + +# 5. Build retrieval indexes +store.build_index() + +print(summary) +``` + +``` +IngestSummary(inserted=47, replaced=0, skipped=0) +``` + +This posts a crawl job to Cloudflare, polls until it completes, retrieves the rendered Markdown for each discovered page, chunks it, and writes everything to the store. The `depth=2` setting tells Cloudflare to follow links up to two levels from the root URL. + +The `store.build_index()` call at the end creates an HNSW vector index and a BM25 keyword index over the stored chunks. These indexes make subsequent `retrieve()` calls fast. Building them after all documents are ingested is more efficient than updating them incrementally during writes, which is why it appears as a separate step at the end. + +The printed `IngestSummary` reports how many documents were newly added, how many had their content replaced, and how many were skipped because identical content was already in the store. + +## How it differs from WebCrawler + +`WebCrawler` fetches raw HTML with `requests` and converts it to Markdown locally. `CloudflareCrawler` offloads both the fetching and the Markdown conversion to Cloudflare's infrastructure. The practical differences are: + +- JavaScript rendering: `CloudflareCrawler` executes JavaScript before extracting content. Single-page applications, dynamically-loaded documentation sites, and client-rendered dashboards all work without additional configuration. +- No local browser required: you do not need Playwright, Selenium, or any headless browser installed. +- Cloudflare handles link discovery: instead of parsing anchor tags from raw HTML (which may not exist until JavaScript runs), Cloudflare discovers links from the rendered DOM. +- Markdown can arrive pre-converted: the API can return Markdown directly, with HTML-to-Markdown conversion performed by Cloudflare. This is optional: you can still choose to receive raw HTML and then perform HTML-to-Markdown conversion locally using Raghilda's built-in converter or another converter of your choice. + +The tradeoff is that `CloudflareCrawler` requires a Cloudflare account with Browser Rendering access and incurs API usage costs. For static HTML sites, `WebCrawler` is simpler (and free). + +## Browser rendering + +The `render=` parameter controls whether Cloudflare executes JavaScript before extracting page content. It defaults to `True`, which is the right choice for any site that relies on client-side rendering. When enabled, Cloudflare loads each page in a headless browser, waits for scripts to finish, and extracts text from the fully populated DOM. + +Setting it explicitly for clarity: + +```{python} +#| eval: false +# JavaScript is executed (default) +crawler = CloudflareCrawler( + account_id=os.environ["CLOUDFLARE_ACCOUNT_ID"], + api_token=os.environ["CLOUDFLARE_API_TOKEN"], + render=True, +) +``` + +We should set `render=False` if the target site is server-rendered HTML and does not need JavaScript execution. This reduces crawl time and Cloudflare API usage: + +```{python} +#| eval: false +# Skip JavaScript execution for static sites +crawler = CloudflareCrawler( + account_id=os.environ["CLOUDFLARE_ACCOUNT_ID"], + api_token=os.environ["CLOUDFLARE_API_TOKEN"], + render=False, +) +``` + +If you are unsure whether a site needs rendering, start with the default (`render=True`) and inspect a few pages. If the returned Markdown already contains the expected content, switching to `render=False` will speed up the crawl and reduce your Cloudflare API usage without losing any text. + +## Page discovery with the source parameter + +The `source=` parameter controls how Cloudflare discovers pages on the target site. The default is `"all"`, which combines multiple discovery methods: + +```{python} +#| eval: false +# Use all available discovery methods (default) +crawler = CloudflareCrawler( + account_id=os.environ["CLOUDFLARE_ACCOUNT_ID"], + api_token=os.environ["CLOUDFLARE_API_TOKEN"], + source="all", +) +``` + +Other options for `source=` are: + +- `"sitemap"`: only discover pages listed in the site's `sitemap.xml`. This is efficient for well-maintained sites where the sitemap is comprehensive. +- `"crawl"`: follow links from the rendered DOM, similar to traditional web crawling but with JavaScript execution. +- `"urls"`: only process the explicitly provided root URLs, without following any links. + +For a documentation site with a complete sitemap, `"sitemap"` is typically the fastest option because it avoids rendering every page just to find links: + +```{python} +#| eval: false +crawler = CloudflareCrawler( + account_id=os.environ["CLOUDFLARE_ACCOUNT_ID"], + api_token=os.environ["CLOUDFLARE_API_TOKEN"], + source="sitemap", +) + +scope = CrawlScope( + roots=["https://example.com/docs/"], + depth=0, +) +``` + +With `source="sitemap"` and `depth=0`, Cloudflare reads the sitemap from the root and returns all listed pages without further link-following. + +## Filtering with Cloudflare-style patterns + +The `include_patterns=` and `exclude_patterns=` fields on `CrawlScope` behave differently depending on which crawler you use. With `WebCrawler`, they are Python regular expressions matched locally. With `CloudflareCrawler`, they are forwarded directly to the Cloudflare API as Cloudflare wildcard patterns, where `**` matches any number of path segments: + +```{python} +#| eval: false +scope = CrawlScope( + roots=["https://example.com/"], + depth=2, + include_patterns=["https://example.com/docs/**"], + exclude_patterns=[ + "https://example.com/docs/archive/**", + "https://example.com/docs/internal/**", + ], + limit=500, +) +``` + +The `include_patterns` list restricts the crawl to URLs matching at least one pattern. The `exclude_patterns` list removes URLs that match any pattern. The `limit=` field caps the total number of pages returned. + +Two additional scope fields are relevant for Cloudflare crawls: + +- `include_external_links=True` allows the crawler to follow links to other domains. +- `include_subdomains=True` allows the crawler to follow links to subdomains of the root host (e.g., `docs.example.com` when crawling from `example.com`). + +Both default to `False`, which keeps the crawl focused on the root domain. + +## Caching + +`CloudflareCrawler` accepts the same caching parameters as `WebCrawler`, though the underlying behavior differs slightly because results come from Cloudflare's API rather than direct HTTP requests. Enable caching with `cache_dir=True` to store results under `.raghilda/cache/cloudflare`, or pass a custom path: + +```{python} +#| eval: false +from datetime import timedelta + +crawler = CloudflareCrawler( + account_id=os.environ["CLOUDFLARE_ACCOUNT_ID"], + api_token=os.environ["CLOUDFLARE_API_TOKEN"], + cache_dir=True, + cache_stale_after=timedelta(days=1), +) +``` + +With caching enabled, the crawler stores both the crawl job results (list of discovered URLs and their rendered Markdown) and the individual page records on disk. On subsequent runs, fresh cached results are reused without making any Cloudflare API calls. + +The `cache_stale_after=` parameter controls when cached results are considered stale. When a cached entry is stale, the crawler sends a new request to Cloudflare with a `maxAge` hint asking for updated content. When `cache_stale_after` is not set, cached entries never expire. + +To force a completely fresh crawl that bypasses the cache entirely, pass `cache_force_refresh=True` to `markdown_documents()`: + +```{python} +#| eval: false +# Ignore all cached results, re-crawl everything +documents = crawler.markdown_documents(scope, cache_force_refresh=True) +``` + +The cache validates its entries against a signature that includes the `render=`, `source=`, and `modified_since=` settings. If any of these change between runs, the cache is automatically invalidated. This prevents stale results from a different configuration being reused. + +## Incremental updates with `modified_since=` + +For stores that need regular updates, the `modified_since=` parameter restricts the crawl to pages modified after a given Unix timestamp. This reduces the number of pages Cloudflare processes on each run: + +```{python} +#| eval: false +import time + +# Only include pages modified in the last 7 days +one_week_ago = int(time.time()) - (7 * 24 * 60 * 60) + +crawler = CloudflareCrawler( + account_id=os.environ["CLOUDFLARE_ACCOUNT_ID"], + api_token=os.environ["CLOUDFLARE_API_TOKEN"], + cache_dir=True, + modified_since=one_week_ago, +) +``` + +Combined with `store.ingest()`, this gives you a lightweight refresh job where only recently changed pages are fetched and upserted (while unchanged documents are skipped by the store's own deduplication logic). + +## Polling behavior + +Cloudflare processes crawl jobs asynchronously. After submitting a job, the crawler polls for completion at regular intervals. Two parameters control this behavior: + +- `poll_interval=5.0`: seconds to wait between status checks. The default of 5 seconds is reasonable for most jobs. +- `max_poll_attempts=60`: maximum number of polls before raising a `TimeoutError`. With the default interval, this gives a 5-minute window. + +For large sites that take longer to crawl, we should increase the timeout: + +```{python} +#| eval: false +crawler = CloudflareCrawler( + account_id=os.environ["CLOUDFLARE_ACCOUNT_ID"], + api_token=os.environ["CLOUDFLARE_API_TOKEN"], + poll_interval=10.0, + max_poll_attempts=120, +) +``` + +This configuration waits up to 20 minutes for a crawl job to finish. + +## Inspecting discovered pages + +Before committing to a full ingest of a large site, you may want to spot-check what Cloudflare discovers and returns. `CloudflareCrawler` exposes lower-level methods for this kind of inspection. Use `origins()` to see what pages Cloudflare finds without converting them into documents: + +```{python} +#| eval: false +for origin in crawler.origins(scope): + print(origin) +``` + +Use `fetch_raw()` to retrieve the full `FetchedSource` for a single page, which includes metadata like the HTTP status code and content type: + +```{python} +#| eval: false +source = crawler.fetch_raw("https://example.com/docs/getting-started") +print(f"Status: {source.status_code}") +print(f"Fetched at: {source.fetched_at}") +print(f"Body at: {source.body_path}") +``` + +Use `fetch_markdown()` to get a single `MarkdownDocument`: + +```{python} +#| eval: false +doc = crawler.fetch_markdown("https://example.com/docs/getting-started") +print(doc.content[:500]) +``` + +These methods are useful for debugging scope configuration or inspecting what Cloudflare returns before committing to a full ingest. + +## Full example + +The following script builds a store from a JavaScript-rendered documentation site. It uses caching so that repeated runs during development avoid redundant API calls, and it sets a one-day staleness window for production refresh jobs. + +```{python} +#| eval: false +import os +from datetime import timedelta +from pathlib import Path + +from raghilda.chunker import MarkdownChunker +from raghilda.crawl import CloudflareCrawler, CrawlScope +from raghilda.embedding import EmbeddingOpenAI +from raghilda.store import DuckDBStore + +DB_PATH = Path("spa_docs.db") + + +def build_store() -> DuckDBStore: + store = DuckDBStore.create( + location=str(DB_PATH), + embed=EmbeddingOpenAI(), + name="spa_docs", + title="SPA Documentation", + overwrite=True, + ) + + crawler = CloudflareCrawler( + account_id=os.environ["CLOUDFLARE_ACCOUNT_ID"], + api_token=os.environ["CLOUDFLARE_API_TOKEN"], + cache_dir=True, + cache_stale_after=timedelta(days=1), + render=True, + source="all", + max_workers=4, + ) + + scope = CrawlScope( + roots=["https://my-spa-docs.example.com/"], + depth=2, + include_patterns=["https://my-spa-docs.example.com/**"], + exclude_patterns=["https://my-spa-docs.example.com/internal/**"], + limit=1000, + ) + + chunker = MarkdownChunker(chunk_size=1600, target_overlap=0.5) + + summary = store.ingest( + crawler.markdown_documents(scope), + prepare=chunker.chunk, + max_workers=4, + ) + + store.build_index() + + print(f"Inserted: {summary.inserted}") + print(f"Replaced: {summary.replaced}") + print(f"Skipped: {summary.skipped}") + + return store + + +def refresh_store() -> None: + store = DuckDBStore.connect(str(DB_PATH)) + + crawler = CloudflareCrawler( + account_id=os.environ["CLOUDFLARE_ACCOUNT_ID"], + api_token=os.environ["CLOUDFLARE_API_TOKEN"], + cache_dir=True, + cache_stale_after=timedelta(days=1), + render=True, + max_workers=4, + ) + + scope = CrawlScope( + roots=["https://my-spa-docs.example.com/"], + depth=2, + include_patterns=["https://my-spa-docs.example.com/**"], + limit=1000, + ) + + chunker = MarkdownChunker(chunk_size=1600, target_overlap=0.5) + + summary = store.ingest( + crawler.markdown_documents(scope), + prepare=chunker.chunk, + max_workers=4, + ) + + print(f"Refresh complete: {summary.inserted} new, {summary.replaced} updated, {summary.skipped} unchanged") + + +if __name__ == "__main__": + if DB_PATH.exists(): + refresh_store() + else: + build_store() +``` + +The script has two paths: an initial build that creates the store from scratch, and a refresh path that reconnects and upserts only changed content. The cache and the store's own deduplication (`skip_if_unchanged=True` in `upsert()`) work together to keep refresh runs fast: the cache avoids re-fetching pages whose content has not changed, and the store avoids re-computing embeddings for documents that are identical to what is already stored. + +## When to use CloudflareCrawler + +Choosing between `CloudflareCrawler` and `WebCrawler` comes down to whether the target site needs JavaScript to produce its content and whether you are willing to depend on an external service. Neither crawler is strictly better as they serve different situations. + +Use `CloudflareCrawler` when: + +- the target site renders content with JavaScript (React, Vue, Angular, etc.) +- you need Cloudflare's sitemap-aware discovery rather than manual link following +- you want pre-converted Markdown without running local conversion logic +- the site is large enough that Cloudflare's distributed infrastructure crawls it faster than concurrent local requests + +Use `WebCrawler` when: + +- the site is static HTML that does not require JavaScript execution +- you want to avoid external API dependencies and costs +- you need fine-grained control over HTTP headers, cookies, or authentication during fetching +- the crawl is small enough that local `requests` calls are fast enough + +Both crawlers implement the same interface (`origins()`, `fetch_raw()`, `fetch_markdown()`, `markdown_documents()`), so switching between them requires changing only the constructor. The rest of your pipeline (chunking, ingestion, retrieval) remains unchanged. + +## Conclusion + +`CloudflareCrawler` lets you build retrieval stores from sites that would otherwise be inaccessible to a simple HTTP client. The rendering, link discovery, and Markdown conversion all happen on Cloudflare's infrastructure, so your local code stays focused on chunking and ingestion. Combined with caching, filtering patterns, and `modified_since=`, you can keep a store current without redundant API calls or full re-crawls. For sites that do not need JavaScript execution, `WebCrawler` remains the simpler and cheaper option, and the shared crawler interface means you can switch between the two without restructuring your pipeline.