From 9862b443dd8fb7513b6b201c6d47ad6492847278 Mon Sep 17 00:00:00 2001 From: Richard Iannone Date: Sun, 14 Jun 2026 17:28:02 -0400 Subject: [PATCH 01/28] Add Chatlas integration guide --- user_guide/05-chatlas-integration.qmd | 310 ++++++++++++++++++++++++++ 1 file changed, 310 insertions(+) create mode 100644 user_guide/05-chatlas-integration.qmd diff --git a/user_guide/05-chatlas-integration.qmd b/user_guide/05-chatlas-integration.qmd new file mode 100644 index 0000000..5490154 --- /dev/null +++ b/user_guide/05-chatlas-integration.qmd @@ -0,0 +1,310 @@ +--- +title: "Using RAG with Chatlas" +guide-section: "Getting Started" +--- + +While raghilda builds the knowledge store, [chatlas](https://posit-dev.github.io/chatlas/) can handle the conversation part. The integration point between the two is a Python function that you register as a tool with chatlas. When the LLM decides it needs information from your store, it calls that function, receives the relevant chunks, and incorporates them into its answer. + +This page will walk you through the pattern step by step. It assumes you already have a populated store (look over [Core Concepts](00-getting-started.qmd) or [Crawling and Ingestion](04-crawling-and-ingestion.qmd) if you need to build one first). + +## Connecting to a store + +Let's start by connecting to an existing store (a `DuckDBStore`). Using `.connect(read_only=True)` is recommended when the store is only used for retrieval: + +```{python} +#| eval: false +from raghilda.store import DuckDBStore + +store = DuckDBStore.connect("quarto_docs.db", read_only=True) +print(f"Store contains {store.size()} documents") +``` + +Any raghilda store backend works here: `DuckDBStore`, `ChromaDBStore`, or `OpenAIStore`. The rest of the code is identical regardless of the backend. + +## Defining a search tool + +chatlas discovers tools through plain Python functions. The function's docstring and type hints tell the model what the tool does and what arguments it accepts. A retrieval tool might look like this: + +```{python} +#| eval: false +import json + +def search_docs(query: str, num_results: int = 5) -> str: + """ + Search the documentation for relevant information. + + Parameters + ---------- + query + A description of what to look for. + num_results + The number of relevant passages to return (default of `5`). + """ + chunks = store.retrieve(query, top_k=num_results, deoverlap=True) + return json.dumps( + [{"text": chunk.text, "context": chunk.context} for chunk in chunks] + ) +``` + +There are a few things we should take note of: + +- the docstring is sent to the model as part of the tool description. Write it for the LLM: be specific about when the tool should be used and what `query=` should contain. +- the return value must be a string. JSON works really well here because it preserves structure without requiring the model to parse anything unusual. +- `deoverlap=True` (the default) merges overlapping chunks from the same document so the model receives coherent passages rather than repetitive fragments. + +The goal is a function that returns enough context for the model to answer accurately, but not so much that it drowns the prompt in noise. Start with a simple version like the one above and refine the docstring and return format once you can observe how the model uses the results. + +## Registering the tool and chatting + +Pass the function to `chat.register_tool()`. After registration, the model can call it whenever it determines that retrieval would help answer a prompt: + +```{python} +#| eval: false +from chatlas import ChatOpenAI + +chat = ChatOpenAI( + model="gpt-5.5", + system_prompt=( + "You are a helpful assistant that answers questions about Quarto. " + "Use the search_docs tool to find relevant information before answering." + ), +) +chat.register_tool(search_docs) + +chat.chat("How do I add citations to a Quarto document?") +``` + +The system prompt matters. Instructing the model to use the tool before answering reduces the chance that it falls back on its training data alone. + +## Interactive and programmatic use + +chatlas provides several ways to consume responses depending on context. + +**Console mode** for interactive exploration: + +```{python} +#| eval: false +chat.console() +``` + +This opens a REPL where you can ask questions and see tool calls in real time. Type `exit` or press `Ctrl+C` to quit. + +**Streaming** for applications that display output incrementally: + +```{python} +#| eval: false +for chunk in chat.stream("What formats does Quarto support?"): + print(chunk, end="", flush=True) +``` + +**Async** for concurrent workloads: + +```{python} +#| eval: false +response = await chat.chat_async("How do I create a Quarto presentation?") +print(response) +``` + +All three modes use the same registered tools and conversation history. The choice depends on where your code runs: `.console()` for quick experimentation in a terminal, `.stream()` for user-facing applications where perceived latency matters, and `.chat_async()` for server-side code that handles multiple requests concurrently. + +## Tailoring retrieval to the tool's purpose + +The tool function is where you control retrieval quality. Here are adjustments worth considering: + +Every `RetrievedChunk` carries an `.origin` attribute that records where the chunk came from (typically a URL or file path). Including it in the JSON response lets the model cite its sources when answering: + +```{python} +#| eval: false +def search_docs(query: str, num_results: int = 5) -> str: + """Search the documentation for relevant information.""" + chunks = store.retrieve(query, top_k=num_results, deoverlap=True) + return json.dumps([ + { + "text": chunk.text, + "context": chunk.context, + "source": chunk.origin, + } + for chunk in chunks + ]) +``` + +Adding `"source": chunk.origin` to the returned dictionary is all it takes. Once the model sees URLs or paths alongside the text, it can reference them in its answer without any additional prompting. + +When a store indexes content from multiple sources or sections, you can pass an `attributes_filter=` argument to `retrieve()` to restrict results to a subset. The filter uses a SQL-like expression (`"section = 'guide'"`) that matches against the attributes defined in your store's schema: + +```{python} +#| eval: false +def search_guides(query: str) -> str: + """Search only the user guide section of the documentation.""" + chunks = store.retrieve( + query, + top_k=5, + attributes_filter="section = 'guide'", + ) + return json.dumps([{"text": chunk.text} for chunk in chunks]) +``` + +Here only chunks whose `section` attribute equals `'guide'` are considered. This keeps retrieval focused and avoids pulling in, for example, API reference text when the user asks a conceptual question. See [Attribute Filters](03-attribute-filters.qmd) for more on defining and using attribute schemas. + +You can also register several tool functions on the same chat, each backed by a different filter or even a different store. The model decides which tool to invoke based on the docstrings, so give each function a clear description of what it covers: + +```{python} +#| eval: false +def search_api_reference(query: str) -> str: + """Search the API reference for function signatures and parameters.""" + chunks = store.retrieve( + query, + top_k=3, + attributes_filter="section = 'reference'", + ) + return json.dumps([{"text": chunk.text} for chunk in chunks]) + +def search_tutorials(query: str) -> str: + """Search the tutorials for step-by-step instructions and examples.""" + chunks = store.retrieve( + query, + top_k=5, + attributes_filter="section = 'tutorial'", + ) + return json.dumps([{"text": chunk.text} for chunk in chunks]) + +chat.register_tool(search_api_reference) +chat.register_tool(search_tutorials) +``` + +With two tools registered, a question like `"What arguments does `ChatOpenAI` accept?"` routes to `search_api_reference`, while `"How do I set up streaming in a Shiny app?"` routes to `search_tutorials`. The model makes the choice on each turn, and you can observe which tool it selects by watching the tool-call display in `.chat()` or `.console()`. + +None of these adjustments require any changes to chatlas itself. The retrieval logic lives entirely in your tool functions, which means you can iterate on what gets returned, how many results to include, and how to filter without touching the chat configuration. That separation is deliberate and it keeps the conversational layer stable while you tune retrieval independently. + +## Choosing a model provider + +Because the retrieval logic lives in a plain Python function, the choice of model provider is independent of raghilda. chatlas supports hosted APIs, cloud platforms, and local inference servers. The tool registration interface is the same in every case. + +Anthropic's Claude models tend to follow tool-calling instructions closely and produce well-structured answers: + +```{python} +#| eval: false +from chatlas import ChatAnthropic + +chat = ChatAnthropic(model="claude-opus-4-8") +chat.register_tool(search_docs) +``` + +Google's Gemini models offer a generous free tier, which is useful for prototyping before committing to a paid API: + +```{python} +#| eval: false +from chatlas import ChatGoogle + +chat = ChatGoogle(model="gemini-3.5-flash") +chat.register_tool(search_docs) +``` + +Ollama runs models locally, so nothing leaves your machine. This matters when the store contains proprietary or sensitive material: + +```{python} +#| eval: false +from chatlas import ChatOllama + +chat = ChatOllama(model="Llama-3.3-8B-Instruct") +chat.register_tool(search_docs) +``` + +The [chatlas model choice documentation](https://posit-dev.github.io/chatlas/get-started/models.html) lists all available providers. Switching between them requires changing only the constructor call; the registered tools, system prompt, and conversation history carry over if you assign them to a new chat object. + +## A full example + +The following script builds a store from a documentation site and starts an interactive RAG chat session. It reuses an existing store if one is already present. + +```{python} +#| eval: false +from pathlib import Path + +from chatlas import ChatOpenAI + +from raghilda.chunker import MarkdownChunker +from raghilda.crawl import CrawlScope, WebCrawler +from raghilda.embedding import EmbeddingOpenAI +from raghilda.store import DuckDBStore + +DB_PATH = Path("chatlas_docs.db") + + +def build_store() -> DuckDBStore: + store = DuckDBStore.create( + location=str(DB_PATH), + embed=EmbeddingOpenAI(), + name="chatlas_docs", + title="Chatlas Documentation", + overwrite=True, + ) + crawler = WebCrawler(cache_dir=True, max_workers=4) + scope = CrawlScope( + roots=["https://posit-dev.github.io/chatlas/"], + depth=1, + include_types=["html"], + ) + chunker = MarkdownChunker() + summary = store.ingest( + crawler.markdown_documents(scope), + prepare=chunker.chunk, + max_workers=4, + ) + store.build_index() + print(f"Indexed {summary.inserted} documents") + return store + + +def get_store() -> DuckDBStore: + if DB_PATH.exists(): + return DuckDBStore.connect(str(DB_PATH), read_only=True) + return build_store() + + +def main(): + import json + + store = get_store() + + def search_chatlas_docs(query: str, num_results: int = 5) -> str: + """ + Search the chatlas documentation. + + Use this tool when the user asks about chatlas features, + API usage, model providers, tool calling, or streaming. + + Parameters + ---------- + query + A description of what to look for. + num_results + Number of passages to return (default of 5). + """ + chunks = store.retrieve(query, top_k=num_results, deoverlap=True) + return json.dumps( + [{"text": chunk.text, "context": chunk.context} for chunk in chunks] + ) + + chat = ChatOpenAI( + model="gpt-5.5", + system_prompt=( + "You answer questions about the chatlas Python library. " + "Always use the search tool before answering." + ), + ) + chat.register_tool(search_chatlas_docs) + chat.console() + + +if __name__ == "__main__": + main() +``` + +This script separates store construction from chat setup so the expensive indexing step only runs once. On subsequent runs it reconnects to the existing database and goes straight to the interactive session. The same structure works for any documentation site or local file collection: swap the `CrawlScope` roots and adjust the system prompt to match your domain. + +## Next steps + +- The [Core Concepts](00-getting-started.qmd) guide covers building a store from scratch. +- The [Chunking](02-chunking.qmd) guide explains how to tune chunk size and overlap for better retrieval quality. +- The [Attribute Filters](03-attribute-filters.qmd) guide shows how to scope retrieval by metadata. +- The [chatlas documentation](https://posit-dev.github.io/chatlas/get-started/tools.html) has more detail on tool calling, streaming, and structured output. From 91feca0384e6c60727ead6d7ff99e0cbabff9a40 Mon Sep 17 00:00:00 2001 From: Richard Iannone Date: Sun, 14 Jun 2026 20:24:07 -0400 Subject: [PATCH 02/28] Lowercase 'chatlas' in guide title --- user_guide/05-chatlas-integration.qmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/user_guide/05-chatlas-integration.qmd b/user_guide/05-chatlas-integration.qmd index 5490154..c19095c 100644 --- a/user_guide/05-chatlas-integration.qmd +++ b/user_guide/05-chatlas-integration.qmd @@ -1,5 +1,5 @@ --- -title: "Using RAG with Chatlas" +title: "Using RAG with chatlas" guide-section: "Getting Started" --- From 8779cfed80f6009074b52b639dd0db7879973248 Mon Sep 17 00:00:00 2001 From: Richard Iannone Date: Sun, 14 Jun 2026 20:24:27 -0400 Subject: [PATCH 03/28] Clarify closure capture and return format in guide --- user_guide/05-chatlas-integration.qmd | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/user_guide/05-chatlas-integration.qmd b/user_guide/05-chatlas-integration.qmd index c19095c..8d7c1d3 100644 --- a/user_guide/05-chatlas-integration.qmd +++ b/user_guide/05-chatlas-integration.qmd @@ -48,8 +48,9 @@ def search_docs(query: str, num_results: int = 5) -> str: There are a few things we should take note of: -- the docstring is sent to the model as part of the tool description. Write it for the LLM: be specific about when the tool should be used and what `query=` should contain. -- the return value must be a string. JSON works really well here because it preserves structure without requiring the model to parse anything unusual. +- The function captures the `store` variable from the surrounding scope. This is a normal Python closure: as long as `store` is defined before the function is called, the reference works. +- The docstring is sent to the model as part of the tool description. Write it for the LLM: be specific about when the tool should be used and what `query=` should contain. +- The return value must be a string because LLM tool-calling APIs transmit results as text. JSON works really well here because it preserves structure without requiring the model to parse anything unusual. - `deoverlap=True` (the default) merges overlapping chunks from the same document so the model receives coherent passages rather than repetitive fragments. The goal is a function that returns enough context for the model to answer accurately, but not so much that it drowns the prompt in noise. Start with a simple version like the one above and refine the docstring and return format once you can observe how the model uses the results. From 83d160b3b81a9df1c95dc2203de796a95d569c33 Mon Sep 17 00:00:00 2001 From: Richard Iannone Date: Sun, 14 Jun 2026 20:24:40 -0400 Subject: [PATCH 04/28] Document .chat() streaming and tool calls --- user_guide/05-chatlas-integration.qmd | 2 ++ 1 file changed, 2 insertions(+) diff --git a/user_guide/05-chatlas-integration.qmd b/user_guide/05-chatlas-integration.qmd index 8d7c1d3..e254483 100644 --- a/user_guide/05-chatlas-integration.qmd +++ b/user_guide/05-chatlas-integration.qmd @@ -75,6 +75,8 @@ chat.register_tool(search_docs) chat.chat("How do I add citations to a Quarto document?") ``` +When you call `.chat()`, chatlas sends the prompt to the model, displays any tool calls the model makes (including the query it passes to your function), and then streams the final answer to the terminal. You see the full round trip without needing to wire up any display logic yourself. + The system prompt matters. Instructing the model to use the tool before answering reduces the chance that it falls back on its training data alone. ## Interactive and programmatic use From dcc0486ec3eb7a604495d432795deb2a26074200 Mon Sep 17 00:00:00 2001 From: Richard Iannone Date: Sun, 14 Jun 2026 20:24:56 -0400 Subject: [PATCH 05/28] Clarify async usage in chatlas guide --- user_guide/05-chatlas-integration.qmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/user_guide/05-chatlas-integration.qmd b/user_guide/05-chatlas-integration.qmd index e254483..cc61aa9 100644 --- a/user_guide/05-chatlas-integration.qmd +++ b/user_guide/05-chatlas-integration.qmd @@ -100,7 +100,7 @@ for chunk in chat.stream("What formats does Quarto support?"): print(chunk, end="", flush=True) ``` -**Async** for concurrent workloads: +**Async** for concurrent workloads (note that `await` requires an `async def` context, so this form is typically used inside an async framework like FastAPI or an `asyncio.run()` entrypoint): ```{python} #| eval: false From 07181d80f9558c496f8989a897d2f197394a3227 Mon Sep 17 00:00:00 2001 From: Richard Iannone Date: Sun, 14 Jun 2026 21:57:45 -0400 Subject: [PATCH 06/28] Add Cloudflare Browser Rendering Crawler guide --- user_guide/52-cloudflare-crawler.qmd | 9 +++++++++ 1 file changed, 9 insertions(+) create mode 100644 user_guide/52-cloudflare-crawler.qmd diff --git a/user_guide/52-cloudflare-crawler.qmd b/user_guide/52-cloudflare-crawler.qmd new file mode 100644 index 0000000..0478af8 --- /dev/null +++ b/user_guide/52-cloudflare-crawler.qmd @@ -0,0 +1,9 @@ +--- +title: "Cloudflare Browser Rendering Crawler" +guide-section: "Store Backends" +--- + +Some websites load their content entirely through JavaScript. A conventional HTTP request to such a site returns an empty shell or a loading spinner rather than the actual text. The `CloudflareCrawler` addresses this by delegating page fetching and rendering to [Cloudflare's Browser Rendering API](https://developers.cloudflare.com/browser-rendering/), which executes JavaScript and returns the fully rendered page content as Markdown. Because the conversion to Markdown happens on Cloudflare's servers, you receive ready-to-chunk text without needing to run a headless browser locally. + +This guide covers how to set up and use `CloudflareCrawler` to build a store from JavaScript-heavy sites. It assumes familiarity with the general crawling workflow described in [Crawling and Ingestion](04-crawling-and-ingestion.qmd). + From 35377f03450dc2fe74f9611b511e5827a423e579 Mon Sep 17 00:00:00 2001 From: Richard Iannone Date: Sun, 14 Jun 2026 21:58:10 -0400 Subject: [PATCH 07/28] Add CloudflareCrawler prerequisites --- user_guide/52-cloudflare-crawler.qmd | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) diff --git a/user_guide/52-cloudflare-crawler.qmd b/user_guide/52-cloudflare-crawler.qmd index 0478af8..900a2d1 100644 --- a/user_guide/52-cloudflare-crawler.qmd +++ b/user_guide/52-cloudflare-crawler.qmd @@ -7,3 +7,22 @@ Some websites load their content entirely through JavaScript. A conventional HTT This guide covers how to set up and use `CloudflareCrawler` to build a store from JavaScript-heavy sites. It assumes familiarity with the general crawling workflow described in [Crawling and Ingestion](04-crawling-and-ingestion.qmd). +## Prerequisites + +`CloudflareCrawler` requires two credentials from a Cloudflare account that has the Browser Rendering API enabled: + +- Account ID: found on your Cloudflare dashboard under the account overview page +- API Token: a bearer token with permission to access the Browser Rendering API + +You should store both as environment variables rather than hardcoding them in scripts: + +```{python} +#| eval: false +import os + +account_id = os.environ["CLOUDFLARE_ACCOUNT_ID"] +api_token = os.environ["CLOUDFLARE_API_TOKEN"] +``` + +If the credentials are missing or the account does not have Browser Rendering enabled, the crawler will raise an error on the first API call. + From 10d7c80789bed8cf57b40d97bc44f268cdc6e18c Mon Sep 17 00:00:00 2001 From: Richard Iannone Date: Sun, 14 Jun 2026 21:58:31 -0400 Subject: [PATCH 08/28] Add basic usage section to Cloudflare crawler docs --- user_guide/52-cloudflare-crawler.qmd | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+) diff --git a/user_guide/52-cloudflare-crawler.qmd b/user_guide/52-cloudflare-crawler.qmd index 900a2d1..853c0ec 100644 --- a/user_guide/52-cloudflare-crawler.qmd +++ b/user_guide/52-cloudflare-crawler.qmd @@ -26,3 +26,23 @@ api_token = os.environ["CLOUDFLARE_API_TOKEN"] If the credentials are missing or the account does not have Browser Rendering enabled, the crawler will raise an error on the first API call. +## Basic usage + +A complete crawl-to-store pipeline needs four big pieces: + +1. a store to write into +2. a crawler to fetch and render pages +3. a scope describing which pages to include +4. a chunker to split the rendered Markdown into retrieval-sized pieces + +You also need to decide on an embedding provider for the store (or defer that by passing `embed=None` if you plan to add embeddings later). + +The example below uses all of the defaults for the crawler beyond the required credentials. That essentially means that: + +- browser rendering is on +- all discovery methods are active +- no caching is configured +- no filtering patterns are applied + +This is a pretty reasonable starting point for an initial exploration of a site before tuning scope and caching for production use. + From 84717a49fc2edd4d3a3a6b35cf105360ebb9de5c Mon Sep 17 00:00:00 2001 From: Richard Iannone Date: Sun, 14 Jun 2026 21:58:58 -0400 Subject: [PATCH 09/28] Add Cloudflare crawler example to docs --- user_guide/52-cloudflare-crawler.qmd | 39 ++++++++++++++++++++++++++++ 1 file changed, 39 insertions(+) diff --git a/user_guide/52-cloudflare-crawler.qmd b/user_guide/52-cloudflare-crawler.qmd index 853c0ec..d390187 100644 --- a/user_guide/52-cloudflare-crawler.qmd +++ b/user_guide/52-cloudflare-crawler.qmd @@ -46,3 +46,42 @@ The example below uses all of the defaults for the crawler beyond the required c This is a pretty reasonable starting point for an initial exploration of a site before tuning scope and caching for production use. +```{python} +#| eval: false +import os + +# 1. Create a store with an embedding provider +store = DuckDBStore.create( + location="rendered_docs.db", + embed=EmbeddingOpenAI(), + name="rendered_docs", + title="Rendered Documentation", + overwrite=True, +) + +# 2. Set up the crawler with Cloudflare credentials +crawler = CloudflareCrawler( + account_id=os.environ["CLOUDFLARE_ACCOUNT_ID"], + api_token=os.environ["CLOUDFLARE_API_TOKEN"], +) + +# 3. Define the crawl scope +scope = CrawlScope( + roots=["https://example.com/docs/"], + depth=2, +) + +# 4. Chunk and ingest the crawled pages +chunker = MarkdownChunker() + +summary = store.ingest( + crawler.markdown_documents(scope), + prepare=chunker.chunk, + max_workers=4, +) + +# 5. Build retrieval indexes +store.build_index() + +print(summary) +``` From 652836f07621dcd2ea572e5a54896a463d4f002e Mon Sep 17 00:00:00 2001 From: Richard Iannone Date: Sun, 14 Jun 2026 21:59:12 -0400 Subject: [PATCH 10/28] Add IngestSummary output example --- user_guide/52-cloudflare-crawler.qmd | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/user_guide/52-cloudflare-crawler.qmd b/user_guide/52-cloudflare-crawler.qmd index d390187..bc1e157 100644 --- a/user_guide/52-cloudflare-crawler.qmd +++ b/user_guide/52-cloudflare-crawler.qmd @@ -85,3 +85,8 @@ store.build_index() print(summary) ``` + +``` +IngestSummary(inserted=47, replaced=0, skipped=0) +``` + From f2c1089f15da5fc085a457846b1852a19d9b1329 Mon Sep 17 00:00:00 2001 From: Richard Iannone Date: Sun, 14 Jun 2026 21:59:26 -0400 Subject: [PATCH 11/28] Add raghilda imports for Cloudflare crawler --- user_guide/52-cloudflare-crawler.qmd | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/user_guide/52-cloudflare-crawler.qmd b/user_guide/52-cloudflare-crawler.qmd index bc1e157..9ba6c30 100644 --- a/user_guide/52-cloudflare-crawler.qmd +++ b/user_guide/52-cloudflare-crawler.qmd @@ -50,6 +50,11 @@ This is a pretty reasonable starting point for an initial exploration of a site #| eval: false import os +from raghilda.chunker import MarkdownChunker +from raghilda.crawl import CloudflareCrawler, CrawlScope +from raghilda.embedding import EmbeddingOpenAI +from raghilda.store import DuckDBStore + # 1. Create a store with an embedding provider store = DuckDBStore.create( location="rendered_docs.db", From 29d99396d380b4e563ff0ca34f07dae1ca3e0250 Mon Sep 17 00:00:00 2001 From: Richard Iannone Date: Sun, 14 Jun 2026 21:59:44 -0400 Subject: [PATCH 12/28] Explain Cloudflare crawler ingestion and indexing --- user_guide/52-cloudflare-crawler.qmd | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/user_guide/52-cloudflare-crawler.qmd b/user_guide/52-cloudflare-crawler.qmd index 9ba6c30..9805e9c 100644 --- a/user_guide/52-cloudflare-crawler.qmd +++ b/user_guide/52-cloudflare-crawler.qmd @@ -95,3 +95,9 @@ print(summary) IngestSummary(inserted=47, replaced=0, skipped=0) ``` +This posts a crawl job to Cloudflare, polls until it completes, retrieves the rendered Markdown for each discovered page, chunks it, and writes everything to the store. The `depth=2` setting tells Cloudflare to follow links up to two levels from the root URL. + +The `store.build_index()` call at the end creates an HNSW vector index and a BM25 keyword index over the stored chunks. These indexes make subsequent `retrieve()` calls fast. Building them after all documents are ingested is more efficient than updating them incrementally during writes, which is why it appears as a separate step at the end. + +The printed `IngestSummary` reports how many documents were newly added, how many had their content replaced, and how many were skipped because identical content was already in the store. + From 1b8b419d521dcdbd98e9348309c6fe183f2ebed4 Mon Sep 17 00:00:00 2001 From: Richard Iannone Date: Sun, 14 Jun 2026 22:00:09 -0400 Subject: [PATCH 13/28] Add CloudflareCrawler vs WebCrawler section --- user_guide/52-cloudflare-crawler.qmd | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/user_guide/52-cloudflare-crawler.qmd b/user_guide/52-cloudflare-crawler.qmd index 9805e9c..8c1a0c3 100644 --- a/user_guide/52-cloudflare-crawler.qmd +++ b/user_guide/52-cloudflare-crawler.qmd @@ -101,3 +101,14 @@ The `store.build_index()` call at the end creates an HNSW vector index and a BM2 The printed `IngestSummary` reports how many documents were newly added, how many had their content replaced, and how many were skipped because identical content was already in the store. +## How it differs from WebCrawler + +`WebCrawler` fetches raw HTML with `requests` and converts it to Markdown locally. `CloudflareCrawler` offloads both the fetching and the Markdown conversion to Cloudflare's infrastructure. The practical differences are: + +- JavaScript rendering: `CloudflareCrawler` executes JavaScript before extracting content. Single-page applications, dynamically-loaded documentation sites, and client-rendered dashboards all work without additional configuration. +- No local browser required: you do not need Playwright, Selenium, or any headless browser installed. +- Cloudflare handles link discovery: instead of parsing anchor tags from raw HTML (which may not exist until JavaScript runs), Cloudflare discovers links from the rendered DOM. +- Markdown arrives pre-converted: the API returns Markdown directly, so there is no local HTML-to-Markdown conversion step. + +The tradeoff is that `CloudflareCrawler` requires a Cloudflare account with Browser Rendering access and incurs API usage costs. For static HTML sites, `WebCrawler` is simpler (and free). + From ac56bfd95f759b0ec6d845e2158260e1c721fb0a Mon Sep 17 00:00:00 2001 From: Richard Iannone Date: Sun, 14 Jun 2026 22:00:27 -0400 Subject: [PATCH 14/28] Document CloudflareCrawler render parameter --- user_guide/52-cloudflare-crawler.qmd | 30 ++++++++++++++++++++++++++++ 1 file changed, 30 insertions(+) diff --git a/user_guide/52-cloudflare-crawler.qmd b/user_guide/52-cloudflare-crawler.qmd index 8c1a0c3..772cae9 100644 --- a/user_guide/52-cloudflare-crawler.qmd +++ b/user_guide/52-cloudflare-crawler.qmd @@ -112,3 +112,33 @@ The printed `IngestSummary` reports how many documents were newly added, how man The tradeoff is that `CloudflareCrawler` requires a Cloudflare account with Browser Rendering access and incurs API usage costs. For static HTML sites, `WebCrawler` is simpler (and free). +## Browser rendering + +The `render=` parameter controls whether Cloudflare executes JavaScript before extracting page content. It defaults to `True`, which is the right choice for any site that relies on client-side rendering. When enabled, Cloudflare loads each page in a headless browser, waits for scripts to finish, and extracts text from the fully populated DOM. + +Setting it explicitly for clarity: + +```{python} +#| eval: false +# JavaScript is executed (default) +crawler = CloudflareCrawler( + account_id=os.environ["CLOUDFLARE_ACCOUNT_ID"], + api_token=os.environ["CLOUDFLARE_API_TOKEN"], + render=True, +) +``` + +We should set `render=False` if the target site is server-rendered HTML and does not need JavaScript execution. This reduces crawl time and Cloudflare API usage: + +```{python} +#| eval: false +# Skip JavaScript execution for static sites +crawler = CloudflareCrawler( + account_id=os.environ["CLOUDFLARE_ACCOUNT_ID"], + api_token=os.environ["CLOUDFLARE_API_TOKEN"], + render=False, +) +``` + +If you are unsure whether a site needs rendering, start with the default (`render=True`) and inspect a few pages. If the returned Markdown already contains the expected content, switching to `render=False` will speed up the crawl and reduce your Cloudflare API usage without losing any text. + From 465dde354a4f9abc93fcbba0d686c8d5ed88541d Mon Sep 17 00:00:00 2001 From: Richard Iannone Date: Sun, 14 Jun 2026 22:00:49 -0400 Subject: [PATCH 15/28] Add docs for CloudflareCrawler source parameter --- user_guide/52-cloudflare-crawler.qmd | 38 ++++++++++++++++++++++++++++ 1 file changed, 38 insertions(+) diff --git a/user_guide/52-cloudflare-crawler.qmd b/user_guide/52-cloudflare-crawler.qmd index 772cae9..98b4af0 100644 --- a/user_guide/52-cloudflare-crawler.qmd +++ b/user_guide/52-cloudflare-crawler.qmd @@ -142,3 +142,41 @@ crawler = CloudflareCrawler( If you are unsure whether a site needs rendering, start with the default (`render=True`) and inspect a few pages. If the returned Markdown already contains the expected content, switching to `render=False` will speed up the crawl and reduce your Cloudflare API usage without losing any text. +## Page discovery with the source parameter + +The `source=` parameter controls how Cloudflare discovers pages on the target site. The default is `"all"`, which combines multiple discovery methods: + +```{python} +#| eval: false +# Use all available discovery methods (default) +crawler = CloudflareCrawler( + account_id=os.environ["CLOUDFLARE_ACCOUNT_ID"], + api_token=os.environ["CLOUDFLARE_API_TOKEN"], + source="all", +) +``` + +Other options for `source=` are: + +- `"sitemap"`: only discover pages listed in the site's `sitemap.xml`. This is efficient for well-maintained sites where the sitemap is comprehensive. +- `"crawl"`: follow links from the rendered DOM, similar to traditional web crawling but with JavaScript execution. +- `"urls"`: only process the explicitly provided root URLs, without following any links. + +For a documentation site with a complete sitemap, `"sitemap"` is typically the fastest option because it avoids rendering every page just to find links: + +```{python} +#| eval: false +crawler = CloudflareCrawler( + account_id=os.environ["CLOUDFLARE_ACCOUNT_ID"], + api_token=os.environ["CLOUDFLARE_API_TOKEN"], + source="sitemap", +) + +scope = CrawlScope( + roots=["https://example.com/docs/"], + depth=0, +) +``` + +With `source="sitemap"` and `depth=0`, Cloudflare reads the sitemap from the root and returns all listed pages without further link-following. + From 4dad7f708d849e8432e0de2f52c2679fd9065ce9 Mon Sep 17 00:00:00 2001 From: Richard Iannone Date: Sun, 14 Jun 2026 22:01:09 -0400 Subject: [PATCH 16/28] Document Cloudflare crawler filtering patterns --- user_guide/52-cloudflare-crawler.qmd | 27 +++++++++++++++++++++++++++ 1 file changed, 27 insertions(+) diff --git a/user_guide/52-cloudflare-crawler.qmd b/user_guide/52-cloudflare-crawler.qmd index 98b4af0..7ef23fe 100644 --- a/user_guide/52-cloudflare-crawler.qmd +++ b/user_guide/52-cloudflare-crawler.qmd @@ -180,3 +180,30 @@ scope = CrawlScope( With `source="sitemap"` and `depth=0`, Cloudflare reads the sitemap from the root and returns all listed pages without further link-following. +## Filtering with Cloudflare-style patterns + +The `include_patterns=` and `exclude_patterns=` fields on `CrawlScope` behave differently depending on which crawler you use. With `WebCrawler`, they are Python regular expressions matched locally. With `CloudflareCrawler`, they are forwarded directly to the Cloudflare API as Cloudflare wildcard patterns, where `**` matches any number of path segments: + +```{python} +#| eval: false +scope = CrawlScope( + roots=["https://example.com/"], + depth=2, + include_patterns=["https://example.com/docs/**"], + exclude_patterns=[ + "https://example.com/docs/archive/**", + "https://example.com/docs/internal/**", + ], + limit=500, +) +``` + +The `include_patterns` list restricts the crawl to URLs matching at least one pattern. The `exclude_patterns` list removes URLs that match any pattern. The `limit=` field caps the total number of pages returned. + +Two additional scope fields are relevant for Cloudflare crawls: + +- `include_external_links=True` allows the crawler to follow links to other domains. +- `include_subdomains=True` allows the crawler to follow links to subdomains of the root host (e.g., `docs.example.com` when crawling from `example.com`). + +Both default to `False`, which keeps the crawl focused on the root domain. + From a2c6eaad36730313f1dbb875296d02cc434c54bd Mon Sep 17 00:00:00 2001 From: Richard Iannone Date: Sun, 14 Jun 2026 22:01:30 -0400 Subject: [PATCH 17/28] Add CloudflareCrawler caching documentation --- user_guide/52-cloudflare-crawler.qmd | 30 ++++++++++++++++++++++++++++ 1 file changed, 30 insertions(+) diff --git a/user_guide/52-cloudflare-crawler.qmd b/user_guide/52-cloudflare-crawler.qmd index 7ef23fe..5aa0711 100644 --- a/user_guide/52-cloudflare-crawler.qmd +++ b/user_guide/52-cloudflare-crawler.qmd @@ -207,3 +207,33 @@ Two additional scope fields are relevant for Cloudflare crawls: Both default to `False`, which keeps the crawl focused on the root domain. +## Caching + +`CloudflareCrawler` accepts the same caching parameters as `WebCrawler`, though the underlying behavior differs slightly because results come from Cloudflare's API rather than direct HTTP requests. Enable caching with `cache_dir=True` to store results under `.raghilda/cache/cloudflare`, or pass a custom path: + +```{python} +#| eval: false +from datetime import timedelta + +crawler = CloudflareCrawler( + account_id=os.environ["CLOUDFLARE_ACCOUNT_ID"], + api_token=os.environ["CLOUDFLARE_API_TOKEN"], + cache_dir=True, + cache_stale_after=timedelta(days=1), +) +``` + +With caching enabled, the crawler stores both the crawl job results (list of discovered URLs and their rendered Markdown) and the individual page records on disk. On subsequent runs, fresh cached results are reused without making any Cloudflare API calls. + +The `cache_stale_after=` parameter controls when cached results are considered stale. When a cached entry is stale, the crawler sends a new request to Cloudflare with a `maxAge` hint asking for updated content. When `cache_stale_after` is not set, cached entries never expire. + +To force a completely fresh crawl that bypasses the cache entirely, pass `cache_force_refresh=True` to `markdown_documents()`: + +```{python} +#| eval: false +# Ignore all cached results, re-crawl everything +documents = crawler.markdown_documents(scope, cache_force_refresh=True) +``` + +The cache validates its entries against a signature that includes the `account_id=`, `render=`, `source=`, and `modified_since=` settings. If any of these change between runs, the cache is automatically invalidated. This prevents stale results from a different configuration being reused. + From 2524c841db6b6641ec27079d07d21ca36ad4c8a1 Mon Sep 17 00:00:00 2001 From: Richard Iannone Date: Sun, 14 Jun 2026 22:01:49 -0400 Subject: [PATCH 18/28] Document modified_since incremental updates --- user_guide/52-cloudflare-crawler.qmd | 21 +++++++++++++++++++++ 1 file changed, 21 insertions(+) diff --git a/user_guide/52-cloudflare-crawler.qmd b/user_guide/52-cloudflare-crawler.qmd index 5aa0711..f583271 100644 --- a/user_guide/52-cloudflare-crawler.qmd +++ b/user_guide/52-cloudflare-crawler.qmd @@ -237,3 +237,24 @@ documents = crawler.markdown_documents(scope, cache_force_refresh=True) The cache validates its entries against a signature that includes the `account_id=`, `render=`, `source=`, and `modified_since=` settings. If any of these change between runs, the cache is automatically invalidated. This prevents stale results from a different configuration being reused. +## Incremental updates with `modified_since=` + +For stores that need regular updates, the `modified_since=` parameter restricts the crawl to pages modified after a given Unix timestamp. This reduces the number of pages Cloudflare processes on each run: + +```{python} +#| eval: false +import time + +# Only include pages modified in the last 7 days +one_week_ago = int(time.time()) - (7 * 24 * 60 * 60) + +crawler = CloudflareCrawler( + account_id=os.environ["CLOUDFLARE_ACCOUNT_ID"], + api_token=os.environ["CLOUDFLARE_API_TOKEN"], + cache_dir=True, + modified_since=one_week_ago, +) +``` + +Combined with `store.ingest()`, this gives you a lightweight refresh job where only recently changed pages are fetched and upserted (while unchanged documents are skipped by the store's own deduplication logic). + From 41c95ca4a153f1e67adcaf280ce7ffa567625a8c Mon Sep 17 00:00:00 2001 From: Richard Iannone Date: Sun, 14 Jun 2026 22:02:15 -0400 Subject: [PATCH 19/28] Document CloudflareCrawler polling and inspection --- user_guide/52-cloudflare-crawler.qmd | 51 ++++++++++++++++++++++++++++ 1 file changed, 51 insertions(+) diff --git a/user_guide/52-cloudflare-crawler.qmd b/user_guide/52-cloudflare-crawler.qmd index f583271..7d3c8f0 100644 --- a/user_guide/52-cloudflare-crawler.qmd +++ b/user_guide/52-cloudflare-crawler.qmd @@ -258,3 +258,54 @@ crawler = CloudflareCrawler( Combined with `store.ingest()`, this gives you a lightweight refresh job where only recently changed pages are fetched and upserted (while unchanged documents are skipped by the store's own deduplication logic). +## Polling behavior + +Cloudflare processes crawl jobs asynchronously. After submitting a job, the crawler polls for completion at regular intervals. Two parameters control this behavior: + +- `poll_interval=5.0`: seconds to wait between status checks. The default of 5 seconds is reasonable for most jobs. +- `max_poll_attempts=60`: maximum number of polls before raising a `TimeoutError`. With the default interval, this gives a 5-minute window. + +For large sites that take longer to crawl, we should increase the timeout: + +```{python} +#| eval: false +crawler = CloudflareCrawler( + account_id=os.environ["CLOUDFLARE_ACCOUNT_ID"], + api_token=os.environ["CLOUDFLARE_API_TOKEN"], + poll_interval=10.0, + max_poll_attempts=120, +) +``` + +This configuration waits up to 20 minutes for a crawl job to finish. + +## Inspecting discovered pages + +Before committing to a full ingest of a large site, you may want to spot-check what Cloudflare discovers and returns. `CloudflareCrawler` exposes lower-level methods for this kind of inspection. Use `origins()` to see what pages Cloudflare finds without converting them into documents: + +```{python} +#| eval: false +for origin in crawler.origins(scope): + print(origin) +``` + +Use `fetch_raw()` to retrieve the full `FetchedSource` for a single page, which includes metadata like the HTTP status code and content type: + +```{python} +#| eval: false +source = crawler.fetch_raw("https://example.com/docs/getting-started") +print(f"Status: {source.status_code}") +print(f"Fetched at: {source.fetched_at}") +print(f"Body at: {source.body_path}") +``` + +Use `fetch_markdown()` to get a single `MarkdownDocument`: + +```{python} +#| eval: false +doc = crawler.fetch_markdown("https://example.com/docs/getting-started") +print(doc.content[:500]) +``` + +These methods are useful for debugging scope configuration or inspecting what Cloudflare returns before committing to a full ingest. + From 8dc3d54eb4f7baa4f43342d25ea5615ffe916138 Mon Sep 17 00:00:00 2001 From: Richard Iannone Date: Sun, 14 Jun 2026 22:02:47 -0400 Subject: [PATCH 20/28] Add full CloudflareCrawler example to guide --- user_guide/52-cloudflare-crawler.qmd | 94 ++++++++++++++++++++++++++++ 1 file changed, 94 insertions(+) diff --git a/user_guide/52-cloudflare-crawler.qmd b/user_guide/52-cloudflare-crawler.qmd index 7d3c8f0..ff7deb7 100644 --- a/user_guide/52-cloudflare-crawler.qmd +++ b/user_guide/52-cloudflare-crawler.qmd @@ -309,3 +309,97 @@ print(doc.content[:500]) These methods are useful for debugging scope configuration or inspecting what Cloudflare returns before committing to a full ingest. +## Full example + +The following script builds a store from a JavaScript-rendered documentation site. It uses caching so that repeated runs during development avoid redundant API calls, and it sets a one-day staleness window for production refresh jobs. + +```{python} +#| eval: false +import os +from datetime import timedelta +from pathlib import Path + +DB_PATH = Path("spa_docs.db") + + +def build_store() -> DuckDBStore: + store = DuckDBStore.create( + location=str(DB_PATH), + embed=EmbeddingOpenAI(), + name="spa_docs", + title="SPA Documentation", + overwrite=True, + ) + + crawler = CloudflareCrawler( + account_id=os.environ["CLOUDFLARE_ACCOUNT_ID"], + api_token=os.environ["CLOUDFLARE_API_TOKEN"], + cache_dir=True, + cache_stale_after=timedelta(days=1), + render=True, + source="all", + max_workers=4, + ) + + scope = CrawlScope( + roots=["https://my-spa-docs.example.com/"], + depth=2, + include_patterns=["https://my-spa-docs.example.com/**"], + exclude_patterns=["https://my-spa-docs.example.com/internal/**"], + limit=1000, + ) + + chunker = MarkdownChunker(chunk_size=1600, target_overlap=0.5) + + summary = store.ingest( + crawler.markdown_documents(scope), + prepare=chunker.chunk, + max_workers=4, + ) + + store.build_index() + + print(f"Inserted: {summary.inserted}") + print(f"Replaced: {summary.replaced}") + print(f"Skipped: {summary.skipped}") + + return store + + +def refresh_store() -> None: + store = DuckDBStore.connect(str(DB_PATH)) + + crawler = CloudflareCrawler( + account_id=os.environ["CLOUDFLARE_ACCOUNT_ID"], + api_token=os.environ["CLOUDFLARE_API_TOKEN"], + cache_dir=True, + cache_stale_after=timedelta(days=1), + render=True, + max_workers=4, + ) + + scope = CrawlScope( + roots=["https://my-spa-docs.example.com/"], + depth=2, + include_patterns=["https://my-spa-docs.example.com/**"], + limit=1000, + ) + + chunker = MarkdownChunker(chunk_size=1600, target_overlap=0.5) + + summary = store.ingest( + crawler.markdown_documents(scope), + prepare=chunker.chunk, + max_workers=4, + ) + + print(f"Refresh complete: {summary.inserted} new, {summary.replaced} updated, {summary.skipped} unchanged") + + +if __name__ == "__main__": + if DB_PATH.exists(): + refresh_store() + else: + build_store() +``` + From ec63edc7d606846c0867375561342e1f652489a3 Mon Sep 17 00:00:00 2001 From: Richard Iannone Date: Sun, 14 Jun 2026 22:03:05 -0400 Subject: [PATCH 21/28] Add raghilda imports for Cloudflare crawler --- user_guide/52-cloudflare-crawler.qmd | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/user_guide/52-cloudflare-crawler.qmd b/user_guide/52-cloudflare-crawler.qmd index ff7deb7..e0b4465 100644 --- a/user_guide/52-cloudflare-crawler.qmd +++ b/user_guide/52-cloudflare-crawler.qmd @@ -319,6 +319,11 @@ import os from datetime import timedelta from pathlib import Path +from raghilda.chunker import MarkdownChunker +from raghilda.crawl import CloudflareCrawler, CrawlScope +from raghilda.embedding import EmbeddingOpenAI +from raghilda.store import DuckDBStore + DB_PATH = Path("spa_docs.db") From 58e3048d423a3f9a958e473ce19752c0ab3ad4cc Mon Sep 17 00:00:00 2001 From: Richard Iannone Date: Sun, 14 Jun 2026 22:03:32 -0400 Subject: [PATCH 22/28] Add guidance for when to use CloudflareCrawler --- user_guide/52-cloudflare-crawler.qmd | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+) diff --git a/user_guide/52-cloudflare-crawler.qmd b/user_guide/52-cloudflare-crawler.qmd index e0b4465..fefcad8 100644 --- a/user_guide/52-cloudflare-crawler.qmd +++ b/user_guide/52-cloudflare-crawler.qmd @@ -408,3 +408,23 @@ if __name__ == "__main__": build_store() ``` +## When to use CloudflareCrawler + +Choosing between `CloudflareCrawler` and `WebCrawler` comes down to whether the target site needs JavaScript to produce its content and whether you are willing to depend on an external service. Neither crawler is strictly better as they serve different situations. + +Use `CloudflareCrawler` when: + +- the target site renders content with JavaScript (React, Vue, Angular, etc.) +- you need Cloudflare's sitemap-aware discovery rather than manual link following +- you want pre-converted Markdown without running local conversion logic +- the site is large enough that Cloudflare's distributed infrastructure crawls it faster than sequential local requests + +Use `WebCrawler` when: + +- the site is static HTML that does not require JavaScript execution +- you want to avoid external API dependencies and costs +- you need fine-grained control over HTTP headers, cookies, or authentication during fetching +- the crawl is small enough that local `requests` calls are fast enough + +Both crawlers implement the same interface (`origins()`, `fetch_raw()`, `fetch_markdown()`, `markdown_documents()`), so switching between them requires changing only the constructor. The rest of your pipeline (chunking, ingestion, retrieval) remains unchanged. + From 214f35b045e1f7fd89b839e37536e9fc76c7b225 Mon Sep 17 00:00:00 2001 From: Richard Iannone Date: Sun, 14 Jun 2026 22:03:51 -0400 Subject: [PATCH 23/28] Add conclusion to CloudflareCrawler guide --- user_guide/52-cloudflare-crawler.qmd | 3 +++ 1 file changed, 3 insertions(+) diff --git a/user_guide/52-cloudflare-crawler.qmd b/user_guide/52-cloudflare-crawler.qmd index fefcad8..5a1e61e 100644 --- a/user_guide/52-cloudflare-crawler.qmd +++ b/user_guide/52-cloudflare-crawler.qmd @@ -428,3 +428,6 @@ Use `WebCrawler` when: Both crawlers implement the same interface (`origins()`, `fetch_raw()`, `fetch_markdown()`, `markdown_documents()`), so switching between them requires changing only the constructor. The rest of your pipeline (chunking, ingestion, retrieval) remains unchanged. +## Conclusion + +`CloudflareCrawler` lets you build retrieval stores from sites that would otherwise be inaccessible to a simple HTTP client. The rendering, link discovery, and Markdown conversion all happen on Cloudflare's infrastructure, so your local code stays focused on chunking and ingestion. Combined with caching, filtering patterns, and `modified_since=`, you can keep a store current without redundant API calls or full re-crawls. For sites that do not need JavaScript execution, `WebCrawler` remains the simpler and cheaper option, and the shared crawler interface means you can switch between the two without restructuring your pipeline. From 94634fbfac68d27c634d37db3cd73ec00fd4315d Mon Sep 17 00:00:00 2001 From: Richard Iannone Date: Sun, 14 Jun 2026 22:04:03 -0400 Subject: [PATCH 24/28] Explain refresh path and caching/deduplication --- user_guide/52-cloudflare-crawler.qmd | 2 ++ 1 file changed, 2 insertions(+) diff --git a/user_guide/52-cloudflare-crawler.qmd b/user_guide/52-cloudflare-crawler.qmd index 5a1e61e..ac95c8f 100644 --- a/user_guide/52-cloudflare-crawler.qmd +++ b/user_guide/52-cloudflare-crawler.qmd @@ -408,6 +408,8 @@ if __name__ == "__main__": build_store() ``` +The script has two paths: an initial build that creates the store from scratch, and a refresh path that reconnects and upserts only changed content. The cache and the store's own deduplication (`skip_if_unchanged=True` in `upsert()`) work together to keep refresh runs fast: the cache avoids re-fetching pages whose content has not changed, and the store avoids re-computing embeddings for documents that are identical to what is already stored. + ## When to use CloudflareCrawler Choosing between `CloudflareCrawler` and `WebCrawler` comes down to whether the target site needs JavaScript to produce its content and whether you are willing to depend on an external service. Neither crawler is strictly better as they serve different situations. From 1f32f6f9d15acd9478d3ec908f27ecfac6b0977a Mon Sep 17 00:00:00 2001 From: Richard Iannone Date: Mon, 15 Jun 2026 10:16:15 -0400 Subject: [PATCH 25/28] Update user_guide/52-cloudflare-crawler.qmd Co-authored-by: Tomasz Kalinowski --- user_guide/52-cloudflare-crawler.qmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/user_guide/52-cloudflare-crawler.qmd b/user_guide/52-cloudflare-crawler.qmd index ac95c8f..ac75303 100644 --- a/user_guide/52-cloudflare-crawler.qmd +++ b/user_guide/52-cloudflare-crawler.qmd @@ -419,7 +419,7 @@ Use `CloudflareCrawler` when: - the target site renders content with JavaScript (React, Vue, Angular, etc.) - you need Cloudflare's sitemap-aware discovery rather than manual link following - you want pre-converted Markdown without running local conversion logic -- the site is large enough that Cloudflare's distributed infrastructure crawls it faster than sequential local requests +- the site is large enough that Cloudflare's distributed infrastructure crawls it faster than concurrent local requests Use `WebCrawler` when: From 0bc44c8d6a54abe29bb964e97048600980c6f260 Mon Sep 17 00:00:00 2001 From: Richard Iannone Date: Mon, 15 Jun 2026 14:38:01 -0400 Subject: [PATCH 26/28] Omit mention of `account_id` in cache invalidation --- user_guide/52-cloudflare-crawler.qmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/user_guide/52-cloudflare-crawler.qmd b/user_guide/52-cloudflare-crawler.qmd index ac75303..6945264 100644 --- a/user_guide/52-cloudflare-crawler.qmd +++ b/user_guide/52-cloudflare-crawler.qmd @@ -235,7 +235,7 @@ To force a completely fresh crawl that bypasses the cache entirely, pass `cache_ documents = crawler.markdown_documents(scope, cache_force_refresh=True) ``` -The cache validates its entries against a signature that includes the `account_id=`, `render=`, `source=`, and `modified_since=` settings. If any of these change between runs, the cache is automatically invalidated. This prevents stale results from a different configuration being reused. +The cache validates its entries against a signature that includes the `render=`, `source=`, and `modified_since=` settings. If any of these change between runs, the cache is automatically invalidated. This prevents stale results from a different configuration being reused. ## Incremental updates with `modified_since=` From 9a4f582682e134f9d072f010c456cc6a5866bda9 Mon Sep 17 00:00:00 2001 From: Richard Iannone Date: Mon, 15 Jun 2026 14:49:11 -0400 Subject: [PATCH 27/28] Revise opening of Cloudflare crawler guide page --- user_guide/52-cloudflare-crawler.qmd | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/user_guide/52-cloudflare-crawler.qmd b/user_guide/52-cloudflare-crawler.qmd index 6945264..74ec7a4 100644 --- a/user_guide/52-cloudflare-crawler.qmd +++ b/user_guide/52-cloudflare-crawler.qmd @@ -3,9 +3,13 @@ title: "Cloudflare Browser Rendering Crawler" guide-section: "Store Backends" --- -Some websites load their content entirely through JavaScript. A conventional HTTP request to such a site returns an empty shell or a loading spinner rather than the actual text. The `CloudflareCrawler` addresses this by delegating page fetching and rendering to [Cloudflare's Browser Rendering API](https://developers.cloudflare.com/browser-rendering/), which executes JavaScript and returns the fully rendered page content as Markdown. Because the conversion to Markdown happens on Cloudflare's servers, you receive ready-to-chunk text without needing to run a headless browser locally. +The `CloudflareCrawler` delegates all page fetching and rendering to [Cloudflare's Browser Rendering API](https://developers.cloudflare.com/browser-rendering/). This provides a lot of benefits if you have an account with Cloudflare: -This guide covers how to set up and use `CloudflareCrawler` to build a store from JavaScript-heavy sites. It assumes familiarity with the general crawling workflow described in [Crawling and Ingestion](04-crawling-and-ingestion.qmd). +- you effectively offload potentially very long-running crawling tasks, and possibly Markdown conversion, from the local host to Cloudflare's servers +- those websites that load their content entirely through JavaScript (which make text retrieval generally problematic) are handled by Cloudflare +- you don't have to run a headless browser locally, you simply receive ready-to-chunk text from the paid service + +This guide covers how to set up and use `CloudflareCrawler` to build a store. It assumes some familiarity with the general crawling workflow described in [Crawling and Ingestion](04-crawling-and-ingestion.qmd). ## Prerequisites @@ -108,7 +112,7 @@ The printed `IngestSummary` reports how many documents were newly added, how man - JavaScript rendering: `CloudflareCrawler` executes JavaScript before extracting content. Single-page applications, dynamically-loaded documentation sites, and client-rendered dashboards all work without additional configuration. - No local browser required: you do not need Playwright, Selenium, or any headless browser installed. - Cloudflare handles link discovery: instead of parsing anchor tags from raw HTML (which may not exist until JavaScript runs), Cloudflare discovers links from the rendered DOM. -- Markdown arrives pre-converted: the API returns Markdown directly, so there is no local HTML-to-Markdown conversion step. +- Markdown arrives pre-converted: the API returns Markdown directly, and, you can optionally choose any HTML-to-Markdown converter instead of Cloudflare's. The tradeoff is that `CloudflareCrawler` requires a Cloudflare account with Browser Rendering access and incurs API usage costs. For static HTML sites, `WebCrawler` is simpler (and free). From 3177765963cbdc147da71c0054d4fe11c29c31dd Mon Sep 17 00:00:00 2001 From: Richard Iannone Date: Mon, 15 Jun 2026 15:11:37 -0400 Subject: [PATCH 28/28] Update user_guide/52-cloudflare-crawler.qmd Co-authored-by: Tomasz Kalinowski --- user_guide/52-cloudflare-crawler.qmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/user_guide/52-cloudflare-crawler.qmd b/user_guide/52-cloudflare-crawler.qmd index 74ec7a4..878aaab 100644 --- a/user_guide/52-cloudflare-crawler.qmd +++ b/user_guide/52-cloudflare-crawler.qmd @@ -112,7 +112,7 @@ The printed `IngestSummary` reports how many documents were newly added, how man - JavaScript rendering: `CloudflareCrawler` executes JavaScript before extracting content. Single-page applications, dynamically-loaded documentation sites, and client-rendered dashboards all work without additional configuration. - No local browser required: you do not need Playwright, Selenium, or any headless browser installed. - Cloudflare handles link discovery: instead of parsing anchor tags from raw HTML (which may not exist until JavaScript runs), Cloudflare discovers links from the rendered DOM. -- Markdown arrives pre-converted: the API returns Markdown directly, and, you can optionally choose any HTML-to-Markdown converter instead of Cloudflare's. +- Markdown can arrive pre-converted: the API can return Markdown directly, with HTML-to-Markdown conversion performed by Cloudflare. This is optional: you can still choose to receive raw HTML and then perform HTML-to-Markdown conversion locally using Raghilda's built-in converter or another converter of your choice. The tradeoff is that `CloudflareCrawler` requires a Cloudflare account with Browser Rendering access and incurs API usage costs. For static HTML sites, `WebCrawler` is simpler (and free).