docs: add chatlas integration and `CloudflareCrawler` user guide articles by rich-iannone · Pull Request #77 · posit-dev/raghilda

rich-iannone · 2026-06-15T00:27:43Z

This PR adds two user guide pages: (1) 05-chatlas-integration.qmd and (2) 52-cloudflare-crawler.qmd.

The first covers how to integrate raghilda's RAG capabilities with the chatlas package, including connecting to a knowledge store, defining and registering retrieval tools for LLMs, tailoring retrieval logic, and working with different model providers.

The second guide page documents the CloudflareCrawler for building stores from JavaScript-rendered sites via Cloudflare's Browser Rendering API. The guide covers credentials setup, browser rendering controls, page discovery options, Cloudflare-style filtering patterns, caching, incremental updates with modified_since=, polling configuration, and lower-level inspection methods. A full example script demonstrates both initial builds and incremental refreshes.

t-kalinowski

These guides read great! They will be very helpful to the package.
I made some minor comments below (some of those comments should probably be converted to issues / todos).

t-kalinowski · 2026-06-15T12:42:25Z

+guide-section: "Store Backends"
+---
+
+Some websites load their content entirely through JavaScript. A conventional HTTP request to such a site returns an empty shell or a loading spinner rather than the actual text. The `CloudflareCrawler` addresses this by delegating page fetching and rendering to [Cloudflare's Browser Rendering API](https://developers.cloudflare.com/browser-rendering/), which executes JavaScript and returns the fully rendered page content as Markdown. Because the conversion to Markdown happens on Cloudflare's servers, you receive ready-to-chunk text without needing to run a headless browser locally.


This is one of the benefits of the Cloudflare backend, but I'm not sure it's the one to lead with. Cloudflare lets you offload the potentially very long-running crawling task, and possibly Markdown conversion, from the local host to Cloudflare's servers (for a fee).

I rewrote this opening and lead with the other, more-compelling benefit.

t-kalinowski · 2026-06-15T12:42:51Z

+- JavaScript rendering: `CloudflareCrawler` executes JavaScript before extracting content. Single-page applications, dynamically-loaded documentation sites, and client-rendered dashboards all work without additional configuration.
+- No local browser required: you do not need Playwright, Selenium, or any headless browser installed.
+- Cloudflare handles link discovery: instead of parsing anchor tags from raw HTML (which may not exist until JavaScript runs), Cloudflare discovers links from the rendered DOM.
+- Markdown arrives pre-converted: the API returns Markdown directly, so there is no local HTML-to-Markdown conversion step.


I believe this is optional - users can easily choose to use any markdown converter instead of the cloudflare one.

Thanks! Clarified this.

t-kalinowski · 2026-06-15T12:43:54Z

+
+## Browser rendering
+
+The `render=` parameter controls whether Cloudflare executes JavaScript before extracting page content. It defaults to `True`, which is the right choice for any site that relies on client-side rendering. When enabled, Cloudflare loads each page in a headless browser, waits for scripts to finish, and extracts text from the fully populated DOM.


This might be a good feature to add to WebCrawler.

Opened this issue to that effect: #78

t-kalinowski · 2026-06-15T13:01:28Z

+
+## Filtering with Cloudflare-style patterns
+
+The `include_patterns=` and `exclude_patterns=` fields on `CrawlScope` behave differently depending on which crawler you use. With `WebCrawler`, they are Python regular expressions matched locally. With `CloudflareCrawler`, they are forwarded directly to the Cloudflare API as Cloudflare wildcard patterns, where `**` matches any number of path segments:


We can probably change the API to remove this difference: we can treat patterns as globs instead of regexes by default, and have behavior align between the different crawler backends. What do you think? For regexes, we can also optionally accept pre-compiled regexes in addition to a str.

I agree that the APIs should be aligned. Will open an issue with a proposed interface.

I opened #79.

t-kalinowski · 2026-06-15T13:05:42Z

+documents = crawler.markdown_documents(scope, cache_force_refresh=True)
+```
+
+The cache validates its entries against a signature that includes the `account_id=`, `render=`, `source=`, and `modified_since=` settings. If any of these change between runs, the cache is automatically invalidated. This prevents stale results from a different configuration being reused.


Hmm, we should probably omit account_id from the key used by the cache - that shouldn't matter.

Added #80 to address. Will also change the text here in the guide page to omit that.

Co-authored-by: Tomasz Kalinowski <tomasz@posit.co>

rich-iannone added 5 commits June 14, 2026 17:28

Add Chatlas integration guide

9862b44

Lowercase 'chatlas' in guide title

91feca0

Clarify closure capture and return format in guide

8779cfe

Document .chat() streaming and tool calls

83d160b

Clarify async usage in chatlas guide

dcc0486

github-actions Bot requested a deployment to pr-77 June 15, 2026 00:31 In progress

rich-iannone added 19 commits June 14, 2026 21:57

Add Cloudflare Browser Rendering Crawler guide

07181d8

Add CloudflareCrawler prerequisites

35377f0

Add basic usage section to Cloudflare crawler docs

10d7c80

Add Cloudflare crawler example to docs

84717a4

Add IngestSummary output example

652836f

Add raghilda imports for Cloudflare crawler

f2c1089

Explain Cloudflare crawler ingestion and indexing

29d9939

Add CloudflareCrawler vs WebCrawler section

1b8b419

Document CloudflareCrawler render parameter

ac56bfd

Add docs for CloudflareCrawler source parameter

465dde3

Document Cloudflare crawler filtering patterns

4dad7f7

Add CloudflareCrawler caching documentation

a2c6eaa

Document modified_since incremental updates

2524c84

Document CloudflareCrawler polling and inspection

41c95ca

Add full CloudflareCrawler example to guide

8dc3d54

Add raghilda imports for Cloudflare crawler

ec63edc

Add guidance for when to use CloudflareCrawler

58e3048

Add conclusion to CloudflareCrawler guide

214f35b

Explain refresh path and caching/deduplication

94634fb

rich-iannone changed the title ~~docs: add chatlas user guide article~~ docs: add chatlas integration and CloudflareCrawler user guide articles Jun 15, 2026

github-actions Bot requested a deployment to pr-77 June 15, 2026 02:07 In progress

t-kalinowski approved these changes Jun 15, 2026

View reviewed changes

Update user_guide/52-cloudflare-crawler.qmd

1f32f6f

Co-authored-by: Tomasz Kalinowski <tomasz@posit.co>

github-actions Bot requested a deployment to pr-77 June 15, 2026 14:19 In progress

rich-iannone added 2 commits June 15, 2026 14:38

Omit mention of account_id in cache invalidation

0bc44c8

Revise opening of Cloudflare crawler guide page

9a4f582

github-actions Bot requested a deployment to pr-77 June 15, 2026 18:52 In progress

t-kalinowski reviewed Jun 15, 2026

View reviewed changes

Comment thread user_guide/52-cloudflare-crawler.qmd Outdated

Update user_guide/52-cloudflare-crawler.qmd

3177765

Co-authored-by: Tomasz Kalinowski <tomasz@posit.co>

github-actions Bot requested a deployment to pr-77 June 15, 2026 19:15 In progress

rich-iannone merged commit 991c649 into main Jun 15, 2026
4 checks passed

rich-iannone deleted the docs-chatlas-user-guide branch June 15, 2026 19:38


		## Browser rendering

		The `render=` parameter controls whether Cloudflare executes JavaScript before extracting page content. It defaults to `True`, which is the right choice for any site that relies on client-side rendering. When enabled, Cloudflare loads each page in a headless browser, waits for scripts to finish, and extracts text from the fully populated DOM.


		## Filtering with Cloudflare-style patterns

		The `include_patterns=` and `exclude_patterns=` fields on `CrawlScope` behave differently depending on which crawler you use. With `WebCrawler`, they are Python regular expressions matched locally. With `CloudflareCrawler`, they are forwarded directly to the Cloudflare API as Cloudflare wildcard patterns, where `**` matches any number of path segments:

Conversation

rich-iannone commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

t-kalinowski left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rich-iannone commented Jun 15, 2026 •

edited

Loading