Skip to content

docs: add chatlas integration and CloudflareCrawler user guide articles#77

Merged
rich-iannone merged 28 commits into
mainfrom
docs-chatlas-user-guide
Jun 15, 2026
Merged

docs: add chatlas integration and CloudflareCrawler user guide articles#77
rich-iannone merged 28 commits into
mainfrom
docs-chatlas-user-guide

Conversation

@rich-iannone

@rich-iannone rich-iannone commented Jun 15, 2026

Copy link
Copy Markdown
Member

This PR adds two user guide pages: (1) 05-chatlas-integration.qmd and (2) 52-cloudflare-crawler.qmd.

The first covers how to integrate raghilda's RAG capabilities with the chatlas package, including connecting to a knowledge store, defining and registering retrieval tools for LLMs, tailoring retrieval logic, and working with different model providers.

The second guide page documents the CloudflareCrawler for building stores from JavaScript-rendered sites via Cloudflare's Browser Rendering API. The guide covers credentials setup, browser rendering controls, page discovery options, Cloudflare-style filtering patterns, caching, incremental updates with modified_since=, polling configuration, and lower-level inspection methods. A full example script demonstrates both initial builds and incremental refreshes.

@rich-iannone rich-iannone changed the title docs: add chatlas user guide article docs: add chatlas integration and CloudflareCrawler user guide articles Jun 15, 2026

@t-kalinowski t-kalinowski left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These guides read great! They will be very helpful to the package.
I made some minor comments below (some of those comments should probably be converted to issues / todos).

Comment thread user_guide/52-cloudflare-crawler.qmd Outdated
guide-section: "Store Backends"
---

Some websites load their content entirely through JavaScript. A conventional HTTP request to such a site returns an empty shell or a loading spinner rather than the actual text. The `CloudflareCrawler` addresses this by delegating page fetching and rendering to [Cloudflare's Browser Rendering API](https://developers.cloudflare.com/browser-rendering/), which executes JavaScript and returns the fully rendered page content as Markdown. Because the conversion to Markdown happens on Cloudflare's servers, you receive ready-to-chunk text without needing to run a headless browser locally.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one of the benefits of the Cloudflare backend, but I'm not sure it's the one to lead with. Cloudflare lets you offload the potentially very long-running crawling task, and possibly Markdown conversion, from the local host to Cloudflare's servers (for a fee).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rewrote this opening and lead with the other, more-compelling benefit.

Comment thread user_guide/52-cloudflare-crawler.qmd Outdated
- JavaScript rendering: `CloudflareCrawler` executes JavaScript before extracting content. Single-page applications, dynamically-loaded documentation sites, and client-rendered dashboards all work without additional configuration.
- No local browser required: you do not need Playwright, Selenium, or any headless browser installed.
- Cloudflare handles link discovery: instead of parsing anchor tags from raw HTML (which may not exist until JavaScript runs), Cloudflare discovers links from the rendered DOM.
- Markdown arrives pre-converted: the API returns Markdown directly, so there is no local HTML-to-Markdown conversion step.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is optional - users can easily choose to use any markdown converter instead of the cloudflare one.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Clarified this.


## Browser rendering

The `render=` parameter controls whether Cloudflare executes JavaScript before extracting page content. It defaults to `True`, which is the right choice for any site that relies on client-side rendering. When enabled, Cloudflare loads each page in a headless browser, waits for scripts to finish, and extracts text from the fully populated DOM.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be a good feature to add to WebCrawler.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened this issue to that effect: #78


## Filtering with Cloudflare-style patterns

The `include_patterns=` and `exclude_patterns=` fields on `CrawlScope` behave differently depending on which crawler you use. With `WebCrawler`, they are Python regular expressions matched locally. With `CloudflareCrawler`, they are forwarded directly to the Cloudflare API as Cloudflare wildcard patterns, where `**` matches any number of path segments:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably change the API to remove this difference: we can treat patterns as globs instead of regexes by default, and have behavior align between the different crawler backends. What do you think? For regexes, we can also optionally accept pre-compiled regexes in addition to a str.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the APIs should be aligned. Will open an issue with a proposed interface.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened #79.

Comment thread user_guide/52-cloudflare-crawler.qmd Outdated
documents = crawler.markdown_documents(scope, cache_force_refresh=True)
```

The cache validates its entries against a signature that includes the `account_id=`, `render=`, `source=`, and `modified_since=` settings. If any of these change between runs, the cache is automatically invalidated. This prevents stale results from a different configuration being reused.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, we should probably omit account_id from the key used by the cache - that shouldn't matter.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added #80 to address. Will also change the text here in the guide page to omit that.

Comment thread user_guide/52-cloudflare-crawler.qmd Outdated
Co-authored-by: Tomasz Kalinowski <tomasz@posit.co>
Comment thread user_guide/52-cloudflare-crawler.qmd Outdated
Co-authored-by: Tomasz Kalinowski <tomasz@posit.co>
@rich-iannone rich-iannone merged commit 991c649 into main Jun 15, 2026
4 checks passed
@rich-iannone rich-iannone deleted the docs-chatlas-user-guide branch June 15, 2026 19:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants