docs: add chatlas integration and CloudflareCrawler user guide articles#77
Conversation
CloudflareCrawler user guide articles
t-kalinowski
left a comment
There was a problem hiding this comment.
These guides read great! They will be very helpful to the package.
I made some minor comments below (some of those comments should probably be converted to issues / todos).
| guide-section: "Store Backends" | ||
| --- | ||
|
|
||
| Some websites load their content entirely through JavaScript. A conventional HTTP request to such a site returns an empty shell or a loading spinner rather than the actual text. The `CloudflareCrawler` addresses this by delegating page fetching and rendering to [Cloudflare's Browser Rendering API](https://developers.cloudflare.com/browser-rendering/), which executes JavaScript and returns the fully rendered page content as Markdown. Because the conversion to Markdown happens on Cloudflare's servers, you receive ready-to-chunk text without needing to run a headless browser locally. |
There was a problem hiding this comment.
This is one of the benefits of the Cloudflare backend, but I'm not sure it's the one to lead with. Cloudflare lets you offload the potentially very long-running crawling task, and possibly Markdown conversion, from the local host to Cloudflare's servers (for a fee).
There was a problem hiding this comment.
I rewrote this opening and lead with the other, more-compelling benefit.
| - JavaScript rendering: `CloudflareCrawler` executes JavaScript before extracting content. Single-page applications, dynamically-loaded documentation sites, and client-rendered dashboards all work without additional configuration. | ||
| - No local browser required: you do not need Playwright, Selenium, or any headless browser installed. | ||
| - Cloudflare handles link discovery: instead of parsing anchor tags from raw HTML (which may not exist until JavaScript runs), Cloudflare discovers links from the rendered DOM. | ||
| - Markdown arrives pre-converted: the API returns Markdown directly, so there is no local HTML-to-Markdown conversion step. |
There was a problem hiding this comment.
I believe this is optional - users can easily choose to use any markdown converter instead of the cloudflare one.
There was a problem hiding this comment.
Thanks! Clarified this.
|
|
||
| ## Browser rendering | ||
|
|
||
| The `render=` parameter controls whether Cloudflare executes JavaScript before extracting page content. It defaults to `True`, which is the right choice for any site that relies on client-side rendering. When enabled, Cloudflare loads each page in a headless browser, waits for scripts to finish, and extracts text from the fully populated DOM. |
There was a problem hiding this comment.
This might be a good feature to add to WebCrawler.
|
|
||
| ## Filtering with Cloudflare-style patterns | ||
|
|
||
| The `include_patterns=` and `exclude_patterns=` fields on `CrawlScope` behave differently depending on which crawler you use. With `WebCrawler`, they are Python regular expressions matched locally. With `CloudflareCrawler`, they are forwarded directly to the Cloudflare API as Cloudflare wildcard patterns, where `**` matches any number of path segments: |
There was a problem hiding this comment.
We can probably change the API to remove this difference: we can treat patterns as globs instead of regexes by default, and have behavior align between the different crawler backends. What do you think? For regexes, we can also optionally accept pre-compiled regexes in addition to a str.
There was a problem hiding this comment.
I agree that the APIs should be aligned. Will open an issue with a proposed interface.
| documents = crawler.markdown_documents(scope, cache_force_refresh=True) | ||
| ``` | ||
|
|
||
| The cache validates its entries against a signature that includes the `account_id=`, `render=`, `source=`, and `modified_since=` settings. If any of these change between runs, the cache is automatically invalidated. This prevents stale results from a different configuration being reused. |
There was a problem hiding this comment.
Hmm, we should probably omit account_id from the key used by the cache - that shouldn't matter.
There was a problem hiding this comment.
Added #80 to address. Will also change the text here in the guide page to omit that.
Co-authored-by: Tomasz Kalinowski <tomasz@posit.co>
Co-authored-by: Tomasz Kalinowski <tomasz@posit.co>
This PR adds two user guide pages: (1)
05-chatlas-integration.qmdand (2)52-cloudflare-crawler.qmd.The first covers how to integrate raghilda's RAG capabilities with the chatlas package, including connecting to a knowledge store, defining and registering retrieval tools for LLMs, tailoring retrieval logic, and working with different model providers.
The second guide page documents the
CloudflareCrawlerfor building stores from JavaScript-rendered sites via Cloudflare's Browser Rendering API. The guide covers credentials setup, browser rendering controls, page discovery options, Cloudflare-style filtering patterns, caching, incremental updates withmodified_since=, polling configuration, and lower-level inspection methods. A full example script demonstrates both initial builds and incremental refreshes.