Currently, include_patterns and exclude_patterns in CrawlScope behave very differently depending on the crawler backend. So WebCrawler interprets as Python regular expressions. But the new CloudflareCrawler forwards the values to the Cloudflare API their own style of wildcard patterns (like, ** will match any number of path segments).
This will likely be very confusing for users. Not only this but it'll force users to write different pattern values when switching backends.
To make this better, we should unify the default pattern syntax to globs (like *, **) for both crawlers. It seems to me that globs are more intuitive anyway for URL path matching (and I think they are well understood). We can, under the hood, convert globs to regex for WebCrawler and keep using the Cloudflare API as it is.
An additional thing to do would be to support Python regexes for more advanced use cases. We should do this by accepting pre-compiled regexes in addition to a str.
A consideration for getting this work underway is choosing a glob-to-regex converter util. There's a built-in one in Python and I do believe libraries for this.
Currently,
include_patternsandexclude_patternsinCrawlScopebehave very differently depending on the crawler backend. SoWebCrawlerinterprets as Python regular expressions. But the newCloudflareCrawlerforwards the values to the Cloudflare API their own style of wildcard patterns (like,**will match any number of path segments).This will likely be very confusing for users. Not only this but it'll force users to write different pattern values when switching backends.
To make this better, we should unify the default pattern syntax to globs (like
*,**) for both crawlers. It seems to me that globs are more intuitive anyway for URL path matching (and I think they are well understood). We can, under the hood, convert globs to regex forWebCrawlerand keep using the Cloudflare API as it is.An additional thing to do would be to support Python regexes for more advanced use cases. We should do this by accepting pre-compiled regexes in addition to a
str.A consideration for getting this work underway is choosing a glob-to-regex converter util. There's a built-in one in Python and I do believe libraries for this.