enh: simplify `CrawlScope` pattern API

Currently, `include_patterns` and `exclude_patterns` in `CrawlScope` behave very differently depending on the crawler backend. So `WebCrawler` interprets as Python regular expressions. But the new `CloudflareCrawler` forwards the values to the Cloudflare API their own style of wildcard patterns (like, `**` will match any number of path segments).

This will likely be very confusing for users. Not only this but it'll force users to write different pattern values when switching backends.

To make this better, we should unify the default pattern syntax to globs (like `*`, `**`) for both crawlers. It seems to me that globs are more intuitive anyway for URL path matching (and I think they are well understood). We can, under the hood, convert globs to regex for `WebCrawler` and keep using the Cloudflare API as it is.

An additional thing to do would be to support Python regexes for more advanced use cases. We should do this by accepting pre-compiled regexes in addition to a `str`.

A consideration for getting this work underway is choosing a glob-to-regex converter util. There's a built-in one in Python and I do believe libraries for this.   

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enh: simplify `CrawlScope` pattern API #79

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

enh: simplify CrawlScope pattern API #79

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

enh: simplify `CrawlScope` pattern API #79