Skip to content

enh: simplify CrawlScope pattern API #79

@rich-iannone

Description

@rich-iannone

Currently, include_patterns and exclude_patterns in CrawlScope behave very differently depending on the crawler backend. So WebCrawler interprets as Python regular expressions. But the new CloudflareCrawler forwards the values to the Cloudflare API their own style of wildcard patterns (like, ** will match any number of path segments).

This will likely be very confusing for users. Not only this but it'll force users to write different pattern values when switching backends.

To make this better, we should unify the default pattern syntax to globs (like *, **) for both crawlers. It seems to me that globs are more intuitive anyway for URL path matching (and I think they are well understood). We can, under the hood, convert globs to regex for WebCrawler and keep using the Cloudflare API as it is.

An additional thing to do would be to support Python regexes for more advanced use cases. We should do this by accepting pre-compiled regexes in addition to a str.

A consideration for getting this work underway is choosing a glob-to-regex converter util. There's a built-in one in Python and I do believe libraries for this.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions