Add TechRadar publisher#925
Conversation
addie9800
left a comment
There was a problem hiding this comment.
Thank you so much for contributing to Fundus! 🚀 This PR already looks really good. I only have a couple of minor changes I would ask you to make and then we can go ahead and merge this PR
| domain="https://www.techradar.com/", | ||
| parser=TechRadarParser, | ||
| sources=[ | ||
| Sitemap("https://www.techradar.com/sitemap.xml", reverse=True), |
There was a problem hiding this comment.
I would add a filter such that only sitemaps of this format https://www.techradar.com/sitemap-yyyy-mm.xml are crawled. (You can use the sitemap_filter attribute and regex_filter class). This reduces the load on the website since the remaining sitemaps only contain pages that are unparsable by Fundus anyway
| "Those results, though, will likely be below the AI Overviews that already sit atop those classic results. If anything, Overviews may be even richer and more accurate thanks to the intelligent query guidance you received in the search box. Scrolling down below them might be pointless.", | ||
| "It doesn't take much imagination to envision a future in which the AI Overviews are your Google Search results, and there is nothing below because it's not as useful, or at least it doesn't \"speak\" to you in the same way the overviews do. They seem to get you because they're designed to respond to your intention in a way that traditional search results could never do.", | ||
| "For some, this is progress. For me? The jury's still out.", | ||
| "What about you? Share your thoughts on Google's new Intelligent Search Box in the comments below." |
There was a problem hiding this comment.
I would suggest adding lines like this also to the bloat regex.
| upper_boundary_selector=XPath("//article"), | ||
| image_selector=XPath("//article//figure//img"), | ||
| caption_selector=XPath("./ancestor::figure//figcaption"), | ||
| author_selector=re.compile(r"(?i)image credit[s]?: (?P<credits>.*)"), |
There was a problem hiding this comment.
The credits sometimes seem to end in /. Perhaps you can add an optional slash to the end of the (non-selecting) part of the regex such that it gets filtered out. e.g. here
| return self.precomputed.ld.bf_search("headline") | ||
|
|
||
| @attribute | ||
| def topics(self) -> List[str]: |
There was a problem hiding this comment.
The list of topics at the end of an article seems to be sometimes be more comprehensive tham the ones that are used in the meta data. e.g. here. If you have a selector for the elements, you can pass that into generic_nodes_to_text
Summary
Validation
python -m pytest -q -W ignore::DeprecationWarning->795 passedruff format --check src->231 files already formattedruff check src->All checks passed!mypy src/fundus/publishers/uk/techradar.py src/fundus/publishers/uk/__init__.py->Success: no issues found in 2 source filesNote: full local
mypy srcwas also checked on Python 3.12 and reports three existing errors outside this patch (src/fundus/parser/utility.pyandsrc/fundus/publishers/kr/hankook_ilbo.py). I left those unrelated files untouched.