Skip to content

Add TechRadar publisher#925

Open
maelrx wants to merge 1 commit into
flairNLP:masterfrom
maelrx:add-techradar-publisher
Open

Add TechRadar publisher#925
maelrx wants to merge 1 commit into
flairNLP:masterfrom
maelrx:add-techradar-publisher

Conversation

@maelrx
Copy link
Copy Markdown

@maelrx maelrx commented May 21, 2026

Summary

  • Add a UK TechRadar publisher parser.
  • Register TechRadar with sitemap and news sitemap sources plus a URL filter for noisy paths.
  • Extract article body, title, authors, publishing date, topics, and images.
  • Add generated parser test data and update the supported publishers table.

Validation

  • python -m pytest -q -W ignore::DeprecationWarning -> 795 passed
  • ruff format --check src -> 231 files already formatted
  • ruff check src -> All checks passed!
  • mypy src/fundus/publishers/uk/techradar.py src/fundus/publishers/uk/__init__.py -> Success: no issues found in 2 source files

Note: full local mypy src was also checked on Python 3.12 and reports three existing errors outside this patch (src/fundus/parser/utility.py and src/fundus/publishers/kr/hankook_ilbo.py). I left those unrelated files untouched.

@addie9800 addie9800 self-assigned this May 21, 2026
Copy link
Copy Markdown
Collaborator

@addie9800 addie9800 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for contributing to Fundus! 🚀 This PR already looks really good. I only have a couple of minor changes I would ask you to make and then we can go ahead and merge this PR

domain="https://www.techradar.com/",
parser=TechRadarParser,
sources=[
Sitemap("https://www.techradar.com/sitemap.xml", reverse=True),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a filter such that only sitemaps of this format https://www.techradar.com/sitemap-yyyy-mm.xml are crawled. (You can use the sitemap_filter attribute and regex_filter class). This reduces the load on the website since the remaining sitemaps only contain pages that are unparsable by Fundus anyway

"Those results, though, will likely be below the AI Overviews that already sit atop those classic results. If anything, Overviews may be even richer and more accurate thanks to the intelligent query guidance you received in the search box. Scrolling down below them might be pointless.",
"It doesn't take much imagination to envision a future in which the AI Overviews are your Google Search results, and there is nothing below because it's not as useful, or at least it doesn't \"speak\" to you in the same way the overviews do. They seem to get you because they're designed to respond to your intention in a way that traditional search results could never do.",
"For some, this is progress. For me? The jury's still out.",
"What about you? Share your thoughts on Google's new Intelligent Search Box in the comments below."
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest adding lines like this also to the bloat regex.

upper_boundary_selector=XPath("//article"),
image_selector=XPath("//article//figure//img"),
caption_selector=XPath("./ancestor::figure//figcaption"),
author_selector=re.compile(r"(?i)image credit[s]?: (?P<credits>.*)"),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The credits sometimes seem to end in /. Perhaps you can add an optional slash to the end of the (non-selecting) part of the regex such that it gets filtered out. e.g. here

return self.precomputed.ld.bf_search("headline")

@attribute
def topics(self) -> List[str]:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The list of topics at the end of an article seems to be sometimes be more comprehensive tham the ones that are used in the meta data. e.g. here. If you have a selector for the elements, you can pass that into generic_nodes_to_text

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants