Skip to content

Update robots.txt to disallow bots#430

Open
jetpham wants to merge 1 commit intomasterfrom
jet/robots_txt
Open

Update robots.txt to disallow bots#430
jetpham wants to merge 1 commit intomasterfrom
jet/robots_txt

Conversation

@jetpham
Copy link
Copy Markdown
Member

@jetpham jetpham commented Feb 16, 2026

Currently the robots.txt only blocks scraping and indexing of our 86 page.

User-agent: *
Disallow: /wiki/86
Disallow: /86
Disallow: /index.php?page=86
Noindex: /wiki/86
Noindex: /86
Noindex: /index.php?page=86

This PR blocks all user agents by default

User-agent: *
Disallow: /

But still allows some user agents, namely search engines and users of AI to still search the website


User-agent: Googlebot
Allow: /
Disallow: /wiki/Special:
Disallow: /wiki/86
Disallow: /index.php?
Disallow: /api.php

User-agent: Bingbot
Allow: /
Disallow: /wiki/Special:
Disallow: /wiki/86
Disallow: /index.php?
Disallow: /api.php

...

But this still limits their access to our 86 page

@jetpham jetpham requested review from ElanHR, mcint and nthmost February 16, 2026 23:02
@jetpham jetpham self-assigned this Feb 16, 2026
Copy link
Copy Markdown
Collaborator

@SuperQ SuperQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We intentionally want to allow well-behaved search engines like Google/Bing/etc.

Blocking all search engines isn't needed or watned.

Re-reading this a bit, I'm not sure it's useful to explicitly list UAs like this. Any non-compliant scrape is not going to respect the robots.txt anyway. I think we should stick to a single UA policy.

@mcint
Copy link
Copy Markdown
Contributor

mcint commented Feb 17, 2026

  1. It's better to devote this energy to caddyfile improvements based on declared User Agents. We can rate limit them, preferentially shed their load early, and other behavior additionally based on...IP address range among others.

  2. Check out https://wikipedia.org/robots.txt. Confirm for mediawiki.org too. commons.wikimedia.org.

I don't know if all special should be blocked, does that block search, and I'm not sure about the prefix syntax nor if most parsers respect the one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants