HCODX/robots.txt Generator
SEO · AI-crawler aware · 2026 directives

robots.txt Generator — with AI-crawler directives

Generate a production-ready robots.txt in your browser. Includes 2026 AI-crawler directives — GPTBot, ClaudeBot, Google-Extended, anthropic-ai, PerplexityBot — plus the SEO-scraper blocks (AhrefsBot, SemrushBot, MJ12bot, DotBot, BLEXBot) that bloat your bandwidth bill.

Start from a preset
User-agent rules
Sitemap (optional)
robots.txt
User-agents
0
Rules
0
Bytes
0
Sitemap
Use cases

Why a 2026 robots.txt matters

A good robots.txt is one of the cheapest SEO wins. A bad one can cost you ranking, training opt-out, and a 30% bandwidth bill from SEO scrapers.

Block AI scrapers from training

Opt out of GPTBot, Google-Extended, anthropic-ai and ClaudeBot training without affecting your Google Search visibility.

Allow Google but block AI

Keep Googlebot indexing your site for search while opting out of Gemini training via Google-Extended.

Cut bandwidth costs

AhrefsBot, SemrushBot, MJ12bot, DotBot and BLEXBot can account for 20-40% of crawler traffic. Block them if you don't use those tools.

Hide admin paths

Keep /wp-admin/, /api/internal/, /staging/ out of search results without exposing them in your sitemap.

Whitelist specific bots

Allow Googlebot, Bingbot, DuckDuckBot — block everyone else. Useful for content sites that don't want syndication scrapers.

WordPress, Ghost & static sites

Presets cover the most common stacks — start with sensible defaults, customize in seconds.

Step by step

How to build your robots.txt

1

Pick a preset (or start from scratch)

The "Allow AI + Block SEO scrapers" preset is a safe modern default. Or start blank and add rules one user-agent at a time.

2

Tune user-agents and paths

Each user-agent block accepts a list of Allow and Disallow paths plus an optional Crawl-delay. The output updates live as you type.

3

Add your sitemap URL

Append Sitemap: https://yourdomain.com/sitemap.xml — Google, Bing, Yandex and DuckDuckGo all auto-discover sitemaps from robots.txt.

4

Upload to your root

Save as robots.txt and serve it at https://yourdomain.com/robots.txt with Content-Type: text/plain. That's it.

FAQ

Frequently asked questions

robots.txt is a plain-text file at the root of a website (https://yourdomain.com/robots.txt) that tells web crawlers which paths they may or may not request. It was introduced in 1994 by Martijn Koster as the Robots Exclusion Protocol and formalized as RFC 9309 in 2022. Compliance is voluntary — well-behaved crawlers like Googlebot, Bingbot, GPTBot and ClaudeBot honor it; malicious scrapers ignore it.

Add a User-agent block per crawler with Disallow: /. The 2026 AI crawler set you typically want to control is: GPTBot (OpenAI training), ChatGPT-User (live ChatGPT browsing), Google-Extended (Gemini and Bard training, separate from Googlebot), ClaudeBot and anthropic-ai (Anthropic), PerplexityBot (Perplexity), Amazonbot, Applebot-Extended, ByteSpider, CCBot (Common Crawl). Blocking Google-Extended does not affect your Google Search ranking — they're separate user-agents.

No. robots.txt is a courtesy directive, not a security mechanism. The file itself is publicly readable, and malicious bots routinely ignore it (or use it as a roadmap to your sensitive paths). For real protection use authentication, rate limiting, WAF rules, or remove pages entirely. Use robots.txt only to manage crawl budget and opt out of training for compliant AI vendors.

Disallow tells the crawler not to fetch the specified path; Allow explicitly permits a sub-path that would otherwise be blocked by a broader Disallow. Example: Disallow: /admin/ blocks the whole admin folder; Allow: /admin/public/ then re-permits one subfolder. The most specific matching rule wins. Allow was originally a Google extension but is now supported by every major crawler including Bingbot and GPTBot.

About

About robots.txt and the 2026 crawler ecosystem

The Robots Exclusion Protocol was invented in 1994 by Martijn Koster after his web server was overwhelmed by a misbehaving crawler. For nearly thirty years it was a community convention; in September 2022 the IETF published it as RFC 9309 — an official internet standard. The grammar is intentionally tiny: User-agent, Allow, Disallow, Sitemap, and an optional Crawl-delay field that not all crawlers honor.

The 2026 AI-crawler landscape

Where there used to be Googlebot and Bingbot, there are now dozens of agents. The major AI vendors split their crawlers in two: one for live retrieval and one for model training. OpenAI uses GPTBot for training and ChatGPT-User for live browsing. Anthropic uses ClaudeBot for general crawling and anthropic-ai as a legacy training agent. Google's Google-Extended is a virtual user-agent specifically for opting out of Gemini training without affecting Google Search via Googlebot.

robots.txt is not security

This bears repeating because content sites still get caught out: a malicious scraper will not respect your robots.txt. If you have content that genuinely cannot be public, use authentication, IP-based rate limits, or a WAF. robots.txt should be treated as crawl-budget management plus a polite "please don't train on me" notice to compliant AI vendors.

Sitemap directive

The Sitemap: directive at the bottom of robots.txt is auto-discovered by every major search engine. It does not need to live on the same domain. You can list multiple sitemaps (one per line). Adding it is strictly better than not adding it — even when you also submit through Google Search Console.

Crawl-delay quirks

Crawl-delay is unofficial. Bingbot, YandexBot and BLEXBot honor it; Googlebot ignores it entirely (Google uses Search Console's crawl rate setting instead). If you need to throttle Googlebot, change the rate inside Search Console — the directive in robots.txt has no effect.

Related

Related tools