Robots.txt for AI crawlers: a practical guide to GPTBot, ClaudeBot, PerplexityBot, and the rest
Most robots.txt files don't address AI crawlers at all. Here's a clear, no-nonsense guide to which AI bots exist, what they want, and what your robots.txt should say to each one.
If you look at robots.txt files across the modern web, you'll find most of them haven't been seriously updated since 2018. They address Googlebot and Bingbot. They block crawlers from /admin/. They reference a sitemap. That's about it.
Meanwhile, AI crawlers are reading your site every day — GPTBot, ClaudeBot, PerplexityBot, GrokBot, Google-Extended, and a dozen others. Each one wants different things. Each one will assume defaults if you don't address it explicitly.
This guide explains what each AI crawler does, what you should consider when handling it, and provides a working robots.txt template you can adapt.
The crawlers that matter
There are roughly 21 AI-related crawlers actively scanning websites in 2026. Here are the ones worth understanding:
OpenAI bots
- GPTBot — OpenAI's training crawler. Visits your site to collect data for training future GPT models. If you don't want OpenAI to train on your content, block this.
- ChatGPT-User — Fetches pages on-demand when a ChatGPT user asks something that requires browsing. This is NOT training — it's real-time browsing. Blocking it means ChatGPT can't browse your site for users.
- OAI-SearchBot — Powers OpenAI's search functionality. Newer crawler, increasingly important.
Anthropic bots
- ClaudeBot — Anthropic's training crawler. Like GPTBot, used for model training.
- Claude-User / Claude-Web — On-demand fetching for Claude users browsing in real-time.
- anthropic-ai — Legacy user agent, still used in some contexts.
Perplexity bots
- PerplexityBot — Crawls for Perplexity's search index. Critical for being cited in Perplexity answers.
- Perplexity-User — Real-time fetcher for live Perplexity queries.
Google AI bots
- Google-Extended — Controls whether Google uses your content for AI training (Gemini, Bard, AI Overviews). Note: this is SEPARATE from Googlebot. Blocking Google-Extended doesn't affect search rankings — only AI training.
Other significant bots
- GrokBot — xAI's crawler for Grok
- Applebot-Extended — Apple Intelligence's training crawler (separate from Applebot)
- Bingbot — Microsoft's crawler, also feeds Copilot
- DuckAssistBot — DuckDuckGo's AI assistant
- Meta-ExternalAgent — Meta's AI crawler
- Cohere-AI — Cohere's training crawler
- Mistral-AI — Mistral's crawler
- You-Bot — You.com's crawler
- Amazonbot — Amazon's AI crawler (used in Rufus, Alexa)
- CCBot — Common Crawl, used as training data by many AI companies
What should you do?
There are two strategies, and the right one depends on your business.
Strategy A: Be visible — recommended for most businesses
If you sell products or services, publish content, or have any business reason to be cited by AI engines, you want AI crawlers to read your site.
The reasoning: if AI engines never see your content, they can never cite you. Being cited drives discovery, brand awareness, and increasingly, sales. Blocking AI crawlers to protect your content typically loses you more visibility than it preserves value.
Allow most crawlers, block sensitive areas only.
Strategy B: Block AI training, allow real-time fetching
If you're a publisher with original content (news site, research firm, premium publication), you might want to allow AI engines to fetch and cite your content in real-time, but not train on it.
The trade-off: this is a finer-grained position that not all crawlers respect. Some bots ignore robots.txt entirely. Some interpret directives inconsistently. Strategy B is theoretically cleaner but practically harder.
Strategy C: Block AI entirely
If you have strong reasons to keep your content out of AI systems (legal, contractual, or competitive), block AI crawlers comprehensively. Accept that you'll be invisible to AI search and answer engines.
A working robots.txt template
Here's a template that follows Strategy A — visible to AI engines, with reasonable defaults. Edit to match your situation:
# robots.txt
# https://yoursite.com/robots.txt
# Default: allow everything that's not explicitly disallowed
User-agent: *
Disallow: /api/
Disallow: /admin/
Disallow: /dashboard/
Disallow: /account/
Disallow: /auth/
# Allow AI crawlers explicitly — important enough to call out by name
User-agent: GPTBot
Allow: /
Disallow: /api/
Disallow: /admin/
User-agent: ChatGPT-User
Allow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: ClaudeBot
Allow: /
Disallow: /api/
Disallow: /admin/
User-agent: Claude-User
Allow: /
User-agent: PerplexityBot
Allow: /
Disallow: /api/
User-agent: Perplexity-User
Allow: /
User-agent: GrokBot
Allow: /
Disallow: /api/
User-agent: Google-Extended
Allow: /
User-agent: Applebot-Extended
Allow: /
User-agent: Bingbot
Allow: /
User-agent: CCBot
Disallow: /
# Sitemap
Sitemap: https://yoursite.com/sitemap.xml
A few notes on this template:
- Default
User-agent: *allows most crawlers but blocks admin/API paths. Reasonable for almost every site. - AI bots are addressed by name so defaults don't apply. This is critical — many AI crawlers don't respect the
*wildcard the way Googlebot does. - CCBot is blocked in this template because Common Crawl data is used by dozens of AI companies whose individual bots you can't enumerate. Blocking CCBot is a reasonable middle position. If you want broader AI visibility, allow it.
- Real-time fetchers (ChatGPT-User, Claude-User, Perplexity-User) are unrestricted because they only visit when a user asks for your content. Blocking them just makes you invisible to live AI browsing.
Things people get wrong
1. Treating Google-Extended like Googlebot. They're different. Google-Extended controls AI training and AI Overviews. Googlebot controls search indexing. Blocking Google-Extended doesn't affect your search rankings.
2. Forgetting that AI bots don't strictly follow standards. Some crawlers ignore robots.txt entirely. Some respect only partial rules. Don't assume blocking a bot in robots.txt means it actually stops crawling — for sensitive content, use authentication or firewalls.
3. Blocking too aggressively then complaining about invisibility. If you block GPTBot, you can't be cited by ChatGPT. Many sites that complain "AI doesn't know about us" have disallowed the crawlers themselves.
4. Not updating for years. New AI crawlers appear every quarter. A robots.txt from 2023 doesn't address GrokBot, OAI-SearchBot, or DuckAssistBot. Review yours every 6 months.
5. Using bot blocking for content protection. robots.txt is a polite request, not an enforcement mechanism. If your content is genuinely sensitive, don't rely on robots.txt — use authentication.
How to know your robots.txt is working
Three quick checks:
1. Fetch it directly. curl https://yoursite.com/robots.txt should return your file with Content-Type: text/plain. If it returns HTML or 404s, your server has the wrong configuration.
2. Validate the syntax. Run it through any robots.txt validator. Common mistakes: missing blank lines between User-agent blocks, typos in bot names, conflicting directives.
3. Check it explicitly addresses AI bots. If your robots.txt only mentions User-agent: * and Googlebot, you have AEO work to do. AI crawlers need their own entries.
AISEOLab's free scan does all three checks automatically and tells you exactly which AI crawlers your robots.txt addresses and which it doesn't.
What about the headers approach?
Some sites prefer setting bot directives via HTTP headers instead of robots.txt. The X-Robots-Tag header can express the same rules:
X-Robots-Tag: gptbot: noindex
X-Robots-Tag: claudebot: noindex
This works for advanced cases — different rules per URL or per response. For most sites, a clear robots.txt is simpler and easier to audit.
A note on the future
AI crawler practices are evolving fast. Two trends to watch:
- Paid AI access programmes. OpenAI, Anthropic, and others are beginning to offer publishers paid licensing in exchange for training rights. This may eventually change how robots.txt is interpreted.
- Standardised AI bot meta-directives. Industry groups are discussing formal extensions to robots.txt for AI-specific rules. Today's directives are de facto, not formal standards.
Both are worth monitoring. For now, the practical work is making sure your robots.txt explicitly addresses each major AI crawler — which most sites still don't do.
Closing
A working robots.txt is the lowest-effort, highest-leverage change you can make for AI visibility. It takes 15 minutes. It immediately controls how every major AI engine sees your site.
If you'd rather not write it from scratch, AISEOLab's free scan generates a robots.txt tailored to your site and shows you exactly which crawlers your current file addresses (and which it misses). One click, one upload, done.
Questions about a specific crawler or scenario? Email hello@aiseolab.ai. We've seen most edge cases.