Robots.txt for AI crawlers: a practical guide to GPTBot, ClaudeBot, PerplexityBot, and the rest

Most robots.txt files don't address AI crawlers at all. Here's a clear, no-nonsense guide to which AI bots exist, what they want, and what your robots.txt should say to each one.

AISEOLab29 May 20267 min read

If you look at robots.txt files across the modern web, you'll find most of them haven't been seriously updated since 2018. They address Googlebot and Bingbot. They block crawlers from /admin/. They reference a sitemap. That's about it.

Meanwhile, AI crawlers are reading your site every day — GPTBot, ClaudeBot, PerplexityBot, GrokBot, Google-Extended, and a dozen others. Each one wants different things. Each one will assume defaults if you don't address it explicitly.

This guide explains what each AI crawler does, what you should consider when handling it, and provides a working robots.txt template you can adapt.

The crawlers that matter

There are roughly 21 AI-related crawlers actively scanning websites in 2026. Here are the ones worth understanding:

OpenAI bots

GPTBot — OpenAI's training crawler. Visits your site to collect data for training future GPT models. If you don't want OpenAI to train on your content, block this.
ChatGPT-User — Fetches pages on-demand when a ChatGPT user asks something that requires browsing. This is NOT training — it's real-time browsing. Blocking it means ChatGPT can't browse your site for users.
OAI-SearchBot — Powers OpenAI's search functionality. Newer crawler, increasingly important.

Anthropic bots

ClaudeBot — Anthropic's training crawler. Like GPTBot, used for model training.
Claude-User / Claude-Web — On-demand fetching for Claude users browsing in real-time.
anthropic-ai — Legacy user agent, still used in some contexts.

Perplexity bots

PerplexityBot — Crawls for Perplexity's search index. Critical for being cited in Perplexity answers.
Perplexity-User — Real-time fetcher for live Perplexity queries.

Google AI bots

Google-Extended — Controls whether Google uses your content for AI training (Gemini, Bard, AI Overviews). Note: this is SEPARATE from Googlebot. Blocking Google-Extended doesn't affect search rankings — only AI training.

Other significant bots

GrokBot — xAI's crawler for Grok
Applebot-Extended — Apple Intelligence's training crawler (separate from Applebot)
Bingbot — Microsoft's crawler, also feeds Copilot
DuckAssistBot — DuckDuckGo's AI assistant
Meta-ExternalAgent — Meta's AI crawler
Cohere-AI — Cohere's training crawler
Mistral-AI — Mistral's crawler
You-Bot — You.com's crawler
Amazonbot — Amazon's AI crawler (used in Rufus, Alexa)
CCBot — Common Crawl, used as training data by many AI companies

What should you do?

There are two strategies, and the right one depends on your business.

Strategy A: Be visible — recommended for most businesses

If you sell products or services, publish content, or have any business reason to be cited by AI engines, you want AI crawlers to read your site.

The reasoning: if AI engines never see your content, they can never cite you. Being cited drives discovery, brand awareness, and increasingly, sales. Blocking AI crawlers to protect your content typically loses you more visibility than it preserves value.

Allow most crawlers, block sensitive areas only.

Strategy B: Block AI training, allow real-time fetching

If you're a publisher with original content (news site, research firm, premium publication), you might want to allow AI engines to fetch and cite your content in real-time, but not train on it.

The trade-off: this is a finer-grained position that not all crawlers respect. Some bots ignore robots.txt entirely. Some interpret directives inconsistently. Strategy B is theoretically cleaner but practically harder.

Strategy C: Block AI entirely

If you have strong reasons to keep your content out of AI systems (legal, contractual, or competitive), block AI crawlers comprehensively. Accept that you'll be invisible to AI search and answer engines.

A working robots.txt template

Here's a template that follows Strategy A — visible to AI engines, with reasonable defaults. Edit to match your situation:

# robots.txt
# https://yoursite.com/robots.txt

# Default: allow everything that's not explicitly disallowed
User-agent: *
Disallow: /api/
Disallow: /admin/
Disallow: /dashboard/
Disallow: /account/
Disallow: /auth/

# Allow AI crawlers explicitly — important enough to call out by name
User-agent: GPTBot
Allow: /
Disallow: /api/
Disallow: /admin/

User-agent: ChatGPT-User
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ClaudeBot
Allow: /
Disallow: /api/
Disallow: /admin/

User-agent: Claude-User
Allow: /

User-agent: PerplexityBot
Allow: /
Disallow: /api/

User-agent: Perplexity-User
Allow: /

User-agent: GrokBot
Allow: /
Disallow: /api/

User-agent: Google-Extended
Allow: /

User-agent: Applebot-Extended
Allow: /

User-agent: Bingbot
Allow: /

User-agent: CCBot
Disallow: /

# Sitemap
Sitemap: https://yoursite.com/sitemap.xml

A few notes on this template:

Default User-agent: * allows most crawlers but blocks admin/API paths. Reasonable for almost every site.
AI bots are addressed by name so defaults don't apply. This is critical — many AI crawlers don't respect the * wildcard the way Googlebot does.
CCBot is blocked in this template because Common Crawl data is used by dozens of AI companies whose individual bots you can't enumerate. Blocking CCBot is a reasonable middle position. If you want broader AI visibility, allow it.
Real-time fetchers (ChatGPT-User, Claude-User, Perplexity-User) are unrestricted because they only visit when a user asks for your content. Blocking them just makes you invisible to live AI browsing.

Things people get wrong

1. Treating Google-Extended like Googlebot. They're different. Google-Extended controls AI training and AI Overviews. Googlebot controls search indexing. Blocking Google-Extended doesn't affect your search rankings.

2. Forgetting that AI bots don't strictly follow standards. Some crawlers ignore robots.txt entirely. Some respect only partial rules. Don't assume blocking a bot in robots.txt means it actually stops crawling — for sensitive content, use authentication or firewalls.

3. Blocking too aggressively then complaining about invisibility. If you block GPTBot, you can't be cited by ChatGPT. Many sites that complain "AI doesn't know about us" have disallowed the crawlers themselves.

4. Not updating for years. New AI crawlers appear every quarter. A robots.txt from 2023 doesn't address GrokBot, OAI-SearchBot, or DuckAssistBot. Review yours every 6 months.

5. Using bot blocking for content protection. robots.txt is a polite request, not an enforcement mechanism. If your content is genuinely sensitive, don't rely on robots.txt — use authentication.

How to know your robots.txt is working

Three quick checks:

1. Fetch it directly. curl https://yoursite.com/robots.txt should return your file with Content-Type: text/plain. If it returns HTML or 404s, your server has the wrong configuration.

2. Validate the syntax. Run it through any robots.txt validator. Common mistakes: missing blank lines between User-agent blocks, typos in bot names, conflicting directives.

3. Check it explicitly addresses AI bots. If your robots.txt only mentions User-agent: * and Googlebot, you have AEO work to do. AI crawlers need their own entries.

AISEOLab's free scan does all three checks automatically and tells you exactly which AI crawlers your robots.txt addresses and which it doesn't.

What about the headers approach?

Some sites prefer setting bot directives via HTTP headers instead of robots.txt. The X-Robots-Tag header can express the same rules:

X-Robots-Tag: gptbot: noindex
X-Robots-Tag: claudebot: noindex

This works for advanced cases — different rules per URL or per response. For most sites, a clear robots.txt is simpler and easier to audit.

A note on the future

AI crawler practices are evolving fast. Two trends to watch:

Paid AI access programmes. OpenAI, Anthropic, and others are beginning to offer publishers paid licensing in exchange for training rights. This may eventually change how robots.txt is interpreted.
Standardised AI bot meta-directives. Industry groups are discussing formal extensions to robots.txt for AI-specific rules. Today's directives are de facto, not formal standards.

Both are worth monitoring. For now, the practical work is making sure your robots.txt explicitly addresses each major AI crawler — which most sites still don't do.

Closing

A working robots.txt is the lowest-effort, highest-leverage change you can make for AI visibility. It takes 15 minutes. It immediately controls how every major AI engine sees your site.

If you'd rather not write it from scratch, AISEOLab's free scan generates a robots.txt tailored to your site and shows you exactly which crawlers your current file addresses (and which it misses). One click, one upload, done.

Questions about a specific crawler or scenario? Email hello@aiseolab.ai. We've seen most edge cases.

All posts

Robots.txt for AI crawlers: a practical guide to GPTBot, ClaudeBot, PerplexityBot, and the rest

Most robots.txt files don't address AI crawlers at all. Here's a clear, no-nonsense guide to which AI bots exist, what they want, and what your robots.txt should say to each one.

AISEOLab29 May 20267 min read

This guide explains what each AI crawler does, what you should consider when handling it, and provides a working robots.txt template you can adapt.

The crawlers that matter

There are roughly 21 AI-related crawlers actively scanning websites in 2026. Here are the ones worth understanding:

OpenAI bots

GPTBot — OpenAI's training crawler. Visits your site to collect data for training future GPT models. If you don't want OpenAI to train on your content, block this.
ChatGPT-User — Fetches pages on-demand when a ChatGPT user asks something that requires browsing. This is NOT training — it's real-time browsing. Blocking it means ChatGPT can't browse your site for users.
OAI-SearchBot — Powers OpenAI's search functionality. Newer crawler, increasingly important.

Anthropic bots

ClaudeBot — Anthropic's training crawler. Like GPTBot, used for model training.
Claude-User / Claude-Web — On-demand fetching for Claude users browsing in real-time.
anthropic-ai — Legacy user agent, still used in some contexts.

Perplexity bots

PerplexityBot — Crawls for Perplexity's search index. Critical for being cited in Perplexity answers.
Perplexity-User — Real-time fetcher for live Perplexity queries.

Google AI bots

Google-Extended — Controls whether Google uses your content for AI training (Gemini, Bard, AI Overviews). Note: this is SEPARATE from Googlebot. Blocking Google-Extended doesn't affect search rankings — only AI training.

Other significant bots

GrokBot — xAI's crawler for Grok
Applebot-Extended — Apple Intelligence's training crawler (separate from Applebot)
Bingbot — Microsoft's crawler, also feeds Copilot
DuckAssistBot — DuckDuckGo's AI assistant
Meta-ExternalAgent — Meta's AI crawler
Cohere-AI — Cohere's training crawler
Mistral-AI — Mistral's crawler
You-Bot — You.com's crawler
Amazonbot — Amazon's AI crawler (used in Rufus, Alexa)
CCBot — Common Crawl, used as training data by many AI companies

What should you do?

There are two strategies, and the right one depends on your business.

Strategy A: Be visible — recommended for most businesses

If you sell products or services, publish content, or have any business reason to be cited by AI engines, you want AI crawlers to read your site.

Allow most crawlers, block sensitive areas only.

Strategy B: Block AI training, allow real-time fetching

If you're a publisher with original content (news site, research firm, premium publication), you might want to allow AI engines to fetch and cite your content in real-time, but not train on it.

Strategy C: Block AI entirely

A working robots.txt template

Here's a template that follows Strategy A — visible to AI engines, with reasonable defaults. Edit to match your situation:

# robots.txt
# https://yoursite.com/robots.txt

# Default: allow everything that's not explicitly disallowed
User-agent: *
Disallow: /api/
Disallow: /admin/
Disallow: /dashboard/
Disallow: /account/
Disallow: /auth/

# Allow AI crawlers explicitly — important enough to call out by name
User-agent: GPTBot
Allow: /
Disallow: /api/
Disallow: /admin/

User-agent: ChatGPT-User
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ClaudeBot
Allow: /
Disallow: /api/
Disallow: /admin/

User-agent: Claude-User
Allow: /

User-agent: PerplexityBot
Allow: /
Disallow: /api/

User-agent: Perplexity-User
Allow: /

User-agent: GrokBot
Allow: /
Disallow: /api/

User-agent: Google-Extended
Allow: /

User-agent: Applebot-Extended
Allow: /

User-agent: Bingbot
Allow: /

User-agent: CCBot
Disallow: /

# Sitemap
Sitemap: https://yoursite.com/sitemap.xml

A few notes on this template:

Default User-agent: * allows most crawlers but blocks admin/API paths. Reasonable for almost every site.
AI bots are addressed by name so defaults don't apply. This is critical — many AI crawlers don't respect the * wildcard the way Googlebot does.
CCBot is blocked in this template because Common Crawl data is used by dozens of AI companies whose individual bots you can't enumerate. Blocking CCBot is a reasonable middle position. If you want broader AI visibility, allow it.
Real-time fetchers (ChatGPT-User, Claude-User, Perplexity-User) are unrestricted because they only visit when a user asks for your content. Blocking them just makes you invisible to live AI browsing.

Things people get wrong

4. Not updating for years. New AI crawlers appear every quarter. A robots.txt from 2023 doesn't address GrokBot, OAI-SearchBot, or DuckAssistBot. Review yours every 6 months.

How to know your robots.txt is working

Three quick checks:

1. Fetch it directly. curl https://yoursite.com/robots.txt should return your file with Content-Type: text/plain. If it returns HTML or 404s, your server has the wrong configuration.

2. Validate the syntax. Run it through any robots.txt validator. Common mistakes: missing blank lines between User-agent blocks, typos in bot names, conflicting directives.

3. Check it explicitly addresses AI bots. If your robots.txt only mentions User-agent: * and Googlebot, you have AEO work to do. AI crawlers need their own entries.

AISEOLab's free scan does all three checks automatically and tells you exactly which AI crawlers your robots.txt addresses and which it doesn't.

What about the headers approach?

Some sites prefer setting bot directives via HTTP headers instead of robots.txt. The X-Robots-Tag header can express the same rules:

X-Robots-Tag: gptbot: noindex
X-Robots-Tag: claudebot: noindex

This works for advanced cases — different rules per URL or per response. For most sites, a clear robots.txt is simpler and easier to audit.

A note on the future

AI crawler practices are evolving fast. Two trends to watch:

Paid AI access programmes. OpenAI, Anthropic, and others are beginning to offer publishers paid licensing in exchange for training rights. This may eventually change how robots.txt is interpreted.
Standardised AI bot meta-directives. Industry groups are discussing formal extensions to robots.txt for AI-specific rules. Today's directives are de facto, not formal standards.

Both are worth monitoring. For now, the practical work is making sure your robots.txt explicitly addresses each major AI crawler — which most sites still don't do.

Closing

A working robots.txt is the lowest-effort, highest-leverage change you can make for AI visibility. It takes 15 minutes. It immediately controls how every major AI engine sees your site.

Questions about a specific crawler or scenario? Email hello@aiseolab.ai. We've seen most edge cases.

The crawlers that matter

OpenAI bots

Anthropic bots

Perplexity bots

Google AI bots

Other significant bots

What should you do?

Strategy A: Be visible — recommended for most businesses

Strategy B: Block AI training, allow real-time fetching

Strategy C: Block AI entirely

A working robots.txt template

Things people get wrong

How to know your robots.txt is working

What about the headers approach?

A note on the future

Closing

More from the journal

AEO vs GEO: the difference, and why your site needs both

AEO vs SEO: what changes, what stays, and what to do about it

Why AI engines disagree about your brand (and how to fix it)

See your score now.

The crawlers that matter

OpenAI bots

Anthropic bots

Perplexity bots

Google AI bots

Other significant bots

What should you do?

Strategy A: Be visible — recommended for most businesses

Strategy B: Block AI training, allow real-time fetching

Strategy C: Block AI entirely

A working robots.txt template

Things people get wrong

How to know your robots.txt is working

What about the headers approach?

A note on the future

Closing

More from the journal

AEO vs GEO: the difference, and why your site needs both

AEO vs SEO: what changes, what stays, and what to do about it

Why AI engines disagree about your brand (and how to fix it)

See your score now.