AI Crawler Management & robots.txt
This is the complete reference of known AI crawler user agents as of 2025-2026. Use this to configure robots.txt and monitor crawl traffic.
This is the complete reference of known AI crawler user agents as of 2025-2026. Use this to configure robots.txt and monitor crawl traffic. ## Key Points - **Server-Side Rendering (SSR)**: HTML generated on each request - **Static Site Generation (SSG)**: HTML generated at build time - **Incremental Static Regeneration (ISR)**: Static pages regenerated on a schedule - Over **560,000 websites** include AI bot directives in robots.txt - **GPTBot** grew from 5% to 30% of AI crawler share between May 2024-2025 - **Meta-ExternalAgent** accounts for 19% of AI crawler traffic - Crawl-to-referral ratio is extremely asymmetric: - **OpenAI**: 1,700:1 (1,700 crawl requests per 1 referral visit) - **Anthropic**: 73,000:1 (73,000 crawl requests per 1 referral visit) - This means AI platforms crawl your content heavily but send very few visitors back - The value is in being cited (brand mention, authority signal), not in receiving traffic - Include accurate `<lastmod>` dates — AI systems use these to evaluate freshness ## Quick Example ```bash # Test what AI crawlers see curl -s -A "GPTBot" https://yourdomain.com/your-page | grep -c "<article>" # Compare with a JavaScript-capable render # If curl returns empty but browser shows content, you have a CSR problem ``` ``` Good: /docs/api/authentication Bad: /d?id=38172&cat=3 Good: /blog/geo-optimization-guide-2025 Bad: /blog/post-38172 ```
skilldb get llm-optimization-skills/AI Crawler Management & robots.txtFull skill: 417 linesAI Crawler Management & robots.txt
AI Crawler User Agent Reference
This is the complete reference of known AI crawler user agents as of 2025-2026. Use this to configure robots.txt and monitor crawl traffic.
OpenAI
| User Agent | Purpose | Respects robots.txt |
|---|---|---|
GPTBot | Scrapes data for training OpenAI's models | Yes |
OAI-SearchBot | Real-time search result generation for ChatGPT search | Yes |
ChatGPT-User | User-initiated browsing (when a user asks ChatGPT to visit a URL) | Partially — may not fully respect |
Anthropic
| User Agent | Purpose | Respects robots.txt |
|---|---|---|
ClaudeBot | Training data collection | Yes |
Claude-SearchBot | Improves search result quality | Yes |
Claude-User | Fetches pages for user queries | Yes |
Perplexity
| User Agent | Purpose | Respects robots.txt |
|---|---|---|
PerplexityBot | Indexing and search | Yes |
Perplexity-User | User-initiated fetches | NO |
| User Agent | Purpose | Respects robots.txt |
|---|---|---|
Google-Extended | LLM training data collection | Yes |
Googlebot | Standard web crawling (also feeds AI Overviews) | Yes |
Apple
| User Agent | Purpose | Respects robots.txt |
|---|---|---|
Applebot-Extended | Powers Siri, Spotlight, Safari AI features | Yes |
Meta
| User Agent | Purpose | Respects robots.txt |
|---|---|---|
Meta-ExternalAgent | Training language models (19% of AI crawler traffic in 2025) | Yes |
FacebookBot | Content aggregation | Yes |
ByteDance
| User Agent | Purpose | Respects robots.txt |
|---|---|---|
Bytespider | Training data for ByteDance models | NO |
Others
| User Agent | Purpose | Respects robots.txt |
|---|---|---|
YouBot | You.com search and AI | Yes |
PhindBot | Phind developer search | Yes |
Andibot | Andi search | Yes |
robots.txt Templates
Template 1: Maximum Visibility (Recommended for Most Sites)
Allow all AI crawlers to access all content. This maximizes your chances of being indexed and cited by AI platforms.
# ============================================
# AI Crawler Policy — Maximum Visibility
# ============================================
# OpenAI
User-agent: GPTBot
Allow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: ChatGPT-User
Allow: /
# Anthropic
User-agent: ClaudeBot
Allow: /
User-agent: Claude-SearchBot
Allow: /
User-agent: Claude-User
Allow: /
# Perplexity
User-agent: PerplexityBot
Allow: /
# Google AI
User-agent: Google-Extended
Allow: /
# Apple
User-agent: Applebot-Extended
Allow: /
# Meta
User-agent: Meta-ExternalAgent
Allow: /
User-agent: FacebookBot
Allow: /
# Other AI
User-agent: YouBot
Allow: /
User-agent: PhindBot
Allow: /
# Standard crawlers
User-agent: *
Allow: /
Sitemap: https://yourdomain.com/sitemap.xml
Template 2: Allow Search, Block Training
Allow AI search crawlers (so your content appears in AI answers) but block training crawlers (so your content is not used to train models).
# ============================================
# AI Crawler Policy — Search Only, No Training
# ============================================
# OpenAI — allow search, block training
User-agent: GPTBot
Disallow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: ChatGPT-User
Allow: /
# Anthropic — allow search, block training
User-agent: ClaudeBot
Disallow: /
User-agent: Claude-SearchBot
Allow: /
User-agent: Claude-User
Allow: /
# Perplexity — allow search
User-agent: PerplexityBot
Allow: /
# Google — block training, allow standard crawl
User-agent: Google-Extended
Disallow: /
User-agent: Googlebot
Allow: /
# Apple — block training
User-agent: Applebot-Extended
Disallow: /
# Meta — block training
User-agent: Meta-ExternalAgent
Disallow: /
User-agent: FacebookBot
Allow: /
# Block known non-compliant crawlers (they ignore this, but document intent)
User-agent: Bytespider
Disallow: /
User-agent: *
Allow: /
Sitemap: https://yourdomain.com/sitemap.xml
Template 3: Selective Access
Allow AI crawlers to access public content but block premium/gated content.
# ============================================
# AI Crawler Policy — Selective Access
# ============================================
# OpenAI
User-agent: GPTBot
Allow: /blog/
Allow: /docs/
Allow: /pricing/
Allow: /about/
Disallow: /app/
Disallow: /dashboard/
Disallow: /api/v1/
Disallow: /premium/
User-agent: OAI-SearchBot
Allow: /
User-agent: ChatGPT-User
Allow: /
# Anthropic
User-agent: ClaudeBot
Allow: /blog/
Allow: /docs/
Allow: /pricing/
Allow: /about/
Disallow: /app/
Disallow: /dashboard/
Disallow: /api/v1/
Disallow: /premium/
User-agent: Claude-SearchBot
Allow: /
User-agent: Claude-User
Allow: /
# Apply same pattern for other crawlers...
User-agent: *
Allow: /
Sitemap: https://yourdomain.com/sitemap.xml
SSR Is Mandatory for AI Crawlability
Most AI crawlers do not execute JavaScript. If your content is rendered client-side (SPA/CSR), it is invisible to AI crawlers.
Required: Use one of these rendering strategies:
- Server-Side Rendering (SSR): HTML generated on each request
- Static Site Generation (SSG): HTML generated at build time
- Incremental Static Regeneration (ISR): Static pages regenerated on a schedule
Verification: Use curl -A "GPTBot" https://yourdomain.com/page and check that the response HTML contains your full content, not empty <div id="root"></div> containers.
# Test what AI crawlers see
curl -s -A "GPTBot" https://yourdomain.com/your-page | grep -c "<article>"
# Compare with a JavaScript-capable render
# If curl returns empty but browser shows content, you have a CSR problem
Crawl Traffic Scale (2025)
Understanding the scale of AI crawling helps set expectations:
- Over 560,000 websites include AI bot directives in robots.txt
- GPTBot grew from 5% to 30% of AI crawler share between May 2024-2025
- Meta-ExternalAgent accounts for 19% of AI crawler traffic
- Crawl-to-referral ratio is extremely asymmetric:
- OpenAI: 1,700:1 (1,700 crawl requests per 1 referral visit)
- Anthropic: 73,000:1 (73,000 crawl requests per 1 referral visit)
- This means AI platforms crawl your content heavily but send very few visitors back
- The value is in being cited (brand mention, authority signal), not in receiving traffic
Site Architecture for AI Consumption
AI crawlers parse your site structure to understand content relationships. Optimize architecture for machine readability:
Semantic HTML
Use proper HTML5 semantic elements — AI crawlers use these to understand content structure:
<article>
<header>
<h1>Primary Topic</h1>
<time datetime="2025-12-15">Last Updated: December 15, 2025</time>
</header>
<section>
<h2>Subtopic</h2>
<p>Content organized in self-contained passages of 134-167 words...</p>
<table>
<caption>Comparison of Options</caption>
<thead>
<tr><th>Feature</th><th>Option A</th><th>Option B</th></tr>
</thead>
<tbody>
<tr><td>Price</td><td>$29/mo</td><td>$49/mo</td></tr>
</tbody>
</table>
</section>
<section>
<h2>FAQ</h2>
<dl>
<dt>What is the pricing?</dt>
<dd>Plans start at $29/month for up to 10,000 events.</dd>
</dl>
</section>
</article>
Heading Hierarchy
Maintain clean nesting — never skip levels:
H1 (one per page) — Page title
H2 — Major section
H3 — Subsection
H4 — Detail (use sparingly)
H2 — Next major section
H3 — Subsection
Clean URLs
AI systems use URL structure to infer content topics:
Good: /docs/api/authentication
Bad: /d?id=38172&cat=3
Good: /blog/geo-optimization-guide-2025
Bad: /blog/post-38172
Next.js Middleware for Tracking AI Crawler Hits
Monitor which AI crawlers are visiting your site and which pages they access:
// middleware.ts
import { NextRequest, NextResponse } from 'next/server';
const AI_CRAWLERS: Record<string, string> = {
'GPTBot': 'openai-training',
'OAI-SearchBot': 'openai-search',
'ChatGPT-User': 'openai-user',
'ClaudeBot': 'anthropic-training',
'Claude-SearchBot': 'anthropic-search',
'Claude-User': 'anthropic-user',
'PerplexityBot': 'perplexity',
'Perplexity-User': 'perplexity-user',
'Google-Extended': 'google-ai',
'Applebot-Extended': 'apple-ai',
'Meta-ExternalAgent': 'meta-training',
'Bytespider': 'bytedance',
'YouBot': 'you-com',
'PhindBot': 'phind',
};
export function middleware(request: NextRequest) {
const ua = request.headers.get('user-agent') || '';
const response = NextResponse.next();
for (const [botName, category] of Object.entries(AI_CRAWLERS)) {
if (ua.includes(botName)) {
console.log(JSON.stringify({
event: 'ai_crawler_hit',
crawler: botName,
category,
path: request.nextUrl.pathname,
timestamp: new Date().toISOString(),
}));
response.headers.set('X-AI-Crawler', botName);
break;
}
}
return response;
}
export const config = {
matcher: [
'/((?!_next/static|_next/image|favicon.ico).*)',
],
};
Production logging: Replace console.log with your analytics service (e.g., send to a database, Vercel Analytics custom event, or a dedicated AI crawler analytics endpoint).
Sitemap Strategy
Submit your XML sitemap to both Google Search Console and Bing Webmaster Tools:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://yourdomain.com/docs/getting-started</loc>
<lastmod>2025-12-15</lastmod>
<changefreq>weekly</changefreq>
<priority>0.9</priority>
</url>
</urlset>
Key points:
- Include accurate
<lastmod>dates — AI systems use these to evaluate freshness - Use IndexNow protocol for instant Bing indexing (see platform-specific-geo skill)
- Prioritize high-value content pages in your sitemap structure
Install this skill directly: skilldb add llm-optimization-skills
Related Skills
Entity-Based Optimization for AI Knowledge Graphs
An "entity" in the context of AI systems is a distinct, identifiable concept — a person, organization, product, place, or idea — that exists as a node in a knowledge graph. Entities are how AI systems
GEO Content Strategy — Writing for AI Citation
AI retrieval systems evaluate relevance primarily on opening content. The first 200 words of any page determine whether an AI system will consider it for citation.
Generative Engine Optimization (GEO) Fundamentals
Generative Engine Optimization (GEO) is the practice of optimizing digital content to appear in AI-generated responses from platforms like ChatGPT, Perplexity, Google AI Overviews, and Claude. Answer
Measuring & Monitoring LLM Visibility
| Metric | Description | Target |
llms.txt Standard Implementation
The llms.txt standard was created by Jeremy Howard (Answer.AI) and published on September 3, 2024. It defines a plain-text Markdown file served at `/llms.txt` that provides a concise, human-curated ma
Platform-Specific GEO — ChatGPT, Perplexity, Google AI Overviews
ChatGPT uses Bing's index as its primary content source, supplemented by parametric knowledge from training data.