Technology & EngineeringLlm Optimization417 lines

AI Crawler Management & robots.txt

This is the complete reference of known AI crawler user agents as of 2025-2026. Use this to configure robots.txt and monitor crawl traffic.

Quick Summary35 lines

This is the complete reference of known AI crawler user agents as of 2025-2026. Use this to configure robots.txt and monitor crawl traffic.

## Key Points

- **Server-Side Rendering (SSR)**: HTML generated on each request
- **Static Site Generation (SSG)**: HTML generated at build time
- **Incremental Static Regeneration (ISR)**: Static pages regenerated on a schedule
- Over **560,000 websites** include AI bot directives in robots.txt
- **GPTBot** grew from 5% to 30% of AI crawler share between May 2024-2025
- **Meta-ExternalAgent** accounts for 19% of AI crawler traffic
- Crawl-to-referral ratio is extremely asymmetric:
- **OpenAI**: 1,700:1 (1,700 crawl requests per 1 referral visit)
- **Anthropic**: 73,000:1 (73,000 crawl requests per 1 referral visit)
- This means AI platforms crawl your content heavily but send very few visitors back
- The value is in being cited (brand mention, authority signal), not in receiving traffic
- Include accurate `<lastmod>` dates — AI systems use these to evaluate freshness

## Quick Example

```bash
# Test what AI crawlers see
curl -s -A "GPTBot" https://yourdomain.com/your-page | grep -c "<article>"

# Compare with a JavaScript-capable render
# If curl returns empty but browser shows content, you have a CSR problem
```

```
Good:  /docs/api/authentication
Bad:   /d?id=38172&cat=3
Good:  /blog/geo-optimization-guide-2025
Bad:   /blog/post-38172
```

skilldb get llm-optimization-skills/AI Crawler Management & robots.txtFull skill: 417 lines

Paste into your CLAUDE.md or agent config

AI Crawler Management & robots.txt

AI Crawler User Agent Reference

This is the complete reference of known AI crawler user agents as of 2025-2026. Use this to configure robots.txt and monitor crawl traffic.

OpenAI

User Agent	Purpose	Respects robots.txt
`GPTBot`	Scrapes data for training OpenAI's models	Yes
`OAI-SearchBot`	Real-time search result generation for ChatGPT search	Yes
`ChatGPT-User`	User-initiated browsing (when a user asks ChatGPT to visit a URL)	Partially — may not fully respect

Anthropic

User Agent	Purpose	Respects robots.txt
`ClaudeBot`	Training data collection	Yes
`Claude-SearchBot`	Improves search result quality	Yes
`Claude-User`	Fetches pages for user queries	Yes

Perplexity

User Agent	Purpose	Respects robots.txt
`PerplexityBot`	Indexing and search	Yes
`Perplexity-User`	User-initiated fetches	NO

Google

User Agent	Purpose	Respects robots.txt
`Google-Extended`	LLM training data collection	Yes
`Googlebot`	Standard web crawling (also feeds AI Overviews)	Yes

Apple

User Agent	Purpose	Respects robots.txt
`Applebot-Extended`	Powers Siri, Spotlight, Safari AI features	Yes

ByteDance

User Agent	Purpose	Respects robots.txt
`Bytespider`	Training data for ByteDance models	NO

Others

User Agent	Purpose	Respects robots.txt
`YouBot`	You.com search and AI	Yes
`PhindBot`	Phind developer search	Yes
`Andibot`	Andi search	Yes

robots.txt Templates

Template 1: Maximum Visibility (Recommended for Most Sites)

Allow all AI crawlers to access all content. This maximizes your chances of being indexed and cited by AI platforms.

# ============================================
# AI Crawler Policy — Maximum Visibility
# ============================================

# OpenAI
User-agent: GPTBot
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

# Anthropic
User-agent: ClaudeBot
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: Claude-User
Allow: /

# Perplexity
User-agent: PerplexityBot
Allow: /

# Google AI
User-agent: Google-Extended
Allow: /

# Apple
User-agent: Applebot-Extended
Allow: /

# Meta
User-agent: Meta-ExternalAgent
Allow: /

User-agent: FacebookBot
Allow: /

# Other AI
User-agent: YouBot
Allow: /

User-agent: PhindBot
Allow: /

# Standard crawlers
User-agent: *
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

Template 2: Allow Search, Block Training

Allow AI search crawlers (so your content appears in AI answers) but block training crawlers (so your content is not used to train models).

# ============================================
# AI Crawler Policy — Search Only, No Training
# ============================================

# OpenAI — allow search, block training
User-agent: GPTBot
Disallow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

# Anthropic — allow search, block training
User-agent: ClaudeBot
Disallow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: Claude-User
Allow: /

# Perplexity — allow search
User-agent: PerplexityBot
Allow: /

# Google — block training, allow standard crawl
User-agent: Google-Extended
Disallow: /

User-agent: Googlebot
Allow: /

# Apple — block training
User-agent: Applebot-Extended
Disallow: /

# Meta — block training
User-agent: Meta-ExternalAgent
Disallow: /

User-agent: FacebookBot
Allow: /

# Block known non-compliant crawlers (they ignore this, but document intent)
User-agent: Bytespider
Disallow: /

User-agent: *
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

Template 3: Selective Access

Allow AI crawlers to access public content but block premium/gated content.

# ============================================
# AI Crawler Policy — Selective Access
# ============================================

# OpenAI
User-agent: GPTBot
Allow: /blog/
Allow: /docs/
Allow: /pricing/
Allow: /about/
Disallow: /app/
Disallow: /dashboard/
Disallow: /api/v1/
Disallow: /premium/

User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

# Anthropic
User-agent: ClaudeBot
Allow: /blog/
Allow: /docs/
Allow: /pricing/
Allow: /about/
Disallow: /app/
Disallow: /dashboard/
Disallow: /api/v1/
Disallow: /premium/

User-agent: Claude-SearchBot
Allow: /

User-agent: Claude-User
Allow: /

# Apply same pattern for other crawlers...

User-agent: *
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

SSR Is Mandatory for AI Crawlability

Most AI crawlers do not execute JavaScript. If your content is rendered client-side (SPA/CSR), it is invisible to AI crawlers.

Required: Use one of these rendering strategies:

Server-Side Rendering (SSR): HTML generated on each request
Static Site Generation (SSG): HTML generated at build time
Incremental Static Regeneration (ISR): Static pages regenerated on a schedule

Verification: Use curl -A "GPTBot" https://yourdomain.com/page and check that the response HTML contains your full content, not empty <div id="root"></div> containers.

# Test what AI crawlers see
curl -s -A "GPTBot" https://yourdomain.com/your-page | grep -c "<article>"

# Compare with a JavaScript-capable render
# If curl returns empty but browser shows content, you have a CSR problem

Crawl Traffic Scale (2025)

Understanding the scale of AI crawling helps set expectations:

Over 560,000 websites include AI bot directives in robots.txt
GPTBot grew from 5% to 30% of AI crawler share between May 2024-2025
Meta-ExternalAgent accounts for 19% of AI crawler traffic
Crawl-to-referral ratio is extremely asymmetric:
- OpenAI: 1,700:1 (1,700 crawl requests per 1 referral visit)
- Anthropic: 73,000:1 (73,000 crawl requests per 1 referral visit)
This means AI platforms crawl your content heavily but send very few visitors back
The value is in being cited (brand mention, authority signal), not in receiving traffic

Site Architecture for AI Consumption

AI crawlers parse your site structure to understand content relationships. Optimize architecture for machine readability:

Semantic HTML

Use proper HTML5 semantic elements — AI crawlers use these to understand content structure:

<article>
  <header>
    <h1>Primary Topic</h1>
    <time datetime="2025-12-15">Last Updated: December 15, 2025</time>
  </header>

  <section>
    <h2>Subtopic</h2>
    <p>Content organized in self-contained passages of 134-167 words...</p>

    <table>
      <caption>Comparison of Options</caption>
      <thead>
        <tr><th>Feature</th><th>Option A</th><th>Option B</th></tr>
      </thead>
      <tbody>
        <tr><td>Price</td><td>$29/mo</td><td>$49/mo</td></tr>
      </tbody>
    </table>
  </section>

  <section>
    <h2>FAQ</h2>
    <dl>
      <dt>What is the pricing?</dt>
      <dd>Plans start at $29/month for up to 10,000 events.</dd>
    </dl>
  </section>
</article>

Heading Hierarchy

Maintain clean nesting — never skip levels:

H1 (one per page) — Page title
  H2 — Major section
    H3 — Subsection
      H4 — Detail (use sparingly)
  H2 — Next major section
    H3 — Subsection

Clean URLs

AI systems use URL structure to infer content topics:

Good:  /docs/api/authentication
Bad:   /d?id=38172&cat=3
Good:  /blog/geo-optimization-guide-2025
Bad:   /blog/post-38172

Next.js Middleware for Tracking AI Crawler Hits

Monitor which AI crawlers are visiting your site and which pages they access:

// middleware.ts
import { NextRequest, NextResponse } from 'next/server';

const AI_CRAWLERS: Record<string, string> = {
  'GPTBot': 'openai-training',
  'OAI-SearchBot': 'openai-search',
  'ChatGPT-User': 'openai-user',
  'ClaudeBot': 'anthropic-training',
  'Claude-SearchBot': 'anthropic-search',
  'Claude-User': 'anthropic-user',
  'PerplexityBot': 'perplexity',
  'Perplexity-User': 'perplexity-user',
  'Google-Extended': 'google-ai',
  'Applebot-Extended': 'apple-ai',
  'Meta-ExternalAgent': 'meta-training',
  'Bytespider': 'bytedance',
  'YouBot': 'you-com',
  'PhindBot': 'phind',
};

export function middleware(request: NextRequest) {
  const ua = request.headers.get('user-agent') || '';
  const response = NextResponse.next();

  for (const [botName, category] of Object.entries(AI_CRAWLERS)) {
    if (ua.includes(botName)) {
      console.log(JSON.stringify({
        event: 'ai_crawler_hit',
        crawler: botName,
        category,
        path: request.nextUrl.pathname,
        timestamp: new Date().toISOString(),
      }));

      response.headers.set('X-AI-Crawler', botName);
      break;
    }
  }

  return response;
}

export const config = {
  matcher: [
    '/((?!_next/static|_next/image|favicon.ico).*)',
  ],
};

Production logging: Replace console.log with your analytics service (e.g., send to a database, Vercel Analytics custom event, or a dedicated AI crawler analytics endpoint).

Sitemap Strategy

Submit your XML sitemap to both Google Search Console and Bing Webmaster Tools:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://yourdomain.com/docs/getting-started</loc>
    <lastmod>2025-12-15</lastmod>
    <changefreq>weekly</changefreq>
    <priority>0.9</priority>
  </url>
</urlset>

Key points:

Include accurate <lastmod> dates — AI systems use these to evaluate freshness
Use IndexNow protocol for instant Bing indexing (see platform-specific-geo skill)
Prioritize high-value content pages in your sitemap structure

Install this skill directly: skilldb add llm-optimization-skills

Get CLI access →

AI Crawler Management & robots.txt

AI Crawler User Agent Reference

OpenAI

Anthropic

Perplexity

Google

Apple

Details

Pack: llm-optimization-skills
File: ai-crawler-management.md
Lines: 417
Category: Technology & Engineering

Download via CLI

Pro

$ skilldb add llm-optimization-skills

Installs the full Llm Optimization pack to your project.

AI Crawler Management & robots.txt

AI Crawler Management & robots.txt

AI Crawler User Agent Reference

OpenAI

Anthropic

Perplexity

Google

Apple

Meta

ByteDance

Others

robots.txt Templates

Template 1: Maximum Visibility (Recommended for Most Sites)

Template 2: Allow Search, Block Training

Template 3: Selective Access

SSR Is Mandatory for AI Crawlability

Crawl Traffic Scale (2025)

Site Architecture for AI Consumption

Semantic HTML

Heading Hierarchy

Clean URLs

Next.js Middleware for Tracking AI Crawler Hits

Sitemap Strategy

Related Skills

Entity-Based Optimization for AI Knowledge Graphs

GEO Content Strategy — Writing for AI Citation

Generative Engine Optimization (GEO) Fundamentals

Measuring & Monitoring LLM Visibility

llms.txt Standard Implementation

Platform-Specific GEO — ChatGPT, Perplexity, Google AI Overviews

User Agent	Purpose	Respects robots.txt
`Meta-ExternalAgent`	Training language models (19% of AI crawler traffic in 2025)	Yes
`FacebookBot`	Content aggregation	Yes