Skip to main content
Technology & EngineeringLlm Optimization417 lines

AI Crawler Management & robots.txt

This is the complete reference of known AI crawler user agents as of 2025-2026. Use this to configure robots.txt and monitor crawl traffic.

Quick Summary35 lines
This is the complete reference of known AI crawler user agents as of 2025-2026. Use this to configure robots.txt and monitor crawl traffic.

## Key Points

- **Server-Side Rendering (SSR)**: HTML generated on each request
- **Static Site Generation (SSG)**: HTML generated at build time
- **Incremental Static Regeneration (ISR)**: Static pages regenerated on a schedule
- Over **560,000 websites** include AI bot directives in robots.txt
- **GPTBot** grew from 5% to 30% of AI crawler share between May 2024-2025
- **Meta-ExternalAgent** accounts for 19% of AI crawler traffic
- Crawl-to-referral ratio is extremely asymmetric:
- **OpenAI**: 1,700:1 (1,700 crawl requests per 1 referral visit)
- **Anthropic**: 73,000:1 (73,000 crawl requests per 1 referral visit)
- This means AI platforms crawl your content heavily but send very few visitors back
- The value is in being cited (brand mention, authority signal), not in receiving traffic
- Include accurate `<lastmod>` dates — AI systems use these to evaluate freshness

## Quick Example

```bash
# Test what AI crawlers see
curl -s -A "GPTBot" https://yourdomain.com/your-page | grep -c "<article>"

# Compare with a JavaScript-capable render
# If curl returns empty but browser shows content, you have a CSR problem
```

```
Good:  /docs/api/authentication
Bad:   /d?id=38172&cat=3
Good:  /blog/geo-optimization-guide-2025
Bad:   /blog/post-38172
```
skilldb get llm-optimization-skills/AI Crawler Management & robots.txtFull skill: 417 lines
Paste into your CLAUDE.md or agent config

AI Crawler Management & robots.txt

AI Crawler User Agent Reference

This is the complete reference of known AI crawler user agents as of 2025-2026. Use this to configure robots.txt and monitor crawl traffic.

OpenAI

User AgentPurposeRespects robots.txt
GPTBotScrapes data for training OpenAI's modelsYes
OAI-SearchBotReal-time search result generation for ChatGPT searchYes
ChatGPT-UserUser-initiated browsing (when a user asks ChatGPT to visit a URL)Partially — may not fully respect

Anthropic

User AgentPurposeRespects robots.txt
ClaudeBotTraining data collectionYes
Claude-SearchBotImproves search result qualityYes
Claude-UserFetches pages for user queriesYes

Perplexity

User AgentPurposeRespects robots.txt
PerplexityBotIndexing and searchYes
Perplexity-UserUser-initiated fetchesNO

Google

User AgentPurposeRespects robots.txt
Google-ExtendedLLM training data collectionYes
GooglebotStandard web crawling (also feeds AI Overviews)Yes

Apple

User AgentPurposeRespects robots.txt
Applebot-ExtendedPowers Siri, Spotlight, Safari AI featuresYes

Meta

User AgentPurposeRespects robots.txt
Meta-ExternalAgentTraining language models (19% of AI crawler traffic in 2025)Yes
FacebookBotContent aggregationYes

ByteDance

User AgentPurposeRespects robots.txt
BytespiderTraining data for ByteDance modelsNO

Others

User AgentPurposeRespects robots.txt
YouBotYou.com search and AIYes
PhindBotPhind developer searchYes
AndibotAndi searchYes

robots.txt Templates

Template 1: Maximum Visibility (Recommended for Most Sites)

Allow all AI crawlers to access all content. This maximizes your chances of being indexed and cited by AI platforms.

# ============================================
# AI Crawler Policy — Maximum Visibility
# ============================================

# OpenAI
User-agent: GPTBot
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

# Anthropic
User-agent: ClaudeBot
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: Claude-User
Allow: /

# Perplexity
User-agent: PerplexityBot
Allow: /

# Google AI
User-agent: Google-Extended
Allow: /

# Apple
User-agent: Applebot-Extended
Allow: /

# Meta
User-agent: Meta-ExternalAgent
Allow: /

User-agent: FacebookBot
Allow: /

# Other AI
User-agent: YouBot
Allow: /

User-agent: PhindBot
Allow: /

# Standard crawlers
User-agent: *
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

Template 2: Allow Search, Block Training

Allow AI search crawlers (so your content appears in AI answers) but block training crawlers (so your content is not used to train models).

# ============================================
# AI Crawler Policy — Search Only, No Training
# ============================================

# OpenAI — allow search, block training
User-agent: GPTBot
Disallow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

# Anthropic — allow search, block training
User-agent: ClaudeBot
Disallow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: Claude-User
Allow: /

# Perplexity — allow search
User-agent: PerplexityBot
Allow: /

# Google — block training, allow standard crawl
User-agent: Google-Extended
Disallow: /

User-agent: Googlebot
Allow: /

# Apple — block training
User-agent: Applebot-Extended
Disallow: /

# Meta — block training
User-agent: Meta-ExternalAgent
Disallow: /

User-agent: FacebookBot
Allow: /

# Block known non-compliant crawlers (they ignore this, but document intent)
User-agent: Bytespider
Disallow: /

User-agent: *
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

Template 3: Selective Access

Allow AI crawlers to access public content but block premium/gated content.

# ============================================
# AI Crawler Policy — Selective Access
# ============================================

# OpenAI
User-agent: GPTBot
Allow: /blog/
Allow: /docs/
Allow: /pricing/
Allow: /about/
Disallow: /app/
Disallow: /dashboard/
Disallow: /api/v1/
Disallow: /premium/

User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

# Anthropic
User-agent: ClaudeBot
Allow: /blog/
Allow: /docs/
Allow: /pricing/
Allow: /about/
Disallow: /app/
Disallow: /dashboard/
Disallow: /api/v1/
Disallow: /premium/

User-agent: Claude-SearchBot
Allow: /

User-agent: Claude-User
Allow: /

# Apply same pattern for other crawlers...

User-agent: *
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

SSR Is Mandatory for AI Crawlability

Most AI crawlers do not execute JavaScript. If your content is rendered client-side (SPA/CSR), it is invisible to AI crawlers.

Required: Use one of these rendering strategies:

  • Server-Side Rendering (SSR): HTML generated on each request
  • Static Site Generation (SSG): HTML generated at build time
  • Incremental Static Regeneration (ISR): Static pages regenerated on a schedule

Verification: Use curl -A "GPTBot" https://yourdomain.com/page and check that the response HTML contains your full content, not empty <div id="root"></div> containers.

# Test what AI crawlers see
curl -s -A "GPTBot" https://yourdomain.com/your-page | grep -c "<article>"

# Compare with a JavaScript-capable render
# If curl returns empty but browser shows content, you have a CSR problem

Crawl Traffic Scale (2025)

Understanding the scale of AI crawling helps set expectations:

  • Over 560,000 websites include AI bot directives in robots.txt
  • GPTBot grew from 5% to 30% of AI crawler share between May 2024-2025
  • Meta-ExternalAgent accounts for 19% of AI crawler traffic
  • Crawl-to-referral ratio is extremely asymmetric:
    • OpenAI: 1,700:1 (1,700 crawl requests per 1 referral visit)
    • Anthropic: 73,000:1 (73,000 crawl requests per 1 referral visit)
  • This means AI platforms crawl your content heavily but send very few visitors back
  • The value is in being cited (brand mention, authority signal), not in receiving traffic

Site Architecture for AI Consumption

AI crawlers parse your site structure to understand content relationships. Optimize architecture for machine readability:

Semantic HTML

Use proper HTML5 semantic elements — AI crawlers use these to understand content structure:

<article>
  <header>
    <h1>Primary Topic</h1>
    <time datetime="2025-12-15">Last Updated: December 15, 2025</time>
  </header>

  <section>
    <h2>Subtopic</h2>
    <p>Content organized in self-contained passages of 134-167 words...</p>

    <table>
      <caption>Comparison of Options</caption>
      <thead>
        <tr><th>Feature</th><th>Option A</th><th>Option B</th></tr>
      </thead>
      <tbody>
        <tr><td>Price</td><td>$29/mo</td><td>$49/mo</td></tr>
      </tbody>
    </table>
  </section>

  <section>
    <h2>FAQ</h2>
    <dl>
      <dt>What is the pricing?</dt>
      <dd>Plans start at $29/month for up to 10,000 events.</dd>
    </dl>
  </section>
</article>

Heading Hierarchy

Maintain clean nesting — never skip levels:

H1 (one per page) — Page title
  H2 — Major section
    H3 — Subsection
      H4 — Detail (use sparingly)
  H2 — Next major section
    H3 — Subsection

Clean URLs

AI systems use URL structure to infer content topics:

Good:  /docs/api/authentication
Bad:   /d?id=38172&cat=3
Good:  /blog/geo-optimization-guide-2025
Bad:   /blog/post-38172

Next.js Middleware for Tracking AI Crawler Hits

Monitor which AI crawlers are visiting your site and which pages they access:

// middleware.ts
import { NextRequest, NextResponse } from 'next/server';

const AI_CRAWLERS: Record<string, string> = {
  'GPTBot': 'openai-training',
  'OAI-SearchBot': 'openai-search',
  'ChatGPT-User': 'openai-user',
  'ClaudeBot': 'anthropic-training',
  'Claude-SearchBot': 'anthropic-search',
  'Claude-User': 'anthropic-user',
  'PerplexityBot': 'perplexity',
  'Perplexity-User': 'perplexity-user',
  'Google-Extended': 'google-ai',
  'Applebot-Extended': 'apple-ai',
  'Meta-ExternalAgent': 'meta-training',
  'Bytespider': 'bytedance',
  'YouBot': 'you-com',
  'PhindBot': 'phind',
};

export function middleware(request: NextRequest) {
  const ua = request.headers.get('user-agent') || '';
  const response = NextResponse.next();

  for (const [botName, category] of Object.entries(AI_CRAWLERS)) {
    if (ua.includes(botName)) {
      console.log(JSON.stringify({
        event: 'ai_crawler_hit',
        crawler: botName,
        category,
        path: request.nextUrl.pathname,
        timestamp: new Date().toISOString(),
      }));

      response.headers.set('X-AI-Crawler', botName);
      break;
    }
  }

  return response;
}

export const config = {
  matcher: [
    '/((?!_next/static|_next/image|favicon.ico).*)',
  ],
};

Production logging: Replace console.log with your analytics service (e.g., send to a database, Vercel Analytics custom event, or a dedicated AI crawler analytics endpoint).

Sitemap Strategy

Submit your XML sitemap to both Google Search Console and Bing Webmaster Tools:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://yourdomain.com/docs/getting-started</loc>
    <lastmod>2025-12-15</lastmod>
    <changefreq>weekly</changefreq>
    <priority>0.9</priority>
  </url>
</urlset>

Key points:

  • Include accurate <lastmod> dates — AI systems use these to evaluate freshness
  • Use IndexNow protocol for instant Bing indexing (see platform-specific-geo skill)
  • Prioritize high-value content pages in your sitemap structure

Install this skill directly: skilldb add llm-optimization-skills

Get CLI access →