AI crawlers are systematically scanning websites across the internet, harvesting content to train large language models like ChatGPT, Claude, and Gemini. If you’re concerned about your content being used without permission or compensation, this guide covers everything you need to know about blocking these bots.

Why Block AI Crawlers?

Before diving into the how, let’s address the why. There are several legitimate reasons to block AI crawlers:

Content monetisation: AI companies profit from your content without compensation
Competitive disadvantage: AI trained on your content can help competitors create similar material
Server costs: High-volume crawling increases infrastructure costs without business benefit
Content misrepresentation: AI may present outdated or inaccurate versions of your content

Method 1: Block AI Crawlers with robots.txt

The simplest method is adding rules to your robots.txt file. While this relies on bots voluntarily complying, most legitimate AI crawlers respect these directives.

Add these rules to your robots.txt file (located at yoursite.com/robots.txt):

# Block OpenAI (ChatGPT)
User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

# Block Anthropic (Claude)
User-agent: ClaudeBot
Disallow: /

User-agent: Claude-Web
Disallow: /

# Block Google AI (Gemini)
User-agent: Google-Extended
Disallow: /

# Block Common Crawl (used by many AI models)
User-agent: CCBot
Disallow: /

# Block Perplexity
User-agent: PerplexityBot
Disallow: /

# Block ByteDance (TikTok's AI)
User-agent: Bytespider
Disallow: /

# Block Apple AI
User-agent: Applebot-Extended
Disallow: /

# Block Amazon AI
User-agent: Amazonbot
Disallow: /

# Block Meta AI
User-agent: FacebookBot
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

# Block Cohere
User-agent: cohere-ai
Disallow: /

Complete AI Crawler Reference

Here’s a comprehensive list of known AI crawlers and their operators:

User Agent	Operator	Purpose
GPTBot	OpenAI	Training data for ChatGPT
ChatGPT-User	OpenAI	Real-time browsing for ChatGPT
ClaudeBot	Anthropic	Training data for Claude
Google-Extended	Google	Training data for Gemini
CCBot	Common Crawl	Dataset used by many AI models
PerplexityBot	Perplexity	AI search engine
Bytespider	ByteDance	TikTok’s AI (Doubao)
Amazonbot	Amazon	Alexa and Amazon AI
Meta-ExternalAgent	Meta	Meta AI training
cohere-ai	Cohere	Enterprise AI models

Method 2: HTTP Headers

Some AI companies respect HTTP headers that communicate your preferences about AI training. Add this response header to signal that your content should not be indexed for AI:

X-Robots-Tag: noai, noimageai

For Apache (.htaccess):

Header set X-Robots-Tag "noai, noimageai"

For Nginx:

add_header X-Robots-Tag "noai, noimageai";

Method 3: JavaScript Rendering as a Defence

Here’s where it gets interesting. A recent study by Vercel and MerJ revealed that most AI crawlers struggle significantly with JavaScript-rendered content.

This finding applies to several major AI crawlers:

GPTBot (OpenAI)
ClaudeBot (Anthropic)
PerplexityBot
AppleBot

By using heavy JavaScript rendering, you’re essentially putting on an invisibility cloak that works against most AI crawlers. There’s one notable exception: Googlebot, which still manages to render JavaScript effectively.

Trade-offs to Consider

While JavaScript rendering can deter AI bots, it comes with trade-offs:

It may hide your content from AI-powered search results (reducing discoverability)
Users relying on AI assistants to find content won’t see your pages
Heavy JavaScript can slow page load times

Strategic recommendations:

Use JavaScript rendering selectively to protect your most valuable content
Start with crawler-friendly content, then enhance with JavaScript
If you depend on AI search for traffic, use a more balanced approach

How to Verify Your Blocks Are Working

After implementing your blocking rules, verify they’re working correctly.

Test robots.txt

Use curl to simulate an AI crawler request:

# Test as GPTBot
curl -A "GPTBot" https://yoursite.com/robots.txt

# Test as ClaudeBot
curl -A "ClaudeBot" https://yoursite.com/robots.txt

You should see your Disallow rules in the response.

Check Server Logs

Monitor your server logs for AI crawler activity. Look for user agents containing:

GPTBot
ClaudeBot
CCBot
Bytespider
PerplexityBot

In Apache logs:

grep -E "GPTBot|ClaudeBot|CCBot|Bytespider" /var/log/apache2/access.log

Limitations to Be Aware Of

No blocking method is foolproof:

robots.txt is voluntary: Malicious crawlers may ignore it entirely
User agent spoofing: Some crawlers disguise themselves as regular browsers
Distributed crawling: Crawlers may use many different IP addresses
Third-party datasets: Your content may already exist in training datasets like Common Crawl

For comprehensive protection, consider combining multiple methods: robots.txt rules, HTTP headers, JavaScript rendering for sensitive content, and regular log monitoring.

Conclusion

Who would have thought that JavaScript would become an unexpected ally in the battle against AI content scraping? While it’s an interesting development, a layered approach works best. Start with robots.txt rules to block well-behaved crawlers, add HTTP headers for additional signalling, and consider JavaScript rendering for your most valuable content.

Remember that the AI crawling landscape is constantly evolving. Today’s blocking tactic might become obsolete tomorrow as crawlers become more sophisticated. Stay informed about new AI crawlers and update your blocking strategy accordingly.

Author

Brian Nguyen

Brian is a marketing ninja that writes about all things tech for Kahunam. When he's not busy posting on our blog, he's scouring the web for new tips and tricks that help Wordpress and Shopify site owners make the most of their online presence.