AI crawlers are systematically scanning websites across the internet, harvesting content to train large language models like ChatGPT, Claude, and Gemini. If you’re concerned about your content being used without permission or compensation, this guide covers everything you need to know about blocking these bots.
Why Block AI Crawlers?
Before diving into the how, let’s address the why. There are several legitimate reasons to block AI crawlers:
- Content monetisation: AI companies profit from your content without compensation
- Competitive disadvantage: AI trained on your content can help competitors create similar material
- Server costs: High-volume crawling increases infrastructure costs without business benefit
- Content misrepresentation: AI may present outdated or inaccurate versions of your content
Method 1: Block AI Crawlers with robots.txt
The simplest method is adding rules to your robots.txt file. While this relies on bots voluntarily complying, most legitimate AI crawlers respect these directives.
Add these rules to your robots.txt file (located at yoursite.com/robots.txt):
# Block OpenAI (ChatGPT)
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
# Block Anthropic (Claude)
User-agent: ClaudeBot
Disallow: /
User-agent: Claude-Web
Disallow: /
# Block Google AI (Gemini)
User-agent: Google-Extended
Disallow: /
# Block Common Crawl (used by many AI models)
User-agent: CCBot
Disallow: /
# Block Perplexity
User-agent: PerplexityBot
Disallow: /
# Block ByteDance (TikTok's AI)
User-agent: Bytespider
Disallow: /
# Block Apple AI
User-agent: Applebot-Extended
Disallow: /
# Block Amazon AI
User-agent: Amazonbot
Disallow: /
# Block Meta AI
User-agent: FacebookBot
Disallow: /
User-agent: Meta-ExternalAgent
Disallow: /
# Block Cohere
User-agent: cohere-ai
Disallow: /
Complete AI Crawler Reference
Here’s a comprehensive list of known AI crawlers and their operators:
| User Agent | Operator | Purpose |
|---|---|---|
| GPTBot | OpenAI | Training data for ChatGPT |
| ChatGPT-User | OpenAI | Real-time browsing for ChatGPT |
| ClaudeBot | Anthropic | Training data for Claude |
| Google-Extended | Training data for Gemini | |
| CCBot | Common Crawl | Dataset used by many AI models |
| PerplexityBot | Perplexity | AI search engine |
| Bytespider | ByteDance | TikTok’s AI (Doubao) |
| Amazonbot | Amazon | Alexa and Amazon AI |
| Meta-ExternalAgent | Meta | Meta AI training |
| cohere-ai | Cohere | Enterprise AI models |
Method 2: HTTP Headers
Some AI companies respect HTTP headers that communicate your preferences about AI training. Add this response header to signal that your content should not be indexed for AI:
X-Robots-Tag: noai, noimageai
For Apache (.htaccess):
Header set X-Robots-Tag "noai, noimageai"
For Nginx:
add_header X-Robots-Tag "noai, noimageai";
Method 3: JavaScript Rendering as a Defence
Here’s where it gets interesting. A recent study by Vercel and MerJ revealed that most AI crawlers struggle significantly with JavaScript-rendered content.
This finding applies to several major AI crawlers:
- GPTBot (OpenAI)
- ClaudeBot (Anthropic)
- PerplexityBot
- AppleBot
By using heavy JavaScript rendering, you’re essentially putting on an invisibility cloak that works against most AI crawlers. There’s one notable exception: Googlebot, which still manages to render JavaScript effectively.
Trade-offs to Consider
While JavaScript rendering can deter AI bots, it comes with trade-offs:
- It may hide your content from AI-powered search results (reducing discoverability)
- Users relying on AI assistants to find content won’t see your pages
- Heavy JavaScript can slow page load times
Strategic recommendations:
- Use JavaScript rendering selectively to protect your most valuable content
- Start with crawler-friendly content, then enhance with JavaScript
- If you depend on AI search for traffic, use a more balanced approach
How to Verify Your Blocks Are Working
After implementing your blocking rules, verify they’re working correctly.
Test robots.txt
Use curl to simulate an AI crawler request:
# Test as GPTBot
curl -A "GPTBot" https://yoursite.com/robots.txt
# Test as ClaudeBot
curl -A "ClaudeBot" https://yoursite.com/robots.txt
You should see your Disallow rules in the response.
Check Server Logs
Monitor your server logs for AI crawler activity. Look for user agents containing:
- GPTBot
- ClaudeBot
- CCBot
- Bytespider
- PerplexityBot
In Apache logs:
grep -E "GPTBot|ClaudeBot|CCBot|Bytespider" /var/log/apache2/access.log
Limitations to Be Aware Of
No blocking method is foolproof:
- robots.txt is voluntary: Malicious crawlers may ignore it entirely
- User agent spoofing: Some crawlers disguise themselves as regular browsers
- Distributed crawling: Crawlers may use many different IP addresses
- Third-party datasets: Your content may already exist in training datasets like Common Crawl
For comprehensive protection, consider combining multiple methods: robots.txt rules, HTTP headers, JavaScript rendering for sensitive content, and regular log monitoring.
Conclusion
Who would have thought that JavaScript would become an unexpected ally in the battle against AI content scraping? While it’s an interesting development, a layered approach works best. Start with robots.txt rules to block well-behaved crawlers, add HTTP headers for additional signalling, and consider JavaScript rendering for your most valuable content.
Remember that the AI crawling landscape is constantly evolving. Today’s blocking tactic might become obsolete tomorrow as crawlers become more sophisticated. Stay informed about new AI crawlers and update your blocking strategy accordingly.