How to Block AI Crawlers from Scraping Your Website

use javascript to stop being crawled

AI crawlers are systematically scanning websites across the internet, harvesting content to train large language models like ChatGPT, Claude, and Gemini. If you’re concerned about your content being used without permission or compensation, this guide covers everything you need to know about blocking these bots.

Why Block AI Crawlers?

Before diving into the how, let’s address the why. There are several legitimate reasons to block AI crawlers:

  • Content monetisation: AI companies profit from your content without compensation
  • Competitive disadvantage: AI trained on your content can help competitors create similar material
  • Server costs: High-volume crawling increases infrastructure costs without business benefit
  • Content misrepresentation: AI may present outdated or inaccurate versions of your content

Method 1: Block AI Crawlers with robots.txt

The simplest method is adding rules to your robots.txt file. While this relies on bots voluntarily complying, most legitimate AI crawlers respect these directives.

Add these rules to your robots.txt file (located at yoursite.com/robots.txt):

# Block OpenAI (ChatGPT)
User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

# Block Anthropic (Claude)
User-agent: ClaudeBot
Disallow: /

User-agent: Claude-Web
Disallow: /

# Block Google AI (Gemini)
User-agent: Google-Extended
Disallow: /

# Block Common Crawl (used by many AI models)
User-agent: CCBot
Disallow: /

# Block Perplexity
User-agent: PerplexityBot
Disallow: /

# Block ByteDance (TikTok's AI)
User-agent: Bytespider
Disallow: /

# Block Apple AI
User-agent: Applebot-Extended
Disallow: /

# Block Amazon AI
User-agent: Amazonbot
Disallow: /

# Block Meta AI
User-agent: FacebookBot
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

# Block Cohere
User-agent: cohere-ai
Disallow: /

Complete AI Crawler Reference

Here’s a comprehensive list of known AI crawlers and their operators:

User AgentOperatorPurpose
GPTBotOpenAITraining data for ChatGPT
ChatGPT-UserOpenAIReal-time browsing for ChatGPT
ClaudeBotAnthropicTraining data for Claude
Google-ExtendedGoogleTraining data for Gemini
CCBotCommon CrawlDataset used by many AI models
PerplexityBotPerplexityAI search engine
BytespiderByteDanceTikTok’s AI (Doubao)
AmazonbotAmazonAlexa and Amazon AI
Meta-ExternalAgentMetaMeta AI training
cohere-aiCohereEnterprise AI models

Method 2: HTTP Headers

Some AI companies respect HTTP headers that communicate your preferences about AI training. Add this response header to signal that your content should not be indexed for AI:

X-Robots-Tag: noai, noimageai

For Apache (.htaccess):

Header set X-Robots-Tag "noai, noimageai"

For Nginx:

add_header X-Robots-Tag "noai, noimageai";

Method 3: JavaScript Rendering as a Defence

Here’s where it gets interesting. A recent study by Vercel and MerJ revealed that most AI crawlers struggle significantly with JavaScript-rendered content.

This finding applies to several major AI crawlers:

  • GPTBot (OpenAI)
  • ClaudeBot (Anthropic)
  • PerplexityBot
  • AppleBot

By using heavy JavaScript rendering, you’re essentially putting on an invisibility cloak that works against most AI crawlers. There’s one notable exception: Googlebot, which still manages to render JavaScript effectively.

Trade-offs to Consider

While JavaScript rendering can deter AI bots, it comes with trade-offs:

  • It may hide your content from AI-powered search results (reducing discoverability)
  • Users relying on AI assistants to find content won’t see your pages
  • Heavy JavaScript can slow page load times

Strategic recommendations:

  • Use JavaScript rendering selectively to protect your most valuable content
  • Start with crawler-friendly content, then enhance with JavaScript
  • If you depend on AI search for traffic, use a more balanced approach

How to Verify Your Blocks Are Working

After implementing your blocking rules, verify they’re working correctly.

Test robots.txt

Use curl to simulate an AI crawler request:

# Test as GPTBot
curl -A "GPTBot" https://yoursite.com/robots.txt

# Test as ClaudeBot
curl -A "ClaudeBot" https://yoursite.com/robots.txt

You should see your Disallow rules in the response.

Check Server Logs

Monitor your server logs for AI crawler activity. Look for user agents containing:

  • GPTBot
  • ClaudeBot
  • CCBot
  • Bytespider
  • PerplexityBot

In Apache logs:

grep -E "GPTBot|ClaudeBot|CCBot|Bytespider" /var/log/apache2/access.log

Limitations to Be Aware Of

No blocking method is foolproof:

  • robots.txt is voluntary: Malicious crawlers may ignore it entirely
  • User agent spoofing: Some crawlers disguise themselves as regular browsers
  • Distributed crawling: Crawlers may use many different IP addresses
  • Third-party datasets: Your content may already exist in training datasets like Common Crawl

For comprehensive protection, consider combining multiple methods: robots.txt rules, HTTP headers, JavaScript rendering for sensitive content, and regular log monitoring.

Conclusion

Who would have thought that JavaScript would become an unexpected ally in the battle against AI content scraping? While it’s an interesting development, a layered approach works best. Start with robots.txt rules to block well-behaved crawlers, add HTTP headers for additional signalling, and consider JavaScript rendering for your most valuable content.

Remember that the AI crawling landscape is constantly evolving. Today’s blocking tactic might become obsolete tomorrow as crawlers become more sophisticated. Stay informed about new AI crawlers and update your blocking strategy accordingly.

Want to rank higher and drive more organic traffic?

Technical SEO, content strategy, and performance optimization—we help businesses get found. Let's discuss your SEO goals and create a roadmap.