Robots.txt is a powerful tool that acts as a gatekeeper for your website, telling search engines and other bots which parts of your site they can access and which they should avoid. Whether you’re looking to solve technical issues or simply want to keep certain content private, understanding how to craft effective robots.txt rules is essential for any website owner.
What can you do with robots.txt?
Robots.txt is incredibly versatile. You can create simple rules or complex instructions targeting specific URL patterns. Here’s what you can achieve:
Target multiple bots with the same rule
Rule template: List multiple user-agents followed by the disallow rule.
Example:
user-agent: [first bot name]
user-agent: [second bot name]
disallow: [path to restrict]
For instance, if you want to keep both GoogleBot and BingBot away from your search results pages, you could write:
user-agent: googlebot
user-agent: bingbot
disallow: /search-results/
Block specific file types
Rule template: Specify a user-agent and use a wildcard to block file extensions.
Example:
user-agent: [bot name]
disallow: *.[file extension]$
If you wanted to prevent all bots from accessing your PDF documents, you might use:
user-agent: *
disallow: *.pdf$
Allow crawling of some areas while restricting others
Rule template: Use allow and disallow in sequence for the same bot.
Example:
user-agent: [bot name]
allow: [parent directory]/
disallow: [parent directory]/[subdirectory]/
For a website with public articles but private drafts:
user-agent: *
allow: /articles/
disallow: /articles/drafts/
Block specific bots while allowing others
Rule template: Create a general rule for all bots, then specific rules for exceptions.
Example:
user-agent: *
allow: /
user-agent: [specific bot to restrict]
disallow: /
allow: [limited access path]
To block an AI training bot while allowing search engines:
user-agent: *
allow: /
user-agent: ai-training-bot
disallow: /
allow: /$
Add comments for clarity
Rule template: Use the # symbol to add notes.
Example:
# [your comment here]
user-agent: [bot name]
disallow: [path]
For personal reference:
# Blocking access to our upcoming product pages until launch
user-agent: *
disallow: /products/upcoming/
Useful robots.txt rules for website owners
Blocking your entire site from all crawlers
Rule template:
User-agent: *
Disallow: /
This tells all bots not to crawl any page on your site. Remember this doesn’t necessarily prevent indexing, just crawling.
Restricting access to specific directories
Rule template:
User-agent: [bot name]
Disallow: /[directory name]/
For example, to keep all bots out of your admin area:
User-agent: *
Disallow: /admin/
Remember that robots.txt isn’t for securing private content—it’s publicly visible and merely a request, not a strict barrier.
Allowing access to specific crawlers only
Rule template:
User-agent: [allowed bot]
Allow: /
User-agent: *
Disallow: /
If you want only Google News to access your site:
User-agent: googlebot-news
Allow: /
User-agent: *
Disallow: /
Blocking a single crawler
Rule template:
User-agent: [bot to block]
Disallow: /
User-agent: *
Allow: /
To block just one problematic bot:
User-agent: aggressive-crawler
Disallow: /
User-agent: *
Allow: /
Blocking specific pages
Rule template:
User-agent: [bot name]
Disallow: /[filename.html]
If you have a temporary page you don’t want indexed:
User-agent: *
Disallow: /temporary-promotion.html
Allowing access to only one directory
Rule template:
User-agent: [bot name]
Disallow: /
Allow: /[public directory]/
For a site under development with only a press area public:
User-agent: *
Disallow: /
Allow: /press-releases/
Managing image crawling
Rule template for blocking a specific image:
User-agent: [image bot]
Disallow: /[path to image]
Rule template for blocking all images:
User-agent: [image bot]
Disallow: /
To prevent Google from indexing product prototype images:
User-agent: googlebot-image
Disallow: /images/prototypes/
Blocking specific file types
Rule template:
User-agent: [bot name]
Disallow: /*.[file extension]$
To prevent all bots from crawling your spreadsheets:
User-agent: *
Disallow: /*.xlsx$
Allowing ad bots while blocking other crawlers
Rule template:
User-agent: *
Disallow: /
User-agent: [ad bot]
Allow: /
For a private site that still needs ad analysis:
User-agent: *
Disallow: /
User-agent: mediapartners-google
Allow: /
Conclusion
Robots.txt is a simple yet powerful tool for managing how bots interact with your website. By implementing the right rules, you can control which parts of your site are crawled, by which bots, and under what circumstances. While robots.txt can help manage bot traffic, it shouldn’t be used as a security measure for sensitive content. With these examples and guidelines, you can create an effective robots.txt file tailored to your website’s specific needs. If you need help with this for your site, contact Kahunam for a consultation.