If you’re running a website, you might notice various web crawlers accessing your content. Among these, Google’s crawlers are essential for getting your site indexed and ranked in search results. However, sometimes spammers or troublemakers might pretend to be Googlebot to access your site. This article explains how you can verify whether a crawler is genuinely from Google or an impostor.
Understanding Google’s crawler types
Google uses three main categories of crawlers to access websites.
- The first type includes common crawlers like Googlebot, which respect robots.txt rules and are used for Google’s main products.
- The second category consists of special-case crawlers such as AdsBot, which perform specific functions for Google products and may or may not follow robots.txt rules.
- The third type includes user-triggered fetchers, which are tools that fetch content when requested by a user and typically ignore robots.txt rules.
How to verify Googlebot and other Google crawlers
Manual verification method
For most website owners, the manual verification method is sufficient. This approach uses command line tools and is perfect for one-off lookups. The process involves four simple steps.
Step 1: Run a reverse DNS lookup
Open your command line interface (Terminal on Mac/Linux or Command Prompt on Windows) and type the ‘host’ command followed by the IP address from your logs. For example:
host 66.249.66.1
This command will return information about the domain associated with that IP address.
Step 2: Check the domain name
Look at the result from Step 1 and verify that the domain name ends with either googlebot.com, google.com, or googleusercontent.com. If it doesn’t, the crawler is not from Google.
For a genuine Googlebot, you might see something like:
1.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-1.googlebot.com
Step 3: Run a forward DNS lookup
Next, you need to verify the IP address by running a forward DNS lookup. Use the ‘host’ command again, but this time with the domain name you found in Step 2:
host crawl-66-249-66-1.googlebot.com
Step 4: Compare the IP addresses
Check that the IP address returned in Step 3 matches the original IP address from your logs. If they match, the crawler is genuinely from Google. If not, it’s an impostor.
A genuine result would look like:
crawl-66-249-66-1.googlebot.com has address 66.249.66.1
Here are two more examples of how this process works:
For a geo-specific Googlebot:
host 35.247.243.240
240.243.247.35.in-addr.arpa domain name pointer geo-crawl-35-247-243-240.geo.googlebot.com.
host geo-crawl-35-247-243-240.geo.googlebot.com
geo-crawl-35-247-243-240.geo.googlebot.com has address 35.247.243.240
For a special-case crawler:
host 66.249.90.77
77.90.249.66.in-addr.arpa domain name pointer rate-limited-proxy-66-249-90-77.google.com.
host rate-limited-proxy-66-249-90-77.google.com
rate-limited-proxy-66-249-90-77.google.com has address 66.249.90.77
Automatic verification method
If you need to verify Google crawlers on a larger scale, an automatic solution might be more suitable. This method involves matching the crawler’s IP address against published lists of Google crawler IP ranges.
Google provides JSON files with IP ranges for different types of crawlers. You can find these at:
- Common crawlers (like Googlebot)
- Special crawlers (like AdsBot)
- User-triggered fetches (users)
- User-triggered fetches (Google)
Conclusion
Verifying whether a crawler is genuinely from Google is an important step in protecting your website from potential spammers. The manual method is straightforward for occasional checks: run a reverse DNS lookup, verify the domain name, run a forward DNS lookup, and compare the IP addresses. For larger sites, automatic verification using Google’s published IP ranges offers a more scalable solution.
By implementing these verification methods, you can ensure that only legitimate Google crawlers access your site, helping to maintain your website’s security and integrity. This verification process is a simple yet effective way to protect your online presence from malicious actors masquerading as Google crawlers.