Managing your crawl budget can help you rise up the rankings in Google search. It’s thought of as an advanced topic, but many of the website improvements required to manage the crawl budget are within reach of even small teams. In this article we’ll look at what we mean by crawl budget, how the budget for your site is determined and what steps you can take to optimise it.
What is the crawl budget?
Crawl budget can be thought of as the amount of energy that Google assigns to “crawling” your website. “GoogleBot” is the name of Google’s web crawling software. It traverses web pages and parses the content it finds. It also follows links on those pages to discover new URLs. Through this process, the crawler can find new content and add it to Google’s search index. Google does this process at scale covering approximately 50 billion webpages and growing. This is part of the information that Google uses to determine how relevant your content is to a search query.
This process requires energy and computing power, so Google’s algorithms make a decision based on the content and setup of your website on how often and to what extent it should crawl the content on your site. In effect, it “budgets” part of its computing power to crawling your website, giving us the term – Crawl Budget.
How is the crawl budget determined?
The web environment is nearly infinite, so Google can’t explore and index every existing URL. Even if Google does have the financial budget and technical know-how to do so, there is still a cost to crawling all of the web all of the time, and not just financially. The impact on the environment is also considered.
Consequently, Googlebot is picky about how much time it spends on your site.
The crawl budget is determined by two main factors: crawl rate limit and crawl demand.
Crawl Rate Limit
Googlebot needs to collect website data without overloading your website’s server. To prevent server overload, Googlebot calculates the crawl rate limit. This limit is the maximum number of simultaneous connections Googlebot can use to collect data on your website, as well as the wait time between fetching new pages. Googlebot calculates the limit to gather all essential content without stressing your server.
The crawl rate limit can increase or decrease based on several factors:
- Crawlability: The limit increases if the website responds quickly. This means Googlebot can use more connections to collect data. If the website slows down or responds with server errors, the limit decreases, and Googlebot reduces the crawl frequency.
- Google’s crawl budget: Google has many bots, but they’re not infinite. Google has powerful and redundant systems but their systems are still prone to maintenance issues, spam and abuse like the rest of us.
Crawl Demand
Crawl demand refers to Google’s appetite for crawling your website.
To explain, consider a popular news website versus the website for a small community knitting group. Which one do you think would be bigger? Would be updated more frequently? Would have better quality content? Or would bring the most value to Google’s users?
Well, sure there’s a lot of bad quality journalism out there, but the bigger news site would nearly always win in this popularity contest. And regular crawling of the news website over the knitting group would help Google achieve its goals more effectively – that is to serve highly relevant searches, which keep people coming back to Google to search, and therefore help Google make more money from search advertising.
Quote:
“If you really want to win with SEO, don’t lose sight of what keeps Google’s share price rising.”
Scott Dooley – Founder, Kahunam
The official line is that Google allocates an amount of effort to crawl depending on a multitude of factors such as its size, update frequency, quality, and relevance compared to other websites.
Several factors play a crucial role in determining crawl demand:
- Content Freshness: Google prioritises crawling websites which frequently update their content to improve the quality of the information which is delivered to searchers. The logic being that old content which isn’t updated is more likely to contain out of date information.
- Website Authority: Websites with high domain authority and positive user signals are crawled more frequently. An example of a positive user signal may include a user searching for that particular brand name in Google Search, another example would be that the person searching for the query does not return to the Google search results and access other pages after visiting, this indicates to Google that your website page has successfully answered the query. Building a memorable brand, a strong backlink profile and maintaining a positive user experience can enhance your website’s authority and attract more crawls.
- Site-wide Events: Major website changes like restructuring or content migration can trigger increased crawl activity to reindex the updated content. Informing Google about such changes through tools like the URL Inspection tool can help ensure efficient crawling and indexing.
In summary, Google considers crawl rate and crawl demand, then determines the crawl budget of a website in terms of the set of URLs Googlebot can and wants to crawl. These factors don’t necessarily correspond, if the crawl rate limit hasn’t been reached, Googlebot may still decrease the crawl rate of a website if it determines the crawl demand to be low.
🌊 You won’t get lost in the data from your website. We’ll explain what your data means and how to use it to get more visitors and sales. Get a quote →
How to manage crawl budget efficiently?
By managing your crawl budget well, search engines can better find and show your important pages, making your website search results in Google more up to date and potentially higher in search results. Here are some tips on how to manage the crawl budget.
Tell Google which URLs should be crawled and which should be ignored
If too much time is spent crawling unnecessary URLs, Googlebot may decide to limit or increase the crawl rate of your site. Some methods to control URLs better:
Consolidate duplicate content
Duplicate content is when the same text or information is found on multiple websites or pages online, which can cause issues for search engines trying to decide which one to prioritize in search results.
Remove duplicate content to focus crawling efforts on unique content instead of multiple URLs. If you don’t want to remove, apply these techniques to tell Google which contents are more prioritized to be crawled.
- Using redirects: A strong signal indicating that the target of the redirect will become the canonical page. However, you should avoid long redirect chains as they negatively impact crawling efficiency.
- Using rel=”canonical” link attribute: A canonical link is a tag added to a webpage’s HTML code that tells search engines which version of a page is the preferred or main version when similar content exists on multiple pages, helping to avoid duplicate content issues. It’s a strong signal indicating that the specified URL will become the canonical URL.
- Using inclusion in the sitemap: A weaker signal suggesting that a URL present in the sitemap will become the canonical page.
Block URL crawling with robots.txt
Some pages may be important for users but not necessary to appear in search results.
If you can’t consolidate content as described above, block unimportant pages using robots.txt
User-agent: *
Disallow: /path-to-unimportant-page/
Return a 404 or 410 status code for permanently deleted pages
Google won’t forget a known URL, but a 404 status is a useful signal to indicate that no data should be collected from that URL. However, blocked URLs will remain in the crawl queue longer and will be recrawled when no longer blocked.
Remove soft 404 errors.
Google will continue to crawl pages with soft 404 errors, wasting your crawl budget. Check the Index Coverage report in Google Search Console to identify soft 404 errors.
Regularly update your sitemap
Google frequently reads your sitemap, so include all content you want Google to crawl in it. If your site has newly updated content, use the <lastmod> tag.
Test and review logs to ensure that Googlebot has no trouble in crawling your site
Improving your website’s crawlability doesn’t guarantee an increase in your crawl budget. Google determines the most reasonable crawl rate based on the need to crawl data, as described above. However, crawlability issues may prevent Google from crawling your website at the desired level.
How to know if Googlebot sees crawlability issues on your website?
Use the Crawl Stats report to review how Googlebot crawls your website. This report shows when Google encounters crawlability issues on your website. If your website reports errors or crawlability warnings, check the Server Connectivity graph and identify instances where Googlebot requests exceed the red line limit. Click on the graph to see which URLs are experiencing issues and try to find correlations between these URLs and the issues on your website.
Additionally, you can use the URL Inspection Tool to check some URLs on your website. If this tool returns a Server Connectivity Overload warning, it means Googlebot cannot crawl all discovered URLs from your website.
How to solve this problem?
- Block crawl activities on certain pages.
- Increase page loading and rendering speed.
- Boost server performance. If it seems like Google has been consistently crawling your website at maximum crawl budget, but some important URLs are still not crawled or updated as desired, you may need to allocate additional resources to allow Google to request more pages from your website.
- Check server connectivity over time in the Crawl Stats report to see if Google’s crawl rate regularly exceeds the limit. If so, increase your distribution resources for a month and see if the number of data collection requests increases during the same period.
Instruct Google to crawl specific high-value content that you think it should pay attention to
Google allocates the necessary time to index all high-quality and valuable content found on your website. If you believe Googlebot is missing important content, the reason could be that it hasn’t found that content, the content is blocked from Google, or the website is restricting Google’s access (or Google is deliberately trying not to overload the website).
How to check if the specific contents have not been crawled?
Search Console doesn’t provide crawl data logs that can be filtered by URL or path, but you can refer to your website’s server logs to see if Googlebot has crawled specific URLs. This is unrelated to whether Google has indexed the URLs it has crawled.
How to solve this problem?
- Inform Google about your new pages: update your sitemap to reflect the new URLs.
- Check the rules in your robots.txt file to ensure you’re not accidentally blocking pages.
- Review your priorities in data collection activities (meaning, use crawl budget reasonably). Manage the URLs you have and improve data collection performance on your website.
- Check to ensure you still have distribution capacity. Googlebot will reduce data collection if it detects that your server is having trouble responding to data collection requests.
In some cases, even if a page has been crawled, it may not appear in search results if it doesn’t provide sufficient value or if users don’t have a need to access the content on that page.
Ensure Google can crawl your site within a reasonable timeframe
If your website has recently been updated, for example you added a new article or changed title tags on a page but you’re not seeing these changes in the Google SERPs yet then it’s like that the pages have not been crawled by Google yet.
Google tries to check and index pages within a reasonable timeframe. For most websites, this timeframe extends from every 3 days onwards. Some sites get different treatment, for example news websites where content remains fresh and relevant for only a short period of time.
How to know if Google is late for crawling your site?
Check your website’s logs to see when Googlebot collected data from specific URLs. To find out the indexing date, use the URL Inspection Tool, or use Google Search to find URLs you’ve updated.
What you should do
- Use the <lastmod> tag in the sitemap to indicate when a URL was last updated and indexed.
- Use a simple URL structure to help Google find your pages.
- Provide standard <a> links that Google can crawl to help it find your pages.
- Do not submit the same sitemap multiple times a day with little change.
- Don’t include URLs in the sitemap that you don’t want to appear in search results. This could waste your crawl budget on pages you don’t want Google to index
- Use a News sitemap if your website contains news content..
Take action when Google is doing excessive crawling on your website
Google’s web crawler has algorithms to avoid overloading websites by reducing the speed of data collection when necessary. However, if you notice Googlebot overloading your website, you can take some emergency measures.
How to know if Googlebot is crawling your site too much?
Monitor your server to check if Googlebot is sending too many requests to your website.
How to fix it?
What you should do:
- Temporarily return an HTTP 503 or 429 response code for Googlebot requests during server overload situations. Googlebot will try to collect data on these URLs again after about 2 days. However, be aware that returning an “unavailable” code for more than a few days may cause Google to permanently reduce or stop collecting data from those URLs. Therefore, you need to take additional steps.
- Once data collection speed has decreased, stop returning HTTP 503 or 429 response codes for data collection requests. Returning these codes for more than 2 days may cause Google to remove those URLs from the index.
- Monitor data collection activity and server capacity over time to ensure no recurring issues.
- If one of the AdsBot crawlers cause the data collection issue , the problem may be due to creating dynamic Search Ad targets for your website. In this case, Google will collect data every 3 weeks. If your server cannot handle these data collection instances, you should limit the Ad targets or increase distribution capacity.
Tips to improve Google bot’s crawler for your website
Increasing Page Load Speed
Google’s crawling capability is limited by bandwidth, time, and the capabilities of various Googlebot versions. If your server responds faster, Google can collect more data from your website. However, Google only wants to collect high-quality content. So, if you only speed up low-quality pages, Googlebot still won’t collect much additional data from your site. Conversely, if you believe high-quality content is being overlooked, the data collection limit may be increased.
Here’s how you can optimize your pages and resources for crawling:
- Use robots.txt to prevent Googlebot from fetching large but unimportant resources. Make sure you only block unimportant resources, meaning those resources unnecessary for understanding the meaning of the page (such as decorative images).
- Pay attention to long redirection chains, as these negatively impact data collection.
- Server response time and rendering time are both important, including the time to load and execute embedded resources like images and scripts. Pay attention to large or slow but necessary resources for indexing.
Specify changed content with HTTP status codes.
Typically, Google supports If-Modified-Since and If-None-Match headers of HTTP requests for the data collection process. Google’s crawler doesn’t send headers in all data collection requests; it depends on the usage scenario of the request (for example, AdsBot generally sets If-Modified-Since and If-None-Match headers of HTTP requests more frequently).
If the crawler sends the If-Modified-Since header, its value will be the date and time of the most recent data collection of this content by Google. Based on that value, the server can choose to return an HTTP 304 (Not Modified) status code with no response content. In that case, Google will reuse the previously collected version of the content. If the content is newer than the date specified in the If-Modified-Since header of the crawler, the server may return an HTTP 200 (OK) status code along with the response content.
In addition to request headers, you can send an HTTP 304 (Not Modified) status code and no response content for any Googlebot requests if the content hasn’t changed since Googlebot last accessed that URL.
Hide URLs you don’t want to appear in search results
Wasting server resources on unnecessary pages can reduce crawl budget on pages you find important, and it can also slow down the process of discovering new or recently updated content on the website.
What you should do:
- Use robots.txt if you don’t want Google to collect data on a resource or page.
- If multiple pages use the same resource (such as shared images or JavaScript files), reference the resource from the same URL on each page. Then, Google can store the resource in the cache and reuse it without requesting it multiple times.
- Use robots.txt solely for pages or resources you wish to exclude from Google search results for an extended period.
- Maintain a stable robots.txt file without frequently adding or removing pages or directories to manage your website’s data collection limit effectively.
- Avoid rotating sitemaps or implementing temporary hiding mechanisms to redistribute the crawl budget.
Conclusion
Managing your crawl budget is a crucial aspect of SEO, allowing search engines to efficiently discover and index your most important pages. By implementing the tips identified in this article, you can make the most of your crawl budget, potentially giving you an edge over your competitors in search results.
Remember, the key lies in balancing both crawl rate and crawl demand. And it’s not just about disallowing specific pages or directories in your robots.txt, creating a better website overall that is responsive, full of high quality content and offers a positive user experience can signal to Google that your site deserves a higher crawl budget. This, in turn, can lead to more frequent indexing, improved search results, and ultimately, a stronger online presence – now we’re talking!