There is no benchmark for the crawl budget
It was John Mueller a few days ago, while answering to a question asked by a user on Reddit, who reiterated that there is no benchmark for Google’s crawl budget, and that therefore there is no optimal “reference number” to tend to with our site interventions.
What we can do, in practical terms, is try to reduce waste on the “useless” pages of our site – a.k.a those that do not have ranked keywords or do not generate visits – to optimize the attention that Google devotes to the contents that are truly important to us and can make more in terms of traffic.
It is in this perspective that it must be interpreted – and used – the Content Overview of SEOZoom, which catalogues the web pages of the site and groups them according to their performance on search engines, so as to know clearly where to intervene, how to do it and when it is time to delete unnecessary or duplicated pages that are wasting the crawl budget.
What is crawl budget to Google
Simply put, the crawl budget is the number of URLs that googlebot can (depending on the speed of the site) and wants (depending on the request of users) to scan. Conceptually, then, it is the balanced frequency between Googlebot’s attempts to not overload the server and Google’s overall desire to scan the domain.
To take care of this aspect could allow to increase the speed with which the robots of the search engines visit site pages; the greater it is the frequency of these passages, the faster the index will detect the updates of the pages. Therefore, a higher value of crawl budget optimization can potentially help keep popular contents up to date and prevent older content from becoming obsolete.
How to enhance crawl budget
One of the most immediate ways to optimize the crawl budget is to limit the amount of low-value URLs present on a website, which can – as mentioned – take away valuable time and resources from the scan of the most important pages of a site.
Low-value pages include those with duplicate contents, soft error pages, faceted navigation and session identifiers, and then again pages compromised by hacking, endless spaces and proxies and obviously low-quality content and spam. A first work that can be done is therefore to verify the presence of these problems on the site, also checking the reports on the scanning errors in Search Console and reducing to the minimum the errors of the server.
Why working on this factor
Since, as stated and confirmed by official sources, there is no benchmark or ideal value to strive for, the whole discussion on the crawl budget is based on abstractions and theories. Of course, Google is often slower to scan all the pages of a small site that does not update often or does not have much traffic compared to a larger site with many daily changes and a significant amount of organic traffic.
The problem lies in quantifying the values of “often” and “very”, but above all in identifying a unique number for both huge and powerful sites and small blogs; for instance, always in theory, an crawl budget’s X-value crawl reached by a major website could be problematic, while for a blog with a few hundred pages and a few hundred visitors a day could be the maximum level reached, difficult to improve.
Prioritizing pages that are relevant to us
For this reason, a serious analysis of this “indexing research budget” must focus on an overall management of the site, trying to improve the frequency of results on important pages (those that convert or attract traffic) using different strategies, rather than trying to optimize the overall frequency of the entire site.
Quick tactics to achieve this goal are the redirects to take away Googlebot from less important pages (blocking them from scanning) and the use of internal links to channel greater importance on pages you want to promote (which, it goes without saying, must provide quality contents). If we operate well in this direction – also using our SEOZoom tools to verify which URLs it is convenient to focus and concentrate resources on – we could increase the frequency of the googlebot’s visits on the site, given the fact that Google should theoretically see more value in sending traffic to site pages that it indexes, updates and places.
Possible interventions of optimization on the site
In addition to the ones described above, there are also some specific interventions that could help to better manage the crawl budget of the site: nothing particularly “new”, given the fact that these are some well-known signs of the health of the website.
First piece of advice is almost trivial: to allow the scanning of important pages of the site in the file robots.txt, a simple but decisive step in order to have under control the scanned and blocked resources. Likewise, it is good to take care of the XML sitemap, so to give the robots a simple and faster way to understand where the internal connections lead; just remember to only use the canonical URLs for the sitemap and to always update it to the most recent loaded version of the robots.txt.
It would then be good to verify – or avoid altogether – redirect chains, which force Googlebot to scan multiple URLs: in the presence of an excessive share of redirects, the search engine’s crawler could suddenly end the scan without reaching the page it needs to index. If the 301 and 302 should be limited, though, other HTTP status codes are even more harmful: pages in 404 and 410 technically consume crawl budget and, plus, damage the user experience of the site. No less annoying are the 5xx errors related to the server, which is why it is good to do a periodic analysis and a health checkup of the site, maybe using our SEO spider!
Another consideration to be made is about URL parameters, because separate URLs are counted by crawlers as separate pages, and therefore waste an invaluable part of the budget and also risk to raise doubts about duplicate content. In the case of multilingual sites, then, we must make the best use of the hreflang tag, informing in the clearest way possible Google of the geolocated versions of the pages, both with the header and with the <loc> element for a given URL.
A basic choice to improve scanning and simplify Googlebot’s interpretation could be to always prefer HTML to other languages: even if Google is learning to manage JavaScript more effectively (and there are many techniques for the SEO optimization of JavaScript), the old HTML still remains the code that gives us more guarantees.