Doubts and false myths: Google clarifies matters about crawl budget

SEO admin 21 July 2020 Reading time : 6 minutes

Put us to the test!

Analyze your site

Select the database

It is the crawl budget the central theme of the second appointment with SEO Mythbusting season 2 (recorded before the Covid-19 emergency, just to be precise), in which the Google Developer Advocate, Martin Splitt, tries to dispel myths and clear up frequent doubts about SEO topics. And so, we discover the interesting ideas that arrive on the topic of the optimization of the scanning budget that the search engine dedicates to each site!

Understanding the crawl budget and recognizing false myths

Host of the episode is Alexis Sanders, Senior Account Manager at the marketing agency Merkle, who asks the Googler to dwell on a topic that creates many understanding difficulties to her customers, precisely the crawl budget, starting from the definitions and the management tips concerning this aspect.

Martin Splitt then begins his insight by saying that “when we talk about Google Search, indexing and crawling, we have to make a sort of compromise: Google wants to scan the maximum amount of information in the shortest possible time, without overloading the servers”, that is, finding the crawl limit or crawl rate (it was already mentioned in this article).

The definition of crawl rate

To be precise, the crawl rate is defined as the maximum number of parallel requests that Googlebot can do simultaneously without overloading a server, and so basically it indicates the maximum amount of stress that Google can exert on a server without bringing anything to the crash or generating inconvenience for this effort.

Google also needs to monitor its resources

But Google must pay attention only to the resources of others, but also to its own, as “the Web is huge and we cannot scan everything all the time, but make some distinctions,” explains Splitt.

For example, he continues, “a news site probably changes quite often, and so we probably have to keep up with it with a high frequency: on the contrary, a site about the history of kimchi will probably not change so assiduously, as history does not have the same rapid pace as the news industry”.

What is the crawl demand

This explains what crawl demand is, that is the frequency with which Googlebot scans a site (or, better, a type of site) based on its probability of updating. As Splitt says in the video, “we try to understand if it is the case to pass more often or if instead we can give a check from time to time”.

The decision process of this factor is based on the identification that Google makes of the site at the first scan, when it basically takes the “fingerprints” of its contents, seeing the topic of the page (which will also be used for deduplication, at a later stage) and also analyzing the date of the last change.

The site has the ability to communicate the dates to Googlebot – for example, through structured data or other time elements on the page – that “more or less keeps track of the frequency and type of changes: if we detect that the frequency of changes is very low, then we will not scan the site with particular frequency”.

The scanning frequency does not concern the quality of contents

Very interesting is the subsequent clarification of the Googler: this scanning rate has nothing to do with quality, he says. “You can have fantastic contents that fit perfectly, that never change”, because the matter in this case is “whether Google must pass frequently on the site for a crawling or it can rather leave it alone for a certain period of time”.

To help Google answer this question correctly, webmasters have at their disposal various tools that give “suggestions”: in addition to the aforementioned structured data, you can use Etag or HTTP headers, useful to report the date of the last change in the sitemap. However, it is important that the updates are useful: “If you only update the date in the sitemap and we realize that there is no real change on the site, or that they are minimal changes, you are not helping” in identifying the likely frequency of changes.

When the crawl budget turns into an issue?

According to Splitt, the crawl budget is still a priority theme that should interest or worry huge sites, “let’s say with millions of urls and pages”, unless you “have a shoddy and unreliable server”. But in that case, “more than crawl budget you should focus on server optimization,” he continues.

Usually, according to the Googler, there is a talk about crawl budget that is out of place, that happens when there is no real problem related; to him, crawl budget issues happen when a site notices that Google discovers but does not scan the pages he cares about for a long time and those pages do not present any problems or errors whatsoever.

In most cases, however, Google decides to scan but not index pages because “it is not worth it because of the poor quality of the contents present”. Typically, these urls are marked as “excluded” in the Index coverage report of the Google Search Console, clarifies the video.

The crawl frequency is not a quality signal

Having a site scanned with high frequency is not necessarily a help to Google because the crawl frequency is not a signal of quality, explains Splitt, as for the search engine is still OK to “have something crawled, indexed and that doesn’t change anymore” and does not require any more bot steps.

Some more targeted advice comes for e-commerces: if there are many small pages very similar to each other with similar contents, you should think about their usefulness wondering if their existence even makes sense. Or, maybe to consider extending the content so to make it better? For instance, if they are only products that vary by a small feature, they could be grouped into a single page with a descriptive text that includes all possible variations (instead of having 10 small pages for each possibility).

The weight of Google on servers

The crawl budget connects to a number of issues, so: among those mentioned there are precisely the duplication or search of pages, but the speed of the server is a sensitive issue, as well. If the site relies on a server that occasionally collapses, Google may have difficulty understanding whether this happens due to the poor characteristics of the server or due to an overload of its requests.

Speaking of managing server resources, then, Splitt also explains how the activity of the bot works (and then what it sees in the log files) suring the early stages of Google’s discovery or when, for example, you perform a server migration: initially, an increase in crawling activity, followed by a slight reduction, which then continues to create a wave. Often, however, the change of servers does not require a new rediscovery by Google (unless you switch from something broken to something that actually works!) and then the crawling activity of the bot remains stable as it was before the switch.

Crawl budget and migration, Google’s suggestions

Rather sensitive is also the management of Googlebot’s scanning activity during the overall migrations of the site: the advice that comes from the video to those who are in the middle of these situations is to progressively update the sitemap and report to Google what is changing, so as to inform the search engine that there were useful changes to follow and verify.

This strategy gives webmasters a little control over how Google discovers changes during a migration, although in principle you can also simply wait for the operation to complete.

What matters is to ensure (and be sure) that both servers are functioning and working regularly, without sudden momentary collapses or status codes of error; it is also important to set the redirects correctly and verify that there are no significant resources blocked on the new site through a robots.txt file not properly updated to the new post migration site.

How the crawl budget works

A question by Alexis Sanders brings us back to the central theme of the appointment and allows Martin Splitt to talk about how the budget crawl works and on what level of the site it intervenes: usually, Google operates on the site level, so considers everything on the same domain. Subdomains can be scanned at times, while in other cases they are excluded, while for CDNs “there is nothing to worry about”.

Another practical suggestion that comes from the Googler is the management of user-generated contents, or more generally the anxieties about the amount of pages on a site: “You can tell us not to index or not to scan contents that is of low quality”, Splitt reminds us, and so for him the crawl budget optimization is “something that concerns more the contents side than the technical infrastructure aspect”, (approach that is perfectly in line with the one of our SEOZoom tools!).

Do not waste Google’s resources and time

Ultimately, working on the crawl budget also means not wasting Google’s time and resources, also through a number of very technical solutions such as cache management.

In particular, Splitt explains that Google “tries to be as aggressive as possible when caching sub-resources, such as CSS, Javascript, API calls, all that sort of thing”. If a site has “API calls that are not GET requests, then we can’t cache them, and so you have to be very careful about making POST requests or something similar, which some APIs do by default”, because Google cannot “cache them and this will consume the site’s crawl budget faster“.