Crawling: what it is, how it works, and why scanning is needed

It all starts here. It is through crawling that search engines like Google are able to discover, explore, and organize the content on the web, navigating the web, the billions of connected pages, and an endless series of hyperlinks that form this complex and dynamic network. Without crawling there would be no ranking nor, clearly, SEO, because web pages would remain invisible to search engines, thus compromising their ability to be found by potential visitors. In short, it is on the search activity and from the visits made by bots that the functioning of the Web (and our work to gain online visibility) is based: understanding how crawling works is therefore essential to understanding how search engines index sites and assign them a position in search results.

What is crawling, crawling for search engines

Crawling, is the crawling process during which search engines send automated programs into the Web to find new and updated content, which will then be added to the various search engine indexes .

More specifically, in IT, crawling defines the entire process of accessing a Web site and retrieving data obtained through a computer program or software.

Keep your site under control
Analyze and monitor your pages and ensure the digital health of your project
Registrazione

These programs, often called crawlers, bots or spiders, navigate through the various links on a web page, just as a human visitor would, but with the specific goal of scanning the content of the pages themselves.

The type of content is broad and can vary-a web page, an image, a video, a PDF, and so on-but regardless of format, the content is discovered through links, whether they are on already known pages or through sitemaps that a site provides directly.

This crawling is not limited to simply reading text: web crawlers identify and analyze every element on the page, from titles to images, from text to hyperlinks, to underlying metadata and HTML code. Theinformation gathered during crawling is then stored and organized in giant databases known as “indexes,” which, updated continuously, are critical to enabling search engines to return relevant and timely results to users performing a search.

In concrete terms, crawling is thus the first step in a larger process that allows search engines to “figure out” what pages exist on the Web, categorize them, and determine their relevance. Without proper crawling, a site risks remaining ignored by search engines and, consequently, invisible to users. Therefore, optimizing pages to facilitate the work of crawlers becomes a crucial aspect of SEO.

What does crawling or crawling mean

This complex activity is called crawling.

The word comes from the verb to crawl and evokes the idea of gradual but steady movement, just as a spider or an infant barely able to move independently does.

The reference to the spider is by no means coincidental and harks back to the concept of the early Web: Tim Berners-Lee, the inventor of the World Wide Web, used this term to emphasize how the Internet is composed of a vast and interconnected set of documents readable through browsers. Just as a spider web catches everything it encounters on its threads, the Web is a structure in which pages of different origins are linked by hyperlinks (links).

Returning to crawling, we then understand why the automated bots that do the crawling are called crawlers or often spiders: just like arachnids, they follow the path traced by the strands of links to create the Web, automatically going out to search or update Web pages on behalf of the search engine, so as to gather content and define the best paths to map the Web

The addition of synonyms such as crawler or bot reinforces this idea, further describing the nature of exploration: “crawling” and “scanning” suggest a continuous, precise, and almost imperceptible action that scours the Web without leaving any corners unexplored. Crawlers thus proceed through the web following links from one page to another, “crawling” through online content to gather and catalog information useful for indexing, performing operations in a meticulous and systematic way.

What are crawlers and what do they do

A crawler is thus to an automated program that performs the important task of exploring the Web. Each crawler or spider is tasked with navigating through links, one site at a time, and automatically “scanning” the pages it encounters, allowing search engines to detect new content, updates or changes to existing web pages, and store them in their indexes.

The operation of a crawler can be simplified by imagining it as a digital explorer traversing vast territories of web pages, following each link and meticulously mapping what it finds along the way.

However, not all crawlers are created equal. Some, such as the well-known Googlebot, are specifically designed to perform this process efficiently on a global scale. Googlebot is, for most websites, the main crawler: its activity is continuous and is intended to constantly update Google’s index with the new information it finds.

A key aspect of the crawler ‘s role is its ability to follow the “trails” left by links on web pages. These hyperlinks guide the crawler to new pages, extending its mapping and enriching the index with new and relevant content. In the absence of an effective crawler, a search engine’s entire ecosystem would be at risk of collapsing, as it would not be able to offer up-to-date and useful results to users. Therefore, understanding and optimizing your online infrastructure for the crawler is essential for improving SEO and maintaining a strong and visible presence on search engines.

What is the purpose of the crawler? Purpose and functions

The crawler is the essential pillar of a search engine’s operation. Its main mission is to systematically explore the web, scanning every page, link, and resource it can reach. This process is not random, but extremely organized: crawlers follow an algorithm that determines which pages to visit, what content to explore, and how much time to devote to each resource, all in function of updating search engine indexes.

An index is, in short, a colossal database that collects all the relevant information from the pages being crawled. When we perform a Google search, for example, we are not searching the web directly, but we are querying this very index, which has been created and updated through the relentless work of crawlers. This is why the speed and quality of crawling are crucial: without an up-to-date index, the search engine would not be able to provide relevant and timely results.

A key aspect of the crawler’s role also involves assessing the value of a page. During the crawling process, the bot collects not only textual content but also signals that indicate the quality and relevance of a resource. Elements such as the presence of quality hyperlinks, page loading speed, HTML tag structure, and semantic consistency of content are all factors that a crawler considers. These signals help define how much a resource should be valued within the index and, consequently, what ranking it can achieve in search results.

SEO crawling: how crawling affects search engines

Crawling and SEO are linked by a mutually influential relationship, where the quality and effectiveness of web page crawling can largely determine the success or failure of an optimization strategy. But how does this process take place? And what is its direct impact on SEO?

When crawlers crawl a website, they simultaneously evaluate a number of technical and content aspects. This is where the importance of having well-structured content and an SEO-oriented architecture comes into play. A site that has a clear structure, with consistent use of headings and a well-defined semantic hierarchy, makes the crawler’s job easier, allowing him or her to quickly understand what the main content is and how it is related to each other. This not only improves crawling but also indexing, as a well-organized site offers more intuitive navigation, leading the crawler through the desired paths.

Content itself plays a crucial role. If a site’s pages are rich in relevant keywords, original text, and valuable to the user, the likelihood increases that the crawler will identify those pages as important and worthy of ranking high in search results. However, content is not the only element to consider. Site performance, such as loading speed and responsiveness on different devices, are additional factors that can positively influence crawling by the crawler and, consequently, search engine rankings.

Another aspect to consider is the conscious management of the crawl budget – that is, the number of pages a crawler is willing to crawl in a given period of time: if a website wastes the crawl budget on irrelevant pages or low-quality content, it risks penalizing the visibility of its most important pages. Ensuring that the crawler focuses on the most valuable content is therefore a key step in optimizing SEO.

Why the crawl budget is important

The crawl budget represents the amount of resources Googlebot devotes to exploring the pages of a website in a given time period. This concept implies that there is a maximum limit of pages that the bot will be able to crawl during a visit, which makes it crucial to administer it carefully.

The crawl budget is influenced by several factors, including page popularity and site health. Pages that receive numerous incoming links, are frequently updated, and attract many visitors tend to get more attention from crawlers. In parallel, sites that are quick to respond to bot queries and have no technical errors facilitate exploration by ideally extending their crawl budget.

To maximize the effectiveness of the crawl budget there are several tools and strategies that can be adopted. It is important to make sure that crawlers focus their efforts on the most relevant pages on the site, thus preventing valuable resources from being consumed and wasted on duplicate content, pages with poor quality or unused content. Using the robots.txt file to exclude non-crucial sections is a useful practice to optimize the use of the crawl budget. In addition, improving page loading speed not only improves user experience, but also allows bots to explore more pages in less time.

Crawling: what it is and how it works for Google

Dwelling precisely on how crawling works for Google, crawling represents the search engine’s way of figuring out what pages exist on the Web: there is no central registry of all web pages, so Google must constantly search for new and updated pages to add them to its list of known pages.

 

Gli spider e il crawling - da Moz

The crawling process begins with a list of URLs from previous crawls and sitemaps provided by site owners: Google uses web crawlers and specifically Googlebot (the name by which its program is known to perform the retrieval operation through the work of a huge amount of computers scanning billions of pages on the web) to visit these addresses, read the information they contain, and follow the links on those pages.

The crawlers will revisit the pages already in the list to see if they have been changed and will also scan the newly detected pages. During this process, crawlers have to make important decisions, such as prioritizing when and what to crawl, making sure that the website can handle the server requests made by Google.

More specifically, in the crawling phase, Googlebot retrieves some publicly accessible Web pages, then follows the links there to find new URLs; by jumping along this path of links, the crawler is able to find new content and add it to the Index, which we know is a huge database of discovered URLs, from which (but here we are already at the later stages of Search) they are later retrieved when a user searches for information to which the content of that URL provides a relevant answer.

Scanning is also called “URL Discovery,” indicating precisely how Google discovers new information to add to its catalog. Usually, the way Google finds a new Web site is by following links from one Web site to another, as mentioned: just as we users do when we explore content on the Web, crawlers go from page to page and store information about what they find on those pages and other publicly accessible content, which ends up in the Google Search index.

Some pages are known because Google has already visited them, other pages are discovered when Googlebot follows a link back to them (e.g., a hub page, such as a category page, links to a new blog post), and still others are discovered when we send Google a sitemap for crawling.

Either way, when Googlebot finds the URL of a page it may visit or “crawl” the page to discover its contents. It is important to understand, in fact, that Googlebot does not crawl all the pages it has detected, partly because some pages may not be authorized for crawling by the site owner, while others may not be accessible without being logged into the site.

During the crawl, Google displays the page and executes any JavaScript code detected using a recent version of Chrome, similar to what a common browser does in displaying the page we visit. Rendering is important because websites often rely on JavaScript to display content on the page, and without rendering, Google may not see this content, the official guide to this tells us.

Crawling for Google: frequency, speed and budget

Googlebot uses an algorithmic process to determine which sites to crawl, how often to do so, and how many pages to retrieve from each site. Google’s crawlers are also programmed to try not to crawl the site too quickly to avoid overloading it. This mechanism is based on the site’s responses – HTTP status code 500 means “slowdown” – and settings in Search Console.

Successfully crawled pages are processed and forwarded to Google indexing to prepare the content for publication in search results; the search engine’s systems view the page content as the browser would and take note of key signals, from keywords to website updates, storing all this information in the search index.

Because the Web and other content is constantly changing, Google’s crawling processes are constantly running to keep up, learning how often content that has already been examined is being changed and scanning it as necessary, and also discovering new content as new links to those pages or information are displayed.

As the reference guide always makes clear, Google never accepts payment for scanning a site more frequently, true to its promise to provide the same tools to all websites to ensure the best possible results for users.

In addition, Google is very careful not to overload its servers, so the frequency of scans depends on three factors:

  • Crawl rate or crawl speed: maximum number of simultaneous connections a crawler can use to crawl a site.
  • Crawl demand: how much content is desired by Google.
  • Crawl budget: number of URLs that Google can and wants to crawl.

There are also three common problems with Googlebots accessing sites, which can prevent or block Google bots from crawling:

  • Problems with the server running the site
  • Network problems
  • Rules in the robots.txt file that prevent page access by Googlebot

As we will see in more detail, the set of tools in the Search Console can serve “content authors to help us crawl their content better,” the official documentation suggests, adding to established standards such as Sitemaps or the robots.txt file to specify how often Googlebot should visit their content or whether it should not be included in the search index.

The importance of crawling for Google and for sites

To better understand the weight this activity has for Google, and thus for SEO, we can think of the analogy proposed by Lizzy Harvey on web.dev: crawling is “like reading all the books in a library.” Before search engines can serve up any search results, they must get as much information from the web as possible, and so they use the crawler, a program that travels from site to site and acts like a browser.

This check includes the HTML and any content mentioned in the HTML, such as images, videos, or JavaScript. Crawlers also extract links from HTML documents so that the crawler can also visit linked URLs, again with the goal of finding new pages on the Web.

Technically speaking, crawlers do not actively click on links or buttons, but send URLs to a queue to be crawled at a later time. When a new URL is accessed, no cookies, service workers, or local storage (such as IndexedDB) are available.

The crawlers attempt to retrieve each URL to determine the status of the document: if a book or document is missing or damaged, the bot cannot read it, just as if a document returns an error status code, the bots cannot use any of its contents, but they could retry the URL at a later time. This ensures that only publicly accessible documents enter the index. Again, if the crawlers discover a 301 or 302 redirect status code, for example, they follow the redirect to a new URL and continue there: when they get a positive response, and therefore have found a user-accessible document, they check whether it is allowed to crawl and then download the content.

Returning then to the previous definitions, crawl rate or crawl rate represents the maximum number of simultaneous connections a crawler can use to crawl a site. Crawl demand, on the other hand, depends on “how much content is desired by Google” and is “influenced by URLs that have not been crawled by Google before, and Google’s estimate of how often content changes on non-URLs.”

Google calculates a site’s crawl rate periodically, based on the site’s responsiveness or, in other words, the share of crawling traffic it can actually handle: if the site is fast and consistent in responding to crawlers, the rate goes up if there is demand for indexing; if, on the other hand, the site slows down or responds with server errors, the rate goes down and Google crawls less.

When Googlebot can crawl a site efficiently, it enables a site to quickly get new content indexed in search results and helps Google discover changes to existing content.

How to handle Google scans on a site

Talking about crawling also means addressing a topic that is becoming increasingly popular in recent years and that often plagues SEOs and those who work on sites, namely the crawl budget, which we have already defined as the amount of time (expressed as the amount of URLs) that Googlebot can and will devote to crawling a site-in other words, the sum of crawl rate and crawl demand.

To guide us through the analysis of how Google’s crawling mechanism works, we can refer to an appointment with the Google Search Console Training series entrusted, as on previous occasions, to Search Advocate Daniel Waisberg, who gives a quick but comprehensive overview of how Google crawls pages, and then dwells on the Search Console’s Crawl Statistics report, which first allows us to check Googlebot’s ability to crawl a given site and provides data on crawl requests, average response time, and more.

As a disclaimer, the Googler explains that such topics are more relevant for those working on a large website, while those with a project with a few thousand pages need not worry too much about them (although, he says, “it never hurts to learn something new, and who knows, your site might become the next big thing”).

How to reduce Googlebot crawl speed the right way

In the rare cases when Google’s crawlers overload servers, you can set a limit on crawl speed using settings in Search Console or other on-site interventions.

As a recent official Google page makes clear, to reduce Googlebot crawl speed we can essentially:

  • Use Search Console to temporarily reduce the crawl speed.
  • Return an HTTP status code 500, 503 or 429 to Googlebot when it crawls too fast.

A code like 4xx identifies client errors: servers return a signal indicating that the client request was wrong in some sense and for some reason; in most cases, errors in this category are rather benign, Google says, such as “not found,” “forbidden,” “I am a teapot” (one of Google’s most famous Easter Eggs), because they do not suggest that something wrong is happening with the server itself.

The only exception is 429, which stands for “too many requests”: this error is a clear signal to any well-behaved robot, including Googlebot, that it must slow down because it is overloading the server.

However, and again with the exception of code 429, all 4xx errors are not good for rate limiting Googlebot, precisely because they do not suggest that there is an error with the server: not that it is overloaded, not that it has encountered a critical error and is unable to respond to the request. They simply mean that the client request was bad or wrong in some way. There is no sensible way to associate, for example, a 404 error with server overload (and it couldn’t be otherwise, because an influx of 404s could result from a user accidentally linking to the wrong pages on the site and cannot, in turn, affect Googlebot’s slowdown in scanning), and the same is true for states 403, 410, 418.

Then there is another aspect to consider: all 4xx HTTP status codes (again, except 429) will cause content to be removed from Google Search; even worse, publishing a robots.txt file with a 4xx HTTP status code makes it practically useless, because it will be treated as if it did not exist – and thus all the rules set, including directives on areas forbidden to be crawled, are practically accessible to everyone, with disadvantages for everyone.

Ultimately, then, Google strongly urges us not to use 404 and other 4xx client errors to reduce Googlebot’s crawling frequency, which albeit seems to be a trending strategy among website owners and some content delivery networks (CDNs).

What is it and how to use Google’s Scan Statistics report.

In this regard, far more effective is to learn how to use the special tool in Google Search Console, the Crawl Stats report or Crawl Stats report, which allows us to find out how often Google crawls the site and what the responses were, but also to view statistics on Google’s crawling behavior and to support understanding and optimizing the crawling process.

The most recent version of this tool was released in late 2020 (as also announced on Google Search News in November 2020) and allows for data that answers questions such as:

  • What is the overall availability of the site?
  • What is the average page response for a crawl request?
  • How many requests have Google made to the site in the last 90 days?

The Crawl Statistics report is available only for properties at the root directory level: site owners can find it by logging into Search Console and going to the “Settings” page.

When the report opens, a summary page appears, which includes a crawl trends graph, details on host status, and a detailed analysis of the crawl request.

The graph on scanning trends

Specifically, the scan trends graph reports information on three metrics:

  • Total scan requests for site URLs (successful or unsuccessful). Requests for resources hosted outside the site are not counted, so if images are served on another domain (such as a CDN network) they will not appear here.
  • Total size of downloads from the site during scanning. Page resources used by multiple pages that Google has cached are only requested the first time (at storage).
  • Average page response time for an indexing search request to retrieve page content. This metric does not include retrieval of page resources such as scripts, images, and other linked or embedded content, and it does not take into account page rendering time.

When analyzing this data, Waisberg recommends looking for “major spikes, dips, and trends over time”: for example, if you notice a significant drop in total crawl requests, you should make sure that no one has added a new robots.txt file to the site; if the site responds slowly to Googlebot it could be a sign that the server cannot handle all the requests, just as a steady increase in average response time is another “indicator that the servers may not be handling all the load,” although it may not immediately affect crawl speed as much as it does the user experience.

Host state analysis

Host status data allows you to check the general availability of a site over the past 90 days. Errors in this section indicate that Google cannot crawl the site for technical reasons.

Again, there are 3 categories that provide details on host status:

  • Robots.txt fetch (robots.txt fetch): the percentage of errors while crawling the robots.txt file. It is not mandatory to have a robots.txt file, but it must return the 200 or 404 response (valid file, filled in or empty, or file does not exist); if Googlebot has a connection problem, such as a 503, it will stop scanning the site.
  • DNS Resolution: indicates when the DNS server did not recognize the host name or did not respond during scanning. In case of errors, it is suggested to contact the registrar to verify that the site is configured correctly and that the server is connected to the Internet.
  • Server connectivity (Server connectivity): shows when the server is not responding or has not provided a complete response for the URL during a scan. If you notice spikes or consistent connectivity problems, it is suggested that you talk to your provider to increase capacity or resolve availability issues.

A substantial error in any of the categories can result in reduced availability. There are three host state values that appear in the report: if Google has found at least one such error on the site in the past week, a red icon alert with an exclamation point appears; if the error is older than a week and dates back to the past 90 days, a white icon with a green checkmark appears, signaling precisely that there have been problems in the past (temporary or resolved in the meantime), which can be verified through server logs or with a developer; finally, if there have been no substantial availability problems in the past 90 days, everything is fine and a green icon with a white checkmark appears.

Googlebot scan requests

The scan request tabs show different decomposed data that help understand what Google’s crawlers found on the site. In this case, there are four breakdowns:

  • Crawl response: the responses Google received while crawling the site, grouped by type as a percentage of all crawl responses. Common response types are 200, 301, 404 or server errors.
  • File types scanned: shows the file types returned by the request (the percentage value of which refers to the responses received for that type, not bytes retrieved); the most common are HTML, images, video, or JavaScript.
  • Purpose of crawl: shows the reason for crawling the site, such as discovering a URL new to Google or refresh for a re-crawl of a known page
  • Type of Googlebot: shows the type of user agent used to perform the crawl request, such as smartphone, desktop, image, and others.

Communicating to search engines how to crawl the site

To recap, to understand and optimize Google crawling we can use the Search Console’s Crawl Statistics report, starting with the page summary graph to analyze crawl volume and trends, continuing with host status details to check overall site availability, and finally, checking the breakdown of crawl requests to understand what Googlebot finds when it crawls the site.

These are the basics of using the crawl status report to ensure that Googlebot can crawl the site efficiently for Search, to be followed up with the necessary crawl budget optimization and general interventions to ensure that our site can actually enter the Google Index and then begin the climb to visibility positions.

With the understanding that the crawl budget-that is, the number of URLs Google can and will crawl on websites each day, repetita iuvant-is a parameter “relevant for large websites, because Google needs to prioritize what to crawl first, how much to crawl, and how frequently to crawl again,” it is still useful to know how to guide the process of search engine crawlers crawling our site.

Le basi del crawling - da Moz

In that sense, as Moz ‘s work (from which we have drawn some of the images on the page) well summarizes for us, there are some optimizations we can have implemented to better direct Googlebot on how we want it to crawl our content published on the Web, and personally telling the search engines how to crawl our pages can give us more and better control over what ends up in the Index.

Site interventions to optimize crawler crawling

Before we get into the details of what needs to be done, however, let’s digress one last time. Usually, we focus on the work necessary to ensure that Google can find our important pages, and that is certainly a good thing. However, we should not forget that there are probably pages that we do not want Googlebot to find, such as old URLs with thin content, duplicate URLs (such as sort parameters and e-commerce filters), special promo code pages, staging or test pages, and so on.

Registrazione
Avoid errors!
Analyze your site and check for pages with problems that need to be fixed

This is also what crawling management is for, allowing us to steer crawlers away from certain pages and sections of the site. And these are the common and most effective methods.

  • Robots.txt

We have mentioned it several times: robots.txt files are located in the root directory of Web sites and suggest which parts of the site search engines should and should not crawl, as well as the speed at which they crawl the site , via specific directives.

  • Sitemap

Sitemaps can also be useful: they are, as the name makes clear, a list of URLs on the site that crawlers can use to discover and index content. One of the easiest ways to make sure Google finds your pages with the highest priority is to create a file that meets Google’s standards and submit it through Google Search Console. Although submitting a sitemap does not replace the need for good site navigation, it can certainly help crawlers follow a path to all important pages.

Sometimes, navigation errors can prevent crawlers from seeing the entire site: this is the case of mobile navigation showing different results than desktop navigation, JavaScript-enabled (and not HTML-enabled) menu items, customization or display of navigation unique to a specific type of visitor over others (which could appear as cloaking to crawlers), failure to link to a primary page of the site in the navigation, hidden text within non-text content, content hidden behind login forms, and so on.

According to experts, it is essential for the website to have clear navigation and useful URL folder structures.

At the same time, a clean information architecture should be set up, following the practice of organizing and labeling content in a way that improves efficiency and findability for users, on the premise that the best information architecture is intuitive, that is, it allows users not to think much about scrolling through the site or finding something.

  • Optimizing the crawl budget

Finally, there are the technical interventions to optimize the crawl budget, which is the average number of URLs Googlebot scans on the site before exiting, and thus serves to prevent Googlebot from wasting time scanning unimportant pages and risking ignoring important ones. The crawl budget is very important on very large sites with tens of thousands of URLs, but it is never a bad idea to prevent crawlers from accessing content that we are definitely not interested in. What we need to make sure is not to block a crawler’s access to pages on which we have added other directives, such as canonical or noindex tags: if Googlebot is blocked from a page, it will not be able to see the instructions there.

7 days for FREE

Discover now all the SEOZoom features!
TOP