How does Google Search work? How many times have we asked ourselves this question, both from the point of view of the “normal” user and those who work in search marketing instead? So let’s try to analyze in a simple but in-depth way how the Search system works, focusing in particular on all the indications regarding how Google discovers web pages, crawls them and publishes them, but also on what all this means for sites and for SEO.
What is Google Search
Search engines are the digital version of a librarian and help users get the appropriate answers and information for their query and need by scanning the full index of results they have available.
Thus, search engines are answer machines, searchable databases of Web content, and exist to discover, understand, and organize Web content to deliver the most relevant results to the questions posed by users.
Broadly speaking, search engines are composed of two main parts: a search index (representing the digital library of information about Web pages) and search algorithms, i.e., computer programs charged with comparing search index results to identify the most appropriate ones based on the rules provided.
Google Search or Google Search is exactly that: a fully automated search engine that uses software, called web crawlers, to regularly explore the Web and find pages to add to its Index. As reiterated frequently, the mission of Google and its search system is “to organize information from around the world and make it globally accessible and useful,” and this requires continual work to continually map the Web and other sources to enable each user to access the information that the algorithms deem most relevant and useful, using as a compass the criteria we usually refer to as 200 ranking factors on Google. The results are then presented in various ways in SERPs, based on what is most appropriate for the type of information that specific person is looking for.
Google’s statistics and numbers
Google made its official debut in 1997 and has rather quickly established itself as “the” search engine on the Web: it is currently the most visited site in the world (for several years) and is so popular that its name (or a derivative) has become synonymous with online search itself in several languages, as evidenced by the English verb to google, the German googeln and the Italian googlare.
Numbers help us understand the dominance of this behemoth in the search engine market: as of February 2023, Statcounter certifies that Google holds about 94 percent of the entire worldwide reference marketshare, relegating the main competitors to residual shares (the second ranked engine is Bing, which does not reach 3 percent of users).
Speaking of figures and statistics, then, impressive are the data revealing the amount of work the search engine does every moment – which also ties in more or less directly with the extent of its Index. Specifically, Internet Live Stats counts that Google in 2023 processes nearly 100 thousand searches every single second, and thus over 8.5 billion searches per day and over 3.1 trillion on an annual basis.
According to Siteefy, as of September 4, 2021, Google contained about 25 billion web pages in its index, while World Wide Web Size Project estimates the estimated number of web pages indexed in Google to be about 50 billion, and on an absolute level there would be about 1.13 trillion websites in the world (even though 82% of them results inactive!).
Thus, every time we enter a query in the search box, Google starts analyzing “thousands, sometimes millions, of web pages or other content that might be a match” precise and relevant to our original intention and, thanks to its systems, tries to present the most useful information in response to what we have asked it.
Why it is important to know how Search works
Let’s continue to give some more numbers that make us understand the value of this huge system: according to BrightEdge, 68 percent of all online experiences start with a search engine, and organic searches are responsible for 53.3 percent of all website traffic.
In order to succeed in intercepting organic traffic, however, we need to be visible, and only by understanding the basics of search and Google Search can we succeed in making our content detectable to users: the first piece of the SEO puzzle, however, is to make sure that content is visible to search engines first, because a site that cannot be found and read by crawlers will never be able to appear in SERPs and be clicked on by people.
Also on a general level, then, two key points related to Google presence should be remembered and kept in mind:
- Even if we follow all the official directions and guidelines Basics of Google Search, Google does not guarantee that it will crawl the page, index it, or publish it.
- Despite what we may read out there, Google does not accept payment for crawling a particular site more frequently or improving its ranking, nor is there a link between organic ranking and spending on search engine advertising.
Then there is another aspect that we must not overlook: ranking is not “eternal,” and not only because “Panta rei,” to put it à la Heraclitus: in addition to the inevitable mutations of technologies, possible changes in search intent, and transformations in the context (e.g., the emergence of new competitors or optimizations of other sites), it is the search engine itself that is constantly changing. As Google says in its guide to core updates, in fact, the Big G team is always working to incrementally refine the efficiency of the search engine and ensure that users always find the most useful and reliable answers: in this sense, it is still the numbers that clarify the amount of such interventions, and in 2022 alone, more than 800,000 experiments were made official, leading to more than 4,000 improvements to Search (values that are constantly growing, as verified by comparing with data on changes in 2020).
To be precise, as seen in the graphic below, there were:
- 4725 launches (i.e., changes actually implemented at the end of a rigorous review process by Google’s most experienced engineers and data scientists).
- 13,280 experiments with real-time traffic (to assess whether user engagement with respect to the new feature is positive and will be useful for everyone).
- 894,660 search quality tests (with the work of quality raters, who do not directly affect ranking, but help Google set quality standards for results and maintain a high level of quality worldwide).
- 72,367 side-by-side experiments (a kind of A/B testing with two different sets of search results, to figure out which one is preferred between the one with the expected change and the “classic” one).
In short, Search is not a static service, and so acquiring some basic knowledge can help us keep up with and solve any crawling problems, get pages indexed, and find out how to optimize the appearance and presence of our site in Google Search-which is then the ultimate goal of SEO, starting with an intuitive assumption: the better Google understands the site, the better match provided to those searching for that kind of content and answers.
The three stages of Google Search
Going to quickly analyze the Google Search system, we can identify three stages in which the process of searching, discovering, and evaluating pages (and not all pages pass these steps):
- Crawling. Through automated programs called crawlers, such as Googlebot, Google downloads text, images and videos from pages found on the Internet.
- Indexing. Google analyzes the text, images, and video files on the page and stores the information in the Google Index, its large database. Google’s search index “is like a library, but it contains more information than all the libraries in the world combined” and is continually expanded and updated with data on web pages, images, books, videos, facts, and more.
- Publication of search results. When a user performs searches on Google Search, algorithms return information relevant to their query in a split second: results are presented in a variety of ways, based on what is most useful for the type of information the person is actually looking for.
How page crawling works
The first step is called crawling and is used to figure out what pages exist on the Web: as the search engine’s official documentation explains, there is no central registry of all web pages, so Google must constantly search for new and updated pages and add them to its list of known pages, doing what is called “URL discovery.”
Much of the work is done by software known as crawlers (but also robots, bots, or spiders), which automatically visit publicly accessible web pages and follow the links on those pages, just as a user browsing content on the Web does; crawlers scan billions of pages on the Web by a huge amount of computers, moving from page to page and storing information about what they find on those pages and other publicly accessible content in the Google Search index.
Some pages are known because Google has already visited them, others are discovered when Google follows a link that redirects from a known page to a new one-for example, a hub page (such as a category page) that redirects to a new blog post-and still others are discovered by sending Google a Sitemap, which is a list of pages for crawling.
When Google finds the URL of a page, it might visit – technically “crawl” – the page to discover its contents and perform the retrieval operation. Specifically, Googlebot uses an algorithmic process to determine which sites to crawl, how often to do so, and how many pages to retrieve from each site to avoid overloading it with too many requests. The crawl rate and crawl demand data (the amount and frequency of crawl requests, respectively) form the crawl budget value, i.e., the number of URLs that Googlebot can and wants to crawl, which can be a relevant element in improving the ranking opportunities of our most strategic pages.
During crawling, Google displays the page and executes any JavaScript code detected using a recent version of Chrome, similar to how the browser displays the page we visit. Rendering is important because websites often rely on JavaScript to display content on the page, and without rendering, Google may not see this content.
In any case, Googlebot does not scan all the pages it has detected: some resources may not be authorized for scanning by the site owner, and others may not be accessible without being logged into the site. Specifically, there are at least three common problems with Googlebot accessing sites that prevent crawling:
- Problems with the server running the site
- Network problems
- Rules in the robots.txt file that prevent Googlebot from accessing the page
Content authors and site owners/operators can help Google better crawl their pages by using the reports contained in Search Console or through the aforementioned established standard services, such as Sitemaps or the robots.txt file, which specify how often crawlers should visit content or possibly exclude certain pages and resources from the search index.
Basically, there are various reasons why we may want to block search engine crawlers from part or all of the site or instruct search engines to avoid storing certain pages in their index. However, if we want our content to be found by Search users, it is crucial to make it accessible to crawlers and indexable, otherwise our site risks being virtually invisible.
Technical details about crawling
Continuing with the library metaphor, according to Lizzy Harvey crawling is “like reading all the books in the library.” Before search engines can show any search results, in fact, they have to retrieve as much information from the Web as possible, and for that they use a crawler, a program that travels from site to site and acts like a browser.
The crawlers attempt to retrieve each URL to determine the status of the document and ensure that only publicly accessible documents enter the index: if a book or document is missing or corrupted, the crawler cannot read it; if, on the other hand, the resource returns an error status code, the crawlers cannot use any of its contents and may retry the URL at a later time.
Specifically, if the crawlers discover a redirect status code (such as 301 or 302), they follow the redirect to a new URL and continue there; when they get a positive response, a sign that they have found a user-accessible document, they check whether it is allowed to crawl and then download the content.
This check includes the HTML and any content mentioned in the HTML, such as images, videos, or JavaScript. The crawlers also extract links from HTML documents to also visit linked URLs since, as we said before, following links is how crawlers find new pages on the Web. Speaking of links, in older versions of Google’s document there was an explicit reference to the fact that “links within advertisements, links for which you have paid on other sites, links in comments, and other links that do not comply with the Guidelines are not followed” – now gone, although almost certainly the way it works has remained the same.
It is important to know, however, that crawlers do not actively click on links or buttons, but send URLs to a queue to be crawled at a later time; also, when a new URL is accessed, there are no cookies, service workers, or local storage (such as IndexedDB).
Indexing, or the organization of information
After finding a web page, crawlers analyze its content, trying to figure out what it is about and organizing Google’s collection of information: this is the phase called indexing, in which crawlers view the page’s content as the browser would and take note of key signals, including by processing and analyzing text content and key content tags and attributes, such as <title> elements and ALT attributes, images, videos, and more.
The Google Search index contains hundreds of billions of web pages and its size exceeds 100,000,000 gigabytes: it is like the index at the end of a book and presents an entry for each word displayed on each web page that has been indexed. In fact, when Google indexes a web page, it adds it to the entries for all the words it contains.
Because the Web and other content is constantly changing, the search engine’s scanning processes are constantly running to keep up, learning how often content that has already been examined is being changed and scanning it as necessary, and also discovering new content as new links to those pages or information are displayed.
One curious aspect is that the Google search index contains more of what is on the Web, as the search engine’s own documentations state, because ” useful information may be available from other sources.” In fact, there are multiple indexes for different types of information, which is gathered through crawling, collaborations, data feed submissions, and through Google’s encyclopedia of facts, the Knowledge Graph. These different indexes allow a user to search within millions of books from the largest collections, find travel schedules through a local public transportation company, or find data provided by public sources such as the World Bank.
Technical details about indexing
From a technical point of view, the indexing procedure takes place with a fully automated crawl, without human intervention, and each web crawler works in a specific way, using the machine learning system provided by its search engine algorithm.
This step is also used by Google to determine whether a page is a duplicate of another page on the Internet or whether it is a canonical page, the one that can be shown in search results as most representative of a clustering of pages with content found on the Internet (the other pages in the cluster, remember, are considered alternate versions and might be published in different contexts, for example, if the user is searching from a mobile device or is looking for a very specific page in that cluster).
Indexing is not guaranteed and not all pages processed by Google are then actually indexed. This may also depend on the content on the page and its metadata, and major indexing problems include:
- Low quality of content on the page.
- Robots meta tag rules that do not allow indexing.
- Website design that may make indexing difficult.
It is Lizzy Harvey again who provides more analysis on this activity, which begins when the crawler, after retrieving a document, passes the content to the search engine to add it to the index: at this point, the search engine performs rendering (i.e., in a nutshell, displays the code of the page as a browser would, with some limitations, to understand how it looks to users) and analyzes the content to understand it.
Specifically, search engines look at a number of signals that describe the content and context of the page, such as keywords, title, links, headings, text, and many other things, which allow the search engines themselves to answer any query with the best possible page.
One final clarification: the Index still represents a kind of database of Web sites pre-approved by Google, which has checked the sources and information and deemed those pages safe for its users. Thus, searching on Google (and in general searching on a search engine) does not mean searching within the entire World Wide Web nor searching the entire Internet (for example, our queries will not bring us results from the notorious and infamous dark web), but searching within the pages selected by the web crawlers of that specific search engine, in a restricted database.
Ranking and publishing of search results
The last activity kicks off when a person enters a query: Google’s computers search the index for matching pages, then return the results deemed most useful, of best quality, and most relevant to that query. Ranking or ordering of pages occurs based on the query, but often the order can change over time if better information becomes available.
In general, it can be assumed that the higher the ranking of a page, the more relevant the search engine considers that page and site to be with respect to the query.
Given the vast amount of information available, finding what we are looking for would be virtually impossible without a tool to organize the data: Google’s ranking systems are designed for this very purpose and, through automatic generation, sort hundreds of billions of web pages and other content in the search index to provide useful and relevant results in a fraction of a second.
Relevance is determined by taking into account hundreds of factors, such as location, language, and the user’s device (computer or phone)-for example, a search for “bicycle repair shops” shows different results to a user in Paris than to a user in Hong Kong.
This extra work is used to ensure more than just matching the query with the keywords in the index, and to provide useful results, Google might consider context, alternative wording, and more: for example, “silicon valley” might refer to the geographic region or television program, but if the query is “silicon valley cast,” results about the region would not be very useful. Other queries may be indirect, such as “the song in pulp fiction,” and search engines should interpret the user’s intent and show results for the music tracks featured in the movie.
Still on the subject of factors, the words used in the search, the relevance and usability of the pages, the reliability of the sources, and the settings of the user’s device can influence the appearance of the information shown in SERPs. The importance attached to each factor changes depending on the type of search: for example, the date of publication of the content plays a more impactful role in responding to searches related to current topics than to searches regarding dictionary definitions, as sanctioned by the so-called Query Deserves Freshness algorithm.
As noted in the specific insights, Google identifies five broad categories of major factors that determine the results of a query, namely:
- Meaning.
- Relevance.
- Quality.
- Usability.
- Context.
Then there are cases when a page is indexed and is recognized as indexed by Search Console, but we do not see it appear in search results; the causes of this phenomenon could be as follows:
- The content of the page is not relevant to users’ queries.
- The quality of the content is low.
- Robots meta tag rules prevent publication.
The refinement of results and SERP features
The search features displayed on the search results page also change according to the user’s query. For example, a search for “bicycle repair shops” is likely to show local results and no image results; however, a search for “modern bicycle” is likely to show results related to images, not local results.
The appearance of additional boxes, features, and functionality also serves to complete the search engine’s mission, and thus to solve the searcher’s query as quickly and effectively as possible: the best-known example are featured snippets (short excerpts featured over organic links that succinctly answer the user’s query), Local Maps, rich results (enriched multimedia results over classic text snippets), and knowledge panels, but the list of features is huge and constantly growing, as shown in our in-depth look at the gallery of results shown in Google SERPs through information retrieved from structured data.
Site, documents and pages: the Google vocabulary
Going back to look at the old version of Google’s document (now gone), the section “What is a document?” was interesting, specifically describing the mechanism used to determine what a document was for Google, with details on how the system displayed and managed multiple pages with identical content to a single document, even with different URLs, and how it determined canonical URLs.
Starting with definitions, we then discover that “internally, Google represents the Web as a (huge) set of documents. Each document represents one or more Web pages,” which may be “identical or very similar, but represent essentially the same content, reachable from different URLs.” In detail, “different URLs in a document may point to exactly the same page or to the same page with small variations intended for users on different devices.”
Google “chooses one of the URLs in a document and defines it as the document’s canonical URL“: it will be “the one Google crawls and indexes the most often,” while “other URLs are considered duplicate or alternate and may occasionally be crawled or published based on user request.” For example, “if the canonical URL is the URL for mobile devices, Google will likely still publish the desktop (alternate) URL for users performing desktop searches.”
Focusing on the glossary, specifically, in Google Search the following terms have this specific meaning:
- Document is a collection of similar pages, which “includes a canonical URL and possibly alternate URLs if your site has duplicate pages.” Google chooses the best URL to show in search results based on platform (mobile/desktop device), user language (hreflang versions are considered separate documents, it explains), location, and many other variables. Google “detects related pages on your site through organic crawling or through features implemented on your site, such as redirects or <link rel=alternate/canonical> tags,” while “related pages from other organizations can only be marked as alternatives if they are explicitly coded by your site (through redirects or link tags).”
- URL is “the URL used to reach a particular piece of content on a site,” and it is clarified that a site “may resolve different URLs on the same page.”
- Page refers to “a given web page, reached through one or more URLs,” and there may “be different versions of a page, depending on the user’s platform (mobile device, desktop, tablet, and so on).”
- Version means “a variant of the page, generally classified as mobile, desktop, and AMP (although AMP may itself have mobile and desktop versions).” Each “version may have a different or identical URL depending on the configuration of the site,” and it is again reiterated that “language variants are not considered different versions, but rather different documents.”
- Page or canonical URL is “the URL that Google considers most representative of the document,” which Google always crawls, while “duplicate URLs in the document are occasionally crawled.”
- Alternative/duplicate page or URL is “the URL of the document that Google may occasionally crawl,” which are published if Google recognizes them as “suitable for the user and the request (e.g., an alternative URL for desktop requests will be published for desktop users, rather than a canonical URL for mobile devices).”
- Site, a term “generally used synonymously with website (a conceptually related set of web pages), but sometimes used synonymously with a Search Console property, although a property may be defined in effect only as part of a site. A site can include subdomains (and even organizations, for properly linked AMP pages).”
How Google Search works: summary and basics
To recap this huge and complex mechanism, we can refer to episode number 4 of the “Search for Beginners” series that Google published on its YouTube channel to clarify the main doubts about how Google Search works and, in particular, to offer the simple definitions of crawling, indexing, and ranking, which we said are the three main stages of the process.
As in previous episodes, the target audience is first and foremost online business sites, but information on how the search engine crawls to discover web pages, organizes the results in the Index, and ranks them to show them to users and potential customers can be useful for any type of business on the Web.
How does Google Research system work
It all start from a basic notion: Google Reasearch “is a powerful tool”, that allows people to “find, share and access an almost infinite amount of contents, regardless of how or when they connect”. Therefore, “if you have an online business and you want your customers to find you”, to understand how the search engine works is key.
First of all, Google has to “realize that your business has a presence on the web”: whether it is a site, a blog, a social media profile or Google My Business listing, “Google goes through a whole journey to find your business, categorize it and show it to your potential customers“, the video reminds us.
Scanning the Web to find new contents
First fundamental step of this process is scanning: “Google constantly searches for new content to add to its huge catalogue” through an activity of discover called crawling. Generally, explains the guiding voice, “Google discovers new pages by following links from page to page”, finding new contents never seen before.
Googlebot’s intervention
Whenever Googlebot, that as we know is Google’s crawler, finds a new site “it has to understand what the content is all about”: this process is called indexing. Briefly, this means that “just as you would organize the inventory of your store, whether it’s shoes, sweathers or dresses, Google analyzes the content of your page and saves these information to its Index“, that is considered the biggest database in the whole world.
Third step: ranking
Once completed the first two technical processes, what happens next that “Google has find your website and Googlebot knows that you have an online shop that sells clothes”? Now it’s time to talk about ranking: whenever a user launches a search query, Google’s systems run through hundreds of billions of web pages within the search index, looking for the most useful and relevant ones in a matter of fraction of a second.
A classic research generates thousands, or even millions of web pages providing potentially relevant information: Google’s job is to “determine the highest quality and most relevant answers, returning the content providing the best user experience and the most appropriate results”.
Ranking and user factors for the ranking
Alongside the classic ranking factors, there are other elements impacting on the determination of results such as “the user’s location, language and device type”, as the video says. For instance, the research about “buying a nice shirt” could show very different results for an user searching from New York compared to the one living in Miami: in particular, in the first case it will need a long-sleeved shirt, while in Florida would probably be more useful a lighter shirt.
Difference between organic and paid results
What we all need to remember, Big G keeps on stressing, is that “Google Search results are organic and generated through sophisticated algorithms that make thousand calculations for each search in a fraction of a second, based only on the relevance of a page to a user”.
And so, “Google never accepts payments from anyone to be included in organic search results or to alter a page ranking in any way”, it explains in the video, perhaps also to directly answer the now well-known Wall Street Journal’s attack.
Different is the case of in-SERP ADS, the kind of advertising appearing among search results but that “they are clearly labeled and so easy to distinguish from the rest of the page”, they tell us from Mountain View (but maybe the message is not that clear to everyone, as we were saying some days ago!).
A process only lasting a few seconds
Therefore, at last: Google explores the web in order to find new contents, indexes those contents categorizing them as a catalogue and its ranking systems analyze the index to only show the most relevant results to users. Then it is the SEO‘s turn to get involved so to try and enhance the ranking on Google from a strategic and business-friendly perspective!