Canonicalization, myths to know and bust according to Google

SEO admin 9 September 2020

Canonicalization does not mean a grouping by topic, but rather a system to prioritize a URL from a set of pages with identical or almost identical content, to reduce duplications: it is with this statement that opens the new episode of SEO Mythbusting season 2, the Google series on Youtube in which the main “false myths” of the SEO are tackled.

Main false myths on canonicalization

In the latest video, the guest and counterpart of the host Martin Splitt is Rachel Costello (Technical SEO Consultant at Builtvisible and former Technical SEO & Content Manager at Deepcrawl, position held at the time of recording), and the central theme is, indeed, the Canonicalization.

That is, as the Googler summarizes, the “management of duplicates” among the contents published by a site to report the preferred version of the page to show in the Search and remove duplications, on one hand to not risk keyword cannibalization and, on the other hand, to prevent Google from having to scan multiple times the same things – because Google does not want multiple and unnecessary crawls or renderings, nor serve the same content proposed in different URLs, as these would not be good search results.

In general, the false myths on this topic include doubts as to whether canonicalization is a signal or a directive, whether it can be used as redirection, and then again the site’s preferences over over the user’s ones and more.

Wrong interpretation about the canonical

According to Costello’s experience, there are two main misconceptions most prevalent on this issue: first, “people think it’s a directive, set a canonical tag and it will be accepted”. Actually, the canonical is an HTML suggestion that a site can set up to signal to a search engine which is the main URL to use for a page/content.

Another frequent case of misconception is the use of canonicalization as a redirection: “If you have a product page that is not available, add a canonical to that category page”, says the expert, adding that “it doesn’t work quite so”, because “the contents must be identical or almost identical”, as confirmed by Martin Splitt.

Google’s explanation

And it is precisely the Developer Advocate of Google to clarify these doubts and thoroughly explain what is the canonicalization: first of all, it is not a directive – or an instruction that search engines are required to comply with – but a signal, namely a hint, a suggestion, that helps search engines understand what we want to canonicalize (to what we want to give importance and priority in Search), but that search engines themselves can decide whether to use or not.

Canonicalization is not a directive

When it comes to canonicalization, Splitt says, “we’re talking about detecting content or the same content or very similar content that exists on different addresses and different Urls”, and Google can “do many different things to identify these things“. For example, it can simply crawl multiple pages and find out that they’re dealing with the same content, or even see if the Urls use the same links and the same type of context, or just use the canonical tag.

You have to understand that Google uses many different signals to “understand if something has the same content or not”, and canonicalization by canonical tag is just one of these. To make it effective, however, it is necessary to correctly set the canonical tag: it will not work to “put it on pages that do not have the same content, but it is not good, either, to put it on each of the identical pages”.

How to report canonicalization

Correctly using the canonicalization of a page avoids to entirely leave the choice to Googlebot on the best page to show between the search results: in addition to the specific tag, there are as mentioned other signals that Google takes into consideration to combine Urls from similar content and operate a deduplication.

Among others, we remember redirects between pages, internal links, outbounding links, directions in sitemap, hreflang, clean or shortened URLs.

Canonicalization is not a redirection

It is not even necessary to use the canonical tag to make a redirect, warns Splitt, because it does not serve as a redirect, although there is often confusion about this aspect. This is confirmed by Rachel Costello, who says she has noticed how people try in every way to group the link equity in a single place and page, and then use the canonical as a desperate attempt to achieve the goal.

This is another mistake, because – Martin Splitt reiterates – canonicalization comes into play and only makes sense when “you cross the same content on different platforms or channels in slightly different places, for whatever reason you are doing it”.

But, in case of out-of-stock and unavailable products, you simply need to make a redirect “to something similar that makes sense to the user at that point”, or put the page in 404 to communicate Google that “this is the current situation but it could come back”.

Canonicalization and waste of crawl budget

It is important to pay attention to the proper use of the canonical tag, because otherwise we may risk of wasting budget crawl.

If we have identical pages and have not set (either we have done wrong, or we often reverse the chosen page) canonicalization, Googlebot will return to scan all the content in a useless and harmful way for the website economy.

Even worse is to use the canonical as a redirect, because in this case the search engine is faced with pages marked as identical, but that actually are not, and therefore will continue to pass on all.

importante prestare attenzione all’uso corretto del tag canonical, perché in caso contrario possiamo rischiare di sprecare crawl budget.

Duplication e deduplication, signals for Google

In the video they then move on to discuss the technical factors that Google takes into consideration to perform the deduplication of the contents of the same site: they are all automatic signals because the work on duplication and deduplication is done “without much human interaction”, says Splitt, but “Google appreciates content fingerprinting” and tries to understand “what is the essence, what is the information, how it relates to the structure of the site, what is written on the sitemap; in short, we are faced with a number of different factors, mostly technical”.

And, basically, Google assigns a score on an ongoing basis, so it doesn’t determine these issues once and it always sticks to the same decision: “We always look at the fresh content taken from the crawling, and then we look at the page – this changes, this has changed, now it is very close to the previous version, now something that was a duplication is not anymore, because the content has been modified”.

Sometimes, Splitt continues, “especially when practically everything is shown in the same URL structure and is like versions in different language of the same thing, but with the same content, then we could end up with a very similar score”. If Google sees two versions, “let’s say one 0.49 and one 0.51 of what we think is a duplicate of the other, then it is really hard to choose which will be the canonical page”.

To complicate things even more is the fact that the situation can change: Google can crawl differently, or it can change the way the crawler fetches data, and even the pages touched before can affect “to have some sort of jump between these two numbers”.

And then there is the canonical: a clear signal to help search engines and not confuse the algorithms engaged in understanding what is the duplication between the contents analyzed. “Why, if we have two equal contents, how do we know which one to choose?” summarizes Martin Splitt.

Preferences of the site or preferences of the users?

Despite these indications, however, sometimes Google can still make a different decision and replace the site’s favorite canonical page with a better one for users – John Mueller also spoke about it in another video on Youtube.

This often happens with identical content in different languages: for example, if there is a canonical tag pointing to the English version of a page, but the user is in Germany, Google will show the German version of the page.

Canonicalization and single content

The last aspect investigated by this episode concerns the amount of unique content on a page needed for Google to accept it as a canonical version, and according to Splitt can be enough even a small share of original content that does not exist on other pages.

However, “if the content is completely different or quite different for algorithms, so much so as to decide that it is not a duplicate, then the canonical is useless,” he concludes.