Duplicate content: comprehensive guide to avoid errors and problems

SEO admin 1 July 2024

Quality content is a crucial component to the success of any online marketing strategy, but there is no shortage of insidious obstacles in this path. A particularly thorny issue for SEO is the one about duplicate content: that is, to simplify, content that is repeated identical or broadly similar in various web pages, within the same site or on different sites. This practice may have a deceptive intent, but it is generally the result of poor optimization or laziness, and may lead to a worsening in ranking of pages and, in general, a difficulty in positioning these contents. That is all there is to know about what duplicate content is, how to detect it, how to correct the problem and avoid its reappearance.

What duplicate content is

Duplicate content is defined as identical or very similar blocks of text found on multiple URLs. This phenomenon can occur both within the same Web site and between different sites.

We are talking about content that is reproduced as an identical or very similar copy in multiple locations on the Web, within a single Web site or across multiple domains, and therefore each content that is located in more than one single Web address or URL. More precisely, the Google guide explains that the expression refers to “blocks of important content within or between domains that are identical or very similar”, which can give rise to what we have referred to as a serious and frequent SEO error.

Keep your site under control

Analyze and monitor your pages: avoid internal duplicate content and ensure the digital health of your project

Find out more

Portions of text in different languages are not considered duplicate content, nor are quotations (even whole paragraphs) identified as an error, especially if we use the semantic markup <cite> within the code.

This content can originate from various causes, including technical errors, mismanagement of URL parameters, or simply unauthorized copying of text from one site to another. In fact, we must distinguish between different forms of duplicate content: internal duplicate content occurs when the same content appears on multiple pages within the same website, but there is also external duplicate content -the same content appearing on multiple different websites-that can often be the result of unauthorized copying of content, a phenomenon known as SEO plagiarism.

Although they may seem harmless, duplicate content can have a devastating impact on organic search engine rankings, negatively affecting site visibility.

Why duplicate content represent an issue

While not technically leading to a penalty, duplicate content can still sometimes negatively affect search engine rankings.

More specifically, Google, Bing, Yahoo and related algorithms are designed to provide users with relevant and unique results; therefore, when they detect the same or “noticeably similar” content on multiple pages, they can become confused having difficulty determining which version is the most relevant to show in search results for a given query.

This process can cause a reduction in site ranking and visibility, because it can lead to a dispersion of SEO value among the different duplicate pages, thus reducing the overall effectiveness of the optimization strategy. In the worst cases, Google may decide not to index the duplicate pages or reduce the ranking of the site as a whole, thus compromising the visibility of the site in search results. This is what also happens when we run into content cannibalization, which precisely presents crawlers with pages with similar or identical content that cause difficulties in determining which page to show in search results.

Beyond this aspect, however, duplicate content is a knot to be solved because -essentially- it does notoffer added value to the user’s experience on the pages of the site, which should be the focal point of any published content. Reasoning as a user, would we regularly visit a site that features unoriginal articles, or would we try to read directly from the original source of this information?

In addition to problems in organic search, duplicate content may also be in violation of the policies of Google Adsense Publisher, and this may prevent the use of Google Ads in sites with copyrighted or copyrighted content: or, in pages which copy and republish content from others without adding any original wording or intrinsic value; pages which copy content from others with slight modifications (rewriting it manually, replacing some terms with simple synonyms, or using automatic techniques) or sites dedicated to insert content such as videos, images or other media from other sources, always without adding substantial value to the user.

Using duplicate content analysis tools and anti-plagiarism software can help us quickly identify duplicate content and take steps to resolve it. In addition, following best practices for creating unique and quality content can serve to avoid duplicate content and ensure that our site maintains good digital health.

What is duplicate content: types and examples

As we said, it is essential to understand the different types of duplicate content we may encounter so that we can be proactive and ready to adopt the right prevention and resolution strategies.

The main distinction relates to the “location” of the problematic content, which leads us to recognize the two broad categories of internal duplic ate content and external duplicate content: both types can have a significant impact on SEO, but they manifest themselves in different ways and require specific approaches to be handled effectively.

Internal duplicate content

Internal duplic ate content occurs when the same content appears on multiple pages within the same website. This phenomenon can be caused by various technical and operational factors, such as creating multiple URLs for the same page if we use URL parameters to track sessions or filters, which precisely generate different versions of the same page. Another example is the presence of print pages or AMP versions that replicate the content of the main pages.

Another frequent cause of internal duplicate content is the misuse of header tags, such as title tags and h1 tags: if multiple pages on the site use the same title or h1 tags, search engines may have difficulty distinguishing between different pages, thus reducing SEO effectiveness.

Still, another common case involves CMSs that automatically generate multiple versions of the same page for categories, tags or archives, which leads to redundancy that confuses search engines. Then, it is impossible not to mention automatically generated content: for years, some websites have been using software to automatically generate large amounts of content based on predefined templates, without paying too much attention to the quality of the information or its actual usefulness to users. Google has opened a battle against such duplicate or unoriginal content-at the center of, for example, the violation called “scaled content abuse,” introduced with the March 2024 spam update-not to mention the incumbency ofArtificial Intelligence applied to content creation, which is shaking up the industry and will lead to yet more changes.

In short: Duplicate content in the same site, also called internal duplicate content, is referred to at the level of domain identity or host name. Their SEO damages are relatively low and mainly concern a possible worsening of the possibility of good rankings in SERPs of the affected pages, again because of the difficulty for search engine crawlers to determine which version is to be preferred and shown to users as a relevant answer to their query.

External duplicate content

External duplicate content occurs when the same content appears on several different websites.

One of the most common causes of external duplicate content is SEO plagiarism, which is the unauthorized copying of content from one site to another. This can occur either intentionally, when one site deliberately copies content from another, or unintentionally, when multiple sites use the same content provided by a third party, such as a content provider or affiliate.

Another example of external duplicate content is content syndication, which is the publication of the same content on multiple websites with the permission of the original content owner. Although syndication can be an effective strategy to increase the visibility of content, it is important to use the canonical tag or other methods to indicate to search engines which version of the content should be considered the main one.

The causes of duplicate content: accidental and intentional

Looking more specifically at the causes of duplicate content, we can recognize a wide range of critical situations. Duplicate content can indeed emerge for a variety of reasons, some of which are accidental, while others are intentional.

Among the examples of non-malicious and non-deceptive duplicate content Google quotes:
- Discussion forums, which can generate both regular and “abbreviated” pages associated with mobile devices.
- Items from an online store displayed or linked via multiple separate URLs.
- Web page versions available for printing only or PDF versions.

However, accidental duplication of content is very common and usually depends on technical or website configuration errors.

We have already mentioned the creation of multiple URLs for the same page: if a site generates different URLs for the same page depending on the parameters used, such as session tracking parameters or search filters, it will end up with different versions of the same page, all indexed by search engines, causing confusion and dispersion of SEO value. Another frequent cause of accidental duplication is the misuse of canonical tags, a powerful tool that can help consolidate SEO value on a single URL; if used incorrectly, however, it can lead to unintentional duplication. For example, if the canonical tag is not set correctly, search engines may not be able to determine which version of the page should be considered the main one.

Print versions of web pages can also cause accidental duplication: many sites offer print versions of their pages, which often replicate the content of the main page. If these print versions are not properly handled with canonical tags or the use of noindex meta tags, they can be indexed by search engines, causing duplicate content.

More numerous-and even more troublesome-are the situations of intentional duplication of content: this phenomenon, which obviously can have even more serious consequences for SEO, occurs when one site deliberately copies content from another without permission, causing what is also called SEO plagiarism, which not only violates copyrights but can also lead to penalties from search engines.

This includes content scraping, in which content is automatically extracted from one site and republished on another: these illicit practices not only harm the original site, but can also lead to significant penalties for the scraping site (literal translation of scraping, which means forced extraction of data from the original source).

In general, Google’s algorithms are capable of detecting these situations of domain overlapping – repetition of an entire piece of content or a portion of it (e.g., a paragraph) on multiple different sites – especially when they result from Black Hat SEO tactics, and thus from a manipulative technique in an “attempt to control search engine rankings or acquire more traffic.” Google knows and seeks to punish sites through ranking downgrading or even removal from the Index, because “deceptive practices like this can cause an unsatisfactory user experience” by showing visitors “the same repeated content over and over again in a set of search results.”

In addition to this diversity problem for users, external duplicate content also embarrasses Googlebot, which, when faced with identical content in different URLs, cannot initially decide which is the original source and is therefore forced to make a decision to favor one page over the others, considering elements such as indexing date, site authority, and so on.

Duplicate e-Commerce content, a thorny issue

Duplication is also very common on e-Commerce sites, especially in the case of mismanagement of URL parameters and faceted navigation, which then creates multiple pages with identical content reachable at different addresses, all of which are indexed by search engines, or inaccurate use of tags, which create an overlap with category pages.

But this error can also result from other factors, such as publishing as an unchanged re-posting of product sheets provided by the original manufacturer of an item for sale. Although intuitively it seems advantageous to use standardized descriptions provided by manufacturers or other retailers, this practice can be detrimental to the site’s search engine rankings.

Duplicate content: the technical causes

We mentioned some potential elements that lead to internal or external duplicate content situations on sites, but now it is the case to list in a more analytical way the five unintended technical causes of the problem.

Variants of the URL

URL parameters, such as click tracking and some analysis codes, can cause duplicate content issues, as well as session Ids that assign a different ID stored in the URL to each user visiting the site or, again, printable versions (when several versions of pages are indexed).

The advice in this case is to try to avoid adding URL parameters or alternative versions of Urls possibly using scripts to transmit the information they contain.

Separate versions of pages

You may experience a duplicate content problem with a site that has separate versions with www prefix and without, or if you have not completed the transition from HTTP:// to HTTPS://, and keeps both versions active and visible to search engines. Other separate versions include pages with and without trailing-slash, case sensitive Urls (i.e., case sensitive), mobile-optimized Urls and AMP versions of pages.

Thin content

They are defined thin content, content that is generally short and poorly formulated, with no added value for users nor originality, which can represent portions of the site already published in other URLs.

It also includes CMS archive pages such as tags, authors and dates and especially pagination pages (archives of post lists after the first page), which are not properly optimized or blocked with a meta tag “noindex, follow”.

Boilerplate content

An element that can generate duplicate content is also the boilerplate content, that is the text in header, footer and sidebar that for some sites can even be the predominant part of the content on page: being present on all Urls, can therefore become a problem if not adequately treated (for example, implementing variations according to the section of the site where the user is located).

Scraped or copied content

In this case not only the problems with plagiarism (which violates the copyright law and against which Google has activated a specific procedure to request the removal of the guilty page from search results pursuant to the Digital Millennium Copyright Act, the US Copyright Act)but all the circumstances in which on the pages there are repropositions that are scraped or explicitly copied.

Copying objects can primarily be blog posts and editorial content, but also product information pages, whose contents end up in multiple locations on the Web.

Negative consequences of duplicate content

Duplicate content is a problem at various levels for all actors on the Web – search engines, site owners and users – and this already makes us understand why it is important to take action to correct these cases and to avoid their appearance.

In detail, to search engines duplicate content may present three main problems:

Inability to decide which versions to include or exclude from their indices.
Indecision whether to direct the link metrics (trust, authority, anchor text, link equity and so on) to a specific page or keep them separated between multiple versions.
Difficulty in deciding which version to place for the different query results.

For site owners, on the other hand, the presence of duplicate content can lead to worsening rankings and traffic losses, which usually result from two main problems that, in both cases, do not allow the content to achieve visibility in the Search that otherwise might have:

A dilution of the visibility of each of the pages with duplicate content – because search engines rarely show multiple versions of the same content and therefore are forced to choose for themselves which version is more likely to be the best result.
UA further dilution of link equity, because even other sites will have to choose between duplicates and therefore backlinks will not point to a single content.

When duplicate contents are responsible for fluctuations in the SERP ranking the cannibalization problem occurs: Google fails to understand which page offers the most relevant content for the query and then alternatively test the target Urls in search of the most relevant one.

Lastly, to users, duplicate content is not useful and does not offer any added value, since it is not unique.

Duplicate content and Google, the official position

The topic of duplicate content returns often in the official speeches of Googlers, who have clarified how the algorithms interpret and evaluate situations of excessive similarity between web pages.

The first thing we need to keep in mind is that there is no real penalty for duplicate content, a penalty like the others that are reported with notification in Google Search Console, nor is duplication something like a negative ranking factor.

If the problem is with the same site, in particular, Google can also use a soft approach: that is, it may be normal for a site to have the same content repeated on several of its own pages, and therefore, a certain amount of duplicate content is acceptable. The search engine’s algorithms themselves are trained to handle this frequent occurrence, which often simply results in only one piece of content being selected for ranking (the canonical one or at any rate the one most suitable in Google’s opinion) and the others not being displayed, but without sending negative ranking signals for the entire domain.

There is another critical aspect, however, because duplicate pages can overinflate a site and consume Google’s crawl budget, taking the crawler’s attention away from more useful and profitable pages.

And then there is the other case, i.e., site caught posting the same or similar content from other sites, which instead can have negative SEO consequences: again, when Google’s algorithm encounters the same content on multiple sites, it quickly decides which page to rank for and which to demote in the rankings or even hide. And it is not necessarily the case that Google chooses well, because there are frequent cases in which the positioned page is the one that actually copied content from another site that, instead, does not get the right visibility.

This is why, therefore, that from an SEO perspective it is crucial not only not to publish duplicate content, but also to check that there are no other sites that draw excessive inspiration from our pages, so to speak, so as to secure ourselves from any problems.

How to check for duplicate content

To check for internal duplicate content on our site, we have various automated tools or manual techniques for analyzing duplicate content.

Remaining within our suite, we can launch a scan with the SEO Spider that will highlight the existence of pages that have the same title tag, the same description or heading (a potential indicator of the problem), signaling even if we have correctly set a canonical. In addition, from this scan we can also view the list of site URLs and analyze them to verify that you have not used problematic parameters.

More complex, however, is the search for duplicate content external to the site: in this case, you can rely on specific tools, such as Copyscape, or launch manual Google searches.

Using specialized tools for duplicate content analysis is one of the most effective ways to quickly identify any problems: these tools are designed to scan the website and compare content with content on other pages on the web, providing a detailed analysis of duplication.

Manual verification requires more time and effort, but it can offer a level of accuracy and control that automated tools may not provide; therefore, this method can be particularly useful for identifying duplications that might escape the tools or for checking the accuracy of the results obtained. The most common and effective techniques for manual verification are:

Google search. One of the simplest and most straightforward techniques is to copy a portion of text, a sufficiently long sentence, or a paragraph of the “offending” content (or that we think may have been copied) and paste it into Google’s search bar in quotation marks. This method, which takes advantage of one of the advanced search operators, allows us to see if there are other web pages with the same content indexed by Google: if the SERP returns results with identical or very similar content, it is likely that indeed that content is duplicated on other sites.
Checking title and h1 tags. We can also manually analyze that each page on your site has unique title tags and h1 tags, either by accessing the source code of the pages or by using browser inspection tools.
URL parameter analysis. This is used to check if your site generates multiple URLs for the same page due to tracking parameters or filters: we can examine server logs or use web traffic analysis tools.
Checking print and AMP versions. We need to check that the print versions and any AMP pages on the site are properly handled with canonical tags or with the use of noindex meta tags: this can be done by accessing these page versions and checking for the appropriate tags in the source code.

How to troubleshoot duplicate content

At a general level, troubleshooting duplicate content comes down to one goal: specifying which of the duplicates is the “correct” one. Thus, there are certain interventions that can serve to avoid the presence of internal duplicate content, and more generally it serves to get into the mindset of always telling Google and the search engines the preferred version of the page as opposed to the possibly duplicate ones.

Specifically, Google also recommends a number of steps to take to ” proactively address duplicate content issues and be sure that visitors view the content intended for them.”

Use the canonical tag to specify the official version of the page and indicate to Google to neglect indexing any variants it may find while crawling the site (but, beware, Google may also choose a different canonical page than the one set). As we know, this HTML tag tells search engines which version of a page should be considered the main one, thus consolidating SEO value on a single URL: for example, if we have different versions of a page due to URL parameters or print versions, we can use rel=canonical to tell search engines which version should be indexed.
In some cases, the best solution to solve duplicate content problems is to remove or edit the problematic text: if we find duplicate content within our site, we should then analyze it and consider removing the less strategic pages or editing them to make them unique.
More specifically, another effective strategy for solving duplicate content problems is to work on content consolidation, which can include total rewriting, adding new information, or reorganization. Consolidating content into a single, more comprehensive and informative page can help improve the quality of information, avoid dispersion of SEO value, and also improve the user experience by offering more complete and relevant information in one place.
If you need to remove a piece of content, you should use a 301 redirect from the “duplicate” page to the original content page in the .htaccess file to redirect users, Googlebots and other spiders intelligently. When multiple pages with the potential to rank well are combined into a single page, they not only stop competing with each other, but also create increased relevance and a signal of popularity in general. This will have a positive impact on the ability of the “correct” page to rank well.
Manage title tags and h1 tags correctly: each page on the site should have unique and descriptive title and h1 tags, so as to avoid search engine confusion and improve site visibility.
Maintain consistencywith internal links as well.
Use top-level domains to allow Google to display the most appropriate version of a document.
Pay attention to the dissemination of content on other sites, including in cases of syndicated distribution (possibly use or ask to use the noindex tag to prevent search engines from indexing the duplicate version of content).
Minimize the repetition of boilerplate text.
Use the parameter management tool in Search Console to indicate how we would like Google to handle URL parameters.
Avoid publishing incomplete pages, such as those for which we do not yet have actual content (placeholder pages, for which we can possibly use the noindex tag to block them and prevent them from being indexed).
Familiarize yourself with the content management system and how it displays content: for example, a blog entry may appear with the same label on a blog home page, on an archive page, and on a page of other entries.
Minimize similar content, possibly by expanding pages that are too similar or consolidating them all on one page. For example, says the guide, “If your travel site contains separate pages for two cities but the information is the same on both pages, you could merge the two pages into one page covering both cities or expand each one so that it presents unique content about each city.”

Best practices for avoiding duplication

Creating unique, high-quality content is the first line of defense against duplicate content. However, there are some best practices we can follow to avoid duplication and ensure that each page on the site offers added value and is distinct from the others.

Thorough research. Before creating new content, conducting thorough research helps us make sure that the topic has not already been exhaustively covered. If we find similar content, we can consider updating or expanding it rather than creating new pages.
Originality. Each piece of content should be original and not copied from other sources. We use a unique voice and personal writing style to stand out. If we want to cite information from other sources, it should be done properly, including with an appropriate link, adding a comment or analysis to enrich the content.
Clear structure. It is helpful to organize content clearly and logically, using well-defined headings, subheadings, and paragraphs. This not only improves the user experience, but also helps search engines better understand the page content.
Continuous monitoring. Using monitoring and analysis tools can help us quickly identify any problems and take timely action.
Regular updates. It is never wrong to optimize content to keep it current and relevant; outdated content can be a source of unintentional duplicates.
Proper use of tags. Each page should have unique and descriptive headings. Never use the same title and h1 tags on multiple pages, as this can cause search engine confusion and reduce SEO effectiveness.

Is it possible to quantify a duplication threshold?

The question that often seizes SEOs and copywriters concerns the “threshold” of duplication that discriminates between a piece of content that is considered new and one that is marked as copied.

Avoid duplicate content

Analyze your site and check for pages with duplicate content

Find out more

Although there are a lot of tools that analyze pages and indicate a more or less studied limit based on their scans, and many SEO experts consider content goes from similar to duplicate if the parts in common exceed 30 percent of the total copy, Google has reiterated several times that there is no benchmark value and that each case is stand-alone anyway.

In the same vein are the other major search engines, which do not define what and how much exactly constitutes duplicate content, partly because they could not concretely cover all cases and situations.

As we were saying before, therefore, the only solution is to work on content more shrewdly, trying to make the text on the page as original as possible to differentiate ourselves from competitors and offer useful and interesting information to users, also going in the direction indicated by Google with the Helpful Content algorithmic system.