Indexing on Google, 5 common errors based on site sizes

SEO admin 14 April 2021 Reading time : 5 minutes

Put us to the test!

Analyze your site

Select the database

It is one of the most important technical steps, the gateway to the Search system, and when you don’t run all the steps correctly the traffic drastically drops or cancels: let’s talk about indexing and, in particular, the most frequent errors and indexing problems that prevent the display of the pages of the site in Google Search.

What it is and why indexing is important on Google

As we know, Google has an “index” of all the Web pages it discovers while scanning, a real list of all the resources it has encountered and deemed suitable to be inserted in the Search system.

It is important to remember – and Google says it openly – that not all the pages that Googlebot manages to find are then effectively indexed, and therefore we must not neglect to periodically monitor the status of the scans of the main resources of our site, to avoid that pages that are precious to us are not taken into consideration.

Wanting to use a similitude, through indexing it is as if Google built a library, made not of books but of sites and Web pages.

Indexing is a prerequisite for getting organic traffic from Google: if we want our pages to be actually displayed in the Search, they must first be indexed correctly – that is, Google must find and save these pages, inserting them in its Index, to then analyze their content and decide for which queries they might be relevant – and more pages of the site fall into this list, the greater the chances of appearing in search results.

This explains why it is crucial to know if Google can index your content and check if the site is properly indexed, using tools such as the Google Search Console, that with the Index Coverage Status Report also provides useful information on the specific problem that prevented listing.

The study on the most frequent indexing errors

Usually, the main causes that prevent indexing are server errors, 404 status and likely presence of pages with might have thin or duplicated content.

But Tomek Rudzki went further and, as he explains in an article published on Search Engine Journal, analyzed and identified what are the most common indexing problems that prevent the display of pages in Google Search.

Thanks to its experience and daily activity of technical optimization of sites to make them more visible on Google has “access to several dozen sites in the Google Search Console”; to obtain reliable statistics then began with the creation of a sample of pages, combining the data of two sources, namely already available customer sites and anonymous data shared by other SEO professionals, involved through a survey on Twitter and direct contacts.

Work methodology

Rudzki describes the preliminary process to obtain valid information, and in particular how he excluded the data of pages left out indexing by choice – old Urls, articles that are no longer relevant, filter parameters in the e-commerce and more – through the various ways available, “including the robots.txt file and the noindex tag“.

So, the expert “removed from the sample the pages that met one of the following criteria”:

Blocked by robots.txt.
Marked as noindex.
Returning an HTTP status code

In addition, to further improve the quality of the sample, only the pages included in the Sitemaps have been considered, which are “the clearest representation of valuable Urls from a given website”, while being aware that “there are many websites that contain junk in their sitemaps, and some that even include the same Urls in their Sitemaps and robots.txt files”.

Indexing issues also depend on the site’s size

Grazie al campionamento, Rudzki ha scoperto che “i problemi di indicizzazione più diffusi variano a seconda delle dimensioni di un sito web”. Per la sua indagine, ha suddiviso i dati in 4 categorie dimensionali:

PSmall websites (up to 10,000 pages).
Average websites (10,000 to 100,000 pages).
Large websites (up to a million pages).
Huge websites (over 1 million pages).

Due to differences in the size of sampled sites, the author sought a way to normalize the data, because “a particular problem encountered by a huge site may have more weight than problems that might have other smaller sites”. Thus, it was necessary to examine “each site individually to sort the indexing problems with which it is struggling”, and then assign “points to indexing problems based on the number of pages affected by a given problem on a given site”.

The 5 main indexing problems of sites

This meticulous work has therefore allowed us to identify the first 5 indexing problems encountered on websites of all sizes:

Scanned – currently not indexed (quality problem).
Duplicate content.
Detected – currently not indexed (budget/quality crawl problem).
Soft 404.
Scanning problem.

Quality issues

Quality issues include pages with sparse, misleading or excessively biased content: if a page “does not provide unique and valuable content that Google wants to show users, you will have difficulty indexing it (and you should not be surprised)”.

Issues with duplicated content

Google may recognize some of the pages as duplicate content, even though this was not intentionally intended.

A common problem is canonical tags pointing to different pages, with the result that the original page is not indexed; if there are duplicate content, “use the rel canonical or redirect 301″ to ensure that “the pages of your own site are not competing with each other for views, clicks and links”.

Problems with the crawl budget

As we know, Google only allocates a share of time to scan each site, which is called crawl budget: based on several factors, Googlebot will only scan a certain amount of Urls on each website. This means that optimization is vital, because we must not allow the bot to waste its time on pages that do not interest us and are not useful for our purposes.

Problems of soft 404

404 errors indicate that “you have sent a deleted or non-existent page for indexing”. Soft 404 displays “not found” information, but do not return the HTTP 404 status code to the server.

Redirecting the removed pages to other irrelevant is a common error, and even multiple redirects can be displayed as soft 404 errors: it is therefore important to shorten the redirect chains as much as possible.

Scanning issues

Lastly, there are many scanning problems, but probably the most important one are the issues with robots.txt: if Googlebot “finds a robots.txt file for your site but fails to access it, it will not scan the site at all”.

Indexing, the main issues based on different site sizes

After highlighting the main difficulties in a general sense, the author also analyzed what are the causes divided according to the size of the site under consideration.

Small websites (sample of 44 cases)

Scanned, currently not indexed (quality problem or budget crawl).
Duplicated content
Crawl budget issue.
Soft 404.
Scanning problem.

Average websites (8 cases)

Duplicated content.
Discovered, currently not indexed (crawl budget / quality problem).
Scanned, currently not indexed (quality problem).
Soft 404 (quality problem).
Scanning problem.

Large websites (9 sites)

Scanned, currently not indexed (quality problem).
Discovered, currently not indexed (crawl budget / quality problem).
Duplicated content.
Soft 404.
Scanning problem.

Huge websites (9 sites)

Scanned, currently not indexed (quality problem).
Discovered, currently not indexed (crawl budget / quality problem).
Duplicate content (duplicated, sent URL not selected as canonical).
Soft 404.
Scanning problem.

Considerations on common indexing issues

It is interesting to note that, according to these results, two categories of websites of different sizes – large and huge – suffer from the same problems: this “shows how difficult it is to maintain quality in the case of large sites”.

The other highlights of the study:

Even relatively small websites (over 10 thousand pages) may not be fully indexed due to an insufficient crawl budget.
The larger your website, the more urgent your budget/ scan quality problems becomes.
The problem of duplicated content is serious, but its weight changes depending on the size of the site.

Orphan pages and URLs not known by Google

In the course of the research, Tomek Rudzki noted that “there is another common problem that prevents indexing of pages”, while not reaching the same quantitative impact as those described. These are orphan pages, meaning pages that are not linked by other resources on the site: if Google does not have a “path to find a page through your website, it may not find it at all”.

The solution is quite simple, that is to add links from related pages or insert the orphan page in the sitemap: despite this, “many webmasters still neglect to do so” and expose the site to more risky problems, concludes the author.