Meta Tag Robots, instructions to communicate with crawlers

SEO admin 27 February 2024

In our battle for online visibility we cannot neglect any aspect, and in particular it is essential to understand how search engines interact with the content we publish. Here comes into play an element that sometimes risks being unfairly underestimated such as robots meta tags, which are precisely specific instructions intended for search engine robots. These short snippets of HTML code are real road signs for crawlers, telling them how to navigate and interpret the pages of our site: understanding their functions and how to use them correctly is therefore crucial for effective SEO.

What Robots Meta Tags are

Robots meta tags are a string of code that is placed in the <head> section of an HTML page and allows us to communicate directly with search engine crawlers in order to provide specific instructions on how to crawl and index our web pages.

Take care of your site

Find out where and how to take action to get better results and earn more money quickly and easily

Find out more

Thus, it is a very specific type of HTML meta tags, page-level and text-level settings that are used to granularly control at the page level the behavior of Googlebots and other automated search engine bots that explore the Web to index content.

What these instructions are for

Specifically, they provide specific instructions on how a given page should be treated: whether it should be indexed, whether links on it should be followed, and other directives that affect SEO. Proper use of robots meta tags thus allows us to adapt the way Google presents our content in search results.

For example, they can be used to prevent duplicate content, exclude non-essential or private pages from the index, or optimize the use of crawling resources, ensuring that search engines focus on the most relevant pages.

It is important to note that these settings can only be read and followed if crawlers are allowed access to the pages that include them.

According to Google guidelines for developers, meta tag robots allow to “use a granular and specific page approach”, and more in particular to “control how a single page should be indexed and provided to users among Google search results”.

With this tool is then possible to suggest Google which are the resources not to be considered for indexing and ranking, because they do not offer any purpose to users or are only published for service reasons. For example, we can use one of these tags to request that test pages or restricted areas be excluded from indexing, prevent search engines from following links to low-quality websites, or even protect our privacy by preventing caching.

From some time now, in addition, webmasters can also take advantage of these commands to “control the indexing and publishing” of a Google snippet, a.k.a those brief text extracts appearing on the SERP that serve to “prove the relevance of a document to the user’s query”.

The correct syntax for instructions

From a technical and formal point of view, robots meta tags follow the same syntactic rules as broader HTML meta tags.

This means that first and foremost they must be placed within the header of every web page, between the <head> and </head> tags, because otherwise they cannot be properly read and received. Indeed, incorrect placement would not only render these directives ineffective, but could even harm the visibility of the site.

In addition, these code snippets can contain generic instructions, aimed at all search engine crawlers, or target specific user-agents, such as Googlebot; again, it is possible to use multiple directives on a single page by separating them with commas, if they are aimed at the same robot. On the other hand, if we are using different commands for different search user-agents, we will have to use separate tags for each bot. The rules are not case sensitive, so they do not distinguish between uppercase and lowercase.

Going into detail, robots meta tags consist of two parts: the first is “meta name=”robots” identifies the type of meta tag and the referring user-agent, while “content=” specifies the guidelines for search engine crawlers and indicates what the behavior should be.

For example,

<meta name=”googlebot” content=”noindex, nofollow”>

tells the Googlebot alone not to index the page in the search engines and not to follow any backlinks.

Thus, this page will not be part of the SERPs and will not pass value to the linked pages; this command can serve for example for a thank you page.

What are all Robots Meta Tags

There are different types of meta tag robots, each with a specific function.

Referring to the official list of rules followed by Google (but not necessarily valid for other search engines), we have:

all – This is the default, for pages with no limitations for indexing and publishing. In practice, this rule has no effect if it is included explicitly.
noindex – This is the command to not have the page, media element or otherwise resource appear in search results.
nofollow – This is the command to not follow the links on the page. However, Google now reads these instructions as a suggestion, not a directive.
none – Equivalent to noindex and nofollow together. It is the opposite of all.
noarchive – Prevents Google from showing a “Copy Cache” link in search results. It is a command being discontinued after Google says goodbye to the cache.
nositelinkssearchbox – Prevents Google from showing the sitelinks search box for the website in search results.
nosnippet – Prevents the display of a text or video preview snippet in SERPs. Applies to any form of search result and applies to classic Search, Google Images and Discover. However, a static thumbnail of an image may still be visible (if available), if in Google’s opinion it can improve the user experience.
indexifembedded – This directive tells Google to index the content of one embedded (embedded) page within another via an iframe, HTML tag or other embedding method despite the presence of a noindex rule. The indexifembedded meta tag is therefore a specific exception to noindex and is valid only if it is accompanied by the noindex command.
max-snippet:[number] – Imposes a maximum amount of characters that Google can show in a text snippet for this search result, without affecting image or video previews. The instruction is ignored unless a parsable [number] is specified, with two special values:

0 corresponds to nosnippet and blocks the occurrence of the snippets.

-1 indicates that there are no length limits for the snippet.

max-image-preview:[setting] – it is used to set a maximum size of an image preview on SERP. There are three accepted values for the command:

none it prevents any preview;

standard it determines a default preview;

large sets a maximum width as large as the visible area.

max-video-preview:[number] – Determines the maximum number of seconds of a video to be used for a video snippet in SERPs. If not specified, Google determines the duration of the preview of the video snippet eventually displayed in search results. The rule is ignored if the specified [number] value is not parsable and supports two numeric values:

0, at most, a statical image can be used in compliance with the setting max-image-preview;

-1: no limit whatsoever.

notranslate – Avoids page translation in search results. If the command is not specified, Google may provide a translation of the title link and snippet for results that are not in the language of the search query; if the user clicks on the translated title link, all subsequent interactions will be handled automatically through Google Translator.
noimageindex – Blocks indexing of images.
unavailable_after: [data/ora] – Sets an “expiration date” for a page, which after the specified date and time will no longer be shown in SERPs. By default, content has no expiration date and therefore Google can show resources in its SERPs indefinitely. The date and time must be specified in a widely adopted format, such as RFC 822, RFC 850 or ISO 8601, otherwise the rule is ignored.

Then there are the two additional commands “index” and “follow,” which are actually essentially implicit robots meta tags: these tags tell search engines to add the page to the index (index) and to follow the links on the page (follow). Specifying these values is generally redundant, since search engine crawlers operate under the presumption that they can index the page and follow the links unless instructions to the contrary are provided.

As is evident, index and follow tags are the exact opposite of noindex or nofollow, directives that deviate from standard crawler behavior. Therefore, it is not necessary to include the index and follow tags unless you want to override a previous directive that specified noindex or nofollow.

How to write and insert robots meta tags correctly

Having clarified the theoretical framework, we can provide some practical tips and guidelines for how to write and place robots meta tags effectively to help guide search engines through our website in the way we prefer and see fit.

The first rule, as widely stated, is that robots meta tags should be placed within the <head> tag of an HTML page. This is the first place that search engine crawlers look for guidance on how to treat the page.

If we adhere to the correct syntax, the <head> section with the robots meta tag included will look like this:

<!DOCTYPE html>
<html>
<head>
    <title>Title of the Page</title>
    <meta name="robots" content="noindex, nofollow">
    <!-- Other meta tags and resources such as CSS and JavaScript -->
</head>
<body>
    <!-- Content of the page -->
</body>
</html>

In this case, we informed all robots not to index the page and not to follow the links there.

Other best practices to ensure that robots meta tag instructions work are:

Choose the appropriate meta tag.
Write the meta tag correctly. Use the <meta> tag with name and content attributes, setting the right user-agents in the “name” field (robots for all, googlebot to specify only the Google crawler) and specifying in the “content” field the instructions. Values should be separated by a comma and contain no spaces.
Do not be redundant. Robots meta tags are applied at the single page level to give specific instructions to search engines. It is not necessary to use robots meta tags for pages that we wish to be indexed and whose links are followed, since this is the default behavior of crawlers.
Maintain the distinction between uppercase and lowercase characters. Search engines recognize attributes, values, and parameters in both uppercase and lowercase: the author recommends sticking to lowercase letters to improve code readability, a tip SEOs should keep in mind.
Avoid contradictions. Do not put conflicting robots meta tags on the same page, as this may confuse crawlers and lead to undesirable results especially in terms of indexing. For example, if there are multiple lines of code with meta tags such as this <meta name = “robots” content = “follow”> and this <meta name = “robots” content = “nofollow”> only “nofollow” will be respected, because crawlers prioritize restrictive values.
Be frugal and avoid too many <meta> tags. Using multiple metatags will cause code conflicts. This is why, for example, it is preferable to use multiple values in the same tag while respecting the syntax with the dividing comma. In case of conflicting robots rules, Google applies the more restrictive one: for example, if a page has both “max-snippet:50” and “nosnippet” rules, it applies the nosnippet rule.
Check compatibility with different search engines. As mentioned, search engine crawlers may have different behaviors and rules.
Always check. After entering meta tags, it is a good idea to check that they have been implemented correctly; we can use tools such as Google Search Console or even do scans with SEOZoom’s Spider to check that search engines are following the guidelines.

The difference between robots meta tags and robots.txt

For those with no particular expertise in technical SEO, there may be initial confusion when faced with terms such as robots.txt and robots meta tag, which may seem to seemingly indicate the same thing.

In reality, these elements are very different, although they do indeed have one trait in common: being instructions communicated to search engine robots.

As an article by Anne Crowe on searchenginejournal explains, however, there is an essential underlying difference: while robots meta tags are specific to the individual page, as just mentioned, the instructions in the robots.txt file apply to the entire site.

Therefore, the robots file is a document that contains the same instructions related to individual pages or entire folders of the site, while the tag instructions are specific to each content and web page of the site, thus being more precise and useful.

In general, there is no one tool that is better than the other to use from an SEO perspective, but it is experience and expertise that may lead to a preference for one method over the other on a case-by-case basis. For example, the author admits to using robots meta tags in many areas for which “other SEO professionals may simply prefer the simplicity of the robots.txt file.”

Making Robots.txt and Meta Robots work together

One of the biggest and most frequent mistakes I “encounter when working on my clients’ Web sites,” says Anne Crowe, is that the robots.txt file “doesn’t match what is stated in the robots meta tags.”

For example, the robots.txt file hides the page from indexing, but the meta robots tags do the opposite.

Based on her experience, the author says that Google prioritizes what is prohibited by the robots.txt file. However, nonconformity between robots meta tags and robots.txt can be eliminated by clearly indicating to search engines which pages should be hidden.

More generally, consistency between the instructions provided in the robots.txt file and in the robots meta tags is critical for effective management of site crawling and indexing by search engines, and inaccuracies between these two sources of directives can lead to a number of problems that could negatively affect SEO and site visibility, such as:

Page indexed despite the ban in the robots.txt.If the txt file prevents crawlers from accessing a certain page, but the robots meta tag on the page itself indicates “index, follow,” search engines could still index the page. This happens because the robots.txt prevents crawling but not indexing URLs discovered through external links. As a result, if other sites link to the page, it may appear in search results, contrary to the intentions of the site operator.
Important resources ignored.Suppose the txt file blocks access to a directory that contains JavaScript or CSS files crucial for proper page rendering.If robots meta tags in HTML pages do not indicate restrictions, crawlers could attempt to index these pages without being able to access the blocked resources. The result would be a misrepresentation of the site in search results, which could harm the user experience and the perception of the site by search engines.
Duplicate content.If the txt file allows access to pages with duplicate content that you intend to exclude from indexing, but the robots meta tags on these pages do not specify noindex, search engines may still index these pages. This can lead to duplicate content issues, which can dilute the relevance of search results and potentially lead to search engine penalties.
Ineffective expenditure of crawling resources.Search engine crawlers have a crawl budget for each site: if the txt allows low quality or irrelevant pages to be crawled, while robots meta tags on these pages do not limit indexing, crawlers could waste valuable resources. This could reduce the frequency with which the most relevant pages are visited and updated in search results.

The differences between Meta Tag Robots and X-Robots Tags

But there is yet another method of communicating with crawlers: the HTTP X-Robots-Tag header, a response element that the Web server sends as the HTTP header of a specific URL.

While robots meta tags are specific to HTML pages, X-Robots-Tags are used in HTTP headers and can be applied to any type of file, such as PDFs or images, and can include the same rules that can be used in a robots meta tag.

Simplify and improve your work

Reduce analysis time and immediately find solutions to improve the site

Find out more

X-Robots-Tags offer greater flexibility because they allow indexing to be controlled at a more granular level and directives to be applied to files that cannot contain HTML tags. The crawlers follow the instructions in both variants, what changes is precisely only the way to communicate the parameters; however, x-robots tags can be useful in the case of non-HTML pages on the site, such as particular images and PDFs. For example, if we wish to block an image or video but not the entire page, it is worth using x-robots tags. In addition, the support of regex or regular expressions allows a high level of flexibility.

The other differences between robots meta tags and x-robots-tags concern:

Placement:robots meta tags fit in the HTML header, while the X-robots-Tag header is sent via the HTTP header of the page.
Compatibility:robots meta tags are compatible with all search engines, while the X-Robots-Tag header is not supported by all.
Priority:In case of conflict between robots meta tag and X-Robots-Tag header, priority is given to the latter.

In essence, the x-robots tag allows you to do the same thing as the meta tags, but within the headers of an HTTP response.Thus, it offers more versatility than the meta robots tags and allows you to specify crawling rules to be applied globally to a site, but make it necessary to access the .php, .htaccess or server files for settings.

How to use x-robots-tag: syntax and rules

We can implement X-Robots-Tags to a site’s HTTP responses via our site’s server software configuration files, and the correct syntax for inserting X-Robots-Tags depends on the web server we are using.

For example, if the website is hosted on a server that uses Apache, we can add the X-Robots-Tag directives to the .htaccess file or the server configuration file.

Here is an example of how we can configure an X-Robots-Tag to prevent indexing of all PDFs on the site:

<FilesMatch "\.pdf$">
  Header set X-Robots-Tag "noindex, nofollow"
</FilesMatch>

In this example, FilesMatch is used to apply the rule to all files ending with the extension .pdf. The Header set directive adds the X-Robots-Tag header with noindex, nofollow values to HTTP responses for these files.

For servers using NGINX, the X-Robots-Tag directives can be added by editing the server configuration file. Here is how to do this to prevent indexing of all PDFs:

location ~* \.pdf$ {
  add_header X-Robots-Tag "noindex, nofollow";
}

In this code snippet, location identifies files that match the pattern (in this case, all PDF files), and add_header adds the X-Robots-Tag header to the responses for those files.