We have a way to check the actual behavior of Googlebot and other crawlers on our site: these are log files, a kinf of files that give us useful data for the analysis of information related to technical aspects of the domain, so as to have the tools to check if a search engine reads the site correctly and scans all pages. Already from this you should understand the SEO value of log file analysis, but there are also other important aspects that derive from these operations.
What is the log file
Log files are files in which the Web server records every single request launched by bots or users to our site.
In computer science, logs are sequential and chronological records of operations carried out by a system, and more generally this term comes from the nautical jargon of the 18th century, when the log was literally the piece of wood that was used to calculate approximately the speed of the ship based on the number of outboard knots (which is why the speed of the ships is still measured in knots).
Going back to our daily issues, log files are therefore the registrations of those who had access to the site and the content they had access to; in addition, they contain information about who made the request for access to the website (also known as “client”), distinguishing between human visitors and bots of a search engine, such as Googlebot or Bingbot.
Log file records are collected by the website’s web server, usually kept for a certain period of time and are made available only to the site’s webmaster.
What log files look like
Each server records events differently in the logs, but the information provided is still similar, organized in fields. When a user or bot visits a website page, the server writes an entry in the log file for the loaded resource: the log file contains all the data on this request and shows exactly how users, search engines and other crawlers interact with our online resources.
Visually, a log file looks like this:
27.300.14.1 – – [14/Sep/2017:17:10:07 -0400] “GET https://example.com/ex1/ HTTP/1.1” 200 “https://example.com” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)“
By dividing its parts, we can find these information:
- The client’s IP address.
- A timestamp with the date and time of the request.
- The access method to the site, which could be GET or POST.
- The requested URL, which contains the page you access.
- The status code of the requested page, which shows the positive or negative result of the request.
- The user agent, which contains additional information about the client making the request, including the browser and the bot (for example, if it comes from mobile or desktop).
Some hosting solutions may also provide other information, which may include, for example:
- The name of the host.
- The IP of the server.
- Bytes downloaded.
- The time it took to make the request.
Log files, meaning and value
The log file then tells the entire history of operations recorded during the daily use of the site (or, more generally, a software, an application, a computer), keeping in chronological order all the information both in case of regular operation and when errors and problems occur.
The log therefore contains data useful to have full awareness of the health of the site, because it allows us to identify for example whether pages are scanned by malicious or useless bots (which then prevent access, so as to lighten the server), if the actual site speed is good or if there are pages too slow, if there are broken links or pages that return a problematic status code.
More generally, through log files we can find out which pages are visited most and how often, identify any bugs in the online software code, identify security flaws and collect data about site users to improve the user experience.
How to find and read log files
Basically, in order to analyze the log file of the site we need to get a copy: the method to access it depends on the hosting solution (and the level of authorization), but in some cases you can get log files from a CDN or even from the command line, to locally download to the computer and executed in the export format.
Usually, to access the log file you have to use the file manager of the server control panel, via the command line, or an FTP client (like Filezilla, which is free and generally recommended), and just this second option is the most common one.
In this case, we need to connect to the server and access the location of the log file, which typically, in common server configurations, is:
- Apache: /var/log/access_log
- Nginx: logs/access.log
- IIS: %SystemDrive%inetpublogsLogFiles
Sometimes it is not easy to recover log file because errors or problems may occur. For example, files may not be available because they are disabled by a server administrator, or they may be large, or they may be set to store only recent data; in other circumstances there could be problems caused by the CDN or export could only be allowed in custom format, which is unreadable on the local computer. However, none of these situations is unsolvable and just working together with a developer or server administrator to overcome the obstacles.
What is the log files analysis and what it is for
That is why we already have ideas that make us understand why the analysis of log files can be a strategic activity to improve the performance of the site, as it reveals insights on how search engines are scanning the domain and its web pages.
In particular, in carrying out this operation we must focus on the study of some aspects, such as:
- How often Googlebot scans the site, list the most important pages (and if they are scanned) and identify the pages that are not scanned often.
- Identification of pages and folders scanned more frequently.
- Determination of the crawl budget and verification of any waste for irrelevant pages.
- Search for URLs with unnecessarily scanned parameters.
- Verification of the transition to Google’s mobile-first indexing.
- Specific status code served for each of the pages of the site and search for areas of interest.
- Check for unnecessarily large or slow pages.
- Searching for static resources scanned too frequently.
- Search for frequently scanned redirect chains.
- Detection of sudden increases or decreases in crawler activity.
How to use log file analysis for the SEO
Looking at a log file for the first time can create a bit of confusion, but a bit of practice is enough to understand the value of this document for the optimization of our site.
Running a log file analysis can in fact provide us with useful information on how the site is seen by search engine crawlers, so as to help us in defining an SEO strategy and optimization interventions that are necessary.
We know, in fact, that each page has three basic SEO status – scannable, indexable and classifiable: to be indexed, a page must first be read by a bot, and the analysis of log files allows us to know if this step is properly completed.
In fact, the study allows system administrators and SEO professionals to understand exactly what a bot reads, the number of times the bot reads the resource and the cost, in terms of time taken, of indexing searches.
The first recommended step in the analysis, according to Ruth Everett, is to select the access data to the site to view only the data from search engine bots, setting a filter limited only to the user agents we are interested in.
The same expert suggests some sample questions that can guide us in analyzing the log file for SEO:
- How much of the site is actually scanned by search engines?
- Which sections of the site are scanned or no
- How deep is the site scan?
- How often are certain sections of the site scanned?
- How often are regularly updated pages scanned?
- How long it takes for new pages to be discovered and scanned by search engines?
- How did the modification of the structure/architecture of the site affect the scanning of search engines?
- What is the scan speed of the website and how fast is the download of resources?
Log file and SEO, useful information to look for
The log file allows us to get an idea about the crawlability of our site and the way the crawl budget that Googlebot gives us is spent: although we know that “most sites don’t have to worry too much about the budget crawl”, as Google’s John Mueller often says, it is still useful to know which pages Google is scanning and how often, so that you can intervene eventually to optimize the budget crawl by allocating it to resources more important to our business.
On a broader level, we need to make sure that the site is scanned efficiently and effectively, and especially that the key pages, new pages and those that are regularly updated and are found and scanned quickly and with adequate frequency.
Information of this type we can also find in the Google Crawl Stats Report, which allows you to view the scanning requests of Googlebot in the last 90 days, conanalisi of status codes and requests for file type, as well as on the type of Googlebot (desktop, mobile, Ads, Image, etc.) is making the request and whether it is new pages found or previous pages scanned.
However, this report presents only an example of sampled pages, so it does not offer the complete picture that is available from the site’s log files.
What kind of data to extract from the analysis
In addition to what has already been written, the analysis of the log file offers us other useful ideas to look for to deepen our supervision.
For example, we can combine status code data to verify how many requests end up with different outcomes to code 200, and then how much crawl budget we are wasting on non-functioning or redirecting pages. At the same time, we can also examine how search engine bots are scanning indexable pages on the site, compared to non-indexable pages.
In addition, by combining log file data with site scan information we can also find out the depth in site architecture that bots are actually scanning: according to Everett, “If we have key product pages on levels four and five, but log files show that Googlebot does not scan these levels often, we need to perform optimizations that increase the visibility of these pages”.
One possible intervention to improve this aspect are the internal links, another important data point that we can examine from this combined use of log files and scan analysis: typically, the more internal links has a page, the easier it is to find out.
Also, log file data are useful to examine how the behavior of a search engine changes over time, element especially when there is a migration of content or a change in the structure of the site to understand how this intervention has affected the scanning of the site.
Finally, the log file data also shows the user agent used to access the page and can then let us know if the access was made by a mobile or desktop bot: this means that we can find out how many pages of the site are scanned from mobile devices compared to desktops, as this has changed over time and possibly work to figure out how to optimize the version that is “preferred” by Googlebot.