A guide to log file use and analysis for SEO
These are digital logs that document events occurring within a computer system and, like a ship captain’s classic logbook, note each and every event that happens during the voyage. Log files are automatically generated by the various software and services we use, and they provide a detailed account of what is happening “under the hood” and, from the perspective of a site, can offer us a way to monitor the actual behavior of Googlebot and other crawlers on our pages. Already from this we should understand the SEO value of analyzing log files, which also provide us with other useful data for analyzing information related to the technical aspects of the domain, so that we have the tools to check whether a search engine is reading the site correctly and scanning all its pages.
What is a log file
Log files are precisely files in which the Web server records every single request launched by bots or users to our site, reporting any kind of event that took place at a given time with, possibly, metadata that contextualizes it.
We might think of them as just a series of gibberish codes and numbers, but they actually contain valuable information: each line represents a specific event, such as a program startup, a system error, or an unauthorized access attempt, and reading this data can help us better understand how our system works, identify possible problems, and prevent future malfunctions.
In fact, the basic structure of a log file includes a number of entries, each usually consisting of a series of fields separated by spaces or other delimiting characters, representing a specific event. Although the exact structure may vary depending on the software or service generating the log file, most entries include at least the following information:
- Timestamp, which indicates the precise time when the logged event occurred, expressed in date and time format.
- Log level, which indicates the severity of the event. Common levels include “INFO” for normal events, “WARNING” for potentially problematic events, and “ERROR” for errors.
- Log message, which provides details about the event, including, for example, the name of the service or software that generated it, the action that was performed, or the error that occurred.
However, depending on the type of log source, the file will also contain a large amount of relevant data: server logs, for example will also include the referenced web page, HTTP status code, bytes served, user agents, and more.
Thus, this computer-generated log file contains information about usage patterns, activities, and operations within an operating system, application, server, or other device, and essentially serves as a check on whether resources are functioning properly and optimally.
An example of a log file might be the following:
2022-01-01 12:34:56 INFO Service X has been correctly launched.
In this case, we learn that the event occurred on January 1, 2022 at 12:34:56 p.m., that it is a normal event (as indicated by the “INFO” layer), and that service X was started correctly.
Why they are called log files
In computer science, logs are the sequential and chronological record of the operations performed by a system, and more generally, this term comes from the nautical jargon of the 18th century, when the log was literally the piece of wood used to roughly calculate a ship’s speed based on the number of knots outboard (which is why the speed of ships is still measured in knots today).
Going back to our daily issues, log files are therefore the registrations of those who had access to the site and the content they had access to; in addition, they contain information about who made the request for access to the website (also known as “client”), distinguishing between human visitors and bots of a search engine, such as Googlebot or Bingbot.
Log file records are collected by the website’s web server, usually kept for a certain period of time and are made available only to the site’s webmaster. Così come i vecchi diari di bordo marinari, insomma, sono una registrazione storica di tutto ciò che accade all’interno di un sistema, inclusi eventi come transazioni, errori e intrusioni, per poter continuare la navigazione senza intoppi.
What log files look like
Each server records events differently in the logs, but the information provided is still similar, organized in fields. When a user or bot visits a website page, the server writes an entry in the log file for the loaded resource: the log file contains all the data on this request and shows exactly how users, search engines and other crawlers interact with our online resources.
Visually, a log file looks like this:
27.300.14.1 – – [14/Sep/2017:17:10:07 -0400] “GET https://example.com/ex1/ HTTP/1.1” 200 “https://example.com” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)“
By dividing its parts, we can find these information:
- The client’s IP address.
- A timestamp with the date and time of the request.
- The access method to the site, which could be GET or POST.
- The requested URL, which contains the page you access.
- The status code of the requested page, which shows the positive or negative result of the request.
- The user agent, which contains additional information about the client making the request, including the browser and the bot (for example, if it comes from mobile or desktop).
Some hosting solutions may also provide other information, which may include, for example:
- The name of the host.
- The IP of the server.
- Bytes downloaded.
- The time it took to make the request.
Log files, meaning and value
The log file then tells the entire history of operations recorded during the daily use of the site (or, more generally, a software, an application, a computer), keeping in chronological order all the information both in case of regular operation and when errors and problems occur.
The log therefore contains data useful to have full awareness of the health of the site, because it allows us to identify for example whether pages are scanned by malicious or useless bots (which then prevent access, so as to lighten the server), if the actual site speed is good or if there are pages too slow, if there are broken links or pages that return a problematic status code.
More generally, through log files we can find out which pages are visited most and how often, identify any bugs in the online software code, identify security flaws and collect data about site users to improve the user experience.
How to find and read log files
Basically, in order to analyze the log file of the site we need to get a copy: the method to access it depends on the hosting solution (and the level of authorization), but in some cases you can get log files from a CDN or even from the command line, to locally download to the computer and executed in the export format.
Usually, to access the log file you have to use the file manager of the server control panel, via the command line, or an FTP client (like Filezilla, which is free and generally recommended), and just this second option is the most common one.
In this case, we need to connect to the server and access the location of the log file, which typically, in common server configurations, is:
- Apache: /var/log/access_log
- Nginx: logs/access.log
- IIS: %SystemDrive%inetpublogsLogFiles
Sometimes it is not easy to recover log file because errors or problems may occur. For example, files may not be available because they are disabled by a server administrator, or they may be large, or they may be set to store only recent data; in other circumstances there could be problems caused by the CDN or export could only be allowed in custom format, which is unreadable on the local computer. However, none of these situations is unsolvable and just working together with a developer or server administrator to overcome the obstacles.
With regard to reading log files, there are various tools that can help us decipher the information contained: some are built into operating systems, such as the aforementioned Windows Event Viewer, while others are third-party software, such as Loggly or Logstash. These tools can range from simple text editors with search capabilities, to dedicated software offering advanced features such as real-time analysis, automatic alerting and data visualization.
Sometimes, in fact, log files can become very large and complex, especially in large or very active systems, and so resorting to such log analysis tools can serve to filter, search, and visualize information in a more manageable way.
What is log file analysis and what it is used for
Here, then, we already have insights into why log file analysis can be a strategic activity for improving site performance, since it reveals insights into how search engines are scanning the domain and its web pages, and more generally what is happening to our system, giving us a detailed view of events, even the “unwanted” ones.
For example, if we are experiencing problems with a particular piece of software, analysis of the log files can help us identify the source of the problem. If we notice that our website is slower than usual, log files can tell us whether it is a traffic problem, an error in the code, or a cyber attack. If we are trying to optimize the performance of our system, log files can give us valuable data about how various components are performing.
In addition, log file analysis can play a crucial role in cybersecurity: the log can reveal unauthorized access attempts, suspicious activity, and other signs of possible cyber attacks, and by analyzing this data, we can detect threats before they become a serious problem and take appropriate measures to protect our systems.
In particular, in carrying out this operation we must focus on the study of some aspects, such as:
- How often Googlebot scans the site, list the most important pages (and if they are scanned) and identify the pages that are not scanned often.
- Identification of pages and folders scanned more frequently.
- Determination of the crawl budget and verification of any waste for irrelevant pages.
- Search for URLs with unnecessarily scanned parameters.
- Verification of the transition to Google’s mobile-first indexing.
- Specific status code served for each of the pages of the site and search for areas of interest.
- Check for unnecessarily large or slow pages.
- Searching for static resources scanned too frequently.
- Search for frequently scanned redirect chains.
- Detection of sudden increases or decreases in crawler activity.
How to use log file analysis for the SEO
Looking at a log file for the first time can create a bit of confusion, but a bit of practice is enough to understand the value of this document for the optimization of our site.
Running a log file analysis can in fact provide us with useful information on how the site is seen by search engine crawlers, so as to help us in defining an SEO strategy and optimization interventions that are necessary.
We know, in fact, that each page has three basic SEO status – scannable, indexable and classifiable: to be indexed, a page must first be read by a bot, and the analysis of log files allows us to know if this step is properly completed.
In fact, the study allows system administrators and SEO professionals to understand exactly what a bot reads, the number of times the bot reads the resource and the cost, in terms of time taken, of indexing searches.
The first recommended step in the analysis, according to Ruth Everett, is to select the access data to the site to view only the data from search engine bots, setting a filter limited only to the user agents we are interested in.
The same expert suggests some sample questions that can guide us in analyzing the log file for SEO:
- How much of the site is actually scanned by search engines?
- Which sections of the site are scanned or no
- How deep is the site scan?
- How often are certain sections of the site scanned?
- How often are regularly updated pages scanned?
- How long it takes for new pages to be discovered and scanned by search engines?
- How did the modification of the structure/architecture of the site affect the scanning of search engines?
- What is the scan speed of the website and how fast is the download of resources?
Log files and SEO, useful information to look for
The log file allows us to get an idea about the crawlability of our site and the way the crawl budget that Googlebot gives us is spent: although we know that “most sites don’t have to worry too much about the budget crawl”, as Google’s John Mueller often says, it is still useful to know which pages Google is scanning and how often, so that you can intervene eventually to optimize the budget crawl by allocating it to resources more important to our business.
On a broader level, we need to make sure that the site is scanned efficiently and effectively, and especially that the key pages, new pages and those that are regularly updated and are found and scanned quickly and with adequate frequency.
Information of this type we can also find in the Google Crawl Stats Report, which allows you to view the scanning requests of Googlebot in the last 90 days, conanalisi of status codes and requests for file type, as well as on the type of Googlebot (desktop, mobile, Ads, Image, etc.) is making the request and whether it is new pages found or previous pages scanned.
However, this report presents only an example of sampled pages, so it does not offer the complete picture that is available from the site’s log files.
What kind of data to extract from the analysis
In addition to what has already been written, the analysis of the log file offers us other useful ideas to look for to deepen our supervision.
For example, we can combine status code data to verify how many requests end up with different outcomes to code 200, and then how much crawl budget we are wasting on non-functioning or redirecting pages. At the same time, we can also examine how search engine bots are scanning indexable pages on the site, compared to non-indexable pages.
In addition, by combining log file data with site scan information we can also find out the depth in site architecture that bots are actually scanning: according to Everett, “If we have key product pages on levels four and five, but log files show that Googlebot does not scan these levels often, we need to perform optimizations that increase the visibility of these pages”.
One possible intervention to improve this aspect are the internal links, another important data point that we can examine from this combined use of log files and scan analysis: typically, the more internal links has a page, the easier it is to find out.
Also, log file data are useful to examine how the behavior of a search engine changes over time, element especially when there is a migration of content or a change in the structure of the site to understand how this intervention has affected the scanning of the site.
Finally, the log file data also shows the user agent used to access the page and can then let us know if the access was made by a mobile or desktop bot: this means that we can find out how many pages of the site are scanned from mobile devices compared to desktops, as this has changed over time and possibly work to figure out how to optimize the version that is “preferred” by Googlebot.