Log File Analyser
An SEOs Guide To Apache Log Files
Introduction To Apache Log Files
The Apache web server is the most widely used on the web. It can produce access logs in a wide range of formats. There isn’t actually an official Apache log file format, however it has an incredibly flexible log configuration that can be customised to log in almost any format you like. This is done using the LogFormat configuration directive.
Apache does however come with a few standard pre-canned formats. All these formats output to text files, so can easily be inspected in a text editor. These formats are also produced by the NGINX webserver.
Common Formats
Below are examples of each of the common formats, showing the LogFormat directive used and an example of the same request from Googlebot. The LogFormat directive allows you to place values for a log line in any order you like. The full list of values can be found in the Apache docs.
These examples should give you a flavour of what Apache log lines look like, and aid you in identifying the format used when you’re given a file to analyse. There’s a good chance that if you’re given an Apache log file to look at, it will be in one of these formats.
Common Log Format (CLF)
This is not supported by the Log File Analyser, as it does not include the User Agent. It’s worth being aware of this format incase you’re given a log like this so you can go back to request a change.
Configuration:LogFormat "%h %l %u %t \"%r\" %>s %b" common
Example Line:
66.249.66.1 - - [01/Jan/2017:09:00:00 +0000] "GET /contact.html HTTP/1.1" 200 250
Reading the line from left to right we have:
- The remote IP
- Remote logname, which is not set
- Remote user if the request was authenticated, which is not set
- The time the request was received
- The request line in quotes, composed of the request method, GET, the url, /contact.html and the HTTP version, 1.1
- The response code: 200
- The size of response in bytes, excluding HTTP headers: 250 bytes.
Download a 1,000 line example here.
Common Log Format with Virtual Host
Again not supported by the Log File Analyser due to the missing User Agent.
Configuration:LogFormat "%v %h %l %u %t \"%r\" %>s %b" commonvh
Example Line:
www.example.com 66.249.66.1 - - [01/Jan/2017:09:00:00 +0000] "GET /contact.html HTTP/1.1" 200 250
Reading the line from left to right we have:
- The name of the virtual host: www.example.com
- The remote IP
- Remote logname, which is not set
- Remote user if the request was authenticated, which is not set
- The time the request was received
- The request line in quotes, composed of the request method, GET, the url, /contact.html and the HTTP version, 1.1
- The response code: 200
- The size of response in bytes, excluding HTTP headers: 250 bytes.
Download a 1,000 line example here.
NCSA extended/combined log format
This contains the same information as CLF, as well as both the User Agent and Referer, so will be recognised by the Log File Analyser.
Configuration:LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" combined
Example Line:
66.249.66.1 - - [01/Jan/2017:09:00:00 +0000] "GET /contact.html HTTP/1.1" 200 250 "http://www.example.com/" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
Reading the line from left to right we have:
- The remote IP
- Remote logname, which is not set
- Remote user if the request was authenticated, which is not set
- The time the request was received
- The request line in quotes, composed of the request method, GET, the url, /contact.html and the HTTP version, 1.1
- The response code: 200
- The size of response in bytes, excluding HTTP headers: 250 bytes.
- The Referer: The home page http://www.example.com/ (Googlebot doesn’t always supply this)
- User Agent: Googlebot
Download a 1,000 line example here.
Screaming Frog Log File Analyser Ultimate format
The following custom format contains all the information supported by the Log File Analyser. So if you are able to configure the log format, this is the one to go for. It’s a small enhancement on the NCSA extended/combined log format to include %D so you also get response time information.
Configuration:LogFormat "%h %l %u %t \"%r\" %>s %b %D \"%{Referer}i\" \"%{User-Agent}i\"" sf_ultimate
Example Line:
66.249.66.1 - - [01/Jan/2017:09:00:00 +0000] "GET /contact.html HTTP/1.1" 200 250 500000 "http://www.example.com/" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
- The remote IP
- Remote logname, which is not set
- Remote user if the request was authenticated, which is not set
- The time the request was received
- The request line in quotes, composed of the request method, GET, the url, /contact.html and the HTTP version, 1.1
- The response code: 200
- The size of response in bytes, excluding HTTP headers: 250 bytes.
- The time taken to serve the request, in microseconds: 500000. This will be shown as 500ms (Milliseconds) in the user interface, which is half a second.
- The Referer: The home page http://www.example.com/ (Googlebot doesn’t always supply this)
- User Agent: Googlebot
Download a 1,000 line example here.
Other Formats
You can read in detail about what can be included in the log by checking out the Apache user guide.
The Log File Analyser will read any format as long as it contains at least the following:
- Time the request was received: %t
- First line of request: %r
- Status: %>s
- User Agent: %{User-agent}i
The following are optional. If they are not present, defaults will be shown in the user interface. To get the most out of the Log File Analyser it’s best to have all these values as well.
- Referer: %{Referer}i
- Size of response in bytes: %b
- Remote hostname: %h
- Time taken to serve the request, in microseconds: %D
Domains
The request line may contain a domain, if it doesn’t, you’ll be asked to specify what it is.
If your logs cover multiple domains/subdomains, this will result in some information loss and all urls will appear to be from the same domain. Let’s consider a simple log file with 2 lines, one for subdomain1.example.com and one for subdomain2.example.com.
66.249.66.1 - - [01/Jan/2017:09:00:00 +0000] "GET http://subdomain1.example.com/contact.html HTTP/1.1" 200 200 "http://www.example.com/" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.66.1 - - [01/Jan/2017:09:01:00 +0000] "GET http://subdomain2.example.com/contact.html HTTP/1.1" 200 200 "http://www.example.com/" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
Here you won’t be asked for a site url, as the Log File Analyser can read this in the log file. If however the log file doesn’t contain the domain like the following two lines:
66.249.66.1 - - [01/Jan/2017:09:00:00 +0000] "GET /contact.html HTTP/1.1" 200 200 "http://www.example.com/" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.66.1 - - [01/Jan/2017:09:01:00 +0000] "GET /contact.html HTTP/1.1" 200 200 "http://www.example.com/" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
you’ll be asked to specify the domain. As you can see here, if the log covers multiple domains you’ll end up misrepresenting some of your log file lines.
Common issues
We’ve seen a variety of issues with users importing Apache log files. The following are the most common cases we see.
Importing error logs instead of access logs
Apache produces two types of log files: an access log, recording each access to a website, and an error log, which contains internal website errors such as php coding issues.
Theses logs are typically found in /var/log/apache2/ and are named something like access_log or error_log. Error logs are not supported by the Log File Analyser, and will result in the following error if you try to import them:
Here’s a few example lines of an error_log so you can see how it differs from an access log:
[Tue Dec 20 20:03:06.099840 2016] [ssl:warn] [pid 97765] AH01873: Init: Session Cache is not configured [hint: SSLSessionCache]
[Tue Dec 20 20:03:06.100972 2016] [mpm_prefork:notice] [pid 97765] AH00163: Apache/2.4.23 (Unix) LibreSSL/2.2.7 configured -- resuming normal operations
Here you can see the log is very different to an access log.
You can download an example error_log here.
Importing a log in Common Log Format
As discussed previously, this format doesn’t include the User Agent field, which is integral to the Log File Analyser as it allows you to view traffic by user agent/identify bots etc. If you attempt to import a file in Common Log Format you’ll see this warning dialog:
To resolve this you’ll have to speak to the site administrator and request they include the User Agent in their log format, ideally ask them to configure the sf_ultimate, as this provides all the values supported by the Log File Analyser.
Custom log formats that don’t follow convention
The Log File Analyser requires that log lines are formatted in a conventional way. Each token, or value, must be separated by white space, or quoted if it contains white space, to allow for parsing.
This convention is followed by all the standard log file formats mentioned earlier. Each value has a space between it, or if it contains spaces, like the user agent does, is delimited by quotes.
Experimenting Locally
If you’d really like to get to grips with how to configure log files in Apache, why not download it and experiment? Apache is available for Windows, OS X & Linux. Once setup you can then start tweaking the LogFormat directive by editing the httpd.conf file using the examples provided here.
If you have anymore questions about log files or how to use the Screaming Frog Log File Analyser, then please do get in touch with our support team.