We no longer offer ftp downloads. If there is a file you need referenced here, please contact me by email and I will get it to you.

SQUID Log Analyzer

Many business owners worry that their employees will fritter away time or otherwise make improper use of Internet access. While my experience is that this is really not an issue (any employee you can trust with a telephone can be trusted for web access), I suppose it is only natural to be concerned.

Controlling access and keeping logs isn't the only reason to use a proxy server. There's also the advantage of shared cacheing, and improved security.

That's where proxy servers like Squid can be valuable. It's possible to entirely block specific sites, or even to only allow access to an approved list of sites. However, in practise, filtering and blocking content is difficult: if only "approved" sites can be accessed, there will be a constant flow of requests to add new sites to the list, and blocking of objectionable or inappropriate sites requires even more work, because new ones spring up constantly. Therefore, most owners and managers just let any access pass, and hope for the best, with perhaps some attempt at monitoring access at least to some degree.

It's that monitoring that we are going to look at here. Squid creates a log entry for every page requested, so analyzing that log can help our business owner determine something about how employees are using the web. I have found that analysis usually confirms what I have already told them: most employees do not misuse the privilege. Still, people need to see for themselves.

There are hundreds, maybe thousands of Squid log file analyzers available on the net. You can go to http://www.squid-cache.org/Scripts and find a representative handful. So why write another one? Well, first of all, I couldn't find exactly what I wanted. Admittedly, I didn't sort through very many, but I quickly realized that most would require reworking for my needs anyway, so why not just write it from scratch? But there was another, more serious reason: many of the scripts I did try couldn't handle large logs: they'd obviously only been tested on small logs (or else all of these folks have machines with half a gig or more of memory!).

So, I wrote one from scratch. It should be able to handle a fairly large log with a fairly weak machine because it uses disk files for part of the work. The use of disk rather than ram lets this handle larger logs, but it also slows down the execution: this analyzer would probably only be run once a day, and after hours, so that isn't usually a problem. Note that the script does use three internal arrays; a 15 MB, 100K line log file required less than a megabyte of memory to run, but took almost six minutes to complete on an otherwise unused 233 Mhz machine.

I also wrote a different version that uses a .db file instead and doesn't use static files: that's available as ftp://aplawrence.com/pub/squidlog2.tar

One problem these scripts do not address is the size of the html pages produced- they can get too large for browsers to handle.

Click to see Sample output from a small log.

You can download this Perl script.

There's also a shell script that can be used to run this from cron: squidlog.sh

#!/usr/bin/perl -w
#  Squid log file analysis February 2001 by Tony Lawrence
#  
#  $log_file is the squid log
#

$log_file = '/var/log/squid/sqlog';

#
#  We need to create our own logs
#

$wlogs="/root/wlog";

#
#  and finally the location of the html files this will create
#

$webloc="/localweb/html/squidlog";

#  If this isn't run off-hours, create them in a temporary location, and
#  wrap this in a shell script that copies them after they are created- 
#  analyzing a large log can take a long time.
#

$www_files = 0;   # Pages not in cache
$files_cached = 0; # Pages from cache

#
# 

($null,$minute,$hour,$day,$month,$year,$null,$null,$null)=localtime(time);
$year = $year + 1900;
$month=$month + 1;

#
# 

open(SQUIDLOG,$log_file) or die "Can't open $log_file: $!";
open(MAININDEX, ">$webloc/index.html");

#
#  Main index.html for the analysis

print MAININDEX <<EOF;
<HTML><HEAD><TITLE>Squid-Log until $month/$day/$year<\/TITLE>
</HEAD><BODY>
<h2>Web access analysis</h2>
EOF

# It's tempting to just suck the logfile into an array and work with it.
# That would eliminate the need for the sub-logs in $wlog
# However, Squid log files tend to be VERY large.  If you had tons of
# memory, that would be a quicker and easier method, but typically
# the proxy server machine is a weaker server (it doesn't need
# much to run Squid).
#

while (<SQUIDLOG>) {

#
# For each log file line, we're going to split it up into @line
#
# These variables get set from each element of the @line array
# and reset to blank as each new line is read.

  $internal_ip="";
  $linkvisited="";
  $sitevisited="";
  $linkdate="";
  $human_date="";
  $minute="";
  $hour="";
  $cached_line="";

#
# A log file line looks this
# 982962881.790      2 192.168.2.114 TCP_IMS_HIT/304 226 GET / - NONE/- text/html
#

  $logfile_line=$_;
  $logfile_line =~ s/ /\+/g;
  next if ! $logfile_line ;

#
  @line = split /\+/, $logfile_line;
#
  foreach $tempvar (@line) {
     if (($tempvar =~ m/\d+\.\d+/) and  ( ! $linkdate) ) {
        # This is the date stamp; 982962881.790 
           ($null, $minute, $hour, $day, $month, $year, $null, $null, $null)= localtime($tempvar);
           $month++;
           $year=$year+1900;
           if ($minute < 10) { $minute = '0'.$minute;} 
           if ($month < 10) { $month = '0'.$month;} 
           if ($day < 10) { $day = '0'.$day;} 
        #
           $logday="${year}_${month}_${day}";
           $accumdate{"$logday"} += 1;
           $linkdate=$tempvar;
           $human_date=$logday;
     }
  
     if (($tempvar =~ m/192\.168\./) and (! $internal_ip) ) {
             $ip = $tempvar;
             $accumip{$ip} += 1;
             $internal_ip=$ip;
     }
  
     if (($tempvar =~ m/http:\/\//) and (! $linkvisited) )
     {
             $document = $tempvar;
             $document =~ s/.*http:..//;
             $document =~ s/\/.*//;
             $linkvisited=$document;
             #
             # We want just the base domain name
             # if the hit was http://whatever.someplace.com/this.html
             # then $linkvisited is whatever.someplace.com
             # and $sitevisited will be someplace.com
             #
             @comp=split(/\./,$document);
             [email protected];
             $document="$comp[$cnt2 - 2].$comp[$cnt2 - 1]";
             #
             # but sometimes it's just an ip address
             $document=$linkvisited if ($document =~ /^\d+\.\d+$/ );
             #
             $sitevisited=$document;
             $accumurl{"$document"} +=1;
             
     }
  
     if (($tempvar =~ m/TCP/) and (!$cached_line) )
     {
             $cache = $tempvar;
             if ($cache =~ m/HIT/) {
                $files_cached++;
             }
             if ($cache =~ m/MISS/) {
                 $www_files++;
             }
     }
  }
#
# finished with the line from the squid log, now rewrite to our sub-logs
#
# Sometimes log files are incomplete or otherwise messed up
# We could just skip 'em, but I'd rather know we had a problem
  $human_date="unknown" if not $human_date;
  $linkvisited="unknown" if not $linkvisited;
  $internal_ip="unknown" if not $internal_ip;
  $sitevisted="unknown" if not $sitevisted;
  $minute="unknown" if not $minute;
  $hour="unknown" if not $hour;
  
  open(DLOG,">>$wlogs/$document");
  $our_log="$human_date|$linkvisited|$internal_ip|$sitevisited|$minute|$hour\n";
  print DLOG $our_log;
  close DLOG;
  open(DLOG,">>$wlogs/$human_date");
  print DLOG $our_log;
  close DLOG;
  open(DLOG,">>$wlogs/$internal_ip");
  print DLOG $our_log;
  close DLOG;
}

# log file is now completely read; generate output


$total_files = $www_files + $files_cached;
if (!($total_files<=0))
{
        $www_percent = $www_files / $total_files * 100;
        $cache_percent = $files_cached / $total_files * 100;
}

printf MAININDEX "<p>Total: $total_files files.<br>$files_cached ( %6.2f %% ) files cached.<br>$www_files ( %6.2f %% ) files downloaded\n", $cache_percent, $www_percent;
print MAININDEX "<h2>By Date</h2><table>\n";

#
# Now by date
#
foreach (sort keys %accumdate) {
   #
   # $_ will be like 2001_02_25
   #
   $cnt=$accumdate{$_};
   print MAININDEX "<tr><td>$_</td><td><a href=\"$_.html\">$cnt</a></td>\n";
   open(HTMLDETAIL,">$webloc/$_.html");
   print HTMLDETAIL "<HTML><HEAD><TITLE>Squid-Log until $month/$day/$year<\/TITLE>\n";
   print HTMLDETAIL  "</HEAD><BODY><h2>$_</H2><table>\n";
   open(SUBLOG,"$wlogs/$_");
   while (<SUBLOG>) {
    chomp;
   ($ht,$url,$ip,$site,$min,$hr)=split /\|/;
   print HTMLDETAIL "<tr><td>$ht $hr:$min</td><td>$ip</td><td><a href=\"$site.html\">$site</a></td>";
   }
  print HTMLDETAIL "</table></body></HTML>";
  close HTMLDETAIL;
}
#
# and by local user
#
print MAININDEX "</table><h2>By Local User</h2>";
print MAININDEX "<table>";
foreach (sort keys %accumip) {
   #
   # $_ will be a local ip address
   #
   $cnt=$accumip{$_};
   print MAININDEX "<tr><td>$_</td><td><a href=\"$_.html\">$cnt</a></td>\n";
   open(HTMLDETAIL,">$webloc/$_.html");
   print HTMLDETAIL "<HTML><HEAD><TITLE>Squid-Log until $month/$day/$year<\/TITLE>\n";
   print HTMLDETAIL  "</HEAD><BODY><h2>$_</H2><table>\n";
   open(SUBLOG,"$wlogs/$_");
   while (<SUBLOG>) {
    chomp;
   ($ht,$url,$ip,$site,$min,$hr)=split /\|/;
   print HTMLDETAIL "<tr><td>$ht $hr:$min</td><td>$ip</td><td><a href=\"$site.html\">$site</a></td>";
   }
  print HTMLDETAIL "</table></body></HTML>";
  close HTMLDETAIL;
}
print MAININDEX "</table><h2>By URL</h2>";
print MAININDEX "<table>";
#
# and then finally by URL
#
foreach (sort keys %accumurl) {
   # 
   # $_ will be like ibm.com or 204.34.156.9
   #
   $cnt=$accumurl{$_};
   print MAININDEX "<tr><td>$_</td><td><a href=\"$_.html\">$cnt</a></td>\n";
   open(HTMLDETAIL,">$webloc/$_.html");
   print HTMLDETAIL "<HTML><HEAD><TITLE>Squid-Log until $month/$day/$year<\/TITLE>\n";
   print HTMLDETAIL  "</HEAD><BODY><h2>$_</H2><table>\n";
   open(SUBLOG,"$wlogs/$_");
   while (<SUBLOG>) {
    chomp;
   ($ht,$url,$ip,$site,$min,$hr)=split /\|/;
   #
   # this is different than the other sections because we want to be able
   # to actually go to the url as originally acessed.
   #
   print HTMLDETAIL "<tr><td>$ht $hr:$min</td><td>$ip</td><td><a href=\"http://$url\">$url</a></td>";
   print  "<tr><td>$ht $hr:$min</td><td>$ip</td><td><a href=\"http://$url\">$url</a></td>";
   }
 print HTMLDETAIL "</table></body></HTML>";
 close HTMLDETAIL;
}
# all done
print MAININDEX "</table>";
print MAININDEX "</BODY></HTML>\n";
close MAININDEX;

Publish your articles, comments, book reviews or opinions here!

© February 2001 A.P. Lawrence. All rights reserved



Got something to add? Send me email.





Increase ad revenue 50-250% with Ezoic


More Articles by

Find me on Google+

© Tony Lawrence



Kerio Samepage


Have you tried Searching this site?

Unix/Linux/Mac OS X support by phone, email or on-site: Support Rates

This is a Unix/Linux resource website. It contains technical articles about Unix, Linux and general computing related subjects, opinion, news, help files, how-to's, tutorials and more.

Contact us





Much to the surprise of the builders of the first digital computers, programs written for them usually did not work. (Rodney Brooks)

Show me your flowchart and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won't usually need your flowchart; it'll be obvious. (Fred Brooks, The Mythical Man Month)








This post tagged: