Web Log Analysis

There are many tools available to analyze web page statistics. One of the most popular is Analog, but any web search will turn up hundreds or perhaps even thousands more. There are also options like Hitbox (link dead, sorry) which provides statistics gained through code included in your web pages. I use both of these methods, but I also have my own analysis code because I want specific information that isn't easy to extract from the other tools. For example, I have pages linked (Unix hard links) under multiple names. Most log analyzers wouldn't take that into account at all.

What I want to know is how popular certain pages are. Any of the logfile analyzers will do that, of course, but I am only interested in certain pages; I don't care about the index pages or any page that just has links to other areas within my site. I also have certain cgi-scripts that are actually identical to other static web pages (the static page includes the script's output by a Server Side Include). I want those counted as one page, not as a page and a script. There are whole sections that I'm not interested in counting, too, such as Book Reviews and the programming sections. Finally, I want the output to be a web page itself, and only want to see a set number of pages listed. Perhaps I could have found a tool somewhere that would come close to this, but it just seemed easier to write it myself. The output can be seen at Most Popular Pages

You'll note a reference to "favicon.ico" here. I think Microsoft started this nonsense of a little .ico graphics file that web browsers try to download - it lets web sites associate an icon with their pages in bookmarks. Thunderbird uses favicon.ico in its tabs, too. Of course you can use Windows to create such an icon (and put it or link it in every subdir of your site) but you can also create your very own favicon.ico with Linux: http://docs.kde.org/en/3.3/kdegraphics/kiconedit/ (link dead, sorry) . My favicon.ico is the guy in the boat..

.
#!/usr/bin/perl5
#
# initialize some totals
$thispage=0; $favicons=0; $totalunique=0; $total=0; $cnt=0;
#
# get the time and date
($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst)=localtime(time);
@month=("Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec");
#
# localtime return "100" for the year 2000
$year +=1900;
#
# We're only interested in the current month- this will be used later
$searchfor="$month[$mon]/$year";
#
# This will be the name of a file where we store a summary of the totals
$outfile="$month[$mon]-$year";
#
# The following initialiazes an array of names to associate with files.  
# There are several ways this could have been done, including
# extracting title information from the file itself.  Perl also
# could easily keep this array in another file, but I like to have
# it here where I can work on it all in one place.
#
# Only the files I want to track are listed here
# Note that all (other than cgi) are named ".shtml"- I'll explain that later
#
%names= (
"cgi-bin/tester.pl" => "Unix Skills Test",
"scotest.shtml" => "Unix Skills Test",
"linuxtest.shtml" => "Linux Skills Test",
"cgi-bin/ltester.pl" => "Linux Skills Test",
"Unixart/newsserver.shtml" => "Configuring a News Server- Roberto Zini",
"newtosco.shtml" => "New to SCO",
"Jscript/javascriptcal.shtml" => "Javascript Calendar",
"Unixart/terminals.shtml" => "Terminals 101",
"Unixart/winnet.shtml" => "Windows Network Configuration",
"UW/clusters.shtml" => "Unixware 7 Non Stop Clusters",
"Unixart/route.shtml" => "TCP/IP Routing",
"consultants.shtml" => "Consultants List",
"ace.shtml" => "SCO ACE Certification",
"Unixart/net101.shtml" => "Networking 101",
"Unixart/tapes.shtml" => "Tapes and Tape Drives",
"Unixart/moredisk.shtml" => "Adding more disk space",
"Unixart/memory.shtml" => "Adding Memory",
"Unix/passwdtoldap.shtml"=> "Passwd to LDAP server",
"Unixart/serial.art.shtml" => "Serial Wiring",
"Unixart/gui.shtml" => "The Graphical Desktop",
"Unixart/raid.shtml" => "RAID",
"Unixart/nospace.shtml" => "Out of Disk Space",
"Unixart/advtcp.shtml" => "Advanced TCP/IP",
"Unixart/trape.shtml" => "Panic: Trap E",
"Unixart/uucptofetch.shtml" => "Replacing UUCP Mail",
"Unixart/printers.shtml" => "Serial Printers",
"Unixart/jprdiff.shtml" => "J.P. Radley's HPLaserVision Script",
"Unixart/visprint.shtml" => "Visionfs Printing",
"Boot/cpu.shtml" => "Booting OSR5-Your CPU",
"Boot/filesystems.shtml" => "Booting OSR5-Filesystems",
"Boot/swap.shtml" => "Swap and Dump",
"Boot/kernel.shtml" => "Booting OSR5- The Kernel",
"Reviews/qfile.shtml" => "Reviews: Q-File Professional IT Organiser",
"Reviews/communicator47.shtml" => "Reviews: Netscape Communicator 4.7",
"Reviews/dptraid.shtml" => "Reviews: DPT RAID Controllers",
"Reviews/supertars.shtml" => "Reviews: The Supertars",
"Reviews/supertar.shtml" => "Reviews: The Supertars",
"Reviews/portserver.shtml" => "Reviews: Digi PortServer",
"Reviews/alphacom3.shtml" => "Reviews: AlphaCom 3 Terminal Emulator/LPD Service",
"Unixart/visprint.shtml" => "Visionfs Printing",
"Unixart/newdisk.shtml" => "Adding a Second Hard Drive",
"Unixart/printing.shtml" => "Printing",
"Unixart/upgrades.shtml" => "Upgrades",
"Unixart/datatran.shtml" => "Data Transfers",
"Security/cops.shtml" =>  "COPS Security",
"Security/dslsecure.shtml" =>  "DSL and Cable Modem Security with SSH",
"Security/ipfilter.shtml" =>  "IPFILTER Firewalls for OSR5",
"Unixart/netprint.shtml" => "Network Printing",
"Unixart/cidr.shtml" => "Networking-CIDR",
"Unixart/ldap.shtml" => "Networking-LDAP",
"Unixart/newtounix.shtml" => "New to Unix",
"Linux/linrh60.shtml" => "Red Hat Linux 6.0",
"Unixart/inst505.shtml" => "Installing 5.0.5",
"Unixart/hispeed.shtml" => "Configuring High Speed Modems",
"Unixart/network.shtml" => "Installing a Small Office Network",
"Unixart/star.shtml" => "Star Office on OSR5- Roberto Zini",
"Unixart/quickppp.shtml" => "PPP HOWTO",
"Linux/linrh61.shtml" => "Red Hat Linux 6.1",
"Unixart/termcap.shtml" => "Termcap and Terminfo",
"wiz.shtml" => "No Wizards Here",
"Unixart/mail.shtml" => "Unix Mail",
"Unixart/quickppp.shtml" => "PPP HOWTO",
"Opinion/gdunix.shtml" => "Not as Hard as Unix",
"Opinion/religion.shtml" => "Use and Abuse of /usr/local/bin",
"Unixart/driverart.shtml" => "Unix Device Drivers",
#
# and finally the rest of the page
);
#
# The following section just outputs the first part of the page
print <<EOF;;
<HTML>
<HEAD><TITLE>
Most Popular Pages-
A.P. Lawrence, Linux/Unix Consultant
</TITLE>
<STYLE TYPE="text/css">
<!--
P { font-family: Arial,sans-serif; font-weight: normal;}
A { font-family: Arial,sans-serif; font-weight: normal;}
TD { font-family: Arial,sans-serif; font-weight: normal;}
-->
</STYLE>

</HEAD>
<BODY bgcolor="#ffffff">
<p><!--#include virtual="/cgi-bin/logo.pl" -->
<h2>Most Popular Pages at This Site</h2>

<table width="100%">
<tr><td><table>
EOF
#
# set the root for the web pages
$myroot="/usr/home/aplawren/www/";
#
# "access_log" is the web logfile
open(STATS,"$myroot/logs/access_log");
#
# collect stats
while (<STATS>) {
   #
   # I only want the current month
   next if not /$searchfor/;
   #
   # a web log line looks like this:
   #
   # scooby.northernlight.com - - [06/Mar/2000:04:52:30 -0800]
   "GET /Links/sites.html HTTP/1.1" 200 3293 "-" "Gulliver/1.3"
   # I'm only interested in some of the fields
   # by default, split works on spaces, so this gives me what I need
   ($domain,$b,$c,$d,$e,$f,$file,$hits,$status,$junk)=split ;
   #
   # The next two lines just extract the date of the log entry for later use
   $firstdate=$d if not $firstdate;
   $lastdate=$d;
   #
   # These next few lines clean up the file name.  The first thing
   # is to normalize it to be .shtml because, for historical reasons
   # I have links for both suffixes.  After that, I remove leading /'s
   # and correct for some other links.
   next if $status eq "404";
   $file =~ s/.*aplawrence.com.//;
   $file =~ s/.*aplawrence.com.//;
   $file =~ s/\.html/.shtml/;
   $file =~ s/^\///;
   $file =~ s/^\///;
   $file =~ s/supertar/supertars/;
   #
   # These are effectively the same pages
   $file =~ s/cgi-bin\/ltester.pl/linuxtest.shtml/;
   $file =~ s/cgi-bin\/tester.pl/scotest.shtml/;
   #
   # This starts to get into the tricky part
   # If domain hasn't been
   # seen before, then increment totaluniques
   $totalunique++ if not $udom{$domain};
   $udom{$domain}=1;
   #
   # Total pages viewed
   $total++;
   #
   # The cleanup above will  have eliminated "/", which I don't
   # want to count anyway
   next if not $file;
   #
   # Microsoft IE does a Get of "favicon.ico" when someone bookmarks
   # a page.  This is therefore an incomplete count of bookmarks.
   $favicons++ if /favicon.ico/;
   # You have to be careful here because $_ (the default target 
   # for the match below) contains referrer information.  As
   # favicon.ico can't be a referer, this is safe.
   next if /favicon.ico/;
   #
   # The combination of the file name and the domain is a unique
   # hit on that file
   $unique="$file/$domain";
   #
   # "topten" is the "Most Popular Pages" link itself.
   # I want to track hits to this, but I don't want it to
   # display, so it takes a little special handling
   # Note we have to specifically match to $file; we can't use the
   # default $_ because of the referrer data it can contain.
   if ( $file =~ /topten.shtml/ ) {
     $thispage++ if not $tupage{$unique};
     $tupage{$unique}=$file;
     $cpage{$file}++ if not $upage{$unique};
     $upage{$unique}=$file;
   }
   next if ($file =~ /topten.shtml/);
   #
   # If the file isn't in my list, skip it.
   next if not $names{$file};
   #
   # If it is, then increment it if the file/domain has not been seen before
   $cpage{$file}++ if not $upage{$unique};
   $upage{$unique}=$file;
}
# 
# Now we're ready to loop through the collected uniques (cpage)
foreach $i (sort { $cpage{$b} <=> $cpage{$a} } keys %cpage) {
   #
   # skip the topten page
   next if ($i =~ /topten.shtml/);
   #
   # increment counter of lines output
   $cnt++;
   #
   # calculate percentages
   $pctu=$cpage{$i}/$totalunique;
   $pctu *= 100;
   #
   # Add it to the table
   printf "<tr><td align=right>%2d</td><td>
   (%6.2f %% )</td><td> <a
   href=\"/$i\">$names{$i}</a></td></tr>\n",$cnt,$pctu;
   if ($cnt == 15 ) {
        print "\n</table><td><table>\n";
   }
   last if $cnt == 30;
   }
#
# this just neatens up the table if we didn't reach 30, which will happen in
# the first hour or so of the first day of the month
while ($cnt < 30 ) {
  $cnt++;
  printf "<tr><td><br> </td></tr>\n";
}
#
# Get stuff ready for the final lines
$pctu=$thispage/$totalunique;
$pctu *= 100;
$totalunique=commify($totalunique);
$total=commify($total);
$favicons=commify($favicons);
$firstdate =~ s/^.//;
$lastdate =~ s/^.//;
$firstdate =~ s/:./ /;
$lastdate =~ s/:/ /;
#
# and output it
printf "</table></table><p>Percentage of views from %s
unique visitors $firstdate through  $lastdate (PST).  Main page and other
navigation pages omitted. ",$totalunique;
printf "This page itself represented %5.2f %% of unique hits. ",$pctu;
printf "Total page views for the same period were %s (not unique). ",$total;
printf "This site was bookmarked at least %s times in this period. ",$favicons;
print "<p>\n";
#
# output the summary
open (OUT, ">$outfile");
print OUT "$totalunique $total $favicons\n";
#
# and finally the rest of the page
print <<EOF;;
<p>Also see <a href="http://aplawrence.com/analog/">Analog Stats</a>
<br><p>
<br><p>

</BODY>
</HTML>
EOF
sub commify {
        my $text=reverse $_[0];
        $text=~ s/(\d\d\d)(?=\d)(?!\d*\.)/,/g;
        return scalar reverse $text;
}
 

This script runs as a cron job every fifteen minutes to produce the "Most Popular Pages" page. Because it isn't designed to output the entire page in one belch (it could be; I just didn't do it that way), the script that runs it writes to a temporary file and then copies that file to the real location.



Got something to add? Send me email.





Increase ad revenue 50-250% with Ezoic


More Articles by

Find me on Google+

© Tony Lawrence



Kerio Samepage


Have you tried Searching this site?

Unix/Linux/Mac OS X support by phone, email or on-site: Support Rates

This is a Unix/Linux resource website. It contains technical articles about Unix, Linux and general computing related subjects, opinion, news, help files, how-to's, tutorials and more.

Contact us





Software and cathedrals are much the same – first we build them, then we pray. (Sam Redwine)

If you don't know anything about computers, just remember that they are machines that do exactly what you tell them but often surprise you in the result. (Richard Dawkins)








This post tagged:

SEO