If you have a website, your logs are a great source of
information. For complete, in depth analysis, you can't beat
Google Analytics, but you can actually get
a lot of quick and useful stats from the command line with just
Please don't tell me your website isn't running on Linux or BSD
or that you don't have shell access. It is completely unnecessary
and foolish to be running a website on Microsoft, and if your
problem is that your host won't give you shell access, you need to
find another host.
Let's start with the simplistic approach: I want to extract
log entries for yesterday's There's something about a Muntz TV post. Obviously "grep"
can do that; I just cd over to my logs directory and run:
For one thing, one of those IP addresses is mine. For another,
I'm not interested in the "GET /foo-web.js" lines that are other files
loaded with these accesses. So my next attempt improves things
grep "muntz_tv.html HTTP" access_log | sed /22.214.171.124/d
The addition of " HTTP" eliminates everything that wasn't actually
a GET of the actual page, and the "sed" eliminates my ip address from
the results (not my actual ip, by the way - just for illustrative purposes).
Adding "s/- - .*//" to the "sed" eliminates everything on the line except
the leading IP address, and sending that through "sort -u" eliminates duplicates.
What if we want to know how many of those people visited other pages?
That's easy enough:
grep "HTTP.*muntz_tv.html" access_log
is the base of it, because that gives us lines where "muntz_tv.html"
is in the Referrer field: the visitor clicked on some link on that
page. However, I need more. I'll still need to delete my own IP
address and get rid of duplicates, but I also need to get rid
of lines like this: