APLawrence.com -  Resources for Unix and Linux Systems, Bloggers and the self-employed

A simple remote site monitor

Email: [email protected]

(This is part 2 of a two part article. Part One is here.)

In the last installment we made a simple script for monitoring remote sites. It seemed to work quite well, but after a week or so I knew I was getting too many results for them to be accurate. Something had to be done to improve the quality of the results.

The first thing I noticed is that sometimes all the sites I was checking were reported at once. I would get a string of messages and then things would be calm again. Clearly my own connection was having occasional drop outs rather than mailstarusa.com or my client sites. I reasoned that I would report outages only if machines I knew to be highly reliable were available. That is, if I could not ping certain DNS servers then I could safely keep quiet about the rest.


chk1="146.115.8.20"
chk2="198.6.1.5"
ping -c 2 -w 10 $chk1 2>&1 > /dev/null; chk1=$?
ping -c 2 -w 10 $chk2 2>&1 > /dev/null; chk2=$?

if [ $chk1 -gt 0 -a $chk2 -gt 0 ]
       then
       echo "reliable hosts are down. no sites checks performed." `date` >>$log
file
       else
{ ...
}
 

chk1 and chk2 are DNS servers belonging to Ultranet (now RCN) and UUNet and were chosen only because I knew them to be reliable and had their addresses memorized. Basically the results of each ping are recorded and if they are both non-zero then the local connection must be down and we can keep quiet about the rest.

Much to my interest something had escaped my notice the first time I read the man page for ping. ping -c2 123.123.123.123 pings a site twice and waits endlessly for results - this led me to add -w 10, thinking that I had made ping timeout after 2 pings and 10 seconds. This is not quite the case. What happens when these two parameters are combined is that the remote site is pinged 10 times and the command finishes when 2 replies are received (no error is reported) or when 10 seconds are up (an error is recorded). That means 8 pings go astray and we are still willing to say the site is up. You could certainly make the argument that the results are of poor quality as a result. On the other hand the results seem to pretty accurately match the reality of the situation.



This gave good results, but I still got messages about sites being down. I would immediately ping the sites again and they would not be down at all. I decided after some research, that things on the net were just slow and the packets were still not returning before 10 seconds (-w 10) were up. I carefully read the man pages again. I didn't want to make -w 10 much larger in case the list of sites grew large and cron kicked the script again before it had finished.

There was some reference in the man pages to QoS bits and it turns out that if we set -Q 0x04 we get a more reliable result. This is a good thing as checking SMS messages while travelling in the geek-mobile is a Bad Thing.

I also changed the script so that different people could be notified for each site. If my clients are interested in the results I can email them without delay. I edited pink.sites to match:

123.123.123.123    Fake site    [email protected]    [email protected]
pcunix.com    pcunix.com    [email protected]    [email protected]
 

Finally, here is the whole script - not pretty, but functional:

#!/bin/bash
# ping a site if no response then email a message
logfile="/var/log/pink.log"
smsnotify="[email protected]"
mailnotify="[email protected]"
chk1="146.115.8.20"
chk2="198.6.1.5"

ping -c 2 -w 10 -Q 0x04 $chk1 2>&1 > /dev/null; chk1=$? ping -c 2 -w 10 -Q 0x04 $chk2 2>&1 > /dev/null; chk2=$?
if [ $chk1 -gt 0 -a $chk2 -gt 0 ] then echo "reliable hosts are down. no sites checks performed." `date` >>$log file else { cat /usr/lib/pink/pink.sites|while read pingsite pingname smsnotify mailnotify do ping -c 2 -w 10 -Q 0x04 $pingsite 2>&1 > /dev/null #if 100% packet loss - a bad ping if [ $? -gt 0 ] then echo no reply from $pingsite \($pingname\) on `date` >>$logfile echo $pingname $pingsite "Alert" `date`| mail -s "$pingname" $mailnotify echo no reply from $pingname $pingsite | mail -s "$pingname" $smsnotify else touch "$logfile" fi done } fi
Here is a bit of the log file:
no reply from mydomain.com (mydomain) on Fri Mar 21 20:01:06 EST 2003
reliable hosts are down. no sites checks performed. Mon Mar 24 05:15:17 EST 2003
no reply from mydomain.com (mydomain) on Mon Mar 24 18:45:18 EST 2003
 

Using this script I recorded results over a period of a few weeks and noted that my customer had far more dropouts and of longer duration than did my mailserver. I was able to show the results to my client and suggest that they contact their DSL provider.

Editor's note: If you really want to test connectivity, and want the script to be able to tell you where the problems are when it is lacking, you need a more procedural approach. See Testing for network connectivity in a script




Got something to add? Send me email.





(OLDER)    <- More Stuff -> (NEWER)    (NEWEST)   

Printer Friendly Version

-> -> A simple remote site monitor part 2




Increase ad revenue 50-250% with Ezoic


More Articles by © Dirk Hart




---August 27, 2004

Very useful.
Been using it with 'c 5' and 'w 5' and found that a return of 0 = site responds to ping
1 = site didn't respond
and interestingly if the DNS can't be resolved it returns 2

thevicar.

Kerio Samepage


Have you tried Searching this site?

Unix/Linux/Mac OS X support by phone, email or on-site: Support Rates

This is a Unix/Linux resource website. It contains technical articles about Unix, Linux and general computing related subjects, opinion, news, help files, how-to's, tutorials and more.

Contact us





The computer is a moron. (Peter Drucker)

Java is C++ without the guns, knives, and clubs. (James Gosling)












This post tagged: