(This is part 2 of a two part article. Part One is here.)
In the last installment we made a simple script for monitoring remote sites. It seemed to work quite well, but after a week or so I knew I was getting too many results for them to be accurate. Something had to be done to improve the quality of the results.
The first thing I noticed is that sometimes all the sites I was checking were reported at once. I would get a string of messages and then things would be calm again. Clearly my own connection was having occasional drop outs rather than mailstarusa.com or my client sites. I reasoned that I would report outages only if machines I knew to be highly reliable were available. That is, if I could not ping certain DNS servers then I could safely keep quiet about the rest.
chk1="146.115.8.20"
chk2="198.6.1.5"
ping -c 2 -w 10 $chk1 2>&1 > /dev/null; chk1=$?
ping -c 2 -w 10 $chk2 2>&1 > /dev/null; chk2=$?
if [ $chk1 -gt 0 -a $chk2 -gt 0 ]
then
echo "reliable hosts are down. no sites checks performed." `date` >>$log
file
else
{ ...
}
chk1 and chk2 are DNS servers belonging to Ultranet (now RCN) and UUNet and were chosen only because I knew them to be reliable and had their addresses memorized. Basically the results of each ping are recorded and if they are both non-zero then the local connection must be down and we can keep quiet about the rest.
| Much to my interest something had escaped my notice the first time I read the man page for ping. ping -c2 123.123.123.123 pings a site twice and waits endlessly for results - this led me to add -w 10, thinking that I had made ping timeout after 2 pings and 10 seconds. This is not quite the case. What happens when these two parameters are combined is that the remote site is pinged 10 times and the command finishes when 2 replies are received (no error is reported) or when 10 seconds are up (an error is recorded). That means 8 pings go astray and we are still willing to say the site is up. You could certainly make the argument that the results are of poor quality as a result. On the other hand the results seem to pretty accurately match the reality of the situation. |
This gave good results, but I still got messages about sites being down. I would immediately ping the sites again and they would not be down at all. I decided after some research, that things on the net were just slow and the packets were still not returning before 10 seconds (-w 10) were up. I carefully read the man pages again. I didn't want to make -w 10 much larger in case the list of sites grew large and cron kicked the script again before it had finished.
There was some reference in the man pages to QoS bits and it turns out that if we set -Q 0x04 we get a more reliable result. This is a good thing as checking SMS messages while travelling in the geek-mobile is a Bad Thing.
I also changed the script so that different people could be notified for each site. If my clients are interested in the results I can email them without delay. I edited pink.sites to match:
123.123.123.123 Fake site 5085551212@vtext.com monitor@yourdomain.com pcunix.com pcunix.com 5085551212@vtext.com monitor@pcunix.com
Finally, here is the whole script - not pretty, but functional:
#!/bin/bash # ping a site if no response then email a message logfile="/var/log/pink.log" smsnotify="5085551515@vtext.com" mailnotify="monitor@mailstarusa.com" chk1="146.115.8.20" chk2="198.6.1.5"Here is a bit of the log file:
ping -c 2 -w 10 -Q 0x04 $chk1 2>&1 > /dev/null; chk1=$? ping -c 2 -w 10 -Q 0x04 $chk2 2>&1 > /dev/null; chk2=$?
if [ $chk1 -gt 0 -a $chk2 -gt 0 ] then echo "reliable hosts are down. no sites checks performed." `date` >>$log file else { cat /usr/lib/pink/pink.sites|while read pingsite pingname smsnotify mailnotify do ping -c 2 -w 10 -Q 0x04 $pingsite 2>&1 > /dev/null #if 100% packet loss - a bad ping if [ $? -gt 0 ] then echo no reply from $pingsite \($pingname\) on `date` >>$logfile echo $pingname $pingsite "Alert" `date`| mail -s "$pingname" $mailnotify echo no reply from $pingname $pingsite | mail -s "$pingname" $smsnotify else touch "$logfile" fi done } fi
no reply from mydomain.com (mydomain) on Fri Mar 21 20:01:06 EST 2003 reliable hosts are down. no sites checks performed. Mon Mar 24 05:15:17 EST 2003 no reply from mydomain.com (mydomain) on Mon Mar 24 18:45:18 EST 2003
Using this script I recorded results over a period of a few weeks and noted that my customer had far more dropouts and of longer duration than did my mailserver. I was able to show the results to my client and suggest that they contact their DSL provider.
More Articles by Dirk Hart
Have you tried Searching this site?
Unix/Linux/Mac OS X support by phone, email or on-site: Support Rates
This is a Unix/Linux resource website. It contains technical articles about Unix, Linux and general computing related subjects, opinion, news, help files, how-to's, tutorials and more. We appreciate comments and article submissions.
Many of the products and books I review are things I purchased for my own use. Some were given to me specifically for the purpose of reviewing them. I resell or can earn commissions from the sale of some of these items. Links within these pages may be affiliate links that pay me for referring you to them. That's mostly insignificant amounts of money; whenever it is not I have made my relationship plain. I also may own stock in companies mentioned here. If you have any question, please do feel free to contact me.
Specific links that take you to pages that allow you to purchase the item I reviewed are very likely to pay me a commission. Many of the books I review were given to me by the publishers specifically for the purpose of writing a review. These gifts and referral fees do not affect my opinions; I often give bad reviews anyway.
We use Google third-party advertising companies to serve ads when you visit our website. These companies may use information (not including your name, address, email address, or telephone number) about your visits to this and other websites in order to provide advertisements about goods and services of interest to you. If you would like more information about this practice and to know your choices about not having this information used by these companies, click here.
Click here to add your comments
---August 27, 2004
Very useful.
Been using it with 'c 5' & 'w 5' and found that a return of 0 = site responds to ping
1 = site didn't respond
and interestingly if the DNS can't be resolved it returns 2
thevicar.
Don't miss responses! Subscribe to Comments by RSS or by Email
Click here to add your comments
If you want a picture to show with your comment, go get a Gravatar