[Date Prev] [Date Next] | [Thread Prev] [Thread Next] | [Date Index] [Thread Index] |
[snips-users] DataAge expires--host is gone (with fix)
|
Hello, Todd Edmands has discovered, well, we think it is a bug. Here are the symptoms: One of his hostmon clients stopped being reachable on the net. It happened during a time when his "notifier" doesn't run frequently. Before the next run of notifier, sufficient time had passed that ALL variables associated with the host had expired. So there was no notification and when looking at either snipstv or the web interface, there was no longer any sign of the host. Effectively, the host had disappeared without any notification. Investigation: -------------- Ok, so we could tune our notifier runs to make sure that the notification wouldn't be lost, but on talking it over, we felt it might be more useful if something would remain visible in the interactive SNIPS displays. We noticed that the DataAge variable is specially created within hostmon on the main SNIPS host. It is calculated for each host on each run of hostmon. The DataAge variable's _value_ is obviously the age of the data from the hostmon-client run. The _age_ of the DataAge variable also gets set to approximately the same value. This means that the DataAge variable for a missing host will expire at the same time as all of the old data for that host expires. And this is the behavior that Todd saw. But the DataAge variable is calculated during each hostmon run. The _age_ of the variable should be set to approximately 0, since by the time we are testing ages only a few seconds will have elapsed since DataAge was calculated. So conceptually, it should not be possible for the DataAge variable to ever expire. For a host which is out of communication, the value of DataAge should continuously increase, but that value is freshly calculated on every run of hostmon. Implications: ------------- Let's assume that we make this change. How does the behavior change? A host goes down. In the next run of hostmon, none of the variables being monitored for that host get updated, with the exception of DataAge. After $OLD_AGE, all the variables for that host get marked "old," except for DataAge (because it is fresh). Because DataAge has a threshold of $OLD_AGE (hard coded in hostmon) it should be going Critical around about this time. After $EXPIRE_AGE, all the variables for that host get marked "nodisplay" (if you're using our earlier fix for "empty devicename"), except for DataAge because it's still fresh. DataAge is now the only variable for that host that gets displayed. In other words, DataAge will never expire. If a host goes missing, DataAge will go Critical and remain that way forever until you do something about it, like fix the connectivity, restart hostmon-client on the host, or change the hostmon-conf file to remove the host. You have to "hup" or restart hostmon if you remove a host and at that point the DataAge for that host is gone. Fixing: ------- This is a design change more than a fix. We think it's better to have the DataAge staying critical until user intervention than to have a host completely disappear. The nice thing is that it's a simple change to make. Below is the context diff for the change. It's mostly comment to make it clearer what it is doing. Note that this diff is based on the "empty device" fix so you might have to apply this by hand if you aren't using that fix. Context diff ===========Cut Here====================================== *** hostmon 2003-12-18 10:24:50.000000000 -0700 --- hostmon.new 2003-12-17 16:59:34.000000000 -0700 *************** *** 392,397 **** --- 392,404 ---- } $timestamp{$item} = 0 if (! defined($timestamp{$item}) ); my $age = $curtime - $timestamp{$item}; + # To avoid expiring DataAge entries, set their $age to 0. + # DataAge events are represented by $item values + # that have only $curdev (= short hostname) instead of + # $curdev\t$curvar\t$comment + if (index($item, "\t") < $[) { + $age = 0; + } # print STDERR "Age for $item is $age secs\n"; # Previous code used alter_event to blank fields in the record for ===========Cut Here====================================== -- Anthony Vealé National Snow and Ice Data Center E-Mail: veale at nsidc org Phone: (303)735-5069 |