[Date Prev]   [Date Next] [Thread Prev]   [Thread Next] [Date Index]   [Thread Index]

 

     [snips-users] DataAge expires--host is gone (with fix)

Hello,

Todd Edmands has discovered, well, we think it is a bug.  Here are
the symptoms:

One of his hostmon clients stopped being reachable on the net.
It happened during a time when his "notifier" doesn't run frequently.
Before the next run of notifier, sufficient time had passed that
ALL variables associated with the host had expired.

So there was no notification and when looking at either snipstv or
the web interface, there was no longer any sign of the host.

Effectively, the host had disappeared without any notification.

Investigation:
--------------
Ok, so we could tune our notifier runs to make sure that the
notification wouldn't be lost, but on talking it over, we felt
it might be more useful if something would remain visible in the
interactive SNIPS displays.

We noticed that the DataAge variable is specially created within
hostmon on the main SNIPS host.  It is calculated for each host on
each run of hostmon.  The DataAge variable's _value_ is obviously
the age of the data from the hostmon-client run.

The _age_ of the DataAge variable also gets set to approximately
the same value.  This means that the DataAge variable for a missing
host will expire at the same time as all of the old data for that
host expires.  And this is the behavior that Todd saw.

But the DataAge variable is calculated during each hostmon run.
The _age_ of the variable should be set to approximately 0, since
by the time we are testing ages only a few seconds will have elapsed
since DataAge was calculated.

So conceptually, it should not be possible for the DataAge variable
to ever expire.  For a host which is out of communication, the
value of DataAge should continuously increase, but that value is
freshly calculated on every run of hostmon.

Implications:
-------------
Let's assume that we make this change.  How does the behavior change?

A host goes down.  In the next run of hostmon, none of the variables
being monitored for that host get updated, with the exception
of DataAge.

After $OLD_AGE, all the variables for that host get marked "old,"
except for DataAge (because it is fresh).  Because DataAge has a
threshold of $OLD_AGE (hard coded in hostmon) it should be going
Critical around about this time.

After $EXPIRE_AGE, all the variables for that host get marked
"nodisplay" (if you're using our earlier fix for "empty devicename"),
except for DataAge because it's still fresh.  DataAge is now the
only variable for that host that gets displayed.

In other words, DataAge will never expire.  If a host goes missing,
DataAge will go Critical and remain that way forever until you do
something about it, like fix the connectivity, restart hostmon-client
on the host, or change the hostmon-conf file to remove the host.
You have to "hup" or restart hostmon if you remove a host and at
that point the DataAge for that host is gone.

Fixing:
-------
This is a design change more than a fix.  We think it's better to
have the DataAge staying critical until user intervention than to
have a host completely disappear.

The nice thing is that it's a simple change to make.  Below is
the context diff for the change.  It's mostly comment to make it
clearer what it is doing.  Note that this diff is based on the
"empty device" fix so you might have to apply this by hand if you
aren't using that fix.

Context diff
===========Cut Here======================================
*** hostmon	2003-12-18 10:24:50.000000000 -0700
--- hostmon.new	2003-12-17 16:59:34.000000000 -0700
***************
*** 392,397 ****
--- 392,404 ----
      }
      $timestamp{$item} = 0 if (! defined($timestamp{$item}) );
      my $age = $curtime - $timestamp{$item};
+     # To avoid expiring DataAge entries, set their $age to 0.
+     # DataAge events are represented by $item values
+     # that have only $curdev (= short hostname) instead of
+     # $curdev\t$curvar\t$comment
+     if (index($item, "\t") < $[) {
+       $age = 0;
+     }
      # print STDERR "Age for $item is $age secs\n";
  
      # Previous code used alter_event to blank fields in the record for
===========Cut Here======================================

-- 
Anthony Vealé
National Snow and Ice Data Center
E-Mail: veale at nsidc org
Phone: (303)735-5069

Zyrion Traverse Network Monitoring & Network Management Software