     Re: Monitoring the monitors

On Sat, Jul 10, 1999 at 06:11:41PM -0400, TTSG wrote:
> Hi,
> 	We just ran into a problem where a machine failed, and it caused
> hostmon to "lock".  The last thing it appears to be doing was an rcp of
> the file from the machine that failed.  We fixed the machine last Friday
> and didn't check NOCOL.  Since then, 2 other machines had the hostmon
> monitor die, which we didn't know.  Even worse, though, was that another
> machine went down and we didn't find out until we saw other indications.
> 	I hate to do this, but is there something we can do to monitor the
> monitors?  (We'd of course have a monitor monitor). 


Perhaps, once every ten minutes or so:

   ls -lt ~nocol/logs | head -1

...if a file hasn't changed "recently," either you have a fairly docile
network or something's buggy with the monitoring (specifically, hostmon
tends to log the idle time of the CPU and context switches once every few
minutes as it's ratehr rare the the relative load on my server(s) don't
change ever-so-slightly)

So now we need a monitor-monitor monitor, right?  *chuckle*

> 	In an unrelated story....... Is there a way to keep "X" previous copies
> of hostmon output?  We sometimes don't catch a situation that would have
> really been great to see the hostmon information until after its refreshed
> it.  Only keep the last "X" rolling copies of the config per machine.

Well, a "painful" way to rebuild the data may be to:

  cat ~nocol/logs/* | grep '\[hostmon\]' | sort

I know... perhaps not the most elegant, but...


