[Date Prev]   [Date Next] [Thread Prev]   [Thread Next] [Date Index]   [Thread Index]

 

     [nocol-users] Re:(nocol-users) Sorting & Updating


I like the idea about the time they went down sorting. It would seem
though that NOCOL reads the separate data files one by one, so I'm not
sure how easy that would be to program.

It might be a nice idea to display how long a device has been down
for, ie after more than 24 hours, the display in the nocol view is
18:32 - well, it could then change to 1d5h30m or something?

On the commenting out thing - The ability to do some sort of primitive
dependability list in nocol to tell it that devices B C and D are
all dependent on switch A, or even by address or subnet would be
good but hard to implement.

What you could do is create a list of hosts and their dependencies:

172.20.0.206:ALL:192.168.170
172.20.0.2:ALL:192.168.220,192.168.210,192.168.100
172.20.0.2:novellmon:ALL
172.20.0.2:ippingmon:192.168.190

Then a script of some sort which is fired on a critical event by 
noclogd which then automatically comments things out (possibly 
you could use some sort of #include file / "make" to easily 
munge your config files for different sites?)

The problem being that:

1. If you have a large/complicated network then trying to maintain 
   this dependency list will start to make your brain hurt eventually.

2. The order of things that NOCOL polls things means that it might not
   hit the core switch that's gone down before it polls the 30 odd
   hosts connected to it. Therefore, there's a possibility that all this
   lot would suddenly go "Critical" then, when NOCOL were to ping the
   core switch again, it would notice that this was down too, but sadly
   it has already marked 30 odd hosts as critical, so they'd suddenly
   'disappear' from the display; you'd have to re-start the appropriate
   agent also.

3. You'll have to write another script which runs when the core
   switch/router whatever comes back to life, which then re-instates
   the 30 boxes behind it to monitroing again.

If the IP address of the device on which these things depends 
changes, you could risk things slipping off monitroing for ever 
and not gettinng put back on: The core router has been changed
or something, old one goes down as Critical, 30 hosts get commented
out from the monitoring file, meanwhile, old router gets removed
from the config file, new one gets added......

I suppose you could write some sort of periodical script which
fires off a load of traceroutes to the hosts in your config files
to "discover" where devices are, but if you have a complex network
with many possible paths, or if you're using things like ATM 
then this will not work well, and neither will it detect things
like switches or bridges which are transparent to the traceroute.

On the other hand, Its nice sometimes to have NOCOL displaying the
full horror of all the affected hosts so that you get a clearer
picture of what has been affected by, say, an important router going down.

Its the difference between staff saying:

"There's a router down."

to

"There's a router down, and it means the mail server is unavailable,
 the DNS is only half working, marketing has disappeared,
 one of the two web servers is unreachable....."

The display allows you to assess the impact of the outage,
even if the person looking at the NOCOL screen has little
idea on how the network is laid out, they get an immediate
feel for what's affected.

I.E: If there's loads of "Critical" on the screen then your 
network is toast!

Perhaps Vikas could come up with a new category for something
like "REALLY_CRITICAL"

:-)


--
.......................................................................
Robert Lister    -    Network Administrator  - tel: +44 (0) 1483 711227
Robert.Lister@mclaren.co.uk                    fax: +44 (0) 1483 711297
McLaren Information Systems Department           http://www.mclaren.com


____________________Reply Separator____________________
Subject:    (nocol-users) Sorting & Updating
Date:       27/09/2000 22:33

1)  Is it possible to modify the web interface so devices in the
critical, error and warning view are sorted by the time they went down
rather than alphabetically?

2)  Occasionally we will experience a major network outage, usually due
to power or telco.  When this happens our help desk spends an
significant amount of time commenting out (hiding) devices that appear
in the Critical view.  Has anyone found a quick solution for commenting
out a large number of devices (75 to 150)?  We primarily use the web
interface and it gets really old clicking on each device one at a time
to update the status.


--
Andy Cravens