[Date Prev] [Date Next] | [Thread Prev] [Thread Next] | [Date Index] [Thread Index] |
[nocol-users] Re:(nocol-users) Sorting & Updating
|
I like the idea about the time they went down sorting. It would seem though that NOCOL reads the separate data files one by one, so I'm not sure how easy that would be to program. It might be a nice idea to display how long a device has been down for, ie after more than 24 hours, the display in the nocol view is 18:32 - well, it could then change to 1d5h30m or something? On the commenting out thing - The ability to do some sort of primitive dependability list in nocol to tell it that devices B C and D are all dependent on switch A, or even by address or subnet would be good but hard to implement. What you could do is create a list of hosts and their dependencies: 172.20.0.206:ALL:192.168.170 172.20.0.2:ALL:192.168.220,192.168.210,192.168.100 172.20.0.2:novellmon:ALL 172.20.0.2:ippingmon:192.168.190 Then a script of some sort which is fired on a critical event by noclogd which then automatically comments things out (possibly you could use some sort of #include file / "make" to easily munge your config files for different sites?) The problem being that: 1. If you have a large/complicated network then trying to maintain this dependency list will start to make your brain hurt eventually. 2. The order of things that NOCOL polls things means that it might not hit the core switch that's gone down before it polls the 30 odd hosts connected to it. Therefore, there's a possibility that all this lot would suddenly go "Critical" then, when NOCOL were to ping the core switch again, it would notice that this was down too, but sadly it has already marked 30 odd hosts as critical, so they'd suddenly 'disappear' from the display; you'd have to re-start the appropriate agent also. 3. You'll have to write another script which runs when the core switch/router whatever comes back to life, which then re-instates the 30 boxes behind it to monitroing again. If the IP address of the device on which these things depends changes, you could risk things slipping off monitroing for ever and not gettinng put back on: The core router has been changed or something, old one goes down as Critical, 30 hosts get commented out from the monitoring file, meanwhile, old router gets removed from the config file, new one gets added...... I suppose you could write some sort of periodical script which fires off a load of traceroutes to the hosts in your config files to "discover" where devices are, but if you have a complex network with many possible paths, or if you're using things like ATM then this will not work well, and neither will it detect things like switches or bridges which are transparent to the traceroute. On the other hand, Its nice sometimes to have NOCOL displaying the full horror of all the affected hosts so that you get a clearer picture of what has been affected by, say, an important router going down. Its the difference between staff saying: "There's a router down." to "There's a router down, and it means the mail server is unavailable, the DNS is only half working, marketing has disappeared, one of the two web servers is unreachable....." The display allows you to assess the impact of the outage, even if the person looking at the NOCOL screen has little idea on how the network is laid out, they get an immediate feel for what's affected. I.E: If there's loads of "Critical" on the screen then your network is toast! Perhaps Vikas could come up with a new category for something like "REALLY_CRITICAL" :-) -- ....................................................................... Robert Lister - Network Administrator - tel: +44 (0) 1483 711227 Robert.Lister@mclaren.co.uk fax: +44 (0) 1483 711297 McLaren Information Systems Department http://www.mclaren.com ____________________Reply Separator____________________ Subject: (nocol-users) Sorting & Updating Date: 27/09/2000 22:33 1) Is it possible to modify the web interface so devices in the critical, error and warning view are sorted by the time they went down rather than alphabetically? 2) Occasionally we will experience a major network outage, usually due to power or telco. When this happens our help desk spends an significant amount of time commenting out (hiding) devices that appear in the Critical view. Has anyone found a quick solution for commenting out a large number of devices (75 to 150)? We primarily use the web interface and it gets really old clicking on each device one at a time to update the status. -- Andy Cravens |