Hi All,
I’m looking for advice on how to deal with alert storms. For example, we lost a UPS last night at a critical hub which cause Observium not to be able to reach most of our networks (and of course, all the alternative OSPF routes were powered via that UPS). Consequently we had several hundred alerts sent out via email and telegram. Is there any interesting way to deal with this that anyone can recommend?
Thanks,
Joey
Hi Joey,
Hmm. This was something we thought about when I was first building the alerting system, but I never really came up with a good solution. Suppressing things at random seems like a bad idea.
adam. On 2019-01-31 01:13:11, Joey Stanford via observium observium@observium.org wrote: Hi All,
I’m looking for advice on how to deal with alert storms. For example, we lost a UPS last night at a critical hub which cause Observium not to be able to reach most of our networks (and of course, all the alternative OSPF routes were powered via that UPS). Consequently we had several hundred alerts sent out via email and telegram. Is there any interesting way to deal with this that anyone can recommend?
Thanks,
Joey _______________________________________________ observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium
Maybe doing something relational -
Like Y devices are dependent on X device - therefore if X device is down, then it is highly probably Y would be down as well - I am sure there is some draw back I am not thinking about here
Thoughts for down the road
On Thu, Jan 31, 2019 at 6:56 AM Adam Armstrong via observium < observium@observium.org> wrote:
Hi Joey,
Hmm. This was something we thought about when I was first building the alerting system, but I never really came up with a good solution. Suppressing things at random seems like a bad idea.
adam.
On 2019-01-31 01:13:11, Joey Stanford via observium < observium@observium.org> wrote: Hi All,
I’m looking for advice on how to deal with alert storms. For example, we lost a UPS last night at a critical hub which cause Observium not to be able to reach most of our networks (and of course, all the alternative OSPF routes were powered via that UPS). Consequently we had several hundred alerts sent out via email and telegram. Is there any interesting way to deal with this that anyone can recommend?
Thanks,
Joey _______________________________________________ observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium
observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium
This is something that comes up every few months.
In reality we don't actually know if anything is up or down, we only know that something /was/ up or down at some point in the past.
This makes deciding whether to tell you about something that has been detected NOW based on the state of something some minutes in the past somewhat unreliable.
10 superfluous alerts are better than missing 1 legitimate alert, imo.
adam. On 2019-01-31 14:04:35, Jeff Waddell via observium observium@observium.org wrote: Maybe doing something relational -
Like Y devices are dependent on X device - therefore if X device is down, then it is highly probably Y would be down as well - I am sure there is some draw back I am not thinking about here
Thoughts for down the road
On Thu, Jan 31, 2019 at 6:56 AM Adam Armstrong via observium <observium@observium.org [mailto:observium@observium.org]> wrote:
Hi Joey,
Hmm. This was something we thought about when I was first building the alerting system, but I never really came up with a good solution. Suppressing things at random seems like a bad idea.
adam. On 2019-01-31 01:13:11, Joey Stanford via observium <observium@observium.org [mailto:observium@observium.org]> wrote: Hi All,
I’m looking for advice on how to deal with alert storms. For example, we lost a UPS last night at a critical hub which cause Observium not to be able to reach most of our networks (and of course, all the alternative OSPF routes were powered via that UPS). Consequently we had several hundred alerts sent out via email and telegram. Is there any interesting way to deal with this that anyone can recommend?
Thanks,
Joey _______________________________________________ observium mailing list observium@observium.org [mailto:observium@observium.org] http://postman.memetic.org/cgi-bin/mailman/listinfo/observium [http://postman.memetic.org/cgi-bin/mailman/listinfo/observium]
_______________________________________________ observium mailing list observium@observium.org [mailto:observium@observium.org] http://postman.memetic.org/cgi-bin/mailman/listinfo/observium [http://postman.memetic.org/cgi-bin/mailman/listinfo/observium]
_______________________________________________ observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium
In reality we don't actually know if anything is up or down, we only know that something /was/ up or down at some point in the past.
At least in our situation, there might a programatic solution that could work.
1) set a delay to the alert (we have this already as part of the checker)
2) set a variable for the number of pending (concurrent) alerts. e.g. 10
3) If number of pending alerts > threshold send alert storm alert and purge alert queue
Something along those lines would probably work to cut down the spam at least in our case.
ps one of my guys said "We need to get Joey to program Obs so that when we get this many alerts it plays a tune😁” I told them that tune would like be an air raid siren.
Hmm...
Originally the plan was to try to implement alert coalescing so that multiple alerts would be sent in a single email, but that isn't as easy when you don't only use email.
I could potentially add *device* dependencies, which might be easier to implement and handle.
adam. On 2019-02-01 00:23:22, Joey Stanford via observium observium@observium.org wrote:
In reality we don't actually know if anything is up or down, we only know that something /was/ up or down at some point in the past.
At least in our situation, there might a programatic solution that could work.
1) set a delay to the alert (we have this already as part of the checker)
2) set a variable for the number of pending (concurrent) alerts. e.g. 10
3) If number of pending alerts > threshold send alert storm alert and purge alert queue
Something along those lines would probably work to cut down the spam at least in our case.
ps one of my guys said "We need to get Joey to program Obs so that when we get this many alerts it plays a tune😁” I told them that tune would like be an air raid siren.
On Jan 31, 2019, at 17:33 , Adam Armstrong via observium observium@observium.org wrote:
I could potentially add *device* dependencies, which might be easier to implement and handle.
That might work but putting in dependencies for hundreds of systems will take a while. It’ll be a little tricky since we use OPFS to route around down devices. Keep in mind our system is about 50% RF links and then 50% devices. A device can have, in our topology, anywhere from one to 5 links to it. So a defined area would be fine…we could just make everything dependent upon the local router, but on larger sites it’ll be a bit more complex.
See, this sort of thing is why we don't bother.
It's very complex, and unless you model a lot of different scenarios, it's only useful to a subset of users.
I actually don't think getting a bunch of emails during a failure is too much of a hardship. It encourages you to have fewer outages! 😆
Adam.
Sent from BlueMail
On 1 Feb 2019, 00:54, at 00:54, Joey Stanford via observium observium@observium.org wrote:
On Jan 31, 2019, at 17:33 , Adam Armstrong via observium
observium@observium.org wrote:
I could potentially add *device* dependencies, which might be easier
to implement and handle.
That might work but putting in dependencies for hundreds of systems will take a while. It’ll be a little tricky since we use OPFS to route around down devices. Keep in mind our system is about 50% RF links and then 50% devices. A device can have, in our topology, anywhere from one to 5 links to it. So a defined area would be fine…we could just make everything dependent upon the local router, but on larger sites it’ll be a bit more complex.
observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium
Device dependencies would be a lovely addition for us.
We monitor a lot of small customer sites, generally behind a single router. If their broadband goes down, we then get a load of notifications because we're unable to poll anything behind it. In this situation, its only the broadband being down which really warrants an alert - everything else is caused by it.
Regards, James Tandy TandyUK Servers Limited
Tel: 01903 247 011 Www: http://www.tandyukservers.co.uk Email: support@tandyukservers.co.uk
TandyUK Servers Limited Registered in England and Wales, Company number 8314911 VAT Registered in the UK, number 182 0661 19 Registered Office: Amelia House, Crescent Road, Worthing, BN11 1QR
On 01/02/2019 00:33, Adam Armstrong via observium wrote:
Hmm...
Originally the plan was to try to implement alert coalescing so that multiple alerts would be sent in a single email, but that isn't as easy when you don't only use email.
I could potentially add *device* dependencies, which might be easier to implement and handle.
adam.
On 2019-02-01 00:23:22, Joey Stanford via observium observium@observium.org wrote:
In reality we don't actually know if anything is up or down, we only know that something /was/ up or down at some point in the past.
At least in our situation, there might a programatic solution that could work.
set a delay to the alert (we have this already as part of the checker)
set a variable for the number of pending (concurrent) alerts. e.g. 10
If number of pending alerts > threshold send alert storm alert and
purge alert queue
Something along those lines would probably work to cut down the spam at least in our case.
ps one of my guys said "We need to get Joey to program Obs so that when we get this many alerts it plays a tune😁” I told them that tune would like be an air raid siren.
observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium
participants (4)
-
Adam Armstrong
-
James Tandy
-
Jeff Waddell
-
Joey Stanford