Dealing with Alert Storms

Joey Stanford

31 Jan 2019 31 Jan '19

2:04 a.m.

Hi All,

I’m looking for advice on how to deal with alert storms. For example, we lost a UPS last night at a critical hub which cause Observium not to be able to reach most of our networks (and of course, all the alternative OSPF routes were powered via that UPS). Consequently we had several hundred alerts sent out via email and telegram. Is there any interesting way to deal with this that anyone can recommend?

Thanks,

Joey

Show replies by date

Adam Armstrong

31 Jan 31 Jan

12:56 p.m.

Hi Joey,

Hmm. This was something we thought about when I was first building the alerting system, but I never really came up with a good solution. Suppressing things at random seems like a bad idea.

adam. On 2019-01-31 01:13:11, Joey Stanford via observium observium@observium.org wrote: Hi All,

Thanks,

Joey _______________________________________________ observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

Jeff Waddell

3:04 p.m.

Maybe doing something relational -

Like Y devices are dependent on X device - therefore if X device is down, then it is highly probably Y would be down as well - I am sure there is some draw back I am not thinking about here

Thoughts for down the road

On Thu, Jan 31, 2019 at 6:56 AM Adam Armstrong via observium < observium@observium.org> wrote:

...

Hi Joey,

Hmm. This was something we thought about when I was first building the alerting system, but I never really came up with a good solution. Suppressing things at random seems like a bad idea.

adam.

On 2019-01-31 01:13:11, Joey Stanford via observium < observium@observium.org> wrote: Hi All,

I’m looking for advice on how to deal with alert storms. For example, we lost a UPS last night at a critical hub which cause Observium not to be able to reach most of our networks (and of course, all the alternative OSPF routes were powered via that UPS). Consequently we had several hundred alerts sent out via email and telegram. Is there any interesting way to deal with this that anyone can recommend?

Thanks,

Joey _______________________________________________ observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

Adam Armstrong

3:18 p.m.

This is something that comes up every few months.

In reality we don't actually know if anything is up or down, we only know that something /was/ up or down at some point in the past.

This makes deciding whether to tell you about something that has been detected NOW based on the state of something some minutes in the past somewhat unreliable.

10 superfluous alerts are better than missing 1 legitimate alert, imo.

adam. On 2019-01-31 14:04:35, Jeff Waddell via observium observium@observium.org wrote: Maybe doing something relational -

Like Y devices are dependent on X device - therefore if X device is down, then it is highly probably Y would be down as well - I am sure there is some draw back I am not thinking about here

Thoughts for down the road

On Thu, Jan 31, 2019 at 6:56 AM Adam Armstrong via observium <observium@observium.org [mailto:observium@observium.org]> wrote:

Hi Joey,

Hmm. This was something we thought about when I was first building the alerting system, but I never really came up with a good solution. Suppressing things at random seems like a bad idea.

adam. On 2019-01-31 01:13:11, Joey Stanford via observium <observium@observium.org [mailto:observium@observium.org]> wrote: Hi All,

Thanks,

Joey _______________________________________________ observium mailing list observium@observium.org [mailto:observium@observium.org] http://postman.memetic.org/cgi-bin/mailman/listinfo/observium [http://postman.memetic.org/cgi-bin/mailman/listinfo/observium]

_______________________________________________ observium mailing list observium@observium.org [mailto:observium@observium.org] http://postman.memetic.org/cgi-bin/mailman/listinfo/observium [http://postman.memetic.org/cgi-bin/mailman/listinfo/observium]

_______________________________________________ observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

Joey Stanford

1 Feb 1 Feb

1:23 a.m.

...

In reality we don't actually know if anything is up or down, we only know that something /was/ up or down at some point in the past.

At least in our situation, there might a programatic solution that could work.

1) set a delay to the alert (we have this already as part of the checker)

2) set a variable for the number of pending (concurrent) alerts. e.g. 10

3) If number of pending alerts > threshold send alert storm alert and purge alert queue

Something along those lines would probably work to cut down the spam at least in our case.

ps one of my guys said "We need to get Joey to program Obs so that when we get this many alerts it plays a tune😁” I told them that tune would like be an air raid siren.

Adam Armstrong

1:33 a.m.

Hmm...

Originally the plan was to try to implement alert coalescing so that multiple alerts would be sent in a single email, but that isn't as easy when you don't only use email.

I could potentially add *device* dependencies, which might be easier to implement and handle.

adam. On 2019-02-01 00:23:22, Joey Stanford via observium observium@observium.org wrote:

In reality we don't actually know if anything is up or down, we only know that something /was/ up or down at some point in the past.

At least in our situation, there might a programatic solution that could work.

1) set a delay to the alert (we have this already as part of the checker)

2) set a variable for the number of pending (concurrent) alerts. e.g. 10

3) If number of pending alerts > threshold send alert storm alert and purge alert queue

Something along those lines would probably work to cut down the spam at least in our case.

ps one of my guys said "We need to get Joey to program Obs so that when we get this many alerts it plays a tune😁” I told them that tune would like be an air raid siren.

Joey Stanford

1:54 a.m.

...

On Jan 31, 2019, at 17:33 , Adam Armstrong via observium observium@observium.org wrote:

I could potentially add *device* dependencies, which might be easier to implement and handle.

That might work but putting in dependencies for hundreds of systems will take a while. It’ll be a little tricky since we use OPFS to route around down devices. Keep in mind our system is about 50% RF links and then 50% devices. A device can have, in our topology, anywhere from one to 5 links to it. So a defined area would be fine…we could just make everything dependent upon the local router, but on larger sites it’ll be a bit more complex.

Adam Armstrong

1:57 a.m.

See, this sort of thing is why we don't bother.

It's very complex, and unless you model a lot of different scenarios, it's only useful to a subset of users.

I actually don't think getting a bunch of emails during a failure is too much of a hardship. It encourages you to have fewer outages! 😆

Adam.

⁣Sent from BlueMail

On 1 Feb 2019, 00:54, at 00:54, Joey Stanford via observium observium@observium.org wrote:

...

...
On Jan 31, 2019, at 17:33 , Adam Armstrong via observium

observium@observium.org wrote:

...
I could potentially add *device* dependencies, which might be easier

to implement and handle.

...
That might work but putting in dependencies for hundreds of systems will take a while. It’ll be a little tricky since we use OPFS to route around down devices. Keep in mind our system is about 50% RF links and then 50% devices. A device can have, in our topology, anywhere from one to 5 links to it. So a defined area would be fine…we could just make everything dependent upon the local router, but on larger sites it’ll be a bit more complex.

observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

James Tandy

11 Feb 11 Feb

8:10 p.m.

Device dependencies would be a lovely addition for us.

We monitor a lot of small customer sites, generally behind a single router. If their broadband goes down, we then get a load of notifications because we're unable to poll anything behind it. In this situation, its only the broadband being down which really warrants an alert - everything else is caused by it.

Regards, James Tandy TandyUK Servers Limited

Tel: 01903 247 011 Www: http://www.tandyukservers.co.uk Email: support@tandyukservers.co.uk

TandyUK Servers Limited Registered in England and Wales, Company number 8314911 VAT Registered in the UK, number 182 0661 19 Registered Office: Amelia House, Crescent Road, Worthing, BN11 1QR

On 01/02/2019 00:33, Adam Armstrong via observium wrote:

...

Hmm...

Originally the plan was to try to implement alert coalescing so that multiple alerts would be sent in a single email, but that isn't as easy when you don't only use email.

I could potentially add *device* dependencies, which might be easier to implement and handle.

adam.

...
On 2019-02-01 00:23:22, Joey Stanford via observium observium@observium.org wrote:

...
In reality we don't actually know if anything is up or down, we only know that something /was/ up or down at some point in the past.

At least in our situation, there might a programatic solution that could work.

set a delay to the alert (we have this already as part of the checker)

set a variable for the number of pending (concurrent) alerts. e.g. 10

If number of pending alerts > threshold send alert storm alert and

purge alert queue

Something along those lines would probably work to cut down the spam at least in our case.

ps one of my guys said "We need to get Joey to program Obs so that when we get this many alerts it plays a tune😁” I told them that tune would like be an air raid siren.

observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

2362

Age (days ago)

2373

Last active (days ago)

List overview

Download

8 comments

4 participants

tags (0)

participants (4)

Adam Armstrong
James Tandy
Jeff Waddell
Joey Stanford