Reduce the number of false-positive status alerts (e.g. Machine unreachable)
CONTEXT: Many status alerts (eg. Machine unreachable) have a simple configuration model, meaning that these alerts will trigger as soon as a ping to the Machine fails.
PROBLEM: VPN connection resets, scheduled restarts, etc. all may be acceptable causes for a ping to fail in our system, but these currently lead to a lot of false positives... provided they are brief, and the system recovers in time.
EXAMPLE SOLUTIONS:
• Add configuration to the machine unreachable alerts to set a time threshold
• Require two consecutive pings to fail before raising the alert
Version 5.0.5 introduced the ability to configure machine unreachable, and SQL Server instance unreachable, by multiple time thresholds. We left this open for a while to see if there were any further requests. Thanks for the suggestion.
-
josh commented
I'd also like this alert for things like SQL Agent stopped etc. Our SQL instances are a on a delayed start so the box will start, and the alert will raise (and page us) but we don't expect the service to have started yet. However if after 5 mins the service hasn't started we would need to know about it.
-
Thomas Franz commented
for me this problem is solved since a few frequent updates, since you allow me to define a minimum timeout for the not reachable alerts.
-
We don't like the idea of repeat alerts - a better approach in our opinion is to add time thresholds to these types of statu alerts. I'll merge the these two suggestions together.
-
Robert commented
We also get a lot of noise from this alert, due to intermittent network issues, especially on VMs. The VM hosts seem to get overwhelmed with network traffic at certain times. It's usually during overnight/maintenance hours (probably from backups), so it doesn't affect the user experience, but it results in lots of alerts during off-hours. If we could configure the alert threshold, then we could determine how long before the alert fires.
-
Jeff commented
I get a machine unreachable alert with 1 second between the "Machine unreachable from" and "Machine became reachable again", so there is no waiting of 5 seconds to re-ping. If it is going to alert for a 1 second "outage", then it is noise and I have to either disable the alert or disable the email, which defeats the purpose. If it was configurable via the UI, I could tweak settings for individual servers that are more ping "lossy" than others. Thank you.
-
Anonymous commented
Oracle Enterprise Manager allow us set these thresholds. Same thing for MySQL Enterprise Monitor.
Right now this feature useless. -
Brent Burbidge commented
Hi Priya
No I still believe this should be configurable. We get false alarms from this alert consistently. Actually had to disable it because it was noise.
I believe this should be something exposed in the UI and configurable (I.E X pings, wait X secs, X ping)
Thanks,
Brent -
Daniel Jackson commented
Erroneous Monitoring Error for Databases During Restore
We log ship several hundred databases and are constantly getting "Monitoring error (SQL Server data collection)" notifications. The connection log will show the following error.
Database 'Some_Database' cannot be opened. It is in the middle of a restore."
I feel this is not really an "error" as much as it is a "state". At the very least, it would be nice to allow this particular notification to be turned off. We are currently getting 50-100 emails a day for "errors" that are not really errors. -
Hi Don,
The way SQL Monitor works is that it pings Machine/ SQL Server and if they don't respond then SQL Monitor raises an alert. If I remember correctly, SQL Monitor pings for 5 times with 1 sec difference, wait for few sec, pings for 5 times again before it raises an alert.
The reason for not exposing this configuration to user is that we believe that it is important to get notified, as soon as possible, if your server is down or not responding to pings. On the other hand, I appreciate that if you are getting lots of false positive then we need to understand reasons for it.
We have seen this error in other user environment when for some bizzare reasons n/w is flaky and pings randomly fail. I would suggest that we can try increasing logging in your environment and capture errors which is causing this alert to raise in this first place. Then we can investigate it further. If you would like to do this then please do let us know and I can email you the details.
Thanks,
Priya -
Don Ferguson commented
I am getting lots of false positives on this. I need to be able to make it less sensitive.
-
Don Ferguson commented
Add ability to set a custom unreachable threshold measured in seconds before the instance unreachable alert is raised.
-
Hi Brent,
It actually works exactly the way you have described.
I think it is ten pings in a row each with a 1s timeout. So by default it is
• five pings
• wait five secs
• five pingsIf then Monitor doesn't get response then it raises alert.
Though this configuration is not exposed via UI.
Are you happy for us to close this feature request?
Thanks,
Priya -
Brent Burbidge commented
Configure the alert/ping threshold to only alert if there are multiple ping failures over a time period or so many failed responses in a row.
-
Thanks for the suggestion. We agree this could be useful and could apply to other alerts of a similar nature.
We've logged this as SRP-6767
Cheers
-
Support commented
On the “Machine Unreachable” alerts, is it possible to create an additional alert (e.g. after 10 minutes) to advise that a machine is still unreachable?
This would be useful for overnight automated reboots when applying Windows updates, and the current unreachable alerts are extremely useful for that, but it would be re-assuring to know that we would continue to receive further alerts if a server had failed to come back up again (and if we also failed to see the absence of a “Machine Unreachable – Ended” alert)