Create an alert that triggers when receiving only partial telemetry from a server
I have a situation where Redgate Monitor is sometimes not receiving telemetry from the server OS. This occurred sporadically for several weeks until I had to open the server's detailed information, where I discovered the missing telemetry. The monitor triggered no alerts during those weeks despite it being caused by crashing windows services.
I've attached a screenshot that shows the missing telemetry for CPU and memory, caused by the perfmon service timing out and crashing.
After opening a ticket I was informed that there are no alert conditions in the monitor to detect this situation. This seems like a crucial condition to be alerted on and would like to recommend that it be added to the monitor that when telemetry is missing or interrupted that it triggers an alert condition. Thank you!
-
Mark Freeman commented
Similarly, I had a set of servers for which no data was being collected due to authentication failures (something changed in our Azure Entra configuration for a subscription). The first I knew about it was when there was a performance problem I needed to investigate and found no data had been collected for weeks.
-
Matthew commented
This is associated with the Redgate ticket #316800 and #317432. Turns out that the reason the telemetry is not received is because of the following error:
RedGate.SqlMonitor.Common.Utilities.Status.StatusLogger - (servernameremoved): WMI / ReadRegistry : AuthorizationError, GroupName: General ActionName: ServerRegistryProperty, ElapsedTime: 123485
System.Runtime.InteropServices.COMException: Server execution failed (0x80080005 (CO_E_SERVER_EXEC_FAILURE))
RedGate.SqlMonitor.Common.Status.ErrorStatusReporter - Unknown error status category
System.Management.ManagementException: Provider load failureThe monitor fails to alert on this actual error and we went several weeks with the server showing "green" on the overview before someone actually went in and looked at the graph and noticed the gaps in the telemetry.
While we've been asked to make this a suggestion, I think this is more of a bug that should be fixed as the monitor -failed- to alert us to an error condition that it could see.