Bot host servers are windows boxes. Usually sitting in a pair on a load balancer. When one goes down, the server is still pingable and responding to all external inquiries. The other tries to pick up the load, but can’t run 100% of the jobs. Basically, some jobs are totally not running.
We only see the “silence” when looking in Splunk and see that no jobs have been running on server #n for a length of time.
How can we have an external (unix, probably) monitor for “no jobs running” and alert (email, page, etc) when it sees x minutes of silence has gone by?