I hope you’re all okay.
Today we work with 8 licenses to run a little over 90 processes. Basically, when we have some random network problem, some of our robots may crash and not return until someone finds that that process is stuck for some reason. Be it an unexpected screen, unexpected network update and so on. Even with a timeout deal, to report an error, in some cases they don’t work so well.
To better deal with these cases, I thought of creating a robot that performs every 30 minutes the verification of all processes that are running and, if any of the processes has been for more than X time without sending any logs, send me an alert so I can analyze if there was a crash in the queue or not.
I believe that the best way to do this is using the Orchestrator API, but before that, I would like to know if you have already experienced this problem and what is the case used for better queue management.
Today we have processes that need to be executed at pre-defined times and, if for some reason (any one) there is a crash in the queue, we end up generating a problem with the lack of execution of another automation.
Have you ever had the same problem? If yes, how did they solve it?
Note: The orchestrator’s alert system doesn’t help much as it sends notification for any errors or alerts and I don’t need all the emails, only for processes that got stuck.