Robots Connected to a Multi-Node Orchestrator Go Unresponsive Intermittently

Why do my robots become unresponsive intermittently even though we have the required licenses? The robots seem to auto-recover and we do not know of any network issues in our environment.

Issue Description

Sometimes, in a multi-node Orchestrator environment, you might experience challenges with robot connectivity and eventually failure of jobs sporadically. The robots may seem to become licensed and start working fine momentarily but they would lose their licensed status out of the blue causing triggers to break and executions to fail. You can observe this closely by opening up the robot sessions under the tenant licenses and seeing the sessions establish and break time to time. The Orchestrator and Robot application logs will display some licensing exceptions in this case, but nothing too specific pertaining to the root cause.

Root Cause and Resolution

Multi-node deployment with HAA

Bots connected to such a setup may experience the aforementioned issues primarily due to HAA/Redis-related challenges. When Redis doesn't function as expected, the robot might face challenges when it attempts to connect to a specific node(s).

To resolve this, follow the steps below:

  1. Fix issues with Redis. Assess the application event viewer logs on the Orchestrator nodes to get a better insight on what the exact issue is. If the issue lies beyond the Orchestrator-Redis integration, then we may need to consult with the Redis Support team.
  2. If you need a quick workaround, consider changing the Orchestrator to a single node temporarily while you diagnose and fix the HAA issues. This can be achieved following the steps described in Moving From HAA Multi-Node To A Single-Server Orchestrator.

Disaster Recovery

  1. In the case of a Disaster Recovery (DR) setup with two active data centers, that essentially requires HAA, the issue is most likely caused by HAA challenges once again. In this case, you can follow the steps mentioned in the above section.
  2. In the case of an Active-Passive DR setup, this issue occurs when both Orchestrator nodes attempt to connect to the Orchestrator database at the same time. Ideally, only the Active node should connect to the Orchestrator DB. To ensure this is happening, shut down the passive node. This would include shutting down the app pools, IIS as well as the server itself.

Miscellaneous

  1. This issue might appear in scenarios where you have recently switched from a multi-node setup to a single-node setup. In such a scenario, you should ensure that all the other nodes, that are no longer in use, do not establish a connection with the Orchestrator DB anymore. This can again be achieved by shutting down the app pools, IIS, as well as the old Orchestrator server(s) itself. You also want to ensure that you have removed all the unnecessary node connections from the Load Balancer end.
  2. Another case where this issue has been observed is when there are random applications (other than the Orchestrator application) that attempt to establish a connection with the Orchestrator's SQL DB. To check what applications are attempting to connect to the UiPath DB, you may run the "sp_who2" command. Here are some informative articles on this:

If there is anything outside the UiPath Active Orchestrator node trying to connect to SQL, it should be eliminated.