Troubleshooting Error #1223 "Robot Is Unresponsive"

Troubleshooting Error #1223 "Robot Is Unresponsive"

Issue Description:

The error "Robot Is Unresponsive" means that a Robot heartbeat request failed over five times. This article covers how to address and diagnose the issue.

Root Cause:

This represents the unresponsive status mentioned here: https://docs.uipath.com/orchestrator/automation-cloud/latest/user-guide/robot-statuses

When this error is encountered the most common causes are:

  1. The Robot cannot make a connection to the Orchestrator.
  2. Orchestrator is not properly updating the Robot status.
  3. Orchestrator is not responding in a timely manner.


Resolution:

  1. When this error is encountered the best way to address is to trace out the error starting with the Robot client. It is very important to note the timestamps and try to capture the error as soon as we see it. (For example, if the issue was transient and the issue occurred a month before the logs were captured, it might not be possible to find the cause).
  2. Capture the alert timestamps. The goal of this step is to determine the time frame in which the Robot was unresponsive and on which host machine it was reported as unresponsive.
    1. If the Robot was reported unresponsive via an email, capture a screenshot of the email message.
    2. If this was noticed through the Orchestrator UI, take the following steps:
      1. Login to Orchestrator.
      2. Navigate to the alerts page: https://docs.uipath.com/orchestrator/standalone/2023.10/user-guide/alerts.
      3. Tenant->Alerts.
      4. See when the first instance of the alert was an take a screenshot. Make sure to note the timestamp.
      5. If the alert is not enabled or is not visible to us, we can check the monitoring page. See: https://docs.uipath.com/orchestrator/automation-cloud/latest/user-guide/monitoring-unattended-sessions
        1. Tenant -> Monitoring -> Unattended session.
        2. Take a screenshot of the unresponsive robot.
        3. Make sure to note the timestamp.
    3. Example:
      1. image.png
      2. In the above screenshot we see that DESKTOP-66M9G02 is reported unresponsive. I record that this screenshot was taken at 7/30 at 8pm EST. This was taken from the monitoring page.
  3. Try running sp_who2 on the UiPath database.
    1. A very tricky issue to debug is when a rogue Orchestrator instance connects to the Orchestrator database by accident. For example:
      1. An Orchestrator instance is cloned and decommissioned. The VM gets turned back on during patching and the instance is not using Redis (or is pointing to the wrong Redis instance).
      2. Using Azure Web App, scaling was enabled without Redis.
    2. When there is a rogue Orchestrator instance connecting to the database some additional symptoms can be seen from the Orchestrator UI:
      1. Robot states are not consistent. On one page the Robot shows unresponsive and on another it shows connected.
      2. Robots might go from licensed to unlicensed for no reason.
      3. In general, we would see a lot of unexplainable and weird behavior from Orchestrator. The issue would be that another Orchestrator is updating the database without the Redis cache enabled.
    3. Run sp_who2 and verify who is connecting to the database.
      1. Log in to the Orchestrator database as a database admin.
      2. Run: EXEC sp_who2
      3. This command will show all connected users to the the SQL server. Make sure that only the correct Orchestrator hosts are connecting to the DB.
      4. If any unexpected Orchestrator connections are found, figure out which machine the connection is coming from, login to the machine and make sure any Orchestrator instance that should be turned off is off.
  4. Once we know when the Robot was unresponsive and on which machine, we will now log in to that machine and capture the logs.
    1. Login to the Robot machine.
    2. Run the following tool: diagnostic tool.
    3. The logs in the tool that are of particular interest are the ones that will capture heartbeat exceptions:
      1. In the folder 'Log Files' we care about the Robot.log.
      2. In 'Application Logs' folder the application logs file.
    4. During analysis, we would examine the log files to see what errors the Robot was throwing when the issue occurred (per the alert timestamps).
  5. There is a known Microsoft issue that can be checked on the Robot machine at this time. See Robot Unresponsive over RDP.
  6. While logged into the Robot machine, check to make sure the Orchestrator URL is accessible. This can be done via a browser or by checking UiPath Assistant (This step assumes the issue is still happening).
    1. Open a browser and go to the Orchestrator URL. If an error or anything unexpected is seen, take a screenshot.
    2. Open the UiPath Assistant if available. Take a screenshot of it (Typically we would expect it to be in an error state.)
      1. If the Assistant is not in an error state make sure the sp_who2 step was followed.
  7. Gather the Orchestrator logs (This step does not apply to cloud):
    1. For MSI standalone: link.
      1. If the Orchestrator instance is multinode, then the logs need to be gathered on all nodes
    2. Automation Suite link.
  8. At this point, all the logs that UiPath needs to triage the issue have been gathered. Raise a ticket and include all the information that was gathered.
  9. Analyze the logs. This step is meant to explain how UiPath analyzes these issues.
    1. In the following examples assume we encountered the issue at 7/1/2024 at 8pm EST and we have now collected all the logs.
    2. A few things to keep in mind:
      1. There should almost always be an exception in the robot logs. If there is not then it means:
        1. The issue is probably that a rogue Orchestrator is connected to the database.
        2. The logs were captured from the wrong machine, or during the wrong timeframe.
        3. Possibly the Robot is connecting to the wrong Orchestrator instance or the Robot service was forcibly killed.
      2. Unless the Robot exception shows a timeout exception, almost always a corresponding Orchestrator exception is expected. In the case of a Robot exception that is not a timeout but with no corresponding Orchestrator exception it could mean:
        1. The Orchestrator logs were captured from the wrong machine or during the wrong timeframe.
        2. The Robot is connecting to the wrong Orchestrator instance.
        3. Possibly an IIS setting is blocking the Orchestrator response (very unlikely because we would encounter more obvious errors in the UI).
        4. The request is not being forwarded by a load balancer or proxy.
      3. Always start from the Robot and trace the connection back to Orchestrator.
    3. Scenario 1:
      1. Check the Robot.log file or the Applications.evtx logs from the robot machine around the time 8pm on 7/1. The following error message is encountered:
        1. UiPath.Service.Host 23.4.0.0 
          

          submitHeartbeat: [Http Status 500][Orchestrator Error Code Unknown]UiPath.Service.Orchestrator.Clients.OrchestratorHttpException: InternalServer Error

      2. From the above, the Robot is Orchestrator returns a 500 message.
      3. Check the Orchestrator logs gathered. At around the time of 8pm EST, there is an exception. Search this exception in the UiPath's KB database and find an article addressing the issue.
    4. Scenario 2:
      1. Check the logssimilarly to scenario 1. This time we encounter:
        1. UiPath.Service.Host 23.4.0.0 
          

          submitHeartbeat: System.TimeoutException: The operation timed out.

      2. Check the IIS logs gathered as part of the Orchestrator log collection. Check to see if at the time of the exception there is any IIS traffic from the Robot machine. There is none.
      3. Skip the step of checking if the Robot machine can reach Orchestrator. Login to the Robot machine and try to navigate to the orchestrator URL. It does not load and fails with a timeout. Because Orchestrator can be reached from current machine and other machines are working, contact the network team and explain that just a particular machine cannot reach Orchestrator. The network traffic may have been accidentally disabled.
    5. Scenario 3:
      1. Check the logs similarly to scenarios 1 and 2. In the Robot machine logs there is no exception.
      2. Check the Orchestrator logs around the time frame 7/1 8 pm EST. There are no errors.
      3. Skip the step of running sp_who2. Run it. It shows that a decommissioned instance of Orchestrator is connecting to the database.
      4. Login to that Orchestrator instance and shut down the IIS services. Afterward uninstall Orchestrator from that instance.