Why do is intermittent Error "An error has occurred. Error code: 0" thrown when trying to Get Queue items or Add Transaction Items
Issue Description:
Get Queue Items Activity within a Process fails intermittently with an error "An error has occurred. Error code: 0"
Root Cause:
This error indicates that a server-side error occurred. It is the same as if the server returned a 500 error. The causes of this could be:
- Timeout on SQL query.
- A specific exception related to the query or operation itself.
- Request not reaching the Orchestrator server (In this case we would usually expect other API calls to suffer the same issue unless it's a firewall-specific setting blocking the calls).
Diagnosing the Issue:
- Gather the Robot logs. These can be gathered from the Orchestrator UI (see link) or from the Robot machine using the Diagnostic tool (see link).
- It is critical to note the timestamp of the error. It is needed to correlate the exception with Orchestrator.
- If using tools like Spunk or Elastic Search to aggregate the logs, pull the logs from there. Ideally, if the error occurred at 7 am, the logs would pull from at least the hour time range around that time (i.e. 6:30-7:30).
- For cloud, we only need the robot logs and the support ID. The rest of the steps are for on-premise only.
- Gather the Orchestrator logs. If this is a multi-node instance, the logs from all servers are needed.
- The necessary logs are the application logs and the HTTP logs. In Standalone this would be the IIS logs and in Automation Suite this is part of the log bundle. If using a tool like ElasticSearch or Splunk for logs and the IIS logs are not included, they need to be manually gathered (only applies to standalone).
- It is very important that the logs gathered include the event. If they do not we cannot make a correlation.
- Links for log gathering:
- The most common causes of these errors are DB issues. As such there are some things we can check on the DB:
- Get a count of the database rows, etc: link.
- There is no set limit on the size of the DB. The size and hardware needed for an Orchestrator DB are determined by the use case. Very generic estimates based on some use cases can be found in the docs here: link. However, actual performance may vary based on the use case and it is important to monitor the performance of the DB server.
- The count servers to give us an estimate on the DB profile. If, for example, DB maintenance is recommended, it will be based on actual performance data (i.e. the IIS logs show a long request time for basic queries, in the Orchestrator logs we see explicit timeouts, etc.).
- Hardware specs of the database server.
- Check that read committed snapshot and allow snapshot isolation are turned on (Starting on 21.10 this is mandatory to have enabled).
-
-- Check if ALLOW_SNAPSHOT_ISOLATION is enabled SELECT name, snapshot_isolation_state_desc FROM sys.databases WHERE name = 'MyDatabase';
– Check if READ_COMMITTED_SNAPSHOT is enabled
SELECT name, is_read_committed_snapshot_on
FROM sys.databases
WHERE name = ‘MyDatabase’;
-
- If deadlocks or timeouts are observed in the Orchestrator logs, a SQL profiler trace may be requested: link.
- SQL query store can also be helpful as well.
- Get a count of the database rows, etc: link.
Resolution:
- If the errors correlate to a database connection timeout then running database maintenance can be a solution, as long as the maintenance is in line with the business retention policies.
- Maintenance tools: link
- Sample exception that might lead to this recommendation (this is a partial stack trace that simply shows the timeout. We would validate the timeout corresponds to the queue item operation):
-
System.Data.Entity.Core.EntityCommandExecutionException: An error occurred while executing the command definition. See the inner exception for details. ---> Microsoft.Data.SqlClient.SqlException (0x80131904): Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding.
-
- If Maintenance is not an option, database performance metrics should be checked to see if this is a hardware limitation.
- If the issue has a none timeout exception, the problem will be addressed by fixing the underlying issue. Most times the database isolation and read committed settings are not enabled.
- As a short-term workaround, the timeout on the activity can be increased or it's possible that automations can have their schedules distributed farther apart to prevent the issue.