Issue investigation in Orchestrator

TL;DR for issue investigation the following Orchestrator improvements would be awesome:

  1. Redesigned jobs and logs overview screens with less clutter and more meaningful information
  2. Filter jobs and logs on meaningful time window (e.g. yesterday between 10:0 and 10:30)
  3. Job search to also consider the relevant exception and log information and for example summarize it in the results
  4. Show the actually times on the jobs screen instead of “a day ago”
  5. Remember how many results the user wants to see per screen type (e.g. 25 jobs and 100 logs)
  6. Have a “job” screen that shows additional information about the job like: logs, exception messages, related queue items, related assets
  7. Link back from a log item found via the logs screen to the originating job

When investigating issues with bot processes the Orchestrator is a vital part as it contains a lot of information that would let you figure out:

  1. What the actual problem was: i.e. the exception in both the jobs and queues
  2. What happened up until that moment: the job logs and queue item status
  3. The context in which it happened: queue item data, assets

However getting to these pieces of information is very inefficient making the issue investigation a lot more cumbersome than it needs to be.

When an issue occurs we often either get an email from the business or the robot on the back of an exception. This contains information like:

  1. The process that failed
  2. The time it occurred
  3. The exception message
  4. A screenshot of the current state
  5. Some unique identifier of the item it was processing

So first point of call would be to go to the jobs overview and find the issue. But when there are multiple issues getting to the right one means hover over the individual jobs’ start/end time to find the one closest to when the issue occurred of clicking the info button of each job, scrolling up and see whether that matches your exception message. It would be a lot easier if you could see the actual time (instead of “2 hours ago”) and the exception message as well as being able to search on information in the exception message. What would be really neat if the search would also go through the related logs and show jobs that match. The unique identifier and the (unique) screenshot filename are in our case at least in the logs so perfect to search for.

You could of course also use the “logs” to search for those directly but in that case you loose the “relation” to the job so you get far too much totally unrelated logs information. Assuming you did find the relevant job and are looking through the related logs the amount of log lines (10) is useless as you’re looking through all the steps and not just the last couple (and yes you can change it to 50 but even if that’s enough you need to do it every time). Also searching doesn’t help because that’ll give you 1 line and again you miss what happened around that.

Additionally figuring out which queue items and/or assets that were use for that particular job is highly dependant on the job and often requires digging through the logs. So finding the “data” that lead up to the issue is also not as easy as it could be.

So while these screens provide useful information it would be nice to have some further improvements that consider how you would use them (at least from an investigation point of view) rather than what data could be displayed.

Thanks!

Thank you @mschuurman for the detailed feedback! We deeply appreciate the diligence and effort you put to help us improve the product! The product and design teams will take it in and work on addressing it. FYI @iamwiliamb