Hi!
Problem
The problem we are trying to solve is that we live with some risk that we could return to work in the morning to find that a trigger has been firing repeatedly and generating many faulted jobs. This risk exists because we have time triggers which fire as much as 24 times a day and we make use of queue triggers for “on-demand” processing that can fire up to every half-hour (48 times a day, per queue trigger).
The root cause of a given incident of repeated faulted jobs is not a concern. It may be that some service that went down or an expired password. We are confident that we can resolve the issue when we return to the office.
Goal
The goal is to improve our own internal support offering, reduce faulted jobs, and avoid as many self-inflicted support headaches as possible. Ideally, we’d like to limit faulted jobs by disabling related triggers after a certain number of consecutive faulted jobs.
We would prefer to use a native UiPath solution rather than build our own solution. We wanted to use “execution-based trigger disabling” to accomplish this. However the feature does not seem to function as advertised. Overall, most descriptions related to this feature suggest that job status drives the behavior, thus why I say that it does not function as advertised. Meanwhile, testing has shown that job status has no effect on trigger disabling. So far, there is no discernable impact on trigger disabling whatsoever.
Question
My question is, what does this feature actually do and, if it isn’t driven by job status, what other ways exist to solve our problem? We’ve attempted to answer this ourselves to no avail. For the interested, feel free to read about our testing methods and findings below.
Testing Methods
- Method A Components:
- Process - “FaultingProcessTest”, an unattended automation which only contains a throw activity. It is designed to fault and do nothing else.
- Multiple triggers with various configurations.
- Frequency - Daily, hourly, and every 10 minutes.
- Consecutive job execution fail count - 1, 2, or 3
- Grace period on disabling the trigger (days) - 0 or 1
- Robot - Different robots.
- Purpose - test the relationship between this feature and faulted jobs.
- Expectation - While the trigger is enabled, we expected jobs to be triggered and fault immediately.
- Result - Jobs triggered, faulted immediately, but triggers were not disabled, even after exceeding the specified maximum. Triggers were allowed to run for days, to allow for possible backend processing to occur.
- Method B components:
- Process - “RobotHold”, an unattended automation designed to open a message box to “hold” a robot and prevent a scheduled job from being triggered during maintenance. Duration configured through argument. Message box dismissable in attended scenario, to stop holding the robot.
- Multiple triggers
- Frequency - intentionally overlapped schedules. Every minute and every 15 minutes.
- Consecutive job execution fail count - 2
- Grace period on disabling the trigger (days) - 0
- Duration - 5 minutes or 2 hours
- Robot - Same robot.
- Purpose - test the relationship between this feature and failed triggers.
- Expectation - While the robot is held, we expected triggers to fail.
- Result - Triggers failed, but were not disabled, even after far exceeding the specified maximum. Triggers were allowed to exceed those maximums for several hours, to allow for possible backend processing to occur.
Findings
Through testing and inspection of the UiPath database, I’ve learned a few things, but I remain confused as to what this feature actually does. I was unable to attach the SQL query I used to review the results.
- The Orchestrator UI and the documentation suggest that triggers are to be disabled based on JOB execution.
- From the edit trigger page:
- “Set execution-based trigger disabling”
- “Disable when consecutive job execution fail count”
- From the documentation:
- This source, towards the bottom, relates directly to the feature available in the Orchestrator UI: Orchestrator - Creating a time trigger
- It suggests this feature is used “to control when the trigger is disabled once a job fails.” For those who support automations in a production environment, a job failure implies a faulted job.
- “The trigger is disabled after the number of failed executions you choose for this setting.” Similarly, a failed execution implies failed execution of a job, but is admittedly more ambiguous.
- “Stopped jobs are not counted towards this value.” This is a direct reference to the influence of job status, implying a contrast between the influence of stopped and faulted job statuses.
- “The number of days to wait before the trigger is disabled after the first failure of a job.”
- This source, related to the overall Orchestrator config file, offers no further clarity: Orchestrator - UiPath.Orchestrator.dll.config
- It references some config settings for Triggers.DisableWhenFailedCount and Triggers.DisableWhenFailingSinceDays, which sound similar to the “execution-based trigger disabling” which appears in the orchestrator UI. The documenation uses slightly different language like “failed launches”, which suggests that it is related to failed triggers instead of failed jobs, but remains unclear.
- This excerpts from documentation prove vague and potentially misleading.
- This and other experiences contribute to the suspicion that new UiPath documentation is not being written by informed humans.
- This source, towards the bottom, relates directly to the feature available in the Orchestrator UI: Orchestrator - Creating a time trigger
- From the edit trigger page:
- Despite messaging that uses the word “job” a lot to describe this feature, the following UiPath database column values (see ProcessSchedules table) are only updated in relation to TRIGGER execution.
- TotalSuccessful - Trigger successfully fired and produced a pending or running job.
- CurrentConsecutiveSuccessful - Consecutive count for the above.
- LastSuccessfulTime - UTC timestamp for the above.
- TotalFailures - Trigger failed to produce a pending job.
- CurrentConsecutiveFailures - Consecutive count for the above.
- LastFailureTime - UTC timestamp for the above.
- Also within the UiPath database ProcessSchedules table, the following columns both relate directly to fields in the orchestrator UI. But testing has not proven a link between these columns and the ones mentioned above.
- ConsecutiveJobFailuresThreshold = “Disable when consecutive job execution fail count”
- JobFailuresGracePeriodInHours = “Grace period on disabling the trigger (days)”
- Another reason I’ve come to believe that this feature has no relation to job execution status (faulted vs successful jobs) is that the following column values remained unchanged. This is despite the similarity to the column names related to this feature, mentioned above.
- ConsecutiveJobFailures
- FirstFailedJobTime
- After all these tests, no proof was found to indicate whether this feature does anything besides change 2 columns in the UiPath database when configuring the trigger.
- No triggers were disabled.
- No jobs were stopped from faulting.
- One possible reason we’re not getting expected results may be the “grace period”. When a trigger has exceeded the number of consecutive failures, it is not disabled immediately. Perhaps there is a periodic check performed by the Orchestrator within the grace period, even when the number of days is zero. With such short tests, we may be missing that periodic check.
- We’re attempting to perform a longer-term test where we create a conflicting trigger which can NEVER be successful to prove whether this feature does anything.