We have a recent problem that came with updating our robots to 20.4.3. For some reason, two jobs start on the same user on two different machines, even though the robot with the user is configured to use only one of the machines.
For context, we use classic folders on our on-premise Orchestrator, 2020.4.2.
In this image you can see two instances of the job (they were not running concurrently, more on that later). We have named the robot 28-01 after the server the robot is connected to, VDA028. But as you can see from the image, the job list shows that the job has also started on VDA013, our other robot machine. And this has happened even though the robot 28-01 is specifically connected to VDA028 (as is the case in classic folders).
What is furthermore puzzling, is when we look at the logs generated by this job that ran on the wrong machine. The logs start like this, and you can immediately spot that one is from another machine than the other:
As you can see, the job has started twice. And on top of that, it started with the same job key. The previous image is taken by looking at the job logs, and since they share the job key, they both show up in the same log view. Here you can see details from the “execution started” images. The job key is the same, but you can see the machine is different! Furthermore, the user is the same and the robot is the same.
The two jobs then run concurrently (which in itself causes headaches, as the jobs share the robot user). Another problem then arises, when one of the two tangled jobs ends. At that moment, the orchestrator flags the job (or the job with this job key) finished. And marks the robot available. But in reality the robot is not yet available, as the other job of these two tangled ones is still running!
This leads to Orchestrator then starting a new job on the same robot (as it should, the robot is marked as available after all), but it then fails with the following error:
This error message was the initial fault we found, but we were able to trace it back to having two jobs running simultaneously with the same job key, on the same robot, but on two different machines.
If you need more details or I didn’t manage to explain things clearly enough, I’ll be happy to provide more details. But has anyone else here ever encountered the same issue, and if so, are there any fixes we could do?
- We already deleted and re-added our robots with new names, it didn’t help.
- The job is started by a queue trigger.
- The same problem can also occur when starting jobs with just X runtimes, not just job triggers
- This problem occurs seemingly randomly, about every 1 or 2 hours.
- All robots are unattended.
- Deleted and re-added the machines (and also robots). No help, it still keeps happening.
- A job ran from a timed trigger, with input “allocate dynamically: 1 times”. It still started twice, with a single job key.