Can I have robots in "Active-Passive HA" mode?

Hi team,
I am working with a prospect who has this very specific requirement and they are currently evaluating both AA and UiPath. The requirement in a gist is the prospect wants an “Active-Passive HA” robot level. Let me add some details to clarify the requirement.

  • The prospect will procure 2 robots/runtimes but they will be running on different machines.
  • One Robot - let’s call this primary - will be executing the jobs that are it has been assigned to run.
  • Second robot - let’s call this one secondary - will have the same jobs assigned to it as the primary one.
  • The secondary robot stays in “listen-only” mode when the primary robot is functional. It does not care how many queue items are there in the primary robot’s bucket or how busy it is.
  • The secondary robot jumps into action if and only if the primary robot goes down - for whatever reason. It then continues to function as primary till the actual primary comes up.
  • The the original primary again assumes the role of primary and secondary goes back to “listen-only” mode.

I hope I have been able to explain my issue. Please do let me know if you folks have any questions. I am more than happy to try and answer them. Looking forward to guidance from the experts here.

My first question would be why?

This seems like a complete waste of a licence (which isn’t cheap) to have it sitting idle like that 99% of the time.

You’ve come to us with a requirement, but without us knowing the problem this requirement is trying to solve I don’t think we can give the best advice as this might simply be the wrong way to solve whatever the problem is.

Hi @ricedil794,

In short they want robot fault tolerance of 1 (1 robot can be down out of 2 at any point of time)

In that case I don’t see any need to limit he process to primary robot. Let the process run on both robots as this will ensure a better robot license utilization.

This way you will still maintain a robot fault tolerance of 1 but also improve the processing time/utilization of license.

In theory there can also be common cause failure (orchestrator is down, user is invalid) in that case both your robots may be down at the same time. There is no escaping such events unless there are contingencies/ fault tolerance on those aspects as well.

My suggestion is to convince your client to think about these tradeoffs and scenarios.
I agree with @Jon_Smith that having a robot just wait for a failure of primary robot is a waste of a license.

1 Like

Yeah, you see, depending on the type of fault you could have it so that this is done with one licence. Lets say server one completely dies, provided the two servers are both on the same machine template the processes can then run on the second server as the licence is then released.

However it doesn’t mitigate things like the UiPath Service freezing. I’ve seen it stuck with the ‘xxx is not responding’ popup that manually has to be closed and doesn’t release the licence until that happens and with the orchestrator not really able to see that there is a problem.

Its why I want to properly understand the original problem to solve, as I feel if the problem is as you describe (also my assumption) then this will leave alot of holes still for things to go wrong.

1 Like

Hi @Jon_Smith,
Fair point, let me explain their thought process. Full disclosure, we already told them the points raised by you & @jeevith. The issue here is the deployment is being done in a isolated site where it is not possible to run another robot after the first one fails. So they need to keep both of them running. We also told them that server/service level faults cannot be managed in this case. The prospect is adamant stuck on his requirement saying we will take care of Server DR but that is a larger investment. For now, you (as in I and the team) need to give him a solution for fault tolerance of the robot. As Jeevith suggested, our default option was let them both run and get utilized but they have some weird requirement (internal) where they are allowed to procure only 1 license as of now and other has to be a DR one which should - and I quote, “only run when the first one fails”. :neutral_face:

Anyways, to cut the story short, I know this is not a logical task and believe me, I have gone nuts trying to explain this but to no avail. I wanted to ensure that I am not missing any UiPath feature before giving up on this opportunity.

Ok, so in my opinion one licence covers this.

You have a single machine template and assign 1 licence to it, ensuring the licence goes to the main server. If that server goes down the Orchestrator will lose the heartbeat from that server and then the licence will become free and go to the second server.

That being said, should server one go down for maintenance etc then it can also switch to server two and get stuck there, which sounds like its not desired.

If the second server could only connect to the Orchestrator when the first server goes down this would work better.

If they allow two servers, but dont care which one it runs on as long as it only runs on one server, that is much easier to manage.

I am curious of the limitation stated as ‘it is not possible to run another robot after the first one fails’

Does a fail include something like an application crashing? Cause if so what a weird requirement…?

Are you sure they havent had some awful advice from another vendor and are now stuck on that bad advice?

1 Like

Haha
My friend, you have hit a jackpot here:

Are you sure they havent had some awful advice from another vendor and are now stuck on that bad advice?

They have 3 out of big 4 consulting them on one or the other aspect of their business and like usual, everyone is a stakeholder in the digital transformation projects - I am sure you have seen this before. If ever a movie called “Too many cooks spoil the broth”, the BAU of this org could be the defacto plot for this movie! Anyways…

As for your query,

I am curious of the limitation stated as ‘it is not possible to run another robot after the first one fails’

The issue here is that this is an isolated site (I can’t say more, security sensitive prospect - think of it as a black site), so once they deploy the automations in productions, no-one who knows UiPath will have access to either the orchestrator or robot machines (Talk about a recipe for disaster when you are trying to avert one) which is why the failover has to be automatic. No one will be able to manually do license shifting to second server on which the secondary robot will be.

Is there a way to start robots via API assuming the robot service is already running on that server? In that case, I can achieve automatic failover but that still means I have to run some external service to monitor heartbeat like you suggested and then get the second agent up automatically (Although a 100 question marks are popping up in my head as I am typing this).

This is asking for a unicorn.

They want to deploy UiPath, to a place where no-one knows UiPath, and have it fix its own problems itself without needing any UiPath specialist help…

They need to get real on some of this, at some point, someone with UiPath knowledge might need to help…

To clarify on the robot fail thing, I perhaps wasnt clear. Do they mean if a job fails, or that the entire robot service crashes?

I think also there are network monitoring tools here that should help etc, provided there is someone competent who can monitor the servers and services to make sure they run then perhaps that would solve it.

Seems like a stubborn situation and indeed made worse by the bad advice :stuck_out_tongue: unfortunately I get there because I have been there myself.
And no worries in the limitations on what you can share, totally appropriate, I think you are sharing enough for us to get the situation without crossing a professional boundary.

IKR!

To clarify on the robot fail thing, I perhaps wasnt clear. Do they mean if a job fails, or that the entire robot service crashes?

Oh my bad, they mean the entire service crashes or essentially that server hardware crashes.

Cool, so if the server crashes entirely and you have a single licence on a machine template and that machine template is used for both servers then it will switch over automatically once the first server is down and disconnected as the licence becomes free and gets reallocated. As mentioned above, it might move during a server reset or something so I think you’d need to make it clear the second machine is not ‘only in case of backup’ but that either could be used, but only one at a time.

If the service crashes, well, that is actually tougher because the service can crash and close and then the behaviour as above would occur, but I have also seen it where the service crashes and then it pops up with that ‘an error has occurred, would you like to report it to Microsoft’ popup, the problem is though, the service is still kinda running whilst that popup is there and delivering a heartbeat to the Orchestrator. This means the licence won’t get removed.

I am not a network specialist, but if you can suppress those warning popups then I reckon you are good, if not, you need some sort of network monitoring tool that looks for things like this crashing.

1 Like

I think the first part helps. Again, I understand, there isn’t much that we can do about the service failure situation but will check if someone in my IT Team knows this.

Circling back this:

Is there a way to start robots via API assuming the robot service is already running on that server? In that case, I can achieve automatic failover but that still means I have to run some external service to monitor heartbeat like you suggested and then get the second agent up automatically (Although a 100 question marks are popping up in my head as I am typing this).Is there a way to start robots via API assuming the robot service is already running on that server? In that case, I can achieve automatic failover but that still means I have to run some external service to monitor heartbeat like you suggested and then get the second agent up automatically (Although a 100 question marks are popping up in my head as I am typing this).

Will this work? I came across this video https://www.youtube.com/watch?v=12FOzCvwxW4 but I am not fully clear if it will work in our situation (minus the UI ofcourse)

This won’t help.

All this does is basically replace say a queue trigger or a time trigger with a ‘manual trigger’ that is done by an app they designed, I’d argue its pretty redundant these days considering the wide range of options we have to trigger automations manually now.

To explain why it won’t work, all this does is basically press ‘start job’ for a process, it doesnt do anything with licence management etc, even if you specify a specific machine it will only run on that machine once that machine gets a licence and licences dont move between machines, if server A has the licence it keeps it until you turn the server off.

As such, in the scenario I gave where the error popup is waiting to be closed, even if you have some third party thing trying to place redundancy to the Orchestrator, it triggering jobs wouldnt do anything as they’d just be stuck in pending.

I strongly feel this is a Server Management issue and that UiPath just needs one machine template, one licence and one machine.
Then a good server admin to set it up so that the service won’t give popup messages that need to be manually closed and you should be set. That and some server monitoring tools that should alert someone should the service on one machine go down so they can reset the server.

1 Like

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.