How to dynamize target with Computer Vision when it's descriptor is set as image and not as text from OCR?

I am trying to automate a web application which is running on a VM without any extensions installed. With this configuration my only option is to use Computer Vision but in some cases where there are non English characters or the text is having a different background, the target is selected as an image and not as a text.
image

My issue with this is I want to change it dynamically. If it reads text with OCR, I can swap the selected text with variable and I am able to select different element based on the variable. But I am not sure how I can do that when the bot is looking for an image.

Is there any work around for such scenarios or it’s just a limitation of using CV?

If you want to recognise multiple simular buttons, it’s very helpful to have one of the extensions installed. Building selectors allows you to be very dynamic with whatever the button should contain, but with OCR it’s either your recognise it or not.