How to use natural language and computer vision to control applications in UiPath?

Hi fellow IT automation enthusiasts!

I’m venturing into an exciting area - envisioning a setup where software interaction is not manual but is driven by human intent conveyed through natural language.

If you’ve watched Star Trek, think about how they communicate with the computer - that’s the kind of interaction I’m aiming for!

I came across an interesting project by @RickLamers on GitHub, named Shell AI. It’s a command-line tool that harnesses GPT-3’s prowess to convert natural language into shell commands. It’s geared to make the command-line experience friendly for everyone, from novices to experts.

Building on this concept, there’s a broader movement towards enhancing UI interactions with natural language. For instance, Adept.AI has put forth ideas on integrating chat functionalities directly with graphical user interfaces.

This got me thinking about broadening this approach to general UIs. Imagine guiding your computer to generate a presentation, modify a video, or even initiate a game, all through simple voice commands or text.

Here are a couple of scenarios I’m imagining:

  • Command: “Design a presentation on the latest sales data.” The system understands, pulls data from the right Excel sheet, crafts a PowerPoint presentation, and applies the company’s formatting rules.
  • Command: “Stream the newest episode of The Mandalorian on Disney+.” The computer responds by launching the browser, logging in, navigating, and starting the stream.

To achieve this, need to combine techs like robotic process automation, computer vision, and natural language processing so that the system would get smarter using reinforcement learning from human user feedback.

I’ve been familiarizing myself with UiPath’s toolkit, mainly UiPath Studio and AI Fabric. But I’m looking for tips on how best to utilize:

  • Computer Vision (CV)
  • Natural Language Processing (NLP)
  • Natural Language Understanding (NLU)
  • Document Understanding
  • AI Center & AI Fabric
  • UiPath Orchestrator

Some questions that I have:

  • Can we adapt the Document Understanding feature for real-time NLU, especially in translating user desires into bot actions?
  • How can we meld the Computer Vision feature to detect and interact with ever-changing UIs using NLU cues?
  • Are there other UiPath components, RPA tools, or examples that would be beneficial?
  • Is there another tool that I could leverage?



We can traslate …but it is not a AI or language model to automatically translate to bot actions…we need to define them

This is interesting again but first the items needs to be identified and then only the selectors can be amended…you need to look for finding generic cv selectors.descriptors and use them to build the logics and call based on changing cv…and cv selectos can be obtained only once you indicate a element but not from some api interactions

There is something called communication mining which can be leveraged


Thanks, I’m looking for something like this but for UI components:

Does UiPath have it available or in the product pipeline, or are you planning to partner with NVIDIA to implement some kind of a plugin?

Perhaps ScreenAI: A visual language model for UI and visually-situated language understanding – Google Research Blog along the lines of How to implement NLP(Natural Language Processing) or Cognitive services in UiPath - #4 by Bhaskar_Agarwal will work?