Text Summarization / Classfication Machine learning Algorithm to directly process unstructured data


What is the use case

Most of the times the process which we want to automate begins with a human input eg , an Email , a ITSM Incident , logs from some program execution etc , in such cases , the data is unstructured. and cannot be directly used. Although RegEx , String extraction techniques can be used to fetch information from unstructured data, it would work only in few cases where the patten is well formed.

There exists ML Algorithm which could perform text classification / extraction based on generative or retrival based models. eg - TextSum model build on tensorflow , a generic seq2seq model in tensorflow.

These ML Algorithms can be used to process unstructed data and extract relevant information, with this , RPA would be able to process unstructured data direclty without any human intervention.

How do you see a solution for the use case?

Existing TextSum or seq2seq models can be used to train a data set. Once trained, the model can be built on Bazel for windows, that ways we would be able to execute the executable as a host command. UIpath activity would be able to take in the unstructured input , and pass it to the host bazel file , once the file returns relavent information maybe in json/xml/plaintext/ custom formart , UIpath use that info.

Bazel would be good enough for POC cases , and maybe in long run would need to use Tensorflow Serving.

I am working on integrating a textsum model with UIpath via custom activites. Will keep posting the progress.

Scope: ______________

  • Custom Activity
  • Reusable Component
  • Template
  • Application Connector
  • Data Connector
  • Machine learning model


This sounds interesting. Can you share some examples.


Yes ,

I am working on Summarization and classification problems on it , and training the model. I will put up the details once i have a working model ready !

Thanks for your interest !


Should we be able to use NLP libraries like corenlp (already integrated out of box in UiPath 2018.2), NTLK, spaCy etc. for NER and Classification?


Yes we would be able to do NER based on NLP Libraries like the ones you mentioned above , but these would be using pre trained word vectors, i.e these are trained on generic data which is good at solving generic problems, if you are looking for Extraction of some words which are not frequent in the data set used to train those NLP’s you would not be getting proper results.

for example , NLTK would be good for texts like “Could you tell me what is the cost for an Icecream ?” it would result good PoS and entities, as the words are common, but these would not be able to do a good job on a buisness email which says " for Order no 1234 workflow stuck , need to process invoice asap".

for this you would need a closed domain system , not an open domain (eg NLTK opennlp etc). To build a closed domain system you would need to train your own word 2 vector model on your own custom / case specific dataset.

as for classification , it would be easier to impliment in a ML model rather than relying on full blown NLP. Most NLP libraries would tokenize and then go ahead to determine PoS , maybe intents etc … But a CNN model would be much effecient and accurate in classification (as we can train it on our specific case ) rather than a NLP (which would be a generic problem)


Can’t these NLP libraries be trained for custom entities? So that we get base model + custom entities…

Can entities be extracted without doing tokenize, PoS and parsing steps?


you would basically have to implement the whole thing with your own custom dataset … download Stanfordnlp train it with your custom data set tagger etc host it locally and then use it with uipath

One would have to tokenize and/or vectorize for entity extraction nlp libs do the same abet internally