Document Understanding / AI Extractor

I have a requirement where I need to fetch certain phrases from various documents and based on that pass the Doctype values to a source system.
For example if the document has

Phrases ----> DocType
Invoice Addition----> Addition
Invoice Extension ----> Extension
Invoice Extra ----> Extra

I am unsure of what extractor should I use in this case. Any help will be appreciated.

@nanmishra

Are you able to extract the phrases?

If yes then use a switch case to check what is the phrase and accordingly set the doctype

Cheers

Thanks. No I have not yet tried to extract the phrases as am not able to understand what extractor should I use.

Do I need to use AI center for this like create ML Skill and all. How do I train it as there are about 50 phrases that I need to search in each document. Something like this:

image

@nanmishra

What are the document types?

We need more detaisl and is it like from whole text you want to check what exists or in any specific place?

Cheers

1 Like

The documents are all Investment documents. I couldnot find this doctype so I think I would need to create that.

We will receive these documents as pdf in email. From each document I need two things:

  1. Company Name
  2. Search if any of the above phrases like “Term Deposit”/“FATCA” exists anywhere in the PDF.
    If it exists I need to get that value back and map it to the corresponding Doctype.

So for Doc1 if I get a match for “Term Deposit” anywhere in the pdf then it will be of Type =“Term Deposit”.

You are describing Classification, not Extraction. You can use the regex classifier to search the entire document for a single word or phrase.

However, the approach you have described is unlikely to work. You will probably have many different document types with the word “Termination”, “IRS”, “FACTA”, etc. somewhere in the document - the terms and conditions, fine print, etc.

If your documents are structured or semi-structured, you should use the Forms Extractor. If your documents really are unstructured, like a written letter, you would probably have to use more complex and unique phrases for classification.

UiPath Academy has a beginner and intermediate course on document understanding, with videos and sample files.

1 Like

@nanmishra

Either you can use classification to classify or use readpdf text and then search for required text using contains and if found accorsingly decide the doctype and use

Cheers

Thanks @anil_g and @KevinE

I understand that this is more of Classification, however as I have 25 different combinations of Identifiers which lead to 25 different Document Types, I did not want to do that as a Classification because in that case I think I would need to create 25 Document Types in the Taxonomy.

Thanks for both your inputs. I am now designing it this way:

  1. Classifying with all the different Identifiers that I have as keyword into one Document Type- “Investments”. This I am doing just to differentiate the documents from invoices which might also be there.
  2. Then for the “Investments” I am using a Regex-Extractor to get the keywords in the document. Once I extract the keyword from document using exportdataset.Tables.Item(“Simple Fields”).Rows(0).Item(“DocumentType”) I will do the mapping in the code.

Just wanted to know if using a readpdf text is better than Regex-Extractor as I have not used that befoe. Also will readpdf text be able to read scanned pdf?

@nanmishra

test with read pdf text with ocr if no handwritten then yes regex shoudl do the job

cheers

1 Like

For your point 2. - you are still confusing Classify with Extract. Regex-Extractor does not let you “get” or “extract” keywords from the document. Regex-Extractor requires you to supply a keyword that will always be in all of your documents, then it will extract some text related to the keyword. For example, a set of documents might have always have the following text in the heading, in every document:

doc1: “Form type: Termination”
doc2: “Form type: Liquidation”
doc3: “Form type: Foreign Account Tax Compliance Act”

You would use Regex Extractor by programming it to find the keywords "Form type: ", and then it will extract the text following the keywords. That doesn’t seem to match the situation you have described.

To repeat - what you are describing is Classification, and you would do it with a Classify activity. You cannot just replace a Classify activity with an Extract activity because it looks easier.

For your last question - Read PDF is not the same type of activity as Regex Extractor, you would not choose between one or the other.

Regex Extractor only works in a Document Understanding workflow. In Document Understanding, you already have a Digitize activity that reads your PDF file, so there would never be any purpose of using Read PDF.

Read PDF is something you would use if you are not using Document Understanding. And in that case, you also could not use Regex Extractor. In that case you might use something like Find Matching Patterns or Is Text Matching instead of Regex Extractor.

1 Like

Thank @KevinE . I used Classification to solve it.

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.