Document understanding - Form extractor : Multiple same type of forms


I am extracting document forms using form extractors. I used templates which extract the details from the defined forms. In a new requirement, i am getting multiple forms (same type which is defined to extract) in the same document. For ex, Name, DOB, Phone number, etc… in the Form 1 and Form 2, Form 3 are different, and i am supposed to extract those details separately.

I feel like it is not possible with the current document understanding process, as the best match of those multiple form will be picked and extract the details. Is there any other option to read multiple same forms from a document? If so, how those documents will be handled in Human in Loop? Immediate advice required to commit the requirement.

@Lahiru.Fernando @Andra_Buica @Ioana_Gligan

Found similar post : Position Based Extractor - Using multiple templates of the same document type

Hello @Pradeep.Robot ,

From a conceptual point of view, wouldn’t it be possible for you to split the multi-page document into single pages before you run each page through the extractor? In which case there would be only one instance of Name, DBO, Phone number, etc. in each form?


That’s highly doubtful too - as we are not very sure on which page the form exist too. The forms may come in between multiple pages like either in 5th or in 40th.

You may have to do some non-DU type work to ascertain which of the 5 or 40 pages qualify as inputs to the DU pipeline.

There must be something on these pages that you can identify after you digitize each page and generate the DOM to figure out it they contain one or more of the columns you want to extract data from? In fact drilling down to the DOM can give you a lot of information on a given page including the positions of controls.

Here is an example on how I use the extraction results to get the list of fields that have been read above a certain confidence level.

out_ExtractionResults.ResultsDocument.Fields.Select(function(f) f.Values(0).Confidence).ToList().Where(function(c) c>0 and c<=0.700)

You might be able to query the DOM similarly to see if your fields exist inside of it.

In fact, it may be possible even before you enter the digitization phase if the page text can be extracted and you use Regex on the text to check for the fields to identify the page which you then pass into the digitizer of the DU Flow.

Thanks for your response @AndyMenon . There are few risk factors which we could find in our process. Main thing is it is not a readable pdf to go with regex extractors. It is the scanned pdf where digitizing the document mostly getting in to an error with regex as the digitized scanned docs getting more special characters. We have ruled out the regex extractors in this documents.

Secondly, we are involving Action center in certain documents, as it is based on transactions - highly could not iterate the list of documents in this process as the human intervention is based on transaction and per document. So far i read with many forums, i see that multiple forms which has been repeated in the same document - bot will not be able to get appropriate form to get the relative elements to it. it is finding the best match from the document and mostly it is taking the values that are provided in the first few page of the document.

Even if we identify the multiple pages and loop through the list - how this transaction can be handled within the action center for a single document?

But - Would like to know if there is any feature within the Document understanding to handle multiple forms with the same defined templates in a same document. If yes, how the Action center involvement in those kind of documents?

I’m going to give this a shot with my perspective because I would look at it this way:

A document (or the information in a document) is a complex datatype and forms one part of your transaction attribute. Therefore document understanding by itself may not necessarily contain everything that will help you handle transaction based processing . You may have to supplement the DU + Action center capabilities by adding a mechanism to track the transaction.

For example: Processing an invoice record can be considered a transaction which contains the following steps:

  1. Get the Invoice Header record containing the Invoice Number
  2. Get the Invoice Item records with that Invoice Number
  3. Each Invoice may have one or more Invoice items

When you process a single invoice record, these 3 steps become part of a transactional Unit of Work. They must complete successfully if started and if there are failures, it must be completely rolled back. Otherwise the Invoice items can be orphaned without their invoice header information.

Therefore, you track all the 3 steps above with some kind of a Unit of Work (UOW) number . Each UOW record that you are tracking in the system either ends with a Success or Failure (rollback)

Now, you can extend this analogy to your DU Process and that may translate to roughly the following steps:

  1. Assign the incoming document with a unique UOW Value - this can be a number, a GUID or a datetime stamp that is unique
  2. Split the document and each page in turn is associated with that UOW ID
  3. Determine if each page needs to go to the action center based on your risk factors
  4. If yes, post page to action center, and when it is processed by the human supervisor and returns to the flow, you will have to UOW as part of that page
  5. Therefore even if the document is split or single, or even if it goes through action center or the DU Flow straight, the UOW flows with it.
  6. In the end, you would have extracted the information from the page either with or without human intervention and that information for that page+the parent document+the UOW ID can be recorded as part of the unique transaction ID.
  7. That is because UOW ID is at the document level and therefore all pages in that document will get the same UOW ID after you split it. But combining it with the page number will make that record unique.

What is the challenge factor?
You have to find a way to integrate the UOW ID in your RPA flow and make sure it is flows in and out of the Action center correctly.

I hope this concept helps flesh out the solution.