Exctract specific date from diffrent font dynamic scanned PDF

Can the UiPath community confirm if I am on the right path for this project. I want to extract a policy number from a different scanned PDFs, the PDF is not tagged. The each PDF has different font and color and word size and the policy number changes location.

I tried relative scraping, but this is imaged based so it works for the first PDF and not the other because the font changes so it can not find the same exact image of the anchor.

I think the only two solutions to this are:

  1. Create a robot for each type of PDF that all have the same font, color, word size and policy number is always in the same place.
  2. Extract the text from the PDF with OCR, then have it create a data table or word document or other type, then have it extract the information from the data table using string manipulation.

Is this correct or is their a different method?

Policy 1
Policy 2
Policy 3

1 Like

@ricardoxh

Hi,

As of now there is no such capabilities in UiPath to extract information from scanned pdf with different fonts, size and location of attributes.

However with recent release on Document Understanding these capabilities are built up but limited to Invoice and Receipt attributes.

Answering your first question, Yes you have to create a different robot or you can use some logical switch expression based on name or any other attribute of file and control flow of execution based on it. For Ex.

PDF A ; PDF B ; PDF C

Will flow through switch based on name so case would be Case A, Case B, Case C.
Based on case bot will try to extract information.

On your second question, String manipulation is also possible but there would be many challenges like…

Assume one PDF have policy number like this. Policy : 12345678Test
Other might have
Policy :
12345678Test

or

Policy Effective
12345678Test 8/11/2020

Hope this will help you understand it.

Hi @ricardoxh

Single Regex can also be useful for the same to extract data which you want.

Happy Automation :raised_hands:

Best Regards
Er Pratik Wavhal :robot::man_technologist:t4: :computer:

Is there a way to classify the different pdfs into several types and then use a specific extractor for each type? This is basically the multiple robots idea, but without actually having one robot per pdf type.

1 Like

Hi
This can be handled with Document understanding
Yah document understanding is merged with three different types of extractor like regex, form and ML extractor

Only ML extractor is limited to Invoice and receipts
But still we can do this with Regex based extractor

And for different format of pdf we can use CLASSIFIERS form document understanding package itself

Kindly check with these

Cheers @ricardoxh

1 Like