Exctract specific date from diffrent font dynamic scanned PDF

ricardoxh · July 11, 2020, 3:45pm

Can the UiPath community confirm if I am on the right path for this project. I want to extract a policy number from a different scanned PDFs, the PDF is not tagged. The each PDF has different font and color and word size and the policy number changes location.

I tried relative scraping, but this is imaged based so it works for the first PDF and not the other because the font changes so it can not find the same exact image of the anchor.

I think the only two solutions to this are:

Create a robot for each type of PDF that all have the same font, color, word size and policy number is always in the same place.
Extract the text from the PDF with OCR, then have it create a data table or word document or other type, then have it extract the information from the data table using string manipulation.

Is this correct or is their a different method?

Policy 1
Policy 2

Lakshay_Verma · July 11, 2020, 4:07pm

@ricardoxh

Hi,

As of now there is no such capabilities in UiPath to extract information from scanned pdf with different fonts, size and location of attributes.

However with recent release on Document Understanding these capabilities are built up but limited to Invoice and Receipt attributes.

Answering your first question, Yes you have to create a different robot or you can use some logical switch expression based on name or any other attribute of file and control flow of execution based on it. For Ex.

PDF A ; PDF B ; PDF C

Will flow through switch based on name so case would be Case A, Case B, Case C.
Based on case bot will try to extract information.

On your second question, String manipulation is also possible but there would be many challenges like…

Assume one PDF have policy number like this. Policy : 12345678Test
Other might have
Policy :
12345678Test

or

Policy Effective
12345678Test 8/11/2020

Hope this will help you understand it.

Pratik_Wavhal · July 11, 2020, 5:46pm

Hi @ricardoxh

Single Regex can also be useful for the same to extract data which you want.

Happy Automation

Best Regards
Er Pratik Wavhal

tudor.serban · July 12, 2020, 1:56pm

Is there a way to classify the different pdfs into several types and then use a specific extractor for each type? This is basically the multiple robots idea, but without actually having one robot per pdf type.

Palaniyappan · July 12, 2020, 3:44pm

Hi
This can be handled with Document understanding
Yah document understanding is merged with three different types of extractor like regex, form and ML extractor

Only ML extractor is limited to Invoice and receipts
But still we can do this with Regex based extractor

And for different format of pdf we can use CLASSIFIERS form document understanding package itself

Kindly check with these

Cheers @ricardoxh

Topic		Replies	Views
Dynamic PDF data extraction Help	3	2594	July 25, 2019
Unable to extract specific data from scanned pdf Help pdf , activities , question	6	992	January 24, 2020
Diffeerent Format Pdfs Extraction Studio studio , question , activities_panel	12	701	July 6, 2022
Extracting specific elements from scanned pdf's Academy Feedback studio	6	2612	April 8, 2019
PDF SEGREGATION 1 Studio studio , question , tools	5	373	February 3, 2023

Most Active Users - Yesterday
ashokkarale
Anil_G
Yoichi
yangyq10
postwick
chandreshsinh.jadeja
aravindbalineni123
Parvathy
aya
PRASHANT_GABHANE
More details...

Exctract specific date from diffrent font dynamic scanned PDF

Related Topics