Efficiently extracting specific fields from scanned Pdf

amodsinghal · October 6, 2020, 5:29am

I am a bit unclear about efficiently extracting data from scanned pdf files. I have multipage brokerage account statements which are scanned pdf’s. Each page is numbered Page X of Y (e.g., Page 1 of 4, Page 2 of 4, etc.) at the bottom right. On top left of each page, there is an Account Number eg., “Account Number 1111-22222”. I want to extract the account number and the “page X of Y” from each page.
So, for this 4 page document example, I expect to get extracted data as follows:
“Account Number 1111-22222” “Page 1 of 4” “Account Number 1111-22222” “Page 2 of 4” “Account Number 1111-22222” “Page 3 of 4” “Account Number 1111-22222” “Page 4 of 4”

How can I extract these fields efficiently? Should I OCR the entire document using ReadPdfWithOcr then parse the result to extract my data? How can I best extract only the fields I care about? Is screen scraping an appropriate choice? Thanks in advance.

Pradeep_Shiv · October 6, 2020, 5:40am

Hello @amodsinghal,

Good day!

you can use ReadPdfWithOCR activity and get the String variable as strOut, Now to extract Account number and Page x of y, you can use Matches activity.

use strOut as input to Matches activity and pass the below patterns:

To get account number (?<=Account Number).*
To get page details (Page\s\d+\sof\s\d+)

Cheers

amodsinghal · October 6, 2020, 2:54pm

That’s what I am doing currently. The problem is that the OCR process takes too long. I suspect the reason is because the OCR first has to read everything from each page. Instead, if the OCR were to extract only the data fields I need, I think it might run faster. Any thoughts?

NIVED_NAMBIAR · October 6, 2020, 3:28pm

Hi @amodsinghal, try with document understanding feature in uipath

amodsinghal · October 7, 2020, 6:14am

I am looking into Document Understanding. However, it appears that even here it first OCR’s the entire document, after which I can apply different methods to extract the data of interest. Since the OCR part is what takes time (for example, OCR’ing a 10 page document will take about 10 times longer than a 1 page document), I am not sure I understand how this will be any faster. What am I missing? Thanks.

Topic		Replies	Views
How to efficiently extract Page X of Y for all pages of scanned pfd? AI Computer Vision activities , computer_vision , question , document_understanding , intelligent_ocr	9	1924	October 14, 2020
Extract data from scanned PDFs Help	7	782	August 31, 2020
Extract data from PDF using get OCR text Help	2	1038	April 14, 2020
How to extract data from multiple pdf Academy Feedback studio	6	4598	September 18, 2019
OCR Specific Field data Help ocr , activities , question	5	912	November 10, 2019

Most Active Users - Yesterday
ashokkarale
MD_Farhan1
Ajay_Mishra
postwick
Dheerendra_vishwakarma
Anil_G
chandreshsinh.jadeja
Gautham_Pattabiraman
vrdabberu
aravindbalineni123
More details...

Efficiently extracting specific fields from scanned Pdf

Related Topics