Read Data from scanned PDF

athiram · January 26, 2022, 5:41am

Hi,

Iam trying to extract data from some scanned pdfs using Tesseract OCR. For single pdf iam able to extract all the data correctly. but when iam running the same WF with another PDF, its not getting correct details. The PDF structure is same but changes are there in the font size and aligment due to scanning. Is there any way we can extract data from these kind of documents with inbuilt OCR

Srini84 · January 26, 2022, 6:33am

@athiram

With Inbuilt OCR you can’t guarantee the Output Quality, It depends upon below parameters

File Image Quality, File Image DPI etc.,

So suggest to integrate with some paid OCR’s which you can expect some quality output

you can easily integrate with Abbyy OCR etc.,

Hope this may help you

Thanks

athiram · January 26, 2022, 6:58am

Hi @Srini84 ,

Thanks for the reply!
But using abby OCR, will it be possible to get text from scanned PDF like we are getting text from normal pdf using anchor. Also my pdf is mix of arabic and english

kirankumar.mahanthi1 · January 26, 2022, 7:05am

Hi @athiram ,

We can do lot more with abbyy Flexi capture. We could extract the data from scanned pdf with anchors and repeating groups. Even we can extract Arabic and other language letters.

I think now they introduced free courses from the abbyy academy.

As mentioned by @Srini84 we can easily integrate abbyy Flexi capture with Uipath using inbuilt activities.

Please note that abbyy Flexi capture is paid tool they will charge for pages we are extracting data. Thanks.

Srini84 · January 26, 2022, 7:08am

@athiram

Abbyy Flexicapture is the product where you can train templates and also other Languages successfully as @kirankumar.mahanthi1 mentioned

Also you can’t get any trail from your personal ID but you have to send mail to Abbyy sales team to get the trail access

Hope this will help you

Thanks

athiram · January 26, 2022, 2:26pm

My issue with the PDF was not the clarify/quality of the file but the spacing between lines as it is a word document converted as PDF.
For this I created a workaround.

Using find image activity, find the clipping region of the image that is associated with the text. For eg : if we are trying to capture invoice number from invoice, we find clipping region for text “Invoice number”
Find the clipping region of actual invoice number. (eg : INV12345).
Find difference between the clipping region(X,Y,Width,Height).Compute new clipping region using the offset value computed.
Use Set Clipping region activity and give input parameter as new clipping region
Get OCR text activity without clipping region

Topic		Replies	Views
RE: Scanned PDF file Data Extraction Problem Help	6	884	May 13, 2019
OCR Data extraction Help	1	812	August 15, 2019
How to extract text from pdf, images Help	1	1512	January 30, 2019
Unable to read scanned PDF correctly Activities pdf	2	609	January 3, 2023
Extracting the data from image based pdf Help pdf , ocr , activities	4	969	March 20, 2020

Read Data from scanned PDF

Related topics