Read Data from scanned PDF

Hi,

Iam trying to extract data from some scanned pdfs using Tesseract OCR. For single pdf iam able to extract all the data correctly. but when iam running the same WF with another PDF, its not getting correct details. The PDF structure is same but changes are there in the font size and aligment due to scanning. Is there any way we can extract data from these kind of documents with inbuilt OCR

@athiram

With Inbuilt OCR you can’t guarantee the Output Quality, It depends upon below parameters

File Image Quality, File Image DPI etc.,

So suggest to integrate with some paid OCR’s which you can expect some quality output

you can easily integrate with Abbyy OCR etc.,

Hope this may help you

Thanks

Hi @Srini84 ,

Thanks for the reply!
But using abby OCR, will it be possible to get text from scanned PDF like we are getting text from normal pdf using anchor. Also my pdf is mix of arabic and english

Hi @athiram ,

We can do lot more with abbyy Flexi capture. We could extract the data from scanned pdf with anchors and repeating groups. Even we can extract Arabic and other language letters.

I think now they introduced free courses from the abbyy academy.

As mentioned by @Srini84 we can easily integrate abbyy Flexi capture with Uipath using inbuilt activities.

Please note that abbyy Flexi capture is paid tool they will charge for pages we are extracting data. Thanks.

@athiram

Abbyy Flexicapture is the product where you can train templates and also other Languages successfully as @kirankumar.mahanthi1 mentioned

Also you can’t get any trail from your personal ID but you have to send mail to Abbyy sales team to get the trail access

Hope this will help you

Thanks

My issue with the PDF was not the clarify/quality of the file but the spacing between lines as it is a word document converted as PDF.
For this I created a workaround.

  • Using find image activity, find the clipping region of the image that is associated with the text. For eg : if we are trying to capture invoice number from invoice, we find clipping region for text “Invoice number”

  • Find the clipping region of actual invoice number. (eg : INV12345).

  • Find difference between the clipping region(X,Y,Width,Height).Compute new clipping region using the offset value computed.

  • Use Set Clipping region activity and give input parameter as new clipping region

  • Get OCR text activity without clipping region