Iam trying to extract data from some scanned pdfs using Tesseract OCR. For single pdf iam able to extract all the data correctly. but when iam running the same WF with another PDF, its not getting correct details. The PDF structure is same but changes are there in the font size and aligment due to scanning. Is there any way we can extract data from these kind of documents with inbuilt OCR
Thanks for the reply!
But using abby OCR, will it be possible to get text from scanned PDF like we are getting text from normal pdf using anchor. Also my pdf is mix of arabic and english
We can do lot more with abbyy Flexi capture. We could extract the data from scanned pdf with anchors and repeating groups. Even we can extract Arabic and other language letters.
I think now they introduced free courses from the abbyy academy.
As mentioned by @Srini84 we can easily integrate abbyy Flexi capture with Uipath using inbuilt activities.
Please note that abbyy Flexi capture is paid tool they will charge for pages we are extracting data. Thanks.
My issue with the PDF was not the clarify/quality of the file but the spacing between lines as it is a word document converted as PDF.
For this I created a workaround.
Using find image activity, find the clipping region of the image that is associated with the text. For eg : if we are trying to capture invoice number from invoice, we find clipping region for text “Invoice number”
Find the clipping region of actual invoice number. (eg : INV12345).
Find difference between the clipping region(X,Y,Width,Height).Compute new clipping region using the offset value computed.
Use Set Clipping region activity and give input parameter as new clipping region