Instead of the Microsoft OCR engine

I am trying to extract text from scanned pdf document.

I know that we had the Microsoft OCR engine to extract that for last version.

What can I use instead of the Microsoft OCR engine?

I have engines you can see on an image below.

image

Hi @mini9301
Try using tesserract or uipathdocument ocr

In the tessarract change the scaling and profile and check the extracted output as per your requirement.

Hope it helps!!

Hi @mini9301

You can use the Tesseract OCR. This OCR will help you to extract the data from the scanned pdf’s and select the scanned option in the Profile dropdown.

If the data is not extracting properly, open the properties there is a option called scale. Change the scale values from 0.1 to 5 until you get the extracted data properly.

Check the below image where you can change,

Hope it helps!!

you can use UiPath document ocr or the Teserract OCR.

Use Tesseract OCR for extraction of data from scanned documents

Since the Tesseract OCR requires Image as the input value,

Do I have to change the scanned pdf file to image file?

How should I do in this case?

Not required to give any input to Tesseract OCR. Use the Read Pdf with OCR activity, inside of this use Tesseract OCR. You can provide the Path of the Pdf file as the input to the Read Pdf with OCR activity.

Check the below image for better understanding,
image

Read Pdf with OCR uses the Tesseract OCR to read the scanned or unstructured pdf’s.

Hope you understand!!

Wow, It worked.

Thank you so much! :slight_smile:

1 Like

It’s my pleasure… @mini9301

Happy Automation!!

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.