About OCR Engines

Amrut_Valasang1 · September 18, 2021, 11:30am

As i have used Read PDF with OCR in my workflow,I am unable to get the output in structured form i.e, i am using Uipath Document OCR Engine but i am not able to get the desired result.(The extracted data should at the same position as we can see them on screen when the file is open.)
can anyone please help me regarding it.

Thanks

Srini84 · September 18, 2021, 11:49am

@Amrut_Valasang1

Welcome to forums

It’s difficult to maintain the same format with the OCR’s

You can try with Document Understanding of UiPath to extract the data from it

Hope this may help you

Thanks

sonaliaggarwal47 · September 18, 2021, 12:53pm

Hi @Amrut_Valasang1,

In my view, that’s not possible with Read pdf with OCR activity as it takes the whole pdf and gives you all details in simple text format without preserving its position on screen.

So, there are few options you can try:

If requirement is not to extract the whole data from pdf but to extract specific data from specific positions, would recommend to use Document Understanding framework of UiPath.
Alternatively, if you want to convert everything in pdf and also retain its position, you can use uipath activities to open and edit those pdfs using acrobat pro dc (using different activities, open application, multiple clicks etc…) which would convert everything and retain its position as well, so you wouldn’t feel a difference.

Regards
Sonali

Pradeep.Robot · September 19, 2021, 5:05am

Technically, Digitizing is just conversion of the PDF to Image and then Text. When PDF is converted to Image and Text is retrieved from it, you wont be able to see that in the same position. you can observe it, by printing through Write Text File right after your OCR digitization.

Pass your Text string and retrieve the values through Regex expressions or through Document extractors like Regex extractors/Form extractors etc… based on your license planning. Since you have not mentioned the objective, just providing my suggestions. Hope this helps.

Amrut_Valasang1 · September 27, 2021, 2:10pm

Thanks for your suggestion,But i am not using DOU.I require the data to be extracted from whole pdf files to text file,that i will be using in some other programming languages.

Thanks

Amrut_Valasang1 · September 27, 2021, 2:11pm

OK…Sonaliaggarwal,
Thanks for ur suggestion.I will work on it.

Thanks

sonaliaggarwal47 · September 28, 2021, 4:59pm

Hi @Amrut_Valasang1,

Let me know how this goes

Regards
Sonali

mkankatala · July 4, 2023, 6:41am

Hi @Amrut_Valasang1

When you use the any OCR engine to extract the data from the pdf. The structure is not same as the pdf. There is my change in the output.
But if you use the UiPath document OCR and omni page it will give better output then other OCR’s.

If you want to use the Omnipage OCR install the UiPath.omnipage.activities package

Hope it helps!!

vrdabberu · July 4, 2023, 6:43am

Hi @Amrut_Valasang1

It’s not possible to get the exact structure as PDF but by using Tesseract OCR you may get the data in a particular format means in a line by line format and the data table can be extracted in a tabular format separated by pipes.

Using Omni Page OCR you can may get the same data as explained above with few changes.

Topic		Replies	Views
Read PDF Text Activity should also return structured text Activities activities , considering	12	4077	January 29, 2020
OCR Without Extracting Data Help activities	2	901	March 7, 2019
How to extract data from pdf files on a dynamic way with OCR Activities pdf , ocr , activities , question , tesseract-ocr , ocr-engine	5	1722	October 15, 2022
Convert PDF to Text File Activities uiautomation , studio , question , activities_panel	8	256	December 28, 2023
Text Extraction for PDF File Studio	4	1638	July 16, 2020

About OCR Engines

Related topics