Digitized Document text format Issue


I am using document understanding framework to read pdf file.
After using activity Digitize Document i fetched the result of pdf in notepad file.

Issue - There are many underscore appearing in the notepad data.

When i fetch the same pdf file using read pdf with ocr then this issue doesnt occur and data in notepad file is fetched correctly.

Can anyone tell me how can i improve results of Digitize document activity.
I tried to manually edit the results in notepad and replace the underscore with space but i am not able to that.

Can you please suggest how can i optimize result for better use ?

Also when i am using form extractor in Data Extraction Scope then all the data i want has underscore between each digit.

For ex:-Invoice Number : AZ12
is visible as A_Z_1_2

Hi Harshit,

Which OCR engine are you using with Digitize Document?

Omni Page OCR.

I used this ocr engine with read pdf with ocr activity also.
Please note that text inside my document was underline.

I was able to solve this using intelligent form extractor.
Intelligent Form extractor used with handwritten fields gave me correct result

Have you tried using any other OCR engine like Microsoft OCR, Tesseract OCR, or UiPath Document OCR?

can you try with present validation station after extractor, and correct the data…

Yes…omni page OCR gave best results compared to microsoft ocr and tesseract ocr…I didnt tried UiPath Document OCR

I didnt want any manual intervention .I was able to resolve this using Intelligent form extractor

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.