Digitized Document text format Issue

Harshit_Tanted1 · July 2, 2020, 10:20am

Hi,

I am using document understanding framework to read pdf file.
After using activity Digitize Document i fetched the result of pdf in notepad file.

Issue - There are many underscore appearing in the notepad data.

When i fetch the same pdf file using read pdf with ocr then this issue doesnt occur and data in notepad file is fetched correctly.

Can anyone tell me how can i improve results of Digitize document activity.
I tried to manually edit the results in notepad and replace the underscore with space but i am not able to that.

Can you please suggest how can i optimize result for better use ?

Also when i am using form extractor in Data Extraction Scope then all the data i want has underscore between each digit.

For ex:-Invoice Number : AZ12
is visible as A_Z_1_2

tudor.serban · July 10, 2020, 11:27am

Hi Harshit,

Which OCR engine are you using with Digitize Document?

Harshit_Tanted1 · July 10, 2020, 11:56am

Omni Page OCR.

I used this ocr engine with read pdf with ocr activity also.
Please note that text inside my document was underline.

I was able to solve this using intelligent form extractor.
Intelligent Form extractor used with handwritten fields gave me correct result

tudor.serban · July 10, 2020, 12:09pm

Have you tried using any other OCR engine like Microsoft OCR, Tesseract OCR, or UiPath Document OCR?

Venugopal24 · July 10, 2020, 12:11pm

can you try with present validation station after extractor, and correct the data…

Harshit_Tanted1 · July 10, 2020, 12:26pm

Yes…omni page OCR gave best results compared to microsoft ocr and tesseract ocr…I didnt tried UiPath Document OCR

Harshit_Tanted1 · July 10, 2020, 12:27pm

I didnt want any manual intervention .I was able to resolve this using Intelligent form extractor

system · July 13, 2020, 12:27pm

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Taxonomy1 Studio activities , question	12	846	March 18, 2024
Document Understanding – Digitize Document – Native PDF inaccuracies Document Understanding	6	2027	April 18, 2022
Digitize Document: One or more errors occurred Studio studio , question , activities_panel	14	3378	October 20, 2022
Unable to digitize document - Document Understanding Studio uiautomation	5	794	December 1, 2022
Configure error message in form extractor? Studio studio , question , activities_panel	12	126	July 15, 2024

Digitized Document text format Issue

Related topics