Document Understanding – Digitize Document – Native PDF inaccuracies

ForemanChris_Rex · April 6, 2022, 11:45am

Hi Everyone,

My organization is reviewing UiPath’s Document Understanding framework. We are trying to extract the data from native PDF documents and have noticed some inaccuracies at the Digitize stage.

Our understanding is that Native PDF documents do not need to be digitized using the OCR engine, and accordingly expected the output from the ‘Digitize Document’ activity to be as accurate as that from a ‘Read PDF Text’ activity. However, we have found this is not the case, and data is missing.

By the way, we have not made any changes to the Document Understanding Framework template, apart from adding our Taxonomy and Regex Extractor settings. Hence the ‘ForceApplyOCR’ flag within the properties of the ‘Digitize Document’ activity is set to False

Has anyone else experienced the same issue, and fixed it?

Thanks

nora_ziani · April 8, 2022, 9:31am

Hello

Did you find solution for your issue

Best regards

ForemanChris_Rex · April 8, 2022, 11:45am

No, I’ve had no response. Thanks

sharon.palawandram · April 11, 2022, 3:04am

Have you tried changing the “ForceApplyOCR” to “True”? Please let me know if the issue still persists.

Also what’s the endpoint you are using and what OCR?

ForemanChris_Rex · April 11, 2022, 7:41am

Hi Sharon,

Thanks for coming back to me.

This is a Native PDF and shouldn’t need OCR. When you use the Read PDF Text Activity for the same PDF it reads it 100% correctly, hence you’d think that the Digitize Document’activity should do the same.

sharon.palawandram · April 11, 2022, 8:07am

Read PDF text activity returns a document text but digitize activity returns a document text and DOM. There could be inconsistencies lying there, it’s better to set force OCR to true for this case and see if you still have the same issue.

supermanPunch · April 18, 2022, 9:02am

Hi @ForemanChris_Rex ,

Have you Checked the Latest Update Introduced into Digitize Document Activity ?
You Could Check the Post Below for further Updates :

Topic		Replies	Views
Different results reading a Native PDF File and Scanned PDF File with the same OCR Activities activities , question , document_understanding	2	1876	March 6, 2022
For the new document understanding feature why would I use OCR for Native PDFs Document Understanding	1	1007	June 18, 2020
OCR Behavior For Document Containing Background Image Of Scanned Document Knowledge Base document_understanding_activities_amp_fr	0	599	January 11, 2023
Digitize Document Native Scanner Does Not Behave Same As Read PDF Text Activity Help activities	4	2063	April 21, 2021
Document Understanding Insight #1 Vote on Tutorials activities , faq , document_understanding	0	184	April 2, 2024

Most Active Users - Yesterday
Yoichi
Anil_G
AJ_Ask
George_Knott
ppr
gorby
adi.mehare
Daniel_Santos1
alexander.deuschlinger
yangyq10
More details...

Document Understanding – Digitize Document – Native PDF inaccuracies

Related topics