Document Understanding – Digitize Document – Native PDF inaccuracies

Hi Everyone,

My organization is reviewing UiPath’s Document Understanding framework. We are trying to extract the data from native PDF documents and have noticed some inaccuracies at the Digitize stage.

Our understanding is that Native PDF documents do not need to be digitized using the OCR engine, and accordingly expected the output from the ‘Digitize Document’ activity to be as accurate as that from a ‘Read PDF Text’ activity. However, we have found this is not the case, and data is missing.

By the way, we have not made any changes to the Document Understanding Framework template, apart from adding our Taxonomy and Regex Extractor settings. Hence the ‘ForceApplyOCR’ flag within the properties of the ‘Digitize Document’ activity is set to False

Has anyone else experienced the same issue, and fixed it?

Thanks

1 Like

Hello

Did you find solution for your issue

Best regards

No, I’ve had no response. Thanks

Have you tried changing the “ForceApplyOCR” to “True”? Please let me know if the issue still persists.

Also what’s the endpoint you are using and what OCR?

Hi Sharon,

Thanks for coming back to me.

This is a Native PDF and shouldn’t need OCR. When you use the Read PDF Text Activity for the same PDF it reads it 100% correctly, hence you’d think that the Digitize Document'activity should do the same.

Read PDF text activity returns a document text but digitize activity returns a document text and DOM. There could be inconsistencies lying there, it’s better to set force OCR to true for this case and see if you still have the same issue.

Hi @ForemanChris_Rex ,

Have you Checked the Latest Update Introduced into Digitize Document Activity ?
You Could Check the Post Below for further Updates :