My organization is reviewing UiPath’s Document Understanding framework. We are trying to extract the data from native PDF documents and have noticed some inaccuracies at the Digitize stage.
Our understanding is that Native PDF documents do not need to be digitized using the OCR engine, and accordingly expected the output from the ‘Digitize Document’ activity to be as accurate as that from a ‘Read PDF Text’ activity. However, we have found this is not the case, and data is missing.
By the way, we have not made any changes to the Document Understanding Framework template, apart from adding our Taxonomy and Regex Extractor settings. Hence the ‘ForceApplyOCR’ flag within the properties of the ‘Digitize Document’ activity is set to False
Has anyone else experienced the same issue, and fixed it?
This is a Native PDF and shouldn’t need OCR. When you use the Read PDF Text Activity for the same PDF it reads it 100% correctly, hence you’d think that the Digitize Document’activity should do the same.
Read PDF text activity returns a document text but digitize activity returns a document text and DOM. There could be inconsistencies lying there, it’s better to set force OCR to true for this case and see if you still have the same issue.