Illegible Characters Extracted With Digitize Document Activity

How to fix gibberish/illegible characters being read from the document when using the Digitize Document Scope?

Issue Description: The document in the PDF may visually look fine, but when the document is digitized using the Digitize Document activity, the text that is digitized (or parts of the text that is digitized) is illegible / gibberish.



Root Cause: This issue is typically caused by a corrupt PDF or some metadata in the source document.

This "corrupt" pdf/metadata has been known to caused by a deprecated font being used in the document, the program that was used to create the PDF file, etc.

See the example below -

image.png

Note: This issue is not the same as wrong values being extracted. The issue discussed in this article will be a noticeable issue and not something like and E mistaken for a 3 etc.

Resolution:

  • Use ApplyOCRonPDF=YES (or in older IntelligentOCR package versions ForceApplyOCR=True) on the Digitize Document Scope in the project so that the process will be forced to OCR to read the characters that are visually present in the document vs using the Native text in the document.