Document Understanding Insight #1

Challenge 1: Digitizer Inconsistency

Problem Statement: In our endeavor to automate document processing using UiPath’s AI Center, we meticulously trained models to extract information from a variety of document layouts with a promising evaluation accuracy exceeding 90%. Despite these high expectations, we encountered a puzzling issue during practical extraction processes. Specific fields from documents, which to the human eye appeared identical in layout to the training set, failed to be extracted accurately, leading to inconsistencies in our data processing pipeline.

Intuition: To diagnose the root cause of these inconsistencies, we conducted a series of tests focusing on the digitizer component of UiPath Document Understanding, particularly the ApplyOcrOnPdf parameter. Our investigation revolved around two settings: ‘Auto’, where the digitizer autonomously decides whether to employ Optical Character Recognition (OCR), and ‘Yes’, which mandates the use of OCR on every document processed. Preliminary findings revealed that the structural integrity of the raw text output varied significantly between these two settings, suggesting a potential misalignment between the training data and the actual data processed during live extraction.


Figure 1: Sample PDF


Figure 2: The raw text with ApplyOcrOnPdf as Auto


Figure 3: The raw text with ApplyOcrOnPdf as Yes

Solution: The breakthrough came when we systematically analyzed the impact of these settings on the extraction process. By setting ApplyOcrOnPdf to ‘Yes’, forcing OCR on all documents, we observed a notable improvement in the structural consistency of the extracted text. This adjustment ensured that the OCR process was uniformly applied across all documents, aligning more closely with the conditions under which the models were trained. Consequently, this led to a substantial increase in field extraction accuracy, aligning our live extraction results more closely with the initial evaluation metrics. Moving forward, this insight has informed our best practices, emphasizing the importance of maintaining consistency in document digitization settings to optimize model performance and reliability in real-world applications.


Figure 4: The extracted text with ApplyOcrOnPdf as Auto


Figure 5: The extracted text with ApplyOcrOnPdf as Yes


Figure 6: Document Understanding – Labelling Session


Figure 7: UiPath Studio – Digitize Document

Disclaimer: This is solely our opinion, any technical issue needs to be referred to UiPath professionals.