DU model does not process all pages of a document.
Issue:
In a Document Understanding Process, data does not seem to be extracted from all of the expected pages after passing through extractors in the Data Extraction Scope.
Initial Troubleshooting:
To troubleshoot this issue, use a document that the extraction seems to be failing with and proceed with the following steps,
- As a first step, validate that the document is valid and the additional pages in the document have content. Manually open the document and review the document to ensure it looks as expected
- Open the process in UiPath Studio
- Apply breakpoints to the Digitize Document activity in the Digitize workflow, a breakpoint to the Classify Document Scope in the Classify workflow, and lastly apply a breakpoint to the "Data Extraction Scope" in the Extract workflow
- Now begin to Debug the process.
First Troubleshooting At Digitization:
- When the breakpoint at the Digitize Document activity is reached, review the DocumentText variable. Look to see if the content from the expected pages is visible. (Note, if there is too much content, the results will be truncated in the Locals panel. It may be necessary to write the content to a text file after the Digitize Document activity for review.)
- If the data is visible, proceed to the next section "Troubleshooting For Extraction" else
- If the data is not visible in the DocumentText, the issue is likely occurring at the digitize document step. Check to see if ApplyOCRonPDF is set to Yes, No, or Auto. If set to No or Auto, try changing the ApplyOCROnPDF to Yes. Debug the process again to check the DocumentText. If the text is still not being digitized on all of the expected pages, please share the following with the Product Support team:
- Project in a zipped folder
- UiPath Diagnostic Tool Report from the machine where the issue is occurring.
Troubleshooting For Extraction:
After digitization takes place, the next step is typically to Classify the document. If the Document Understanding Framework is used, the Classification results will be saved in a ClassificationResultsArray. The next step would be to loop through the ClassificationResultsArray with a For Each (or Parallel For Each as seen in the Main-ActionCenter.xaml that is part of the Document Understanding Framework.)
This portion of troubleshooting will require a check at the Classification step as well as the Extraction Step. (If classification is not being used in the process because there is only one document type ever being processed, this portion of troubleshooting can be ignored.)
- When the breakpoint at the Classify Document Scope is reached while debugging, step over the Classify Document scope and then check the ClassificationResultsArray after the classification takes place. In the ClassificationResultsArray make note of how many classification results were obtained and what each of the ClassificationResults in the array are
- Continue debugging the process until the Data Extraction Scope in the "Extract" workflow is reached. Check the configuration of the Data Extraction Scope
- If a specific Document Type Id is configured for the scope, other classification results that do not meet the specified Document Type Id will be ignored. In the event that some of the pages in the document have a different classification, they would not have data extracted because the Data Extraction Scope has specifically been configured to look at one particular DocumentTypeId. If this is the case the ClassificationResult should be used instead
- If the Document Type Id is not being used, but instead, the Classification Result property is used, ensure that the variable/argument corresponding to items in the ClassificationResultsArray is being used. If the Document Understanding Framework is used, by default, the variable used would be the in_ClassificationResult argument (this is a specific ClassificationResult from the ClassificationResultsArray).
If something like a specific index of the ClassificationResultsArray is used (Ex: ClassificationResultsArray(0)), the process would only ever look at the first result from the array and in the event that the array contained other classification results, they would be ignored. A variable/argument representing each classificationResult from the array should be used instead.
Troubleshooting After Extraction:
If the configuration for the Data Extraction Scope is confirmed to be correct, step over the Data Extraction Scope and then review the out_ExtractionResults. This can be achieved easily by placing a Present Validation Station activity after the extraction and passing the extractionResults to the activity. When the activity is used, check to see if the expected page is shown after the extraction. If the page is visible, but the values just were not extracted, the issue could be with the configuration for the Extractor that was used or possibly with the ML Skill specifically. (Note: If there are multiple classification results, the validation station check would need to be completed for each classification result)
Next Steps:
If the steps above have been followed and the issue is still occurring, share the following with Product Support for additional troubleshooting:
- Screenshots / results from performing the troubleshooting as described above
- Project in a zipped folder
- UiPath Diagnostic Tool Report from the machine where the issue is occurring.