I’ve been trying to build a Process using the Document Understanding Framework. I’ve repeated and recreated my templates several times and yet, when I run the process, I get this error repeatedly.
I have swapped the Microsoft and Tesseract OCRs. Also, the Form Extractor uses this API endpoint for which I use the Key from the Orchestrator Services tab.
I’ve also confirmed that the file I’m trying to extract information from is less then 50KB which is way below the limit of this end point.
@AndyMenon Check if the Digitize Document Activity is able to get the text results properly, Also Check Output of Classifiy Document Scope Activity, you can check if the Count of the Classifier Result is greater than 0 or not and then continue using Validation Station and the Exporting of the document .
You can try using different OCR Engine and check if it is able to get the data, also if you’re using OMNI Page OCR, check by using different Properties of Profile. You can keep the Profile in Scan Mode, if it doesn’t work change it again and check if it is able to get the data.
I checked the document Text output. It contains JSON content consisting of box positions and the words. The snapshot of the output document text is below. Since it contains position data I thought I could use the Position-based Extractor. But the Position-based extractor is not available in my environment despite installing the required packages and following the instructions on this forum thread (where I have posted a third-problem):
I have built two projects from scratch using two different PDF documents thinking that the documents might be the problem. But in both cases the project ends up with the error message.
Document Text output confirms words extracted with position information:
@AndyMenon May I know the Output of which Activity did you use to get that Json Data ? Because the data should actually be a raw data from the PDF file you are using as far as I know.
Yep. I sent you a snapshot of the DOM that contains the word+position info. But the DocumentText also contains the raw data as seen from my debugger here below.
Excellent timing!
I was debugging that at this very moment! It is blank!
A related problem - I see only one classifier in my environment. But the instructions forum shows a screenshot where there are multiple classifiers. Comparison below:
Can you send me your project by removing all sensitive data? I want to run a comparison of the project dependencies to see where this is running off the road for me.
Swapped out the Microsoft Engine for the Tesseract which has a default Scale of 2.
Replaced the Keyword based Classifier with the Intelligent Keyword Classifier and trained it by clicking on the Manage Learning link and providing the trainer with an input PDF to generate the json file.
@AndyMenon How many Keywords have you provided in the Keyword Based Classifier? How many pages does your PDF contain ? The Intelligent Keyword Classifier is available starting from the latest release. I was able to get the Document Understanding Framework to work by using Omni Page OCR with a Scan Profile and use Keywords from the Document that appear to be constant in those types of document throughout.
I’m not sure if the Intelligent Based Classifier is really needed for your data extraction
I have one page.
There are about 15 artifacts on the page and I’m extracting about 10 of them.
I was able to configure the KW based Classifier by using the “Manage Learning” link of this activity.
I posted the steps in another thread a few hours ago (link below). Is this the right way to use this Classifier?
I’m asking because I watched a couple of YouTube videos and they showed that a blank jSon file was to be input to this Classifier. They did not say anything about adding keywords manually. I came up with the steps after reading through the UiPath documentation for this Classifier.
@AndyMenon I think the methods that you used are proper. So if you remove the Intelligent Keyword Classifier now, It Still won’t be able to classify the document ?
It works now. This is how my Classifier setup looks without the Intelligent KW Classifier. One thing I did is that I added more keywords than the number of artifacts I’m extracting. Therefore, I have to update my template to see how the newly added keywords will help in pulling more information out of the document.
@AndyMenon Actually the appearance doesn’t matter Since you are using an Updated version, It appears in that form now, The Parameters are proper. I don’t think there has been any extra updates on the Form Extractor. You’ll have to use Manage Templates and create Templates to define the Fields that you want to Extract for each of the Document type.