I’m start using Document Understanding. It’s awesome! But had a doubt on a specific scenario:
If I have 1 scanned PDF that contains 2 different Docum Types on It from which I would like to classify them and then, apply 2 different Form Extractor (for example) - What shoudl be logic workflow?
What will be the best approach on using the Document Splitting? (What activities should be comibned and how?).
Hope you can shed some light on It.
Hi @GNISH, I’m not sure how familiar you are with DU, but the general steps would be:
Use Digitize Document to obtain the text and DOM.
Use the previous results in a Classify Document Scope with Intelligent Keyword Classifier. You can use the design time wizard (Manage Learning) from the Intelligent Keyword Classifier to do some preliminary training so that it knows what each of the document types looks like.
Use the classification results from step 2 in a Data Extraction Scope.
Hi @tudor.serban, Thanks for the reply and suggestion! It was very useful.
I tried with the Intelligent Keyword Classifier and It worked with some additional actions:
I couldn’t use the “raw” PDF for the preliminary training as the original PDF contained both DocumentTypesId that I’m looking for… So, I had to split It in order to do pass It to the Intelligent Keyword Classifier for training.
Then, when I process the original PDF, It was able to classify and split the PDF into 2 separate DocumentTypeId, and be ready for the Form Template extractor.
@GNISHI: Glad to hear that. Alternatively, you could still use the original document for training without splitting it in the following way: digitize the document and then use the Present Classification Station activity to select the page ranges and corresponding document types. Save the result and pass it to a Train Classifiers Scope with Intelligent Keyword Classifier Trainer. You can then classify and split subsequent documents after this point.
Hi, Just implemented IntelligentOCR and Document Understanding for classification and extraction. it is amazing. Still i have a question though. Can we use it to extract data from a table in PDF? what if the table is nested table ?
@Ioana_Gligan, I had a question on document understanding framework. Can it extract images from PDF and a text embedded inside the image? Like an image of a stamp on a PDF, and the stamp is containing some text. Is it capable of extracting such things?