Document Understanding: Document Splitting and Other Wonderful Stories :)

Hello!
I’m start using Document Understanding. It’s awesome! But had a doubt on a specific scenario:
If I have 1 scanned PDF that contains 2 different Docum Types on It from which I would like to classify them and then, apply 2 different Form Extractor (for example) - What shoudl be logic workflow?
What will be the best approach on using the Document Splitting? (What activities should be comibned and how?).
Hope you can shed some light on It. :slightly_smiling_face:

Thanks!
Gaston.-

Hi @GNISH, I’m not sure how familiar you are with DU, but the general steps would be:

  1. Use Digitize Document to obtain the text and DOM.
  2. Use the previous results in a Classify Document Scope with Intelligent Keyword Classifier. You can use the design time wizard (Manage Learning) from the Intelligent Keyword Classifier to do some preliminary training so that it knows what each of the document types looks like.
  3. Use the classification results from step 2 in a Data Extraction Scope.

Hi @tudor.serban, Thanks for the reply and suggestion! :slight_smile: It was very useful.
I tried with the Intelligent Keyword Classifier and It worked with some additional actions:

I couldn’t use the “raw” PDF for the preliminary training as the original PDF contained both DocumentTypesId that I’m looking for… So, I had to split It in order to do pass It to the Intelligent Keyword Classifier for training.
Then, when I process the original PDF, It was able to classify and split the PDF into 2 separate DocumentTypeId, and be ready for the Form Template extractor.

1 Like

@GNISHI: Glad to hear that. Alternatively, you could still use the original document for training without splitting it in the following way: digitize the document and then use the Present Classification Station activity to select the page ranges and corresponding document types. Save the result and pass it to a Train Classifiers Scope with Intelligent Keyword Classifier Trainer. You can then classify and split subsequent documents after this point.

2 Likes

That was great , i did a POC on that :slight_smile:

1 Like

Hi, Just implemented IntelligentOCR and Document Understanding for classification and extraction. it is amazing. Still i have a question though. Can we use it to extract data from a table in PDF? what if the table is nested table ?

Hello @SWATI_KAROT,

All extractors have table extraction capabilities. Try them out.

We do not currently support nested tables or “repeating grouped fields” (like groups of field 1, table 2, another_field 3, that can appear multiple times in a document).

Ioana

Hi Ioana,

Thank you so much for your reply.I tried on nested tables yesterday, hence, it didn’t work. Let me try on basic tables first.

Thanks,
Swati Karot

1 Like

Hello,

I have a problem with AI Fabric and Data Manager.
Could someone help me with configuration for Data Manager. I wanted to use the docker container but I need a login first (in documentation I found “registry credentials” https://docs.uipath.com/ai-fabric/v2020.4/docs/about-data-manager. I am not sure what it represents)

Thank you,
Viorel

Hi Ioana,

Please find attached the template configured for table extraction. All the custom selection and table highlight is clearly visible.

But during runtime, in validation station, the table is not getting extracted. Please find attached the same.

Any suggestions?

Thanks,
Swati.

Hello @viorel.balaj, and welcome to our community :slight_smile:

Please reach out to your UiPath contact for obtaining credentials and all the necessary information about DataManager.

Ioana

1 Like

Hello Swati,

Please check that you have used the “Configure Extractors” and that your field is checked.

1 Like

Wow! That was bang on.
Missed the check boxes. Table extracting fine now. Thanks a lot.

1 Like

Hi! Just wanted to clarify for everyone on this post, the link to the documentation of Data Manager is now here:
https://docs.uipath.com/ai-fabric/v2020.4/docs/about-data-manager

Marco

1 Like

Can anyone help me with this thread?

Hi Kesavraj,

Iam also facing the same issue while using “Create Document Validation Action” in the work flow". I have provided the same values you have provided in the screenshot.

If you have already found an solution could you please help me on this.
If you haven’t found an solution could anyone in the forum help on this

@Ioana_Gligan, I had a question on document understanding framework. Can it extract images from PDF and a text embedded inside the image? Like an image of a stamp on a PDF, and the stamp is containing some text. Is it capable of extracting such things?

What version of Studio are you using? Have you checked that your project “Supports Persistence” in the Project Settings?
Are you using the Cloud Orchestrator?

Hello @Kesavaraj_K @Varshini_Ganapathy_Subram

Try giving the folder paths as I have done here in the screenshot below. This should work for you.
image

Hi Lahiru.Fernando,

Thanks for your response. I tried with the same folder name as present in the screenshot. But still I’m getting the same error. Could you suggest me another way to solve this error