How to Extract PDF Table through Document Understanding in UiPath

When we are dealing with invoices or any pdf document that contains table and we need to extract the data out of it, then Document Understanding came into picture which helps to get the data out of pdf document very easily.
Below is the Sample Invoice.

We need to extract the Invoice Number and the table in excel file.

Lets see how we can implement this in UiPath.

***Implementation in UiPath

To start with , we first need to install the below mentioned 3 packages in UiPath Studio.

image
image

Step 1:- Drag the Load Taxonomy activity in the designer panel and create a variable to store the output i.e taxonomy .

image

Then we will click on Taxonomy Mananger shown in ribbon in studio.

Taxonomy Manager can only be accessed after installing UiPath.IntelligentOCR.Activities . Once the package is installed , a taxonomy manager button appears on the ribbon.

image

Click on Taxonomy Manager , below window would open.

a) Click on group and provide a name . I have provided a name as XYZ.

b) Click on Category and provide a relevant name.

c) Click on Document type and provide a relevant name.

d) The we will add field to the document that we need to extract from the pdf.

For the table field, add all the columns that we need to extract.

Once all done, close the taxonomy manager.

Step 2:- Drag the Digitize Document activity in the designer panel.In Document Path, we will provide the path of the pdf document.

We have stored the pdf path in the variable “DocumentPath” and will store the Document Text in the output variable “strDocumentText” and Document Object Model in a output variable “DocumentObjectModel”.

We will then drag and drop OmniPage OCR activity ,although there are various OCR’s available , you can use anyone as per your preference.

Step 3:- Drag Data Extraction Scope Activity in the designer panel.

In Inputs-

a) Pass the DocumentPath variable under Document Path.

b) Pass the taxonomy variable under Taxonomy.

c) Pass the DocumentObjectModel variable under Document Object Model.

d) For Document Type ID,

  • > You would see a folder ‘DocumentProcessing’ created in your project folder.

  • > Open “DocumentProcessing” folder , you would see a json file.

  • Right click on json file and open in notepad

  • > Copy DocumentTypeID and paste in Data Extraction Scope under Document Type Id.

In Outputs-

Store the Extraction Result in output variable — ExtractionResults.

As invoice contains data in tabular form that we need to extract, so we will use Form Extractor here.

End Point- It will be auto-populated.

Api Key-

To get the Api key, open the URL www.platform.uipath.com”, click on Admin-

Click on Licenses.

Under Robots & Services , click on Document Understanding and generate a new api key.

Pass the api key in form extractor.

We will now click on Manage Templates , below window will open.

->Click on Create Template

  • Select the Document Type from drop down.

  • Provide template name.

  • Under template document, select the pdf document.

  • Under OCR Engine, select OmniPageOCR.

  • Click on Configure button.

  • > Once you click on configure button, you would see below window.

  • Page 1-Matching Info-

Capture any 5 items from your pdf document that describe your document.

. Invoice Number-

Capture the Invoice number value from the document.

. Invoice Table-

select the table data excluding headers .

Once done click on submit button.

We will then click on Configure Extractors & select all the fields and click on save button.

Step 4:- Drag the activity Export Extraction Results in the designer panel.

image

Store the output in a variable- DataSet

Step 5:- We will use for loop to loop through datatable extraction results.

Step 6:- We will run the workflow run.

image

Data from the pdf has been extracted like- Invoice Number & the table information but we could observe that there are duplicate enteries.

Now we have to merge Invoice Number & table infomation into one table and also need to handle the duplicates, we will create a datatable at the start of the process.

It will have one column ie Invoice Number , type-String

Store the Output in a variable — DTOutput

Step 7:- We will use Merge datatable activity after for each activity.

image

Source Table- DataSet.Tables(2)

This is the pdf table which we will merge with the new table that we have cretated i.e DTOutput .

Destination Table- DTOutput

Step 8:- We will now loop through this merged datatable i.e DTOutput

As we know that Invoice Number column is blank now, so we will assign the value to it.

image

Now we have both Invoice Number & table data in the output datatable- DTOutput.

Step 9:- We will now write this data into the excel file.

image

Step 10:- Lets run the final workflow .

This is our final output extracted in the workbook.

I hope you enjoyed the article!!

Happy Automation!!

4 Likes

This tutorial is spectacular.

1 Like