How to Read a Table data from PDF and store in Excel or Word?

Abhinandan · September 18, 2019, 5:33am

Hi Team,
I have one PDF file which having multiple pages and in that there is a call transaction table in the pdf.
The Table prolonged from 2nd page of pdf to 6th page.
So my question is how can i identify the starting of table and End of the table in UiPath?
And how can i store the entire table into the excel or word?
is there any packages/activiites to store pdf data into excel?

Please help me. there is a big things to do with. littile bit urgent

Ioana_Gligan · September 18, 2019, 6:53am

Hello @Abhinandan,

Are the PDF files always in the same format?

You could try to use the IntelligentOCR package, to:

“Digitize Document”
“Data Extraction Scope” with a “Regex Based Extractor” inside, and you can configure the Regex Based Extractor to extract the entire table area as a value for a field you define in the taxonomy manager.

This would help you get the entire table content, if a regex expression would work.

This would also allow you to feed the data extraciton scope output into a Present Validation Station attended activity for human verification and correction.

This would be a good start I think, but only if the table area is always identifiable through some regex rules.

Abhinandan · September 18, 2019, 10:33am

Hi, Thanks for the reply.

The format of the PDF is same but text will be change to give as input to the regex.
That is the major problem to find out the solution

Ioana_Gligan · September 20, 2019, 10:14am

Are you referring to the labels changing, or the values changing?
Could you offer a sample file and field to extract?

ercanebiler · October 4, 2019, 6:23am

Hi @Ioana_Gligan,

I am trying to use Regex Based Extractor for getting data from invoice pdf. May i ask how it works?

First digitize pdf
Then,

Do i have to put regex based extractor activity inside of data extraction scope?
If yes;
There will be two configuration setup. One for DESCOPE and the other for Regex extractor.
Inside of DE Scope what should i type to taxonomy variables? What should i type for regex expressions? Expression will be regex ok but it will be for variables title’s? Or for variables?

If no;

Please just create a path for me with questions above.

Thanks for your help.

Ioana_Gligan · October 4, 2019, 11:49am

Hello @ercanebiler,

Please have a look at the workflow shared here: How to use the IntelligentOCR Package . It should be a good start to see how things get set up.

For the RegEx Based Extractor: use the Configure Expressions link to open the expression manager, navigate to the field you want to edit an expression for, and use the Regex Builder available when you click on the Edit link. (don’t forget to select which groups you want to capture!!)

The Configuration InArgument will be populated with a serialized version of all the expressions you configure, so that if you want to share them between projects you can store the serialized configuration and retrieve it at run time and pass it in as a variable.

After you set up your expressions, go to Configure Extractors, and activate the RegEx Based Extractor for the fields you want to apply the expressions for.

Hope this helps!

Ioana

ercanebiler · October 4, 2019, 12:57pm

Thanks for your answer @Ioana_Gligan,

I tried to check those workflows but couldn’t open it. It throws “Document is invalid.”

I want to ask 2 other questions;

I have Quantity and amount columns in Invoices. Sample;

QTY AMOUNT
1,00 140,40
10,00 6,01
2,00 3,50

How can i build regex for them? They are pretty same.

As you know we have to digitize the document before these operations. I digitized it with MS OCR. But some values couldn’t found with MS OCR. I know OCR tech not working %100 but what do you suggest for that?

Ioana_Gligan · October 7, 2019, 10:32am

Hello @ercanebiler,

can you please send me the entire error you are getting, along with the exact Studio version you try to open the workflow with?

Q1) Currently tables are not supported in the Regex Based Extractor, we are planning to extend support to table fields in the future

Q2) You can play around with other OCR engine options - either one of the Cloud options or OmniPage OCR (found in the UiPath.OmniPage.Activities package), and see which one works best for your needs.

Thanks,

Ioana

Topic		Replies	Views
Extract specific table within PDF Form with RegEx Studio studio , question , activities_panel	12	1768	March 8, 2023
PDF tabular data extraction Studio	3	810	February 24, 2021
Assist of Extract pdf data Activities pdf , activities	9	1771	April 27, 2021
Regex Based Extractor - Table Document Understanding activities , question	7	1798	July 12, 2024
Extract table from PDF using Regex Studio	3	2303	February 24, 2021

How to Read a Table data from PDF and store in Excel or Word?

Related topics