Extracting table from PDF and splitting row by column

How to extract table from pdf row by row and after extracting table data, how to split column wise data. I have table below,

  1. List item

can any one please help me.
Thank you

Hi @Vanitha_Agamin

Try using Document Understanding Approach to extract the data from the PDF table.

Hope this helps,

Thanks.

Thank you,
I know how to extract table from pdf using Document Understanding, apart from this any other solution.

Hi @Vanitha_Agamin ,

Could you try reading the text using PDF Read Text and check if the data appears in structured format?
If so, we could give a try with Regex.

Kind Regards,
Ashwin A.K

Thank You.
Yes, I used Read PDF text activity and write data into text file. Then how to split and extract only column data.

Hi @Vanitha_Agamin ,

We’d appreciate it if you could share some sample data with us so that we can test few scenarios out from our end.

Kind Regards,
Ashwin A.K

text.txt (421 Bytes)
This is text, i have extracted from pdf.
Invoice 873101.pdf (36.5 KB)
This is PDF

Hi @Vanitha_Agamin ,

Could you also let us know How the Table Output should Appear?

You can Provide us with an Expected Output Excel File.

The table output should look
Table
like below image

Hi @Vanitha_Agamin ,

Could you Check the below Workflow :
Extract_TableData_Regex.xaml (8.4 KB)

It was able to Extract the Table Data using the Below Regex :

(.*)\s.??(\w+[\d.,]+)\s+(\d+)\s+(\w+[\d.,]+)

However, we would require you to Confirm the Extraction with several other similar pdf’s and Test it.

Let us know if it didn’t work for all PDF’s and also provide The PDF/PDF text so we could understand the cause and modify the Regex.

Thank you @supermanPunch
This is working for this invoice.
I will try to work out for different pdf which contains table and if i have finding difficult will come back to you.

1 Like

I need to extract table shown below picture

from the pdf
Calculate Client Security Hash - 2020.10 Exercise Hints.pdf (155.9 KB)
can you please any one help me.
Thank you

@Vanitha_Agamin ,

The Previous PDF was that of a Invoice, and it follows a Certain Format.

The current Provided PDF is that of a UiPath course, Is there a Reason you are providing this kind of a PDF.

Will the formats be different for different PDF’s ? If so, we would need to know How many different kind of PDF’s are present? If it is finite, we should be able to create Regex patterns and Get the Extracted Data by Identifying the Format of PDF first.

If the PDF’s formats are dynamic, we may need to understand the Scope of this Process first.

Yes, I have different format pdf.
Pdf file2.pdf (407.4 KB)
I have to extract Table from this, But this table data may vary in future. How to extract table.
Thank you in Advance.

@Vanitha_Agamin ,

We would need to Identify the Different PDF formats from the Start if at all possible, We would require to Gather all the formats, and it’s Samples and Test whether the Extraction using Simple Methods like Regex/String Manipulation is possible.

Once, we have Gathered all the pdf formats that would appear as Inputs, we should be able to Categorise them, based on their Keywords, So that we can Use a Particular Regex Pattern/String Manipulation Technique to Extract Data for that Particular PDF and so on for other PDF formats.

If the above is not at all under the consideration, then you might need to use Some Third Party Tools for Extraction, which can be used to Extract any Table format in PDF but also may require you to Purchase it’s License or Only have a Trial Version of it.

One Such Component is Provided in the Below Post, which looks very similar to your case :

You could Browse More Such Components in the UiPath Markeplace :

However, If you Currently need the Extraction for the PDF type provided, you can Check the Below Workflow which is done using Regex Patterns:
Extract_TableData_Regex.xaml (9.2 KB)

Let us know your decision and Thoughts.

Thank you @supermanPunch
But even i don’t know what kind of pdf will come.
Can you please help me for below pdf table.
pdf file3.pdf (407.9 KB)

@Vanitha_Agamin ,

Can you just Change the Pattern in the Matches Activity to the Below and Check :

(^\d+.*?)(?=\s)(.*?)\s(?=\d)([\d.,]+)\s*([\d.,]+)\s*([\d.,]+)\s*([\d.,]+)

image

Also, Select MultiLine Option from RegexOptions :
image

Thank you so much @supermanPunch

1 Like

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.